| 29-07-2025 |
WebGuard Enhances AI Safety for Web Agents |
AI Safety and Security |
WebGuard, developed by Scale AI with UC Berkeley and Ohio State University, is a pioneering dataset designed to assess and improve the safety of AI web agents. It features 4,939 human-annotated actions from 193 websites, categorized by risk levels to guide safe decision-making. Fine-tuning with WebGuard significantly boosts model accuracy, with smaller models like Qwen2.5-VL-7B achieving up to 80% accuracy in identifying high-risk actions. Researchers invite the community to use the public dataset to advance AI safety. |
|
| 12-07-2025 |
Scale AI Team Tackles Controversial AI Statements |
Podcast |
In Episode 9 of Scale AI’s Human in the Loop podcast, the Enterprise team debates bold AI claims, like whether coding agents will replace engineers or if fine-tuning is obsolete due to large context windows. They explore barriers to enterprise AI adoption, emphasizing human challenges over technical ones, and discuss the reliability of single versus multi-agent systems. The team’s insights, drawn from working with top enterprises, highlight practical strategies for building effective AI systems. Listen to the episode or read the transcript at scale.com to gain industry insights. |
|
| 28-06-2025 |
Scale AI Recognized for AI Data Innovation |
Awards & Honours |
Scale AI, a leader in data annotation, earned a spot on TIME’s 2025 list of the 100 Most Influential Companies for its critical role in advancing AI through high-quality data labeling. With over 240,000 gig workers, Scale supports major AI firms, though its recent $14.3 billion Meta deal may shift ties with rivals like OpenAI. Its new division, backed by a U.S. Department of Defense contract, rapidly grows by tailoring AI models for large organizations. Learn more at scale.com. |
|
| 26-06-2025 |
Scale AI’s FORTRESS Enhances AI Safety Evaluation |
AI Safety and Security |
Scale AI introduces FORTRESS, a benchmark designed to evaluate large language models (LLMs) for national security and public safety risks. Featuring over 1,010 expert-crafted adversarial prompts, FORTRESS assesses model safeguards across domains like Chemical, Biological, Radiological, Nuclear and Explosive (CBRNE) activities, ensuring robust protection against misuse while minimizing over-refusals of benign requests. It provides a balanced, scalable framework with instance-specific rubrics for precise evaluation. Visit scale.com to explore the FORTRESS leaderboard and methodology. |
|
| 07-06-2025 |
DeepSeek-R1 Tops SEAL Leaderboard for Open-Source Models |
AI Tool Benchmarking |
Scale AI’s SEAL Leaderboards, designed to evaluate frontier large language models (LLMs) with unbiased private datasets, recently highlighted DeepSeek-R1’s superior performance among open-source models on the Humanity’s Last Exam (Text Only) benchmark. The leaderboards, crafted by verified domain experts, assess models across diverse tasks like coding, reasoning, and honesty, with top performers including Gemini-2.5-pro-preview and Claude Sonnet 4. Regular updates ensure rankings reflect the latest AI advancements, fostering trust and transparency. |
|
| 07-06-2025 |
Red Teaming Enhances Enterprise AI Safety |
Podcast |
Scale AI’s Human in the Loop podcast episode highlights red teaming, a critical process for identifying vulnerabilities in enterprise AI systems. Experts discuss how tailored red teaming prevents issues like model drift and over-restrictive guardrails, ensuring AI remains both safe and effective. By simulating real-world risks, Scale AI’s approach helps enterprises balance usability with robust safety policies. Watch the episode at scale.com/blog to learn how to strengthen your AI defenses. |
|
| 07-06-2025 |
Gemini-2.5 Pro Leads SEAL Leaderboards in Reasoning |
AI Tool Benchmarking |
Google’s Gemini-2.5 Pro preview model has secured the top spot on Scale AI’s SEAL Leaderboards, excelling in expert reasoning and visual understanding benchmarks. Evaluated by domain experts using private datasets, it outperformed competitors in tasks like Humanity’s Last Exam and VISTA, showcasing its strength in complex problem-solving. The leaderboards provide industry insights into AI model capabilities, ensuring trustworthy rankings. Visit scale.com/leaderboard to explore the full rankings. |
|