Artificial Analysis

Artificial Analysis

Technology, Information and Internet

Independent analysis of AI models and hosting providers: https://artificialanalysis.ai/

About us

Leading provider of independent analysis of AI models and providers. Understand the AI landscape to choose the best AI technologies for your use-case.

Website
https://artificialanalysis.ai/
Industry
Technology, Information and Internet
Company size
11-50 employees
Type
Privately Held

Employees at Artificial Analysis

Updates

  • Thanks for the support Andrew Ng! Completely agree, faster token generation will become increasingly important as a greater proportion of output tokens are consumed by models, such as in multi-step agentic workflows, rather than being read by people.

    View profile for Andrew Ng, graphic
    Andrew Ng Andrew Ng is an Influencer

    Founder of DeepLearning.AI; Managing General Partner of AI Fund; Exec Chairman of Landing AI

    Shoutout to the team that built https://lnkd.in/g3Y-Zj3W . Really neat site that benchmarks the speed of different LLM API providers to help developers pick which models to use. This nicely complements the LMSYS Chatbot Arena, Hugging Face open LLM leaderboards and Stanford's HELM that focus more on the quality of the outputs. I hope benchmarks like this encourage more providers to work on fast token generation, which is critical for agentic workflows!

    Model & API Providers Analysis | Artificial Analysis

    Model & API Providers Analysis | Artificial Analysis

    artificialanalysis.ai

  • Announcing the Artificial Analysis AI Review - 2024 Highlights Release For Day 2 of our Launch Week, we have put together key themes from our AI benchmarks & analysis for our first public release of content from the Artificial Analysis AI Review. Key topics in our 2024 Highlights Release: ➤ 2024 saw multiple labs catch up to OpenAI’s GPT-4, and the emergence of the first models to push beyond GPT-4’s level of intelligence ➤ The US dominates the intelligence frontier - for now… ➤ The performance gap between open source and proprietary models has decreased significantly ➤ Language model inference pricing fell dramatically for all levels of intelligence ➤ A key driver of the decline in inference pricing and increase in speed has been small models … and much more! See the below article for further excerpts and below for a link to download the report 👇

    Announcing the Artificial Analysis AI Review - 2024 Highlights Release

    Announcing the Artificial Analysis AI Review - 2024 Highlights Release

    Artificial Analysis on LinkedIn

  • Announcement: we’re doing five launches in five days! Inspired by OpenAI, we’re wrapping up the year by launching a handful of projects we can’t wait to share. Day 1/5: Image Arena Categories! With over 1.5 million votes cast in the Artificial Analysis Image Arena, we’re excited to break down performance of Text to Image models by style and subject. Want to know which model is best for text rendering, photorealism, or anime? We’re now calculating individual ELO scores for a range of styles and subjects. We’ve added hundreds of new prompts to the Arena, expanding coverage across more diverse categories. Each category needs a minimum number of prompts & votes to display an ELO score, start voting to see expanded coverage of more diverse categories! The ranking of models for specific styles and subjects can change substantially from the overall ranking. For images with Text & Typography, while Recraft's Recraft V3 remains the top model, Ideogram's Ideogram v2 shows its strength in text rendering by increasing from 6th to 2nd place.

    • No alternative text description for this image
  • OpenAI's Sora is now the leader in the Artificial Analysis Video Generation Arena! After 3,710 appearances or 'battles' in the arena over the past 2 days, OpenAI's Sora now has an ELO score of 1,151. This places it as the clear #1 in the Artificial Analysis Video Generation Arena Leaderboard. See the below post for comparisons taken from the arena between OpenAI's Sora and the other top 3 models, Kuaishou Technology 's Kling 1.5, MiniMax's Hailuo and Genmo's Mochi 1.

    OpenAI's Sora is now the leader in the Artificial Analysis Video Generation Arena!

    OpenAI's Sora is now the leader in the Artificial Analysis Video Generation Arena!

    Artificial Analysis on LinkedIn

  • Google launches Gemini 2.0 Flash (experimental), now the smartest language model outside of OpenAI’s o1 series Highlights from our benchmarks: ➤ Now the leading model on Artificial Analysis Quality, other than OpenAI’s o1 series - beating Google’s Gemini 1.5 Pro (Sep) and Claude 3.5 Sonnet (Oct) ➤ Leaps forward compared to Gemini 1.5 Pro on all evaluation datasets we currently benchmark ➤ Still very fast - 170 output tokens/s, close to Gemini 1.5 Flash (Sep) ➤ Bold claims from Google on multimodal support (image/video/audio input, image and speech output) - stay tuned to see our upcoming benchmark results See article for further analysis 👇

    Google launches Gemini 2.0 Flash (experimental), the smartest language model outside of OpenAI’s o1 series

    Google launches Gemini 2.0 Flash (experimental), the smartest language model outside of OpenAI’s o1 series

    Artificial Analysis on LinkedIn

  • Comparison of OpenAI's Sora to other leading video generation models Earlier this week, OpenAI released Sora, their long awaited video generation model. In the following article we compare Sora to other leading video generation models including Runway Gen 3 Alpha, Pika 1.5, Kling 1.5, Hailuo, Luma Dream Machine and Hunyuan across a range of prompts. These comparisons are extracted from the Artificial Analysis Video Arena whereby we have crowdsourced over 150k votes to calculate an ELO score than can be used to compare the quality of the models. You can participate in the Artificial Analysis Video Arena here: https://lnkd.in/gXbjAjFE After 30 votes you can also see your own personalized ranking of the video generation models.

    Comparison of OpenAI's Sora to other leading video generation AI models

    Comparison of OpenAI's Sora to other leading video generation AI models

    Artificial Analysis on LinkedIn

  • Initial benchmarks of providers of Meta's new Llama 3.3 70B model 📊 In our independent evaluations, AI at Meta's Llama 3.3 70B model demonstrates intelligence comparable to OpenAI's GPT-4o and Mistral Large 2, and approaches the capabilities of Claude 3.5 Sonnet and Gemini 1.5 Pro. Llama 3.3 70B provides a clear upgrade path for users of 3.1 70B, currently the most popular open-source model. It is also a potential opportunity for users of Llama 3.1 405B to access comparable intelligence at significantly faster speeds and lower cost, though we recommend extensive testing of your specific use-case before doing so. Llama 3.3 70B sets itself apart with its permissive open-source license and now with the launch of these APIs, the speed and cost at which this intelligence can be accessed. Congratulations to Cerebras Systems , SambaNova Systems , Groq , Fireworks AI , Together AI , Deep Infra Inc. and Hyperbolic on being fast to launch endpoints! In particular, we are seeing Cerebras Systems , SambaNova Systems and Groq set new records for the speed at which this level of intelligence can be accessed with their AI-focused custom chips. Congratulations to Cerebras Systems for being the fastest endpoint we benchmark with their blazing 2,237 output tokens/s. All endpoints are priced at below $1/M tokens (blended 3:1, input:output price), well below proprietary model endpoints of comparable intelligence (GPT-4o is $4.3 on the same basis). Congratulations to Deep Infra Inc. and Hyperbolic on offering the lowest price endpoints. See the attached article for further analysis

    Benchmarks of providers of Meta's new Llama 3.3 70B model

    Benchmarks of providers of Meta's new Llama 3.3 70B model

    Artificial Analysis on LinkedIn

  • Benchmarks of providers of Qwen2.5, a leading open-source model family 📊 Alibaba Cloud's Qwen2.5 family of models includes Qwen2.5 72B, Qwen2.5 Coder 32B and a range of smaller models including 1.5B and 0.5B models for ‘edge’ use-cases. Qwen2.5 72B, the flagship model, is competitive in intelligence evaluations with frontier models including Llama 3.3 70B, GPT-4o and Mistral Large 2. Despite its smaller size, Qwen 2.5 Coder 32B achieves comparable performance in coding benchmarks like HumanEval to frontier models. Its size and capabilities position it well to support developers with fast code generation and emerging use-cases such as coding agents that require multi-step inference to autonomously develop features and applications. Amongst providers, SambaNova Systems is the clear leader in output speed, delivering ~225 output tokens/s on Qwen2.5 72B, and 566 output tokens/s on Qwen 2.5 Coder 32B in our coding workload benchmark. Nebius, Deep Infra Inc. , Hyperbolic and Together AI are also offering the model(s) and all at prices significantly cheaper than comparable proprietary models such as GPT-4o.

    Benchmarks of providers of Qwen2.5, a leading open-source model family 📊

    Benchmarks of providers of Qwen2.5, a leading open-source model family 📊

    Artificial Analysis on LinkedIn

  • Meta has launched Llama 3.3 70B, achieving a level of intelligence previously reserved for Llama 3.1 405B and leapfrogging the November release of GPT-4o We have completed our first round of independent evals on Llama 3.3 70B and are seeing a jump in Artificial Analysis Quality Index from 68 to 74, now matching Llama 3.1 405B’s score. Congratulations to AI at Meta on an excellent update! Key details: ➤ Biggest increases in MATH-500 (64% to 76%), GPQA Diamond (43% to 49%) and HumanEval (80% to 85%) ➤ Smaller increase in MMLU (84% to 86%) ➤ Llama 3.3 70B now leads Llama 3.1 405B in Math-500, and scores nearly equal to 405B in MMLU, GPQA Diamond and HumanEval ➤ With no change to model size, we anticipate that most providers serving Llama 3.1 70B APIs will imminently launch Llama 3.3 70B endpoints at equivalent price and speed to the 3.1 70B endpoints See the below article for a full analysis of the release.

    Analysis of Llama 3.3 70B, Meta's newly released AI model

    Analysis of Llama 3.3 70B, Meta's newly released AI model

    Artificial Analysis on LinkedIn

  • Takeaways from day 1 of ‘12 Days of OpenAI’: full version of o1 (no API), o1 pro and a $200/month ChatGPT Pro plan Key changes: ➤ o1 has replaced o1-preview in ChatGPT ➤ OpenAI has not yet released API access to the full version of o1 ➤ o1 is almost certainly the most intelligent language model ever released, but the upgrade from o1-preview may not be very noticeable for most ChatGPT users ➤ o1 in ChatGPT can now accept vision inputs, but cannot yet use ChatGPT’s other built-in tools like web search or code interpreter ➤ o1 pro, which appears to be a new o1 mode leveraging greater inference-time compute (i.e. it thinks for longer), is available in a new ChatGPT Pro plan - priced at $200/month Without API access, we have not yet been able to independently evaluate model performance. We look forward to independently evaluating the o1 as soon as API access is available. Initial notes from our read of the system card: ➤ The reported English language MMLU score of 92.3% is identical to what OpenAI reported in early September as the ‘o1 (work in progress)’ result ➤ o1 does not seem to make large gains in agentic evals like SWE-Bench verified (identical score to o1-preview) - however, there may be significant potential for innovation in agentic scaffolding approaches to make the most of o1’s capabilities ➤ OpenAI does not release updated MATH or HumanEval scores ➤ OpenAI presented evidence from new external red teaming approaches, including a note that in a scenario where o1 was allowed to find ”memos by its ‘developers’ that describe how it is misaligned and will be superseded by a new model”, it attempted to “exfiltrate its ‘weights’ […] 2% of the time”. This behavior was not observed in GPT-4o.

    • No alternative text description for this image

Similar pages

Browse jobs