The open source AI model 'Qwen3-235B-A22B-Thinking-2507' has been released and outperforms OpenAI and Google's AI models in major benchmarks



The '

Qwen3 ' family of large-scale language models (LLMs) developed by Alibaba in China has added ' Qwen3-235B-A22B-Thinking-2507 ', which has a parameter size of 235 billion and 22 billion active parameters. 'Qwen3-235B-A22B-Thinking-2507' is the most advanced inference model, improving both the quality and depth of the inference capabilities of 'Qwen3-235B-A22B'.

Qwen3-235B-A22B-Thinking-2507
https://simonwillison.net/2025/Jul/25/qwen3-235b-a22b-thinking-2507/

It's Qwen's summer: Qwen3-235B-A22B-Thinking-2507 tops charts | VentureBeat
https://venturebeat.com/ai/its-qwens-summer-new-open-source-qwen3-235b-a22b-thinking-2507-tops-openai-gemini-reasoning-models-on-key-benchmarks/

Alibaba announced the Qwen3 family at the end of April 2025. The flagship model, Qwen3-235B-A22B, has 235 billion parameters and 22 billion active parameters, and has achieved competitive results in benchmark evaluations of coding, mathematics, general functions, and other cutting-edge AI models from other companies, such as DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro.

Alibaba announces 'Qwen3', an inference model with higher performance than GPT-4o and o1, and the flagship model 'Qwen3-235B-A22B' has 235 billion parameters and 22 billion active parameters - GIGAZINE



The 'Qwen3-235B-A22B-Thinking-2507' is an improved version of the 'Qwen3-235B-A22B'. Compared to 'Qwen3-235B-A22B', 'Qwen3-235B-A22B-Thinking-2507' has significantly improved performance in reasoning tasks, including academic benchmarks that require human expertise in logical reasoning, mathematics, science, coding, and general functions, and has achieved the best performance among open source reasoning models. However, due to the improved reasoning ability, 'Qwen3-235B-A22B-Thinking-2507' has a longer thinking time, so it is recommended to use it for complex reasoning tasks.




Below are multiple benchmark tests comparing the inference models 'Qwen3-235B-A22B-Thinking-2507' (red), 'Qwen3-235B-A22B-Thinking' (blue), 'Gemini 2.5 Pro' (gray), 'OpenAI o4 mini' (dark brown), and 'DeepSeek R1 0528' (light brown). Benchmark tests related to biology, physics, and chemistry are GPQA , benchmark tests related to mathematics are AIME25 , benchmark tests related to coding are LiveCodeBench v6 , and benchmark tests related to inference ability and advanced expertise are HLE .



Regarding the results of the benchmark test of 'Qwen3-235B-A22B-Thinking-2507', David Hendrickson, a software engineer at Columbia University, said, 'The wins from China are just coming. They are the winners of open source language models now. This new model, Qwen3-235B-A22B-Thinking-2507, is the most high-performance open source inference model. It is a model with a context length of 130,000, and it outperforms the o1, o3-mini, Opus4, and Flash 2.5, which have a context length of 120,000, in

LiveBench .'




The graph comparing the benchmark results published by Alibaba did not include the benchmark results of Claude Opus 4 Thinking (brown) and OpenAI o3 (dark gray), so Apolinario created a graph comparing them.



Simon Willison, an engineer who runs a unique AI benchmark called '

Drawing a Pelican on a Bicycle ,' wrote that when he had 'Qwen3-235B-A22B-Thinking-2507' draw a 'Pelican on a Bicycle,' it took 166 seconds to think. The generated 'Pelican on a Bicycle' is as follows.



'Qwen3-235B-A22B-Thinking-2507' is available on Alibaba Cloud.




It is also published on Hugging Face.

Qwen/Qwen3-235B-A22B-Thinking-2507 · Hugging Face
https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507



in Software, Posted by logu_ii