Multilingual MMLU Benchmark Results

To evaluate multilingual performance, we translated MMLU’s test set into 14 languages using professional human translators. Relying on human translators for this evaluation increases confidence in the accuracy of the translations, especially for low-resource languages like Yoruba.

Results

Language	o3-high	o1	o4-mini-high	o3-mini-high	gpt-4.5-preview-2025-02-27	gpt-4.1-2025-04-14	gpt-4o-2024-11-20	gpt-4.1-mini-2025-04-14	gpt-4o-mini-2024-07-18	gpt-4.1-nano-2025-04-14
Arabic	0.904	0.890	0.861	0.819	0.860	0.844	0.831	0.795	0.709	0.659
Bengali	0.878	0.873	0.840	0.801	0.848	0.827	0.801	0.749	0.658	0.583
Chinese (Simplified)	0.893	0.889	0.869	0.836	0.870	0.861	0.842	0.817	0.731	0.710
French	0.906	0.893	0.874	0.837	0.878	0.870	0.846	0.835	0.766	0.739
German	0.905	0.890	0.867	0.808	0.853	0.855	0.836	0.823	0.743	0.722
Hindi	0.898	0.883	0.859	0.811	0.858	0.842	0.819	0.780	0.692	0.629
Indonesian	0.898	0.886	0.869	0.828	0.872	0.859	0.840	0.816	0.745	0.714
Italian	0.912	0.897	0.877	0.838	0.878	0.869	0.845	0.835	0.764	0.734
Japanese	0.890	0.889	0.869	0.831	0.869	0.856	0.835	0.810	0.726	0.690
Korean	0.893	0.882	0.867	0.826	0.860	0.849	0.829	0.801	0.720	0.679
Portuguese (Brazil)	0.910	0.895	0.878	0.841	0.879	0.870	0.836	0.839	0.768	0.741
Spanish	0.911	0.899	0.880	0.840	0.884	0.876	0.843	0.839	0.774	0.748
Swahili	0.860	0.854	0.813	0.738	0.820	0.795	0.779	0.679	0.619	0.566
Yoruba	0.780	0.754	0.708	0.637	0.682	0.647	0.621	0.566	0.458	0.455
Average	0.888	0.877	0.852	0.807	0.851	0.837	0.814	0.785	0.705	0.669

These results can be reproduced by running

python -m simple-evals.run_multilingual_mmlu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multilingual MMLU Benchmark Results

Results

FilesExpand file tree

multilingual_mmlu_benchmark_results.md

Latest commit

History

multilingual_mmlu_benchmark_results.md

File metadata and controls

Multilingual MMLU Benchmark Results

Results