Skip to content

Latest commit

 

History

History
30 lines (24 loc) · 4.15 KB

File metadata and controls

30 lines (24 loc) · 4.15 KB

Multilingual MMLU Benchmark Results

To evaluate multilingual performance, we translated MMLU’s test set into 14 languages using professional human translators. Relying on human translators for this evaluation increases confidence in the accuracy of the translations, especially for low-resource languages like Yoruba.

Results

Language o3-high o1 o4-mini-high o3-mini-high gpt-4.5-preview-2025-02-27 gpt-4.1-2025-04-14 gpt-4o-2024-11-20 gpt-4.1-mini-2025-04-14 gpt-4o-mini-2024-07-18 gpt-4.1-nano-2025-04-14
Arabic 0.904 0.890 0.861 0.819 0.860 0.844 0.831 0.795 0.709 0.659
Bengali 0.878 0.873 0.840 0.801 0.848 0.827 0.801 0.749 0.658 0.583
Chinese (Simplified) 0.893 0.889 0.869 0.836 0.870 0.861 0.842 0.817 0.731 0.710
French 0.906 0.893 0.874 0.837 0.878 0.870 0.846 0.835 0.766 0.739
German 0.905 0.890 0.867 0.808 0.853 0.855 0.836 0.823 0.743 0.722
Hindi 0.898 0.883 0.859 0.811 0.858 0.842 0.819 0.780 0.692 0.629
Indonesian 0.898 0.886 0.869 0.828 0.872 0.859 0.840 0.816 0.745 0.714
Italian 0.912 0.897 0.877 0.838 0.878 0.869 0.845 0.835 0.764 0.734
Japanese 0.890 0.889 0.869 0.831 0.869 0.856 0.835 0.810 0.726 0.690
Korean 0.893 0.882 0.867 0.826 0.860 0.849 0.829 0.801 0.720 0.679
Portuguese (Brazil) 0.910 0.895 0.878 0.841 0.879 0.870 0.836 0.839 0.768 0.741
Spanish 0.911 0.899 0.880 0.840 0.884 0.876 0.843 0.839 0.774 0.748
Swahili 0.860 0.854 0.813 0.738 0.820 0.795 0.779 0.679 0.619 0.566
Yoruba 0.780 0.754 0.708 0.637 0.682 0.647 0.621 0.566 0.458 0.455
Average 0.888 0.877 0.852 0.807 0.851 0.837 0.814 0.785 0.705 0.669

These results can be reproduced by running

python -m simple-evals.run_multilingual_mmlu