To evaluate multilingual performance, we translated MMLU’s test set into 14 languages using professional human translators. Relying on human translators for this evaluation increases confidence in the accuracy of the translations, especially for low-resource languages like Yoruba.
| Language | o3-high | o1 | o4-mini-high | o3-mini-high | gpt-4.5-preview-2025-02-27 | gpt-4.1-2025-04-14 | gpt-4o-2024-11-20 | gpt-4.1-mini-2025-04-14 | gpt-4o-mini-2024-07-18 | gpt-4.1-nano-2025-04-14 |
|---|---|---|---|---|---|---|---|---|---|---|
| Arabic | 0.904 | 0.890 | 0.861 | 0.819 | 0.860 | 0.844 | 0.831 | 0.795 | 0.709 | 0.659 |
| Bengali | 0.878 | 0.873 | 0.840 | 0.801 | 0.848 | 0.827 | 0.801 | 0.749 | 0.658 | 0.583 |
| Chinese (Simplified) | 0.893 | 0.889 | 0.869 | 0.836 | 0.870 | 0.861 | 0.842 | 0.817 | 0.731 | 0.710 |
| French | 0.906 | 0.893 | 0.874 | 0.837 | 0.878 | 0.870 | 0.846 | 0.835 | 0.766 | 0.739 |
| German | 0.905 | 0.890 | 0.867 | 0.808 | 0.853 | 0.855 | 0.836 | 0.823 | 0.743 | 0.722 |
| Hindi | 0.898 | 0.883 | 0.859 | 0.811 | 0.858 | 0.842 | 0.819 | 0.780 | 0.692 | 0.629 |
| Indonesian | 0.898 | 0.886 | 0.869 | 0.828 | 0.872 | 0.859 | 0.840 | 0.816 | 0.745 | 0.714 |
| Italian | 0.912 | 0.897 | 0.877 | 0.838 | 0.878 | 0.869 | 0.845 | 0.835 | 0.764 | 0.734 |
| Japanese | 0.890 | 0.889 | 0.869 | 0.831 | 0.869 | 0.856 | 0.835 | 0.810 | 0.726 | 0.690 |
| Korean | 0.893 | 0.882 | 0.867 | 0.826 | 0.860 | 0.849 | 0.829 | 0.801 | 0.720 | 0.679 |
| Portuguese (Brazil) | 0.910 | 0.895 | 0.878 | 0.841 | 0.879 | 0.870 | 0.836 | 0.839 | 0.768 | 0.741 |
| Spanish | 0.911 | 0.899 | 0.880 | 0.840 | 0.884 | 0.876 | 0.843 | 0.839 | 0.774 | 0.748 |
| Swahili | 0.860 | 0.854 | 0.813 | 0.738 | 0.820 | 0.795 | 0.779 | 0.679 | 0.619 | 0.566 |
| Yoruba | 0.780 | 0.754 | 0.708 | 0.637 | 0.682 | 0.647 | 0.621 | 0.566 | 0.458 | 0.455 |
| Average | 0.888 | 0.877 | 0.852 | 0.807 | 0.851 | 0.837 | 0.814 | 0.785 | 0.705 | 0.669 |
These results can be reproduced by running
python -m simple-evals.run_multilingual_mmlu