The HalalBench dataset contains 1,043 food packaging images with 36,438 text annotations across 14 languages, annotated in COCO format with bounding boxes and transcriptions for ingredient text regions.
The dataset will be available for download at:
https://github.com/halallens-no/halalbench/releases
After downloading, extract into this directory:
cd data/
tar -xzf halalbench-v1.0.tar.gzThis will produce:
data/
annotations.json # COCO-format ground truth
images/ # 1,043 food packaging images
img_0001.jpg
img_0002.jpg
...
The annotations follow the standard COCO format with additional fields for OCR:
{
"images": [
{
"id": 1,
"file_name": "img_0001.jpg",
"width": 1920,
"height": 1080,
"language": "en"
}
],
"annotations": [
{
"id": 1,
"image_id": 1,
"bbox": [x, y, width, height],
"category_id": 1,
"attributes": {
"text": "sodium benzoate",
"confidence": 1.0
}
}
],
"categories": [
{
"id": 1,
"name": "ingredient_text"
}
]
}| Field | Description |
|---|---|
images[].language |
ISO 639-1 language code of the primary text |
annotations[].bbox |
Bounding box in COCO format [x, y, w, h] |
annotations[].attributes.text |
Ground-truth transcription of the text region |
| Language | Code | Images | Annotations |
|---|---|---|---|
| Arabic | ar | 72 | 2,518 |
| Danish | da | 68 | 2,380 |
| Dutch | nl | 71 | 2,485 |
| English | en | 112 | 3,920 |
| French | fr | 89 | 3,115 |
| German | de | 85 | 2,975 |
| Indonesian | id | 62 | 2,170 |
| Japanese | ja | 78 | 2,730 |
| Korean | ko | 74 | 2,590 |
| Malay | ms | 59 | 2,065 |
| Norwegian | no | 65 | 2,275 |
| Swedish | sv | 70 | 2,450 |
| Thai | th | 69 | 2,415 |
| Turkish | tr | 69 | 2,350 |
| Total | 1,043 | 36,438 |
Each image was annotated by at least two annotators with the following protocol:
- Bounding boxes tightly enclose each ingredient text region
- Transcriptions preserve the original text as printed (including accents, special characters)
- Language labels reflect the primary language of the ingredient list on each image
- Inter-annotator disagreements resolved by a third annotator
The HalalBench dataset is licensed under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).
You are free to share and adapt the dataset, provided you give appropriate credit and distribute any derivative works under the same license.
Full terms: https://creativecommons.org/licenses/by-sa/4.0/
@article{halalbench2026,
title = {HalalBench: A Multilingual OCR Benchmark for Food Packaging Ingredient Extraction},
author = {Nurul Isma},
journal = {arXiv preprint arXiv:XXXX.XXXXX},
year = {2026}
}