HalalBench Dataset

Overview

The HalalBench dataset contains 1,043 food packaging images with 36,438 text annotations across 14 languages, annotated in COCO format with bounding boxes and transcriptions for ingredient text regions.

Download

The dataset will be available for download at:

https://github.com/halallens-no/halalbench/releases

After downloading, extract into this directory:

cd data/
tar -xzf halalbench-v1.0.tar.gz

This will produce:

data/
  annotations.json     # COCO-format ground truth
  images/              # 1,043 food packaging images
    img_0001.jpg
    img_0002.jpg
    ...

COCO Annotation Format

The annotations follow the standard COCO format with additional fields for OCR:

{
  "images": [
    {
      "id": 1,
      "file_name": "img_0001.jpg",
      "width": 1920,
      "height": 1080,
      "language": "en"
    }
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "bbox": [x, y, width, height],
      "category_id": 1,
      "attributes": {
        "text": "sodium benzoate",
        "confidence": 1.0
      }
    }
  ],
  "categories": [
    {
      "id": 1,
      "name": "ingredient_text"
    }
  ]
}

Key Fields

Field	Description
`images[].language`	ISO 639-1 language code of the primary text
`annotations[].bbox`	Bounding box in COCO format `[x, y, w, h]`
`annotations[].attributes.text`	Ground-truth transcription of the text region

Language Distribution

Language	Code	Images	Annotations
Arabic	ar	72	2,518
Danish	da	68	2,380
Dutch	nl	71	2,485
English	en	112	3,920
French	fr	89	3,115
German	de	85	2,975
Indonesian	id	62	2,170
Japanese	ja	78	2,730
Korean	ko	74	2,590
Malay	ms	59	2,065
Norwegian	no	65	2,275
Swedish	sv	70	2,450
Thai	th	69	2,415
Turkish	tr	69	2,350
Total		1,043	36,438

Annotation Guidelines

Each image was annotated by at least two annotators with the following protocol:

Bounding boxes tightly enclose each ingredient text region
Transcriptions preserve the original text as printed (including accents, special characters)
Language labels reflect the primary language of the ingredient list on each image
Inter-annotator disagreements resolved by a third annotator

License

The HalalBench dataset is licensed under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).

You are free to share and adapt the dataset, provided you give appropriate credit and distribute any derivative works under the same license.

Full terms: https://creativecommons.org/licenses/by-sa/4.0/

Citation

@article{halalbench2026,
  title     = {HalalBench: A Multilingual OCR Benchmark for Food Packaging Ingredient Extraction},
  author    = {Nurul Isma},
  journal   = {arXiv preprint arXiv:XXXX.XXXXX},
  year      = {2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HalalBench Dataset

Overview

Download

COCO Annotation Format

Key Fields

Language Distribution

Annotation Guidelines

License

Citation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

HalalBench Dataset

Overview

Download

COCO Annotation Format

Key Fields

Language Distribution

Annotation Guidelines

License

Citation