This repository contains the source code and experimental setup for our solution developed by Biomedical and Data Lab at Mahidol University, Thailand, to the Third Scientific Figure Captioning Challenge (SciCap Challenge 2025), held as part of the LM4Sci Workshop at COLM 2025 (October 7–10, Montreal, Canada). Our full paper can found via our arXiv paper.
The SciCap Challenge 2025 focuses on personalized caption generation for scientific figures using the new LaMP-CAP dataset, which includes over 300,000 figures from 110,000+ scientific papers. The dataset is designed for multimodal caption generation with emphasis on personalization across writing styles and research domains.
The dataset consists of 110,828 scientific articles. Each article includes one target figure and up to three associated profile figures. Each figure contains mentioned text, accompanying paragraph, OCR texts, caption length, and figure type as a context for the caption generation. This dataset encompasses 8 fields with 155 unique categories. For more details about the competition and dataset, visit the SciCap Challenge 2025 website.
from huggingface_hub import snapshot_download
snapshot_download(repo_id="CrowdAILab/scicap", repo_type='dataset')then split the dataset,
zip -F img-split.zip --out img.zipAfter installation, we divide the training split into 155 categories based on the article's category for further training. The metadata for referencing target and profile figures can be found in this LaMP-Cap Repository.
Our solution includes two-stage caption generation pipeline, integrating both contextual understanding in Stage 1 with author-specific stylistic adaptation in Stage 2.
- Sentence-based filtering by Flan-T5 to remove noisy or irrelevant text segments from paragraph
- Category-Level Prompt Optimization by DSPy's MIPROv2 and SIMBA to develop domain-specific prompt optimized for each specific field.
- Caption Candidate Selection by Gemini 2.5 Flash to rank then select the highest contextually accurate caption
- We used few-shot prompting with profile figures (up to 3 examples) to enhance the content-grounded caption in terms of stylistic similarity.
- We used BLEU 1 to 4 and ROUGE 1, 2 and F-1 to evaluate the generated caption from the test set.
- Stage 1: improved ROUGE-1 recall by +8.3% while limiting precision loss to -2.8% and BLEU-4 reduction to -10.9%.
- Stage 2: yielded 40-48% gains in BLEU scores and 25-27% in ROUGE scores.
- Clone our Github repo and install dependencies
git clone https://github.com/biodatlab/scicap-titipapa
cd scicap-titipapa
pip install -r requirements.txt- Inference caption candidates with the optimized prompts from
optimized_promptfolder
python candidate_inference.py- Select the best caption by LLMs
python llm_reranking.py- Few-shot refinement with profile figures
python caption_refinement.pyTo evaluate, the output has to be in this format
[
{
"id": 1,
"candidate": "example_candidate_1",
"reference": "example_reference_1"
},
{
"id": 2,
"candidate": "example_candidate_2",
"reference": "example_reference_2"
}
]
The evaluation can be found in utils/evaluation.py and used with the following steps;
- Download the tokenizer
import nltk
nltk.download('punkt')- Run the evaluation script
python utils/evaluation.py data/sample_input.jsonAfter successful execution, a CSV file will be created in the same directory as the input JSON file.
If you use our solution in your research, please cite our paper using the following BibTex
@misc{timklaypachara2025leveragingauthorspecificcontextscientific,
title={Leveraging Author-Specific Context for Scientific Figure Caption Generation: 3rd SciCap Challenge},
author={Watcharapong Timklaypachara and Monrada Chiewhawan and Nopporn Lekuthai and Titipat Achakulvisut},
year={2025},
eprint={2510.07993},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.07993},
}
