This repository provides a training framework for computer use agents with InternVL models.
pip install -r requirements.txtBefore starting fine-tuning, download the pre-trained model we provide.
mkdir pretrained/
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL3-8B --local-dir pretrained/InternVL3-8BAfter downloading the pre-trained model, prepare your customized SFT (Supervised Fine-Tuning) data. Update the JSON file in agent-sft/internvl_chat/data/train_meta.json similar to train_meta.json
The format for the JSON file should be:
{
"your-custom-dataset-1": {
"root": "path/to/the/image/",
"annotation": "path/to/the/jsonl/annotation",
"conv_style": "internvl2_5_mobile_planning_cot_v1",
"data_augment": false,
"max_dynamic_patch": 12,
"repeat_time": 1,
"length": "number of samples in the dataset"
}
}The jsonl annotation looks like:
{"image": "android_lab_data_Cantook/cantook_5/images/home_screen.png", "conversations": [{"from": "human", "value": "<image>\nPlease generate the next move according to the UI screenshot, task and previous operations.\n\nTask: Could you check my reading progress for War and Peace?\n\nPrevious operations:\nNone"}, {"from": "gpt", "value": "<think>\nBased on the second image, I can see this is an e-reader app showing a \"Recently read\" section with several books including \"1984\" by George Orwell (79.5% complete), \"The Sorcerer's Stone\" by J.K. Rowling (48.8% complete), and \"King of Envy\" by Ana Huang (52.9% complete). However, I don't see \"War and Peace\" anywhere in the currently visible book collection, either in the \"Recently read\" or \"Recently added\" sections. I would need to search or browse through the library further to find War and Peace and check its reading progress.\n</think>\n<operation>\nOpen the Cantook app to access your digital book library, which shows recently read books including '1984' by George Orwell (79.5% complete), 'The Sorcerer's Stone' by J.K. Rowling (48.8%), 'King of Envy' by Ana Huang (52.9%), and 'A Christmas Carol'. The app also displays your recently added books including multiple Harry Potter titles.\n</operation>\n<action>\nopen_app(app_name='Cantook')\n</action>"}], "width": 1440, "height": 3120}
{"image": "android_lab_data_Cantook/cantook_5/images/001_start_1749723405576.png", "conversations": [{"from": "human", "value": "<image>\nPlease generate the next move according to the UI screenshot, task and previous operations.\n\nTask: Could you check my reading progress for War and Peace?\n\nPrevious operations:\nStep 1: \"Open the Cantook app to access your digital book library, which shows recently read books including '1984' by George Orwell (79.5% complete), 'The Sorcerer's Stone' by J.K. Rowling (48.8%), 'King of Envy' by Ana Huang (52.9%), and 'A Christmas Carol'. The app also displays your recently added books including multiple Harry Potter titles.\""}, {"from": "gpt", "value": "<think>\nLooking at the screenshot of this e-reader app, I can see the user's library with several books including \"1984\" (79.5% complete), \"The Sorcerer's Stone,\" \"King of Envy,\" \"A Christmas Carol,\" and Harry Potter titles in the \"Recently read\" and \"Recently added\" sections. However, I don't see \"War and Peace\" displayed anywhere in the current Home view. The navigation tabs at the bottom show I could access \"My Books,\" which would likely provide a more complete view of the entire library where I might find War and Peace and check its reading progress. The most logical next step would be to navigate to the \"My Books\" section to search for this specific title.\n</think>\n<operation>\nClick on the 'My Books' tab at the bottom of the screen to view your book library\n</operation>\n<action>\nclick(x=0.3674, y=0.9192)\n</action>"}], "width": 1440, "height": 3120}
...Using 8 GPUs to finetnun VLM:
GPUS=8 bash shell/internvl3_8b_dynamic_res_stage1.sh \
./data/train_meta.json \ # Training data
work_dirs/InternVL3-8B-CUA-exp1 \ # output dir
pretrained/InternVL3-8B-MPO # Based on this pretrained model to finetune the model@article{liu2025scalecua,
title = {ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data},
author = {Liu, Zhaoyang and Xie, Jingjing and Ding, Zichen and Li, Zehao and Yang, Bowen and Wu, Zhenyu and Wang, Xuehui and Sun, Qiushi and Liu, Shi and Wang, Weiyun and Ye, Shenglong and Li, Qingyun and Dong, Xuan and Yu, Yue and Lu, Chenyu and Mo, YunXiang and Yan, Yao and Tian, Zeyue and Zhang, Xiao and Huang, Yuan and Liu, Yiqian and Su, Weijie and Luo, Gen and Yue, Xiangyu and Qi, Biqing and Chen, Kai and Zhou, Bowen and Qiao, Yu and Chen, Qifeng and Wang, Wenhai},
year = {2025},
url = {https://github.com/OpenGVLab/ScaleCUA}
}