🏗️ AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly

Zhi Jing^1,2, Jinbin Qiao^2,3, Ouyang Lu^2,4, Jicong Ao², Shuang Qiu⁵, Yu-Gang Jiang^1,*, Chenjia Bai^2,*

¹Fudan University^†, ²Institute of Artificial Intelligence (TeleAI), China Telecom^†,

³Tianjin University, ⁴Northwestern Polytechnical University, ⁵City University of Hong Kong

^* Equal advising | ^† Equally leading organizations

🚀 News

[2026-04-29] 🔓 Open-source the inference code, AssemLM-V1 weights, and demo dataset for inference.
[2026-04-16] 🗺️ Announce the open-source plan.
[2026-04-10] 📄 Upload the paper to arXiv: paper
[2026-03-15] 🎉 Release the first version of the project page.
[2026-03-05] 🏗️ Create the project page and code repository.

⚙️ Setup Environment

Installation Steps

1. Clone the repository

git clone https://github.com/TeleHuman/AssemLM.git
cd AssemLM

2. Create & Build conda env

conda create -n assemlm python=3.10.14 -y
conda activate assemlm
bash setting.sh

3. Prepare the model

mkdir models && cd models
huggingface-cli download TeleEmbodied/AssemLM-V1 --local-dir ./AssemLM-V1

3. Prepare the dataset

mkdir datasets && cd datasets
huggingface-cli download --repo-type dataset --resume-download TeleEmbodied/AssemLM  --local-dir .
cd ..

🚀 Getting Started

Run the API server for AssemLM.

bash scripts/run_api.sh

Open another terminal and run the query code:

conda activate assemlm
bash scripts/query_assemlm.sh

After running, two folders will be created in the root directory:

datasets_tmp: contains the input data for the current request.
results_tmp: contains the prediction results and visualization outputs.

The first three images are from datasets_tmp, while the last image is from results_tmp.

🗺️ Open-Source Plan

🔓 Release AssemLM-V1 weights, inference code, and a demo dataset.
📦 Release the majority of the AssemBench dataset.
📚 Release additional datasets and benchmark resources.
🧠 Release the training code.
⚙️ Release the data processing pipeline.
🚀 Release updated and improved model weights.

🔖 Citation

If you find our work helpful, please cite:

@article{jing2026assemlm,
  title={AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly},
  author={Jing, Zhi and Qiao, Jinbin and Lu, Ouyang and Ao, Jicong and Qiu, Shuang and Jiang, Yu-Gang and Bai, Chenjia},
  journal={arXiv preprint arXiv:2604.08983},
  year={2026}
}

Acknowledgements

Our implementation is based on the open-source codebases from StarVLA, TwoByTwo, RoboRefer.
We also sincerely acknowledge the datasets and assets provided by PartNet, BiAssembly, TwoByTwo, PartNeXt, IKEA-Manual.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🏗️ AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly

🚀 News

⚙️ Setup Environment

Installation Steps

1. Clone the repository

2. Create & Build conda env

3. Prepare the model

3. Prepare the dataset

🚀 Getting Started

🗺️ Open-Source Plan

🔖 Citation

Acknowledgements

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

🏗️ AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly

🚀 News

⚙️ Setup Environment

Installation Steps

1. Clone the repository

2. Create & Build conda env

3. Prepare the model

3. Prepare the dataset

🚀 Getting Started

🗺️ Open-Source Plan

🔖 Citation

Acknowledgements