Synthetic Active Learning (SAL)

Related Publications

📌 Journal of Manufacturing Systems (2026)

This repository contains the source code and datasets for our paper published in Journal of Manufacturing Systems (2026):

Designing Synthetic Active Learning for Model Refinement in Manufacturing Parts Detection
Journal of Manufacturing Systems, Volume 84, 2026, Pages 68–84
📄 Read the paper (ScienceDirect)

⬅️ Previous Work: Static Domain Randomization

SAL extends our previous work on static synthetic dataset generation for manufacturing object detection.

📄 Previous paper: ICRA 2025, Domain Randomization for Object Detection in Manufacturing Applications Using Synthetic Data: A Comprehensive Study
💻 base generation pipeline: SynMfg

The previous work generates a fixed synthetic dataset before training.
SAL instead continuously generates synthetic data during training based on model weaknesses.

Related Datasets

This code generates synthetic data from 3D models using SAL. We use two datasets to generate synthetic images and train an object detection model, which performs well on real-world data.

Robotic Dataset: Published by Horváth et al., this dataset includes both 3D models and real images.
- 📂 3D Models: Located in data/Objects/Robotic/, containing 10 .obj files.
- 🖼️ Real Images: Download from Dropbox – Public Robotic Dataset. We We use the yolo_cropped_all subset for real-image evaluation.
SIP15-OD Dataset: Developed by us in our previous ICRA2025 paper. It contains 15 manufacturing object 3D models across three use cases, along with 395 real images featuring 996 annotated objects taken in various manufacturing environments.
Due to company policy, the original CAD models cannot be publicly released. However, the real-world annotated images are available via: Roboflow-SIP15OD.

Below are samples of the synthetic data and their real-world counterparts from the robotic dataset, as well as the three use cases from the SIP-15-OD dataset.

Introduction

Figure 1. Overview of the Synthetic Active Learning (SAL) framework. The system iteratively generates synthetic data, trains the detection model, evaluates performance weaknesses, and updates generation configurations to refine model performance.

Synthetic Active Learning (SAL) is a fully automatic model refinement framework for manufacturing parts detection using only synthetic data generated with domain randomization.

Traditional domain randomization pipelines generate a fixed synthetic dataset before training. While effective, static datasets cannot adapt to performance weaknesses that emerge during training, such as poor results for specific object categories, challenging materials, object sizes, or recurring misclassification patterns.

Inspired by active learning, SAL shifts the focus from selecting real data for labeling to selecting what synthetic data to generate from an effectively unlimited variation space. The framework continuously identifies weak areas of the model and generates targeted synthetic data to improve them.

The iterative SAL pipeline consists of:

Synthetic data generation
Model training
Weakness evaluation
Generation configuration update

To diagnose model weaknesses, SAL introduces four custom evaluators:

Category Performance Evaluator
Misclassification Evaluator
Challenging Size Evaluator
Overall Performance Evaluator

Based on these analyses, four corresponding updaters adjust the synthetic data generation process:

Category Distribution Updater
Object Pairwise Updater
Object Size Updater
Object Material Updater

This closed-loop system automatically regenerates data targeting underperforming attributes, without requiring real images or manual intervention. The process continues until performance stabilizes and further improvements become marginal.

To enable efficient simultaneous training and generation, SAL employs a data block shifting scheme, where one data block is regenerated while the remaining blocks are used for training. This design supports continuous dataset refinement with efficient GPU utilization.

Key Results

Across four industrial use cases from two datasets, SAL achieves:

+2–6 percentage point improvement in mAP@50 over static training
Significant gains in previously underperforming categories
More balanced per-class detection performance
Reduced need for extensive hyperparameter tuning

SAL demonstrates that synthetic data pipelines can evolve from static dataset creation to adaptive, model-driven refinement suitable for scalable industrial deployment.

Getting Started

Setup Python environment

Setup conda environment using conda env create -f environment.yml
Activate environment using conda activate SAL_Code

Setup Blender

Download Blender 3.4

Go to Blender 3.4, and download the appropriate version of Blender for your system. As an example blender-3.4.1-windows-x64.msi for Windows or blender-3.4.1-linux-x64.tar.xz for Linux.
Install Blender.
Set blender environment variable BLENDER_PATH to the Blender executable. As an example C:\Program Files\Blender Foundation\Blender 3.4\blender.exe for Windows or /user/blender-3.4.1-linux-x64/blender for Linux.

Setup Texture folders

Downloaded textures are put into their corresponding folders inside the data folder structure.

Synthetic_Active_Learning_Code/
└── data/
    ├── Background_Images/
    ├── Objects/
    ├── PBR_Textures/
    └── Texture_Images/

Download background images

Go to Google Drive.
Download all image files from train and testval folders.
Put all images into data/Background_Images.

Download texture images

Go to Flickr 8k Dataset.
Download all image files.
Put all images into data/Texture_Images.

Download PBR textures

Run blenderproc download cc_textures data/PBR_Textures. It downloads textures from cc0textures.com.
To use specific material textures like metal, create a new folder named data/Metal_Textures and place only the metal textures from the cc_textures data there.

3D model preparation

The preparation of 3D models used in the pipeline can be read about in the objects section.

Configuration Reference

This repository uses JSON configuration files (see configs/) to control both synthetic data generation (domain randomization) and the SAL training loop (training, evaluators, and updaters).
Below we summarize the main parameters and their default values used in our experiments.

Synthetic Data Generation Parameters (Domain Randomization)

Parameter	Description	Default
Scene
`background_texture_type`	Background texture type. `1`: no texture, `2`: random images from BG-20L.	`2`
`total_distracting_objects`	Maximum number of distractors in the scene.	`10`
Object characteristics
`max_objects`	Maximum number of objects per scene. `-1` includes all objects and empty background images.	`-1`
`multiple_of_same_object`	Allow multiple instances of the same object in one scene.	`TRUE`
`object_weights`	Sampling weights for object categories. `[]` means uniform distribution.	`[]`
`nr_objects_weights`	Sampling weights for number of objects. `[]` means uniform distribution.	`[]`
`object_rotation_x`	Min–max rotation angle around x-axis (degrees).	`0–360`
`object_rotation_y`	Min–max rotation angle around y-axis (degrees).	`0–360`
`object_distance_scale`	Min–max distance ratio between objects. `0.53` prevents overlap.	`0.53–1.0`
`objects_texture_type`	Object texture type. `1`: RGB, `2`: image, `3`: PBR, `0`: random.	`3`
Camera
`camera_zoom`	Min–max camera zoom.	`0.1–0.7`
`camera_theta`	Min–max azimuth angle (degrees).	`0–360`
`camera_phi`	Min–max polar angle (degrees, max 90).	`0–60`
`camera_focus_point_x`	Min–max shift for focus point x.	`0–0.5`
`camera_focus_point_y`	Min–max shift for focus point y.	`0–0.5`
`camera_focus_point_z`	Min–max shift for focus point z.	`0–0.5`
Illumination
`light_count_auto`	Automatically set light count based on scene size.	`1`
`light_energy`	Min–max light energy.	`5–150`
`light_color_red`	Min–max red channel value.	`0–255`
`light_color_green`	Min–max green channel value.	`0–255`
`light_color_blue`	Min–max blue channel value.	`0–255`
Post-processing
`vertical_flip`	Probability of vertical flip augmentation.	`0.2`
`horizontal_flip`	Probability of horizontal flip augmentation.	`0.2`
`blur`	Probability of blur augmentation.	`0.2`
`to_gray`	Probability of grayscale conversion.	`0.2`
`clahe`	Probability of applying CLAHE.	`0.2`
`random_brightness_contrast`	Probability of brightness/contrast adjustment.	`0.2`
`random_gamma`	Probability of random gamma correction.	`0.2`
`image_compression`	Probability of image compression augmentation.	`0.2`
`crop_and_pad`	Probability of crop-and-pad augmentation.	`0.2`
`multiplicative_noise`	Probability of multiplicative noise augmentation.	`0.2`

Training Configuration Parameters (SAL)

Parameter	Description	Default
Training
`model`	Base model checkpoint (e.g., `yolov8n.pt`, `yolov8s.pt`).	`yolov8m.pt`
`training_dataset_size`	Training dataset size (generates more if needed).	`10500`
`validation_dataset_size`	Validation dataset size (`-1` uses all).	`2500`
`evaluation_dataset_size`	Evaluation dataset size (`-1` uses all).	`-1`
`epochs`	Number of training epochs.	`2000`
`num_workers`	Data loader workers.	`4`
`batch_size`	Batch size.	`16`
`dropout`	Dropout probability.	`0.0`
`img_size`	Input image size.	`720`
`learning_rate`	Learning rate.	`0.001`
`weight_decay`	Weight decay.	`0.0005`
`postprocess_iou_thres`	IOU threshold used in post-processing.	`0.3`
`postprocess_conf_thres`	Confidence threshold used in post-processing.	`0.01`
`nr_dataset_segments`	Number of dataset segments (data blocks).	`4`
`training_stop_patience`	Patience (epochs) before stopping training.	`2000`
`evaluation_patience`	Patience (epochs) for evaluation.	`100`
`evaluation_backoff`	Backoff (epochs) before re-evaluating.	`100`
`early_stop_backoff`	Backoff (epochs) for early stopping after new data.	`80`
`configuration_update_ratio`	Ratio of new data generated using updated configuration.	`0.7`
Evaluators
`map_conf_thres`	Confidence threshold for mAP computation.	`0.25`
`confusion_matrix.iou_thres`	IOU threshold for confusion matrix samples.	`0.45`
`confusion_matrix.conf_thres`	Confidence threshold for confusion matrix samples.	`0.25`
`incorrect_evaluator.iou_thres`	IOU threshold for incorrect evaluator.	`0.45`
`incorrect_evaluator.conf_thres`	Confidence threshold for incorrect evaluator.	`0.25`
`confusion_evaluator.iou_thres`	IOU threshold for confusion evaluator.	`0.45`
`confusion_evaluator.conf_thres`	Confidence threshold for confusion evaluator.	`0.25`
`size_evaluator.iou_thres`	IOU threshold for size evaluator.	`0.45`
`size_evaluator.conf_thres`	Confidence threshold for size evaluator.	`0.25`
Updaters
`size_configuration.enabled`	Enable object size updater.	`True`
`size_configuration.nr_objects_range`	Object count range used during configuration update.	`[2, 12]`
`size_configuration.camera_zoom_min_range`	Range for updating minimum camera zoom.	`[0.05, 0.6]`
`size_configuration.camera_zoom_max_range`	Range for updating maximum camera zoom.	`[0.3, 0.75]`
`size_configuration.min_object_distance`	Range for updating minimum object distance scale.	`[0.5, 1.0]`
`size_configuration.max_object_distance`	Range for updating maximum object distance scale.	`[0.5, 1.0]`
`class_configuration.enabled`	Enable category distribution updater.	`True`
`pair_configuration.enabled`	Enable object pairwise updater.	`True`
`metal_configuration.enabled`	Enable material updater.	`True`

Running SAL

To start the SAL training process use the command:

Run python train.py --config configs/config-sample-continuous.json to start the generation.

To run the training three datasets need to be specified in the dataset yaml file: Training, Validation, and evaluation. If there are fewer samples in the paths than is specified in the configuration file, more will be generated. After checking that there are enough samples in the datasets, the datasets that will be continuously updated will be copied to a "working directory". This enables multiple training instances to be run simultaneity and preserves the original dataset.

The resulting model will be saved in the parent save folder ("continuous_runs" by default) along other graphs and metrics. Although the model uses Ultralytics architecture, it doesn't get saved in their format. To convert it back use the conversion script.

Test Using Custom Tests

The resulting SAL model can be tested using our own implemented tests that give similar metrics to Ultralytics. The results are saved to the specified parent run folder. The "test" dataset in the data yaml will be used for these tests. Alongside metrics, predictions and ground truth from the test set is provided.

Run python test.py --config configs/config-sample-test.json to start the testing process.

Ultralytics Based Test

There is also an option to use Ultralytics based testing ("validation"). To run this process the SAL model is converted to a Ultralytics model. This will capture the terminal output that the Ultralytics validation process outputs into a text file that gets put in the validation folder. The validation folder will be renamed using the model run name and the dataset name. Multiple model paths ("weights") and ("dataset_yamls") can be provided. All provided model paths will be tested on all datasets. Be aware that this runs using the specified "val" images in the dataset yaml.

Run python ultralytics_val.py --config configs/config-sample-ultralytics-val.json to start the Ultralytics testing process.

Convert to Ultralytics model

To make the model more useful it can be converted to an Ultralytics model which can be used as normal.

Run python convert_to_ultralytics.py --sal_model_path continuous_runs/train1/weights/best.pt --save_model_path converted_model.pt to start the Ultralytics conversion proces.

Standalone generation

Run python Generation/Blender/generation_main.py --config configs/config-sample-generation.json to start the generation.

Results

We compare static training with Synthetic Active Learning (SAL) on the robotics use case. Results are shown in terms of training dynamics and real-world evaluation performance.

Training Loss

Static training (above): training loss decreases smoothly on a fixed synthetic dataset.

SAL training (below): loss fluctuates as new synthetic data is introduced in each refinement loop.

The fluctuations in SAL are expected. Each time new targeted synthetic data is generated, the loss temporarily increases before decreasing again as the model adapts. This behavior reflects the continuous dataset refinement process.

Real-World Evaluation Performance

Static training (above): mAP@50 on the real test dataset.

SAL training (below): mAP@50 on the real test dataset.

When evaluated on real data (never seen during training), SAL consistently outperforms static training.

Higher overall mAP@50
Clear improvements in previously underperforming categories
More balanced per-class performance

Acknowledgement

The robotic dataset is from Horváth et al., including their .obj files and real images accessed from their GitLab repository. Thanks for their great work!

We also thank previous works in domain randomization for industrial applications, including Tobin et al., Eversberg and Lambrecht, and Horváth et al..

We acknowledge the contributions of the YOLOv8 model from Ultralytics, which we used for training our model.

Citation

If you find this work useful for your research, please consider citing:

@article{ZHU202668,
  title   = {Designing Synthetic Active Learning for Model Refinement in Manufacturing Parts Detection},
  author  = {Zhu, Xiaomeng and Henningsson, Jacob and Mårtensson, Pär and Hanson, Lars and Björkman, Mårten and Maki, Atsuto},
  journal = {Journal of Manufacturing Systems},
  volume  = {84},
  pages   = {68--84},
  year    = {2026},
  doi     = {10.1016/j.jmsy.2025.11.023}  
}

For the static domain randomization pipeline, please cite our ICRA 2025 work (see the Related Publications section above) and refer to the corresponding GitHub repository: SynMfg

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
Dataset		Dataset
Evaluation		Evaluation
Generation		Generation
QueueListeners		QueueListeners
configs		configs
data/Objects		data/Objects
datasets		datasets
resources		resources
utils		utils
LICENSE		LICENSE
README.md		README.md
convert_to_ultralytics.py		convert_to_ultralytics.py
environment.yml		environment.yml
test.py		test.py
train.py		train.py
ultralytics_val.py		ultralytics_val.py

Folders and files

Latest commit

History

Repository files navigation

Synthetic Active Learning (SAL)

Related Publications

📌 Journal of Manufacturing Systems (2026)

⬅️ Previous Work: Static Domain Randomization

Related Datasets

Introduction

Key Results

Getting Started

Setup Python environment

Setup Blender

Download Blender 3.4

Setup Texture folders

Download background images

Download texture images

Download PBR textures

3D model preparation

Configuration Reference

Synthetic Data Generation Parameters (Domain Randomization)

Training Configuration Parameters (SAL)

Running SAL

Test Using Custom Tests

Ultralytics Based Test

Convert to Ultralytics model

Standalone generation

Results

Training Loss

Real-World Evaluation Performance

Acknowledgement

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages