FPS-SOAP is a set of scripts for efficient chemical dataset curation using Farthest Point Sampling (FPS) algorithm combined with Smooth Overlap of Atomic Positions (SOAP) descriptors. The tool helps identify structurally dissimilar compounds by calculating similarity scores between molecular geometries, enabling dataset pruning or expansion for machine learning applications in chemistry.
General reactive machine learning potentials for CHON elements
- Python 3.10.18
- Dscribe 2.1.1
- ASE 3.25.0
- Numpy 2.2.6 (CPU version)
- PyTorch 2.7.1 (GPU version)
# Create environment from requirements.txt (only support CPU version)
conda create --name fps-soap --file requirements.txt
# Activate the environment
conda activate fps-soap
# Install PyTorch (if GPU version is required)
pip install torch==2.7.1+cu118 torchvision==0.22.1+cu118 torchaudio==2.7.1+cu118 --index-url https://download.pytorch.org/whl/cu118Optimized CPU implementation for FPS-based structure similarity sampling using NumPy. Uses Laplacian kernel (accelerated by Numba JIT) for atomic similarity calculation and AverageKernel for molecular similarity aggregation.
| Parameter | Type | Default | Description |
|---|---|---|---|
--ref |
str | "" |
Path to reference XYZ file (optional) |
--cand |
str | required | Path to candidate XYZ file (must be provided) |
--n_jobs |
int | None |
Number of CPU cores for parallel processing (None = all available cores) |
--batch_size |
int | 50 | Batch size for CPU parallel processing |
--r_cut |
float | 10.0 | Cutoff radius for SOAP descriptor (unit: Å) |
--n_max |
int | 6 | Number of radial basis functions for SOAP descriptor |
--l_max |
int | 4 | Maximum degree of spherical harmonics for SOAP descriptor |
--threshold |
float | 0.9 | Similarity threshold (0-1, structures above this threshold are masked) |
--dynamic_species |
bool | False |
Use only chemical elements in the current formula (enable with --dynamic_species) |
--max_fps_rounds |
int | None |
Maximum number of FPS rounds (None = unlimited) |
--save_soap |
bool | False |
Save calculated SOAP descriptors to .h5 file (enable with --save_soap) |
--save_dir |
str | fps_results |
Directory to save output results (default: creates fps_results/[timestamp]/formula folders) |
- Automatically initializes reference set with first candidate structure if
--refis empty - Parallelizes across CPU cores for similarity calculation using Numba
- Efficiently handles different sizes of datasets by adjusting
--thresholdand--max_fps_rounds
GPU-accelerated version using PyTorch for FPS-based structure similarity sampling.
NOTE: GPU version now is significantly SLOWER than CPU version, not recommended!
python fps_cpu_numpy.py \
--cand tests/test_dataset/rxn000x_all.xyz \
--save_dir tests/test_result/ \
--threshold 0.99 Expected Output:
- Matching output files in
tests/test_result/(compare with baseline) - 8-core CPU runtime: ~10 seconds
Default outputs are saved in:
fps_results/
└── YYYY-MM-DD-HH-MM-SS/
└── total_output.log # Total Log file
└── Formula/
├── updated_ref_structures_Formula.xyz # Filtered XYZ file
├── updated_ref_soap_descriptors_Formula.h5 # (Optional) Saved SOAP descriptors
└── Formula_output.log # Log file
If you use this tool in your research, please cite:
@article{BowenLi-2025,
title={General reactive machine learning potentials for CHON elements},
author={Bowen Li, Sixuan Mi, Jin Xiao, Shuwen Zhang, Han Wang, Tong Zhu},
journal={Nature Computational Science (Ready to submit)},
year={2025},
doi={10.XXXX/XXXX}
}Last updated: 2025-06-06