FPS-SOAP: Farthest Point Sampling with SOAP Descriptors

FPS-SOAP is a set of scripts for efficient chemical dataset curation using Farthest Point Sampling (FPS) algorithm combined with Smooth Overlap of Atomic Positions (SOAP) descriptors. The tool helps identify structurally dissimilar compounds by calculating similarity scores between molecular geometries, enabling dataset pruning or expansion for machine learning applications in chemistry.

📄 Project Paper

General reactive machine learning potentials for CHON elements

🚀 Environment Setup

Dependencies

Python 3.10.18
Dscribe 2.1.1
ASE 3.25.0
Numpy 2.2.6 (CPU version)
PyTorch 2.7.1 (GPU version)

Installation

# Create environment from requirements.txt (only support CPU version)
conda create --name fps-soap --file requirements.txt

# Activate the environment
conda activate fps-soap

# Install PyTorch (if GPU version is required)
pip install torch==2.7.1+cu118 torchvision==0.22.1+cu118 torchaudio==2.7.1+cu118 --index-url https://download.pytorch.org/whl/cu118

📜 Script Documentation

1. CPU Version: `fps_cpu_numpy.py`

Purpose

Optimized CPU implementation for FPS-based structure similarity sampling using NumPy. Uses Laplacian kernel (accelerated by Numba JIT) for atomic similarity calculation and AverageKernel for molecular similarity aggregation.

Key Parameters

Parameter	Type	Default	Description
`--ref`	str	`""`	Path to reference XYZ file (optional)
`--cand`	str	required	Path to candidate XYZ file (must be provided)
`--n_jobs`	int	`None`	Number of CPU cores for parallel processing (`None` = all available cores)
`--batch_size`	int	50	Batch size for CPU parallel processing
`--r_cut`	float	10.0	Cutoff radius for SOAP descriptor (unit: Å)
`--n_max`	int	6	Number of radial basis functions for SOAP descriptor
`--l_max`	int	4	Maximum degree of spherical harmonics for SOAP descriptor
`--threshold`	float	0.9	Similarity threshold (0-1, structures above this threshold are masked)
`--dynamic_species`	bool	`False`	Use only chemical elements in the current formula (enable with `--dynamic_species`)
`--max_fps_rounds`	int	`None`	Maximum number of FPS rounds (`None` = unlimited)
`--save_soap`	bool	`False`	Save calculated SOAP descriptors to .h5 file (enable with `--save_soap`)
`--save_dir`	str	`fps_results`	Directory to save output results (default: creates `fps_results/[timestamp]/formula` folders)

Features

Automatically initializes reference set with first candidate structure if --ref is empty
Parallelizes across CPU cores for similarity calculation using Numba
Efficiently handles different sizes of datasets by adjusting --threshold and --max_fps_rounds

2. GPU Version: `fps_gpu_torch.py` (in development)

Purpose

GPU-accelerated version using PyTorch for FPS-based structure similarity sampling.
NOTE: GPU version now is significantly SLOWER than CPU version, not recommended!

🧪 Testing

Test Command (CPU)

python fps_cpu_numpy.py \
  --cand tests/test_dataset/rxn000x_all.xyz \
  --save_dir tests/test_result/ \
  --threshold 0.99

Expected Output:

Matching output files in tests/test_result/ (compare with baseline)
8-core CPU runtime: ~10 seconds

📁 Result Structure

Default outputs are saved in:

fps_results/
└── YYYY-MM-DD-HH-MM-SS/
    └── total_output.log                              # Total Log file
    └── Formula/
        ├── updated_ref_structures_Formula.xyz        # Filtered XYZ file
        ├── updated_ref_soap_descriptors_Formula.h5   # (Optional) Saved SOAP descriptors
        └── Formula_output.log                        # Log file

📝 Citation

If you use this tool in your research, please cite:

@article{BowenLi-2025,
  title={General reactive machine learning potentials for CHON elements},
  author={Bowen Li, Sixuan Mi, Jin Xiao, Shuwen Zhang, Han Wang, Tong Zhu},
  journal={Nature Computational Science (Ready to submit)},
  year={2025},
  doi={10.XXXX/XXXX}
}

Last updated: 2025-06-06

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
__pycache__		__pycache__
tests		tests
README.md		README.md
fps_cpu_numpy.py		fps_cpu_numpy.py
fps_gpu_torch.py		fps_gpu_torch.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FPS-SOAP: Farthest Point Sampling with SOAP Descriptors

📄 Project Paper

🚀 Environment Setup

Dependencies

Installation

📜 Script Documentation

1. CPU Version: `fps_cpu_numpy.py`

Purpose

Key Parameters

Features

2. GPU Version: `fps_gpu_torch.py` (in development)

Purpose

🧪 Testing

Test Command (CPU)

📁 Result Structure

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FPS-SOAP: Farthest Point Sampling with SOAP Descriptors

📄 Project Paper

🚀 Environment Setup

Dependencies

Installation

📜 Script Documentation

1. CPU Version: fps_cpu_numpy.py

Purpose

Key Parameters

Features

2. GPU Version: fps_gpu_torch.py (in development)

Purpose

🧪 Testing

Test Command (CPU)

📁 Result Structure

📝 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

1. CPU Version: `fps_cpu_numpy.py`

2. GPU Version: `fps_gpu_torch.py` (in development)

Packages