🧪 Chemistry Meets ML: LogP Prediction for SIBUR "119 элемент" Challenge

For all the original datasets, look in the folder chem-sibur-119/allsiburdata

🕵️‍♂️ Data Cleaning (a bit messy, but it worked)

The initial dataset contained ~12k molecules. We attempted to gather more data using augmentation (check chem-sibur-119/data_augmentation.ipynb for more info) and simply adding more examples from datasets on Kaggle, HuggingFace, GitHub platforms and etc. but realized it was a bad idea because we were getting unsatisfactory results at the expense of time. So we tried a basic standardization on the primary dataset from "SIBUR" using RDKit and Chython libraries:

Removed invalid SMILES
The duplicates were removed a little crookedly
Some molecules were processed incorrectly — for example, some valid examples were dropped

In the end, we had a dataset with 10898 molecules: no duplicates, no invalid SMILES, everything is fine

Due to some misunderstandings we did not use the cleared dataset

For more information check chem-sibur-119/data_cleaning.ipynb

💡 Finding the best way to predict LogP

Once we had finished data preprocessing, we started thinking about LogP prediction methods:

1. Descriptors from RDKit and Mordred + Morgan Fingerprints + XGBRegressor + Optuna (check MolD_MolFP_pred.ipynb for more info), this part was prepared by Grisha, also check chem-sibur-119/grishapart

The best RMSE on public-leaderboard: 1.00064
The best RMSE on private-leaderboard: 1.00174

Our gut feeling was that using molecular descriptors and molecular fingerprints would not allow us to achieve good RMSE because it's so easy to perform, so we did not spend much time on it. In short, using these parameters seemed very simple and inflexible to us, so we did not pay much attention to it

2. Using GNN, in particular, DMPNN (check DMPNN_pred.ipynb for more info), this part was prepared by Kirill, also check chem-sibur-119/kirillpart

The best RMSE on public-leaderboard: 0.930880
The best RMSE on private-leaderboard: 0.958848

3. Our final model using Chemprop Library (check Chemprop_pred.ipynb for more info), this part was mostly done by Mikhail, but we (Kirill and Grisha) also took some part in it, also check chem-sibur-119/mikhailpart

The best RMSE on public-leaderboard: 0.672947
The best RMSE on private-leaderboard: 0.704836

Due to some misunderstandings, we chose a bad solution to submit our application to the private-leaderboard, so we finished this hackathon in 15th place out of 42, but we could have done better (taken a cleaned dataset and submitted the submission with the best RMSE on the public-leaderboard) and took 6-7th place

🎓 Conclusions

1. Data Quality Matters

Despite initial efforts to augment the dataset, cleaner and more consistent data (like the final processed 10898 molecules) could have improved model performance. Future work should prioritize rigorous validation of preprocessing steps to avoid accidental exclusion of valid samples

2. Model Selection & Flexibility

Simple descriptor-based methods (RDKit/Mordred + XGBoost) underperformed (RMSE ~1.0), confirming their limitations for complex LogP prediction
GNNs (DMPNN) showed better results (RMSE ~0.93–0.96), but the Chemprop-based model outperformed both (RMSE ~0.67–0.70), proving that task-specific architectures are critical

3. Submission Strategy

A misstep in leaderboard submission (using a suboptimal model version) cost us ~6–7 positions. Always validate the final pipeline before submission

4. Key Takeaways

Chemprop is a powerful tool for molecular property prediction, but requires careful hyperparameter tuning
Team coordination is as important as technical work — clear communication could have prevented the dataset/leaderboard issues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧪 Chemistry Meets ML: LogP Prediction for SIBUR "119 элемент" Challenge

🕵️‍♂️ Data Cleaning (a bit messy, but it worked)

💡 Finding the best way to predict LogP

1. Descriptors from RDKit and Mordred + Morgan Fingerprints + XGBRegressor + Optuna (check MolD_MolFP_pred.ipynb for more info), this part was prepared by Grisha, also check chem-sibur-119/grishapart

2. Using GNN, in particular, DMPNN (check DMPNN_pred.ipynb for more info), this part was prepared by Kirill, also check chem-sibur-119/kirillpart

3. Our final model using Chemprop Library (check Chemprop_pred.ipynb for more info), this part was mostly done by Mikhail, but we (Kirill and Grisha) also took some part in it, also check chem-sibur-119/mikhailpart

🎓 Conclusions

1. Data Quality Matters

2. Model Selection & Flexibility

3. Submission Strategy

4. Key Takeaways

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
allsiburdata		allsiburdata
grishapart		grishapart
kirillpart		kirillpart
mikhailpart		mikhailpart
README.md		README.md
data_augmentation.ipynb		data_augmentation.ipynb
data_cleaning.ipynb		data_cleaning.ipynb

Folders and files

Latest commit

History

Repository files navigation

🧪 Chemistry Meets ML: LogP Prediction for SIBUR "119 элемент" Challenge

🕵️‍♂️ Data Cleaning (a bit messy, but it worked)

💡 Finding the best way to predict LogP

1. Descriptors from RDKit and Mordred + Morgan Fingerprints + XGBRegressor + Optuna (check MolD_MolFP_pred.ipynb for more info), this part was prepared by Grisha, also check chem-sibur-119/grishapart

2. Using GNN, in particular, DMPNN (check DMPNN_pred.ipynb for more info), this part was prepared by Kirill, also check chem-sibur-119/kirillpart

3. Our final model using Chemprop Library (check Chemprop_pred.ipynb for more info), this part was mostly done by Mikhail, but we (Kirill and Grisha) also took some part in it, also check chem-sibur-119/mikhailpart

🎓 Conclusions

1. Data Quality Matters

2. Model Selection & Flexibility

3. Submission Strategy

4. Key Takeaways

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages