Skip to content

AbeelLab/ml-assisted-p-coumaric-acid-optimization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

p-Coumaric acid project

Code for the research paper: Machine Learning-Assisted Pathway Optimization in Large Combinatorial Design Spaces: a p-Coumaric Acid Case Study

NOTE: the data folder in the github repo only has processed data. For the raw data (and .fasta files), we refer to the 4tu repository.

Data

DNA sequencing data (.fasta files) can be found in the 4tu repository.

  1. Promoter strength screening: PromoterScreen_FCS_Elif
  2. Data for choosing the genes for optimization: GeneTargets
  • The Brenda database for brendapyrser
  • String-DB Enrichment analysis for the genes coming out of FBA (one seperate file for PAL/C4H since these are natively from a different organism.)
  1. Library clones (in 4tu repo)
  2. GFP expression: quantified_promoter_strength_rewritten.csv

Scripts

Sequence pool analysis

  1. Analysis of the pooled sequencing of the four library designs: SequencePoolAnalysis(Irsan).py

Preprocessing for ML

Sequencing data is stored in the 4tu-repo. We also provide the count matrix and the numeric matrix used for machine learning. Gff files are made available upon request.

  1. Processing fcs files to (mean) promoter strengths: PromoterScreenAnalysis.py

  2. Finding gene targets from GEM + additional info on Thermodynamics and Literature enrichment: FindGeneTargets_YeastGEM.py

  3. Processing gff files to pro-orf-ter count matrix

  4. Merging previous WUR round with TUD round (strain count matrix) MergeTUDWURdata.py. Outputs the file batch_corrected_screening_results_integrated.csv, which is used for training.

  5. Convert to a numeric matrix: ConstructNumericMatrix.py

  6. Batch-correction script for WUR data BatchCorrectWUR.py

  7. Batch correction and some plots of TUD round: BatchCorrection.py

  8. Sampling the designs for sequencing, after screening SampleDesignsForSequencing.py.

Rescreening

  1. Rescreened top 86: RescreeningTopProducers

Library visualizations

  1. Gene count fraction: AssessGeneContent.py

ML + Feature importance

  1. Dense Weight implementation + XGBoost training DenseWeightTraining.py. Remeasured top 100 strains were also integrated here. These were lower than the first (n=1) measurement round.
  2. TODO Feature importance scripts need to be rewritten into one py file From old project: 1704_combinatorial_interrogation.py 010724_individualcontributions_genes.py, 2504_combination_analysis.py

Round 2 (DoE)

  1. Recommendation of the new strains (ML-assisted DoE)170624_newdesigns_V2.ipynb
  2. analysis of new designs: AnalysisValidationRound.py.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors