Code for the research paper: Machine Learning-Assisted Pathway Optimization in Large Combinatorial Design Spaces: a p-Coumaric Acid Case Study
NOTE: the data folder in the github repo only has processed data. For the raw data (and .fasta files), we refer to the 4tu repository.
DNA sequencing data (.fasta files) can be found in the 4tu repository.
- Promoter strength screening: PromoterScreen_FCS_Elif
- Data for choosing the genes for optimization: GeneTargets
- The Brenda database for brendapyrser
- String-DB Enrichment analysis for the genes coming out of FBA (one seperate file for PAL/C4H since these are natively from a different organism.)
- Library clones (in 4tu repo)
- GFP expression: quantified_promoter_strength_rewritten.csv
- Analysis of the pooled sequencing of the four library designs: SequencePoolAnalysis(Irsan).py
Sequencing data is stored in the 4tu-repo. We also provide the count matrix and the numeric matrix used for machine learning. Gff files are made available upon request.
-
Processing fcs files to (mean) promoter strengths: PromoterScreenAnalysis.py
-
Finding gene targets from GEM + additional info on Thermodynamics and Literature enrichment: FindGeneTargets_YeastGEM.py
-
Processing gff files to pro-orf-ter count matrix
-
Merging previous WUR round with TUD round (strain count matrix) MergeTUDWURdata.py. Outputs the file batch_corrected_screening_results_integrated.csv, which is used for training.
-
Convert to a numeric matrix: ConstructNumericMatrix.py
-
Batch-correction script for WUR data BatchCorrectWUR.py
-
Batch correction and some plots of TUD round: BatchCorrection.py
-
Sampling the designs for sequencing, after screening SampleDesignsForSequencing.py.
- Rescreened top 86: RescreeningTopProducers
- Gene count fraction: AssessGeneContent.py
- Dense Weight implementation + XGBoost training DenseWeightTraining.py. Remeasured top 100 strains were also integrated here. These were lower than the first (n=1) measurement round.
- TODO Feature importance scripts need to be rewritten into one py file From old project: 1704_combinatorial_interrogation.py 010724_individualcontributions_genes.py, 2504_combination_analysis.py
- Recommendation of the new strains (ML-assisted DoE)170624_newdesigns_V2.ipynb
- analysis of new designs: AnalysisValidationRound.py.