p-Coumaric acid project

Code for the research paper: Machine Learning-Assisted Pathway Optimization in Large Combinatorial Design Spaces: a p-Coumaric Acid Case Study

NOTE: the data folder in the github repo only has processed data. For the raw data (and .fasta files), we refer to the 4tu repository.

Data

DNA sequencing data (.fasta files) can be found in the 4tu repository.

Promoter strength screening: PromoterScreen_FCS_Elif
Data for choosing the genes for optimization: GeneTargets

The Brenda database for brendapyrser
String-DB Enrichment analysis for the genes coming out of FBA (one seperate file for PAL/C4H since these are natively from a different organism.)

Library clones (in 4tu repo)
GFP expression: quantified_promoter_strength_rewritten.csv

Scripts

Sequence pool analysis

Analysis of the pooled sequencing of the four library designs: SequencePoolAnalysis(Irsan).py

Preprocessing for ML

Sequencing data is stored in the 4tu-repo. We also provide the count matrix and the numeric matrix used for machine learning. Gff files are made available upon request.

Processing fcs files to (mean) promoter strengths: PromoterScreenAnalysis.py
Finding gene targets from GEM + additional info on Thermodynamics and Literature enrichment: FindGeneTargets_YeastGEM.py
Processing gff files to pro-orf-ter count matrix
Merging previous WUR round with TUD round (strain count matrix) MergeTUDWURdata.py. Outputs the file batch_corrected_screening_results_integrated.csv, which is used for training.
Convert to a numeric matrix: ConstructNumericMatrix.py
Batch-correction script for WUR data BatchCorrectWUR.py
Batch correction and some plots of TUD round: BatchCorrection.py
Sampling the designs for sequencing, after screening SampleDesignsForSequencing.py.

Rescreening

Rescreened top 86: RescreeningTopProducers

Library visualizations

Gene count fraction: AssessGeneContent.py

ML + Feature importance

Dense Weight implementation + XGBoost training DenseWeightTraining.py. Remeasured top 100 strains were also integrated here. These were lower than the first (n=1) measurement round.
TODO Feature importance scripts need to be rewritten into one py file From old project: 1704_combinatorial_interrogation.py 010724_individualcontributions_genes.py, 2504_combination_analysis.py

Round 2 (DoE)

Recommendation of the new strains (ML-assisted DoE)170624_newdesigns_V2.ipynb
analysis of new designs: AnalysisValidationRound.py.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data/processed		data/processed
figures		figures
functions		functions
models		models
results		results
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

p-Coumaric acid project

Data

Scripts

Sequence pool analysis

Preprocessing for ML

Rescreening

Library visualizations

ML + Feature importance

Round 2 (DoE)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

p-Coumaric acid project

Data

Scripts

Sequence pool analysis

Preprocessing for ML

Rescreening

Library visualizations

ML + Feature importance

Round 2 (DoE)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages