iSCORE-PD SNV analysis

This is a set of Rmarckdown files with scripts used to analyze the genotypes and the potential CRISPR off-targets in the iSCORE-PD cell line collection. the main document, 00_main_document_Run1and2.Rmd will generate a self-contained standalone html report of the full set of analyses.

Prerequesites

Softwares

R 4.5.0
Java Runtime (JRE) 21.0.6
git
wget or curl
C compiler with make if running from Linux

Additional libraries if installing from vanilla Ubuntu 24.04 LTS

## Update apt packages
sudo apt update -y

## Install Ubuntu libraries and dependencies
sudo apt install -y /
  build-essential /
  libssl-dev /
  libcurl4-openssl-dev /
  libpng-dev /
  libxml2-dev /
  liblzma-dev /
  libbz2-dev /
  libblas-dev /
  liblapack-dev /
  gfortran /
  default-jre /
  default-jdk /
  libpcre2-dev /
  libdeflate-dev /
  libzstd-dev /
  libtirpc-dev /
  pandoc /
  libfontconfig1-dev /
  libtiff-dev /
  libcairo2-dev

## Update location of java for R deployment
sudo R CMD javareconf

R packages

Bioconductor

From R terminal:

install.packages("BiocManager")

For analyses:

remotes
httpgd
VariantAnnotation
data.table
TxDb.Hsapiens.UCSC.hg38.knownGene
BSgenome.Hsapiens.UCSC.hg38
org.Hs.eg.db
parallel
DT
htmltools
ggplot2
fastreeR
ape
gridExtra

From R terminal:

options(Ncpus = parallel::detectCores())

BiocManager::install(c("remotes",
  "VariantAnnotation",
  "data.table",
  "TxDb.Hsapiens.UCSC.hg38.knownGene",
  "BSgenome.Hsapiens.UCSC.hg38",
  "org.Hs.eg.db",
  "parallel",
  "DT",
  "htmltools",
  "ggplot2",
  "fastreeR",
  "ape",
  "gridExtra"))

remotes::install_github("nx10/httpgd")

For kniting the RMarkdown document

rmarkdown

From R terminal:

options(Ncpus = parallel::detectCores())
install.packages("rmarkdown")

Installation

Clone the repo:

git clone https://github.com/hockemeyer-ucb/pd-sv-analysis.git

Download test data

To test your installation, you can download the iSCORE-PD cell lines joint genotypes for chromosome 22 in a directory named SNV within the pd-sv-analysis repo.

wget

wget -P pd-sv-analysis/SNV 'https://pd-cell-lines-data.s3.us-west-2.amazonaws.com/joint-genotyping/study-chr22.vcf.gz'

curl

curl --create-dirs -O --output-dir pd-sv-analysis/SNV 'https://pd-cell-lines-data.s3.us-west-2.amazonaws.com/joint-genotyping/study-chr22.vcf.gz'

Suggested hardware

The script will read the VCF files in parallel up to the max number of CPU and will consume copius amount of RAM during the serialization process. It has been successully develop and tested using the following parameters. Please note that in it's current version, the script will crash an instance with 8 CPU and 64 Gb of RAM.

OS: Ubuntu 24.04 LTS
Architecture: ARM64
Diskspace: 25 Gb
CPU: 16
RAM: 124 Gb
Test dataset runtime: ~5 mins
Full dataset runtime: ~2 hrs

I have successfully run the test data on a MacBook Air with Apple M3 AMD64 chipset with 8 CPUs and 24Gb of RAM in less than 5 minutes.

Kniting the RMarkdown documents (with .Rmd files)

The 00_main_document_Run1and2.Rmd is the main file referencing all the other children files. To knit (i.e to generate an html report), run the following commands from an R terminal:

library(rmarkdown)
render("./00_main_document_Run1and2.Rmd", output_dir = "html_output")

This will generate an standalone html report in the html_output directory.

If loading the .Rmd file from RStudio or VScode, their kniting command will automatically create the document in the final html_output directory.

Analyzing the full dataset

Cleaning the slate

Prior to dowloading the full dataset, make sure that the SNV and the data directory are emptied or deleted. Some of the larger files that are compute intensive are cached in the data directory when ran the first time. There's no mechanism for linking the content of the SNV directory with the files in the data directory.

From within the pd-sv-analysis repo, run:

rm -rf SNV data

Dowloading the full dataset

The joint genotypes for all cell lines for each chromosome can be found here:

Here's a simple one liner to download all the files, run in bash from the pd-sv-analysis directory

for x in {1..22} X Y;do wget -P SNV  "https://pd-cell-lines-data.s3.us-west-2.amazonaws.com/joint-genotyping/study-chr${x}.vcf.gz"; done

Analyse the full dataset

The scripts will pick up each indivuals gunzipped VCF files in the SNV directory and run with them. From an R terminal, simply run the folowing command and be patient. On a 16 CPU linux machine, it takes ~2 hrs.

library(rmarkdown)
render("./00_main_document_Run1and2.Rmd", output_dir = "html_output")

Supporting Files

This analysis uses files specific for the Parkinson Disease CRISPR engineered cell lines. If you would like to run this analysis, the following files in the supporting_files directory will need to be updated to your design.

iSCORE-PD_design.csv

the iSCORE-PD_design.csv file is a comma seperated text file starting with a header and with one cell line per line with the following column header:

samples: Sample ID found in VCF header of the joint genotyping output file
group: Group ID linking the sample in group
meta: Additional group relation. Not used anymore

iSCORE-PD_cells_grouped_by_editing_methods.csv

The iSCORE-PD_cells_grouped_by_editing_methods.csv file is a comma seperated text file starting with a header and with one cell line per line with the following column header:

samples: Sample ID found in VCF header of the joint genotyping output file
group: Group ID linking the sample in group
meta: Additional group relation. Not used anymore
editing_group: Method used for the editing (Cas9, TALEN, PE)

It is used to establish the sample in the VCF to the different cell line edited group. In our analysis, each CRISPR edit has serveral cell line clones.

iSCORE-PD_cells_grouped_with_guides.csv

The iSCORE-PD_cells_grouped_with_guides.csv is an extention of the previous file with extra columns with the Id of the RNA guide(s) used to edit each cell lines. The two files could be consolidated but where kept seperate for the ability to change the samples analyzed in one or the other files. The extra columns represent the maximal number of guides used in any of the cell lines. In our case, some edits used up to 3 guides, so we add 3 extra columns and for each lines with list the Id of the guides used (either one, two or three guide IDs)

samples: Sample ID
group: Group ID
meta: Additional group relation. Not used anymore
editing_group: Method used for the editing (Cas9, TALEN, PE)
guide1
guide2
guide3

sgRNA.txt

This file is used by (Cas-OFFinder)[https://github.com/snugel/cas-offinder] to predict the putative location of CRISPR Off-target sites. It starts with the location of the chromsome FASTA files for the human Hg38 chromosomes on your local hard drive. The second line correspond to the type of RNA used for the search with the size of the RNA and DNA bulge (Details can be found with the (Cas-OFFinder)[https://github.com/snugel/cas-offinder] documentation). Then the following lines are the RNA sequences for each guide with the number of missmatch to be considered.

For examples:

/home/ubuntu/working/genomes/hg38.fullAnalysisSet.chroms
NNNNNNNNNNNNNNNNNNNNNRG
GGAGGGAGTGGTGCATGGTGNNN	5	SNCA_A53T_peg
TCATAGGAATCTTGAATACTNNN	5	SNCA_A53T_nc
CAGGGTGTGGCAGAAGCAGCNNN	5	SNCA_A30P_peg

cas-offinder-out.txt

The cas-offinder-out.txt file is the output of running (Cas-OFFinder)[https://github.com/snugel/cas-offinder]. It was run once using sgRNA.txt as input like this on a GPU instance on AWS:

cas-offinder sgRNA.txt G cas-offinder-out.txt

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.vscode		.vscode
supporting_files		supporting_files
.gitignore		.gitignore
.lintr		.lintr
00_main_document_Run1and2.Rmd		00_main_document_Run1and2.Rmd
01_group_specific_SNP_analysis_v2.Rmd		01_group_specific_SNP_analysis_v2.Rmd
02_group_specific_by_edit_method.Rmd		02_group_specific_by_edit_method.Rmd
03_proximal_off_target.Rmd		03_proximal_off_target.Rmd
04_cas-offinder.Rmd		04_cas-offinder.Rmd
05_genotype_imbalance.Rmd		05_genotype_imbalance.Rmd
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

iSCORE-PD SNV analysis

Prerequesites

Softwares

Additional libraries if installing from vanilla Ubuntu 24.04 LTS

R packages

Bioconductor

For analyses:

For kniting the RMarkdown document

Installation

Download test data

Suggested hardware

Kniting the RMarkdown documents (with .Rmd files)

Analyzing the full dataset

Cleaning the slate

Dowloading the full dataset

Analyse the full dataset

Supporting Files

iSCORE-PD_design.csv

iSCORE-PD_cells_grouped_by_editing_methods.csv

iSCORE-PD_cells_grouped_with_guides.csv

sgRNA.txt

cas-offinder-out.txt

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

iSCORE-PD SNV analysis

Prerequesites

Softwares

Additional libraries if installing from vanilla Ubuntu 24.04 LTS

R packages

Bioconductor

For analyses:

For kniting the RMarkdown document

Installation

Download test data

Suggested hardware

Kniting the RMarkdown documents (with .Rmd files)

Analyzing the full dataset

Cleaning the slate

Dowloading the full dataset

Analyse the full dataset

Supporting Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!