diff --git a/0.download_data/README.md b/0.download_data/README.md index e4ea49b..0dac6bd 100644 --- a/0.download_data/README.md +++ b/0.download_data/README.md @@ -1,11 +1,13 @@ -# Download JUMP-Target SQLite files from AWS +# Download JUMP-Target SQLite Files from AWS -In this module, we download the SQLite files from [AWS](https://cellpainting-gallery.s3.amazonaws.com/index.html#cpg0000-jump-pilot/source_4/workspace/backend/2020_11_04_CPJUMP1/) with [aws-cli](https://github.com/aws/aws-cli) on Aug 10, 2023 using instructions provided from [JUMP Cell Painting Datasets](https://github.com/jump-cellpainting/datasets). -There are 51 plates from the pilot dataset (cpg0000), totalling 1.1 TB of storage from the SQLite files. +In this module, we download SQLite files from [AWS](https://cellpainting-gallery.s3.amazonaws.com/index.html#cpg0000-jump-pilot/source_4/workspace/backend/2020_11_04_CPJUMP1/) with [aws-cli](https://github.com/aws/aws-cli). +There are 51 plates from the pilot dataset (`cpg0000`), totaling about 1.1 TB of SQLite files. -Firstly, we generate a manifest file in the [data folder](./data/) called [jump_dataset_location_manifest.csv](./data/jump_dataset_location_manifest.csv). -Afterwards, we process each plate using [CytoTable](https://github.com/cytomining/CytoTable). +First, we generate a manifest file in the [data folder](./data/) called [jump_dataset_location_manifest.csv](./data/jump_dataset_location_manifest.csv). +Afterward, we process each plate using [CytoTable](https://github.com/cytomining/CytoTable). -Optionally, to download only the SQLite plates, please use the [download_from_aws.sh](./download_from_aws.sh) file, which contains the bash script that will download the files from the paths in the manifest. +The module entrypoint is [run.sh](./run.sh), which runs manifest generation, CytoTable plate processing, and image-download notebook execution. -Please see the notes from the main [`README.md` on processing this step](../README.md#running-code-from-this-project). +Optionally, to download only the SQLite plates, use [download_from_aws.sh](./download_from_aws.sh), which downloads files from paths in the manifest. + +See the main [`README.md` section on running code](../README.md#running-code-from-this-project) for step-level execution details. diff --git a/1.process_data/README.md b/1.process_data/README.md index a7909d0..174c7f8 100644 --- a/1.process_data/README.md +++ b/1.process_data/README.md @@ -1,21 +1,22 @@ -# Merge, normalize, feature select, and aggregate single cells with pycytominer +# Single-cell Quality Control, Merge, Normalize, Feature Select, and Aggregate Single Cells -In this module, we perform four preprocessing steps on the SQLite files using [pycytominer](https://github.com/cytomining/pycytominer/tree/main): +In this module, we perform five preprocessing steps on single-cell data generated from CytoTable outputs, using [pycytominer](https://github.com/cytomining/pycytominer/tree/main) for normalization, feature selection, and aggregation: -1. Merge and annotate single cells from the SQLite file using the [pycytominer SingleCell class](https://github.com/cytomining/pycytominer/blob/main/pycytominer/cyto_utils/cells.py) -2. [Normalize](https://github.com/cytomining/pycytominer/blob/main/pycytominer/normalize.py) the single cells using the negative controls (e.g., DMSO for compound treatment, no-target or target intergenic region sgRNAs for crispr treatment, and genes with weak signatures in orf treatment) as reference for the standard scalar method per plate. -3. [Feature Select](https://github.com/cytomining/pycytominer/blob/main/pycytominer/feature_select.py) the single cell plate morphology data per plate by variance thresholding, correlation thresholding, and by filtering columns containing NaNs and columns specified in the blocklist. -4. [Aggregate](https://github.com/cytomining/pycytominer/blob/main/pycytominer/feature_select.py) both the normalized and feature selected single-cell morphology data to the well level. +1. Perform single-cell quality control (QC) after CytoTable and before annotation/merging to remove low-quality cells. +2. Merge and annotate single-cell profiles from CytoTable outputs for downstream normalization and feature selection. +3. [Normalize](https://github.com/cytomining/pycytominer/blob/main/pycytominer/normalize.py) single cells using negative controls (for example, DMSO for compounds, no-target or intergenic-targeting sgRNAs for CRISPR, and weak-signature genes for ORF) as reference populations for standard scaling per plate. +4. [Feature select](https://github.com/cytomining/pycytominer/blob/main/pycytominer/feature_select.py) single-cell morphology data per plate using variance thresholding, correlation thresholding, and filtering columns containing NaNs or listed in the blocklist. +5. Aggregate both normalized and feature-selected single-cell morphology data to the well level. -## Run merging and normalization notebook +## Run Single-cell Processing Pipeline -To process the data, run the [process_data.sh](./process_data.sh) file which will convert the notebook into a python file and run it from terminal. +To process the data, run [process_data.sh](./process_data.sh), which converts notebooks to Python scripts in `nbconverted/` and runs QC-aware single-cell processing through merging/annotation, normalization, and feature selection; then run [aggregate_sc_data.sh](./aggregate_sc_data.sh) for well-level aggregation. ```bash # Make sure you are in the 1.process_data directory cd 1.process_data -# Process the data with steps 1-3 +# Process the data with steps 1-4 ./process_data.sh -# Process the data with step 4 +# Process the data with step 5 ./aggregate_sc_data.sh ``` diff --git a/2.evaluate_data/README.md b/2.evaluate_data/README.md index 660272d..1cacd95 100644 --- a/2.evaluate_data/README.md +++ b/2.evaluate_data/README.md @@ -1,18 +1,19 @@ -# Apply phenotypic profiling model to JUMP data +# Apply phenotypic profiling models to JUMP data -In this module, we generate single-cell probabilities for each of the 15 phenotypic classes by applying the [phenotypic profiling model](https://github.com/WayScience/phenotypic_profiling_model). -There are two model types, final and shuffled baseline. -The shuffled baseline model trains using randomly shuffled single-cell features. -We output one file for all plates that contains phenotypic probabilities and relevant metadata for all of the single-cells. -The files we output are in `parquet` format. +This module generates single-cell probabilities for 15 phenotypic classes using trained morphology models. +Current workflows focus on class-balanced logistic regression prediction runs and related per-plate processing. -## Run the prediction notebook +Outputs are written as `parquet` files with phenotypic probabilities and relevant metadata. -To generate the probabilities for each single cell, run the [evaluate_data.sh](./evaluate_data.sh) file which will convert the notebook into a python file and run it from terminal. +## Run predictions + +To generate probabilities, run [evaluate_data.sh](./evaluate_data.sh): ```bash -# Make sure you are in the 2.evaluate_data directory +# make sure you are in the 2.evaluate_data directory cd 2.evaluate_data -# Run the notebook as a python script +# run prediction pipeline source evaluate_data.sh ``` + +`evaluate_data.sh` executes the active prediction workflow in `class_balanced_log_reg_areashape_predict_sc_probabilities/`. diff --git a/2.evaluate_data/compute_sc_anomalyze_data/compute_aggregate_treatment_anomaly_data.ipynb b/2.evaluate_data/compute_sc_anomalyze_data/compute_aggregate_treatment_anomaly_data.ipynb deleted file mode 100644 index fb570b8..0000000 --- a/2.evaluate_data/compute_sc_anomalyze_data/compute_aggregate_treatment_anomaly_data.ipynb +++ /dev/null @@ -1,292 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "62fa96b8", - "metadata": { - "lines_to_next_cell": 0 - }, - "source": [ - "# Compute Aggregate Treatment Anomaly Data\n", - "Anomalies data are computed by first\n", - "aggregating cellprofiler data to the treatment level, and\n", - "then computing anomaly scores and feature importances for later analysis." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "47212c25", - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "import argparse\n", - "import pathlib\n", - "import sys\n", - "\n", - "import numpy as np\n", - "import pandas as pd\n", - "from sklearn.ensemble import IsolationForest\n", - "\n", - "# Get the current working directory\n", - "cwd = pathlib.Path.cwd()\n", - "\n", - "if (cwd / \".git\").is_dir():\n", - " root_dir = cwd\n", - "\n", - "else:\n", - " root_dir = None\n", - " for parent in cwd.parents:\n", - " if (parent / \".git\").is_dir():\n", - " root_dir = parent\n", - " break\n", - "\n", - "# Check if a Git root directory was found\n", - "if root_dir is None:\n", - " raise FileNotFoundError(\"No Git root directory found.\")\n", - "\n", - "sys.path.append(str((root_dir / \"2.evaluate_data\" / \"utils\").resolve(strict=True)))\n", - "from isolation_forest_data_feature_importance import IsoforestFeatureImportance" - ] - }, - { - "cell_type": "markdown", - "id": "918d564a", - "metadata": { - "lines_to_next_cell": 0 - }, - "source": [ - "## Inputs" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6e48ab2a", - "metadata": { - "lines_to_next_cell": 0 - }, - "outputs": [], - "source": [ - "parser = argparse.ArgumentParser()\n", - "parser.add_argument(\n", - " \"--big_drive_path\",\n", - " type=pathlib.Path,\n", - " default=root_dir / \"big_drive\",\n", - ")\n", - "args = parser.parse_args()\n", - "\n", - "big_drive_path = args.big_drive_path.resolve(strict=True)\n", - "\n", - "agg_anomaly_data_path = (big_drive_path / \"feature_selected_sc_qc_data\").resolve(\n", - " strict=True\n", - ")\n", - "\n", - "exp_meta = pd.read_csv(\n", - " root_dir / \"reference_plate_data/experiment-metadata.tsv\", sep=\"\\t\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "5b726e8d", - "metadata": { - "lines_to_next_cell": 0 - }, - "source": [ - "## Outputs" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9cb8285a", - "metadata": {}, - "outputs": [], - "source": [ - "agg_treatment_anomaly_data_path = big_drive_path / \"aggregated_anomaly_data\"\n", - "agg_treatment_anomaly_data_path.mkdir(parents=True, exist_ok=True)" - ] - }, - { - "cell_type": "markdown", - "id": "333acaf7", - "metadata": {}, - "source": [ - "## Aggregate Treatment Profiles per Plate\n", - "Aggregate the treatment profiles per plate, while\n", - "also taking the intersection of all cellprofiler features." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6768d73a", - "metadata": {}, - "outputs": [], - "source": [ - "exp_meta = exp_meta[[\"Assay_Plate_Barcode\", \"Perturbation\"]]\n", - "agg_platedf = []\n", - "feat_cols = set()\n", - "\n", - "for sc_plate_path in agg_anomaly_data_path.rglob(\"*.parquet\"):\n", - " morph_platedf = pd.read_parquet(sc_plate_path)\n", - " treatment_col_name = \"Metadata_pert_iname\"\n", - " plate_feat_cols = set(\n", - " morph_platedf.columns[~morph_platedf.columns.str.contains(\"Metadata\")].tolist()\n", - " )\n", - "\n", - " if not feat_cols:\n", - " feat_cols = plate_feat_cols\n", - " else:\n", - " feat_cols &= plate_feat_cols\n", - "\n", - " agg_mapping = dict.fromkeys(feat_cols, \"mean\")\n", - " morph_platedf = morph_platedf.assign(Metadata_group_size=1)\n", - " agg_mapping |= {\"Metadata_group_size\": \"size\", \"Metadata_Plate\": \"first\"}\n", - "\n", - " # Distinguish between Compound data and (CRISPR and ORF data)\n", - " if not np.any(morph_platedf.columns.str.contains(treatment_col_name)):\n", - " treatment_col_name = \"Metadata_gene\"\n", - "\n", - " agg_platedf.append(\n", - " morph_platedf.groupby(treatment_col_name).agg(agg_mapping).reset_index()\n", - " )\n", - "\n", - "feat_cols = sorted(feat_cols)\n", - "agg_platedf = pd.concat(agg_platedf, axis=0)" - ] - }, - { - "cell_type": "markdown", - "id": "c3f17d8a", - "metadata": {}, - "source": [ - "## Merging and Processing Aggregate Profiles" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8448988b", - "metadata": {}, - "outputs": [], - "source": [ - "agg_platedf = pd.merge(\n", - " exp_meta,\n", - " agg_platedf,\n", - " how=\"inner\",\n", - " left_on=\"Assay_Plate_Barcode\",\n", - " right_on=\"Metadata_Plate\",\n", - ")\n", - "\n", - "agg_platedf = agg_platedf.rename(columns={\"Perturbation\": \"Metadata_Perturbation\"})" - ] - }, - { - "cell_type": "markdown", - "id": "9f15d96d", - "metadata": { - "lines_to_next_cell": 0 - }, - "source": [ - "## Compute Aggregate Feature Importances\n", - "Isolation forest reference:\n", - "https://ieeexplore.ieee.org/document/4781136" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9e1790b3", - "metadata": {}, - "outputs": [], - "source": [ - "isofor = IsolationForest(n_estimators=1_000, random_state=0, n_jobs=-1)\n", - "agg_platedf = agg_platedf.assign(\n", - " **{\n", - " \"Result_inlier\": isofor.fit_predict(agg_platedf[feat_cols]),\n", - " \"Result_anomaly_score\": isofor.decision_function(agg_platedf[feat_cols]),\n", - " }\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "d998c802", - "metadata": { - "lines_to_next_cell": 0 - }, - "source": [ - "## Compute Feature Importances\n", - "Computes sample feature importances using anomaly scores per sample." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b420ae52", - "metadata": {}, - "outputs": [], - "source": [ - "metadf = agg_platedf.copy().filter(regex=\"Metadata|Result\")\n", - "\n", - "result = IsoforestFeatureImportance(\n", - " estimators=isofor.estimators_,\n", - " morphology_data=agg_platedf[list(isofor.feature_names_in_)],\n", - ")()" - ] - }, - { - "cell_type": "markdown", - "id": "4b2e159c", - "metadata": { - "lines_to_next_cell": 0 - }, - "source": [ - "## Combine Anomaly Results and Metadata" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "19fc4a55", - "metadata": {}, - "outputs": [], - "source": [ - "meta_resultdf = metadf[metadf.columns.difference(result.columns)]\n", - "result = result.join(meta_resultdf, how=\"inner\")" - ] - }, - { - "cell_type": "markdown", - "id": "bd6d2407", - "metadata": {}, - "source": [ - "## Save Aggregate Anomaly Data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5493f454", - "metadata": {}, - "outputs": [], - "source": [ - "result.to_parquet(\n", - " agg_treatment_anomaly_data_path / \"aggregated_treatment_anomaly_data.parquet\"\n", - ")" - ] - } - ], - "metadata": { - "jupytext": { - "cell_metadata_filter": "-all", - "main_language": "python", - "notebook_metadata_filter": "-all" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/2.evaluate_data/compute_sc_anomalyze_data/compute_all_sc_anomalyze_data.sh b/2.evaluate_data/compute_sc_anomalyze_data/compute_all_sc_anomalyze_data.sh deleted file mode 100755 index 0ed67e5..0000000 --- a/2.evaluate_data/compute_sc_anomalyze_data/compute_all_sc_anomalyze_data.sh +++ /dev/null @@ -1,53 +0,0 @@ -#!/bin/bash - -# This script calls python code to compute single-cell anomaly data with all datasets. - -conda init bash -conda activate jump_sc - -# Get the root directory -git_root=$(git rev-parse --show-toplevel) - -py_path="nbconverted" - -jupyter nbconvert --to python --output-dir="${py_path}/" ./*.ipynb - -iso_forest_paths="${git_root}/2.evaluate_data/train_sc_anomalyze_models/isolation_forest_models" -big_drive_path="${1:-${git_root}/big_drive}" -anomaly_data_path="${2:-${big_drive_path}/sc_anomaly_data}" - - -# Get the single-cell data path (with multiple plates) -plate_paths=( - "${big_drive_path}/feature_selected_sc_qc_data" - "${big_drive_path}/normalized_sc_qc_data" - "${big_drive_path}/feature_selected_sc_data" - "${big_drive_path}/normalized_sc_data" -) - -for plate_dir in "${plate_paths[@]}"; do - - if [ -d "$plate_dir" ]; then - - echo -e "\nComputing anomaly data and feature importance data from $plate_dir" - for file in "$plate_dir"/*.parquet; do - - if [ -f "$file" ]; then - - iso_forest_path="$iso_forest_paths/$(basename "$plate_dir")_isolation_forest.joblib" - - /usr/bin/time -v python3 "$py_path/compute_sc_anomaly_data.py" "$file" "$iso_forest_path" "$anomaly_data_path" - - fi - - done - - else - echo "Error: '$plate_dir' is not a directory." - return 1 - fi - -done - -/usr/bin/time -v python3 "$py_path/compute_sc_feature_importance_data.py" --big_drive_path "$big_drive_path" -/usr/bin/time -v python3 "$py_path/compute_aggregate_treatment_anomaly_data.py" --big_drive_path "$big_drive_path" diff --git a/2.evaluate_data/compute_sc_anomalyze_data/compute_feature_importance_by_plate.ipynb b/2.evaluate_data/compute_sc_anomalyze_data/compute_feature_importance_by_plate.ipynb deleted file mode 100644 index eed23a2..0000000 --- a/2.evaluate_data/compute_sc_anomalyze_data/compute_feature_importance_by_plate.ipynb +++ /dev/null @@ -1,179 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "34b8f202", - "metadata": { - "lines_to_next_cell": 0 - }, - "source": [ - "# Compute Plate Feature Importances\n", - "Created to prevent memory leakage while computing feature importances by calling this script from a bash script" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8d20a767", - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "import argparse\n", - "import pathlib\n", - "import sys\n", - "\n", - "import joblib\n", - "import pandas as pd\n", - "\n", - "# Get the current working directory\n", - "cwd = pathlib.Path.cwd()\n", - "\n", - "if (cwd / \".git\").is_dir():\n", - " root_dir = cwd\n", - "\n", - "else:\n", - " root_dir = None\n", - " for parent in cwd.parents:\n", - " if (parent / \".git\").is_dir():\n", - " root_dir = parent\n", - " break\n", - "\n", - "# Check if a Git root directory was found\n", - "if root_dir is None:\n", - " raise FileNotFoundError(\"No Git root directory found.\")\n", - "\n", - "sys.path.append(str((root_dir / \"2.evaluate_data\" / \"utils\").resolve(strict=True)))\n", - "from isolation_forest_data_feature_importance import IsoforestFeatureImportance" - ] - }, - { - "cell_type": "markdown", - "id": "b1eec855", - "metadata": {}, - "source": [ - "# Inputs and Outputs" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "acec67a7", - "metadata": {}, - "outputs": [], - "source": [ - "parser = argparse.ArgumentParser()\n", - "parser.add_argument(\"--anomaly_dataset_path\", type=str, required=True)\n", - "parser.add_argument(\"--plate\", type=str, required=True)\n", - "parser.add_argument(\"--morphology_dataset_path\", type=str, required=True)\n", - "parser.add_argument(\"--model_path\", type=str, required=True)\n", - "parser.add_argument(\"--feature_importances_output_path\", type=str, required=True)\n", - "parser.add_argument(\"--plate_mapping_path\", type=str, required=True)\n", - "\n", - "args = parser.parse_args()\n", - "\n", - "anomaly_dataset = pathlib.Path(args.anomaly_dataset_path)\n", - "morphology_dataset = pathlib.Path(args.morphology_dataset_path)\n", - "model_path = pathlib.Path(args.model_path)\n", - "feature_importances_output_path = pathlib.Path(args.feature_importances_output_path)\n", - "plate_mapping_path = pathlib.Path(args.plate_mapping_path)\n", - "anomalyze_model = joblib.load(model_path)" - ] - }, - { - "cell_type": "markdown", - "id": "268b43b6", - "metadata": {}, - "source": [ - "# Compute Feature Importances" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "37cc374b", - "metadata": {}, - "outputs": [], - "source": [ - "# Number of samples to compute feature importances for in each category (anomalous or inlier)\n", - "num_control_samples = 300\n", - "num_anomalous_samples = 300\n", - "\n", - "merge_cols = [\n", - " \"Metadata_Plate\",\n", - " \"Metadata_Well\",\n", - " \"Metadata_Site\",\n", - " \"Metadata_ObjectNumber\",\n", - "]\n", - "\n", - "plate_mappingdf = pd.read_csv(plate_mapping_path, sep=\"\\t\")[\n", - " [\"Assay_Plate_Barcode\", \"Perturbation\"]\n", - "].rename(\n", - " columns={\n", - " \"Assay_Plate_Barcode\": \"Metadata_Plate\",\n", - " \"Perturbation\": \"Metadata_treatment_type\",\n", - " }\n", - ")\n", - "\n", - "anomdf = pd.concat(\n", - " [pd.read_parquet(path) for path in anomaly_dataset.rglob(\"*.parquet\")],\n", - " ignore_index=True,\n", - ")\n", - "\n", - "feature_importancesdf = []\n", - "\n", - "\n", - "anomdf = anomdf.loc[anomdf[\"Metadata_Plate\"] == args.plate]\n", - "\n", - "for control_type in [\"negcon\", \"anomalous\"]:\n", - "\n", - " # Negative controls shouldn't be anomalous and vice versa\n", - " if control_type == \"negcon\":\n", - " morphologydf = anomdf.loc[\n", - " (anomdf[\"Metadata_control_type\"] == \"negcon\")\n", - " & (anomdf[\"Metadata_Plate\"] == args.plate)\n", - " ].nlargest(num_control_samples, \"Result_anomaly_score\")\n", - "\n", - " else:\n", - " morphologydf = anomdf.loc[\n", - " (anomdf[\"Metadata_control_type\"] != \"negcon\")\n", - " & (anomdf[\"Metadata_Plate\"] == args.plate)\n", - " ].nsmallest(num_anomalous_samples, \"Result_anomaly_score\")\n", - "\n", - " platedf = pd.read_parquet(\n", - " list(morphology_dataset.rglob(f\"*{args.plate}*.parquet\"))[0]\n", - " )\n", - "\n", - " platedf = platedf[merge_cols + anomalyze_model.feature_names_in_.tolist()]\n", - "\n", - " morphologyonlydf = pd.merge(platedf, morphologydf, on=merge_cols, how=\"inner\")\n", - " meta_resultdf = morphologyonlydf.copy().filter(regex=\"Metadata|Result\")\n", - " morphologyonlydf = morphologyonlydf[anomalyze_model.feature_names_in_.tolist()]\n", - "\n", - " result = IsoforestFeatureImportance(\n", - " estimators=anomalyze_model.estimators_,\n", - " morphology_data=morphologyonlydf,\n", - " )()\n", - "\n", - " meta_resultdf = meta_resultdf[meta_resultdf.columns.difference(result.columns)]\n", - " result = result.join(meta_resultdf, how=\"inner\")\n", - "\n", - " feature_importancesdf.append(result)\n", - "\n", - "if feature_importancesdf:\n", - " pd.concat(feature_importancesdf, axis=0, ignore_index=True).to_parquet(\n", - " feature_importances_output_path / f\"{args.plate}.parquet\"\n", - " )" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "python", - "language": "python", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/2.evaluate_data/compute_sc_anomalyze_data/compute_sc_anomaly_data.ipynb b/2.evaluate_data/compute_sc_anomalyze_data/compute_sc_anomaly_data.ipynb deleted file mode 100644 index 6771a4f..0000000 --- a/2.evaluate_data/compute_sc_anomalyze_data/compute_sc_anomaly_data.ipynb +++ /dev/null @@ -1,114 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "jukit_cell_id": "GKSjBhrcro" - }, - "source": [ - "# Identify single cell anomalies\n", - "In this analysis we compute single-cell anomaly data with anomalyze" - ] - }, - { - "cell_type": "code", - "metadata": { - "jukit_cell_id": "wWc5vokN2k" - }, - "source": [ - "import pathlib\n", - "import sys\n", - "\n", - "import joblib\n", - "import pandas as pd\n", - "import pyarrow as pa\n", - "import pyarrow.parquet as pq\n", - "from sklearn.ensemble import IsolationForest" - ], - "outputs": [], - "execution_count": null - }, - { - "cell_type": "markdown", - "metadata": { - "jukit_cell_id": "az7X2tr2FY" - }, - "source": [ - "## Define inputs and outputs" - ] - }, - { - "cell_type": "code", - "metadata": { - "jukit_cell_id": "0Qhap9H5Xu" - }, - "source": [ - "sc_data_path = pathlib.Path(sys.argv[1]).resolve(strict=True)\n", - "sc_data_dir_name = sc_data_path.parent.name\n", - "pq_file = pq.ParquetFile(sc_data_path)\n", - "\n", - "iso_forest = joblib.load(pathlib.Path(sys.argv[2]).resolve(strict=True))\n", - "iso_forest.n_jobs = -1\n", - "\n", - "anomaly_data_path = pathlib.Path(sys.argv[3]) / sc_data_dir_name / sc_data_path.stem\n", - "anomaly_data_path.mkdir(parents=True, exist_ok=True)" - ], - "outputs": [], - "execution_count": null - }, - { - "cell_type": "code", - "metadata": { - "jukit_cell_id": "H1uMt6i3l6" - }, - "source": [ - "feat_cols = iso_forest.feature_names_in_" - ], - "outputs": [], - "execution_count": null - }, - { - "cell_type": "markdown", - "metadata": { - "jukit_cell_id": "7oCv6Vb79n" - }, - "source": [ - "## Compute Anomaly Data" - ] - }, - { - "cell_type": "code", - "metadata": { - "jukit_cell_id": "kD5WcVSvOF" - }, - "source": [ - "# Isolation forest reference:\n", - "# https://ieeexplore.ieee.org/document/4781136\n", - "# The data is batched here to reduce the memory burden\n", - "for i, batch in enumerate(pq_file.iter_batches(batch_size=220_000)):\n", - " pdf = batch.to_pandas()\n", - " pdf = pdf.assign(Result_inlier=iso_forest.predict(pdf[feat_cols]))\n", - " pdf = pdf.assign(Result_anomaly_score=iso_forest.decision_function(pdf[feat_cols]))\n", - " keep_cols = [col for col in pdf.columns if \"Metadata\" in col or \"Result\" in col]\n", - " pdf = pdf[keep_cols]\n", - "\n", - " pdf.sort_values(by=\"Result_anomaly_score\", ascending=True, inplace=True)\n", - "\n", - " output_path = anomaly_data_path / f\"{sc_data_path.stem}_anomaly_batch_{i}.parquet\"\n", - " pdf.to_parquet(output_path)" - ], - "outputs": [], - "execution_count": null - } - ], - "metadata": { - "anaconda-cloud": {}, - "kernelspec": { - "display_name": "python", - "language": "python", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} \ No newline at end of file diff --git a/2.evaluate_data/compute_sc_anomalyze_data/compute_sc_feature_importance_data.ipynb b/2.evaluate_data/compute_sc_anomalyze_data/compute_sc_feature_importance_data.ipynb deleted file mode 100644 index 608d18d..0000000 --- a/2.evaluate_data/compute_sc_anomalyze_data/compute_sc_feature_importance_data.ipynb +++ /dev/null @@ -1,224 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "1d540ded", - "metadata": { - "jukit_cell_id": "zkd21a2pL8" - }, - "source": [ - "# Compute Sample Feature Importances\n", - "This notebook computes feature importances by dataset, treatment type, and by anomaly score." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "36c92f2b", - "metadata": { - "jukit_cell_id": "xvb69t4oBw" - }, - "outputs": [], - "source": [ - "import argparse\n", - "import pathlib\n", - "import subprocess\n", - "\n", - "import joblib\n", - "import pandas as pd" - ] - }, - { - "cell_type": "markdown", - "id": "d7de68ac", - "metadata": { - "jukit_cell_id": "KUHmTNEP3T" - }, - "source": [ - "## Find the root of the git repo on the host system" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9011ba94", - "metadata": { - "jukit_cell_id": "vdfNeby9lU" - }, - "outputs": [], - "source": [ - "# Get the current working directory\n", - "cwd = pathlib.Path.cwd()\n", - "\n", - "if (cwd / \".git\").is_dir():\n", - " root_dir = cwd\n", - "\n", - "else:\n", - " root_dir = None\n", - " for parent in cwd.parents:\n", - " if (parent / \".git\").is_dir():\n", - " root_dir = parent\n", - " break\n", - "\n", - "# Check if a Git root directory was found\n", - "if root_dir is None:\n", - " raise FileNotFoundError(\"No Git root directory found.\")" - ] - }, - { - "cell_type": "markdown", - "id": "7e867582", - "metadata": { - "jukit_cell_id": "f2OGzBqBg7" - }, - "source": [ - "# Inputs" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d7bd46eb", - "metadata": { - "jukit_cell_id": "FPf9gGwrq9" - }, - "outputs": [], - "source": [ - "# Replace with your data storage path here\n", - "parser = argparse.ArgumentParser()\n", - "parser.add_argument(\n", - " \"--big_drive_path\",\n", - " type=pathlib.Path,\n", - " default=root_dir / \"big_drive\",\n", - ")\n", - "args = parser.parse_args()\n", - "\n", - "big_drive_path = args.big_drive_path.resolve(strict=True)\n", - "anomaly_datasets_path = (big_drive_path / \"sc_anomaly_data\").resolve(strict=True)\n", - "\n", - "anomalyze_models_path = (big_drive_path / \"isolation_forest_models\").resolve(\n", - " strict=True\n", - ")\n", - "\n", - "plate_mapping_path = (\n", - " root_dir / \"reference_plate_data/experiment-metadata.tsv\"\n", - ").resolve(strict=True)\n", - "\n", - "plate_mappingdf = pd.read_csv(plate_mapping_path, sep=\"\\t\")[\n", - " [\"Assay_Plate_Barcode\", \"Perturbation\"]\n", - "]" - ] - }, - { - "cell_type": "markdown", - "id": "197fbef5", - "metadata": { - "jukit_cell_id": "Pd6iLYYRnQ" - }, - "source": [ - "# Outputs" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5da0c52a", - "metadata": { - "jukit_cell_id": "VxPBfErVQM" - }, - "outputs": [], - "source": [ - "feature_importances_path = big_drive_path / \"sc_anomaly_feature_importances\"" - ] - }, - { - "cell_type": "markdown", - "id": "9fc71686", - "metadata": { - "jukit_cell_id": "6QWRMmeVMb" - }, - "source": [ - "# Sample and Compute Feature feature_importances\n", - "Sample and then compute feature importances between the most anomalous and the least anomalous cells." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b0b1cf72", - "metadata": { - "jukit_cell_id": "Eom828CFqT" - }, - "outputs": [], - "source": [ - "model_suffix_name = \"_isolation_forest.joblib\"\n", - "merge_cols = [\n", - " \"Metadata_Plate\",\n", - " \"Metadata_Well\",\n", - " \"Metadata_Site\",\n", - " \"Metadata_ObjectNumber\",\n", - "]\n", - "\n", - "plate_mappingdf.rename(\n", - " columns={\n", - " \"Assay_Plate_Barcode\": \"Metadata_Plate\",\n", - " \"Perturbation\": \"Metadata_treatment_type\",\n", - " },\n", - " inplace=True,\n", - ")\n", - "\n", - "for anomaly_dataset in anomaly_datasets_path.iterdir():\n", - " if \"normal\" in anomaly_dataset.stem or \"qc\" not in anomaly_dataset.stem:\n", - " continue\n", - "\n", - " anomaly_model_name = anomaly_dataset.stem + model_suffix_name\n", - " morphology_dataset = big_drive_path / f\"{anomaly_dataset.stem}\"\n", - " anomalyze_model = joblib.load(anomalyze_models_path / anomaly_model_name)\n", - "\n", - " anomaly_paths = list(anomaly_dataset.rglob(\"*.parquet\"))\n", - "\n", - " anomdf = pd.concat(\n", - " [pd.read_parquet(path) for path in anomaly_paths], ignore_index=True\n", - " )\n", - "\n", - " anomdf = pd.merge(\n", - " left=anomdf, right=plate_mappingdf, on=\"Metadata_Plate\", how=\"inner\"\n", - " )\n", - "\n", - " dataset_feature_importances_path = feature_importances_path / anomaly_dataset.stem\n", - "\n", - " for treatment_type_name, treatment_typedf in anomdf.groupby(\n", - " \"Metadata_treatment_type\"\n", - " ):\n", - " treatment_feature_importances_path = (\n", - " dataset_feature_importances_path / treatment_type_name\n", - " )\n", - " treatment_feature_importances_path.mkdir(parents=True, exist_ok=True)\n", - "\n", - " # Morphology parquet files are separated by plate in each dataset\n", - " for plate in treatment_typedf[\"Metadata_Plate\"].unique():\n", - " subprocess.run(\n", - " [\n", - " \"run_feature_importance.sh\",\n", - " str(anomaly_dataset),\n", - " plate,\n", - " str(morphology_dataset),\n", - " str(anomalyze_models_path / anomaly_model_name),\n", - " str(treatment_feature_importances_path),\n", - " str(plate_mapping_path),\n", - " \"nbconverted/compute_feature_importance_by_plate.py\",\n", - " ]\n", - " )" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "python", - "language": "python", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/2.evaluate_data/compute_sc_anomalyze_data/nbconverted/compute_aggregate_treatment_anomaly_data.py b/2.evaluate_data/compute_sc_anomalyze_data/nbconverted/compute_aggregate_treatment_anomaly_data.py deleted file mode 100644 index 33c1dc2..0000000 --- a/2.evaluate_data/compute_sc_anomalyze_data/nbconverted/compute_aggregate_treatment_anomaly_data.py +++ /dev/null @@ -1,175 +0,0 @@ -#!/usr/bin/env python -# coding: utf-8 - -# # Compute Aggregate Treatment Anomaly Data -# Anomalies data are computed by first -# aggregating cellprofiler data to the treatment level, and -# then computing anomaly scores and feature importances for later analysis. - -# In[ ]: - - -import argparse -import pathlib -import sys - -import numpy as np -import pandas as pd -from sklearn.ensemble import IsolationForest - -# Get the current working directory -cwd = pathlib.Path.cwd() - -if (cwd / ".git").is_dir(): - root_dir = cwd - -else: - root_dir = None - for parent in cwd.parents: - if (parent / ".git").is_dir(): - root_dir = parent - break - -# Check if a Git root directory was found -if root_dir is None: - raise FileNotFoundError("No Git root directory found.") - -sys.path.append(str((root_dir / "2.evaluate_data" / "utils").resolve(strict=True))) -from isolation_forest_data_feature_importance import IsoforestFeatureImportance - - -# ## Inputs - -# In[ ]: - - -parser = argparse.ArgumentParser() -parser.add_argument( - "--big_drive_path", - type=pathlib.Path, - default=root_dir / "big_drive", -) -args = parser.parse_args() - -big_drive_path = args.big_drive_path.resolve(strict=True) - -agg_anomaly_data_path = (big_drive_path / "feature_selected_sc_qc_data").resolve( - strict=True -) - -exp_meta = pd.read_csv( - root_dir / "reference_plate_data/experiment-metadata.tsv", sep="\t" -) - - -# ## Outputs - -# In[ ]: - - -agg_treatment_anomaly_data_path = big_drive_path / "aggregated_anomaly_data" -agg_treatment_anomaly_data_path.mkdir(parents=True, exist_ok=True) - - -# ## Aggregate Treatment Profiles per Plate -# Aggregate the treatment profiles per plate, while -# also taking the intersection of all cellprofiler features. - -# In[ ]: - - -exp_meta = exp_meta[["Assay_Plate_Barcode", "Perturbation"]] -agg_platedf = [] -feat_cols = set() - -for sc_plate_path in agg_anomaly_data_path.rglob("*.parquet"): - morph_platedf = pd.read_parquet(sc_plate_path) - treatment_col_name = "Metadata_pert_iname" - plate_feat_cols = set( - morph_platedf.columns[~morph_platedf.columns.str.contains("Metadata")].tolist() - ) - - if not feat_cols: - feat_cols = plate_feat_cols - else: - feat_cols &= plate_feat_cols - - agg_mapping = dict.fromkeys(feat_cols, "mean") - morph_platedf = morph_platedf.assign(Metadata_group_size=1) - agg_mapping |= {"Metadata_group_size": "size", "Metadata_Plate": "first"} - - # Distinguish between Compound data and (CRISPR and ORF data) - if not np.any(morph_platedf.columns.str.contains(treatment_col_name)): - treatment_col_name = "Metadata_gene" - - agg_platedf.append( - morph_platedf.groupby(treatment_col_name).agg(agg_mapping).reset_index() - ) - -feat_cols = sorted(feat_cols) -agg_platedf = pd.concat(agg_platedf, axis=0) - - -# ## Merging and Processing Aggregate Profiles - -# In[ ]: - - -agg_platedf = pd.merge( - exp_meta, - agg_platedf, - how="inner", - left_on="Assay_Plate_Barcode", - right_on="Metadata_Plate", -) - -agg_platedf = agg_platedf.rename(columns={"Perturbation": "Metadata_Perturbation"}) - - -# ## Compute Aggregate Feature Importances -# Isolation forest reference: -# https://ieeexplore.ieee.org/document/4781136 - -# In[ ]: - - -isofor = IsolationForest(n_estimators=1_000, random_state=0, n_jobs=-1) -agg_platedf = agg_platedf.assign( - **{ - "Result_inlier": isofor.fit_predict(agg_platedf[feat_cols]), - "Result_anomaly_score": isofor.decision_function(agg_platedf[feat_cols]), - } -) - - -# ## Compute Feature Importances -# Computes sample feature importances using anomaly scores per sample. - -# In[ ]: - - -metadf = agg_platedf.copy().filter(regex="Metadata|Result") - -result = IsoforestFeatureImportance( - estimators=isofor.estimators_, - morphology_data=agg_platedf[list(isofor.feature_names_in_)], -)() - - -# ## Combine Anomaly Results and Metadata - -# In[ ]: - - -meta_resultdf = metadf[metadf.columns.difference(result.columns)] -result = result.join(meta_resultdf, how="inner") - - -# ## Save Aggregate Anomaly Data - -# In[ ]: - - -result.to_parquet( - agg_treatment_anomaly_data_path / "aggregated_treatment_anomaly_data.parquet" -) diff --git a/2.evaluate_data/compute_sc_anomalyze_data/nbconverted/compute_feature_importance_by_plate.py b/2.evaluate_data/compute_sc_anomalyze_data/nbconverted/compute_feature_importance_by_plate.py deleted file mode 100644 index d76d12f..0000000 --- a/2.evaluate_data/compute_sc_anomalyze_data/nbconverted/compute_feature_importance_by_plate.py +++ /dev/null @@ -1,134 +0,0 @@ -#!/usr/bin/env python -# coding: utf-8 - -# # Compute Plate Feature Importances -# Created to prevent memory leakage while computing feature importances by calling this script from a bash script - -# In[ ]: - - -import argparse -import pathlib -import sys - -import joblib -import pandas as pd - -# Get the current working directory -cwd = pathlib.Path.cwd() - -if (cwd / ".git").is_dir(): - root_dir = cwd - -else: - root_dir = None - for parent in cwd.parents: - if (parent / ".git").is_dir(): - root_dir = parent - break - -# Check if a Git root directory was found -if root_dir is None: - raise FileNotFoundError("No Git root directory found.") - -sys.path.append(str((root_dir / "2.evaluate_data" / "utils").resolve(strict=True))) -from isolation_forest_data_feature_importance import IsoforestFeatureImportance - - -# # Inputs and Outputs - -# In[ ]: - - -parser = argparse.ArgumentParser() -parser.add_argument("--anomaly_dataset_path", type=str, required=True) -parser.add_argument("--plate", type=str, required=True) -parser.add_argument("--morphology_dataset_path", type=str, required=True) -parser.add_argument("--model_path", type=str, required=True) -parser.add_argument("--feature_importances_output_path", type=str, required=True) -parser.add_argument("--plate_mapping_path", type=str, required=True) - -args = parser.parse_args() - -anomaly_dataset = pathlib.Path(args.anomaly_dataset_path) -morphology_dataset = pathlib.Path(args.morphology_dataset_path) -model_path = pathlib.Path(args.model_path) -feature_importances_output_path = pathlib.Path(args.feature_importances_output_path) -plate_mapping_path = pathlib.Path(args.plate_mapping_path) -anomalyze_model = joblib.load(model_path) - - -# # Compute Feature Importances - -# In[ ]: - - -# Number of samples to compute feature importances for in each category (anomalous or inlier) -num_control_samples = 300 -num_anomalous_samples = 300 - -merge_cols = [ - "Metadata_Plate", - "Metadata_Well", - "Metadata_Site", - "Metadata_ObjectNumber", -] - -plate_mappingdf = pd.read_csv(plate_mapping_path, sep="\t")[ - ["Assay_Plate_Barcode", "Perturbation"] -].rename( - columns={ - "Assay_Plate_Barcode": "Metadata_Plate", - "Perturbation": "Metadata_treatment_type", - } -) - -anomdf = pd.concat( - [pd.read_parquet(path) for path in anomaly_dataset.rglob("*.parquet")], - ignore_index=True, -) - -feature_importancesdf = [] - - -anomdf = anomdf.loc[anomdf["Metadata_Plate"] == args.plate] - -for control_type in ["negcon", "anomalous"]: - - # Negative controls shouldn't be anomalous and vice versa - if control_type == "negcon": - morphologydf = anomdf.loc[ - (anomdf["Metadata_control_type"] == "negcon") - & (anomdf["Metadata_Plate"] == args.plate) - ].nlargest(num_control_samples, "Result_anomaly_score") - - else: - morphologydf = anomdf.loc[ - (anomdf["Metadata_control_type"] != "negcon") - & (anomdf["Metadata_Plate"] == args.plate) - ].nsmallest(num_anomalous_samples, "Result_anomaly_score") - - platedf = pd.read_parquet( - list(morphology_dataset.rglob(f"*{args.plate}*.parquet"))[0] - ) - - platedf = platedf[merge_cols + anomalyze_model.feature_names_in_.tolist()] - - morphologyonlydf = pd.merge(platedf, morphologydf, on=merge_cols, how="inner") - meta_resultdf = morphologyonlydf.copy().filter(regex="Metadata|Result") - morphologyonlydf = morphologyonlydf[anomalyze_model.feature_names_in_.tolist()] - - result = IsoforestFeatureImportance( - estimators=anomalyze_model.estimators_, - morphology_data=morphologyonlydf, - )() - - meta_resultdf = meta_resultdf[meta_resultdf.columns.difference(result.columns)] - result = result.join(meta_resultdf, how="inner") - - feature_importancesdf.append(result) - -if feature_importancesdf: - pd.concat(feature_importancesdf, axis=0, ignore_index=True).to_parquet( - feature_importances_output_path / f"{args.plate}.parquet" - ) diff --git a/2.evaluate_data/compute_sc_anomalyze_data/nbconverted/compute_sc_anomaly_data.py b/2.evaluate_data/compute_sc_anomalyze_data/nbconverted/compute_sc_anomaly_data.py deleted file mode 100755 index 7baf021..0000000 --- a/2.evaluate_data/compute_sc_anomalyze_data/nbconverted/compute_sc_anomaly_data.py +++ /dev/null @@ -1,61 +0,0 @@ -#!/usr/bin/env python -# coding: utf-8 - -# # Identify single cell anomalies -# In this analysis we compute single-cell anomaly data with anomalyze - -# In[ ]: - - -import pathlib -import sys - -import joblib -import pandas as pd -import pyarrow as pa -import pyarrow.parquet as pq -from sklearn.ensemble import IsolationForest - - -# ## Define inputs and outputs - -# In[ ]: - - -sc_data_path = pathlib.Path(sys.argv[1]).resolve(strict=True) -sc_data_dir_name = sc_data_path.parent.name -pq_file = pq.ParquetFile(sc_data_path) - -iso_forest = joblib.load(pathlib.Path(sys.argv[2]).resolve(strict=True)) -iso_forest.n_jobs = -1 - -anomaly_data_path = pathlib.Path(sys.argv[3]) / sc_data_dir_name / sc_data_path.stem -anomaly_data_path.mkdir(parents=True, exist_ok=True) - - -# In[ ]: - - -feat_cols = iso_forest.feature_names_in_ - - -# ## Compute Anomaly Data - -# In[ ]: - - -# Isolation forest reference: -# https://ieeexplore.ieee.org/document/4781136 -# The data is batched here to reduce the memory burden -for i, batch in enumerate(pq_file.iter_batches(batch_size=220_000)): - pdf = batch.to_pandas() - pdf = pdf.assign(Result_inlier=iso_forest.predict(pdf[feat_cols])) - pdf = pdf.assign(Result_anomaly_score=iso_forest.decision_function(pdf[feat_cols])) - keep_cols = [col for col in pdf.columns if "Metadata" in col or "Result" in col] - pdf = pdf[keep_cols] - - pdf.sort_values(by="Result_anomaly_score", ascending=True, inplace=True) - - output_path = anomaly_data_path / f"{sc_data_path.stem}_anomaly_batch_{i}.parquet" - pdf.to_parquet(output_path) - diff --git a/2.evaluate_data/compute_sc_anomalyze_data/nbconverted/compute_sc_feature_importance_data.py b/2.evaluate_data/compute_sc_anomalyze_data/nbconverted/compute_sc_feature_importance_data.py deleted file mode 100644 index 172ad32..0000000 --- a/2.evaluate_data/compute_sc_anomalyze_data/nbconverted/compute_sc_feature_importance_data.py +++ /dev/null @@ -1,142 +0,0 @@ -#!/usr/bin/env python -# coding: utf-8 - -# # Compute Sample Feature Importances -# This notebook computes feature importances by dataset, treatment type, and by anomaly score. - -# In[ ]: - - -import argparse -import pathlib -import subprocess - -import joblib -import pandas as pd - - -# ## Find the root of the git repo on the host system - -# In[ ]: - - -# Get the current working directory -cwd = pathlib.Path.cwd() - -if (cwd / ".git").is_dir(): - root_dir = cwd - -else: - root_dir = None - for parent in cwd.parents: - if (parent / ".git").is_dir(): - root_dir = parent - break - -# Check if a Git root directory was found -if root_dir is None: - raise FileNotFoundError("No Git root directory found.") - -parser = argparse.ArgumentParser() -parser.add_argument( - "--big_drive_path", - type=pathlib.Path, - default=root_dir / "big_drive", -) -args = parser.parse_args() - - -# # Inputs - -# In[ ]: - - -# Replace with your data storage path here -big_drive_path = args.big_drive_path.resolve(strict=True) -anomaly_datasets_path = (big_drive_path / "sc_anomaly_data").resolve(strict=True) - -anomalyze_models_path = (big_drive_path / "isolation_forest_models").resolve( - strict=True -) - -plate_mapping_path = ( - root_dir / "reference_plate_data/experiment-metadata.tsv" -).resolve(strict=True) - -plate_mappingdf = pd.read_csv(plate_mapping_path, sep="\t")[ - ["Assay_Plate_Barcode", "Perturbation"] -] - - -# # Outputs - -# In[ ]: - - -feature_importances_path = big_drive_path / "sc_anomaly_feature_importances" - - -# # Sample and Compute Feature feature_importances -# Sample and then compute feature importances between the most anomalous and the least anomalous cells. - -# In[ ]: - - -model_suffix_name = "_isolation_forest.joblib" -merge_cols = [ - "Metadata_Plate", - "Metadata_Well", - "Metadata_Site", - "Metadata_ObjectNumber", -] - -plate_mappingdf.rename( - columns={ - "Assay_Plate_Barcode": "Metadata_Plate", - "Perturbation": "Metadata_treatment_type", - }, - inplace=True, -) - -for anomaly_dataset in anomaly_datasets_path.iterdir(): - if "normal" in anomaly_dataset.stem or "qc" not in anomaly_dataset.stem: - continue - - anomaly_model_name = anomaly_dataset.stem + model_suffix_name - morphology_dataset = big_drive_path / f"{anomaly_dataset.stem}" - anomalyze_model = joblib.load(anomalyze_models_path / anomaly_model_name) - - anomaly_paths = list(anomaly_dataset.rglob("*.parquet")) - - anomdf = pd.concat( - [pd.read_parquet(path) for path in anomaly_paths], ignore_index=True - ) - - anomdf = pd.merge( - left=anomdf, right=plate_mappingdf, on="Metadata_Plate", how="inner" - ) - - dataset_feature_importances_path = feature_importances_path / anomaly_dataset.stem - - for treatment_type_name, treatment_typedf in anomdf.groupby( - "Metadata_treatment_type" - ): - treatment_feature_importances_path = ( - dataset_feature_importances_path / treatment_type_name - ) - treatment_feature_importances_path.mkdir(parents=True, exist_ok=True) - - # Morphology parquet files are separated by plate in each dataset - for plate in treatment_typedf["Metadata_Plate"].unique(): - subprocess.run( - [ - "run_feature_importance.sh", - str(anomaly_dataset), - plate, - str(morphology_dataset), - str(anomalyze_models_path / anomaly_model_name), - str(treatment_feature_importances_path), - str(plate_mapping_path), - "nbconverted/compute_feature_importance_by_plate.py", - ] - ) diff --git a/2.evaluate_data/compute_sc_anomalyze_data/run_feature_importance.sh b/2.evaluate_data/compute_sc_anomalyze_data/run_feature_importance.sh deleted file mode 100755 index 3835467..0000000 --- a/2.evaluate_data/compute_sc_anomalyze_data/run_feature_importance.sh +++ /dev/null @@ -1,19 +0,0 @@ -#!/bin/bash - -set -euo pipefail - -ANOMALY_DATASET_PATH="$1" -PLATE_NAME="$2" -MORPHOLOGY_DATASET_PATH="$3" -MODEL_PATH="$4" -FEATURE_IMPORTANCE_OUTPUT_PATH="$5" -PLATE_MAPPING_PATH="$6" -PYTHON_SCRIPT="$7" - -python3 "$PYTHON_SCRIPT" \ - --anomaly_dataset_path "$ANOMALY_DATASET_PATH" \ - --plate "$PLATE_NAME" \ - --morphology_dataset_path "$MORPHOLOGY_DATASET_PATH" \ - --model_path "$MODEL_PATH" \ - --feature_importances_output_path "$FEATURE_IMPORTANCE_OUTPUT_PATH" \ - --plate_mapping_path "$PLATE_MAPPING_PATH" diff --git a/2.evaluate_data/train_sc_anomalyze_models/README.md b/2.evaluate_data/train_sc_anomalyze_models/README.md deleted file mode 100644 index 47e2acf..0000000 --- a/2.evaluate_data/train_sc_anomalyze_models/README.md +++ /dev/null @@ -1,7 +0,0 @@ -# Identify single-cell anomalies -This analysis identifies JUMP single-cell anomalies after sampling and training processed JUMP data. -Anomalyze is trained on 4 combinations of processed data: -- Normalized single-cell QC'd data -- Normalized single-cell non-QC'd data -- Normalized and feature-selected single-cell QC'd data -- Normalized and feature-selected single-cell non-QC'd data diff --git a/2.evaluate_data/train_sc_anomalyze_models/identify_anomalous_single_cells_fs.ipynb b/2.evaluate_data/train_sc_anomalyze_models/identify_anomalous_single_cells_fs.ipynb deleted file mode 100644 index b837017..0000000 --- a/2.evaluate_data/train_sc_anomalyze_models/identify_anomalous_single_cells_fs.ipynb +++ /dev/null @@ -1,155 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "jukit_cell_id": "EzUt4Twg5N" - }, - "source": [ - "# Identify Single Cell Anomalies\n", - "Here we train Anomalyze models on different pre-computed datasets." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "jukit_cell_id": "OvtZf6TNyh" - }, - "source": [ - "### Import Libraries" - ] - }, - { - "cell_type": "code", - "metadata": { - "jukit_cell_id": "QuNHIYPoOl" - }, - "source": [ - "import json\n", - "import pathlib\n", - "import sys\n", - "\n", - "import joblib\n", - "import pandas as pd\n", - "import pyarrow.parquet as pq\n", - "from sklearn.ensemble import IsolationForest" - ], - "outputs": [], - "execution_count": null - }, - { - "cell_type": "markdown", - "metadata": { - "jukit_cell_id": "rHQcp3lptv" - }, - "source": [ - "## Define paths" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "jukit_cell_id": "3MslTnvRwY" - }, - "source": [ - "### Inputs" - ] - }, - { - "cell_type": "code", - "metadata": { - "jukit_cell_id": "u0I43b5xFw" - }, - "source": [ - "plate_data_path = pathlib.Path(sys.argv[1])\n", - "plate_data_name = plate_data_path.name\n", - "sampled_plate_jump_data_path = sys.argv[2]\n", - "\n", - "sampled_platedf = pd.read_parquet(\n", - " f\"{sampled_plate_jump_data_path}/{plate_data_name}.parquet\"\n", - ")\n", - "\n", - "feature_columns_path = (\n", - " pathlib.Path(sys.argv[3]) / plate_data_name / \"feature_columns.json\"\n", - ").resolve(strict=True)\n", - "\n", - "with feature_columns_path.open(\"r\") as feat_cols_obj:\n", - " feat_cols = json.load(feat_cols_obj)" - ], - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": "\u001b[0;31m--------------------------------------------------------------\u001b[0m\n\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)\nCell \u001b[0;32mIn[7], line 4\u001b[0m\n\u001b[1;32m 1\u001b[0m big_drive_path \u001b[38;5;241m=\u001b[39m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mroot_dir\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m/big_drive\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 3\u001b[0m \u001b[38;5;66;03m# Plate morphology data\u001b[39;00m\n\u001b[0;32m----> 4\u001b[0m plate_paths \u001b[38;5;241m=\u001b[39m \u001b[43mpathlib\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mPath\u001b[49m\u001b[43m(\u001b[49m\u001b[43msys\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43margv\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m1\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mresolve\u001b[49m\u001b[43m(\u001b[49m\u001b[43mstrict\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m)\u001b[49m\n\u001b[1;32m 6\u001b[0m \u001b[38;5;66;03m# Boolean flag for if the data is single-cell\u001b[39;00m\n\u001b[1;32m 7\u001b[0m is_sc \u001b[38;5;241m=\u001b[39m sys\u001b[38;5;241m.\u001b[39margv[\u001b[38;5;241m2\u001b[39m]\u001b[38;5;241m.\u001b[39mlower() \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtrue\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\nFile \u001b[0;32m~/mambaforge-pypy3/envs/jump_sc/lib/python3.9/pathlib.py:1215\u001b[0m, in \u001b[0;36mPath.resolve\u001b[0;34m(self, strict)\u001b[0m\n\u001b[1;32m 1209\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mresolve\u001b[39m(\u001b[38;5;28mself\u001b[39m, strict\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mFalse\u001b[39;00m):\n\u001b[1;32m 1210\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 1211\u001b[0m \u001b[38;5;124;03m Make the path absolute, resolving all symlinks on the way and also\u001b[39;00m\n\u001b[1;32m 1212\u001b[0m \u001b[38;5;124;03m normalizing it (for example turning slashes into backslashes under\u001b[39;00m\n\u001b[1;32m 1213\u001b[0m \u001b[38;5;124;03m Windows).\u001b[39;00m\n\u001b[1;32m 1214\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[0;32m-> 1215\u001b[0m s \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_flavour\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mresolve\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mstrict\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mstrict\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1216\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m s \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m 1217\u001b[0m \u001b[38;5;66;03m# No symlink resolution => for consistency, raise an error if\u001b[39;00m\n\u001b[1;32m 1218\u001b[0m \u001b[38;5;66;03m# the path doesn't exist or is forbidden\u001b[39;00m\n\u001b[1;32m 1219\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mstat()\n\nFile \u001b[0;32m~/mambaforge-pypy3/envs/jump_sc/lib/python3.9/pathlib.py:373\u001b[0m, in \u001b[0;36m_PosixFlavour.resolve\u001b[0;34m(self, path, strict)\u001b[0m\n\u001b[1;32m 370\u001b[0m \u001b[38;5;66;03m# NOTE: according to POSIX, getcwd() cannot contain path components\u001b[39;00m\n\u001b[1;32m 371\u001b[0m \u001b[38;5;66;03m# which are symlinks.\u001b[39;00m\n\u001b[1;32m 372\u001b[0m base \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m'\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m path\u001b[38;5;241m.\u001b[39mis_absolute() \u001b[38;5;28;01melse\u001b[39;00m os\u001b[38;5;241m.\u001b[39mgetcwd()\n\u001b[0;32m--> 373\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43m_resolve\u001b[49m\u001b[43m(\u001b[49m\u001b[43mbase\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mstr\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mpath\u001b[49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;129;01mor\u001b[39;00m sep\n\nFile \u001b[0;32m~/mambaforge-pypy3/envs/jump_sc/lib/python3.9/pathlib.py:357\u001b[0m, in \u001b[0;36m_PosixFlavour.resolve.._resolve\u001b[0;34m(path, rest)\u001b[0m\n\u001b[1;32m 355\u001b[0m \u001b[38;5;66;03m# Resolve the symbolic link\u001b[39;00m\n\u001b[1;32m 356\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 357\u001b[0m target \u001b[38;5;241m=\u001b[39m \u001b[43maccessor\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mreadlink\u001b[49m\u001b[43m(\u001b[49m\u001b[43mnewpath\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 358\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mOSError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[1;32m 359\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m e\u001b[38;5;241m.\u001b[39merrno \u001b[38;5;241m!=\u001b[39m EINVAL \u001b[38;5;129;01mand\u001b[39;00m strict:\n\nFile \u001b[0;32m~/mambaforge-pypy3/envs/jump_sc/lib/python3.9/pathlib.py:462\u001b[0m, in \u001b[0;36m_NormalAccessor.readlink\u001b[0;34m(self, path)\u001b[0m\n\u001b[1;32m 461\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mreadlink\u001b[39m(\u001b[38;5;28mself\u001b[39m, path):\n\u001b[0;32m--> 462\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mos\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mreadlink\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpath\u001b[49m\u001b[43m)\u001b[49m\n\n\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '/home/camo/projects/JUMP-single-cell/2.evaluate_data/identify_sc_outlier_per_plate/-i'\n" - } - ], - "execution_count": 1 - }, - { - "cell_type": "markdown", - "metadata": { - "jukit_cell_id": "veVF0pgtq2" - }, - "source": [ - "### Outputs" - ] - }, - { - "cell_type": "code", - "metadata": { - "jukit_cell_id": "HeBnkGEuZf" - }, - "source": [ - "isoforest_path = pathlib.Path(sys.argv[4])\n", - "isoforest_path.mkdir(parents=True, exist_ok=True)\n", - "\n", - "isoforest_path = pathlib.Path(\n", - " isoforest_path / f\"{plate_data_name}_isolation_forest.joblib\"\n", - ")" - ], - "outputs": [], - "execution_count": null - }, - { - "cell_type": "markdown", - "metadata": { - "jukit_cell_id": "Od3Z4Ripve" - }, - "source": [ - "## Train Anomalyze Models" - ] - }, - { - "cell_type": "code", - "metadata": { - "jukit_cell_id": "LxCzoc8wlD" - }, - "source": [ - "meta_cols = [col for col in sampled_platedf.columns if \"Metadata\" in col]\n", - "featdf = sampled_platedf[feat_cols]\n", - "\n", - "# If 800 trees are trained with 256 samples per tree, then\n", - "# 800 * 256 gives approximately the expected number of samples per isolation forest.\n", - "# For some of the plate data, this number of samples can barely fit in memory.\n", - "# We also want to maximize the number of trees to learn many patterns for identifying anomalies.\n", - "# 256 is empirically the largest number of samples per tree that allowed outliers to be isolated better.\n", - "isofor = IsolationForest(n_estimators=800, random_state=0, n_jobs=-1)\n", - "isofor.fit(featdf)\n", - "\n", - "joblib.dump(isofor, isoforest_path)" - ], - "outputs": [], - "execution_count": null - } - ], - "metadata": { - "anaconda-cloud": {}, - "kernelspec": { - "display_name": "python", - "language": "python", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} \ No newline at end of file diff --git a/2.evaluate_data/train_sc_anomalyze_models/isolation_forest_models/feature_selected_sc_data_isolation_forest.joblib b/2.evaluate_data/train_sc_anomalyze_models/isolation_forest_models/feature_selected_sc_data_isolation_forest.joblib deleted file mode 100644 index 42c17ef..0000000 Binary files a/2.evaluate_data/train_sc_anomalyze_models/isolation_forest_models/feature_selected_sc_data_isolation_forest.joblib and /dev/null differ diff --git a/2.evaluate_data/train_sc_anomalyze_models/isolation_forest_models/feature_selected_sc_qc_data_isolation_forest.joblib b/2.evaluate_data/train_sc_anomalyze_models/isolation_forest_models/feature_selected_sc_qc_data_isolation_forest.joblib deleted file mode 100644 index fd26d4a..0000000 Binary files a/2.evaluate_data/train_sc_anomalyze_models/isolation_forest_models/feature_selected_sc_qc_data_isolation_forest.joblib and /dev/null differ diff --git a/2.evaluate_data/train_sc_anomalyze_models/isolation_forest_models/normalized_sc_data_isolation_forest.joblib b/2.evaluate_data/train_sc_anomalyze_models/isolation_forest_models/normalized_sc_data_isolation_forest.joblib deleted file mode 100644 index d29ca07..0000000 Binary files a/2.evaluate_data/train_sc_anomalyze_models/isolation_forest_models/normalized_sc_data_isolation_forest.joblib and /dev/null differ diff --git a/2.evaluate_data/train_sc_anomalyze_models/isolation_forest_models/normalized_sc_qc_data_isolation_forest.joblib b/2.evaluate_data/train_sc_anomalyze_models/isolation_forest_models/normalized_sc_qc_data_isolation_forest.joblib deleted file mode 100644 index 22d5c26..0000000 Binary files a/2.evaluate_data/train_sc_anomalyze_models/isolation_forest_models/normalized_sc_qc_data_isolation_forest.joblib and /dev/null differ diff --git a/2.evaluate_data/train_sc_anomalyze_models/nbconverted/identify_anomalous_single_cells_fs.py b/2.evaluate_data/train_sc_anomalyze_models/nbconverted/identify_anomalous_single_cells_fs.py deleted file mode 100644 index 779dd6b..0000000 --- a/2.evaluate_data/train_sc_anomalyze_models/nbconverted/identify_anomalous_single_cells_fs.py +++ /dev/null @@ -1,75 +0,0 @@ -#!/usr/bin/env python -# coding: utf-8 - -# # Identify Single Cell Anomalies -# Here we train Anomalyze models on different pre-computed datasets. - -# ### Import Libraries - -# In[ ]: - - -import json -import pathlib -import sys - -import joblib -import pandas as pd -import pyarrow.parquet as pq -from sklearn.ensemble import IsolationForest - - -# ## Define paths - -# ### Inputs - -# In[1]: - - -plate_data_path = pathlib.Path(sys.argv[1]) -plate_data_name = plate_data_path.name -sampled_plate_jump_data_path = sys.argv[2] - -sampled_platedf = pd.read_parquet( - f"{sampled_plate_jump_data_path}/{plate_data_name}.parquet" -) - -feature_columns_path = ( - pathlib.Path(sys.argv[3]) / plate_data_name / "feature_columns.json" -).resolve(strict=True) - -with feature_columns_path.open("r") as feat_cols_obj: - feat_cols = json.load(feat_cols_obj) - - -# ### Outputs - -# In[ ]: - - -isoforest_path = pathlib.Path(sys.argv[4]) -isoforest_path.mkdir(parents=True, exist_ok=True) - -isoforest_path = pathlib.Path( - isoforest_path / f"{plate_data_name}_isolation_forest.joblib" -) - - -# ## Train Anomalyze Models - -# In[ ]: - - -meta_cols = [col for col in sampled_platedf.columns if "Metadata" in col] -featdf = sampled_platedf[feat_cols] - -# If 800 trees are trained with 256 samples per tree, then -# 800 * 256 gives approximately the expected number of samples per isolation forest. -# For some of the plate data, this number of samples can barely fit in memory. -# We also want to maximize the number of trees to learn many patterns for identifying anomalies. -# 256 is empirically the largest number of samples per tree that allowed outliers to be isolated better. -isofor = IsolationForest(n_estimators=800, random_state=0, n_jobs=-1) -isofor.fit(featdf) - -joblib.dump(isofor, isoforest_path) - diff --git a/2.evaluate_data/train_sc_anomalyze_models/nbconverted/sample_anomalous_single_cells_fs.py b/2.evaluate_data/train_sc_anomalyze_models/nbconverted/sample_anomalous_single_cells_fs.py deleted file mode 100644 index 87cdaae..0000000 --- a/2.evaluate_data/train_sc_anomalyze_models/nbconverted/sample_anomalous_single_cells_fs.py +++ /dev/null @@ -1,103 +0,0 @@ -#!/usr/bin/env python -# coding: utf-8 - -# # Sample data from each well -# Deterministically sample data from each plate dataset and get the intersection of all features not containing nans. - -# ### Import Libraries - -# In[ ]: - - -import json -import pathlib -import sys - -import pandas as pd - -sys.path.append(str(pathlib.Path.cwd().parent / "utils")) -from DeterministicSampling import DeterministicSampling - - -# ## Define paths - -# ### Inputs - -# In[1]: - - -# Plate morphology data -sc_data_path = pathlib.Path(sys.argv[1]).resolve(strict=True) -sc_data_dir_name = sc_data_path.parent.name -plate_path = pathlib.Path(sc_data_path).resolve(strict=True) -platedf = pd.read_parquet(plate_path) - -sampled_plate_jump_path = pathlib.Path(sys.argv[2]) - -anomaly_data_path = pathlib.Path(sys.argv[3]) / sc_data_dir_name -anomaly_data_path.mkdir(parents=True, exist_ok=True) -feat_col_path = anomaly_data_path / "feature_columns.json" - - -# ### Outputs - -# In[2]: - - -plate_data_path = sampled_plate_jump_path / f"{plate_path.parent.name}.parquet" - - -# ## Save feature columns that don't contain NaNs - -# In[ ]: - - -feat_cols = [] -non_na_cols = platedf.columns[platedf.notna().all()].tolist() -non_na_cols = [col for col in non_na_cols if not "Metadata" in col] - -if feat_col_path.exists(): - with feat_col_path.open("r") as feat_col_obj: - feat_cols = json.load(feat_col_obj) - feat_cols = list(set(non_na_cols) & set(feat_cols)) - -else: - feat_cols = non_na_cols - -with feat_col_path.open("w") as feat_col_obj: - json.dump(feat_cols, feat_col_obj) - - -# ## Sample single-cell data by Hash - -# In[3]: - - -ds = DeterministicSampling( - _platedf=platedf, - # Number of samples per plate needed to train the JUMP isolation forests - # See identify_anomalous_single_cells_fs.py for more details - # Only 4_000 are needed, however each sampling will likely not be exactly 4_000 samples - # due to the probabilistic nature of hash-based sampling. - # To ensure a sufficient sample size I increased the threshold. - _samples_per_plate=4_100, - _plate_column="Metadata_Plate", - _well_column="Metadata_Well", - _cell_id_columns=["Metadata_Site", "Metadata_ObjectNumber"], -) - -sampled_platedf = ds.sample_plate_deterministically(_sample_strategy="well_sampling") - - -# ## Concatenate and Save Sampled Plate data - -# In[4]: - - -if plate_data_path.exists(): - sampled_platedf = pd.concat( - [sampled_platedf, pd.read_parquet(plate_data_path)], axis=0 - ) - -sampled_platedf.to_parquet(plate_data_path) - diff --git a/2.evaluate_data/train_sc_anomalyze_models/sample_anomalous_single_cells_fs.ipynb b/2.evaluate_data/train_sc_anomalyze_models/sample_anomalous_single_cells_fs.ipynb deleted file mode 100644 index d49a528..0000000 --- a/2.evaluate_data/train_sc_anomalyze_models/sample_anomalous_single_cells_fs.ipynb +++ /dev/null @@ -1,225 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "jukit_cell_id": "EzUt4Twg5N" - }, - "source": [ - "# Sample data from each well\n", - "Deterministically sample data from each plate dataset and get the intersection of all features not containing nans." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "jukit_cell_id": "OvtZf6TNyh" - }, - "source": [ - "### Import Libraries" - ] - }, - { - "cell_type": "code", - "metadata": { - "jukit_cell_id": "QuNHIYPoOl" - }, - "source": [ - "import json\n", - "import pathlib\n", - "import sys\n", - "\n", - "import pandas as pd\n", - "\n", - "sys.path.append(str(pathlib.Path.cwd().parent / \"utils\"))\n", - "from DeterministicSampling import DeterministicSampling" - ], - "outputs": [], - "execution_count": null - }, - { - "cell_type": "markdown", - "metadata": { - "jukit_cell_id": "rHQcp3lptv" - }, - "source": [ - "## Define paths" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "jukit_cell_id": "3MslTnvRwY" - }, - "source": [ - "### Inputs" - ] - }, - { - "cell_type": "code", - "metadata": { - "jukit_cell_id": "u0I43b5xFw" - }, - "source": [ - "# Plate morphology data\n", - "sc_data_path = pathlib.Path(sys.argv[1]).resolve(strict=True)\n", - "sc_data_dir_name = sc_data_path.parent.name\n", - "plate_path = pathlib.Path(sc_data_path).resolve(strict=True)\n", - "platedf = pd.read_parquet(plate_path)\n", - "\n", - "sampled_plate_jump_path = pathlib.Path(sys.argv[2])\n", - "\n", - "anomaly_data_path = pathlib.Path(sys.argv[3]) / sc_data_dir_name\n", - "anomaly_data_path.mkdir(parents=True, exist_ok=True)\n", - "feat_col_path = anomaly_data_path / \"feature_columns.json\"" - ], - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": "\u001b[0;31m--------------------------------------------------------------\u001b[0m\n\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)\nCell \u001b[0;32mIn[5], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# Plate morphology data\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m plate_path \u001b[38;5;241m=\u001b[39m \u001b[43mpathlib\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mPath\u001b[49m\u001b[43m(\u001b[49m\u001b[43msys\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43margv\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m1\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mresolve\u001b[49m\u001b[43m(\u001b[49m\u001b[43mstrict\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m)\u001b[49m\n\u001b[1;32m 3\u001b[0m platedf \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mread_parquet(plate_path)\n\u001b[1;32m 5\u001b[0m \u001b[38;5;66;03m# Boolean flag for if the data is single-cell\u001b[39;00m\n\nFile \u001b[0;32m~/mambaforge-pypy3/envs/jump_sc/lib/python3.9/pathlib.py:1215\u001b[0m, in \u001b[0;36mPath.resolve\u001b[0;34m(self, strict)\u001b[0m\n\u001b[1;32m 1209\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21mresolve\u001b[39m(\u001b[38;5;28mself\u001b[39m, strict\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mFalse\u001b[39;00m):\n\u001b[1;32m 1210\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 1211\u001b[0m \u001b[38;5;124;03m Make the path absolute, resolving all symlinks on the way and also\u001b[39;00m\n\u001b[1;32m 1212\u001b[0m \u001b[38;5;124;03m normalizing it (for example turning slashes into backslashes under\u001b[39;00m\n\u001b[1;32m 1213\u001b[0m \u001b[38;5;124;03m Windows).\u001b[39;00m\n\u001b[1;32m 1214\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[0;32m-> 1215\u001b[0m s \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_flavour\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mresolve\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mstrict\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mstrict\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1216\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m s \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m 1217\u001b[0m \u001b[38;5;66;03m# No symlink resolution => for consistency, raise an error if\u001b[39;00m\n\u001b[1;32m 1218\u001b[0m \u001b[38;5;66;03m# the path doesn't exist or is forbidden\u001b[39;00m\n\u001b[1;32m 1219\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mstat()\n\nFile \u001b[0;32m~/mambaforge-pypy3/envs/jump_sc/lib/python3.9/pathlib.py:373\u001b[0m, in \u001b[0;36m_PosixFlavour.resolve\u001b[0;34m(self, path, strict)\u001b[0m\n\u001b[1;32m 370\u001b[0m \u001b[38;5;66;03m# NOTE: according to POSIX, getcwd() cannot contain path components\u001b[39;00m\n\u001b[1;32m 371\u001b[0m \u001b[38;5;66;03m# which are symlinks.\u001b[39;00m\n\u001b[1;32m 372\u001b[0m base \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m'\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m path\u001b[38;5;241m.\u001b[39mis_absolute() \u001b[38;5;28;01melse\u001b[39;00m os\u001b[38;5;241m.\u001b[39mgetcwd()\n\u001b[0;32m--> 373\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43m_resolve\u001b[49m\u001b[43m(\u001b[49m\u001b[43mbase\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mstr\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mpath\u001b[49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;129;01mor\u001b[39;00m sep\n\nFile \u001b[0;32m~/mambaforge-pypy3/envs/jump_sc/lib/python3.9/pathlib.py:357\u001b[0m, in \u001b[0;36m_PosixFlavour.resolve.._resolve\u001b[0;34m(path, rest)\u001b[0m\n\u001b[1;32m 355\u001b[0m \u001b[38;5;66;03m# Resolve the symbolic link\u001b[39;00m\n\u001b[1;32m 356\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 357\u001b[0m target \u001b[38;5;241m=\u001b[39m \u001b[43maccessor\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mreadlink\u001b[49m\u001b[43m(\u001b[49m\u001b[43mnewpath\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 358\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mOSError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[1;32m 359\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m e\u001b[38;5;241m.\u001b[39merrno \u001b[38;5;241m!=\u001b[39m EINVAL \u001b[38;5;129;01mand\u001b[39;00m strict:\n\nFile \u001b[0;32m~/mambaforge-pypy3/envs/jump_sc/lib/python3.9/pathlib.py:462\u001b[0m, in \u001b[0;36m_NormalAccessor.readlink\u001b[0;34m(self, path)\u001b[0m\n\u001b[1;32m 461\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21mreadlink\u001b[39m(\u001b[38;5;28mself\u001b[39m, path):\n\u001b[0;32m--> 462\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mos\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mreadlink\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpath\u001b[49m\u001b[43m)\u001b[49m\n\n\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '/home/camo/projects/JUMP-single-cell/2.evaluate_data/identify_sc_outlier_per_plate/-i'\n" - } - ], - "execution_count": 1 - }, - { - "cell_type": "markdown", - "metadata": { - "jukit_cell_id": "Vo5hG6EkKw" - }, - "source": [ - "### Outputs" - ] - }, - { - "cell_type": "code", - "metadata": { - "jukit_cell_id": "CLFFnku4sw" - }, - "source": [ - "plate_data_path = sampled_plate_jump_path / f\"{plate_path.parent.name}.parquet\"" - ], - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": "\u001b[0;31m--------------------------------------------------------------\u001b[0m\n\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)\nCell \u001b[0;32mIn[7], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m plate_data_path \u001b[38;5;241m=\u001b[39m \u001b[43msampled_plate_jump_path\u001b[49m \u001b[38;5;241m/\u001b[39m plate_path\u001b[38;5;241m.\u001b[39mparent\u001b[38;5;241m.\u001b[39mname\n\n\u001b[0;31mNameError\u001b[0m: name 'sampled_plate_jump_path' is not defined\n" - } - ], - "execution_count": 2 - }, - { - "cell_type": "markdown", - "metadata": { - "jukit_cell_id": "nJGYnPrMIE" - }, - "source": [ - "## Save feature columns that don't contain NaNs" - ] - }, - { - "cell_type": "code", - "metadata": { - "jukit_cell_id": "soht24BsrT" - }, - "source": [ - "feat_cols = []\n", - "non_na_cols = platedf.columns[platedf.notna().all()].tolist()\n", - "non_na_cols = [col for col in non_na_cols if not \"Metadata\" in col]\n", - "\n", - "if feat_col_path.exists():\n", - " with feat_col_path.open(\"r\") as feat_col_obj:\n", - " feat_cols = json.load(feat_col_obj)\n", - " feat_cols = list(set(non_na_cols) & set(feat_cols))\n", - "\n", - "else:\n", - " feat_cols = non_na_cols\n", - "\n", - "with feat_col_path.open(\"w\") as feat_col_obj:\n", - " json.dump(feat_cols, feat_col_obj)" - ], - "outputs": [], - "execution_count": null - }, - { - "cell_type": "markdown", - "metadata": { - "jukit_cell_id": "X0CVV2bWwS" - }, - "source": [ - "## Sample single-cell data by Hash" - ] - }, - { - "cell_type": "code", - "metadata": { - "jukit_cell_id": "5OrHDaTxQX" - }, - "source": [ - "ds = DeterministicSampling(\n", - " _platedf=platedf,\n", - " # Number of samples per plate needed to train the JUMP isolation forests\n", - " # See identify_anomalous_single_cells_fs.py for more details\n", - " # Only 4_000 are needed, however each sampling will likely not be exactly 4_000 samples\n", - " # due to the probabilistic nature of hash-based sampling.\n", - " # To ensure a sufficient sample size I increased the threshold.\n", - " _samples_per_plate=4_100,\n", - " _plate_column=\"Metadata_Plate\",\n", - " _well_column=\"Metadata_Well\",\n", - " _cell_id_columns=[\"Metadata_Site\", \"Metadata_ObjectNumber\"],\n", - ")\n", - "\n", - "sampled_platedf = ds.sample_plate_deterministically(_sample_strategy=\"well_sampling\")" - ], - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": "\u001b[0;31m--------------------------------------------------------------\u001b[0m\n\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)\nCell \u001b[0;32mIn[8], line 7\u001b[0m\n\u001b[1;32m 4\u001b[0m hash_cols \u001b[38;5;241m=\u001b[39m [\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mMetadata_Plate\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mMetadata_Well\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n\u001b[1;32m 5\u001b[0m group_cols \u001b[38;5;241m=\u001b[39m hash_cols\u001b[38;5;241m.\u001b[39mcopy()\n\u001b[0;32m----> 7\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[43mis_sc\u001b[49m:\n\u001b[1;32m 8\u001b[0m hash_cols \u001b[38;5;241m+\u001b[39m\u001b[38;5;241m=\u001b[39m [\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mMetadata_Site\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mMetadata_ObjectNumber\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n\u001b[1;32m 10\u001b[0m platedf[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mMetadata_farmhash\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m platedf\u001b[38;5;241m.\u001b[39mapply(\n\u001b[1;32m 11\u001b[0m \u001b[38;5;28;01mlambda\u001b[39;00m row: Fingerprint64(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;241m.\u001b[39mjoin(\u001b[38;5;28mstr\u001b[39m(row[col]) \u001b[38;5;28;01mfor\u001b[39;00m col \u001b[38;5;129;01min\u001b[39;00m hash_cols)) \u001b[38;5;241m%\u001b[39m divisor,\n\u001b[1;32m 12\u001b[0m axis\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m1\u001b[39m,\n\u001b[1;32m 13\u001b[0m )\n\n\u001b[0;31mNameError\u001b[0m: name 'is_sc' is not defined\n" - } - ], - "execution_count": 3 - }, - { - "cell_type": "markdown", - "metadata": { - "jukit_cell_id": "A98Gum4i9t" - }, - "source": [ - "## Concatenate and Save Sampled Plate data" - ] - }, - { - "cell_type": "code", - "metadata": { - "jukit_cell_id": "tEcACoqJrH" - }, - "source": [ - "if plate_data_path.exists():\n", - " sampled_platedf = pd.concat(\n", - " [sampled_platedf, pd.read_parquet(plate_data_path)], axis=0\n", - " )\n", - "\n", - "sampled_platedf.to_parquet(plate_data_path)" - ], - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": "\u001b[0;31m--------------------------------------------------------------\u001b[0m\n\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)\nCell \u001b[0;32mIn[12], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[43mplate_data_path\u001b[49m\u001b[38;5;241m.\u001b[39mexists():\n\u001b[1;32m 2\u001b[0m sampled_platedf \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mconcat(\n\u001b[1;32m 3\u001b[0m [sampled_platedf, pd\u001b[38;5;241m.\u001b[39mread_parquet(plate_data_path)], axis\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m0\u001b[39m\n\u001b[1;32m 4\u001b[0m )\n\u001b[1;32m 6\u001b[0m sampled_platedf\u001b[38;5;241m.\u001b[39mto_parquet(plate_data_path)\n\n\u001b[0;31mNameError\u001b[0m: name 'plate_data_path' is not defined\n" - } - ], - "execution_count": 4 - } - ], - "metadata": { - "anaconda-cloud": {}, - "kernelspec": { - "display_name": "python", - "language": "python", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} \ No newline at end of file diff --git a/2.evaluate_data/train_sc_anomalyze_models/sample_data_and_train_anomalyze.sh b/2.evaluate_data/train_sc_anomalyze_models/sample_data_and_train_anomalyze.sh deleted file mode 100755 index 24f7214..0000000 --- a/2.evaluate_data/train_sc_anomalyze_models/sample_data_and_train_anomalyze.sh +++ /dev/null @@ -1,55 +0,0 @@ -#!/bin/bash - -# This script calls python code to sample and train anomalyze by passing plates from processed data as input. - -conda init bash -conda activate jump_sc - -sampled_plate_jump_data=$(mktemp -d) - -cleanup() { - echo "Removing temporary plate data directory" - rm -rf "$sampled_plate_jump_data" -} - -# Trap common exit signals -trap cleanup EXIT INT TERM ERR - -# Get the root directory -git_root=$(git rev-parse --show-toplevel) - -py_path="nbconverted" - -jupyter nbconvert --to python --output-dir="${py_path}/" *.ipynb - -# Get the normalized data path (with multiple plates) -plate_paths=( - "${git_root}/big_drive/normalized_sc_data" - "${git_root}/big_drive/feature_selected_sc_qc_data" - "${git_root}/big_drive/normalized_sc_qc_data" - "${git_root}/big_drive/feature_selected_sc_data" -) - -feature_column_path="${git_root}/big_drive/sc_anomaly_data" -isoforest_path="${git_root}/big_drive/isolation_forest_models" - -for plate_dir in "${plate_paths[@]}"; do - - if [ -d "$plate_dir" ]; then - - echo -e "\nSampling from $plate_dir" - - for plate_file in $plate_dir/*; do - /usr/bin/time -v python3 "$py_path/sample_anomalous_single_cells_fs.py" "$plate_file" "$sampled_plate_jump_data" "$feature_column_path" - done - - echo -e "\nTraining on sampled data from $plate_dir" - - /usr/bin/time -v python3 "$py_path/identify_anomalous_single_cells_fs.py" "$plate_dir" "$sampled_plate_jump_data" "$feature_column_path" "$isoforest_path" - - else - echo "Error: '$plate_dir' is not a directory." - return 1 - fi - -done diff --git a/2.evaluate_data/utils/DeterministicSampling.py b/2.evaluate_data/utils/DeterministicSampling.py deleted file mode 100644 index 28fd4bc..0000000 --- a/2.evaluate_data/utils/DeterministicSampling.py +++ /dev/null @@ -1,74 +0,0 @@ -import warnings - -import pandas as pd -from farmhash import Fingerprint64 - - -class DeterministicSampling: - - def __init__( - self, - _platedf: pd.DataFrame, - _samples_per_plate: int, - _plate_column: str, - _well_column: str, - _cell_id_columns: list[str], - ): - self._platedf = _platedf - self._samples_per_plate = _samples_per_plate - self._plate_column = _plate_column - self._well_column = _well_column - self._cell_id_columns = _cell_id_columns - self._divisor = 10_000 - - """ - _plate_column, _well_column and _cell_id_columns store column names, and must be present in _platedf. - _cell_id_columns will be used with _plate_column and _well_column to uniquely identify each cell. - For example, _cell_id_columns in some projects could be ["Metadata_Site", "Metadata_ObjectNumber"] - """ - - @property - def platedf(self) -> pd.DataFrame: - return self._platedf.copy() - - def __hash_data(self) -> None: - # Create a hash for each data entry - - hash_cols = [self._plate_column, self._well_column] + self._cell_id_columns - self._group_cols = hash_cols[0:2] - - combined_strs = self._platedf[hash_cols].astype(str).agg("".join, axis=1) - - # Vectorized hash mapping - self._platedf["Metadata_farmhash"] = combined_strs.map( - lambda x: Fingerprint64(x) % self._divisor - ) - - def sample_plate_deterministically( - self, _sample_strategy: str = "well_sampling" - - ) -> pd.DataFrame: - self.__hash_data() - - if _sample_strategy != "well_sampling": - mod_cutoff = ( - self._samples_per_plate / self._platedf.shape[0] - ) * self._divisor - return self._platedf.loc[self._platedf["Metadata_farmhash"] < mod_cutoff] - - grouped = self._platedf.groupby(self._group_cols) - - num_groups = len(grouped) - samples_per_group = max(1, self._samples_per_plate // num_groups) - - # Fixed sampling per group using hash - return pd.concat( - ( - group_df.loc[ - group_df["Metadata_farmhash"] - < (samples_per_group / len(group_df)) * self._divisor - ] - for _, group_df in grouped - ), - ignore_index=True, - ) diff --git a/2.evaluate_data/utils/isolation_forest_data_feature_importance.py b/2.evaluate_data/utils/isolation_forest_data_feature_importance.py deleted file mode 100644 index 930f8f1..0000000 --- a/2.evaluate_data/utils/isolation_forest_data_feature_importance.py +++ /dev/null @@ -1,209 +0,0 @@ -# Learn more about the methods used in this module -# by reading the isolation forest paper: -# https://doi.org/10.1109/ICDM.2008.17 - -from typing import Iterable, Optional, Union - -import numpy as np -import pandas as pd -from joblib import Parallel, delayed -from sklearn.tree import DecisionTreeRegressor, _tree - - -def isoforest_expected_length(node_sample_counts: np.ndarray) -> np.ndarray: - """ - Compute expected path length c(n) for an array of node sizes, where c(n) - is the average path length of an unsuccessful Binary Search Tree (BST) search - (same as the iForest remaining-path normalizer and leaf correction term). - - Parameters - ---------- - node_sample_counts : np.ndarray - Number of training samples in each node. - - Returns - ------- - np.ndarray - Expected path length for each input node size. - """ - - node_sample_counts = np.asarray(node_sample_counts) - out = np.zeros_like(node_sample_counts, dtype=float) - mask_1 = node_sample_counts <= 1 - mask_2 = node_sample_counts == 2 - mask_rest = ~(mask_1 | mask_2) - - out[mask_1] = 0.0 - out[mask_2] = 1.0 - remaining_node_counts = node_sample_counts[mask_rest].astype(float) - - # This equation came from the paper to compute c(n) - out[mask_rest] = 2.0 * ( - np.log(remaining_node_counts - 1.0) + np.euler_gamma - ) - 2.0 * ((remaining_node_counts - 1.0) / remaining_node_counts) - - return out - - -class IsoforestFeatureImportance: - """ - Per-sample Isolation Forest feature contributions using split gain. - For each point x and tree t, contributions sum the split gains - g(v, x) = c(n_p) - [1 + c(n_c)] along the path x takes, then - average over trees. - Higher values indicate a feature was more important for isolating a sample. - """ - - def __init__( - self, - estimators: list[DecisionTreeRegressor], - morphology_data: pd.DataFrame, - estimators_features: Optional[Iterable[np.ndarray]] = None, - ): - """ - Parameters - ---------- - estimators : list[DecisionTreeRegressor] - Fitted isolation trees (e.g., from IsolationForest.estimators_). - morphology_data : pd.DataFrame - Feature matrix to explain; columns must match the model features. - estimators_features : Iterable[np.ndarray] | None - Optional per-tree feature index subsets (IsolationForest.estimators_features_). - """ - - self._estimators = estimators - self._estimators_features = estimators_features - self._morphology_data = morphology_data - self._isoforest_importances = None - - @property - def isoforest_importances(self) -> pd.DataFrame: - if self._isoforest_importances is None: - raise ValueError("isoforest_importances have not been computed") - - return self._isoforest_importances - - def _compute_tree_gains( - self, - estimator: DecisionTreeRegressor, - X_data: np.ndarray, - feat_subset: np.ndarray, - ) -> np.ndarray: - """ - Compute per-sample, per-feature gains for a single tree. - - Parameters - ---------- - estimator : DecisionTreeRegressor - The fitted isolation tree to traverse. - X_data : np.ndarray - Feature matrix (float) aligned to the model's feature order. - feat_subset : np.ndarray - Global feature indices used by this tree (from estimators_features_). - - Returns - ------- - np.ndarray - Array of shape (n_samples, n_features) with split-gain contributions - for this tree. - """ - - tree = estimator.tree_ - num_samples, _ = X_data.shape - - left_children = tree.children_left - right_children = tree.children_right - tree_feats = tree.feature - feat_thresholds = tree.threshold - - node_lengths = isoforest_expected_length(tree.n_node_samples) - total_tree_gain = np.zeros_like(X_data, dtype=float) - - for sample_idx in range(num_samples): - sample_node = 0 - while tree_feats[sample_node] != _tree.TREE_UNDEFINED: - tree_feat_idx = tree_feats[sample_node] - rel_feat_idx = feat_subset[tree_feat_idx] - - if X_data[sample_idx, rel_feat_idx] <= feat_thresholds[sample_node]: - child = left_children[sample_node] - else: - child = right_children[sample_node] - - # Subtracting 1, because the expected path length is one less - # once the next step was taken (to the child node). - # The child node will sometimes be a leaf node as some samples may - # not be isolated. - sample_tree_gain = node_lengths[sample_node] - 1.0 - node_lengths[child] - total_tree_gain[sample_idx, rel_feat_idx] += sample_tree_gain - sample_node = child - - return total_tree_gain - - def compute_isoforest_importances(self) -> pd.DataFrame: - """ - Compute per-sample, per-feature split-gain contributions averaged over trees. - - Returns - ------- - pd.DataFrame - DataFrame of averaged contributions with sample index matching - `morphology_data` and columns matching feature names. - """ - - feature_names = list(self._morphology_data.columns) - X = self._morphology_data[feature_names].to_numpy(dtype=float, copy=False) - - feat_subsets: list[np.ndarray] = ( - [np.asarray(fs, dtype=int) for fs in self._estimators_features] - if self._estimators_features is not None - else [np.arange(X.shape[1], dtype=int)] * len(self._estimators) - ) - - average_tree_gains = np.zeros_like(X, dtype=float) - - all_tree_gains = Parallel(n_jobs=-1)( - delayed(self._compute_tree_gains)(estimator, X, subset) - for estimator, subset in zip(self._estimators, feat_subsets) - ) - - for total_tree_gain in all_tree_gains: - average_tree_gains += total_tree_gain - - average_tree_gains /= len(self._estimators) - - self._isoforest_importances = pd.DataFrame( - average_tree_gains, columns=feature_names, index=self._morphology_data.index - ) - - return self._isoforest_importances - - def get_filtered_isoforest_importances( - self, features: Union[str, list[str]] - ) -> dict[str, list[float]]: - """ - Return feature importances without NaNs. - - Parameters - ---------- - features: Morphology feature names. - - Returns - ------- - dict[str, list[float]] - A mapping from feature name to a list of non-NaN importance values. - """ - - features = [features] if isinstance(features, str) else features - - if self._isoforest_importances is None: - filtered_morphology_data = self.compute_isoforest_importances() - else: - filtered_morphology_data = self._isoforest_importances[features].copy() - - return filtered_morphology_data.apply(lambda x: x.dropna().tolist()).to_dict() - - def __call__(self) -> pd.DataFrame: - """Compute or return cached per-sample, per-feature importances.""" - - return self.compute_isoforest_importances() diff --git a/3.analyze_data/README.md b/3.analyze_data/README.md index 85644cb..1d015ae 100644 --- a/3.analyze_data/README.md +++ b/3.analyze_data/README.md @@ -1,24 +1,24 @@ # Analyze Predicted Probabilities -In this module, we perform multiple analyses on the predicted probability data to validate the phenotypic predictions for each treatment (e.g., compound, CRISPR, or ORF). -To compare treatments and the negative control groups, we perform KS tests. +In this module, we perform analyses on predicted probability data to evaluate phenotypic behavior across treatments (for example, compound, CRISPR, and ORF). +Core comparisons use KS-based testing against negative control groups. ## Analyze Well Probabilities We compare the phenotype probabilities between each treated well and the remaining negative control wells on the corresponding plate. Each treatment well and corresponding negative control well phenotype probabilities are only compared if the number of cells in these groups is above a given cell count threshold. The group, treatment cells or control cells, are then randomly down-sampled depending on which of these groups has a larger population of cells. -Random sampling of the control cells is accomplished through stratification of cells by the plate's wells. +Random sampling of control cells is done through stratification by plate and well. After sampling the cell population, the cells from the treated and control groups are compared using the KS test statistic. -We have found that the predicted probabilities generated from [non-shuffled](https://github.com/WayScience/phenotypic_profiling_model/blob/main/2.train_model/models/multi_class_models/final__CP_areashape_only__balanced.joblib) and [shuffled](https://github.com/WayScience/phenotypic_profiling_model/blob/main/2.train_model/models/multi_class_models/shuffled_baseline__CP__balanced.joblib) weighted logistic regression models seem to perform the best from validation. These models were trained exclusively from [mitocheck](https://github.com/WayScience/mitocheck_data) cellprofiler areashape morphology features. +The analysis scripts in this directory include well-level aggregation within plates (summarizing single-cell probabilities per well), treatment-level aggregation across wells, significance testing, and probability distribution visualization. -## Run the analysis notebooks +## Run the Analysis Notebooks -To perform the analyses, run the [analyze_data.sh](./analyze_data.sh) file which will convert the notebook into a python file and run it from terminal. +To perform the analyses, run [analyze_data.sh](./analyze_data.sh). ```bash -# Make sure you are in the 3.analyze_data directory +# make sure you are in the 3.analyze_data directory cd 3.analyze_data -# Run the notebook as a python script +# run analysis workflow source analyze_data.sh ``` diff --git a/README.md b/README.md index dc90309..5e0d78c 100644 --- a/README.md +++ b/README.md @@ -4,12 +4,12 @@ ## Data -In this repository, we apply the [phenotypic profiling model](https://github.com/WayScience/phenotypic_profiling_model), which predicts phenotypic class of single cells using nuclei features, to the [JUMP-Target pilot data](https://github.com/jump-cellpainting/JUMP-Target) from the [JUMP consortium](https://jump-cellpainting.broadinstitute.org/). +In this repository, we apply the [phenotypic profiling model](https://github.com/WayScience/phenotypic_profiling_model), which predicts the phenotypic class of single cells using nuclei features, to the [JUMP-Target pilot data](https://github.com/jump-cellpainting/JUMP-Target) from the [JUMP consortium](https://jump-cellpainting.broadinstitute.org/). In this dataset, there are 51 plates with one of three perturbation types (Clustered Regularly Interspaced Short Palindromic Repeats \[CRISPR\], Open Reading Frame \[ORF\], and Compound) for two cell lines (A549 and U2OS). -Each perturbation type has it's own platemap and metadata file that can be found in the [reference_plate_data](./reference_plate_data/) folder. -A barcode platemap is include which associates each plate to the correct platemap file. +Each perturbation type has its own platemap and metadata file in the [reference_plate_data](./reference_plate_data/) folder. +A barcode platemap is included to associate each plate with the correct platemap file. We segment a total of **20,959,860 single cells** in all plates. @@ -30,13 +30,15 @@ Specifically, the benefits of single-cell phenotyping include: ## Repository Structure -| Module | Purpose | Description | -| :---------------------------------------------- | :----------------------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [0.download_data](./0.download_data/) | Download JUMP-Target SQLite files and process them with [CytoTable](https://github.com/cytomining/CytoTable) | We download the CellProfiler SQLite outputs for 51 plates from AWS and process them into Parquet files which combine compartment and image metadata as a single table. | -| [1.process_data](./1.process_data/) | Process SQLite files | We use pycytominer on the SQLite outputs to merge single-cells and normalize features | -| [2.evaluate_data](./2.evaluate_data/) | Apply phenotypic profiling model | We generate phenotypic predictions for single-cells using the phenotypic profiling model | -| [3.analyze_data](./3.analyze_data/) | Analyze phenotypic predictions | We perform multiple analyses to validate the phenotypic predicted class for each perturbation compared to control | -| [reference_plate_data](./reference_plate_data/) | Platemaps per perturbation type | This folder holds the platemap files with metadata based on perturbation type and the barcode platemap file | +| Module | Purpose | Description | +| :--- | :--- | :--- | +| [0.download_data](./0.download_data/) | Download JUMP-Target SQLite files and process them with [CytoTable](https://github.com/cytomining/CytoTable) | Downloads CellProfiler SQLite outputs for 51 plates from AWS and processes them into Parquet files that combine compartment and image metadata in one table. | +| [1.process_data](./1.process_data/) | Process SQLite files | Uses pycytominer on SQLite outputs to merge single cells, normalize features, and produce downstream-ready data. | +| [2.evaluate_data](./2.evaluate_data/) | Apply phenotypic profiling model | Runs class-balanced logistic regression prediction workflows to generate single-cell phenotype probabilities. | +| [3.analyze_data](./3.analyze_data/) | Analyze phenotypic predictions | Performs analyses to validate predicted phenotypic classes for perturbations compared to controls. | +| [4.validate_data](./4.validate_data/) | Validate phenotype-level findings | Contains validation scripts for compound and phenotype-level enrichment analyses. | +| [4.visualize_data](./4.visualize_data/) | Visualize analysis outputs | Contains visualization scripts for downstream interpretation and reporting. | +| [reference_plate_data](./reference_plate_data/) | Platemaps and metadata | Holds platemap files, metadata by perturbation type, and barcode platemap mappings. | ## Development @@ -45,13 +47,14 @@ Please see [`just` installation details](https://just.systems/man/en/packages.ht ### Environment -For all modules, we use conda environments that includes all necessary packages. +For all modules, we use conda environments that include the required packages. -To create the environment from terminal, run the code line below: +To create the environments from terminal, run the commands below: ```bash -# Make sure you are in the same directory as the environment file -conda env create -f environment.yml +# Make sure you are in the repository root +conda env create -n jump_sc -f environment.yml +conda env create -n R_jump_sc -f R_environment.yml ``` Alternatively, use the following `just` command. @@ -63,19 +66,15 @@ just setup-conda-envs ## Running Code from this Project -`just` commands are provided for processing each step. -There also is a command which can run all processing steps. +`just` commands are provided to run project tasks from the repository root. +These commands are an entrypoint separate from directly invoking module shell scripts. Individual steps: ```bash # run step 0.download_data just run-step-0 -``` - -Run all steps: -```bash -# run all steps -just run-all-steps ``` + +Module-specific scripts are also available in each workflow directory when you need direct execution. diff --git a/reference_plate_data/README.md b/reference_plate_data/README.md index 200d72d..aa6734a 100644 --- a/reference_plate_data/README.md +++ b/reference_plate_data/README.md @@ -1,25 +1,25 @@ # Barcode Platemap and Platemaps -In this folder, there three file types: +In this folder, there are four file types: -1. **Barcode plate** - CSV file that associates each plate to the correct platemap -2. **Platemap** - TXT that contains the plate layout (e.g., what broad sample is in what well) +1. **Barcode plate** - CSV file that associates each plate with the correct platemap +2. **Platemap** - TXT file that contains plate layout information (for example, which Broad sample is in each well) 3. **Metadata** - TSV file that contains all of the relevant metadata per broad sample 4. **Experimental Metadata** - TSV file that contains additional information about the experiment -The platemap and metadata files are broken down by treatment (e.g., compound, crispr, or orf). +The platemap and metadata files are broken down by treatment (for example, compound, CRISPR, or ORF). We downloaded metadata files excluding the **Experimental Metadata** from [the metadata folder in AWS](https://cellpainting-gallery.s3.amazonaws.com/index.html#cpg0000-jump-pilot/source_4/workspace/metadata/external_metadata/). We downloaded the barcode platemap and platemap files from [the platemaps folder in AWS](https://cellpainting-gallery.s3.amazonaws.com/index.html#cpg0000-jump-pilot/source_4/workspace/metadata/platemaps/2020_11_04_CPJUMP1/). We downloaded the experimental metadata file from [the jump-cellpainting github repo subfolder](https://github.com/jump-cellpainting/2023_Chandrasekaran_submitted/blob/9edd26d60524a62f993d4df40a5d8908206714f5/README.md#batch-and-plate-metadata). -Visualization of the platemaps are as follows: +Platemap visualizations are shown below: ![compound_platemap](./ex_figures/JUMP_compound_platemap.png) > Platemap for compound treatments ![crispr_platemap](./ex_figures/JUMP_crispr_platemap.png) -> Platemap for crispr treatments +> Platemap for CRISPR treatments ![orf_platemap](./ex_figures/JUMP_orf_platemap.png) -> Platemap for orf treatments +> Platemap for ORF treatments