Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
08cefec
NF_MAAffymetrix: update qmd structure
cyouh95 Dec 1, 2024
501ec0f
NF_MAAffymetrix: #113 update handling custom annotations
cyouh95 Jan 5, 2025
475a337
NF_MAAffymetrix: update pipeline documentation
cyouh95 Jan 5, 2025
6249c40
NF_MAAffymetrix: move updated doc to new pipeline version
cyouh95 Jan 10, 2025
683e0c3
NF_MAAffymetrix: track array annotations
cyouh95 Jan 19, 2025
8e6a96a
NF_MAAffymetrix: #113 update custom annotations config
cyouh95 Jan 20, 2025
4b490ba
NF_MAAffymetrix: use reference annotations GL-DPPD-7110-A
cyouh95 Jan 20, 2025
645b258
NF_MAAffymetrix: update tool versions
cyouh95 Jan 20, 2025
7b14d1d
NF_MAAffymetrix: update pipeline version from GL-DPPD-7114 to GL-DPPD…
cyouh95 Jan 21, 2025
a177711
NF_MAAffymetrix: update pipeline doc
cyouh95 Jan 31, 2025
0cd2a59
NF_MAAffymetrix: update pipeline doc
cyouh95 Feb 4, 2025
cfb2e94
Update GL-DPPD-7114-A.md
asaravia-butler Feb 7, 2025
7433f69
NF_MAAffymetrix: reorder DE table columns
cyouh95 Feb 10, 2025
44ecb25
NF_MAAffymetrix: remove visualization_PCA_table_GLmicroarray.csv output
cyouh95 Feb 10, 2025
bfa4eba
NF_MAAffymetrix: update report headings
cyouh95 Feb 10, 2025
a228c94
NF_MAAffymetrix: remove viz output from V&V
cyouh95 Feb 10, 2025
d07ee40
NF_MAAffymetrix: use original sample name in output files
cyouh95 Feb 10, 2025
f8bb3fe
Revert "NF_MAAffymetrix: use original sample name in output files"
cyouh95 Feb 11, 2025
2529319
NF_MAAffymetrix: minor updates to qmd
cyouh95 Feb 27, 2025
3c8e0af
NF_MAAffymetrix: update accepted ISA field name for label
cyouh95 Mar 4, 2025
57afe18
NF_MAAffymetrix: minor updates to workflow version 1.0.5
cyouh95 Mar 4, 2025
cca1601
NF_MAAffymetrix: minor update to pipeline doc
cyouh95 Mar 4, 2025
7733b7c
NF_MAAffymetrix: update custom functions in pipeline doc
cyouh95 Mar 25, 2025
0d79447
NF_MAAffymetrix: update nextflow version from 23.10.1 to 24.10.5
cyouh95 Mar 25, 2025
93d32ee
Update pipeline and annotation README docs
bnovak32 May 28, 2025
6ce8f28
Refactor Affymetrix Workflow: Update documentation, streamline proces…
jihanyehia Apr 1, 2026
263d0d3
Update 3rd party software licenses and add purrr license
jihanyehia Apr 2, 2026
8e891b6
Update 3rd party software licenses: replace glue and stringr license …
jihanyehia Apr 6, 2026
2bfdfce
Add conditional execution of UPDATE_ISA_TABLES that is not expected t…
jihanyehia Apr 8, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 11 additions & 10 deletions 3rd_Party_Licenses/Microarray_Affymetrix_3rd_Party_Software.md

Large diffs are not rendered by default.

Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
1,622 changes: 1,622 additions & 0 deletions Microarray/Affymetrix/Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114-A.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion Microarray/Affymetrix/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# GeneLab bioinformatics processing pipeline for Affymetrix microarray data


> **The document [`GL-DPPD-7114.md`](Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114.md) holds an overview and example commands for how GeneLab processes Affymetrix microarray datasets. See the [Repository Links](#repository-links) descriptions below for more information. Processed data output files and processing code is provided for each GLDS dataset along with the processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**
> **The document [`GL-DPPD-7114-A.md`](Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114-A.md) holds an overview and example commands for how GeneLab processes Affymetrix microarray datasets. See the [Repository Links](#repository-links) descriptions below for more information. Processed data output files and processing code is provided for each GLDS dataset along with the processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**

---

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [1.0.5](https://github.com/nasa/GeneLab_Data_Processing/tree/NF_MAAffymetrix_1.0.5/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix) - 2024-08-30
## [1.0.5](https://github.com/nasa/GeneLab_Data_Processing/tree/NF_MAAffymetrix_1.0.5/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix) - 2026-05-XX

### Added

- Add support for bacteria annotations using manufacturer annotations ([#113](https://github.com/nasa/GeneLab_Data_Processing/issues/113))
- Support for custom annotations, see [specification](examples/annotations/README.md) ([#113](https://github.com/nasa/GeneLab_Data_Processing/issues/113))
- Add option to skip differential expression analysis (`--skipDE`) ([#104](https://github.com/nasa/GeneLab_Data_Processing/issues/104))

### Changed
Expand All @@ -21,6 +21,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Update MA plot to support HTAFeatureSet ([#105](https://github.com/nasa/GeneLab_Data_Processing/issues/105))
- Remove extra `.1` suffix in AFFY HTA 2 0 Probe IDs in the raw data to allow for merging to BioMart data ([#106](https://github.com/nasa/GeneLab_Data_Processing/issues/106))
- Decrease legend size when sample names are long to prevent it from covering plot ([#107](https://github.com/nasa/GeneLab_Data_Processing/issues/107))
- Update the custom `fetch_organism_specific_annotation_table()` function, used when loading organism-specific annotation metadata, to convert figshare ndownloader URLs to direct API endpoints, as ndownloader URLs require redirect handling that is not supported in all programmatic download contexts
- Simplify group sample retrieval during differential expression group-wise statistics computation to use a more concise `filter/pull/sort` chain instead of `group_by/summarize/filter/pull`, addressing the deprecation warning in dplyr >= 1.1.0 where returning more than 1 row per `summarise()` group is deprecated
- Update processed data protocol to auto-populate workflow version from `nextflow.config` and add Caenorhabditis elegans, Saccharomyces cerevisiae, Escherichia coli, and Pseudomonas aeruginosa to supported organisms ([#98](https://github.com/nasa/GeneLab_Data_Processing/issues/98))
- Update software table generation to exclude `R.utils` from table if data files are not compressed ([#99](https://github.com/nasa/GeneLab_Data_Processing/issues/99))

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

### Implementation Tools <!-- omit in toc -->

The current GeneLab Affymetrix Microarray consensus processing pipeline (NF_MAAffymetrix), [GL-DPPD-7114](../../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114.md), is implemented as a [Nextflow](https://nextflow.io/) DSL2 workflow and utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) to run all tools in containers. This workflow (NF_MAAffymetrix) is run using the command line interface (CLI) of any unix-based system. While knowledge of creating workflows in Nextflow is not required to run the workflow as is, [the Nextflow documentation](https://nextflow.io/docs/latest/index.html) is a useful resource for users who want to modify and/or extend this workflow.
The current GeneLab Affymetrix Microarray consensus processing pipeline (NF_MAAffymetrix), [GL-DPPD-7114-A](../../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114-A.md), is implemented as a [Nextflow](https://nextflow.io/) DSL2 workflow and utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) to run all tools in containers. This workflow (NF_MAAffymetrix) is run using the command line interface (CLI) of any unix-based system. While knowledge of creating workflows in Nextflow is not required to run the workflow as is, [the Nextflow documentation](https://nextflow.io/docs/latest/index.html) is a useful resource for users who want to modify and/or extend this workflow.

### Workflow & Subworkflows <!-- omit in toc -->

Expand All @@ -14,8 +14,8 @@ The current GeneLab Affymetrix Microarray consensus processing pipeline (NF_MAAf

---
The NF_MAAffymetrix workflow is composed of three subworkflows as shown in the image above.
Below is a description of each subworkflow and the additional output files generated that are not already indicated in the [GL-DPPD-7114 pipeline
document](../../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114.md):
Below is a description of each subworkflow and the additional output files generated that are not already indicated in the [GL-DPPD-7114-A pipeline
document](../../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114-A.md):

1. **Analysis Staging Subworkflow**

Expand All @@ -26,7 +26,7 @@ document](../../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114.md):
2. **Affymetrix Microarray Processing Subworkflow**

- Description:
- This subworkflow uses the staged raw data and metadata parameters from the Analysis Staging Subworkflow to generate processed data using the [GL-DPPD-7114 pipeline](../../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114.md).
- This subworkflow uses the staged raw data and metadata parameters from the Analysis Staging Subworkflow to generate processed data using the [GL-DPPD-7114-A pipeline](../../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114-A.md).

1. **V&V Pipeline Subworkflow**

Expand Down Expand Up @@ -143,6 +143,8 @@ nextflow run NF_MAAffymetrix_1.0.5/main.nf \
```bash
nextflow run NF_MAAffymetrix_1.0.5/main.nf \
-profile singularity \
--osdAccession OSD-266 \
--gldsAccession GLDS-266 \
--isaArchivePath </path/to/isaArchive>
```

Expand All @@ -157,7 +159,7 @@ nextflow run NF_MAAffymetrix_1.0.5/main.nf \

<br>

**Additional Required Parameters For [Approach 1](#3a-approach-1-run-the-workflow-on-a-genelab-agilent-1-channel-microarray-dataset):**
**Additional Required Parameters For [Approach 1](#3a-approach-1-run-the-workflow-on-a-genelab-affymetrix-microarray-dataset):**

* `--osdAccession OSD-###` – specifies the OSD ID to process through the NF_MAAffymetrix workflow (replace ### with the OSD number)

Expand All @@ -171,10 +173,22 @@ nextflow run NF_MAAffymetrix_1.0.5/main.nf \

<br>

**Additional Required Parameters For [Approach 3](#3c-approach-3-run-the-workflow-using-an-isa-archive):**

* `--osdAccession OSD-###` – specifies the OSD ID to process through the NF_MAAffymetrix workflow (replace ### with the OSD number)

* `--gldsAccession GLDS-###` – specifies the GLDS ID to process through the NF_MAAffymetrix workflow (replace ### with the GLDS number)

* `--isaArchivePath` - specifies the path to a previously-downloaded *ISA.zip (Default: an *ISA.zip is automatically fetched from the GeneLab Repository for the GLDS dataset being processed)

<br>

**Optional Parameters:**

* `--skipVV` - skip the automated V&V processes (Default: the automated V&V processes are active)

* `--skipDE` - skip the differential expression analysis (Default: the differential expression analysis is performed)

* `--resultsDir` - specifies the output directory for all files produced by the workflow (Default: <OSD-NNN_GLDS-NNN> if OSD and GLDS accessions are specified. Otherwise, the workflow launch directory.)

<br>
Expand All @@ -200,10 +214,10 @@ All R code steps and output are rendered within a Quarto document yielding the f


The outputs from the Analysis Staging and V&V Pipeline Subworkflows are described below:
> Note: The outputs from the Affymetrix Microarray Processing Subworkflow are documented in the [GL-DPPD-7114.md](../../../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114.md) processing protocol.
> Note: The outputs from the Affymetrix Microarray Processing Subworkflow are documented in the [GL-DPPD-7114-A.md](../../../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114-A.md) processing protocol.

**Analysis Staging Subworkflow**

> Note: only applicable for [Approach 1](#3a-approach-1-run-the-workflow-on-a-genelab-affymetrix-microarray-dataset) and [Approach 3](#3c-approach-3-run-the-workflow-using-an-isa-archive)
- Output:
- \*_microarray_v1_runsheet.csv (table containing metadata required for processing, including the raw reads files location)
- \*-ISA.zip (the ISA archive of the GLDS datasets to be processed, downloaded from the GeneLab Data Repository)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Custom Annotations Specification

## Description

* If using custom gene annotations when processing Affymetrix datasets through GeneLab's Affymetrix processing pipeline, a csv config file must be provided as specified below.
* See [config.csv](config.csv) for the latest config file used at GeneLab.


## Example

- [config.csv](config.csv)


## Required columns

| Column Name | Type | Description | Example |
|:------------|:-----|:------------|:--------|
| array_design | string | A bioMart attribute identifier denoting the microarray probe/probeset attribute used for annotation mapping. | AFFY E coli Genome 2 0 |
| annot_type | string | Used to determine how the custom annotations are parsed before merging to the data. Currently, only the below are supported: <ul><li>`3prime-IVT`: Annotations file is expected to be in the format of the 3' IVT expression analysis arrays annotations by [Thermo Fisher](https://www.thermofisher.com/us/en/home/life-science/microarray-analysis/microarray-data-analysis/genechip-array-annotation-files.html)</li><li>`custom`: Annotations file is merged as is, expected to have the following columns: `ProbesetID`, `ENTREZID`, `SYMBOL`, `GENENAME`, `ENSEMBL`, `REFSEQ`, `GOSLIM_IDS`, `STRING_id`, `count_gene_mappings`, `gene_mapping_source`</li></ul> | 3prime-IVT |
| annot_filename | string | Name of the custom annotations file. | E_coli_2.na36.annot.csv |

## Optional columns
If the file was downloaded from a website, provide the download link used and date
downloaded in additional columns after the required column for traceability.

| Column Name | Type | Description | Example |
|:------------|:-----|:------------|:--------|
| download_link | string | The URL used to retrieve the annotation file. | https://www.thermofisher.com/order/catalog/product/sec/assets?url=TFS-Assets/LSG/Support-Files/E_coli_2-na36-annot-csv.zip |
| download_date | date string | The date the file was retrieved in YYYY-MM-DD format. | 2024-06-15 |
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
array_design,annot_type,annot_filename,download_link,download_date
AFFY E coli Genome 2 0,3prime-IVT,E_coli_2.na36.annot.csv,https://www.thermofisher.com/order/catalog/product/sec/assets?url=TFS-Assets/LSG/Support-Files/E_coli_2-na36-annot-csv.zip,2024-06-15
AFFY GeneChip P. aeruginosa Genome,3prime-IVT,Pae_G1a.na36.annot.csv,https://www.thermofisher.com/order/catalog/product/sec/assets?url=TFS-Assets/LSG/Support-Files/Pae_G1a-na36-annot-csv.zip,2024-06-15
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Sample Name,Study Assay Measurement,Study Assay Technology Type,Study Assay Technology Platform,organism,biomart_attribute,Source Name,Label,Array Data File Name,Array Data File Path,Comment[Array Data File Name],Factor Value[Spaceflight],Factor Value[Altered Gravity],Original Sample Name
Atha_Col-0_clsCC_FLT_1G_Rep1,transcription profiling,DNA microarray,Affymetrix,Arabidopsis thaliana,AFFY ATH1 121501,Culture cells_1,biotin,GLDS-213_microarray_FC_front.CEL.gz,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-213/download?source=datamanager&file=GLDS-213_microarray_FC_front.CEL.gz,GLDS-213_microarray_FC_front.CEL.gz,Space Flight,1G by centrifugation,Atha_Col-0_clsCC_FLT_1G_Rep1
Atha_Col-0_clsCC_FLT_1G_Rep2,transcription profiling,DNA microarray,Affymetrix,Arabidopsis thaliana,AFFY ATH1 121501,Culture cells_2,biotin,GLDS-213_microarray_FC_rear.CEL.gz,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-213/download?source=datamanager&file=GLDS-213_microarray_FC_rear.CEL.gz,GLDS-213_microarray_FC_rear.CEL.gz,Space Flight,1G by centrifugation,Atha_Col-0_clsCC_FLT_1G_Rep2
Atha_Col-0_clsCC_FLT_uG_Rep1,transcription profiling,DNA microarray,Affymetrix,Arabidopsis thaliana,AFFY ATH1 121501,Culture cells_3,biotin,GLDS-213_microarray_FS_front.CEL.gz,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-213/download?source=datamanager&file=GLDS-213_microarray_FS_front.CEL.gz,GLDS-213_microarray_FS_front.CEL.gz,Space Flight,uG,Atha_Col-0_clsCC_FLT_uG_Rep1
Atha_Col-0_clsCC_FLT_uG_Rep2,transcription profiling,DNA microarray,Affymetrix,Arabidopsis thaliana,AFFY ATH1 121501,Culture cells_4,biotin,GLDS-213_microarray_FS_rear.CEL.gz,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-213/download?source=datamanager&file=GLDS-213_microarray_FS_rear.CEL.gz,GLDS-213_microarray_FS_rear.CEL.gz,Space Flight,uG,Atha_Col-0_clsCC_FLT_uG_Rep2
Atha_Col-0_clsCC_GC_1G_Rep1,transcription profiling,DNA microarray,Affymetrix,Arabidopsis thaliana,AFFY ATH1 121501,Culture cells_5,biotin,GLDS-213_microarray_GS_front.CEL.gz,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-213/download?source=datamanager&file=GLDS-213_microarray_GS_front.CEL.gz,GLDS-213_microarray_GS_front.CEL.gz,Ground Control,1G on Earth,Atha_Col-0_clsCC_GC_1G_Rep1
Atha_Col-0_clsCC_GC_1G_Rep2,transcription profiling,DNA microarray,Affymetrix,Arabidopsis thaliana,AFFY ATH1 121501,Culture cells_6,biotin,GLDS-213_microarray_GS_rear.CEL.gz,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-213/download?source=datamanager&file=GLDS-213_microarray_GS_rear.CEL.gz,GLDS-213_microarray_GS_rear.CEL.gz,Ground Control,1G on Earth,Atha_Col-0_clsCC_GC_1G_Rep2
Loading