-
Notifications
You must be signed in to change notification settings - Fork 5
New version including changes in SDRF validation, remove validation of simple txt files #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
40624b8
0a6c80e
92c92ff
fdc415b
11ccc4c
45f3cff
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| FROM python:3.11-slim | ||
| RUN apt-get update && apt-get install -y --no-install-recommends git procps libglib2.0-0t64 && rm -rf /var/lib/apt/lists/* | ||
| WORKDIR /src | ||
| COPY . . | ||
| RUN pip install --no-cache-dir . | ||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -31,7 +31,7 @@ packages = [ | |||||
| [tool.poetry.dependencies] | ||||||
| python = "*" | ||||||
| click = "*" | ||||||
| sdrf-pipelines = "==0.0.33" | ||||||
| sdrf-pipelines = ">=0.1.1" | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🌐 Web query:
💡 Result: No released versions of sdrf-pipelines explicitly documented as exposing both sdrf_pipelines.converters.openms.unimod.UnimodDatabase and sdrf_pipelines.sdrf.sdrf.read_sdrf / validate_sdrf APIs could be identified from available sources. The package has releases from 0.0.2 to 0.1.0 (latest as of 2026-02-04), primarily CLI-focused (parse_sdrf validate-sdrf, convert-openms). Repository structure confirms sdrf_pipelines/ directory with sdrf/ and converters/, suggesting Python APIs exist, but no public API docs, changelogs, or version-specific mentions found. Downstream projects should test specific versions (e.g., >=0.0.33 <0.1.0) or pin to latest stable 0.1.0 after verifying API presence via code inspection, as no breaking change evidence exists. Citations:
🏁 Script executed: #!/bin/bash
# Check sdrf-pipelines versions available on PyPI and GitHub releases
# Try to fetch PyPI JSON API
echo "=== PyPI sdrf-pipelines releases ==="
curl -s "https://pypi.org/pypi/sdrf-pipelines/json" | jq -r '.releases | keys[]' | sort -V
echo -e "\n=== Latest version info ==="
curl -s "https://pypi.org/pypi/sdrf-pipelines/json" | jq '.info | {version, release_url}'Repository: bigbio/quantms-utils Length of output: 441 🏁 Script executed: #!/bin/bash
# Verify the code uses the mentioned sdrf-pipelines APIs
echo "=== Searching for sdrf_pipelines imports in the codebase ==="
rg "from sdrf_pipelines|import sdrf_pipelines" -n
echo -e "\n=== Files that import sdrf_pipelines modules ==="
rg "sdrf_pipelines\." -l | head -20Repository: bigbio/quantms-utils Length of output: 433 Bound This PR now depends on specific Suggested change-sdrf-pipelines = ">=0.1.1"
+sdrf-pipelines = ">=0.1.1,<0.2.0"📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||
| pyopenms = ">=3.3.0" | ||||||
| pandas = "*" | ||||||
| pyarrow = ">=16.1.0" | ||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,192 +1,134 @@ | ||
| # nf-core: Update the script to check the sdrf | ||
| # This script is based on the example at: https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv | ||
|
|
||
| import errno | ||
| import logging | ||
| import os | ||
| import sys | ||
|
|
||
| import click | ||
| import pandas as pd | ||
| from sdrf_pipelines.sdrf.sdrf import SdrfDataFrame | ||
| from sdrf_pipelines.sdrf.sdrf_schema import DEFAULT_TEMPLATE, MASS_SPECTROMETRY | ||
|
|
||
| from sdrf_pipelines.sdrf.sdrf import read_sdrf | ||
|
coderabbitai[bot] marked this conversation as resolved.
Comment on lines
+6
to
+7
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🌐 Web query:
💡 Result: sdrf-pipelines 0.0.33 is a Python package (https://pypi.org/project/sdrf-pipelines/0.0.33/, released Nov 23, 2025) from the bigbio/sdrf-pipelines GitHub repo (https://github.com/bigbio/sdrf-pipelines). It is the official SDRF (Sample and Data Relationship Format) file validator and converter for proteomics/multiomics, translating SDRF files to pipeline configurations (e.g., OpenMS, MaxQuant, MSstats). Primary usage is via the CLI tool parse_sdrf: - validate-sdrf: Validates SDRF structure, formatting, uniqueness, and optionally ontology terms (EFO, CL, MS) using YAML schema templates (default: mass spectrometry). Example: parse_sdrf validate-sdrf --sdrf_file path/to/sdrf.tsv --template human --skip-ontology Simplified: parse_sdrf validate-sdrf-simple path/to/sdrf.tsv --template default No explicit Python API documentation for read_sdrf or validate_sdrf functions found in official sources (PyPI, GitHub README, COMMANDS.md referenced but not detailed). These may be internal functions called by the CLI (e.g., in sdrf_pipelines.sdrf.validators or parsers), accessible after pip install sdrf-pipelines by importing the module, but primarily designed for CLI use. Check GitHub source code (sdrf_pipelines/ directory) for programmatic access. Install with pip install sdrf-pipelines==0.0.33 or sdrf-pipelines[ontology] for full validation. Citations:
🏁 Script executed: #!/bin/bash
# Verify sdrf-pipelines 0.0.33 API structure by checking GitHub source
# First, check if we can access the repo and look at the module structure at the tag for 0.0.33
# Try to fetch the raw content from GitHub at the specific version
# Check the structure of sdrf_pipelines.sdrf.sdrf module
curl -s "https://raw.githubusercontent.com/bigbio/sdrf-pipelines/0.0.33/sdrf_pipelines/sdrf/sdrf.py" | head -100
echo "---"
# Also check for SDRF class and validate_sdrf method
curl -s "https://raw.githubusercontent.com/bigbio/sdrf-pipelines/0.0.33/sdrf_pipelines/sdrf/sdrf.py" | grep -A 10 "def read_sdrf\|def validate_sdrf\|class.*SDRF" | head -50Repository: bigbio/quantms-utils Length of output: 81 🏁 Script executed: #!/bin/bash
# Check for dependency specifications in the repository
find . -type f \( -name "setup.py" -o -name "requirements*.txt" -o -name "pyproject.toml" -o -name "setup.cfg" \) | head -20
echo "---"
# Check the actual usage in the check_samplesheet.py file
cat -n quantmsutils/sdrf/check_samplesheet.py | head -50Repository: bigbio/quantms-utils Length of output: 1897 🏁 Script executed: #!/bin/bash
# Check the requirements and pyproject.toml for sdrf-pipelines version
echo "=== requirements.txt ==="
cat requirements.txt | grep -i sdrf
echo ""
echo "=== pyproject.toml ==="
cat pyproject.toml | grep -A 5 -B 5 sdrfRepository: bigbio/quantms-utils Length of output: 285 🏁 Script executed: #!/bin/bash
# Check the actual check_samplesheet.py to see how read_sdrf and validate_sdrf are being used
cat -n quantmsutils/sdrf/check_samplesheet.py
echo ""
echo "=== Checking if there are other files using these imports ==="
rg "read_sdrf|validate_sdrf" --type pyRepository: bigbio/quantms-utils Length of output: 2914 Fix version conflict: requirements.txt (0.0.33) vs pyproject.toml (>=0.1.1). The 🤖 Prompt for AI Agents |
||
|
|
||
| logging.basicConfig(format="%(asctime)s [%(funcName)s] - %(message)s", level=logging.DEBUG) | ||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def make_dir(path): | ||
| if len(path) > 0: | ||
| try: | ||
| os.makedirs(path) | ||
| except OSError as exception: | ||
| if exception.errno != errno.EEXIST: | ||
| raise exception | ||
|
|
||
|
|
||
| def print_error(error, context="Line", context_str=""): | ||
| error_str = "ERROR: Please check samplesheet -> {}".format(error) | ||
| if context != "" and context_str != "": | ||
| error_str = "ERROR: Please check samplesheet -> {}\n{}: '{}'".format( | ||
| error, context.strip(), context_str.strip() | ||
| ) | ||
| print(error_str) | ||
| sys.exit(1) | ||
| # Minimal columns required to run quantms/quantmsdiann pipelines. | ||
| # These are checked in --minimal mode instead of full schema validation. | ||
| MINIMAL_REQUIRED_COLUMNS = [ | ||
| "source name", | ||
| "assay name", | ||
| "comment[data file]", | ||
| "comment[label]", | ||
| "comment[cleavage agent details]", | ||
| "comment[instrument]", | ||
| "comment[proteomics data acquisition method]", | ||
| "technology type", | ||
| ] | ||
|
|
||
| # Recommended columns: warn if missing but don't fail | ||
| MINIMAL_RECOMMENDED_COLUMNS = [ | ||
| "comment[precursor mass tolerance]", | ||
| "comment[fragment mass tolerance]", | ||
| "comment[dissociation method]", | ||
| "comment[technical replicate]", | ||
| "comment[fraction identifier]", | ||
| ] | ||
|
|
||
|
|
||
| def check_sdrf( | ||
| input_sdrf: str, | ||
| skip_ms_validation: bool = False, | ||
| skip_factor_validation: bool = False, | ||
| skip_experimental_design_validation: bool = False, | ||
| template: str = "ms-proteomics", | ||
| minimal: bool = False, | ||
| use_ols_cache_only: bool = False, | ||
| skip_sdrf_validation: bool = False, | ||
| ): | ||
| """ | ||
| Check the SDRF file for errors. If any errors are found, print them and exit with a non-zero status code. | ||
| @param input_sdrf: Path to the SDRF file to check | ||
| @param skip_ms_validation: Disable the validation of mass spectrometry fields in SDRF (e.g. posttranslational modifications) | ||
| @param skip_factor_validation: Disable the validation of factor values in SDRF | ||
| @param skip_experimental_design_validation: Disable the validation of experimental design | ||
| @param use_ols_cache_only: Use ols cache for validation of the terms and not OLS internet service | ||
| @param skip_sdrf_validation: Disable the validation of SDRF | ||
| """ | ||
| if skip_sdrf_validation: | ||
| print("No SDRF validation was performed.") | ||
| sys.exit(0) | ||
|
|
||
| df = SdrfDataFrame.parse(input_sdrf) | ||
| errors = df.validate(DEFAULT_TEMPLATE, use_ols_cache_only) | ||
|
|
||
| if not skip_ms_validation: | ||
| errors = errors + df.validate(MASS_SPECTROMETRY, use_ols_cache_only) | ||
| Check the SDRF file for errors. | ||
|
|
||
| if not skip_factor_validation: | ||
| errors = errors + df.validate_factor_values() | ||
|
|
||
| if not skip_experimental_design_validation: | ||
| errors = errors + df.validate_experimental_design() | ||
| :param input_sdrf: Path to the SDRF file to check | ||
| :param template: Schema template for full validation (e.g. 'ms-proteomics', 'dia-acquisition') | ||
| :param minimal: Only validate columns required to run the pipeline (skip organism, etc.) | ||
| :param use_ols_cache_only: Use OLS cache instead of live OLS service | ||
| """ | ||
| if minimal: | ||
| errors = _validate_minimal(input_sdrf) | ||
| else: | ||
| df = read_sdrf(input_sdrf) | ||
| errors = df.validate_sdrf( | ||
| template=template, | ||
| use_ols_cache_only=use_ols_cache_only, | ||
| ) | ||
|
|
||
| for error in errors: | ||
| print(error) | ||
|
|
||
| sys.exit(bool(errors)) | ||
|
|
||
|
|
||
| def check_expdesign(expdesign): | ||
| """ | ||
| Check the expdesign file for errors. If any errors are found, print them and exit with a non-zero status code. | ||
| @param expdesign: Path to the expdesign file to check | ||
| """ | ||
| data = pd.read_csv(expdesign, sep="\t", header=0, dtype=str) | ||
| data = data.dropna() | ||
| schema_file = ["Fraction_Group", "Fraction", "Spectra_Filepath", "Label", "Sample"] | ||
| schema_sample = ["Sample", "MSstats_Condition", "MSstats_BioReplicate"] | ||
|
|
||
| # check table format: two table | ||
| with open(expdesign, "r") as f: | ||
| lines = f.readlines() | ||
| try: | ||
| empty_row = lines.index("\n") | ||
| except ValueError: | ||
| print( | ||
| "the one-table format parser is broken in OpenMS2.5, please use one-table or sdrf" | ||
| ) | ||
| sys.exit(1) | ||
|
|
||
| s_table = [i.replace("\n", "").split("\t") for i in lines[empty_row + 1 :]][1:] | ||
| s_header = lines[empty_row + 1].replace("\n", "").split("\t") | ||
| s_data_frame = pd.DataFrame(s_table, columns=s_header) | ||
|
|
||
| # check missed mandatory column | ||
| missed_columns = set(schema_file) - set(data.columns) | ||
| if len(missed_columns) != 0: | ||
| print("{0} column missed".format(" ".join(missed_columns))) | ||
| sys.exit(1) | ||
|
|
||
| missed_columns = set(schema_sample) - set(s_data_frame.columns) | ||
| if len(missed_columns) != 0: | ||
| print("{0} column missed".format(" ".join(missed_columns))) | ||
| sys.exit(1) | ||
| def _validate_minimal(input_sdrf: str) -> list[str]: | ||
| """Validate only the columns required to run the pipeline. | ||
|
|
||
| if len(set(data.Label)) != 1 and "MSstats_Mixture" not in s_data_frame.columns: | ||
| print("MSstats_Mixture column missed in ISO experiments") | ||
| sys.exit(1) | ||
|
|
||
| # check logical problem: may be improved | ||
| check_expdesign_logic(data, s_data_frame) | ||
| Returns a list of error strings. Only missing required columns | ||
| produce errors; missing recommended columns produce warnings (non-blocking). | ||
| """ | ||
| df_header = pd.read_csv(input_sdrf, sep="\t", nrows=0) | ||
| columns_lower = [c.lower() for c in df_header.columns] | ||
| errors = [] | ||
|
|
||
| # Reject header-only files | ||
| df_rows = pd.read_csv(input_sdrf, sep="\t", nrows=1) | ||
| if len(df_rows) == 0: | ||
| errors.append("ERROR: SDRF file contains a header but no data rows.") | ||
| return errors | ||
|
|
||
| # Check required columns (case-insensitive) | ||
| for col in MINIMAL_REQUIRED_COLUMNS: | ||
| if col.lower() not in columns_lower: | ||
| errors.append(f"ERROR: Required column '{col}' is missing from the SDRF file.") | ||
|
|
||
| # Check at least one modification parameters column exists | ||
| has_mod_col = any(c.startswith("comment[modification parameters") for c in columns_lower) | ||
| if not has_mod_col: | ||
| errors.append( | ||
| "ERROR: At least one 'comment[modification parameters]' column is required." | ||
| ) | ||
|
|
||
| # Warn about recommended columns (non-blocking) | ||
| for col in MINIMAL_RECOMMENDED_COLUMNS: | ||
| if col.lower() not in columns_lower: | ||
| logger.warning( | ||
| f"Recommended column '{col}' is missing. Pipeline will use default parameters." | ||
| ) | ||
|
|
||
| def check_expdesign_logic(f_table, s_table): | ||
| fg_ints = f_table["Fraction_Group"].astype(int) | ||
| if fg_ints.max() > fg_ints.nunique(): | ||
| print("Fraction_Group discontinuous!") | ||
| sys.exit(1) | ||
| f_table_d = f_table.drop_duplicates(["Fraction_Group", "Fraction", "Label", "Sample"]) | ||
| if f_table_d.shape[0] < f_table.shape[0]: | ||
| print("Existing duplicate entries in Fraction_Group, Fraction, Label and Sample") | ||
| sys.exit(1) | ||
| if len(set(s_table.Sample)) < s_table.shape[0]: | ||
| print("Existing duplicate Sample in sample table!") | ||
| sys.exit(1) | ||
| return errors | ||
|
|
||
|
|
||
| @click.command( | ||
| "checksamplesheet", | ||
| short_help="Reformat nf-core/quantms sdrf file and check its contents.", | ||
| short_help="Validate an SDRF file for quantms pipelines.", | ||
| ) | ||
| @click.option("--exp_design", help="SDRF/Expdesign file to be validated") | ||
| @click.option("--is_sdrf", help="SDRF file or Expdesign file", is_flag=True) | ||
| @click.option("--skip_sdrf_validation", help="Disable the validation of SDRF", is_flag=True) | ||
| @click.option("--exp_design", help="SDRF file to be validated", required=True) | ||
| @click.option( | ||
| "--skip_ms_validation", | ||
| help="Disable the validation of mass spectrometry fields in SDRF (e.g. posttranslational modifications)", | ||
| is_flag=True, | ||
| "--template", "-t", | ||
| help="Schema template for full validation (e.g. ms-proteomics, dia-acquisition)", | ||
| default="ms-proteomics", | ||
| ) | ||
| @click.option( | ||
| "--skip_factor_validation", | ||
| help="Disable the validation of factor values in SDRF", | ||
| is_flag=True, | ||
| ) | ||
| @click.option( | ||
| "--skip_experimental_design_validation", | ||
| help="Disable the validation of experimental design", | ||
| "--minimal", | ||
| help="Only validate columns required to run the pipeline (skip organism, metadata, etc.)", | ||
| is_flag=True, | ||
| ) | ||
| @click.option( | ||
| "--use_ols_cache_only", | ||
| help="Use ols cache for validation of the terms and not OLS internet service", | ||
| help="Use OLS cache for ontology validation instead of the live OLS service", | ||
| is_flag=True, | ||
| ) | ||
| def checksamplesheet( | ||
| exp_design: str, | ||
| is_sdrf: bool = False, | ||
| skip_sdrf_validation: bool = False, | ||
| skip_ms_validation: bool = False, | ||
| skip_factor_validation: bool = False, | ||
| skip_experimental_design_validation: bool = False, | ||
| template: str = "ms-proteomics", | ||
| minimal: bool = False, | ||
| use_ols_cache_only: bool = False, | ||
| ): | ||
| """ | ||
| Reformat nf-core/quantms sdrf file and check its contents. | ||
| @param exp_design: SDRF/Expdesign file to be validated | ||
| @param is_sdrf: SDRF file or Expdesign file | ||
| @param skip_sdrf_validation: Disable the validation of SDRF | ||
| @param skip_ms_validation: Disable the validation of mass spectrometry fields in SDRF (e.g. posttranslational modifications) | ||
| @param skip_factor_validation: Disable the validation of factor values in SDRF | ||
| @param skip_experimental_design_validation: Disable the validation of experimental design | ||
| @param use_ols_cache_only: Use ols cache for validation of the terms and not OLS internet service | ||
|
|
||
| """ | ||
| # TODO validate expdesign file | ||
| if is_sdrf: | ||
| check_sdrf( | ||
| input_sdrf=exp_design, | ||
| skip_sdrf_validation=skip_sdrf_validation, | ||
| skip_ms_validation=skip_ms_validation, | ||
| skip_factor_validation=skip_factor_validation, | ||
| skip_experimental_design_validation=skip_experimental_design_validation, | ||
| use_ols_cache_only=use_ols_cache_only, | ||
| ) | ||
| else: | ||
| check_expdesign(exp_design) | ||
| """Validate an SDRF file for quantms pipelines.""" | ||
| check_sdrf( | ||
| input_sdrf=exp_design, | ||
| template=template, | ||
| minimal=minimal, | ||
| use_ols_cache_only=use_ols_cache_only, | ||
| ) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run the dev container as a non-root user.
The image never switches away from root. That keeps the whole dev session over-privileged and will also leave bind-mounted files owned by root on the host.
Suggested change
📝 Committable suggestion
🧰 Tools
🪛 Trivy (0.69.3)
[error] 1-1: Image user should not be 'root'
Specify at least 1 USER command in Dockerfile with non-root user as argument
Rule: DS-0002
Learn more
(IaC/Dockerfile)
🤖 Prompt for AI Agents