Skip to content

feat: add GO enrichment analysis page for ProteomicsLFQ results#8

Open
hjn0415a wants to merge 2 commits intoOpenMS:mainfrom
hjn0415a:feature/go-terms
Open

feat: add GO enrichment analysis page for ProteomicsLFQ results#8
hjn0415a wants to merge 2 commits intoOpenMS:mainfrom
hjn0415a:feature/go-terms

Conversation

@hjn0415a
Copy link
Copy Markdown
Contributor

@hjn0415a hjn0415a commented Feb 4, 2026

This PR adds a new GO Enrichment Analysis page for ProteomicsLFQ results.
The page allows users to perform GO term enrichment (BP, CC, MF) based on protein-level differential abundance results.

  • Added a new Streamlit results page: results_proteomicslfq.py
  • Integrated GO enrichment analysis using MyGene.info for GO annotation
  • Foreground proteins are selected based on configurable p-value and |log2FC| thresholds
  • Enrichment is computed using Fisher’s exact test
  • Results are visualized as bar plots and tables, separated by GO category (BP / CC / MF)
  • Added mygene as a new dependency

Summary by CodeRabbit

  • Chores
    • Updated file formatting to ensure proper line endings (no user-facing impact).

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 4, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fa7af7c4-c7ca-48e9-bcb1-d9071d893697

📥 Commits

Reviewing files that changed from the base of the PR and between 827367e and 87f95fb.

📒 Files selected for processing (2)
  • content/results_proteomicslfq.py
  • requirements.txt
✅ Files skipped from review due to trivial changes (2)
  • requirements.txt
  • content/results_proteomicslfq.py

📝 Walkthrough

Walkthrough

This pull request adds trailing newlines to two files: content/results_proteomicslfq.py and requirements.txt. Both files previously lacked newline characters at their end, and this change ensures they terminate with proper newlines as per standard code formatting conventions.

Changes

Cohort / File(s) Summary
File Formatting
content/results_proteomicslfq.py
Added trailing newline at end of file to comply with standard formatting requirements.
Dependency Configuration
requirements.txt
Added trailing newline after statsmodels entry to ensure file terminates with proper newline character.

Possibly related PRs

Poem

🐰 A newline here, a newline there,
Tidy files floating through the air,
No trailing chaos left behind,
Just properly formatted code, so refined! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title claims to 'add GO enrichment analysis page' but the actual changes only fix newline formatting in two files without implementing any GO analysis functionality. Align the title with the actual changes, such as 'fix: add trailing newlines to results_proteomicslfq.py and requirements.txt' or create a separate PR for the actual GO enrichment feature.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@content/results_proteomicslfq.py`:
- Around line 68-117: The foreground/background counts are using all proteins
(bg_ids/fg_ids) even if MyGene returned no annotation, so update
run_go_enrichment to first compute annotated_ids = set(res["query"].astype(str))
(or otherwise derive the set of IDs present in the filtered res) and then
replace bg_set and fg_set with their intersections with annotated_ids before
computing N_bg/N_fg and running the Fisher tests; keep building go2bg/go2fg from
res rows as-is so counts and p-values reflect only annotated proteins.
🧹 Nitpick comments (4)
requirements.txt (1)

152-152: Consider pinning mygene for deterministic builds.

requirements.txt is generated by pip-compile, but mygene is unpinned. Align it with the rest of the lockfile by re-running pip-compile or pinning a version to avoid non-reproducible installs.

content/results_proteomicslfq.py (3)

45-50: Wrap the GO enrichment UI in @st.fragment to avoid full reruns.

This keeps slider/button interactions from re-running the entire page. As per coding guidelines, **/*.py: Use @st.fragment decorator for interactive UI updates without full page reloads.

Suggested refactor (skeleton)
+@st.fragment
+def go_enrichment_panel(pivot_df):
     st.subheader("🧬 GO Enrichment Analysis")
     p_cutoff = st.slider(...)
     fc_cutoff = st.slider(...)
     if st.button("Run GO Enrichment"):
         ...
+
+go_enrichment_panel(pivot_df)

55-65: Avoid blind Exception catches to improve debuggability.

Catching broad exceptions hides unexpected failures. Consider narrowing to the likely exceptions (e.g., AttributeError/IndexError in parsing and request-related exceptions around the API call) or re-raise after logging.

Also applies to: 140-141


134-137: Use streamlit_plotly_events for interactive Plotly charts.

Right now the chart is displayed but you aren’t capturing interactions. Consider using plotly_events to support click/selection actions. As per coding guidelines, **/*.py: Use Plotly and streamlit_plotly_events for interactive visualizations.

Example integration
+from streamlit_plotly_events import plotly_events
 ...
-                                st.plotly_chart(fig, use_container_width=True)
+                                selected = plotly_events(fig, click_event=True, select_event=True)
+                                st.plotly_chart(fig, use_container_width=True)

Comment on lines +68 to +117
bg_ids = analysis_df["UniProt"].dropna().unique().tolist()
fg_ids = analysis_df[
(analysis_df["p-value"] < p_cutoff) &
(analysis_df["log2FC"].abs() >= fc_cutoff)
]["UniProt"].dropna().unique().tolist()

if len(fg_ids) < 3:
st.warning(f"Not enough significant proteins (p < {p_cutoff}, |log2FC| ≥ {fc_cutoff}). Found: {len(fg_ids)}")
else:
res_list = mg.querymany(bg_ids, scopes="uniprot", fields="go", as_dataframe=False)
res = pd.DataFrame(res_list)
if "notfound" in res.columns:
res = res[res["notfound"] != True]

def extract_go_terms(go_data, go_type):
if not isinstance(go_data, dict) or go_type not in go_data:
return []
terms = go_data[go_type]
if isinstance(terms, dict):
terms = [terms]
return list({t.get("term") for t in terms if "term" in t})

for go_type in ["BP", "CC", "MF"]:
res[f"{go_type}_terms"] = res["go"].apply(lambda x: extract_go_terms(x, go_type))

fg_set = set(fg_ids)
bg_set = set(bg_ids)

def run_go_enrichment(go_type):
go2fg = defaultdict(set)
go2bg = defaultdict(set)
for _, row in res.iterrows():
uid = str(row["query"])
for term in row[f"{go_type}_terms"]:
go2bg[term].add(uid)
if uid in fg_set:
go2fg[term].add(uid)

records = []
N_fg = len(fg_set)
N_bg = len(bg_set)
for term, fg_genes in go2fg.items():
a = len(fg_genes)
if a == 0:
continue
b = N_fg - a
c = len(go2bg[term]) - a
d = N_bg - (a + b + c)
_, p = fisher_exact([[a, b], [c, d]], alternative="greater")
records.append({"GO_Term": term, "Count": a, "GeneRatio": f"{a}/{N_fg}", "p_value": p})
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Foreground/background counts include unannotated proteins, biasing Fisher p-values.

N_bg/N_fg are computed from all proteins, even those without GO annotations. This inflates the background and can understate enrichment. Restrict both sets to annotated proteins returned by MyGene before computing Fisher’s exact test.

Proposed fix
-                    bg_ids = analysis_df["UniProt"].dropna().unique().tolist()
+                    bg_ids = analysis_df["UniProt"].dropna().unique().tolist()
                     fg_ids = analysis_df[
                         (analysis_df["p-value"] < p_cutoff) &
                         (analysis_df["log2FC"].abs() >= fc_cutoff)
                     ]["UniProt"].dropna().unique().tolist()
 ...
-                        fg_set = set(fg_ids)
-                        bg_set = set(bg_ids)
+                        annotated_ids = set(res["query"].astype(str))
+                        bg_set = annotated_ids
+                        fg_set = annotated_ids.intersection(map(str, fg_ids))
🧰 Tools
🪛 Ruff (0.14.14)

[error] 80-80: Avoid inequality comparisons to True; use not res["notfound"]: for false checks

Replace with not res["notfound"]

(E712)


[warning] 91-91: Function definition does not bind loop variable go_type

(B023)

🤖 Prompt for AI Agents
In `@content/results_proteomicslfq.py` around lines 68 - 117, The
foreground/background counts are using all proteins (bg_ids/fg_ids) even if
MyGene returned no annotation, so update run_go_enrichment to first compute
annotated_ids = set(res["query"].astype(str)) (or otherwise derive the set of
IDs present in the filtered res) and then replace bg_set and fg_set with their
intersections with annotated_ids before computing N_bg/N_fg and running the
Fisher tests; keep building go2bg/go2fg from res rows as-is so counts and
p-values reflect only annotated proteins.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hjn0415a Could you check this?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

@t0mdavid-m
Copy link
Copy Markdown
Member

I just noticed that the abundance data is not actually calculated within the workflow. This could lead to issues in displaying abundance data if the user changes the labels after running the workflow. Could you please integrate it together with the GO-Term Annotation in the execution section of the workflow.

The results pages should then only display preprocessed results. The results should only be influenced by the configuration if the user reruns the workflow.

@t0mdavid-m
Copy link
Copy Markdown
Member

Make sure you use the file manager to write the output to file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants