feat: add GO enrichment analysis page for ProteomicsLFQ results#8
feat: add GO enrichment analysis page for ProteomicsLFQ results#8hjn0415a wants to merge 2 commits intoOpenMS:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (2)
📝 WalkthroughWalkthroughThis pull request adds trailing newlines to two files: Changes
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@content/results_proteomicslfq.py`:
- Around line 68-117: The foreground/background counts are using all proteins
(bg_ids/fg_ids) even if MyGene returned no annotation, so update
run_go_enrichment to first compute annotated_ids = set(res["query"].astype(str))
(or otherwise derive the set of IDs present in the filtered res) and then
replace bg_set and fg_set with their intersections with annotated_ids before
computing N_bg/N_fg and running the Fisher tests; keep building go2bg/go2fg from
res rows as-is so counts and p-values reflect only annotated proteins.
🧹 Nitpick comments (4)
requirements.txt (1)
152-152: Consider pinningmygenefor deterministic builds.
requirements.txtis generated by pip-compile, butmygeneis unpinned. Align it with the rest of the lockfile by re-running pip-compile or pinning a version to avoid non-reproducible installs.content/results_proteomicslfq.py (3)
45-50: Wrap the GO enrichment UI in@st.fragmentto avoid full reruns.This keeps slider/button interactions from re-running the entire page. As per coding guidelines,
**/*.py: Use@st.fragmentdecorator for interactive UI updates without full page reloads.Suggested refactor (skeleton)
+@st.fragment +def go_enrichment_panel(pivot_df): st.subheader("🧬 GO Enrichment Analysis") p_cutoff = st.slider(...) fc_cutoff = st.slider(...) if st.button("Run GO Enrichment"): ... + +go_enrichment_panel(pivot_df)
55-65: Avoid blindExceptioncatches to improve debuggability.Catching broad exceptions hides unexpected failures. Consider narrowing to the likely exceptions (e.g.,
AttributeError/IndexErrorin parsing and request-related exceptions around the API call) or re-raise after logging.Also applies to: 140-141
134-137: Usestreamlit_plotly_eventsfor interactive Plotly charts.Right now the chart is displayed but you aren’t capturing interactions. Consider using
plotly_eventsto support click/selection actions. As per coding guidelines,**/*.py: Use Plotly and streamlit_plotly_events for interactive visualizations.Example integration
+from streamlit_plotly_events import plotly_events ... - st.plotly_chart(fig, use_container_width=True) + selected = plotly_events(fig, click_event=True, select_event=True) + st.plotly_chart(fig, use_container_width=True)
content/results_proteomicslfq.py
Outdated
| bg_ids = analysis_df["UniProt"].dropna().unique().tolist() | ||
| fg_ids = analysis_df[ | ||
| (analysis_df["p-value"] < p_cutoff) & | ||
| (analysis_df["log2FC"].abs() >= fc_cutoff) | ||
| ]["UniProt"].dropna().unique().tolist() | ||
|
|
||
| if len(fg_ids) < 3: | ||
| st.warning(f"Not enough significant proteins (p < {p_cutoff}, |log2FC| ≥ {fc_cutoff}). Found: {len(fg_ids)}") | ||
| else: | ||
| res_list = mg.querymany(bg_ids, scopes="uniprot", fields="go", as_dataframe=False) | ||
| res = pd.DataFrame(res_list) | ||
| if "notfound" in res.columns: | ||
| res = res[res["notfound"] != True] | ||
|
|
||
| def extract_go_terms(go_data, go_type): | ||
| if not isinstance(go_data, dict) or go_type not in go_data: | ||
| return [] | ||
| terms = go_data[go_type] | ||
| if isinstance(terms, dict): | ||
| terms = [terms] | ||
| return list({t.get("term") for t in terms if "term" in t}) | ||
|
|
||
| for go_type in ["BP", "CC", "MF"]: | ||
| res[f"{go_type}_terms"] = res["go"].apply(lambda x: extract_go_terms(x, go_type)) | ||
|
|
||
| fg_set = set(fg_ids) | ||
| bg_set = set(bg_ids) | ||
|
|
||
| def run_go_enrichment(go_type): | ||
| go2fg = defaultdict(set) | ||
| go2bg = defaultdict(set) | ||
| for _, row in res.iterrows(): | ||
| uid = str(row["query"]) | ||
| for term in row[f"{go_type}_terms"]: | ||
| go2bg[term].add(uid) | ||
| if uid in fg_set: | ||
| go2fg[term].add(uid) | ||
|
|
||
| records = [] | ||
| N_fg = len(fg_set) | ||
| N_bg = len(bg_set) | ||
| for term, fg_genes in go2fg.items(): | ||
| a = len(fg_genes) | ||
| if a == 0: | ||
| continue | ||
| b = N_fg - a | ||
| c = len(go2bg[term]) - a | ||
| d = N_bg - (a + b + c) | ||
| _, p = fisher_exact([[a, b], [c, d]], alternative="greater") | ||
| records.append({"GO_Term": term, "Count": a, "GeneRatio": f"{a}/{N_fg}", "p_value": p}) |
There was a problem hiding this comment.
Foreground/background counts include unannotated proteins, biasing Fisher p-values.
N_bg/N_fg are computed from all proteins, even those without GO annotations. This inflates the background and can understate enrichment. Restrict both sets to annotated proteins returned by MyGene before computing Fisher’s exact test.
Proposed fix
- bg_ids = analysis_df["UniProt"].dropna().unique().tolist()
+ bg_ids = analysis_df["UniProt"].dropna().unique().tolist()
fg_ids = analysis_df[
(analysis_df["p-value"] < p_cutoff) &
(analysis_df["log2FC"].abs() >= fc_cutoff)
]["UniProt"].dropna().unique().tolist()
...
- fg_set = set(fg_ids)
- bg_set = set(bg_ids)
+ annotated_ids = set(res["query"].astype(str))
+ bg_set = annotated_ids
+ fg_set = annotated_ids.intersection(map(str, fg_ids))🧰 Tools
🪛 Ruff (0.14.14)
[error] 80-80: Avoid inequality comparisons to True; use not res["notfound"]: for false checks
Replace with not res["notfound"]
(E712)
[warning] 91-91: Function definition does not bind loop variable go_type
(B023)
🤖 Prompt for AI Agents
In `@content/results_proteomicslfq.py` around lines 68 - 117, The
foreground/background counts are using all proteins (bg_ids/fg_ids) even if
MyGene returned no annotation, so update run_go_enrichment to first compute
annotated_ids = set(res["query"].astype(str)) (or otherwise derive the set of
IDs present in the filtered res) and then replace bg_set and fg_set with their
intersections with annotated_ids before computing N_bg/N_fg and running the
Fisher tests; keep building go2bg/go2fg from res rows as-is so counts and
p-values reflect only annotated proteins.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
|
I just noticed that the abundance data is not actually calculated within the workflow. This could lead to issues in displaying abundance data if the user changes the labels after running the workflow. Could you please integrate it together with the GO-Term Annotation in the execution section of the workflow. The results pages should then only display preprocessed results. The results should only be influenced by the configuration if the user reruns the workflow. |
|
Make sure you use the file manager to write the output to file. |
This PR adds a new GO Enrichment Analysis page for ProteomicsLFQ results.
The page allows users to perform GO term enrichment (BP, CC, MF) based on protein-level differential abundance results.
Summary by CodeRabbit