Conversation
Select data sources for each model component: - Primary: cases.csv, population.csv, activity.csv, hrz_32626.geojson - Secondary: prev_culls.csv (impute missing dates), movement.csv (fixed p_mov) - Fixed inputs: mortality ledgers (r=1.0/day) - Excluded: clc_32626.geojson (revisit if binary HRZ insufficient) Documents pre-processing requirements and parameter decisions. Closes #4
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
WalkthroughAdds a new documentation file that defines criteria and classifications for selecting data sources in the HPAI modelling workflow, maps fields and derived quantities per source, specifies preprocessing and missing‑data handling, and links data sources to modelling questions and parameters to estimate or fix. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
hpai-challenge/notes/step03_data_source_selection.md (1)
3-3: Make the process‑DAG alignment explicit.You reference the process DAG but do not map sources to DAG nodes/edges. Consider adding a short table or bullet list linking each data source to specific DAG components to make the alignment auditable. As per coding guidelines “Review for: … alignment with process DAG requirements.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hpai-challenge/notes/step03_data_source_selection.md` at line 3, The document mentions the process DAG but doesn't map data sources to DAG nodes/edges; update step03_data_source_selection.md to explicitly link each listed data source to the specific process‑DAG component(s) it supports (use a short table or bullet list) so each source is auditable against DAG nodes/edges and the "Review for: alignment with process DAG requirements" guideline; include enough detail to identify the DAG node/edge names and the role of the source (input, output, metadata) next to each source entry.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hpai-challenge/notes/step03_data_source_selection.md`:
- Around line 183-185: The imputation step for prev_cull dates currently assumes
a trigger_confirmation within 1km; add a deterministic fallback when none exists
by specifying which rule to use (e.g., use the nearest confirmed case regardless
of distance, or apply a global median delay anchored to each farm's cull_start
date). Update the "Impute prev_cull dates" section to (1) state the chosen
fallback clearly, (2) define the tie-breaker (nearest by distance then earliest
confirmation), and (3) document how to compute and apply the fallback delay
(e.g., median delay calculated from the 12 complete records) so downstream
preprocessing is unambiguous for prev_cull and trigger_confirmation.
- Around line 137-172: The fenced code block using triple backticks that
contains the ASCII diagram (the block starting with
"┌─────────────────────────────────────────────────────────────────┐" and the
opening "```") lacks a language specifier; update the opening fence from "```"
to include a language (e.g., "```text") so Markdownlint stops flagging it and
consistency is preserved across the document.
---
Nitpick comments:
In `@hpai-challenge/notes/step03_data_source_selection.md`:
- Line 3: The document mentions the process DAG but doesn't map data sources to
DAG nodes/edges; update step03_data_source_selection.md to explicitly link each
listed data source to the specific process‑DAG component(s) it supports (use a
short table or bullet list) so each source is auditable against DAG nodes/edges
and the "Review for: alignment with process DAG requirements" guideline; include
enough detail to identify the DAG node/edge names and the role of the source
(input, output, metadata) next to each source entry.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
hpai-challenge/notes/step03_data_source_selection.md (1)
186-186: Minor wording polish for readability.At Line 186, consider “use the nearest confirmed case … use the earliest confirmation date” for grammatical clarity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hpai-challenge/notes/step03_data_source_selection.md` at line 186, Edit the "Fallback" bullet text that currently reads "use nearest confirmed case regardless of distance; if multiple equidistant, use earliest confirmation date" to include the definite article "the" in both places for clarity, i.e., change to "use the nearest confirmed case regardless of distance; if multiple equidistant, use the earliest confirmation date" within the "Fallback" bullet in step03_data_source_selection.md.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hpai-challenge/notes/step03_data_source_selection.md`:
- Line 100: Standardize on a single anchor term and an explicit formula: adopt
"trigger_date" as the anchor for imputation in Option 2, define median_delay =
median(confirmed_date - trigger_date) computed from the 12 complete records, and
state the imputation rule as confirmed_date_imputed = trigger_date +
median_delay; replace any occurrences of "trigger_confirmation + delay" (lines
183-186) with this exact terminology and update the Decision line (Option 2 /
median_delay) so both places use "trigger_date", "median_delay" and the concrete
formula.
---
Nitpick comments:
In `@hpai-challenge/notes/step03_data_source_selection.md`:
- Line 186: Edit the "Fallback" bullet text that currently reads "use nearest
confirmed case regardless of distance; if multiple equidistant, use earliest
confirmation date" to include the definite article "the" in both places for
clarity, i.e., change to "use the nearest confirmed case regardless of distance;
if multiple equidistant, use the earliest confirmation date" within the
"Fallback" bullet in step03_data_source_selection.md.
Co-authored-by: sbfnk <sebastian.funk@lshtm.ac.uk>
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
hpai-challenge/notes/step03_data_source_selection.md (2)
181-181: Consider a sparse or truncated distance matrix for 9,160 farms.A full pairwise distance matrix for 9,160 farms is ~84 million entries (~640 MB at float64). For a spatial kernel that decays rapidly, only distances within a cutoff radius are ever non-negligible. Pre-computing and storing a sparse matrix (distances ≤ e.g. 50 km) would substantially reduce both memory footprint and runtime during model fitting.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hpai-challenge/notes/step03_data_source_selection.md` at line 181, Replace the full dense pairwise distance matrix computation for "pairwise distances between all 9,160 farms" with a sparse/truncated approach: compute distances only up to a cutoff radius (e.g., 50 km), store results in a sparse format (indices + distances) rather than a full NxN float64 array, and update any downstream references to use the sparse distance representation during kernel evaluations or model fitting so only nonzero entries are iterated over.
107-107: Pandoc citation syntax (@Yoo2021) will not resolve in plain Markdown rendering.Lines 107 and 208 use
@Yoo2021, which is valid only when processed by pandoc/Quarto with a.bibfile. If this document is viewed on GitHub or rendered without a bibliography backend, the citation will appear as a raw string. Either add a bibliography entry and a YAML front-matterbibliography:field, or replace with an inline reference (e.g., "Yoo et al. (2021)") and a References section.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hpai-challenge/notes/step03_data_source_selection.md` at line 107, Replace the pandoc-only citation token "@Yoo2021" in step03_data_source_selection.md with a form that renders without a bibliography backend: either add a bibliography entry and YAML front-matter "bibliography:" pointing to a .bib file so pandoc/Quarto can resolve `@Yoo2021`, or change the inline occurrences of "@Yoo2021" to a plain-text citation like "Yoo et al. (2021)" and add a References section listing the full citation; ensure both occurrences are updated consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hpai-challenge/notes/step03_data_source_selection.md`:
- Around line 218-224: Update the Questions → Data Sources table so Q2–Q5
include movement.csv (as it feeds the movement network for the transmission
model) and add mortality_ledgers where the row refers to culling/mortality
simulations; leave Q1 as descriptive only (cases.csv, population.csv). Ensure
the table rows for Q2, Q3, Q4, Q5 explicitly list "All primary + prev_culls +
movement.csv" and add "mortality_ledgers" to rows that simulate or estimate
culling/mortality (Q2–Q5 as appropriate) so the file aligns with the process DAG
requirements.
- Line 210: Change the ambiguous table row "Detection delay distribution" (the
row using date_suspicious → date_confirmed) to a concrete choice: either mark it
as "Fixed" with the chosen parameters documented, or split it into two rows
(e.g., "Mean detection delay — Estimate" and "Delay shape/variance — Fix") so
the preprocessing step and the observation model referenced at line 59 have a
canonical prior; update the table entry and any related text to match the chosen
option (ensure references to date_suspicious and date_confirmed remain
accurate).
- Around line 188-191: The "Movement network" note currently restricts movements
to the broiler_1 → broiler_2 pair without justification; update the text around
that bullet to state the reason (e.g., limited sample size for other
production-type pairs in movement.csv, biological/epidemiological rationale, or
alignment with pre-shipment testing), or explicitly mark the filter as
provisional and add a cross-reference to the issue tracker where the decision
can be revisited; mention "broiler_1 → broiler_2" and "movement.csv" so
reviewers can find the filter.
- Line 186: Update the fallback rule sentence to include the missing determiners
by changing the phrase "use nearest confirmed case regardless of distance; if
multiple equidistant, use earliest confirmation date" to explicitly use "the"
before the superlatives so it reads "use the nearest confirmed case regardless
of distance; if multiple are equidistant, use the earliest confirmation date",
and ensure the pluralization/auxiliary ("are") is added for grammatical
correctness in the second clause.
---
Nitpick comments:
In `@hpai-challenge/notes/step03_data_source_selection.md`:
- Line 181: Replace the full dense pairwise distance matrix computation for
"pairwise distances between all 9,160 farms" with a sparse/truncated approach:
compute distances only up to a cutoff radius (e.g., 50 km), store results in a
sparse format (indices + distances) rather than a full NxN float64 array, and
update any downstream references to use the sparse distance representation
during kernel evaluations or model fitting so only nonzero entries are iterated
over.
- Line 107: Replace the pandoc-only citation token "@Yoo2021" in
step03_data_source_selection.md with a form that renders without a bibliography
backend: either add a bibliography entry and YAML front-matter "bibliography:"
pointing to a .bib file so pandoc/Quarto can resolve `@Yoo2021`, or change the
inline occurrences of "@Yoo2021" to a plain-text citation like "Yoo et al.
(2021)" and add a References section listing the full citation; ensure both
occurrences are updated consistently.
…ement to Q table Co-authored-by: sbfnk <sebastian.funk@lshtm.ac.uk>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hpai-challenge/notes/step03_data_source_selection.md`:
- Around line 221-224: The Q2–Q5 table rows currently list datasets for
full-model/simulation questions but omit the fixed input mortality_ledgers
(which provides r used in infectiousness per lines referencing r); update the
table rows for Q2, Q3, Q4, and Q5 to explicitly include mortality_ledgers among
the data sources (alongside prev_culls and movement.csv), or alternatively add a
brief table note that all full-model/simulation questions implicitly require
fixed inputs (e.g., mortality_ledgers) so callers of r and infectiousness can be
traced.
Co-authored-by: sbfnk <sebastian.funk@lshtm.ac.uk>
Summary
Closes #4
Summary by CodeRabbit