Skip to content

Add data source selection (workflow step 03)#15

Merged
sbfnk merged 6 commits intomainfrom
step03
Feb 26, 2026
Merged

Add data source selection (workflow step 03)#15
sbfnk merged 6 commits intomainfrom
step03

Conversation

@sbfnk
Copy link
Copy Markdown
Collaborator

@sbfnk sbfnk commented Feb 17, 2026

Summary

  • Select data sources for each model component
  • Primary: cases.csv, population.csv, activity.csv, hrz_32626.geojson
  • Secondary: prev_culls.csv (impute 77% missing dates), movement.csv (fixed p_mov=0.01)
  • Fixed inputs: mortality ledgers (r=1.0/day)
  • Excluded initially: clc_32626.geojson (continuous spillover risk)
  • Documents pre-processing requirements and parameter decisions

Closes #4

Summary by CodeRabbit

  • Documentation
    • New data source selection guide outlining selection criteria (information content, quality, complexity cost, redundancy), classification of sources (Primary, Secondary, Fixed) and per-source decisions (fields used, derived quantities, missing-data handling, estimated vs fixed).
    • Added preprocessing requirements, a data flow summary, mappings from sources to modelling questions, and notes on movement pathways and HRZ considerations.

Select data sources for each model component:
- Primary: cases.csv, population.csv, activity.csv, hrz_32626.geojson
- Secondary: prev_culls.csv (impute missing dates), movement.csv (fixed p_mov)
- Fixed inputs: mortality ledgers (r=1.0/day)
- Excluded: clc_32626.geojson (revisit if binary HRZ insufficient)

Documents pre-processing requirements and parameter decisions.

Closes #4
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 78dd567 and baf49fa.

📒 Files selected for processing (1)
  • hpai-challenge/notes/step03_data_source_selection.md

Walkthrough

Adds a new documentation file that defines criteria and classifications for selecting data sources in the HPAI modelling workflow, maps fields and derived quantities per source, specifies preprocessing and missing‑data handling, and links data sources to modelling questions and parameters to estimate or fix.

Changes

Cohort / File(s) Summary
Data Source Selection Documentation
hpai-challenge/notes/step03_data_source_selection.md
New documentation file detailing selection criteria (information content, quality, complexity cost, redundancy); classifies sources as Primary, Secondary, and Fixed Inputs; enumerates per-source fields, derived quantities, missing-data handling, and estimation vs fixed decisions; lists preprocessing steps (spatial joins, distance matrices, activity lookups, movement network), data flow summary, and mapping from model questions to required data.
Linked issue / Deliverable
linked_issue: #4``
Recorded decisions and deliverable confirm the added document matches the Step 03 objectives (Primary: cases.csv, population.csv, activity.csv, hrz_32626.geojson; Secondary: prev_culls.csv, movement.csv; Fixed inputs: mortality ledgers).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • seabbs-bot
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately summarizes the main change: adding documentation for data source selection as part of workflow step 03.
Linked Issues check ✅ Passed The pull request fulfils all coding requirements from issue #4: documents primary data sources (cases.csv, population.csv, activity.csv, hrz_32626.geojson), secondary sources with handling (prev_culls.csv with imputation, movement.csv with fixed p_mov), fixed inputs (mortality ledgers with r=1.0/day), and excluded sources (clc_32626.geojson).
Out of Scope Changes check ✅ Passed The pull request contains only the single file required by issue #4 (notes/step03_data_source_selection.md) with no extraneous changes or out-of-scope modifications.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch step03

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
hpai-challenge/notes/step03_data_source_selection.md (1)

3-3: Make the process‑DAG alignment explicit.

You reference the process DAG but do not map sources to DAG nodes/edges. Consider adding a short table or bullet list linking each data source to specific DAG components to make the alignment auditable. As per coding guidelines “Review for: … alignment with process DAG requirements.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hpai-challenge/notes/step03_data_source_selection.md` at line 3, The document
mentions the process DAG but doesn't map data sources to DAG nodes/edges; update
step03_data_source_selection.md to explicitly link each listed data source to
the specific process‑DAG component(s) it supports (use a short table or bullet
list) so each source is auditable against DAG nodes/edges and the "Review for:
alignment with process DAG requirements" guideline; include enough detail to
identify the DAG node/edge names and the role of the source (input, output,
metadata) next to each source entry.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hpai-challenge/notes/step03_data_source_selection.md`:
- Around line 183-185: The imputation step for prev_cull dates currently assumes
a trigger_confirmation within 1km; add a deterministic fallback when none exists
by specifying which rule to use (e.g., use the nearest confirmed case regardless
of distance, or apply a global median delay anchored to each farm's cull_start
date). Update the "Impute prev_cull dates" section to (1) state the chosen
fallback clearly, (2) define the tie-breaker (nearest by distance then earliest
confirmation), and (3) document how to compute and apply the fallback delay
(e.g., median delay calculated from the 12 complete records) so downstream
preprocessing is unambiguous for prev_cull and trigger_confirmation.
- Around line 137-172: The fenced code block using triple backticks that
contains the ASCII diagram (the block starting with
"┌─────────────────────────────────────────────────────────────────┐" and the
opening "```") lacks a language specifier; update the opening fence from "```"
to include a language (e.g., "```text") so Markdownlint stops flagging it and
consistency is preserved across the document.

---

Nitpick comments:
In `@hpai-challenge/notes/step03_data_source_selection.md`:
- Line 3: The document mentions the process DAG but doesn't map data sources to
DAG nodes/edges; update step03_data_source_selection.md to explicitly link each
listed data source to the specific process‑DAG component(s) it supports (use a
short table or bullet list) so each source is auditable against DAG nodes/edges
and the "Review for: alignment with process DAG requirements" guideline; include
enough detail to identify the DAG node/edge names and the role of the source
(input, output, metadata) next to each source entry.

Comment thread hpai-challenge/notes/step03_data_source_selection.md Outdated
Comment thread hpai-challenge/notes/step03_data_source_selection.md Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
hpai-challenge/notes/step03_data_source_selection.md (1)

186-186: Minor wording polish for readability.

At Line 186, consider “use the nearest confirmed case … use the earliest confirmation date” for grammatical clarity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hpai-challenge/notes/step03_data_source_selection.md` at line 186, Edit the
"Fallback" bullet text that currently reads "use nearest confirmed case
regardless of distance; if multiple equidistant, use earliest confirmation date"
to include the definite article "the" in both places for clarity, i.e., change
to "use the nearest confirmed case regardless of distance; if multiple
equidistant, use the earliest confirmation date" within the "Fallback" bullet in
step03_data_source_selection.md.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hpai-challenge/notes/step03_data_source_selection.md`:
- Line 100: Standardize on a single anchor term and an explicit formula: adopt
"trigger_date" as the anchor for imputation in Option 2, define median_delay =
median(confirmed_date - trigger_date) computed from the 12 complete records, and
state the imputation rule as confirmed_date_imputed = trigger_date +
median_delay; replace any occurrences of "trigger_confirmation + delay" (lines
183-186) with this exact terminology and update the Decision line (Option 2 /
median_delay) so both places use "trigger_date", "median_delay" and the concrete
formula.

---

Nitpick comments:
In `@hpai-challenge/notes/step03_data_source_selection.md`:
- Line 186: Edit the "Fallback" bullet text that currently reads "use nearest
confirmed case regardless of distance; if multiple equidistant, use earliest
confirmation date" to include the definite article "the" in both places for
clarity, i.e., change to "use the nearest confirmed case regardless of distance;
if multiple equidistant, use the earliest confirmation date" within the
"Fallback" bullet in step03_data_source_selection.md.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8c23280 and f557916.

📒 Files selected for processing (1)
  • hpai-challenge/notes/step03_data_source_selection.md

Comment thread hpai-challenge/notes/step03_data_source_selection.md Outdated
Co-authored-by: sbfnk <sebastian.funk@lshtm.ac.uk>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
hpai-challenge/notes/step03_data_source_selection.md (2)

181-181: Consider a sparse or truncated distance matrix for 9,160 farms.

A full pairwise distance matrix for 9,160 farms is ~84 million entries (~640 MB at float64). For a spatial kernel that decays rapidly, only distances within a cutoff radius are ever non-negligible. Pre-computing and storing a sparse matrix (distances ≤ e.g. 50 km) would substantially reduce both memory footprint and runtime during model fitting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hpai-challenge/notes/step03_data_source_selection.md` at line 181, Replace
the full dense pairwise distance matrix computation for "pairwise distances
between all 9,160 farms" with a sparse/truncated approach: compute distances
only up to a cutoff radius (e.g., 50 km), store results in a sparse format
(indices + distances) rather than a full NxN float64 array, and update any
downstream references to use the sparse distance representation during kernel
evaluations or model fitting so only nonzero entries are iterated over.

107-107: Pandoc citation syntax (@Yoo2021) will not resolve in plain Markdown rendering.

Lines 107 and 208 use @Yoo2021, which is valid only when processed by pandoc/Quarto with a .bib file. If this document is viewed on GitHub or rendered without a bibliography backend, the citation will appear as a raw string. Either add a bibliography entry and a YAML front-matter bibliography: field, or replace with an inline reference (e.g., "Yoo et al. (2021)") and a References section.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hpai-challenge/notes/step03_data_source_selection.md` at line 107, Replace
the pandoc-only citation token "@Yoo2021" in step03_data_source_selection.md
with a form that renders without a bibliography backend: either add a
bibliography entry and YAML front-matter "bibliography:" pointing to a .bib file
so pandoc/Quarto can resolve `@Yoo2021`, or change the inline occurrences of
"@Yoo2021" to a plain-text citation like "Yoo et al. (2021)" and add a
References section listing the full citation; ensure both occurrences are
updated consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hpai-challenge/notes/step03_data_source_selection.md`:
- Around line 218-224: Update the Questions → Data Sources table so Q2–Q5
include movement.csv (as it feeds the movement network for the transmission
model) and add mortality_ledgers where the row refers to culling/mortality
simulations; leave Q1 as descriptive only (cases.csv, population.csv). Ensure
the table rows for Q2, Q3, Q4, Q5 explicitly list "All primary + prev_culls +
movement.csv" and add "mortality_ledgers" to rows that simulate or estimate
culling/mortality (Q2–Q5 as appropriate) so the file aligns with the process DAG
requirements.
- Line 210: Change the ambiguous table row "Detection delay distribution" (the
row using date_suspicious → date_confirmed) to a concrete choice: either mark it
as "Fixed" with the chosen parameters documented, or split it into two rows
(e.g., "Mean detection delay — Estimate" and "Delay shape/variance — Fix") so
the preprocessing step and the observation model referenced at line 59 have a
canonical prior; update the table entry and any related text to match the chosen
option (ensure references to date_suspicious and date_confirmed remain
accurate).
- Around line 188-191: The "Movement network" note currently restricts movements
to the broiler_1 → broiler_2 pair without justification; update the text around
that bullet to state the reason (e.g., limited sample size for other
production-type pairs in movement.csv, biological/epidemiological rationale, or
alignment with pre-shipment testing), or explicitly mark the filter as
provisional and add a cross-reference to the issue tracker where the decision
can be revisited; mention "broiler_1 → broiler_2" and "movement.csv" so
reviewers can find the filter.
- Line 186: Update the fallback rule sentence to include the missing determiners
by changing the phrase "use nearest confirmed case regardless of distance; if
multiple equidistant, use earliest confirmation date" to explicitly use "the"
before the superlatives so it reads "use the nearest confirmed case regardless
of distance; if multiple are equidistant, use the earliest confirmation date",
and ensure the pluralization/auxiliary ("are") is added for grammatical
correctness in the second clause.

---

Nitpick comments:
In `@hpai-challenge/notes/step03_data_source_selection.md`:
- Line 181: Replace the full dense pairwise distance matrix computation for
"pairwise distances between all 9,160 farms" with a sparse/truncated approach:
compute distances only up to a cutoff radius (e.g., 50 km), store results in a
sparse format (indices + distances) rather than a full NxN float64 array, and
update any downstream references to use the sparse distance representation
during kernel evaluations or model fitting so only nonzero entries are iterated
over.
- Line 107: Replace the pandoc-only citation token "@Yoo2021" in
step03_data_source_selection.md with a form that renders without a bibliography
backend: either add a bibliography entry and YAML front-matter "bibliography:"
pointing to a .bib file so pandoc/Quarto can resolve `@Yoo2021`, or change the
inline occurrences of "@Yoo2021" to a plain-text citation like "Yoo et al.
(2021)" and add a References section listing the full citation; ensure both
occurrences are updated consistently.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f557916 and 5b26871.

📒 Files selected for processing (1)
  • hpai-challenge/notes/step03_data_source_selection.md

Comment thread hpai-challenge/notes/step03_data_source_selection.md Outdated
Comment thread hpai-challenge/notes/step03_data_source_selection.md
Comment thread hpai-challenge/notes/step03_data_source_selection.md Outdated
Comment thread hpai-challenge/notes/step03_data_source_selection.md Outdated
…ement to Q table

Co-authored-by: sbfnk <sebastian.funk@lshtm.ac.uk>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hpai-challenge/notes/step03_data_source_selection.md`:
- Around line 221-224: The Q2–Q5 table rows currently list datasets for
full-model/simulation questions but omit the fixed input mortality_ledgers
(which provides r used in infectiousness per lines referencing r); update the
table rows for Q2, Q3, Q4, and Q5 to explicitly include mortality_ledgers among
the data sources (alongside prev_culls and movement.csv), or alternatively add a
brief table note that all full-model/simulation questions implicitly require
fixed inputs (e.g., mortality_ledgers) so callers of r and infectiousness can be
traced.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5b26871 and 78dd567.

📒 Files selected for processing (1)
  • hpai-challenge/notes/step03_data_source_selection.md

Comment thread hpai-challenge/notes/step03_data_source_selection.md
Co-authored-by: sbfnk <sebastian.funk@lshtm.ac.uk>
@sbfnk sbfnk merged commit 5ed94ba into main Feb 26, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HPAI Workflow Step 03: Data Source Selection

2 participants