Skip to content

feat: ✨ raw data to staging#94

Open
martonvago wants to merge 7 commits into
mainfrom
feat/data-raw-to-staging
Open

feat: ✨ raw data to staging#94
martonvago wants to merge 7 commits into
mainfrom
feat/data-raw-to-staging

Conversation

@martonvago
Copy link
Copy Markdown
Collaborator

@martonvago martonvago commented May 28, 2026

Description

This PR transforms raw data to staged data. For now, this includes special transformations only for VAS.

Merge after #89

Closes #73

This PR needs an in-depth review.

Checklist

  • Ran just run-all

Comment thread scripts/stage_data.py
Comment on lines +10 to +12
VAS_TIME_FIELD_PATTERN = re.compile(
r"^vas_(?P<field_name>.+?)(_fasted)?_(?P<time>minus10|30|60|90|120|180|240)min$"
)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can share this with the metadata transformation

Comment thread scripts/stage_data.py
Comment on lines +19 to +23
if not files:
raise FileNotFoundError(
f"No raw data files found in '{file_path}'. "
"Have you run `just download-data`?"
)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we can just do nothing without throwing an error.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think these types of errors might not be necessary. And might be solved by using some type of "orchestrator", but that's for later. This is fine for now.

Comment thread scripts/stage_data.py
.rename({"redcap_event_name": "event"})
.with_columns(
pl.lit("Copenhagen").alias("center"),
pl.lit(resource_name).alias("resource_name"),
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll use this to write the parquet file then drop it

Comment thread scripts/stage_data.py
Comment on lines +93 to +100
for col in vas_cols:
match = cast(re.Match[str], VAS_TIME_FIELD_PATTERN.match(col))

time = match.group("time")
if time == "minus10":
time = "-10"

cols_grouped_by_time.setdefault(int(time), []).append(col)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rewrite this dictionary construction without a for loop if you want, but I think it will just be longer and more complicated

Comment thread .typos.toml
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done automatically, not sure if it's the right way

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread .gitattributes
@@ -1 +1,2 @@
raw/** filter=lfs diff=lfs merge=lfs -text
staging/** filter=lfs diff=lfs merge=lfs -text
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can track *.parquet

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, is that a question?

Comment thread scripts/stage_data.py
"""Selects columns and adds base columns common to all dataframes."""
return (
raw_df.select(["redcap_event_name"] + cols)
.rename({"redcap_event_name": "event"})
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having looked at more of the data, I don't think event by itself can be the PK. Raised #95

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I was thinking the same.

@martonvago martonvago moved this from Todo to In review in Data development May 28, 2026
@martonvago martonvago marked this pull request as ready for review May 28, 2026 10:24
@martonvago martonvago requested a review from a team as a code owner May 28, 2026 10:24
Comment thread scripts/stage_data.py
Comment on lines +19 to +23
if not files:
raise FileNotFoundError(
f"No raw data files found in '{file_path}'. "
"Have you run `just download-data`?"
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think these types of errors might not be necessary. And might be solved by using some type of "orchestrator", but that's for later. This is fine for now.

Comment thread scripts/stage_data.py
"""Selects columns and adds base columns common to all dataframes."""
return (
raw_df.select(["redcap_event_name"] + cols)
.rename({"redcap_event_name": "event"})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I was thinking the same.

Comment thread .gitattributes
@@ -1 +1,2 @@
raw/** filter=lfs diff=lfs merge=lfs -text
staging/** filter=lfs diff=lfs merge=lfs -text
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, is that a question?

Comment thread .typos.toml
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread scripts/stage_data.py
)


def raw_to_staged(raw_df: pl.DataFrame) -> list[pl.DataFrame]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe organize so either all the _fn are above or below the normal functions?

Comment thread scripts/stage_data.py
Comment on lines +52 to +53
pl.lit("Copenhagen").alias("center"),
pl.lit(resource_name).alias("resource_name"),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pl.lit("Copenhagen").alias("center"),
pl.lit(resource_name).alias("resource_name"),
# Only used for creating the Parquet files.
pl.lit("Copenhagen").alias("center"),
pl.lit(resource_name).alias("resource_name"),

Comment thread scripts/stage_data.py
Comment on lines +87 to +90
vas_cols = so.keep(
raw_df.columns,
lambda column: VAS_TIME_FIELD_PATTERN.match(column) is not None,
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be better by using Polars rather than a filter. E.g. select() can take a pattern/exclude

Comment thread scripts/stage_data.py
vas_dfs = so.pairwise_fmap(
list(cols_grouped_by_time.items()), [raw_df], _create_df_for_time_group
)
return pl.concat(vas_dfs, how="vertical")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of this would be better with a pivot https://docs.pola.rs/user-guide/transformations/pivot/

@github-project-automation github-project-automation Bot moved this from In review to In progress in Data development May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

Raw data to staging

2 participants