feat: :sparkles: raw data to staging by martonvago · Pull Request #94 · onlimit-study/feasibility-data

martonvago · 2026-05-28T09:29:36Z

Description

This PR transforms raw data to staged data. For now, this includes special transformations only for VAS.

Merge after #89

Closes #73

This PR needs an in-depth review.

Checklist

Ran just run-all

martonvago · 2026-05-28T09:37:41Z

+VAS_TIME_FIELD_PATTERN = re.compile(
+    r"^vas_(?P<field_name>.+?)(_fasted)?_(?P<time>minus10|30|60|90|120|180|240)min$"
+)


We can share this with the metadata transformation

martonvago · 2026-05-28T09:38:29Z

+    if not files:
+        raise FileNotFoundError(
+            f"No raw data files found in '{file_path}'. "
+            "Have you run `just download-data`?"
+        )


Alternatively, we can just do nothing without throwing an error.

Yea, I think these types of errors might not be necessary. And might be solved by using some type of "orchestrator", but that's for later. This is fine for now.

…easibility-data into feat/data-raw-to-staging

martonvago · 2026-05-28T09:44:43Z

+        .rename({"redcap_event_name": "event"})
+        .with_columns(
+            pl.lit("Copenhagen").alias("center"),
+            pl.lit(resource_name).alias("resource_name"),


I'll use this to write the parquet file then drop it

martonvago · 2026-05-28T09:46:34Z

+    for col in vas_cols:
+        match = cast(re.Match[str], VAS_TIME_FIELD_PATTERN.match(col))
+
+        time = match.group("time")
+        if time == "minus10":
+            time = "-10"
+
+        cols_grouped_by_time.setdefault(int(time), []).append(col)


I can rewrite this dictionary construction without a for loop if you want, but I think it will just be longer and more complicated

martonvago · 2026-05-28T09:49:29Z

This was done automatically, not sure if it's the right way

martonvago · 2026-05-28T09:49:53Z

@@ -1 +1,2 @@
 raw/** filter=lfs diff=lfs merge=lfs -text
+staging/** filter=lfs diff=lfs merge=lfs -text


Or we can track *.parquet

Not sure, is that a question?

martonvago · 2026-05-28T10:22:19Z

+    """Selects columns and adds base columns common to all dataframes."""
+    return (
+        raw_df.select(["redcap_event_name"] + cols)
+        .rename({"redcap_event_name": "event"})


Having looked at more of the data, I don't think event by itself can be the PK. Raised #95

Yea, I was thinking the same.

lwjohnst86 · 2026-05-28T13:35:55Z

+    if not files:
+        raise FileNotFoundError(
+            f"No raw data files found in '{file_path}'. "
+            "Have you run `just download-data`?"
+        )


Yea, I think these types of errors might not be necessary. And might be solved by using some type of "orchestrator", but that's for later. This is fine for now.

lwjohnst86 · 2026-05-28T13:37:54Z

+    """Selects columns and adds base columns common to all dataframes."""
+    return (
+        raw_df.select(["redcap_event_name"] + cols)
+        .rename({"redcap_event_name": "event"})


Yea, I was thinking the same.

lwjohnst86 · 2026-05-28T13:38:59Z

@@ -1 +1,2 @@
 raw/** filter=lfs diff=lfs merge=lfs -text
+staging/** filter=lfs diff=lfs merge=lfs -text


Not sure, is that a question?

lwjohnst86 · 2026-05-28T13:39:18Z

lwjohnst86 · 2026-05-28T13:47:47Z

+    )
+
+
+def raw_to_staged(raw_df: pl.DataFrame) -> list[pl.DataFrame]:


Maybe organize so either all the _fn are above or below the normal functions?

lwjohnst86 · 2026-05-28T13:48:27Z

+            pl.lit("Copenhagen").alias("center"),
+            pl.lit(resource_name).alias("resource_name"),


Suggested change

pl.lit("Copenhagen").alias("center"),

pl.lit(resource_name).alias("resource_name"),

# Only used for creating the Parquet files.

pl.lit("Copenhagen").alias("center"),

pl.lit(resource_name).alias("resource_name"),

lwjohnst86 · 2026-05-28T13:51:51Z

+    vas_cols = so.keep(
+        raw_df.columns,
+        lambda column: VAS_TIME_FIELD_PATTERN.match(column) is not None,
+    )


This would be better by using Polars rather than a filter. E.g. select() can take a pattern/exclude

lwjohnst86 · 2026-05-28T13:53:51Z

+    vas_dfs = so.pairwise_fmap(
+        list(cols_grouped_by_time.items()), [raw_df], _create_df_for_time_group
+    )
+    return pl.concat(vas_dfs, how="vertical")


I think all of this would be better with a pivot https://docs.pola.rs/user-guide/transformations/pivot/

martonvago added 3 commits May 28, 2026 09:19

feat: ✨ stage raw data

c559150

feat: ✨ track staging with git lfs

b4608bb

refactor: ♻️ end early if no raw data

30214ad

martonvago self-assigned this May 28, 2026

add-to-board-token Bot added this to Data development May 28, 2026

github-project-automation Bot moved this to Todo in Data development May 28, 2026

martonvago and others added 2 commits May 28, 2026 10:36

chore: 🔧 add vas to known words

627ee16

Merge branch 'main' into feat/data-raw-to-staging

df75685

martonvago commented May 28, 2026

View reviewed changes

martonvago added 2 commits May 28, 2026 10:43

refactor: ♻️ add underscore prefix to functions

455a4ac

Merge branch 'feat/data-raw-to-staging' of github.com:onlimit-study/f…

a7eba5f

…easibility-data into feat/data-raw-to-staging

martonvago commented May 28, 2026

View reviewed changes

martonvago moved this from Todo to In review in Data development May 28, 2026

martonvago marked this pull request as ready for review May 28, 2026 10:24

martonvago requested a review from a team as a code owner May 28, 2026 10:24

lwjohnst86 requested changes May 28, 2026

View reviewed changes

github-project-automation Bot moved this from In review to In progress in Data development May 28, 2026

		@@ -1 +1,2 @@
		raw/** filter=lfs diff=lfs merge=lfs -text
		staging/** filter=lfs diff=lfs merge=lfs -text

		)


		def raw_to_staged(raw_df: pl.DataFrame) -> list[pl.DataFrame]:

		pl.lit("Copenhagen").alias("center"),
		pl.lit(resource_name).alias("resource_name"),

Conversation

martonvago commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

martonvago commented May 28, 2026 •

edited

Loading