feat: repository analytics & repo populated & repo health score & health score refactor (IN-1054)#3987
feat: repository analytics & repo populated & repo health score & health score refactor (IN-1054)#3987gaspergrom wants to merge 7 commits intomainfrom
Conversation
…lth score refactor Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
|
|
| sum(failedAssessments) AS failed, | ||
| sum(passedAssessments) AS passed, | ||
| sum(failedAssessments + passedAssessments) AS total, | ||
| round(100 * (passed / total)) AS percentage |
There was a problem hiding this comment.
Security score can divide by zero
Medium Severity
repo_health_score_security_category computes percentage as round(100 * (passed / total)) without guarding total = 0. When a repo/category has no remaining assessments after filtering, total can be zero and this expression can fail or yield invalid results, breaking repository security scoring.
| AND repo != '' | ||
| {% if defined(repoUrl) %} | ||
| AND repo = {{ String(repoUrl, description="Repository URL", required=False) }} | ||
| {% end %} |
There was a problem hiding this comment.
Excluded repos bypass security endpoint filtering
Low Severity
repo_health_score_security.pipe does not apply the repos_to_channels_excluded check used by other repository health pipes. Queries by repoUrl can return security scores for repositories intentionally excluded from analytics, producing inconsistent behavior across repository health endpoints.
There was a problem hiding this comment.
Pull request overview
This PR adds repository-level analytics (populated repo metadata + repo health score metrics and rollup) and extends the existing project insights/health-score pipeline to support both project and repository records, while refactoring project health-score benchmark logic into shared Tinybird includes.
Changes:
- Introduces repository “populated” enrichment copy pipe + datasource and a daily repository health score copy pipe + datasource.
- Adds repository health score metric pipes (dual API/batch modes) and a copy/rollup pipe that computes category percentages and overall score.
- Extends
project_insights_copy_dsto include both project and repo records, adds a new combined insights endpoint, and refactors project health score benchmark logic into sharedincludes/*.incl.
Reviewed changes
Copilot reviewed 42 out of 42 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| services/libs/tinybird/pipes/repositories_populated_copy.pipe | New copy pipe to materialize enriched repository metadata into repositories_populated_ds. |
| services/libs/tinybird/pipes/repo_health_score_stars.pipe | New repo-level stars metric pipe + shared benchmark include. |
| services/libs/tinybird/pipes/repo_health_score_security.pipe | New repo-level security metric pipe. |
| services/libs/tinybird/pipes/repo_health_score_retention.pipe | New repo-level retention metric pipe + shared include. |
| services/libs/tinybird/pipes/repo_health_score_pull_requests.pipe | New repo-level PR metric pipe + shared include. |
| services/libs/tinybird/pipes/repo_health_score_organization_dependency.pipe | New repo-level org dependency metric pipe + shared include. |
| services/libs/tinybird/pipes/repo_health_score_merge_lead_time.pipe | New repo-level merge lead time metric pipe + shared include. |
| services/libs/tinybird/pipes/repo_health_score_issues_resolution.pipe | New repo-level issue resolution metric pipe + shared include. |
| services/libs/tinybird/pipes/repo_health_score_forks.pipe | New repo-level forks metric pipe + shared include. |
| services/libs/tinybird/pipes/repo_health_score_copy.pipe | New daily copy pipe to join repo metric pipes and compute category/overall scores into repo_health_score_copy_ds. |
| services/libs/tinybird/pipes/repo_health_score_contributor_dependency.pipe | New repo-level contributor dependency metric pipe + shared include. |
| services/libs/tinybird/pipes/repo_health_score_contributions_outside_work_hours.pipe | New repo-level outside-work-hours metric pipe + shared include. |
| services/libs/tinybird/pipes/repo_health_score_active_days.pipe | New repo-level active-days metric pipe + shared include. |
| services/libs/tinybird/pipes/repo_health_score_active_contributors.pipe | New repo-level active-contributors metric pipe + shared include. |
| services/libs/tinybird/pipes/project_repo_insights.pipe | New endpoint serving combined project + repo insights from project_insights_copy_ds. |
| services/libs/tinybird/pipes/project_insights.pipe | Updated to filter type = 'project' after project_insights_copy_ds becomes mixed-type. |
| services/libs/tinybird/pipes/project_insights_copy.pipe | Extended copy logic to UNION project records with repo records, sourcing repo base from repositories_populated_ds. |
| services/libs/tinybird/pipes/health_score_stars.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/pipes/health_score_retention.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/pipes/health_score_pull_requests.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/pipes/health_score_organization_dependency.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/pipes/health_score_merge_lead_time.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/pipes/health_score_issues_resolution.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/pipes/health_score_forks.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/pipes/health_score_contributor_dependency.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/pipes/health_score_contributions_outside_work_hours.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/pipes/health_score_active_days.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/pipes/health_score_active_contributors.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/includes/health_score_stars.incl | New shared benchmark/include logic for stars scoring. |
| services/libs/tinybird/includes/health_score_retention.incl | New shared benchmark/include logic for retention scoring. |
| services/libs/tinybird/includes/health_score_pull_requests.incl | New shared benchmark/include logic for PR scoring. |
| services/libs/tinybird/includes/health_score_organization_dependency.incl | New shared processing + benchmark logic for org dependency scoring. |
| services/libs/tinybird/includes/health_score_merge_lead_time.incl | New shared benchmark/include logic for merge lead time scoring. |
| services/libs/tinybird/includes/health_score_issues_resolution.incl | New shared benchmark/include logic for issue resolution scoring. |
| services/libs/tinybird/includes/health_score_forks.incl | New shared benchmark/include logic for forks scoring. |
| services/libs/tinybird/includes/health_score_contributor_dependency.incl | New shared processing + benchmark logic for contributor dependency scoring. |
| services/libs/tinybird/includes/health_score_contributions_outside_work_hours.incl | New shared benchmark/include logic for outside-work-hours scoring. |
| services/libs/tinybird/includes/health_score_active_days.incl | New shared benchmark/include logic for active-days scoring. |
| services/libs/tinybird/includes/health_score_active_contributors.incl | New shared benchmark/include logic for active-contributors scoring. |
| services/libs/tinybird/datasources/repositories_populated_ds.datasource | New datasource for enriched repository metadata. |
| services/libs/tinybird/datasources/repo_health_score_copy_ds.datasource | New datasource for repository health score rollups and raw metrics/benchmarks. |
| services/libs/tinybird/datasources/project_insights_copy_ds.datasource | Updated schema to include type + repoUrl and adjusted sorting key for mixed-type records. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| SELECT | ||
| channel, | ||
| uniq(CASE WHEN memberId != '' THEN memberId ELSE NULL END) AS contributorCount, | ||
| uniq(CASE WHEN organizationId != '' THEN organizationId ELSE NULL END) AS organizationCount | ||
| FROM activityRelations_deduplicated_cleaned_bucket_union |
There was a problem hiding this comment.
uniq(CASE ... ELSE NULL END) will still count NULL as a distinct value in ClickHouse aggregate functions, so repositories that have any rows with empty memberId/organizationId can be overcounted by 1. Use uniqIf(memberId, memberId != '') / uniqIf(organizationId, organizationId != '') (or countDistinctIf) to exclude empty IDs without introducing NULL into the aggregation.
| {% if defined(startDate) %} | ||
| AND timestamp | ||
| > {{ DateTime(startDate, description="Filter after date", required=False) }} | ||
| {% end %} | ||
| {% if defined(endDate) %} | ||
| AND timestamp | ||
| < {{ DateTime(endDate, description="Filter before date", required=False) }} | ||
| {% end %} |
There was a problem hiding this comment.
In single-repo (repoUrl) mode there’s no default previous-quarter time window unless startDate/endDate are provided, but batch mode restricts to the previous quarter. This makes the API endpoint return all-time active contributors by default. Consider applying the same previous-quarter bounds when repoUrl is defined and no dates are passed.
| {% if defined(startDate) %} | |
| AND timestamp | |
| > {{ DateTime(startDate, description="Filter after date", required=False) }} | |
| {% end %} | |
| {% if defined(endDate) %} | |
| AND timestamp | |
| < {{ DateTime(endDate, description="Filter before date", required=False) }} | |
| {% end %} | |
| {% if defined(startDate) or defined(endDate) %} | |
| {% if defined(startDate) %} | |
| AND timestamp | |
| > {{ DateTime(startDate, description="Filter after date", required=False) }} | |
| {% end %} | |
| {% if defined(endDate) %} | |
| AND timestamp | |
| < {{ DateTime(endDate, description="Filter before date", required=False) }} | |
| {% end %} | |
| {% else %} | |
| AND timestamp >= toStartOfQuarter(now() - toIntervalQuarter(1)) | |
| AND timestamp < toStartOfQuarter(now()) | |
| {% end %} |
| AND timestamp | ||
| > {{ DateTime(startDate, description="Filter after date", required=False) }} | ||
| {% end %} | ||
| {% if defined(endDate) %} | ||
| AND timestamp | ||
| < {{ DateTime(endDate, description="Filter before date", required=False) }} |
There was a problem hiding this comment.
In single-repo (repoUrl) mode there’s no default 365-day window unless startDate/endDate are provided, but batch mode is explicitly “last 365 days”. This makes the API endpoint return all-time results by default. Apply the same 365-day bounds in the repoUrl branch when no explicit dates are passed.
| AND timestamp | |
| > {{ DateTime(startDate, description="Filter after date", required=False) }} | |
| {% end %} | |
| {% if defined(endDate) %} | |
| AND timestamp | |
| < {{ DateTime(endDate, description="Filter before date", required=False) }} | |
| AND timestamp > {{ DateTime(startDate, description="Filter after date", required=False) }} | |
| {% else %} | |
| AND timestamp >= toStartOfDay(now() - toIntervalDay(365)) | |
| {% end %} | |
| {% if defined(endDate) %} | |
| AND timestamp < {{ DateTime(endDate, description="Filter before date", required=False) }} | |
| {% else %} | |
| AND timestamp < toStartOfDay(now()) |
| WHERE | ||
| (type = 'pull_request-opened' OR type = 'merge_request-opened' OR type = 'changeset-created') | ||
| AND channel = {{ String(repoUrl, description="Repository URL", required=False) }} | ||
| AND channel NOT IN (SELECT channel FROM repos_to_channels_excluded) | ||
| {% if defined(startDate) %} |
There was a problem hiding this comment.
In single-repo (repoUrl) mode there’s no default 365-day window unless startDate/endDate are provided, but batch mode is “last 365 days”. This makes the API endpoint return all-time PR counts by default. Add a default 365-day filter in the repoUrl branch when no explicit dates are passed.
| WHERE | ||
| memberId != '' | ||
| AND (type, platform) IN (SELECT activityType, platform FROM activityTypes_filtered) | ||
| AND channel = {{ String(repoUrl, description="Repository URL", required=False) }} | ||
| AND channel NOT IN (SELECT channel FROM repos_to_channels_excluded) |
There was a problem hiding this comment.
In single-repo (repoUrl) mode there’s no default 365-day window unless startDate/endDate are provided, but batch mode restricts to the last 365 days. This makes the API endpoint return all-time dependency metrics by default. Apply the same default window in the repoUrl branch when no explicit dates are passed.
| WHERE | ||
| category NOT IN ('Documentation', 'Vulnerability Management') | ||
| AND repo != '' | ||
| {% if defined(repoUrl) %} | ||
| AND repo = {{ String(repoUrl, description="Repository URL", required=False) }} |
There was a problem hiding this comment.
This pipe doesn’t exclude repos/channels in repos_to_channels_excluded, unlike the other repo health score metrics. As a result, calling the endpoint with an excluded repoUrl can still return a security score. Consider adding an exclusion filter to match the other health-score pipes’ behavior.
| COALESCE(owh.contributionsOutsideWorkHours, 0) AS contributionsOutsideWorkHours, | ||
| COALESCE(owh.contributionsOutsideWorkHoursBenchmark, 0) AS contributionsOutsideWorkHoursBenchmark, | ||
| COALESCE(sec.securityPercentage, 0) AS securityPercentage |
There was a problem hiding this comment.
securityPercentage is COALESCE’d to 0 when there is no matching row from repo_health_score_security, which penalizes repos that simply lack security evaluation data (and differs from how other missing metrics are excluded via arrayFilter(... >= 0)). Consider keeping it NULL when absent and handling it explicitly in overallScore.
| CASE WHEN organizationId != '' THEN organizationId ELSE NULL END | ||
| ) AS activeOrganizationsLast365Days | ||
| FROM activityRelations_deduplicated_cleaned_bucket_union | ||
| WHERE timestamp <= now() |
There was a problem hiding this comment.
The “last 365 days” repo metrics node only filters timestamp <= now() (no lower bound), so it actually counts all historical activity. Add a timestamp >= now() - INTERVAL 365 DAY bound (or equivalent) to match the column name/description.
| WHERE timestamp <= now() | |
| WHERE timestamp >= now() - INTERVAL 365 DAY | |
| AND timestamp <= now() |
| CASE WHEN organizationId != '' THEN organizationId ELSE NULL END | ||
| ) AS activeOrganizationsPrevious365Days | ||
| FROM activityRelations_deduplicated_cleaned_bucket_union | ||
| WHERE timestamp < now() - INTERVAL 365 DAY |
There was a problem hiding this comment.
The “previous 365 days” repo metrics node only filters timestamp < now() - INTERVAL 365 DAY (no lower bound), so it counts all activity older than 365 days rather than the 365–730 day window implied by the column name. Add a lower bound (e.g., timestamp >= now() - INTERVAL 730 DAY).
| WHERE timestamp < now() - INTERVAL 365 DAY | |
| WHERE timestamp >= now() - INTERVAL 730 DAY | |
| AND timestamp < now() - INTERVAL 365 DAY |
| uniq(CASE WHEN memberId != '' THEN memberId ELSE NULL END) AS activeContributorsLast365Days, | ||
| uniq( | ||
| CASE WHEN organizationId != '' THEN organizationId ELSE NULL END | ||
| ) AS activeOrganizationsLast365Days |
There was a problem hiding this comment.
uniq(CASE ... ELSE NULL END) will count NULL as a distinct value in ClickHouse, so a repo with any rows missing memberId/organizationId can be overcounted by 1. Prefer uniqIf(memberId, memberId != '') / uniqIf(organizationId, organizationId != '') (or countDistinctIf).
Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
| pm.starsPrevious365Days AS starsPrevious365Days, | ||
| pm.forksPrevious365Days AS forksPrevious365Days, | ||
| pm.activeContributorsPrevious365Days AS activeContributorsPrevious365Days, | ||
| pm.activeOrganizationsPrevious365Days AS activeOrganizationsPrevious365Days |
There was a problem hiding this comment.
Stale projects can break insights copy
High Severity
project_insights_copy_project_results selects pm.* metrics without COALESCE, but project_insights_copy_period_metrics only emits rows for segments with activity in the last 730 days. Projects with older/no recent activity get NULL period metrics, which conflicts with non-null UInt64 columns in project_insights_copy_ds and can fail the copy.
Additional Locations (1)
| rp.softwareValue AS softwareValue, | ||
| rp.firstCommit AS firstCommit | ||
| FROM repositories_populated_ds AS rp | ||
| JOIN repositories r FINAL ON r.id = rp.id |
There was a problem hiding this comment.
Repo status checks missing in insights join
Medium Severity
project_insights_copy_repo_base joins repositories by id without checking enabled, excluded, or deletedAt. Because project_insights_copy runs daily while repositories_populated_ds refreshes hourly, stale rows in rp can still be emitted as active repo insights for a full day after a repo is disabled, excluded, or deleted.
| SELECT | ||
| channel, | ||
| uniq(CASE WHEN memberId != '' THEN memberId ELSE NULL END) AS contributorCount, | ||
| uniq(CASE WHEN organizationId != '' THEN organizationId ELSE NULL END) AS organizationCount |
There was a problem hiding this comment.
Null counted as contributor and organization
Medium Severity
repositories_populated_copy_contributor_org_counts uses uniq(CASE ... ELSE NULL END), and NULL is treated as a distinct value in uniq. Repositories with only empty memberId or organizationId can get counts of 1 instead of 0, which propagates incorrect contributorCount and organizationCount into project_insights_copy_ds.
Additional Locations (1)
Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
| WHERE timestamp >= now() - INTERVAL 730 DAY AND timestamp <= now() | ||
| GROUP BY segmentId | ||
|
|
||
| NODE project_insights_copy_previous_365_days_metrics |
There was a problem hiding this comment.
Project period metrics silently change from all-time to 365-day
Medium Severity
The refactored project_insights_copy_period_metrics node introduces a significant undocumented behavior change for project-level metrics. The original project_insights_copy_last_365_days_metrics used WHERE timestamp <= now() with countIf(type = 'star') — effectively counting all-time stars, forks, and unique contributors/orgs. The new version correctly applies countIf(type = 'star' AND timestamp >= now() - INTERVAL 365 DAY), restricting counts to the actual last 365 days. Similarly, the original previous-period node had no lower time bound (counting all data before 365 days ago), while the new code limits to the 365–730 day window. While the new logic matches the column names, this will cause existing project starsLast365Days, activeContributorsLast365Days, etc. to drop substantially for projects with older data.
Reviewed by Cursor Bugbot for commit ee11f8c. Configure here.
Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 7 total unresolved issues (including 6 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 50a13b2. Configure here.
| ELSE 0 | ||
| END AS activeContributorsBenchmark | ||
| FROM health_score_active_contributors_score | ||
| INCLUDE "../includes/health_score_active_contributors.incl" "GROUP_COL=segmentId" "SOURCE_NODE=health_score_active_contributors_score" |
There was a problem hiding this comment.
Score nodes deleted but still referenced by includes
High Severity
The refactoring replaced the entire contents of all 11 project-level health_score_*.pipe files with only the INCLUDE directive, deleting the data-fetching (score) NODEs. The .incl files expand to FROM $SOURCE_NODE (e.g., FROM health_score_active_contributors_score), but those nodes no longer exist anywhere in the codebase. The same issue applies to all new repo_health_score_*.pipe files, which reference nodes like repo_health_score_active_contributors_score that were never defined. This breaks both the project-level health_score_copy.pipe and the new repo_health_score_copy.pipe, since their LEFT JOIN on these pipes will fail when the included nodes try to read from non-existent sources.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 50a13b2. Configure here.


Summary
Details
Shared includes (includes/)
Introduced 11 .incl files containing parameterized benchmark/processing logic ($GROUP_COL, $SOURCE_NODE). These are used by both project-level and
repo-level health score pipes, ensuring benchmark thresholds (e.g., stars 0-9 → 0, 1000+ → 5) are defined in a single place.
time
Repository health score pipes (repo_health_score_*.pipe)
12 new pipes, each with dual mode:
Metrics: active contributors, contributor dependency, organization dependency, retention, stars, forks, issue resolution, pull requests, merge lead time,
active days, contributions outside work hours, security.
Repository health score copy pipe & datasource
score. Runs daily at 01:50 UTC.
Repository populated data
Runs hourly.
Project insights extension
Project health score refactor
Refactored 11 existing project-level health_score_*.pipe files to replace inline benchmark nodes with INCLUDE directives. No functional change — same
columns, same endpoint behavior. health_score_copy.pipe unchanged.
Note
Medium Risk
Medium risk due to new Tinybird datasources/pipes and a schema change to
project_insights_copy_dsthat can affect downstream queries and backfills, plus new scoring computations that may impact analytics correctness and performance.Overview
Adds repository-level analytics: a new hourly
repositories_populated_ds(viarepositories_populated_copy.pipe) and a dailyrepo_health_score_copy_dspopulated byrepo_health_score_copy.pipe, which aggregates multiple per-repo metric pipes and computes category percentages plus an overall score.Extends
project_insights_copy_dsto store both project and repo insight records (type,repoUrl), updatesproject_insights_copy.pipetoUNION ALLrepo rows with project rows (and optimizes project period metrics into a single scan), and introducesproject_repo_insights.pipeto query the combined dataset while scoping the existingproject_insights.pipetotype = 'project'.Refactors multiple project
health_score_*.pipefiles to use new shared benchmarkincludes/*.inclmodules, reducing duplicated benchmark logic;scripts/format.shnow formatsincludes/too.Reviewed by Cursor Bugbot for commit 50a13b2. Bugbot is set up for automated code reviews on this repo. Configure here.