diff --git a/scripts/migrate-candidates-to-v2/README.md b/scripts/migrate-candidates-to-v2/README.md index f59c54c..c232853 100644 --- a/scripts/migrate-candidates-to-v2/README.md +++ b/scripts/migrate-candidates-to-v2/README.md @@ -25,9 +25,8 @@ for the architectural context. `BEGIN; UPDATE * N; COMMIT;` — a single transaction that applies all v2 blobs to the prod DB. -Two migration-time fallbacks are applied to recover currently-dead data -(neither changes any normalizer behavior; both classes of project are -unreadable in the prod UI today): +Three migration-time fallbacks are applied to recover currently-dead data +(all classes are unreadable in the prod UI today): - **Sole-algo fallback** (~18 projects). Pre-2026 saves omit the per-entry `algorithm` field AND the `results[0].search_meta.algorithm` @@ -37,6 +36,15 @@ unreadable in the prod UI today): `/orgs/WHO/sources/ICD-11-WHO/` (no version in URL) saved before `ocl_issues#2522` removed the silent `'HEAD'` default. Use `'HEAD'` as the migrate-time version so the projectContext builds. +- **Orphan-algo result-derived recovery** (`ocl_issues#2555`, 9 projects, + 731 codes). When an entry's algorithm tag isn't in the project's current + `algorithms[]` (reconfigured / mistagged project), the normalizer can't look + up a concept_identity, so it derives one from the RESULT itself: if the + result's source is the project's target source → anchor to the target + canonical+version (formal, visible/recommendable); otherwise key on its own + generated source (correctly NON-target, so the UI's target filter keeps it + inert rather than mis-surfacing it). See `vendored/normalizers.js` + `reference_source: 'result'`. ## Runbook @@ -44,37 +52,47 @@ The migration is read-only against the source data (dump + normalize) until the final `psql -f migrate.sql` step. Everything before that can be inspected, diffed, and re-run safely. +> **IMPORTANT — read before running (verified 2026-06-02, see +> `verification-report.md`):** +> 1. **`migrate.mjs` is self-contained.** It imports the normalizer from +> `./vendored/` (pinned to the validated `315e9b0` version), so +> `node migrate.mjs` runs from any checkout — including `main`, where the +> v2-only PR (`9f8f4b3`) deleted `normalizeLegacyAllCandidates` from `src/`. +> No special checkout or worktree needed. +> 2. **Dump with `dump-projects.sql`** (not the bare query below) so concept_keys +> use the formal source canonical (matching the live app) instead of a +> generated `ns.openconceptlab.org` URL. Also apply `fix-algorithm-canonicals.sql` +> (`--rw`) so the live app stays consistent on re-run. See `ocl_issues#2555`. +> 3. **Validate** every run with `node validate.mjs` (0 hard failures expected; +> review the orphan-tag drop list — currently ~9 reconfigured/mistagged +> projects, separate from this fix). +> 4. **Retain the backup permanently** (not just for the rollback window). It is +> the only recoverable copy of the orphan-tag-dropped raw results. +> 5. **Never re-dump + re-run `migrate.mjs` after applying** — it throws +> `TypeError: (candidates || []) is not iterable` on v2 data. (`migrate.sql` +> re-apply is safe/idempotent.) + ### 1. Dry-run locally Confirms current prod data shape against the live normalizer. Run any time without coordination. ```bash -# From any workstation that can reach oclapi2 prod through ocl-psql: -mkdir -p scripts/migrate-candidates-to-v2/input - -~/.ocl/bin/ocl-psql oclapi2 -t -A -c " - SELECT jsonb_build_object( - 'id', id, - 'name', name, - 'target_repo_url', target_repo_url, - 'owner_url', COALESCE( - '/orgs/' || (SELECT mnemonic FROM organizations WHERE id = organization_id) || '/', - '/users/' || (SELECT username FROM user_profiles WHERE id = user_id) || '/' - ), - 'algorithms', algorithms, - 'candidates', candidates - )::text - FROM map_projects - WHERE candidates IS NOT NULL AND candidates::text NOT IN ('{}', '[]', 'null') - ORDER BY id; -" > scripts/migrate-candidates-to-v2/input/projects.ndjson - -node scripts/migrate-candidates-to-v2/migrate.mjs +cd scripts/migrate-candidates-to-v2 +mkdir -p input + +# 1. Dump with resolved formal canonicals (read-only): +~/.ocl/bin/ocl-psql oclapi2 -t -A -f dump-projects.sql > input/projects.ndjson + +# 2. Normalize to v2 (self-contained; runs from any checkout): +node migrate.mjs + +# 3. Independently validate the transformation: +node validate.mjs ``` -Inspect `scripts/migrate-candidates-to-v2/output/summary.json` and -`skipped.json`. Spot-check 3-4 generated `proj-.v2.json` blobs +Inspect `output/summary.json`, `skipped.json`, and `output/validation-report.json` +(expect 0 hard failures). Spot-check 3-4 generated `proj-.v2.json` blobs against representative live projects: - One bridge-heavy LOINC (e.g. proj 105, 130, 73 — Jussara / Top-200 / ARUP). @@ -91,18 +109,20 @@ The migration window (~15-20 min) consists of: ``` T-0:00 Announce maintenance window (Slack/email to active users) T-0:01 Block oclmap traffic at nginx (or 503 maintenance page) -T-0:02 Snapshot the table: +T-0:02 Snapshot the table (RETAIN PERMANENTLY, not just for rollback): pg_dump --table=map_projects --data-only > map_projects.backup.sql -T-0:03 Re-dump the live candidates (in case a save landed since the dry run): - ~/.ocl/bin/ocl-psql oclapi2 -t -A -c "" \ - > scripts/migrate-candidates-to-v2/input/projects.ndjson -T-0:04 Run migrate.mjs (regenerates v2 blobs + migrate.sql): - node scripts/migrate-candidates-to-v2/migrate.mjs -T-0:05 Apply the migration (single BEGIN/COMMIT transaction): - ~/.ocl/bin/ocl-psql oclapi2 --rw \ - -f scripts/migrate-candidates-to-v2/output/migrate.sql -T-0:06 Deploy the v2-only oclmap build to prod -T-0:15 Smoke test: open 4 representative projects (above) in the UI. +T-0:03 Fix the live algorithms config (formal canonicals; app re-run consistency): + ~/.ocl/bin/ocl-psql oclapi2 --rw -f fix-algorithm-canonicals.sql +T-0:04 Re-dump the live candidates (in case a save landed since the dry run): + ~/.ocl/bin/ocl-psql oclapi2 -t -A -f dump-projects.sql \ + > input/projects.ndjson +T-0:05 Normalize to v2 + validate: + node migrate.mjs + node validate.mjs # confirm 0 hard failures +T-0:07 Apply the migration (single BEGIN/COMMIT transaction): + ~/.ocl/bin/ocl-psql oclapi2 --rw -f output/migrate.sql +T-0:08 Deploy the v2-only oclmap build to prod +T-0:16 Smoke test: open 4 representative projects (above) in the UI. Confirm candidates render correctly. T-0:18 Unblock traffic ``` @@ -123,25 +143,32 @@ If smoke test fails on the v2-only oclmap: ## Sizes -Verified against the 2026-05-27 prod snapshot: +Verified against the 2026-06-02 prod snapshot: -- 68 projects with non-empty `candidates` -- v1 total: ~803 MB -- v2 total: ~1199 MB (+49% — structural expansion: bridge cascade targets - become explicit `bridge_child` Candidate + ConceptDefinition entries) -- Migration runtime: ~10 sec (Node) + ~30-60 sec (psql apply over the - tunnel) +- **69** projects with non-empty `candidates` (was 68 on 2026-05-27 — the set + drifts, so always re-dump immediately before the window) +- v1 total: ~860 MB · v2 total: ~1.2 GB (+~40% structural expansion) +- Migration runtime: ~10 sec (Node normalize) + ~30-60 sec (psql apply) ## Files -- `migrate.mjs` — the migration script (Node 22+, ESM). +- `migrate.mjs` — the migration script (Node 22+, ESM). Self-contained; runs from any checkout. +- `vendored/normalizers.js`, `vendored/conceptKey.js` — vendored normalizer (the + `315e9b0` version + the migration-only orphan-algo `reference_source: 'result'` + recovery, `ocl_issues#2555`), so `migrate.mjs` doesn't depend on `src/` (where the + v2-only PR deleted `normalizeLegacyAllCandidates`). Dir is `vendored/` rather than + `lib/` because the root `.gitignore` ignores `lib/`. +- `dump-projects.sql` — the dump query with resolved formal canonicals. +- `fix-algorithm-canonicals.sql` — one-time `--rw` fix of `algorithms` config so + the live app keys concepts on the same formal canonicals as the migrated data. +- `validate.mjs` — independent v1↔v2 equivalence + referential-integrity validator. +- `verification-report.md` — the pre-prod verification record (`ocl_issues#2555`). - `README.md` — this file. -- `.gitignore` — excludes `input/`, `output/`, and `*.tmp` from version - control. +- `.gitignore` — excludes `input/`, `output/`, and `*.tmp` from version control. ## After the migration -This whole folder can be deleted once the migration has run cleanly in -prod and the v2-only oclmap is stable. The script imports -`normalizers.js`'s `normalizeLegacyAllCandidates` function, which itself -is deleted in the v2-only oclmap PR. +This whole folder (including `vendored/`) can be deleted once the migration +has run cleanly in prod and the v2-only oclmap is stable. `vendored/normalizers.js` +is the `315e9b0` normalizer (`normalizeLegacyAllCandidates`, no longer in `src/`) +plus the migration-only orphan-algo recovery (`reference_source: 'result'`). diff --git a/scripts/migrate-candidates-to-v2/dump-projects.sql b/scripts/migrate-candidates-to-v2/dump-projects.sql new file mode 100644 index 0000000..39d258a --- /dev/null +++ b/scripts/migrate-candidates-to-v2/dump-projects.sql @@ -0,0 +1,83 @@ +-- ============================================================================ +-- Migration dump (replaces the inline query in README §1). Produces +-- input/projects.ndjson for migrate.mjs, with formal source canonicals resolved +-- so the migrated concept_keys match what the live app generates. +-- ============================================================================ +-- Self-contained: resolves canonicals inline so it works for dry-runs and the +-- real run alike, with NO dependency on fix-algorithm-canonicals.sql having run +-- first. It resolves three things: +-- (1) target_repo.canonical_url (migrate.mjs keys target_repo concepts on it) +-- (2) custom algo canonical_url (ICD-11 -> formal, overriding stray variants; +-- recovers custom algos that would otherwise drop) +-- (3) bridge algo bridge_repo.canonical_url (from the bridge source) +-- +-- Rule: ICD-11 (any variant) -> http://id.who.int/icd/release/11/mms; +-- else source's registered canonical_url (CIEL/LOINC/SNOMED-GPS); +-- else NULL -> migrate.mjs generates the ns.openconceptlab.org URL (PIH). +-- +-- (fix-algorithm-canonicals.sql applies the SAME custom/bridge resolution to the +-- live `algorithms` column so re-runs in the app stay consistent with the +-- migrated data; target_repo the app fetches live.) +-- +-- ocl-psql oclapi2 -t -A -f dump-projects.sql > input/projects.ndjson +-- ============================================================================ +WITH resolved AS ( + SELECT DISTINCT ON (mp.id) + mp.id, mp.name, mp.target_repo_url, mp.algorithms, mp.candidates, + mp.organization_id, mp.user_id, + CASE + WHEN mp.target_repo_url ILIKE '%ICD-11%' THEN 'http://id.who.int/icd/release/11/mms' + ELSE s.canonical_url + END AS resolved_canonical + FROM map_projects mp + LEFT JOIN organizations o ON o.mnemonic = split_part(mp.target_repo_url,'/',3) + LEFT JOIN user_profiles u ON u.username = split_part(mp.target_repo_url,'/',3) + LEFT JOIN sources s ON s.mnemonic = split_part(mp.target_repo_url,'/',5) + AND s.version = 'HEAD' + AND (s.organization_id = o.id OR s.user_id = u.id) + WHERE mp.candidates IS NOT NULL AND mp.candidates::text NOT IN ('{}','[]','null') + ORDER BY mp.id +) +SELECT jsonb_build_object( + 'id', id, + 'name', name, + 'target_repo_url', target_repo_url, + 'owner_url', COALESCE( + '/orgs/' || (SELECT mnemonic FROM organizations WHERE id = organization_id) || '/', + '/users/' || (SELECT username FROM user_profiles WHERE id = user_id) || '/' + ), + 'algorithms', ( + SELECT jsonb_agg( + CASE + -- custom: set/override canonical (ICD-11 forced to formal; others filled) + WHEN a.elem->>'type' = 'custom' AND resolved_canonical IS NOT NULL + AND (COALESCE(a.elem->>'canonical_url','') = '' OR target_repo_url ILIKE '%ICD-11%') + THEN a.elem || jsonb_build_object('canonical_url', resolved_canonical) + -- bridge: set bridge_repo.canonical_url from the bridge source when missing + WHEN a.elem->>'type' IN ('ocl-bridge','ocl-ciel-bridge') + AND COALESCE(a.elem->'bridge_repo'->>'canonical_url','') = '' + THEN a.elem || jsonb_build_object('bridge_repo', jsonb_build_object('canonical_url', + COALESCE( + (SELECT bs.canonical_url FROM sources bs + JOIN organizations bo ON bo.id = bs.organization_id + WHERE bs.mnemonic = split_part(COALESCE(NULLIF(a.elem->>'target_repo_url',''),'/orgs/CIEL/sources/CIEL/'),'/',5) + AND bo.mnemonic = split_part(COALESCE(NULLIF(a.elem->>'target_repo_url',''),'/orgs/CIEL/sources/CIEL/'),'/',3) + AND bs.version='HEAD' + LIMIT 1), + 'https://ns.openconceptlab.org' || COALESCE(NULLIF(a.elem->>'target_repo_url',''),'/orgs/CIEL/sources/CIEL/') + ))) + ELSE a.elem + END + ORDER BY a.ord + ) + FROM unnest(algorithms) WITH ORDINALITY AS a(elem, ord) + ), + 'candidates', candidates, + 'target_repo', CASE + WHEN resolved_canonical IS NOT NULL + THEN jsonb_build_object('canonical_url', resolved_canonical) + ELSE NULL + END +)::text +FROM resolved +ORDER BY id; diff --git a/scripts/migrate-candidates-to-v2/fix-algorithm-canonicals.sql b/scripts/migrate-candidates-to-v2/fix-algorithm-canonicals.sql new file mode 100644 index 0000000..c257d37 --- /dev/null +++ b/scripts/migrate-candidates-to-v2/fix-algorithm-canonicals.sql @@ -0,0 +1,73 @@ +-- ============================================================================ +-- PROD PRE-STEP (run BEFORE the v1->v2 candidates migration): canonical fix +-- ============================================================================ +-- Sets the formal canonical_url on map_projects.algorithms so BOTH the live app +-- and the migration produce identical, formal concept_keys. Without this, custom +-- ICD-11 algos drop entirely (no identity) and LOINC/CIEL/ICD-11 concepts split +-- between the formal canonical and a generated ns.openconceptlab.org URL. +-- +-- Rules (per @paynejd 2026-06-02): +-- * ICD-11 (any variant, incl. ICD-11-WHO-Agent) -> http://id.who.int/icd/release/11/mms +-- * else -> the target/bridge source's registered canonical_url (CIEL, LOINC, SNOMED-GPS) +-- * else (no registered canonical, e.g. PIH) -> leave unset -> app/migration generate ns URL +-- +-- WHY this is needed (not just the migration dump): the live app reads custom +-- algo canonical from algorithms[].canonical_url and bridge canonical from +-- algorithms[].bridge_repo.canonical_url (it does NOT fetch the bridge source). +-- target_repo, by contrast, the app fetches live, so the migration dump resolves +-- that one (see dump-projects.sql) and no config change is needed for it. +-- +-- Validated read-only against prod 2026-06-02: 51 custom algos -> formal, +-- bridge -> CIELterminology.org, built-in types untouched. +-- +-- SAFETY: single transaction; idempotent (re-running is a no-op on already-fixed +-- algos). Take the standard map_projects backup first. Apply with: +-- ocl-psql oclapi2 --rw -f fix-algorithm-canonicals.sql +-- ============================================================================ +BEGIN; + +UPDATE map_projects mp +SET algorithms = ( + SELECT array_agg( + CASE + -- custom algos: force ICD-11 to formal (overrides stray '/icd11/mms'); + -- fill others from the project target source canonical. + WHEN elem->>'type' = 'custom' AND r.resolved IS NOT NULL + AND (COALESCE(elem->>'canonical_url','') = '' OR mp.target_repo_url ILIKE '%ICD-11%') + THEN elem || jsonb_build_object('canonical_url', r.resolved) + -- bridge algos: set bridge_repo.canonical_url from the bridge source when absent. + WHEN elem->>'type' IN ('ocl-bridge','ocl-ciel-bridge') + AND COALESCE(elem->'bridge_repo'->>'canonical_url','') = '' + THEN elem || jsonb_build_object('bridge_repo', jsonb_build_object('canonical_url', + COALESCE( + (SELECT bs.canonical_url FROM sources bs + JOIN organizations bo ON bo.id = bs.organization_id + WHERE bs.mnemonic = split_part(COALESCE(NULLIF(elem->>'target_repo_url',''),'/orgs/CIEL/sources/CIEL/'),'/',5) + AND bo.mnemonic = split_part(COALESCE(NULLIF(elem->>'target_repo_url',''),'/orgs/CIEL/sources/CIEL/'),'/',3) + AND bs.version='HEAD' LIMIT 1), + 'https://ns.openconceptlab.org' || COALESCE(NULLIF(elem->>'target_repo_url',''),'/orgs/CIEL/sources/CIEL/') + ))) + ELSE elem + END ORDER BY ord + ) + FROM unnest(mp.algorithms) WITH ORDINALITY AS t(elem, ord) +) +FROM ( + SELECT mp2.id, + CASE WHEN mp2.target_repo_url ILIKE '%ICD-11%' + THEN 'http://id.who.int/icd/release/11/mms' + ELSE s.canonical_url END AS resolved + FROM map_projects mp2 + LEFT JOIN organizations o ON o.mnemonic = split_part(mp2.target_repo_url,'/',3) + LEFT JOIN sources s ON s.mnemonic = split_part(mp2.target_repo_url,'/',5) + AND s.version='HEAD' AND s.organization_id = o.id +) r +WHERE mp.id = r.id + AND EXISTS ( + SELECT 1 FROM unnest(mp.algorithms) e + WHERE e->>'type'='custom' + OR (e->>'type' IN ('ocl-bridge','ocl-ciel-bridge') + AND COALESCE(e->'bridge_repo'->>'canonical_url','')='') + ); + +COMMIT; diff --git a/scripts/migrate-candidates-to-v2/migrate.mjs b/scripts/migrate-candidates-to-v2/migrate.mjs index 97bd295..d3dfd5d 100644 --- a/scripts/migrate-candidates-to-v2/migrate.mjs +++ b/scripts/migrate-candidates-to-v2/migrate.mjs @@ -29,9 +29,15 @@ import { createInterface } from 'node:readline' import { fileURLToPath } from 'node:url' import { dirname, join, resolve } from 'node:path' +// Self-contained: the normalizer is vendored into ./vendored/ (pinned to the +// 315e9b0 version this migration was written + validated against) so the +// script runs from any checkout, including main where the v2-only PR +// (9f8f4b3) deleted normalizeLegacyAllCandidates from src/. The whole folder +// is deleted after the one-shot prod migration. (Not ./lib/ — root .gitignore +// ignores lib/.) import { normalizeLegacyAllCandidates -} from '../../src/components/map-projects/normalizers.js' +} from './vendored/normalizers.js' const __dirname = dirname(fileURLToPath(import.meta.url)) const INPUT = resolve(process.env.INPUT || join(__dirname, 'input/projects.ndjson')) diff --git a/scripts/migrate-candidates-to-v2/validate.mjs b/scripts/migrate-candidates-to-v2/validate.mjs new file mode 100644 index 0000000..92ebcfe --- /dev/null +++ b/scripts/migrate-candidates-to-v2/validate.mjs @@ -0,0 +1,216 @@ +/* eslint-env node */ +/* eslint-disable no-process-env, no-console */ + +/** + * Independent v1 -> v2 migration equivalence validator. + * + * Reads the SAME NDJSON input that migrate.mjs consumed plus the v2 blobs it + * produced (output/proj-.v2.json), and independently re-derives what the + * migration SHOULD have preserved, then diffs against what it DID produce. + * + * This does NOT trust the normalizer — it recomputes expectations from the raw + * v1 data and asserts: + * - code coverage : every concept code from a survivable v1 result appears in v2 + * - row coverage : every v1 row with survivable results appears in v2.rows + * - referential integrity within v2 (candidate->def, candidate->algo_response, + * concept_row->def) + * and QUANTIFIES the two known silent-drop classes: + * - orphan-tag drop : results tagged with an algorithm id not in the project's + * current algorithms[] config (normalizers.js `if(!algoDef) return`) + * - no-identity drop : results under a configured algo with no derivable + * concept_identity (e.g. type:custom with no canonical_url) + * + * Read-only. Operates on dry-run artifacts only. + * + * INPUT ./input/projects.ndjson + * OUTDIR ./output (reads proj-.v2.json, writes validation-report.json) + */ + +import { createReadStream } from 'node:fs' +import { readFile, writeFile } from 'node:fs/promises' +import { createInterface } from 'node:readline' +import { fileURLToPath } from 'node:url' +import { dirname, join, resolve } from 'node:path' + +const __dirname = dirname(fileURLToPath(import.meta.url)) +const INPUT = resolve(process.env.INPUT || join(__dirname, 'input/projects.ndjson')) +const OUTDIR = resolve(process.env.OUTDIR || join(__dirname, 'output')) + +// Mirror of migrate.mjs CONCEPT_IDENTITY_BY_TYPE keys (which types have a +// built-in identity). custom needs a canonical_url to get one. +const TYPES_WITH_IDENTITY = new Set([ + 'ocl-semantic', 'ocl-search', 'ocl-bridge', 'ocl-ciel-bridge', 'ocl-scispacy' +]) +const algoHasIdentity = (a) => + !!a && (TYPES_WITH_IDENTITY.has(a.type) || (a.type === 'custom' && !!a.canonical_url)) + +// Effective tag the migration groups an entry under (mirror of +// candidatesByAlgo): entry.algorithm || results[0].search_meta.algorithm || +// (sole configured algo id, only when the first two are absent). +const effectiveTag = (entry, soleAlgoId) => { + const t = entry?.algorithm || entry?.results?.[0]?.search_meta?.algorithm + if(t) return t + return soleAlgoId || null +} + +const codeFromConceptKey = (key) => { + try { const a = JSON.parse(key); return Array.isArray(a) ? String(a[1]) : null } catch { return null } +} + +const main = async () => { + const rl = createInterface({ input: createReadStream(INPUT, { encoding: 'utf8' }), crlfDelay: Infinity }) + const report = [] + let hardFail = 0 + + for await (const line of rl) { + if(!line.trim()) continue + const p = JSON.parse(line) + const { id, name } = p + const candidates = Array.isArray(p.candidates) ? p.candidates : [] + const algos = Array.isArray(p.algorithms) ? p.algorithms : [] + const configuredIds = new Set(algos.map(a => a?.id).filter(Boolean)) + const algoById = new Map(algos.map(a => [a?.id, a])) + const soleAlgoId = algos.length === 1 ? algos[0]?.id : null + + // --- v1 expectations --- + let v1Results = 0 + let survivableResults = 0 // tag in config AND algo has identity + let droppedOrphanTag = 0 // tag not in config + let droppedNoIdentity = 0 // tag in config but algo lacks identity + const expectedCodes = new Set() // codes from survivable results (+ cascade targets) + const survivableRows = new Set() + const orphanTags = new Map() + // Codes that the orphan/no-identity result-derived fallback should recover + // (PR3-C / ocl_issues#2555). Verified present in v2 below. + const recoverableCodes = new Set() + const addCodes = (results) => results.forEach(r => { if(r?.id) recoverableCodes.add(String(r.id)) }) + + for(const entry of candidates) { + const results = entry?.results || [] + v1Results += results.length + const tag = effectiveTag(entry, soleAlgoId) + const idx = entry?.row?.__index + if(!tag) { // null tag, multi-algo -> recovered via result-derived fallback + droppedOrphanTag += results.length + if(results.length) orphanTags.set('', (orphanTags.get('') || 0) + results.length) + addCodes(results) + continue + } + if(!configuredIds.has(tag)) { + droppedOrphanTag += results.length + if(results.length) orphanTags.set(tag, (orphanTags.get(tag) || 0) + results.length) + addCodes(results) + continue + } + if(!algoHasIdentity(algoById.get(tag))) { + droppedNoIdentity += results.length + addCodes(results) + continue + } + // survivable + survivableResults += results.length + if(idx !== undefined && idx !== null && results.length) survivableRows.add(String(idx)) + // Only bridge algos expand result.mappings[].cascade_target into their own + // concept_defs. For non-bridge algos, result.mappings are the concept's own + // cross-references and are NOT migrated to concept_defs, so harvesting them + // here would be a false-positive coverage miss. + const isBridge = ['ocl-bridge', 'ocl-ciel-bridge'].includes(algoById.get(tag)?.type) + for(const r of results) { + if(r?.id) expectedCodes.add(String(r.id)) + if(isBridge) { + for(const m of (r?.mappings || [])) { + if(m?.cascade_target_concept_code) expectedCodes.add(String(m.cascade_target_concept_code)) + } + } + } + } + + // --- v2 actuals --- + let v2 = null + try { v2 = JSON.parse(await readFile(join(OUTDIR, `proj-${id}.v2.json`), 'utf8')) } catch { /* missing */ } + const defs = v2?.concept_definitions || [] + const defKeys = new Set(defs.map(([k]) => k)) + const v2Codes = new Set(defs.map(([k]) => codeFromConceptKey(k)).filter(Boolean)) + const rows = v2?.rows || {} + let v2Candidates = 0 + const refErrors = [] + for(const [idx, row] of Object.entries(rows)) { + const arIds = new Set(Object.keys(row?.algorithm_responses || {})) + for(const [cid, c] of Object.entries(row?.candidates || {})) { + v2Candidates++ + if(c?.concept_key && !defKeys.has(c.concept_key)) refErrors.push(`row ${idx} cand ${cid}: concept_key not in defs`) + if(c?.algorithm_response_id && !arIds.has(c.algorithm_response_id)) refErrors.push(`row ${idx} cand ${cid}: algorithm_response_id dangling`) + } + for(const ck of Object.keys(row?.concept_rows || {})) { + if(!defKeys.has(ck)) refErrors.push(`row ${idx} concept_row ${ck}: not in defs`) + } + } + + // --- checks --- + const missingCodes = [...expectedCodes].filter(c => !v2Codes.has(c)) + const missingRows = [...survivableRows].filter(r => !(r in rows)) + // Orphan/no-identity recovery: every recoverable code should now be in v2. + const orphanMissingCodes = [...recoverableCodes].filter(c => !v2Codes.has(c)) + const orphanRecoveredCount = recoverableCodes.size - orphanMissingCodes.length + const flags = [] + if(missingCodes.length) flags.push(`CODE_COVERAGE_FAIL(${missingCodes.length} missing)`) + if(missingRows.length) flags.push(`ROW_COVERAGE_FAIL(${missingRows.length} missing)`) + if(refErrors.length) flags.push(`REF_INTEGRITY_FAIL(${refErrors.length})`) + if(orphanMissingCodes.length) flags.push(`ORPHAN_UNRECOVERED(${orphanMissingCodes.length} codes)`) + // info: orphan/no-identity at-risk results now recovered via fallback + if(droppedOrphanTag > 0) flags.push(`ORPHAN_TAG_RECOVERED(${orphanRecoveredCount}/${recoverableCodes.size} codes)`) + if(v1Results > 0 && v2Candidates === 0) flags.push('ZERO_OUT_WITH_RESULTS') + + const hard = missingCodes.length > 0 || missingRows.length > 0 || refErrors.length > 0 || orphanMissingCodes.length > 0 + if(hard) hardFail++ + + report.push({ + id, name, + v1_results: v1Results, + survivable_results: survivableResults, + dropped_orphan_tag: droppedOrphanTag, + dropped_no_identity: droppedNoIdentity, + orphan_tags: Object.fromEntries(orphanTags), + configured_algo_ids: [...configuredIds], + v2_candidates: v2Candidates, + v2_concept_defs: defs.length, + v2_rows: Object.keys(rows).length, + missing_codes: missingCodes.length, + missing_rows: missingRows.length, + orphan_recoverable_codes: recoverableCodes.size, + orphan_recovered_codes: orphanRecoveredCount, + orphan_unrecovered_codes: orphanMissingCodes.length, + ref_errors: refErrors.slice(0, 5), + hard_fail: hard, + flags + }) + } + + await writeFile(join(OUTDIR, 'validation-report.json'), JSON.stringify(report, null, 2)) + + // ---- console summary ---- + const sum = (f) => report.reduce((s, r) => s + (r[f] || 0), 0) + console.log(`\n=== VALIDATION SUMMARY (${report.length} projects) ===`) + console.log(`HARD FAILURES (coverage / referential integrity): ${hardFail}`) + console.log(`Total v1 results: ${sum('v1_results')}`) + console.log(` survivable (matched algo, expected in v2): ${sum('survivable_results')}`) + console.log(` at-risk - orphan tag (not in current algorithms[]): ${sum('dropped_orphan_tag')}`) + console.log(` at-risk - no concept_identity (custom, no canonical): ${sum('dropped_no_identity')}`) + console.log(`\nOrphan/no-identity RECOVERY (result-derived fallback):`) + console.log(` recoverable codes: ${sum('orphan_recoverable_codes')}`) + console.log(` recovered in v2: ${sum('orphan_recovered_codes')}`) + console.log(` STILL MISSING: ${sum('orphan_unrecovered_codes')}`) + const orphanProjs = report.filter(r => r.dropped_orphan_tag > 0 || r.dropped_no_identity > 0) + console.log(`\n-- at-risk projects (recovered codes / recoverable) --`) + for(const r of orphanProjs.sort((a, b) => b.orphan_recoverable_codes - a.orphan_recoverable_codes)) { + console.log(` id=${r.id} recovered=${r.orphan_recovered_codes}/${r.orphan_recoverable_codes} missing=${r.orphan_unrecovered_codes} v2_cands=${r.v2_candidates} tags=${JSON.stringify(r.orphan_tags)} | ${r.name.slice(0, 30)}`) + } + if(hardFail) { + console.log(`\n-- HARD FAILURES --`) + for(const r of report.filter(x => x.hard_fail)) + console.log(` id=${r.id} missing_codes=${r.missing_codes} missing_rows=${r.missing_rows} ref_errors=${r.ref_errors.length} | ${r.name.slice(0, 32)}`) + } + console.log(`\nReport: ${join(OUTDIR, 'validation-report.json')}`) +} + +main().catch(e => { console.error('FATAL', e); process.exit(1) }) diff --git a/scripts/migrate-candidates-to-v2/vendored/conceptKey.js b/scripts/migrate-candidates-to-v2/vendored/conceptKey.js new file mode 100644 index 0000000..2f01a15 --- /dev/null +++ b/scripts/migrate-candidates-to-v2/vendored/conceptKey.js @@ -0,0 +1,54 @@ +/** + * Helpers for the opaque internal pool key used to index ConceptDefinitions + * and ConceptRows by canonical concept identity. + * + * See plans/unified-mapper-model.md (the "key" and FHIR alignment notes + * section). This file is the only place that knows the on-the-wire format + * of the key. Components must use makeConceptKey / parseConceptKey rather + * than constructing or splitting strings themselves. + * + * Why opaque: a key like `${url}|${code}` would collide with FHIR canonical + * URL syntax (where `|` is the version separator: `http://loinc.org|2.74`). + * Using JSON.stringify of an array sidesteps this and lets us extend with + * version transparently. + */ + +/** + * @typedef {Object} ConceptReference + * @property {string} url - canonical URL of the code system + * @property {string} code - the concept code + * @property {string} [version] - optional code system version + */ + +/** + * Produce the internal pool key for a ConceptReference. + * @param {ConceptReference} reference + * @returns {string} + */ +export const makeConceptKey = (reference) => { + if (!reference || typeof reference.url !== 'string' || typeof reference.code !== 'string') { + throw new TypeError('makeConceptKey requires a ConceptReference with url and code') + } + return JSON.stringify([reference.url, reference.code, reference.version ?? null]) +} + +/** + * Recover the ConceptReference from an internal pool key. + * @param {string} key + * @returns {ConceptReference} + */ +export const parseConceptKey = (key) => { + const [url, code, version] = JSON.parse(key) + return version === null ? { url, code } : { url, code, version } +} + +/** + * Compare two references for canonical equality (ignoring undefined version). + * @param {ConceptReference} a + * @param {ConceptReference} b + * @returns {boolean} + */ +export const referencesEqual = (a, b) => { + if (!a || !b) return false + return a.url === b.url && a.code === b.code && (a.version ?? null) === (b.version ?? null) +} diff --git a/scripts/migrate-candidates-to-v2/vendored/normalizers.js b/scripts/migrate-candidates-to-v2/vendored/normalizers.js new file mode 100644 index 0000000..08b1661 --- /dev/null +++ b/scripts/migrate-candidates-to-v2/vendored/normalizers.js @@ -0,0 +1,754 @@ +/* eslint-disable no-undef */ +/** + * Pure normalization functions for the unified mapper data model. + * See plans/unified-mapper-model.md for the architectural rationale. + * + * Splits the legacy flat candidate-shaped object returned by match algorithms + * into four entities: + * - AlgorithmResponse: raw algorithm output (preserved verbatim) + * - Candidate: a claim that a concept matches a row, per algorithm + * - ConceptDefinition: project-wide canonical concept data (lives in conceptCache) + * - ConceptRow: per-row presence of a concept (rerank_score lives here) + * + * Concept identity is canonical: ConceptReference = {url, code, version?}, where + * `url` is the canonical URL of the code system (NOT the OCL relative URL of an + * instance). Each algorithm declares how to extract the reference from its + * response via a `concept_identity` config block; the normalizer resolves + * `reference_source: 'target_repo' | 'bridge_repo' | 'fixed'` against the + * project context. + * + * These functions intentionally have no React or lodash dependencies — they + * are unit-testable in isolation against captured algorithm fixtures. + */ + +import { makeConceptKey } from './conceptKey.js' + +const newId = () => { + if (typeof crypto !== 'undefined' && typeof crypto.randomUUID === 'function') { + return crypto.randomUUID() + } + return `id-${Date.now()}-${Math.random().toString(36).slice(2)}` +} + +/** + * Create an AlgorithmResponse entity wrapping the raw algorithm output. + * Preserves the response untouched so debug/audit views can render it later. + */ +export const createAlgorithmResponse = (rawResponse, algorithmId, options = {}) => { + const { status = 'success', error, rowIndex } = options + return { + id: newId(), + algorithm_id: algorithmId, + row_index: rowIndex, + raw: rawResponse, + received_at: new Date().toISOString(), + status, + ...(error ? { error } : {}) + } +} + +/** + * Decide a concept's lookup_status based on which fields the algorithm response + * populated. 'full' means the response carries enough data for the UI: + * - `property` field present (the verbose-payload marker — OCL's + * ConceptDetailSerializer always emits it, even as an empty array), OR + * - a populated `names` array (the list-serializer signal — names are + * enough to render the row's display). + * Many concepts (LOINC especially) have no separate `descriptions`, so + * requiring descriptions for 'full' stranded verbose-loaded concepts at + * 'partial' and made the UI's summary chips disappear. We don't check + * `extras` because scispacy synthesizes that field with internal metadata + * (LOINC_NUM, composite_score) — not the OCL schema property dict. + */ +const inferLookupStatus = (result) => { + if (!result) return 'pending' + const hasNames = Array.isArray(result.names) && result.names.length > 0 + const hasVerbosePayload = Array.isArray(result.property) + || (Array.isArray(result.properties) && result.properties.length > 0) + if (result.id && result.display_name && (hasVerbosePayload || hasNames)) return 'full' + if (result.id && result.display_name) return 'partial' + return 'pending' +} + +/** + * Resolve a ConceptReference from a result given an identity config and project context. + * + * @param {Object} result One algorithm result (concept-shaped). + * @param {Object} identityConfig The algorithm's concept_identity config block. + * Shape: { reference_source, code_field, canonical_url? } + * @param {Object} projectContext Project-level canonical context. + * Shape: { target_repo: {canonical_url, version?}, + * bridge_repo?: {canonical_url, version?} } + * @returns {ConceptReference | null} + */ +// Orphan-algo recovery (PR3-C / ocl_issues#2555): extract the OCL source path +// (/owner-type/owner/sources/mnemonic/) and the source mnemonic from an OCL +// url, host-agnostic (handles both relative and absolute api.* urls). +const sourcePathOf = (u) => { + if (!u) return null + const parts = String(u).replace(/^https?:\/\/[^/]+/, '').split('/').filter(Boolean) + const i = parts.indexOf('sources') + return (i >= 0 && parts.length > i + 1) ? '/' + parts.slice(0, i + 2).join('/') + '/' : null +} +const sourceMnemonicOf = (u) => { + const p = sourcePathOf(u) + return p ? p.split('/').filter(Boolean).pop() : null +} + +const resolveReference = (result, identityConfig, projectContext) => { + if (!result || !identityConfig) return null + const codeField = identityConfig.code_field || 'id' + const code = result[codeField] + if (!code) return null + + let url + let version + switch (identityConfig.reference_source) { + case 'fixed': + url = identityConfig.canonical_url + // When a fixed-canonical algo's URL matches the project's target_repo, + // it's targeting the same repo as the target_repo-pathway algos and must + // key on the same (url, code, version) for dedup. Without this, e.g. + // scispacy LOINC (fixed) splits from ocl-search/ocl-semantic LOINC + // (target_repo) and from ocl-bridge cascade target (target_repo via + // cascade) under any pinned version. The mapping project pins ONE + // explicit target repo version; concept identity for dedup must be + // anchored to that pin across all algorithm paths. + if (url && url === projectContext?.target_repo?.canonical_url) { + version = projectContext.target_repo.version + } + break + case 'target_repo': + url = projectContext?.target_repo?.canonical_url + version = projectContext?.target_repo?.version + break + case 'bridge_repo': + url = projectContext?.bridge_repo?.canonical_url + version = projectContext?.bridge_repo?.version + break + case 'result': { + // Orphan-algo recovery: the algorithm that produced this result is no + // longer in the project config, so there's no concept_identity to look + // up. Derive the concept's home from the RESULT itself. If it lives in + // the project's target source, anchor to the target canonical+version + // (formal — so it dedups and renders/recommends like any target concept); + // otherwise key on its own (generated) source so it stays correctly + // NON-target (the UI's target-repo filter keeps it inert instead of + // mis-surfacing a non-target concept as a mapping candidate). + const targetSrc = sourcePathOf(projectContext?.target_repo?.relative_url) + const targetMnem = sourceMnemonicOf(projectContext?.target_repo?.relative_url) + const resultSrc = sourcePathOf(result.url || result.ocl_url) + if (resultSrc && targetSrc && resultSrc === targetSrc) { + url = projectContext.target_repo.canonical_url + version = projectContext.target_repo.version + } else if (resultSrc) { + url = `https://ns.openconceptlab.org${resultSrc}` + } else if (result.source && targetMnem && result.source === targetMnem) { + // stub (no url) whose source mnemonic matches the project target source + url = projectContext.target_repo.canonical_url + version = projectContext.target_repo.version + } + break + } + default: + return null + } + + if (!url) return null + return version ? { url, code, version } : { url, code } +} + +/** + * Build a ConceptDefinition from a concept-shaped result + a resolved reference. + */ +const toConceptDefinition = (result, reference, identityConfig, { algorithmId } = {}) => { + if (!reference) return null + const oclUrlField = identityConfig?.ocl_url_field + const oclUrl = oclUrlField ? result?.[oclUrlField] : undefined + return { + reference, + key: makeConceptKey(reference), + ocl_url: oclUrl, + id: result?.id, + display_name: result?.display_name, + source: result?.source, + owner: result?.owner, + names: result?.names, + descriptions: result?.descriptions, + concept_class: result?.concept_class, + datatype: result?.datatype, + retired: result?.retired, + properties: result?.properties, + // `property` (singular) is the schema-specific property dict OCL returns + // for sources like LOINC (COMPONENT/PROPERTY/TIME_ASPCT/etc.) sourced from + // ConceptDetailSerializer.property = JSONField(source='properties'). + // ConceptSummaryProperties.jsx reads this field directly. Without it the + // verbose payload's schema chips never reach the UI even when $match + // returns them. + property: result?.property, + extras: result?.extras, + lookup_status: inferLookupStatus(result), + lookup_source_type: 'algorithm', + lookup_source: algorithmId + } +} + +/** + * Build a stub ConceptDefinition for a bridge cascade target. Bridge responses + * only carry name+code+url for cascade targets, so the resulting definition is + * always 'pending' — ensureLoaded will fill it in. + * + * @param {Object} mapping A cascade mapping entry from the bridge response. + * @param {Object} cascadeIdentity The algo's concept_identity.cascade_target config. + * @param {Object} projectContext + */ +const cascadeTargetToConceptDefinition = (mapping, cascadeIdentity, projectContext) => { + if (!mapping || !cascadeIdentity) return null + // Build a synthetic "result" shape so resolveReference can run uniformly. + const syntheticResult = { + [cascadeIdentity.code_field || 'cascade_target_concept_code']: + mapping[cascadeIdentity.code_field || 'cascade_target_concept_code'], + ...mapping + } + const reference = resolveReference(syntheticResult, cascadeIdentity, projectContext) + if (!reference) return null + + const oclUrlField = cascadeIdentity.ocl_url_field + const oclUrl = oclUrlField ? mapping?.[oclUrlField] : undefined + + return { + reference, + key: makeConceptKey(reference), + ocl_url: oclUrl, + id: reference.code, + display_name: mapping.cascade_target_concept_name, + source: mapping.cascade_target_source_name, + owner: undefined, + names: undefined, + descriptions: undefined, + concept_class: undefined, + datatype: undefined, + retired: undefined, + properties: undefined, + property: undefined, + extras: undefined, + lookup_status: 'pending', + lookup_source_type: undefined, + lookup_source: undefined + } +} + +// rerank_score on ConceptRow comes from search_normalized_score ONLY when +// the caller signals it came from a reranker-true single-algo $match path +// — that's the path where the server-side scores ARE the unified rerank. +// In all other paths (multi-algo, bridge, scispacy, custom) the response's +// search_normalized_score is just the per-algo native score (e.g. FAISS +// similarity × 100), which the OCL server emits unconditionally. Treating +// that as a unified rerank score yields chip values like "100.00%" for +// every top semantic candidate until the debounced $rerank/ pipeline runs +// and overwrites it — misleading during the interim window. +const newConceptRow = (conceptKey, result, trustServerRerank = false) => ({ + concept_key: conceptKey, + rerank_score: trustServerRerank && typeof result?.search_meta?.search_normalized_score === 'number' + ? result.search_meta.search_normalized_score + : undefined +}) + +const isBridgeResult = (identityConfig) => + identityConfig?.reference_source === 'bridge_repo' && Boolean(identityConfig.cascade_target) + +/** + * Normalize a single algorithm result into the new entity model. + * + * @param {Object} result One result object (concept-shaped). + * @param {Object} ctx Normalization context. + * @param {string} ctx.algorithmId The algorithm ID (e.g. 'ocl-search'). + * @param {Object} ctx.algorithmConfig The algorithm definition with concept_identity. + * @param {string} ctx.algorithmResponseId FK back to the AlgorithmResponse. + * @param {Object} ctx.projectContext {target_repo, bridge_repo?, namespace} + * @returns {{candidates: Array, concept_definitions: Array, concept_rows: Array}} + */ +export const normalizeAlgoResult = (result, ctx = {}) => { + const empty = { candidates: [], concept_definitions: [], concept_rows: [] } + if (!result) return empty + + const { algorithmId, algorithmConfig, algorithmResponseId, projectContext, trustServerRerank } = ctx + const identityConfig = algorithmConfig?.concept_identity + if (!identityConfig) return empty + + const meta = result.search_meta || {} + const isBridge = isBridgeResult(identityConfig) + + // Resolve the primary reference (the result's own concept). + const primaryReference = resolveReference(result, identityConfig, projectContext) + if (!primaryReference) return empty + + const candidates = [] + const conceptDefinitions = [] + const conceptRows = [] + + const primaryDef = toConceptDefinition(result, primaryReference, identityConfig, { algorithmId }) + conceptDefinitions.push(primaryDef) + conceptRows.push(newConceptRow(primaryDef.key, result, trustServerRerank)) + + const primaryCandidate = { + id: newId(), + algorithm_response_id: algorithmResponseId, + algorithm_id: algorithmId, + concept_key: primaryDef.key, + type: isBridge ? 'bridge' : 'standard', + score: meta.search_score, + highlights: meta.search_highlight + } + candidates.push(primaryCandidate) + + // For bridge results, fan out one bridge_child candidate per cascade mapping. + if (isBridge && Array.isArray(result.mappings)) { + result.mappings.forEach(mapping => { + const targetDef = cascadeTargetToConceptDefinition( + mapping, + identityConfig.cascade_target, + projectContext + ) + if (!targetDef) return + + // Avoid duplicate ConceptDefinition entries within the same result. + if (!conceptDefinitions.some(cd => cd.key === targetDef.key)) { + conceptDefinitions.push(targetDef) + // Cascade target's row carries no rerank_score yet — the bridge + // response doesn't score targets, so the debounced rerank pass + // will fill it once the row is eligible. + conceptRows.push(newConceptRow(targetDef.key)) + } + + candidates.push({ + id: newId(), + algorithm_response_id: algorithmResponseId, + algorithm_id: algorithmId, + concept_key: targetDef.key, + type: 'bridge_child', + score: undefined, + highlights: undefined, + bridge_concept_key: primaryDef.key, + parent_candidate_id: primaryCandidate.id, + map_type: mapping.map_type + }) + }) + } + + return { + candidates, + concept_definitions: conceptDefinitions, + concept_rows: conceptRows + } +} + +/** + * Higher-level orchestration: normalize a full algorithm invocation for one row. + * Takes the wrapped `{row, results}` payload and produces the AlgorithmResponse + + * flattened entity arrays, with intra-payload dedup. + * + * @param {Object} rawPayload The {row, results} envelope. + * @param {Object} ctx + * @param {string} ctx.algorithmId + * @param {Object} ctx.algorithmConfig + * @param {Object} ctx.projectContext + * @param {number} ctx.rowIndex + * @param {string} [ctx.status='success'] + * @param {string} [ctx.error] + * @param {*} [ctx.rawResponse] Override stored raw (defaults to rawPayload). + */ +export const normalizeAlgorithmInvocation = (rawPayload, ctx = {}) => { + const { + algorithmId, + algorithmConfig, + projectContext, + rowIndex, + status = 'success', + error, + rawResponse, + trustServerRerank + } = ctx + + const algorithmResponse = createAlgorithmResponse( + rawResponse !== undefined ? rawResponse : rawPayload, + algorithmId, + { status, error, rowIndex } + ) + + const results = Array.isArray(rawPayload?.results) ? rawPayload.results : [] + + const allCandidates = [] + const defsByKey = new Map() + const rowsByKey = new Map() + + results.forEach(result => { + const { candidates, concept_definitions, concept_rows } = normalizeAlgoResult(result, { + algorithmId, + algorithmConfig, + algorithmResponseId: algorithmResponse.id, + projectContext, + trustServerRerank + }) + allCandidates.push(...candidates) + concept_definitions.forEach(cd => { + const existing = defsByKey.get(cd.key) + // Prefer richer definitions: never overwrite a 'full' with a 'pending'. + if (!existing || lookupStatusRank(cd.lookup_status) > lookupStatusRank(existing.lookup_status)) { + defsByKey.set(cd.key, cd) + } + }) + concept_rows.forEach(cr => { + if (!rowsByKey.has(cr.concept_key)) { + rowsByKey.set(cr.concept_key, cr) + } + }) + }) + + return { + algorithm_response: algorithmResponse, + candidates: allCandidates, + concept_definitions: Array.from(defsByKey.values()), + concept_rows: Array.from(rowsByKey.values()) + } +} + +const LOOKUP_RANK = { pending: 0, failed: 0, partial: 1, full: 2 } +export const lookupStatusRank = (status) => LOOKUP_RANK[status] ?? 0 + +/** + * Filter a ConceptDefinition.property[] dict array to the subset the project + * considers identifying — repoVersion.meta.display.concept_summary_properties + * (e.g. for LOINC: COMPONENT / PROPERTY / TIME_ASPCT / SYSTEM / SCALE_TYP / + * METHOD). Producer-side volume control for the AI Assistant v2 payload's + * recommendable_concepts[*].property push. + * + * Returns: + * - undefined when `property` is missing/empty (essentializer drops the key) + * - `property` whole when no summary list is configured (FHIR-passthrough + * sources, or repos that haven't pinned summary properties) + * - filtered subset matching the summary codes otherwise + * + * Pure: no lodash, no React. + */ +export const filterPropertyBySummary = (property, summaryCodes) => { + if (!Array.isArray(property) || property.length === 0) return undefined + if (!Array.isArray(summaryCodes) || summaryCodes.length === 0) return property + const set = new Set(summaryCodes) + return property.filter(p => p?.code && set.has(p.code)) +} + +/** + * Compact a property[] array to a `{code: value}` object for the AI Assistant + * payload. PR3-D3-lite (L-5): saves ~50% bytes vs. the FHIR-aligned + * `[{code, valueCoding}, ...]` shape by elimminating per-entry overhead. The + * LLM reads property values by axis name in plain English (per the LOINC + * system prompt's axis-by-axis methodology); the wire shape doesn't matter. + * + * Safe fallback: FHIR allows a CodeSystem.property `code` to repeat within a + * single concept (e.g. multiple `parent` codes). If any code appears twice, + * the original array is returned unchanged so no values are lost to map + * collision. LOINC's identifying axes (COMPONENT, PROPERTY, etc.) are unique + * per concept; ICD-11 / SNOMED edge cases preserve the array form. + * + * Pure. + */ +export const compactProperty = (property) => { + if (!Array.isArray(property) || property.length === 0) return property + const seen = new Set() + for (const p of property) { + const code = p?.code + if (!code) return property + if (seen.has(code)) return property + seen.add(code) + } + const obj = {} + for (const p of property) { + obj[p.code] = p.valueCoding ?? p.value ?? null + } + return obj +} + +/** + * Reduce a locale tag to its primary subtag (the part before any `-`), + * lowercased. `en-US` → `en`, `pt-BR` → `pt`, `EN` → `en`. Empty/nullish + * input returns `''`. Pure. + */ +export const primarySubtag = (locale) => { + if (!locale || typeof locale !== 'string') return '' + const idx = locale.indexOf('-') + return (idx === -1 ? locale : locale.slice(0, idx)).toLowerCase() +} + +/** + * Reduce a list of locale tags to the unique set of primary subtags. + * `['en', 'en-US', 'PT-br']` → `['en', 'pt']`. Preserves first-seen order. + * Pure. + */ +export const uniqByPrimarySubtag = (locales) => { + if (!Array.isArray(locales)) return [] + const seen = new Set() + const out = [] + for (const l of locales) { + const tag = primarySubtag(l) + if (!tag || seen.has(tag)) continue + seen.add(tag) + out.push(tag) + } + return out +} + +/** + * Trim a single `names[*]` entry to the fields the LLM actually consumes. + * Drops `external_id` (upstream-system identifier, useless to the LLM). + * Keeps `name_type` when set (semantically meaningful — fully-specified + * vs. synonym vs. index term carries weight for match-recommendation). + * Emits `locale_preferred` only when literally `true` (mirrors the + * truthy-only pattern used for `retired`). + * + * Pure. + */ +const trimName = (n) => { + if (!n) return n + const out = { name: n.name, locale: n.locale } + if (n.name_type) out.name_type = n.name_type + if (n.locale_preferred === true) out.locale_preferred = true + return out +} + +/** + * Trim a single `descriptions[*]` entry. Drops `external_id`. Keeps `type` + * (description-types like `Definition` / `Usage` carry meaning). + * + * Pure. + */ +const trimDescription = (d) => { + if (!d) return d + const out = { description: d.description, locale: d.locale } + if (d.type) out.type = d.type + if (d.locale_preferred === true) out.locale_preferred = true + return out +} + +/** + * Filter a list of `names[*]` entries by primary-subtag match against the + * effective locale set, then trim each surviving entry to the LLM-relevant + * fields. Names with no `locale` field pass through (defensive — typically + * canonical fully-specified names that lost their tag in upstream processing). + * + * Behavior: + * - `effectiveLocales` empty/missing → identity locale-wise (every entry + * survives the filter) but `trimName` is still applied. This is the + * backwards-compatible path for old projects with neither `input_locale` + * nor `filters.locale` set; they pay the structural-trim savings but + * keep their full multi-locale set. + * - `effectiveLocales` populated → keep entries whose locale's primary + * subtag is in the set (case-insensitive), plus entries with no locale. + * + * Pure. + */ +export const filterAndTrimNames = (names, effectiveLocales) => { + if (!Array.isArray(names)) return names + const set = new Set(uniqByPrimarySubtag(effectiveLocales)) + const filterActive = set.size > 0 + const out = [] + for (const n of names) { + if (!n) continue + if (filterActive && n.locale) { + const tag = primarySubtag(n.locale) + if (tag && !set.has(tag)) continue + } + out.push(trimName(n)) + } + return out +} + +/** + * Filter+trim companion for `descriptions[*]`. Same locale semantics as + * `filterAndTrimNames`. Pure. + */ +export const filterAndTrimDescriptions = (descriptions, effectiveLocales) => { + if (!Array.isArray(descriptions)) return descriptions + const set = new Set(uniqByPrimarySubtag(effectiveLocales)) + const filterActive = set.size > 0 + const out = [] + for (const d of descriptions) { + if (!d) continue + if (filterActive && d.locale) { + const tag = primarySubtag(d.locale) + if (tag && !set.has(tag)) continue + } + out.push(trimDescription(d)) + } + return out +} + +/** + * Build a single recommendable_concepts[*] entry for the AI Assistant v2 + * payload. Pure projection — no React, no refs. The MapProject component + * calls this once per (target-canonical) ConceptDefinition while walking + * defsByKey in buildV2RecommendationPayload. + * + * PR3-D3-lite trims applied here: + * - L-2: `ocl_url` dropped (UI deeplink; LLM doesn't use it). + * - L-5: `property` compacted via `compactProperty` (FHIR array → object). + * - L-8: `descriptions` omitted when the source array is empty. + * + * PR3-H trims applied here: + * - `names`/`descriptions` filtered by primary-subtag match against + * `effectiveLocales` (empty set → no locale filter, identity behavior). + * - Per-entry structural trim: drop `external_id` always; emit + * `locale_preferred` only when `true`. + * + * L-3 (drop constant `concept_class` / `datatype` across the surfaced set) is + * applied by the caller AFTER this function builds each entry, since the + * homogeneity check requires the full set. + * + * @param {Object} args + * @param {Object} args.def ConceptDefinition (from conceptCache or normalizer output) + * @param {string} args.key the concept_key + * @param {Array} args.evidence [{algorithm_id, candidate_type, score, highlights, via?}, ...] + * @param {Object} [args.rowState] rowMatchState[rowIndex] — provides rerank_score per concept_key + * @param {Array} [args.summaryPropertyCodes] target_repo summary property codes (optional filter) + * @param {Array} [args.effectiveLocales] PR3-H union of (input_locale, filters.locale); empty = no locale filter + */ +export const buildRecommendableConceptEntry = ({ + def, key, evidence, rowState, summaryPropertyCodes, effectiveLocales +}) => { + const filteredNames = filterAndTrimNames(def.names, effectiveLocales) + const filteredDescriptions = filterAndTrimDescriptions(def.descriptions, effectiveLocales) + const entry = { + concept_key: key, + canonical_reference: def.reference, + display_name: def.display_name, + names: filteredNames, + concept_class: def.concept_class, + datatype: def.datatype, + property: compactProperty(filterPropertyBySummary(def.property, summaryPropertyCodes)), + rerank_score: rowState?.concept_rows?.[key]?.rerank_score, + evidence + } + if (Array.isArray(filteredDescriptions) && filteredDescriptions.length > 0) + entry.descriptions = filteredDescriptions + if (def.retired === true) entry.retired = true + return entry +} + +/** + * Apply L-3 (drop constant `concept_class` / `datatype`) across a built + * recommendable_concepts[] set. If ALL entries share the same value on + * either field, that field is removed from every entry — it carries zero + * discriminatory signal for the LLM in the homogeneous case. If the + * surfaced set is heterogeneous on the field, the field is kept on every + * entry as today. + * + * Pure. Mutates entries in place. + */ +export const stripConstantClassAndDatatype = (entries) => { + if (!Array.isArray(entries) || entries.length < 2) return entries + for (const field of ['concept_class', 'datatype']) { + const values = entries.map(e => e[field]) + const first = values[0] + if (first !== undefined && values.every(v => v === first)) { + for (const e of entries) delete e[field] + } + } + return entries +} + +/** + * Backfill rowMatchState + ConceptDefinitions from the legacy + * `allCandidates` shape (a saved-project artifact: `{ [algoId]: [{row, + * results}, ...] }`). Called on project load so that v1-saved projects + * render correctly under UNIFIED_MODEL_ENABLED=true. A precursor to + * PR3's `normalizeLegacy.js`. + * + * Pure function: no React, no APIService, no mutation of inputs. + * + * @param {Object} allCandidates { [algoId]: [{row, results}, ...] } + * @param {Object} projectContext {namespace, target_repo, bridge_repo?} + * @param {Array} algorithms algo defs (may carry concept_identity) + * @param {Object} [conceptIdentityByType] optional fallback map for algos + * missing concept_identity (e.g. API- + * loaded bridge/scispacy variants). + * @returns {{ + * rowMatchState: Object, keyed by row __index + * conceptDefinitionsByKey: Map + * }} + */ +export const normalizeLegacyAllCandidates = ( + allCandidates, + projectContext, + algorithms, + conceptIdentityByType = {} +) => { + const rowMatchState = {} + const conceptDefinitionsByKey = new Map() + if(!allCandidates || !projectContext) return { rowMatchState, conceptDefinitionsByKey } + + const algoById = new Map((algorithms || []).map(a => [a.id, a])) + + Object.entries(allCandidates).forEach(([algoId, rowEntries]) => { + const algoDef = algoById.get(algoId) + // Orphan-algo recovery (PR3-C / ocl_issues#2555): when the tag isn't a + // configured algo (reconfigured / mistagged project) or yields no identity, + // fall back to a result-derived identity instead of dropping the entries. + const algoConfig = algoDef?.concept_identity + ? algoDef + : (algoDef && conceptIdentityByType[algoDef.type] + ? { ...algoDef, concept_identity: conceptIdentityByType[algoDef.type] } + : { id: algoId, concept_identity: { reference_source: 'result', code_field: 'id', ocl_url_field: 'url' } }) + + ;(rowEntries || []).forEach(rowEntry => { + const idx = rowEntry?.row?.__index + if(idx === undefined || idx === null) return + + const normalized = normalizeAlgorithmInvocation( + { row: rowEntry.row, results: rowEntry.results || [] }, + { + algorithmId: algoId, + algorithmConfig: algoConfig, + projectContext, + rowIndex: idx, + // Saved-project legacy data: search_normalized_score was persisted + // from a prior session's $rerank/ output, so it IS the canonical + // rerank score. Honor it (no client-side rerank will fire for + // already-loaded rows). + trustServerRerank: true + } + ) + + const prevRow = rowMatchState[idx] || { + algorithm_responses: {}, + candidates: {}, + concept_rows: {} + } + const nextRow = { + algorithm_responses: { + ...prevRow.algorithm_responses, + [normalized.algorithm_response.id]: normalized.algorithm_response + }, + candidates: { ...prevRow.candidates }, + concept_rows: { ...prevRow.concept_rows } + } + normalized.candidates.forEach(c => { nextRow.candidates[c.id] = c }) + normalized.concept_rows.forEach(cr => { + const existing = nextRow.concept_rows[cr.concept_key] + // Existing entry keeps its rerank_score (richer wins); new arrivals + // are taken only when no entry exists yet. + if(!existing) nextRow.concept_rows[cr.concept_key] = cr + else if(existing.rerank_score === undefined && cr.rerank_score !== undefined) + nextRow.concept_rows[cr.concept_key] = cr + }) + rowMatchState[idx] = nextRow + + normalized.concept_definitions.forEach(def => { + const existing = conceptDefinitionsByKey.get(def.key) + if(!existing || lookupStatusRank(def.lookup_status) > lookupStatusRank(existing.lookup_status)) + conceptDefinitionsByKey.set(def.key, def) + }) + }) + }) + + return { rowMatchState, conceptDefinitionsByKey } +} diff --git a/scripts/migrate-candidates-to-v2/verification-report.md b/scripts/migrate-candidates-to-v2/verification-report.md new file mode 100644 index 0000000..4dbd045 --- /dev/null +++ b/scripts/migrate-candidates-to-v2/verification-report.md @@ -0,0 +1,178 @@ +# PR3-C migration verification — findings (interim) + +Date: 2026-06-02 · Verifier: automated dry-run + sandbox rehearsal against live prod data +Scope: oclmap candidates v1→v2 migration (`migrate.mjs` / PR #34) + v2-only app (PR #35), +ahead of the production migration + deploy. QA + staging already migrated. + +## Bottom line + +**Migration MECHANICS and transformation FIDELITY are verified and safe.** Of the 69 +non-empty prod projects, **52 (75%) migrate with zero data loss** and **100% of the +133,042 "survivable" candidate results transform faithfully** (perfect code coverage, +row coverage, and v2 referential integrity — 0 hard failures). The apply is a single +atomic transaction that takes ~25s and turns **100% of projects into +`mapper_schema_version: 2`**; rollback restores **byte-identical** data. + +**One material finding:** the migration **permanently discards ~2,678 raw candidate +results (~2%) across 17 projects** (5 of them fully zeroed). **Crucially this is NOT a +new user-facing regression** — those candidates are *already invisible in current +production* (PR2b runs `UNIFIED_MODEL_ENABLED=true` and normalizes-on-load via the +identical function). The migration's only new effect is that it **overwrites the raw v1 +data**, so a future bug-fix could no longer recover those results from the live row — +only from the backup. + +**Recommendation: GO**, conditioned on (a) **retaining the pre-migration `pg_dump` +permanently** (archive, not just for the rollback window), and ideally (b) landing a +small `normalizers.js` fix first that recovers most of the loss. Plus the runbook +corrections below. Final 98% gate still needs the staging deployed-app UI smoke tests +(Track C — blocked on SSO creds). + +## What was tested + +| Track | What | Result | +|---|---|---| +| A | Fresh read-only dump of all 69 non-empty prod projects (860 MB) | ✅ | +| A | `migrate.mjs` dry-run | ✅ **69 ok / 0 skipped / 0 errors** | +| A | Sandbox Postgres: apply exact `migrate.sql` | ✅ 69 UPDATEs, 1 txn, **100% → v2**, 0 nulled | +| B | `validate.mjs` independent v1↔v2 equivalence | ✅ **0 hard failures**; loss fully characterized | +| D | Rollback rehearsal (TRUNCATE + restore backup) | ✅ **byte-identical** (69/69 md5 match) | +| D | Idempotency of `migrate.sql` re-apply | ✅ 69→69 v2 | +| D | Double-run guard (re-run `migrate.mjs` on v2 data) | ✅ crashes loudly & safely (no corruption) | +| E | `matches`/`analysis` cross-field integrity | ✅ matches reference concept by URL+rowIndex — survive intact | +| F | QA parity | ✅ no un-migrated leftovers (QA has no real data) | +| C | Staging deployed-app UI smoke tests | ⏳ blocked on staging SSO creds | + +## The data-loss finding (detail) + +**Root cause** — `oclmap/src/components/map-projects/normalizers.js`, in +`normalizeLegacyAllCandidates` (commit `315e9b0`): + +```js +const algoById = new Map((algorithms || []).map(a => [a.id, a])) +Object.entries(allCandidates).forEach(([algoId, rowEntries]) => { + const algoDef = algoById.get(algoId) + if(!algoDef) return // ← silently drops EVERY result whose tag isn't a CURRENT algo id +``` + +Candidate entries are grouped by their stored algorithm tag (`entry.algorithm` → +`results[0].search_meta.algorithm` → sole-algo fallback), then matched against the +project's **current** `algorithms[]` by `id`. Two ways an entry's tag fails to match: + +- **Reconfigured project** — algorithms changed after results were saved. + e.g. id=132/135 ran `ocl-ciel-bridge`+`ocl-semantic`, now configured `ocl-scispacy-loinc`. +- **Mistagged results** — e.g. id=52 "CES drugs": results tagged `ocl-search`, algo + configured as `ocl-semantic`. (The sole-algo fallback is skipped because a tag *is* + present, just the wrong one.) + +A second, smaller class is **no-identity drop** (1,703 results): `custom` algos with no +`canonical_url` get no `concept_identity` (the ICD-11 custom-API demo projects). + +**Blast radius** (from `validate.mjs` / `validation-report.json`): + +- Total v1 results: 135,720 → survivable 133,042 · orphan-tag drop 975 · no-identity drop 1,703. +- 52/69 projects clean. 17 projects lose ≥1 result. +- **5 worst-case "FULL-ZERO + USED"** (had results + user matches/analysis, → 0 candidates): + | id | name | results→v2 | matches | analysis | tags vs config | + |---|---|---|---|---|---| + | 52 | CES drugs | 132→0 | 126 | 0 | `ocl-search` vs `ocl-semantic` | + | 132 | HL7 Brazil Jusssara | 270→0 | 4 | 9 | bridge+semantic vs `ocl-scispacy-loinc` | + | 67 | ICD 11 Demo Showcase #001 | 53→0 | 18 | 15 | ICD-API names vs `custom` | + | 57 | ICD 11 Test | 9→0 | 6 | 1 | ICD-API names vs `custom` | + | 55 | conceptDictionary…copy | 28→0 | 1 | 0 | ICD-API name vs `custom` | +- 12 partial-loss projects keep most data (e.g. id=90 dropped 663 of 8,518; id=135 270→10). + +**Why it is NOT a new regression:** current prod (PR2b) renders the candidate grid from +the normalized `rowMatchState` (`buildQualityRowViews(rowMatchStateRef.current[idx], …)`, +MapProject.jsx@bbdc360:3705), which is built by the *same* `normalizeLegacyAllCandidates` +with the *same* drop. So these candidates already don't render today. The migration +persists that state and discards the raw v1 source. + +**`matches` are safe:** each match is self-contained (`{url, mapType, repoURL, decision, +rowIndex, state}`) and references concepts by URL — never by candidate-array index. So +users' actual mapping decisions and CSV export survive intact for all projects, including +the zeroed ones. + +## Recommendations + +1. **GO** for the prod migration on data-integrity grounds. +2. **Retain the pre-migration `pg_dump` permanently** (archive it) — standard safety + net, even though the canonical + orphan-recovery fixes now bring preventable + drops to ~0. +3. **DONE — both data-loss classes recovered** (see the two sections below): + the canonical fix eliminated the 1,703 no-identity drops, and the orphan-algo + `reference_source: 'result'` fallback recovered all 731 orphan-tag codes + (`validate.mjs`: 0 still-missing, 0 hard failures). + +## Runbook corrections (found during rehearsal) + +- **Run the migration from commit `315e9b0`**, not `main`. `migrate.mjs` imports + `normalizeLegacyAllCandidates`, which PR #35 (`9f8f4b3`) deletes — running from `main` + crashes on import. (Use a detached worktree at `315e9b0`.) +- **Target count is now 69** non-empty projects (was 68 at the 2026-05-27 dry-run). + Always re-dump immediately before the window (the runbook's T-0:03 step) — the set drifts. +- **Never re-dump+migrate after applying** — `migrate.mjs` on v2 data throws + `TypeError: (candidates || []) is not iterable`. (`migrate.sql` re-apply is safe/idempotent.) +- nginx-block step is necessary: a save landing between dump and apply is overwritten; the + T-0:03 re-dump mitigates. + +## Artifacts (this folder; input/ and output/ are gitignored) + +- `validate.mjs` — new independent equivalence validator (read-only over dry-run outputs). +- `output/summary.json`, `output/skipped.json` (empty), `output/validation-report.json`, + `output/migrate.sql`, `output/proj-*.v2.json`. +- Sandbox: throwaway Postgres at `127.0.0.1:55432` (data in `/tmp/oclmap-sandbox-pg`), + `/tmp/map_projects.backup.sql`, `/tmp/pre_migration_md5.txt`. Remove when done. + +## Canonical resolution (follow-up — fixes most of the loss + a deeper issue) + +The "no-canonical → drop" path turned out to be one face of a broader problem: the +migration always keys concepts on the **generated** `https://ns.openconceptlab.org/...` +URL, while the live app keys them on each source's **formal** `canonical_url` +(it fetches `target_repo` at runtime; reads `custom`/`bridge` canonicals from the +algo config). Consequences on real data: (a) custom ICD-11 algos drop entirely +(no identity), (b) **10 LOINC/ICD-11 projects split a concept's identity** between +formal + generated, (c) migrated keys mismatch the live app on re-run. + +**Decision (@paynejd):** ICD-11 (all variants) → `http://id.who.int/icd/release/11/mms`; +PIH → generated; everything else → the source's registered canonical. Resolve at +dump time (static snapshot), not a runtime read. + +**Prototyped + measured** (dry-run + validator on the real 69): + +| | before (generated) | after (formal resolution) | +|---|---|---| +| no-identity drops | 1,703 | **0** | +| split-identity projects | 10 | **0** | +| concept keys | mostly generated | **all formal** (only 24 PIH generated) | +| hard failures | 0 | 0 | +| orphan-tag drops (separate bug) | 975 | 975 (unchanged) | + +**Two production pieces (both validated):** +1. `fix-algorithm-canonicals.sql` — one-time `--rw` UPDATE of `map_projects.algorithms` + setting formal canonical on custom + bridge algos. Fixes the **live app** too + (it reads these from config). Run before the migration, after backup. +2. `dump-projects.sql` — migration dump with resolved `target_repo.canonical_url` + (the app fetches target live; the migration can't, so it resolves at dump time). + `migrate.mjs` is unchanged. + +Hand both to Sunny. + +**RESOLVED — orphan-tag drops (`ocl_issues#2555`).** Results tagged with an algorithm +not in the project's current `algorithms[]` (reconfigured/mistagged projects: id 52, +55, 57, 61, 67, 95, 114, 132, 135) were dropped at the normalizer's `if(!algoDef) +return`. Added a migration-only `reference_source: 'result'` fallback to +`vendored/normalizers.js`: derive the concept identity from the result's own source +— target-source results anchor to the project's target canonical (formal, visible + +recommendable); non-target results key on their own generated source (correctly inert, +never mis-surfaced). Measured: **731/731 recoverable codes now in v2, 0 still-missing, +0 hard failures**, target/non-target keying verified correct. The remaining UI nicety +(showing orphan-algo candidates in the *by-algorithm* view, which iterates configured +algos only) is a separate small **app** PR — out of scope for this migration-tooling PR. + +## Outstanding for ≥98% (Track C) + +Seed staging with ~8 diverse real prod projects (incl. a bridge, a scispacy, an +ICD-11/HEAD, a sole-algo, the largest, and one of the zeroed projects), then SSO browser +smoke tests on the deployed v2 app: load without legacy alert, render candidates/scores, +bridge cascade targets, Concept Details panel, AI Recommend, CSV export, re-save→reload, +and a fresh new-project save. **Needs staging Keycloak creds.**