Skip to content

Recurse into nested class_derivations in validate-spec#231

Open
amc-corey-cox wants to merge 9 commits into
mainfrom
validate-nested-cross-table
Open

Recurse into nested class_derivations in validate-spec#231
amc-corey-cox wants to merge 9 commits into
mainfrom
validate-nested-cross-table

Conversation

@amc-corey-cox
Copy link
Copy Markdown
Contributor

Closes #211. Closes #213. Closes #219.

Summary

Walk nested class_derivations declared on slot_derivations and validate them like top-level CDs, threading the parent CD through so cross-table boundaries can be diagnosed at validate-spec time. Adds three checks plus the recursion shape that enables them.

Checks added

Cross-table mismatch — closes #211

When a nested CD's populated_from differs from its parent's:

The warning surfaces the silent-null hazard from the original issue at validate-spec time, instead of waiting until transform.

Joined-class attribute refs — closes #213

Expression refs like {alias.field} are validated against the joined class's slot list. Previously, the validator dropped {alias.*} patterns from expression-reference checking entirely (a workaround for false positives on join aliases); now those refs are checked against the actual joined class. Covers both:

  • explicit joins: aliases (with class_named honored)
  • implicit-join targets surfaced via nested CDs' populated_from

This is partial coverage of #213 — runtime-side eval-time validation is separately needed for the bare {scroe} typo case. With #230 + this PR landed, the silent-null mechanism the issue describes is closed at the validate-spec layer.

is_a / mixins resolution — closes #219 (Option C)

is_a (string) and each entry of mixins (list) resolve in this order:

  1. Spec-internal class_derivation name pool — matches Transformer._find_class_derivation_by_name
  2. Target schema classes via SchemaView — matches LinkML's plain-schema is_a convention
  3. Else → error ('is_a: TotallyMadeUpName' does not resolve to ...)

Composes cleanly with #230's structural change: that PR makes the JSON Schema layer accept string references; this PR makes the semantic layer resolve them. They can land in any order; once both are in, the full chain works end-to-end.

Other changes

  • ValidationMessage.severity widened to Literal["error", "warning", "info"] to carry the implicit-join INFO.
  • New extract_expr_attribute_references() helper — returns {base: {attr, ...}} for {base.attr} patterns, mirroring extract_expr_slot_references().
  • New _build_joined_class_map(), _check_cross_table_join(), _collect_class_derivation_pool(), _check_class_inheritance_refs() helpers.
  • _validate_class_derivation now takes optional parent_class_deriv, parent_path, and derivation_pool kwargs; recurses into nested CDs.
  • _validate_slot_derivation's joined_aliases: set[str] becomes joined_class_slots: dict[str, set[str] | None].

Test plan

Walk nested class_derivations declared on slot_derivations and validate
them like top-level CDs, threading the parent CD through so cross-table
boundaries can be diagnosed. Adds three checks:

- Cross-table mismatch: when a nested CD's populated_from differs from
  its parent's, emit INFO when an implicit join would synthesize at
  runtime (#212) and WARNING when no implicit join is possible. Closes
  #211.

- Joined-class attribute refs: expressions like {alias.field} are now
  validated against the joined class's slot list — for both explicit
  joins: aliases and nested-CD implicit-join targets. Replaces the
  previous "drop join aliases from check" behavior. Closes #213.

- is_a / mixins resolution: string references resolve against the
  spec-internal class_derivation name pool first (matches the runtime's
  _find_class_derivation_by_name), then against target schema classes
  via SchemaView, else error. Closes #219.

ValidationMessage.severity widens to Literal["error", "warning", "info"]
to carry the implicit-join INFO.
Copilot AI review requested due to automatic review settings May 12, 2026 16:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances validate-spec semantic validation to traverse nested class_derivations under slot_derivations, enabling earlier detection of cross-table nesting issues, validating {alias.field} expression references against joined classes, and resolving is_a/mixins references against spec-internal derivations and/or the target schema.

Changes:

  • Added recursive semantic validation of nested class_derivations, including a new cross-table joinability diagnostic for parent→child populated_from mismatches.
  • Added extract_expr_attribute_references() and extended expression validation to check {alias.field} patterns against joined-class slot sets.
  • Added is_a/mixins reference resolution via a top-level class-derivation name pool with fallback to target-schema classes; widened ValidationMessage.severity to include "info".

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/linkml_map/validator.py Implements nested-CD recursion, cross-table join diagnostics, joined-class expression attribute validation, and inheritance reference resolution; adds "info" severity.
tests/test_validator.py Adds unit tests covering attribute reference extraction, nested cross-table diagnostics, joined-class attribute validation, and is_a/mixins resolution behavior.

Comment thread src/linkml_map/validator.py
Comment thread src/linkml_map/validator.py
Comment thread src/linkml_map/validator.py Outdated
Comment thread src/linkml_map/validator.py
- Cross-table join: when an explicit joins: entry covers the nested
  source, also verify the spec actually has 'join_on' or both
  'source_key' and 'lookup_key'. An empty entry (structurally valid
  but missing keys) now warns at validate-spec time instead of failing
  at transform time with ValueError.

- is_a / mixins: skip the resolution check entirely when target_sv is
  None. The "not in pool" half is genuinely ambiguous without the
  target schema — the reference could still resolve there. Erroring
  on half-checked pools was a false-positive risk.

- Joined-class error wording: include both the resolved class name
  and the alias when they differ ('joined class X (alias Y)') so users
  can tell which schema class was actually checked when class_named is
  set.

- _build_joined_class_map now returns
  dict[alias, (class_name, slot_set | None)] so the consumer can
  surface the resolved class name in error messages.

Added five new regression tests covering each scenario.
When an implicit join would synthesize at runtime (#212), additionally
check whether parent and child share non-join columns. Runtime behavior
on those columns is data-dependent (#217 Bug 2): rows with a join match
merge and mark shared columns _AMBIGUOUS (unqualified access errors),
rows without a match leave the bare parent row in place and silently
return the parent's value. Same spec, same dataset, can both error and
silently succeed depending on which row is being processed.

The hazard is fully determined by the schemas — surface it at
validate-spec time instead of waiting for a sparse row to expose it.

This addresses the static half of #217 Bug 2. The runtime behavior
itself (consistent _AMBIGUOUS insertion regardless of join hit, and
nested-object shape on no-match) remains a follow-up.
Copilot AI review requested due to automatic review settings May 12, 2026 16:53
@amc-corey-cox
Copy link
Copy Markdown
Contributor Author

Scope refinement after handover from a runtime-side session:

Re #213: The PR body says "partial coverage of #213 — runtime-side eval-time validation is separately needed for the bare {scroe} typo case." That undersells what's here. _validate_slot_derivation (validator.py:919-928) already validates bare-name expression refs against source_class_slots — so expr: '{scroe}' is caught at validate-spec time today. With this PR's joined-class attribute check on top, the static side of #213 is fully covered for both bare and dotted forms. The remaining #213 work is the runtime change in eval_utils._eval_name (stop swallowing NameNotDefined → None) plus defense-in-depth for cases where the validator has no schema — both separate from this PR.

Re #217 (new in d4a7ca2): Added a static warning for the column-overlap hazard described in #217 Bug 2. When _check_cross_table_join would synthesize an implicit join, it now also checks whether parent and child share non-join columns. If they do, runtime ambiguity behavior is data-dependent (rows with a join match merge and mark shared columns _AMBIGUOUS; rows without a match silently return the parent's value for the same column). The hazard is fully determined by the schemas, so the validator surfaces it instead of waiting for a sparse row to expose it. This addresses the static half of #217 Bug 2. Bug 1 (misleading {value: None} shape on join miss) and the runtime side of Bug 2 (consistent sentinel insertion regardless of join hit) remain follow-up work.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread src/linkml_map/validator.py
The overlap warning added in d4a7ca2 was unreachable given how
pick_join_key works: it returns a non-None key only when exactly one
non-identifier column is common (or when only identifier columns are
common, as a fallback). Any remaining overlap after subtracting the
join key is therefore identifier-only. Filtering identifiers (the only
sensible refinement) would make the check fire on zero realistic
inputs.

The genuine "two non-identifier columns in common, one is the join key,
the other is a real data column that resolves ambiguously" hazard is
already covered by the existing 'multiple candidate join columns'
WARNING — pick_join_key refuses to pick in that scenario, so we fall
through to the ambiguous-candidates path, not the implicit-join INFO
branch.

Dropping the warning and the three tests rather than carrying dead
code on speculation that pick_join_key might one day get smarter.
Copilot AI review requested due to automatic review settings May 12, 2026 18:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread src/linkml_map/validator.py
Comment thread src/linkml_map/validator.py Outdated
@amc-corey-cox
Copy link
Copy Markdown
Contributor Author

Correction to the previous comment re #217:

Reverted the column-overlap warning in 481d461. On a second look it was unreachable in practice: pick_join_key only returns a non-None key when there's exactly one non-identifier column in common, so any remaining overlap after subtracting the join key is identifier-only. Filtering identifiers (the only sensible refinement) would have made the check fire on zero realistic inputs.

The genuine #217 Bug 2 hazard — two non-identifier columns in common where one is the join key and the other is a real data column — is already caught: pick_join_key refuses to pick in that scenario, and we fall through to the existing "multiple candidate join columns" WARNING, not the implicit-join INFO branch.

So the practical position is:

Net: this PR no longer touches #217 with its own diagnostic; the issue's static hazard is incidentally covered by the cross-table check we already had.

Both messages referred to "joins: block" wholesale, but the path fires
whenever a specific entry for nested_source is absent — the parent CD
may already have a joins: mapping for other tables. Reword to point at
the specific missing entry instead of implying the whole block needs
adding.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/linkml_map/validator.py:768

  • _check_cross_table_join() derives parent_source from parent_cd.get("populated_from") only. If the parent CD omits populated_from (identity mapping), the function returns early and skips cross-table mismatch diagnostics, but the runtime uses parent_class_deriv.populated_from or parent_class_deriv.name when resolving nested objects. To avoid missing #211-style warnings/infos, compute the effective parent source the same way (fallback to the parent CD’s name).
    nested_source = nested_cd.get("populated_from")
    parent_source = parent_cd.get("populated_from")
    if not nested_source or not parent_source or nested_source == parent_source:
        return

Comment thread src/linkml_map/validator.py Outdated
_derive_nested_objects in the runtime uses
``parent_class_deriv.populated_from or parent_class_deriv.name``, so a
CD that omits ``populated_from`` falls back to its own name as the
source class (the "identity" case). _build_joined_class_map and
_check_cross_table_join previously only consulted ``populated_from``,
so an identity-style outer CD with a nested cross-table reference
silently slipped past validation. Mirror the runtime fallback in both
places and add a regression test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants