(GH-538) Initial work for exporting schemas from code#1406
(GH-538) Initial work for exporting schemas from code#1406michaeltlombardi wants to merge 5 commits intoPowerShell:mainfrom
Conversation
This change defines several new utility methods for `schemars::Schema`: - `get_bundled_schema_resource_defs_key()` to retrieve the key for a bundled schema resource in `$defs` by its `$id` URI. - `remove_bundled_schema_resouces()` to discover and remove entries in the `$defs` keyword that define bundled schema resources. This enables canonicalizing a generated schema for publishing. - `rename_defs_subschema()` to rename an entry in the `$defs` keyword. This is used by other methods for canonicalizing bundled schema resources. - `get_references_to_bundled_schema_resource()` to retrieve every unique reference to a given bundled schema resource. This is used by other methods to canonicalize bundled schema resource definitions and references to them. - `to_value_with_stable_order()` to return a sorted map for exporting to disk. This is only intended for use in exporting where we want a stable order to minimize the work of comparing changed JSON Schemas. The `schemars` generator for JSON Schemas is not fully deterministic in the ordering of keys for a generated schema. - `canonicalize_refs_and_defs()` to standardize how a schema references and bundles schema resources. Standardizing bundled schema resources and references to them simplifies munging the schemas for different forms. This change includes initial testing and reference documentation for the new utility methods.
This change defines two new transforms for managing schemas with bundled schema resources: - `canonicalize_refs_and_defs()` to munge the keys for entries in `$defs` that contain bundled schema resources and update all references to bundled schema resources to use resource's `$id` as the reference, pointing to the absolute URI instead of using a relative URI or fragment pointer URI. - `remove_bundled_schema_resources()` to remove every entry in the `$defs` keyword that is a bundled schema resource (defines the `$id` keyword). This enables slimming down the canonical form of schemas for publishing. This change provides reference documentation for the new transforms. Because the transforms are just invoking the newly defined, documented, and tested schema utility methods, this change doesn't duplicate those tests for the transforms.
This change introduces two new transforms specific to VS Code: - `remove_vs_code_keywords()` to recursively remove keywords that belong to the VS Code extended vocabulary from a schema. This enables publishing canonical and canonically bundled forms of the schema. - `urlencode_defs_keys()` to update the keys for entries in `$defs` that contain characters that need to be URL encoded for VS Code to correctly discover references. The JSON Schema integration in VS Code doesn't discover canonically bundled schema resources. Instead, references need to point to `#/$defs/<key>`. This transformer munges the definition keys and references to them so that VS Code can resolve the references.
This change enhances the functionality of the `DscRepoSchema` trait for
exporting the schemas by defining the following associated functions
and methods:
- `default_export_schema_id_uri()` to emit the `$id` URI for the
`vNext` version of the canonical-form schema on GitHub. This should
be inserted into the JSON Schema for every DSC repo schema as the
`$id` keyword, either with the `schemars` attribute or in the
manual implementation for the `JsonSchema` trait.
- `default_export_meta_schema_uri()` to emit the `$schema` URI for the
schema. Currently always returns the 2020-12 URI. In the future,
when we support other meta schemas, this should be overridden.
This should be inserted into the JSON Schema for every DSC repo
schema as the `$schema` keyword, either with the `schemars` attribute
or in the manual implementation for the `JsonSchema` trait.
- `generate_exportable_schema()` to return a `Schema` for a given
version and form. The prefix URI for an exported schema is _always_
the GitHub URI, because the `aka.ms` prefix is a user-convenience -
the actual hosted schemas always live on GitHub.
This method is a slightly more ergonomic API for exporting over using
the `generate_schema()` method directly and always specifying the
GitHub schema form.
- `generate_schema()` to return a `Schema` with the given version, form,
and URI prefix.
This associated function retrieves the schema with `schemars` and
then:
1. Sets the `$id` for the schema to match the version, form and URI
prefix.
1. Canonicalizes the references and definitions for the schema so
that the key for bundled schema resources is always their `$id`
URI and references to bundled schema resources are always absolute
URIs instead of fragment pointers.
1. Syncs the versions for all bundled resources to match the current
version of the root schema.
1. Munges the schema as needed for the given form by:
- Removing VS Code keywords and bundled resources from canonical
schemas.
- Removing VS Code keywords from bundled schemas.
- Munging references and definitions in VS Code schemas to enable
VS Code to correctly resolve references to bundled schema
resources.
1. Returns the munged schema.
- `set_schema_id_uri()` to define the `$id` keyword for a given schema
to the correct URI for a given version, form, and URI prefix.
- `get_schema_relative_path()` to return a `PathBuf` pointing to the
export location for a given version and form of the schema, relative
to the `schemas` folder. This enables exporting to disk.
- `set_enhanced_schema_id_uri()` as an ergonomic alternative to
`set_schema_id_uri()` for a given version of the schema when working
with the VS Code enhanced form.
- `set_canonical_schema_id_uri()` as an ergonomic alternative to
`set_schema_id_uri()` for a given version of the schema when working
with the canonical form.
- `set_bundled_schema_id_uri()` as an ergonomic alternative to
`set_schema_id_uri()` for a given version of the schema when working
with the canonically bundled form.
- `get_valid_schema_forms()` to return a vector containing the valid
forms for a schema. The returned vector always includes the canonical
form. If the schema is for a type that should provide a bundled form,
the vector also includes the VS Code and Bundled forms.
Prior to this change, we had no way to perform build tasks based on the Rust code itself. We could compile the code for different targets, but we couldn't use the `cargo build` subsystem to take any actions _on_ our code because that subsystem runs before compilation. This change adapts the [`cargo xtask` pattern][01] for our work by: - Defining a top-level folder for a new `xtask` crate, which defines a binary for various developmental and maintenance tasks that need access to Rust code. - Adding an alias to the cargo configuration so that you can invoke `cargo xtask <command> <subcommand>` to use the `xtask` binary. - Defining an initial task for exporting the main DSC schemas from the `dsc-lib` crate to disk. In this initial implementation it only exports the schemas for `vNext`. The eventual intention is to be able to generate and export schemas as part of the build and release automation using xtasks. This change doesn't modify or affect general building, testing, and releasing steps for the various projects. [01]: https://github.com/matklad/cargo-xtask
There was a problem hiding this comment.
Pull request overview
This pull request introduces initial support for exporting JSON Schemas from Rust code using the cargo xtask pattern. The implementation adds build task functionality that enables maintainers to export DSC schemas to disk with cargo xtask schema export, addressing issue #538 for canonicalizing schemas.
Changes:
- Adds a new
xtaskworkspace member implementing the cargo xtask pattern for build tasks - Implements schema transformation functions for canonicalizing, bundling, and VS Code-specific formatting
- Extends
DscRepoSchematrait with schema generation and export methods - Adds utility methods to
SchemaUtilityExtensionsfor managing bundled schema resources
Reviewed changes
Copilot reviewed 24 out of 25 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| xtask/Cargo.toml | Defines the new xtask package with build task dependencies |
| xtask/src/main.rs | Entry point for xtask CLI with schema export command |
| xtask/src/args.rs | CLI argument parsing for xtask subcommands |
| xtask/src/schemas/export.rs | Core export logic with macro for exporting multiple type schemas |
| xtask/locales/en-us.toml | Localized strings for xtask CLI messages |
| xtask/.project.data.json | Project metadata marking xtask as TestOnly |
| lib/dsc-lib-jsonschema/src/vscode/transforms/urlencode_defs_keys.rs | VS Code-specific transform to URL-encode $defs keys for compatibility |
| lib/dsc-lib-jsonschema/src/vscode/transforms/remove_vs_code_keywords.rs | Transform to remove VS Code-specific keywords from schemas |
| lib/dsc-lib-jsonschema/src/vscode/transforms/mod.rs | Module exports for VS Code transforms |
| lib/dsc-lib-jsonschema/src/vscode/mod.rs | Adds transforms module to vscode namespace |
| lib/dsc-lib-jsonschema/src/transforms/remove_bundled_schema_resources.rs | Public transform to remove bundled schema resources from $defs |
| lib/dsc-lib-jsonschema/src/transforms/canonicalize_refs_and_defs.rs | Public transform to standardize bundled resource references and definitions |
| lib/dsc-lib-jsonschema/src/transforms/mod.rs | Exports new public transform functions |
| lib/dsc-lib-jsonschema/src/tests/vscode/transforms/*.rs | Test infrastructure for VS Code transforms |
| lib/dsc-lib-jsonschema/src/tests/schema_utility_extensions.rs | Unit tests for new utility extension methods |
| lib/dsc-lib-jsonschema/src/schema_utility_extensions.rs | New trait methods for bundled resource management and schema canonicalization |
| lib/dsc-lib-jsonschema/src/dsc_repo/mod.rs | Helper functions for schema version management and bundled resource synchronization |
| lib/dsc-lib-jsonschema/src/dsc_repo/dsc_repo_schema.rs | Enhanced DscRepoSchema trait with schema generation and export methods |
| data.build.json | Registers xtask in build data |
| Cargo.toml | Adds xtask to workspace members and platform groups |
| Cargo.lock | Lock file updates for xtask dependencies |
| .cargo/config.toml | Cargo alias enabling cargo xtask command |
Comments suppressed due to low confidence (15)
xtask/src/schemas/export.rs:29
- Using
unwrap()onparent()assumes that the CARGO_MANIFEST_DIR always has a parent directory. While this is typically true, it's better to handle this case explicitly with an error message or expect() call that explains why the parent must exist.
p.parent().unwrap().into()
xtask/src/schemas/export.rs:37
- Using
unwrap()onparent()can panic if the path has no parent. While this is unlikely given the path construction, consider usingexpect()with a descriptive message or proper error handling to make debugging easier if this assumption is violated.
let folder = path.parent().unwrap();
lib/dsc-lib-jsonschema/src/dsc_repo/dsc_repo_schema.rs:217
- Typo in documentation: "the the" should be "the". This appears in multiple method doc comments for the set_*_schema_id_uri methods.
/// Sets the `$id` for a schema to the the URI for the enhanced form of the schema with the
/// default prefix for a given version.
lib/dsc-lib-jsonschema/src/dsc_repo/dsc_repo_schema.rs:238
- Typo in documentation: "the the" should be "the".
/// Sets the `$id` for a schema to the the URI for the canonical form of the schema with the
/// default prefix for a given version.
xtask/src/schemas/export.rs:45
- Unnecessary explicit
returnstatement. In Rust, the last expression in a function is automatically returned. This line should be simplified to justOk(())without thereturnkeyword, which is the idiomatic Rust style.
return Ok(())
lib/dsc-lib-jsonschema/src/dsc_repo/dsc_repo_schema.rs:325
- Typo in documentation: "Retuns" should be "Returns".
/// Retuns a vector of the [`SchemaForm`]s that are valid for the type.
lib/dsc-lib-jsonschema/src/dsc_repo/mod.rs:233
- The function clones the entire schema twice: once at line 225 and implicitly through canonicalize_refs_and_defs_for_bundled_resources which also clones. For large schemas with many bundled resources, this could be a performance concern. Consider optimizing by minimizing schema clones or using more efficient lookup strategies.
pub(crate) fn sync_bundled_resource_id_versions(schema: &mut Schema) {
// First get the root ID so we can update the bundled dsc repo schema resources.
let lookup_schema = &schema.clone();
let Some(schema_version_folder) = get_schema_id_version(lookup_schema) else {
return;
};
let replacement_pattern = regex::Regex::new(r"schemas/v(Next|\d+(\.\d+){0,2})/").unwrap();
let replacement_value = &format!("schemas/{schema_version_folder}/");
// Make sure we're working from canonicalized references and definitions:
schema.canonicalize_refs_and_defs_for_bundled_resources();
lib/dsc-lib-jsonschema/src/schema_utility_extensions.rs:2000
- The
rename_defs_subschemaimplementation rebuilds the entire$defsmap by cloning all entries, even when only one key needs to be renamed. For schemas with many definitions, this is inefficient. Consider using a more efficient approach, such as removing the old key, cloning only the value, and inserting it with the new key.
fn rename_defs_subschema(&mut self, old_key: &str, new_key: &str) {
let Some(defs) = self.get_defs_mut() else {
return;
};
*defs = defs.iter_mut().map(|(k, v)| {
if k.as_str() == old_key {
(new_key.to_string(), v.clone())
} else {
(k.clone(), v.clone())
}
}).collect();
}
lib/dsc-lib-jsonschema/src/tests/vscode/transforms/urlencode_defs_keys.rs:2
- The test file for
urlencode_defs_keysis empty (only contains the copyright header). Thevscodify_refs_and_defsfunction inurlencode_defs_keys.rslacks unit tests, unlike theremove_vs_code_keywordstransform which has comprehensive tests. Consider adding tests to verify the URL encoding behavior and reference updates.
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT License.
lib/dsc-lib-jsonschema/src/transforms/mod.rs:16
- The new public transform functions
canonicalize_refs_and_defsandremove_bundled_schema_resourcesin the transforms module lack integration tests. While the underlying trait methods have unit tests, the public API functions should have integration tests following the pattern established byidiomaticize_externally_tagged_enumandidiomaticize_string_enum. Consider adding comprehensive integration test files for these transforms intests/integration/transforms/.
mod canonicalize_refs_and_defs;
pub use canonicalize_refs_and_defs::canonicalize_refs_and_defs;
mod idiomaticize_externally_tagged_enum;
pub use idiomaticize_externally_tagged_enum::idiomaticize_externally_tagged_enum;
mod idiomaticize_string_enum;
pub use idiomaticize_string_enum::idiomaticize_string_enum;
mod remove_bundled_schema_resources;
pub use remove_bundled_schema_resources::remove_bundled_schema_resources;
xtask/src/main.rs:36
- The match expression can be simplified. Instead of matching the result and wrapping it in XTaskError, you can use the
?operator directly since SchemaExportError already implements From for XTaskError. Change this to:export_schemas(RecognizedSchemaVersion::VNext)?; Ok(())
match export_schemas(RecognizedSchemaVersion::VNext) {
Ok(_) => Ok(()),
Err(e) => Err(XTaskError::SchemaExport(e)),
}
lib/dsc-lib-jsonschema/src/dsc_repo/mod.rs:250
- Using
unwrap()on a regex that's expected to match schema bundled resources can panic if the assumption is violated. Consider usingexpect()with a descriptive message explaining that this should always match DSC repo schema resource IDs with version folders, or handle the error case explicitly.
let definition = schema.get_defs_subschema_from_id_mut(resource_id).unwrap();
lib/dsc-lib-jsonschema/src/schema_utility_extensions.rs:1999
- The
rename_defs_subschemaimplementation clones all map entries even when only one needs to be renamed. This is inefficient for large schemas. Consider usingBTreeMapor iterating once to find and rename only the matching key, or usingentryAPI to modify in place.
*defs = defs.iter_mut().map(|(k, v)| {
if k.as_str() == old_key {
(new_key.to_string(), v.clone())
} else {
(k.clone(), v.clone())
}
}).collect();
lib/dsc-lib-jsonschema/src/dsc_repo/dsc_repo_schema.rs:261
- Typo in documentation: "the the" should be "the".
/// Sets the `$id` for a schema to the the URI for the bundled form of the schema with the
/// default prefix for a given version.
xtask/src/schemas/export.rs:28
- Using
unwrap()onstd::env::var("CARGO_MANIFEST_DIR")can cause a panic if the environment variable is not set. While this is typically set by Cargo during build, it's better to handle the error case explicitly with proper error reporting. Consider returning a Result or providing a more helpful panic message that explains the expected environment.
let p = PathBuf::from(std::env::var("CARGO_MANIFEST_DIR").unwrap());
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
PR Summary
The primary purpose of this change is to enable exporting JSON Schemas from Rust code. This change adopts the
cargo xtaskpattern to enable build tasks that require access to Rust code in our libraries.As a maintainer or contributor, you can invoke
cargo xtask schema exportto export the next version of the DSC schemas to disk in theschemas/vNextfolder.This change required implementing several helpers and transformers for the generated schemas. Each subset of functionality is captured in a separate commit.
Note that this change doesn't modify the build, test, or release steps for any existing projects. This is an initial implementation that will require continuous improvement until we are able to reliably export canonical schemas for the project from source code.
PR Context
As part of #538, we need to generate and export schemas from code to disk.