Skip to content

(GH-538) Initial work for exporting schemas from code#1406

Open
michaeltlombardi wants to merge 5 commits intoPowerShell:mainfrom
michaeltlombardi:gh-538/main/prep-gen-schema-export
Open

(GH-538) Initial work for exporting schemas from code#1406
michaeltlombardi wants to merge 5 commits intoPowerShell:mainfrom
michaeltlombardi:gh-538/main/prep-gen-schema-export

Conversation

@michaeltlombardi
Copy link
Collaborator

PR Summary

The primary purpose of this change is to enable exporting JSON Schemas from Rust code. This change adopts the cargo xtask pattern to enable build tasks that require access to Rust code in our libraries.

As a maintainer or contributor, you can invoke cargo xtask schema export to export the next version of the DSC schemas to disk in the schemas/vNext folder.

This change required implementing several helpers and transformers for the generated schemas. Each subset of functionality is captured in a separate commit.

Note that this change doesn't modify the build, test, or release steps for any existing projects. This is an initial implementation that will require continuous improvement until we are able to reliably export canonical schemas for the project from source code.

PR Context

As part of #538, we need to generate and export schemas from code to disk.

This change defines several new utility methods for `schemars::Schema`:

- `get_bundled_schema_resource_defs_key()` to retrieve the key for a
  bundled schema resource in `$defs` by its `$id` URI.
- `remove_bundled_schema_resouces()` to discover and remove entries in
  the `$defs` keyword that define bundled schema resources. This
  enables canonicalizing a generated schema for publishing.
- `rename_defs_subschema()` to rename an entry in the `$defs` keyword.
  This is used by other methods for canonicalizing bundled schema
  resources.
- `get_references_to_bundled_schema_resource()` to retrieve every
  unique reference to a given bundled schema resource. This is used by
  other methods to canonicalize bundled schema resource definitions and
  references to them.
- `to_value_with_stable_order()` to return a sorted map for exporting
  to disk. This is only intended for use in exporting where we want a
  stable order to minimize the work of comparing changed JSON Schemas.
  The `schemars` generator for JSON Schemas is not fully deterministic
  in the ordering of keys for a generated schema.
- `canonicalize_refs_and_defs()` to standardize how a schema references
  and bundles schema resources. Standardizing bundled schema resources
  and references to them simplifies munging the schemas for different
  forms.

This change includes initial testing and reference documentation for
the new utility methods.
This change defines two new transforms for managing schemas with bundled
schema resources:

- `canonicalize_refs_and_defs()` to munge the keys for entries in
  `$defs` that contain bundled schema resources and update all
  references to bundled schema resources to use resource's `$id` as the
  reference, pointing to the absolute URI instead of using a relative
  URI or fragment pointer URI.
- `remove_bundled_schema_resources()` to remove every entry in the
  `$defs` keyword that is a bundled schema resource (defines the `$id`
  keyword). This enables slimming down the canonical form of schemas
  for publishing.

This change provides reference documentation for the new transforms.
Because the transforms are just invoking the newly defined, documented,
and tested schema utility methods, this change doesn't duplicate those
tests for the transforms.
This change introduces two new transforms specific to VS Code:

- `remove_vs_code_keywords()` to recursively remove keywords that belong
  to the VS Code extended vocabulary from a schema. This enables
  publishing canonical and canonically bundled forms of the schema.
- `urlencode_defs_keys()` to update the keys for entries in `$defs` that
  contain characters that need to be URL encoded for VS Code to
  correctly discover references. The JSON Schema integration in VS Code
  doesn't discover canonically bundled schema resources. Instead,
  references need to point to `#/$defs/<key>`. This transformer munges
  the definition keys and references to them so that VS Code can
  resolve the references.
This change enhances the functionality of the `DscRepoSchema` trait for
exporting the schemas by defining the following associated functions
and methods:

- `default_export_schema_id_uri()` to emit the `$id` URI for the
  `vNext` version of the canonical-form schema on GitHub. This should
  be inserted into the JSON Schema for every DSC repo schema as the
  `$id` keyword, either with the `schemars` attribute or in the
  manual implementation for the `JsonSchema` trait.
- `default_export_meta_schema_uri()` to emit the `$schema` URI for the
  schema. Currently always returns the 2020-12 URI. In the future,
  when we support other meta schemas, this should be overridden.

  This should be inserted into the JSON Schema for every DSC repo
  schema as the `$schema` keyword, either with the `schemars` attribute
  or in the manual implementation for the `JsonSchema` trait.
- `generate_exportable_schema()` to return a `Schema` for a given
  version and form. The prefix URI for an exported schema is _always_
  the GitHub URI, because the `aka.ms` prefix is a user-convenience -
  the actual hosted schemas always live on GitHub.

  This method is a slightly more ergonomic API for exporting over using
  the `generate_schema()` method directly and always specifying the
  GitHub schema form.
- `generate_schema()` to return a `Schema` with the given version, form,
  and URI prefix.

  This associated function retrieves the schema with `schemars` and
  then:

  1. Sets the `$id` for the schema to match the version, form and URI
     prefix.
  1. Canonicalizes the references and definitions for the schema so
     that the key for bundled schema resources is always their `$id`
     URI and references to bundled schema resources are always absolute
     URIs instead of fragment pointers.
  1. Syncs the versions for all bundled resources to match the current
     version of the root schema.
  1. Munges the schema as needed for the given form by:

     - Removing VS Code keywords and bundled resources from canonical
       schemas.
     - Removing VS Code keywords from bundled schemas.
     - Munging references and definitions in VS Code schemas to enable
       VS Code to correctly resolve references to bundled schema
       resources.
  1. Returns the munged schema.
- `set_schema_id_uri()` to define the `$id` keyword for a given schema
  to the correct URI for a given version, form, and URI prefix.
- `get_schema_relative_path()` to return a `PathBuf` pointing to the
  export location for a given version and form of the schema, relative
  to the `schemas` folder. This enables exporting to disk.
- `set_enhanced_schema_id_uri()` as an ergonomic alternative to
  `set_schema_id_uri()` for a given version of the schema when working
  with the VS Code enhanced form.
- `set_canonical_schema_id_uri()` as an ergonomic alternative to
  `set_schema_id_uri()` for a given version of the schema when working
  with the canonical form.
- `set_bundled_schema_id_uri()` as an ergonomic alternative to
  `set_schema_id_uri()` for a given version of the schema when working
  with the canonically bundled form.
- `get_valid_schema_forms()` to return a vector containing the valid
  forms for a schema. The returned vector always includes the canonical
  form. If the schema is for a type that should provide a bundled form,
  the vector also includes the VS Code and Bundled forms.
Prior to this change, we had no way to perform build tasks based on
the Rust code itself. We could compile the code for different targets,
but we couldn't use the `cargo build` subsystem to take any actions
_on_ our code because that subsystem runs before compilation.

This change adapts the [`cargo xtask` pattern][01] for our work by:

- Defining a top-level folder for a new `xtask` crate, which defines
  a binary for various developmental and maintenance tasks that need
  access to Rust code.
- Adding an alias to the cargo configuration so that you can invoke
  `cargo xtask <command> <subcommand>` to use the `xtask` binary.
- Defining an initial task for exporting the main DSC schemas from the
  `dsc-lib` crate to disk. In this initial implementation it only
  exports the schemas for `vNext`. The eventual intention is to be able
  to generate and export schemas as part of the build and release
  automation using xtasks.

This change doesn't modify or affect general building, testing, and
releasing steps for the various projects.

[01]: https://github.com/matklad/cargo-xtask
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces initial support for exporting JSON Schemas from Rust code using the cargo xtask pattern. The implementation adds build task functionality that enables maintainers to export DSC schemas to disk with cargo xtask schema export, addressing issue #538 for canonicalizing schemas.

Changes:

  • Adds a new xtask workspace member implementing the cargo xtask pattern for build tasks
  • Implements schema transformation functions for canonicalizing, bundling, and VS Code-specific formatting
  • Extends DscRepoSchema trait with schema generation and export methods
  • Adds utility methods to SchemaUtilityExtensions for managing bundled schema resources

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated no comments.

Show a summary per file
File Description
xtask/Cargo.toml Defines the new xtask package with build task dependencies
xtask/src/main.rs Entry point for xtask CLI with schema export command
xtask/src/args.rs CLI argument parsing for xtask subcommands
xtask/src/schemas/export.rs Core export logic with macro for exporting multiple type schemas
xtask/locales/en-us.toml Localized strings for xtask CLI messages
xtask/.project.data.json Project metadata marking xtask as TestOnly
lib/dsc-lib-jsonschema/src/vscode/transforms/urlencode_defs_keys.rs VS Code-specific transform to URL-encode $defs keys for compatibility
lib/dsc-lib-jsonschema/src/vscode/transforms/remove_vs_code_keywords.rs Transform to remove VS Code-specific keywords from schemas
lib/dsc-lib-jsonschema/src/vscode/transforms/mod.rs Module exports for VS Code transforms
lib/dsc-lib-jsonschema/src/vscode/mod.rs Adds transforms module to vscode namespace
lib/dsc-lib-jsonschema/src/transforms/remove_bundled_schema_resources.rs Public transform to remove bundled schema resources from $defs
lib/dsc-lib-jsonschema/src/transforms/canonicalize_refs_and_defs.rs Public transform to standardize bundled resource references and definitions
lib/dsc-lib-jsonschema/src/transforms/mod.rs Exports new public transform functions
lib/dsc-lib-jsonschema/src/tests/vscode/transforms/*.rs Test infrastructure for VS Code transforms
lib/dsc-lib-jsonschema/src/tests/schema_utility_extensions.rs Unit tests for new utility extension methods
lib/dsc-lib-jsonschema/src/schema_utility_extensions.rs New trait methods for bundled resource management and schema canonicalization
lib/dsc-lib-jsonschema/src/dsc_repo/mod.rs Helper functions for schema version management and bundled resource synchronization
lib/dsc-lib-jsonschema/src/dsc_repo/dsc_repo_schema.rs Enhanced DscRepoSchema trait with schema generation and export methods
data.build.json Registers xtask in build data
Cargo.toml Adds xtask to workspace members and platform groups
Cargo.lock Lock file updates for xtask dependencies
.cargo/config.toml Cargo alias enabling cargo xtask command
Comments suppressed due to low confidence (15)

xtask/src/schemas/export.rs:29

  • Using unwrap() on parent() assumes that the CARGO_MANIFEST_DIR always has a parent directory. While this is typically true, it's better to handle this case explicitly with an error message or expect() call that explains why the parent must exist.
    p.parent().unwrap().into()

xtask/src/schemas/export.rs:37

  • Using unwrap() on parent() can panic if the path has no parent. While this is unlikely given the path construction, consider using expect() with a descriptive message or proper error handling to make debugging easier if this assumption is violated.
    let folder = path.parent().unwrap();

lib/dsc-lib-jsonschema/src/dsc_repo/dsc_repo_schema.rs:217

  • Typo in documentation: "the the" should be "the". This appears in multiple method doc comments for the set_*_schema_id_uri methods.
    /// Sets the `$id` for a schema to the the URI for the enhanced form of the schema with the
    /// default prefix for a given version.

lib/dsc-lib-jsonschema/src/dsc_repo/dsc_repo_schema.rs:238

  • Typo in documentation: "the the" should be "the".
    /// Sets the `$id` for a schema to the the URI for the canonical form of the schema with the
    /// default prefix for a given version.

xtask/src/schemas/export.rs:45

  • Unnecessary explicit return statement. In Rust, the last expression in a function is automatically returned. This line should be simplified to just Ok(()) without the return keyword, which is the idiomatic Rust style.
    return Ok(())

lib/dsc-lib-jsonschema/src/dsc_repo/dsc_repo_schema.rs:325

  • Typo in documentation: "Retuns" should be "Returns".
    /// Retuns a vector of the [`SchemaForm`]s that are valid for the type.

lib/dsc-lib-jsonschema/src/dsc_repo/mod.rs:233

  • The function clones the entire schema twice: once at line 225 and implicitly through canonicalize_refs_and_defs_for_bundled_resources which also clones. For large schemas with many bundled resources, this could be a performance concern. Consider optimizing by minimizing schema clones or using more efficient lookup strategies.
pub(crate) fn sync_bundled_resource_id_versions(schema: &mut Schema) {
    // First get the root ID so we can update the bundled dsc repo schema resources.
    let lookup_schema = &schema.clone();
    let Some(schema_version_folder) = get_schema_id_version(lookup_schema) else {
        return;
    };
    let replacement_pattern = regex::Regex::new(r"schemas/v(Next|\d+(\.\d+){0,2})/").unwrap();
    let replacement_value = &format!("schemas/{schema_version_folder}/");

    // Make sure we're working from canonicalized references and definitions:
    schema.canonicalize_refs_and_defs_for_bundled_resources();

lib/dsc-lib-jsonschema/src/schema_utility_extensions.rs:2000

  • The rename_defs_subschema implementation rebuilds the entire $defs map by cloning all entries, even when only one key needs to be renamed. For schemas with many definitions, this is inefficient. Consider using a more efficient approach, such as removing the old key, cloning only the value, and inserting it with the new key.
    fn rename_defs_subschema(&mut self, old_key: &str, new_key: &str) {
        let Some(defs) = self.get_defs_mut() else {
            return;
        };

        *defs = defs.iter_mut().map(|(k, v)| {
            if k.as_str() == old_key {
                (new_key.to_string(), v.clone())
            } else {
                (k.clone(), v.clone())
            }
        }).collect();
    }

lib/dsc-lib-jsonschema/src/tests/vscode/transforms/urlencode_defs_keys.rs:2

  • The test file for urlencode_defs_keys is empty (only contains the copyright header). The vscodify_refs_and_defs function in urlencode_defs_keys.rs lacks unit tests, unlike the remove_vs_code_keywords transform which has comprehensive tests. Consider adding tests to verify the URL encoding behavior and reference updates.
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT License.

lib/dsc-lib-jsonschema/src/transforms/mod.rs:16

  • The new public transform functions canonicalize_refs_and_defs and remove_bundled_schema_resources in the transforms module lack integration tests. While the underlying trait methods have unit tests, the public API functions should have integration tests following the pattern established by idiomaticize_externally_tagged_enum and idiomaticize_string_enum. Consider adding comprehensive integration test files for these transforms in tests/integration/transforms/.
mod canonicalize_refs_and_defs;
pub use canonicalize_refs_and_defs::canonicalize_refs_and_defs;
mod idiomaticize_externally_tagged_enum;
pub use idiomaticize_externally_tagged_enum::idiomaticize_externally_tagged_enum;
mod idiomaticize_string_enum;
pub use idiomaticize_string_enum::idiomaticize_string_enum;
mod remove_bundled_schema_resources;
pub use remove_bundled_schema_resources::remove_bundled_schema_resources;

xtask/src/main.rs:36

  • The match expression can be simplified. Instead of matching the result and wrapping it in XTaskError, you can use the ? operator directly since SchemaExportError already implements From for XTaskError. Change this to: export_schemas(RecognizedSchemaVersion::VNext)?; Ok(())
                match export_schemas(RecognizedSchemaVersion::VNext) {
                    Ok(_) => Ok(()),
                    Err(e) => Err(XTaskError::SchemaExport(e)),
                }

lib/dsc-lib-jsonschema/src/dsc_repo/mod.rs:250

  • Using unwrap() on a regex that's expected to match schema bundled resources can panic if the assumption is violated. Consider using expect() with a descriptive message explaining that this should always match DSC repo schema resource IDs with version folders, or handle the error case explicitly.
        let definition = schema.get_defs_subschema_from_id_mut(resource_id).unwrap();

lib/dsc-lib-jsonschema/src/schema_utility_extensions.rs:1999

  • The rename_defs_subschema implementation clones all map entries even when only one needs to be renamed. This is inefficient for large schemas. Consider using BTreeMap or iterating once to find and rename only the matching key, or using entry API to modify in place.
        *defs = defs.iter_mut().map(|(k, v)| {
            if k.as_str() == old_key {
                (new_key.to_string(), v.clone())
            } else {
                (k.clone(), v.clone())
            }
        }).collect();

lib/dsc-lib-jsonschema/src/dsc_repo/dsc_repo_schema.rs:261

  • Typo in documentation: "the the" should be "the".
    /// Sets the `$id` for a schema to the the URI for the bundled form of the schema with the
    /// default prefix for a given version.

xtask/src/schemas/export.rs:28

  • Using unwrap() on std::env::var("CARGO_MANIFEST_DIR") can cause a panic if the environment variable is not set. While this is typically set by Cargo during build, it's better to handle the error case explicitly with proper error reporting. Consider returning a Result or providing a more helpful panic message that explains the expected environment.
    let p = PathBuf::from(std::env::var("CARGO_MANIFEST_DIR").unwrap());

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@michaeltlombardi michaeltlombardi marked this pull request as ready for review February 25, 2026 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants