Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions crates/weaver_semconv/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ glob = "0.3.3"
jsonschema = "0.33.0" # JSON Schema validation used to enhance error messages
saphyr = "0.0.6" # YAML parser preserving span information (location in file)
utoipa = { workspace = true, optional = true }
semver = "1.0.27"

[features]
openapi = ["utoipa"]
Expand Down
101 changes: 86 additions & 15 deletions crates/weaver_semconv/src/registry.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,37 @@ use crate::registry_repo::RegistryRepo;
use crate::semconv::{SemConvSpecV1WithProvenance, SemConvSpecWithProvenance};
use crate::stats::Stats;
use crate::Error;
use regex::Regex;
use std::collections::HashMap;
use std::path::Path;
use std::sync::LazyLock;
use weaver_common::result::WResult;

/// Extracts a semantic version from a path string.
///
/// This function searches through path segments (separated by '/') looking for
/// a segment that starts with 'v' followed by a valid semantic version.
///
/// # Arguments
///
/// * `path` - A path string that may contain a version segment (e.g., "/path/v1.26.0/file")
///
/// # Returns
///
/// * `Some(String)` - The version string including the 'v' prefix (e.g., "v1.26.0")
/// * `None` - If no valid semantic version is found in the path
fn extract_semconv_version(path: &str) -> Option<String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern here (and also one with the previous regex) is that the v is not required, see: https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/schemas#schema-url

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, gotcha. Reading the link you shared here, it also says that version is always the last part of the URL, so if I'm understanding correctly, I don't need to try to parse every segment and check if it's a valid semver... just the last segment is enough, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the documentation on schema_url is relevant here, because in this context Weaver does not derive a version from a schema_url, but from a semantic conventions registry (local, Git, or archive). For recent registries, Weaver retrieves the version directly from the manifest file, but for older registries this manifest does not exist, and in that case the version is derived from the archive path, for example. See the code of the from_semconv_specs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we need to be flexible. What I'm saying is that eventually I'd like for SchemaURL to work. Right now it's forcing the v as required, I think we should allow it as optional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit lost here if in this PR we want to solve v being optional or not 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it wasn't optional before, you can leave it as is for now, I'll clean that up in a follow up.

For context - I'm working on new resolution specification where we can resolve "resolved schema" directly from a Schema URL - so I'll need to either merge in your PR and fix in my branch or have you fix it so my local fix doesn't get blown away.

I'd rather merge yours in and fix later if you're nervous about the amount of rust.

for segment in path.split('/') {
if segment.starts_with('v') && segment.chars().nth(1).is_some_and(|c| c.is_ascii_digit()) {
// Try to parse it with semver (strip the 'v' first)
if let Some(version_part) = segment.strip_prefix('v') {
if semver::Version::parse(version_part).is_ok() {
return Some(segment.to_owned());
}
}
}
}
None
}

/// A semantic convention registry is a collection of semantic convention
/// specifications indexed by group id.
#[derive(Default, Debug)]
Expand Down Expand Up @@ -125,10 +150,6 @@ impl SemConvRegistry {
registry_repo: &RegistryRepo,
semconv_specs: Vec<SemConvSpecWithProvenance>,
) -> Result<SemConvRegistry, Error> {
// ToDo We should use: https://docs.rs/semver/latest/semver/ and URL parser that can give us the last element of the path to send to the parser.
static VERSION_REGEX: LazyLock<Regex> =
LazyLock::new(|| Regex::new(r".*(v\d+\.\d+\.\d+).*").expect("Invalid regex"));

// Load all the semantic convention registry.
let mut registry = SemConvRegistry::new(registry_repo.id().as_ref());

Expand All @@ -137,16 +158,11 @@ impl SemConvRegistry {
}

if registry_repo.manifest().is_none() {
let mut semconv_version = "unversioned".to_owned();

// No registry manifest found.
// Try to infer the manifest from the registry path by detecting the
// presence of the following pattern in the registry path: v\d+\.\d+\.\d+.
if let Some(captures) = VERSION_REGEX.captures(registry_repo.registry_path_repr()) {
if let Some(captured_text) = captures.get(1) {
semconv_version = captured_text.as_str().to_owned();
}
}
// Try to infer the manifest from the registry path by detecting a
// valid semantic version segment (e.g., v1.26.0) in the path.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the comment here since semver accepts more than just v\d+\.\d+\.\d+, e.g. release candidates

let semconv_version = extract_semconv_version(registry_repo.registry_path_repr())
.unwrap_or_else(|| "unversioned".to_owned());

registry.set_manifest(RegistryManifest {
name: registry_repo.id().as_ref().to_owned(),
Expand Down Expand Up @@ -447,4 +463,59 @@ mod tests {
.collect::<Vec<_>>();
assert_eq!(groups.len(), 3);
}

#[test]
fn test_extract_semconv_version() {
use super::extract_semconv_version;

// Test with GitHub release URL pattern
assert_eq!(
extract_semconv_version(
"https://github.com/open-telemetry/semantic-conventions/releases/download/v1.26.0/semantic-conventions.zip"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a list of file format suffixes that Weaver allows? Semver won't be able to parse those suffixes, so I believe I'll have to manually remove this suffixes before parsing the version

),
Some("v1.26.0".to_owned())
);

// Test with local path containing version
assert_eq!(
extract_semconv_version("/some/path/v1.2.3/registry"),
Some("v1.2.3".to_owned())
);

// Test with version at the end of path
assert_eq!(
extract_semconv_version("/path/to/semantic-conventions-v2.0.0"),
None // "v2.0.0" is not a standalone segment
);

// Test with no version in path
assert_eq!(
extract_semconv_version("/some/path/without/version"),
None
);

// Test with invalid semver (missing patch version)
assert_eq!(extract_semconv_version("/path/v1.2/registry"), None);

// Test with pre-release version
assert_eq!(
extract_semconv_version("/path/v1.0.0-alpha.1/registry"),
Some("v1.0.0-alpha.1".to_owned())
);

// Test with build metadata
assert_eq!(
extract_semconv_version("/path/v1.0.0+build.123/registry"),
Some("v1.0.0+build.123".to_owned())
);

// Test empty path
assert_eq!(extract_semconv_version(""), None);

// Test path with only slashes
assert_eq!(extract_semconv_version("///"), None);

// Test version-like but not starting with 'v'
assert_eq!(extract_semconv_version("/path/1.2.3/registry"), None);
}
}
Loading