Skip to content

Fix/csv import authority separator parsing#544

Open
Hazel-0 wants to merge 2 commits into4Science:main-crisfrom
Hazel-0:fix/csv-import-authority-separator-parsing
Open

Fix/csv import authority separator parsing#544
Hazel-0 wants to merge 2 commits into4Science:main-crisfrom
Hazel-0:fix/csv-import-authority-separator-parsing

Conversation

@Hazel-0
Copy link

@Hazel-0 Hazel-0 commented Feb 17, 2026

  • Enhanced resolveValueAndAuthority() to handle authorities containing ::
  • Fixes NumberFormatException when parsing values like: value::will be referenced::ORCID::0000-0002-5474-1918::600
  • Properly handles 2-part, 3-part, and 4+ part formats
  • Maintains backward compatibility with existing CSV imports

References

Description

Fixes CSV metadata import failure when authority-controlled metadata values contain the authority separator (::) within the authority string itself, causing a NumberFormatException during parsing.

Instructions for Reviewers

This PR fixes a bug in the CSV metadata import functionality where the import fails with a NumberFormatException when processing metadata values where the authority string itself contains the authority separator (::). This commonly occurs with ORCID and ROR-ID authority references (e.g., Fischer, Frank::will be referenced::ORCID::0000-0002-5474-1918::600).

The resolveValueAndAuthority() method in MetadataImport.java incorrectly assumed that the format is always value::authority::confidence (exactly 3 parts) and that the authority never contains the separator itself. When an authority like will be referenced::ORCID::0000-0002-5474-1918 is split by ::, it produces more than 3 parts, causing the parser to incorrectly identify parts and throw a NumberFormatException.

List of changes in this PR:

  • Enhanced resolveValueAndAuthority() method to correctly handle authorities containing separators by implementing logic to reconstruct authority strings from multiple parts
  • Changed minimum parts check from < 3 to < 2 to properly handle 2-part format (value::authority) which was previously ignored
  • Added explicit handling for 2-part format: sets authority and uses CF_ACCEPTED as default confidence (consistent with existing behavior when authority is provided)
  • Added logic to reconstruct authority strings that contain separators by combining middle parts when there are 4+ parts after splitting
  • Improved error handling with try-catch for confidence parsing to gracefully handle cases where the last part is not numeric
  • Added comprehensive code comments explaining the parsing logic for different part counts (2, 3, and 4+ parts)

Include guidance for how to test or review your PR. This may include: steps to reproduce a bug, screenshots or description of a new feature, or reasons behind specific changes.

How to test this PR:

  1. Prepare a CSV file with authority-controlled metadata values containing separators in the authority:

    dc.contributor.author,"Fischer, Frank::will be referenced::ORCID::0000-0002-5474-1918::600"
    dc.contributor.author,"Chemnitz University of Technology::will be referenced::ROR-ID::https://ror.org/00a208s56::600"
  2. Run the CSV metadata import via the DSpace admin interface or command line:

    [dspace]/bin/dspace metadata-import -f /path/to/test.csv -e admin@example.com
  3. Verify the import succeeds without throwing NumberFormatException

  4. Verify the metadata values are correctly imported with:

    • Correct authority strings preserved (e.g., will be referenced::ORCID::0000-0002-5474-1918)
    • Correct confidence values (e.g., 600)

Test Cases Covered:

  • Standard 3-part format: value::authority::600
  • 2-part format: value::authority (now correctly sets authority with CF_ACCEPTED)
  • Authority with ORCID separator, e.g.: Fischer, Frank::will be referenced::ORCID::0000-0002-5474-1918::600
  • Authority with ROR-ID separator, e.g.: Chemnitz University of Technology::will be referenced::ROR-ID::https://ror.org/00a208s56::600
  • Authority without numeric confidence: value::authority::with::separators (treats all as authority)

Backward Compatibility:
Fully backward compatible - the fix maintains all existing behavior for standard 3-part format while adding support for edge cases.

Checklist

  • My PR is created against the main branch of code (unless it is a backport or is fixing an issue specific to an older branch).
  • My PR is small in size (e.g. less than 1,000 lines of code, not including comments & integration tests). Exceptions may be made if previously agreed upon.
  • My PR passes Checkstyle validation based on the Code Style Guide.
  • My PR includes Javadoc for all new (or modified) public methods and classes. It also includes Javadoc for large or complex private methods.
  • My PR passes all tests and includes new/updated Unit or Integration Tests based on the Code Testing Guide.
  • My PR includes details on how to test it. I've provided clear instructions to reviewers on how to successfully test this fix or feature.
  • If my PR fixes an issue ticket, I've linked them together

- Enhanced resolveValueAndAuthority() to handle authorities containing ::
- Fixes NumberFormatException when parsing values like:
  value::will be referenced::ORCID::0000-0002-5474-1918::600
- Properly handles 2-part, 3-part, and 4+ part formats
- Maintains backward compatibility with existing CSV imports
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

metadata-import fails on person entities if CSV contains ROR-ID values

1 participant