-
Notifications
You must be signed in to change notification settings - Fork 542
Refactor and optimize the Export metadata framework, especially for the data/variable-level metadata #11405
Copy link
Copy link
Labels
FY25 Sprint 24FY25 Sprint 24 (2025-05-21 - 2025-06-04)FY25 Sprint 24 (2025-05-21 - 2025-06-04)FY25 Sprint 25FY25 Sprint 25 (2025-06-04 - 2025-06-18)FY25 Sprint 25 (2025-06-04 - 2025-06-18)FY25 Sprint 26FY25 Sprint 26 (2025-06-18 - 2025-07-02)FY25 Sprint 26 (2025-06-18 - 2025-07-02)FY26 Sprint 1FY26 Sprint 1 (2025-07-02 - 2025-07-16)FY26 Sprint 1 (2025-07-02 - 2025-07-16)FY26 Sprint 2FY26 Sprint 2 (2025-07-16 - 2025-07-30)FY26 Sprint 2 (2025-07-16 - 2025-07-30)FY26 Sprint 3(2025-07-30 - 2025-08-13)(2025-07-30 - 2025-08-13)FY26 Sprint 4FY26 Sprint 4 (2025-08-13 - 2025-08-27)FY26 Sprint 4 (2025-08-13 - 2025-08-27)FY26 Sprint 5FY26 Sprint 5 (2025-08-27 - 2025-09-10)FY26 Sprint 5 (2025-08-27 - 2025-09-10)Size: 80A percentage of a sprint. 56 hours.A percentage of a sprint. 56 hours.
Metadata
Metadata
Assignees
Labels
FY25 Sprint 24FY25 Sprint 24 (2025-05-21 - 2025-06-04)FY25 Sprint 24 (2025-05-21 - 2025-06-04)FY25 Sprint 25FY25 Sprint 25 (2025-06-04 - 2025-06-18)FY25 Sprint 25 (2025-06-04 - 2025-06-18)FY25 Sprint 26FY25 Sprint 26 (2025-06-18 - 2025-07-02)FY25 Sprint 26 (2025-06-18 - 2025-07-02)FY26 Sprint 1FY26 Sprint 1 (2025-07-02 - 2025-07-16)FY26 Sprint 1 (2025-07-02 - 2025-07-16)FY26 Sprint 2FY26 Sprint 2 (2025-07-16 - 2025-07-30)FY26 Sprint 2 (2025-07-16 - 2025-07-30)FY26 Sprint 3(2025-07-30 - 2025-08-13)(2025-07-30 - 2025-08-13)FY26 Sprint 4FY26 Sprint 4 (2025-08-13 - 2025-08-27)FY26 Sprint 4 (2025-08-13 - 2025-08-27)FY26 Sprint 5FY26 Sprint 5 (2025-08-27 - 2025-09-10)FY26 Sprint 5 (2025-08-27 - 2025-09-10)Size: 80A percentage of a sprint. 56 hours.A percentage of a sprint. 56 hours.
Historically Dataverse has supported one format that encoded all the available DataVariable-level metadata from DataTable objects, the "full DDI". The reason our Exporters write the metadata into an OutputStream rather than return it as an object in memory was to accommodate potentially very large formats like the DDI. This export is still prohibitively expensive in terms of memory use since it receives the entire dataset-worth of variable-level metadata from
ExportDataProvider.getDatasetFileDetails()as one big JsonArray.I would like to address this similarly to how we have added offset-length parameters to the /versions and /versions//files APIs for the SPA; and make it possible for
getDatasetFileDetails()to page through the files/datatables in smaller batches.Once the DataExporter is refactored, the croissant exporter in the gdcc repo could be refactored as well. Just like the DDI it encodes the datavariable information. It's a little bit worse, because it doesn't stream its output either; it instead accumulates the entire json object in memory, then writes it all at once as
outputStream.write(job.build().toString().getBytes("UTF8"));. But it should be very doable to make it stream instead.My secondary, less pressing concern is I want to have an option for the main
getDatasetJson()method to skip the files info. (From what I can tell,InternalExportDataProvier.getDatasetJson()always callsJsonPrinter.jsonWithCitation(DatasetVersion dsv, boolean includeFiles)withincludeFiles=true). If we are going to be exporting formats individually, this may be wasteful too, to pack thousands, or worse, of files into the json when exporting a cheap format like oai_dc that does not need them.