You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[df] Revert choice to change default Snapshot TTree compression settings
61088a3
made the deliberate choice to change the default compression settings when
calling Snapshot with TTree output format from 101 to 505. This choice was the
result of internal discussion within the team, based on the empirical evidence
available up to that point that showed ZSTD outperforming ZLIB on all metrics
for the TTree datasets (as well as for RNTuple datasets).
This commit proposes to revert that choice based on new evidence, summarised at
https://github.com/vepadulano/ttree-lossless-compression-studies. The main
takeaway message from that study is that TTree datasets with branches of type
ROOT::RVec where many (if not all) of the collections are empty are compressed
better with ZLIB than with ZSTD. Being this case actually quite relevant, as
most datasets are made of branches with collection types and as the result of
analysis steps these collections may be skimmed quite drastically, there is
enough motivation to move the default compression settings for TTree back to
101.
This commit changes the default RSnapshotOptions values for compression settings
respectively to 'kUndefined' and '0' for the compression algorithm and the
compression level. When the 'kUndefined' compression algorithm is used, Snapshot
will behave differently depending on the output format: the settings will be 101
for TTree and 505 for RNTuple.
Add one test per respective output format to check the default values are
respected.
Copy file name to clipboardExpand all lines: README/ReleaseNotes/v640/index.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -272,7 +272,7 @@ Given the risk of silently incorrect physics results, and the absence of known w
272
272
273
273
## RDataFrame
274
274
275
-
- The message shown in ROOT 6.38 to inform users about change of default compression setting used by Snapshot (was 101 before 6.38, became 505 in 6.38) is now removed.
275
+
- The change of default compression settings used by Snapshot for the TTree output data format introduced in 6.38 (was 101 before 6.38, became 505 in 6.38) is reverted. That choice was based on evidence available up to that point that indicated that ZSTD was outperforming ZLIB in all cases for the available datasets. New evidence demonstrated that this is not always the case, and in particular for the notable case of TTree branches made of collections where many (up to all) of them are empty. The investigation is described at https://github.com/vepadulano/ttree-lossless-compression-studies. The new default compression settings for Snapshot are respectively `kUndefined` for the compression algorithm and `0` for the compression level. When Snapshot detects `kUndefined` used in the options, it changes the compression settings to the new defaults of 101 (for TTree) and 505 (for RNTuple).
276
276
- Signatures of the HistoND and HistoNSparseD operations have been changed. Previously, the list of input column names was allowed to contain an extra column for events weights. This was done to align the logic with the THnBase::Fill method. But this signature was inconsistent with all other Histo* operations, which have a separate function argument that represents the column to get the weights from. Thus, HistoND and HistoNSparseD both now have a separate function argument for the weights. The previous signature is still supported, but deprecated: a warning will be raised if the user passes the column name of the weights as an extra element of the list of input column names. In a future version of ROOT this functionality will be removed. From now on, creating a (sparse) N-dim histogram with weights should be done by calling `HistoN[Sparse]D(histoModel, inputColumns, weightColumn)`.
<td>Compression algorithm for the output dataset</td>
59
+
<td>Compression algorithm for the output dataset, defaults to ROOT::RCompressionSetting::EAlgorithm::EValues::kUndefined. This is converted to ZLIB by default for TTree and ZSTD by default for RNTuple</td>
60
60
</tr>
61
61
<tr>
62
62
<td><code>fCompressionLevel</code></td>
63
63
<td><code>int</code></td>
64
64
<td>5</td>
65
-
<td>Compression level for the output dataset</td>
65
+
<td>Compression level for the output dataset, defaults to 0 (uncompressed). If the default value of `fCompressionAlgorithm` is not modified, the compression level is changed to 1 by default for TTree and 5 by default for RNTuple</td>
66
66
</tr>
67
67
<tr>
68
68
<td><code>fOutputFormat</code></td>
@@ -184,9 +184,8 @@ struct RSnapshotOptions {
184
184
}
185
185
std::string fMode = "RECREATE"; ///< Mode of creation of output file
186
186
ESnapshotOutputFormat fOutputFormat = ESnapshotOutputFormat::kDefault; ///< Which data format to write to
187
-
ECAlgo fCompressionAlgorithm =
188
-
ROOT::RCompressionSetting::EAlgorithm::kZSTD; ///< Compression algorithm of output file
189
-
intfCompressionLevel = 5; ///< Compression level of output file
187
+
ECAlgo fCompressionAlgorithm = ECAlgo::kUndefined; ///< Compression algorithm of output file
188
+
intfCompressionLevel = 0; ///< Compression level of output file
190
189
boolfLazy = false; ///< Do not start the event loop when Snapshot is called
191
190
boolfOverwriteIfExists = false; ///< If fMode is "UPDATE", overwrite object in output file if it already exists
192
191
boolfVector2RVec = true; ///< If set to true will convert std::vector columns to RVec when saving to disk
0 commit comments