[df] Revert choice to change default Snapshot TTree compression settings

vepadulano · vepadulano · commit 2a688e1da0e0 · 2026-03-31T16:12:33.000+02:00
61088a3 made the deliberate choice to change the default compression settings when calling Snapshot with TTree output format from 101 to 505. This choice was the result of internal discussion within the team, based on the empirical evidence available up to that point that showed ZSTD outperforming ZLIB on all metrics for the TTree datasets (as well as for RNTuple datasets). This commit proposes to revert that choice based on new evidence, summarised at https://github.com/vepadulano/ttree-lossless-compression-studies. The main takeaway message from that study is that TTree datasets with branches of type ROOT::RVec where many (if not all) of the collections are empty are compressed better with ZLIB than with ZSTD. Being this case actually quite relevant, as most datasets are made of branches with collection types and as the result of analysis steps these collections may be skimmed quite drastically, there is enough motivation to move the default compression settings for TTree back to 101. This commit changes the default RSnapshotOptions values for compression settings respectively to 'kUndefined' and '0' for the compression algorithm and the compression level. When the 'kUndefined' compression algorithm is used, Snapshot will behave differently depending on the output format: the settings will be 101 for TTree and 505 for RNTuple. Add one test per respective output format to check the default values are respected.
diff --git a/README/ReleaseNotes/v640/index.md b/README/ReleaseNotes/v640/index.md
@@ -272,7 +272,7 @@ Given the risk of silently incorrect physics results, and the absence of known w
 
 ## RDataFrame
 
-- The message shown in ROOT 6.38 to inform users about change of default compression setting used by Snapshot (was 101 before 6.38, became 505 in 6.38) is now removed.
+- The change of default compression settings used by Snapshot for the TTree output data format introduced in 6.38 (was 101 before 6.38, became 505 in 6.38) is reverted. That choice was based on evidence available up to that point that indicated that ZSTD was outperforming ZLIB in all cases for the available datasets. New evidence demonstrated that this is not always the case, and in particular for the notable case of TTree branches made of collections where many (up to all) of them are empty. The investigation is described at https://github.com/vepadulano/ttree-lossless-compression-studies. The new default compression settings for Snapshot are respectively `kUndefined` for the compression algorithm and `0` for the compression level. When Snapshot detects `kUndefined` used in the options, it changes the compression settings to the new defaults of 101 (for TTree) and 505 (for RNTuple).
 - Signatures of the HistoND and HistoNSparseD operations have been changed. Previously, the list of input column names was allowed to contain an extra column for events weights. This was done to align the logic with the THnBase::Fill method. But this signature was inconsistent with all other Histo* operations, which have a separate function argument that represents the column to get the weights from. Thus, HistoND and HistoNSparseD both now have a separate function argument for the weights. The previous signature is still supported, but deprecated: a warning will be raised if the user passes the column name of the weights as an extra element of the list of input column names. In a future version of ROOT this functionality will be removed. From now on, creating a (sparse) N-dim histogram with weights should be done by calling `HistoN[Sparse]D(histoModel, inputColumns, weightColumn)`.
 
 ## Python Interface
diff --git a/tree/dataframe/inc/ROOT/RSnapshotOptions.hxx b/tree/dataframe/inc/ROOT/RSnapshotOptions.hxx
@@ -56,13 +56,13 @@ Note that for RNTuple, the defaults correspond to those set in RNTupleWriteOptio
 <td><code>fCompressionAlgorithm</code></td>
 <td><code>ROOT::RCompressionSetting::EAlgorithm</code></td>
 <td>Zstd</td>
-<td>Compression algorithm for the output dataset</td>
+<td>Compression algorithm for the output dataset, defaults to ROOT::RCompressionSetting::EAlgorithm::EValues::kUndefined. This is converted to ZLIB by default for TTree and ZSTD by default for RNTuple</td>
 </tr>
 <tr>
 <td><code>fCompressionLevel</code></td>
 <td><code>int</code></td>
 <td>5</td>
-<td>Compression level for the output dataset</td>
+<td>Compression level for the output dataset, defaults to 0 (uncompressed). If the default value of `fCompressionAlgorithm` is not modified, the compression level is changed to 1 by default for TTree and 5 by default for RNTuple</td>
 </tr>
 <tr>
 <td><code>fOutputFormat</code></td>
@@ -184,9 +184,8 @@ struct RSnapshotOptions {
    }
    std::string fMode = "RECREATE"; ///< Mode of creation of output file
    ESnapshotOutputFormat fOutputFormat = ESnapshotOutputFormat::kDefault; ///< Which data format to write to
-   ECAlgo fCompressionAlgorithm =
-      ROOT::RCompressionSetting::EAlgorithm::kZSTD; ///< Compression algorithm of output file
-   int fCompressionLevel = 5;                       ///< Compression level of output file
+   ECAlgo fCompressionAlgorithm = ECAlgo::kUndefined;                     ///< Compression algorithm of output file
+   int fCompressionLevel = 0;                                             ///< Compression level of output file
    bool fLazy = false;                              ///< Do not start the event loop when Snapshot is called
    bool fOverwriteIfExists = false;  ///< If fMode is "UPDATE", overwrite object in output file if it already exists
    bool fVector2RVec = true;         ///< If set to true will convert std::vector columns to RVec when saving to disk
diff --git a/tree/dataframe/src/RDFSnapshotHelpers.cxx b/tree/dataframe/src/RDFSnapshotHelpers.cxx
@@ -364,6 +364,28 @@ void SetBranchesHelper(TTree *inputTree, TTree &outputTree,
    throw std::logic_error(
       "RDataFrame::Snapshot: something went wrong when creating a TTree branch, please report this as a bug.");
 }
+
+auto GetSnapshotCompressionSettings(const ROOT::RDF::RSnapshotOptions &options)
+{
+   using CompAlgo = ROOT::RCompressionSetting::EAlgorithm::EValues;
+   using OutputFormat = ROOT::RDF::ESnapshotOutputFormat;
+
+   if (options.fOutputFormat == OutputFormat::kTTree || options.fOutputFormat == OutputFormat::kDefault) {
+      // The default compression settings for TTree is 101
+      if (options.fCompressionAlgorithm == CompAlgo::kUndefined) {
+         return ROOT::CompressionSettings(CompAlgo::kZLIB, 1);
+      }
+      return ROOT::CompressionSettings(options.fCompressionAlgorithm, options.fCompressionLevel);
+   } else if (options.fOutputFormat == OutputFormat::kRNTuple) {
+      // The default compression settings for RNTuple is 505
+      if (options.fCompressionAlgorithm == CompAlgo::kUndefined) {
+         return ROOT::CompressionSettings(CompAlgo::kZSTD, 5);
+      }
+      return ROOT::CompressionSettings(options.fCompressionAlgorithm, options.fCompressionLevel);
+   } else {
+      throw std::invalid_argument("RDataFrame::Snapshot: unrecognized output format");
+   }
+}
 } // namespace
 
 ROOT::Internal::RDF::RBranchData::RBranchData(std::string inputBranchName, std::string outputBranchName, bool isDefine,
@@ -535,8 +557,7 @@ void ROOT::Internal::RDF::UntypedSnapshotTTreeHelper::SetEmptyBranches(TTree *in
 void ROOT::Internal::RDF::UntypedSnapshotTTreeHelper::Initialize()
 {
    fOutputFile.reset(
-      TFile::Open(fFileName.c_str(), fOptions.fMode.c_str(), /*ftitle=*/"",
-                  ROOT::CompressionSettings(fOptions.fCompressionAlgorithm, fOptions.fCompressionLevel)));
+      TFile::Open(fFileName.c_str(), fOptions.fMode.c_str(), /*ftitle=*/"", GetSnapshotCompressionSettings(fOptions)));
    if (!fOutputFile)
       throw std::runtime_error("Snapshot: could not create output file " + fFileName);
 
@@ -774,9 +795,9 @@ void ROOT::Internal::RDF::UntypedSnapshotTTreeHelperMT::SetEmptyBranches(TTree *
 
 void ROOT::Internal::RDF::UntypedSnapshotTTreeHelperMT::Initialize()
 {
-   const auto cs = ROOT::CompressionSettings(fOptions.fCompressionAlgorithm, fOptions.fCompressionLevel);
    auto outFile =
-      std::unique_ptr<TFile>{TFile::Open(fFileName.c_str(), fOptions.fMode.c_str(), /*ftitle=*/fFileName.c_str(), cs)};
+      std::unique_ptr<TFile>{TFile::Open(fFileName.c_str(), fOptions.fMode.c_str(), /*ftitle=*/fFileName.c_str(),
+                                         GetSnapshotCompressionSettings(fOptions))};
    if (!outFile)
       throw std::runtime_error("Snapshot: could not create output file " + fFileName);
    fOutputFile = outFile.get();
@@ -929,7 +950,7 @@ void ROOT::Internal::RDF::UntypedSnapshotRNTupleHelper::Initialize()
    model->Freeze();
 
    ROOT::RNTupleWriteOptions writeOptions;
-   writeOptions.SetCompression(fOptions.fCompressionAlgorithm, fOptions.fCompressionLevel);
+   writeOptions.SetCompression(GetSnapshotCompressionSettings(fOptions));
    writeOptions.SetInitialUnzippedPageSize(fOptions.fInitialUnzippedPageSize);
    writeOptions.SetMaxUnzippedPageSize(fOptions.fMaxUnzippedPageSize);
    writeOptions.SetApproxZippedClusterSize(fOptions.fApproxZippedClusterSize);
@@ -1151,8 +1172,7 @@ ROOT::Internal::RDF::SnapshotHelperWithVariations::SnapshotHelperWithVariations(
 
    TDirectory::TContext fileCtxt;
    fOutputHandle = std::make_shared<SnapshotOutputWriter>(
-      TFile::Open(filename.data(), fOptions.fMode.c_str(), /*ftitle=*/"",
-                  ROOT::CompressionSettings(fOptions.fCompressionAlgorithm, fOptions.fCompressionLevel)));
+      TFile::Open(filename.data(), fOptions.fMode.c_str(), /*ftitle=*/"", GetSnapshotCompressionSettings(fOptions)));
    if (!fOutputHandle->fFile)
       throw std::runtime_error(std::string{"Snapshot: could not create output file "} + std::string{filename});
 
diff --git a/tree/dataframe/test/dataframe_snapshot.cxx b/tree/dataframe/test/dataframe_snapshot.cxx
@@ -247,6 +247,22 @@ TEST(RDFSnapshotMore, BasketSizePreservation)
    TestBasketSizePreservation();
 }
 
+// Test for default compression settings
+TEST(RDFSnapshotMore, DefaultCompressionSettings)
+{
+   struct FileGuardRAII {
+      std::string fFilename{"RDFSnapshotMore_default_compression_settings.root"};
+      std::string fTreeName{"tree"};
+      ~FileGuardRAII() { std::remove(fFilename.c_str()); }
+   } fileGuard;
+   ROOT::RDataFrame df{1};
+   df.Define("x", [] { return 42; }).Snapshot(fileGuard.fTreeName, fileGuard.fFilename, {"x"});
+
+   auto f = std::make_unique<TFile>(fileGuard.fFilename.c_str());
+   EXPECT_EQ(f->GetCompressionAlgorithm(), ROOT::RCompressionSetting::EAlgorithm::EValues::kZLIB);
+   EXPECT_EQ(f->GetCompressionLevel(), 1);
+}
+
 // fixture that provides fixed and variable sized arrays as RDF columns
 class RDFSnapshotArrays : public ::testing::Test {
 protected:
diff --git a/tree/dataframe/test/dataframe_snapshot_ntuple.cxx b/tree/dataframe/test/dataframe_snapshot_ntuple.cxx
@@ -170,6 +170,26 @@ TEST(RDFSnapshotRNTuple, WriteOpts)
    }
 }
 
+TEST(RDFSnapshotRNTuple, DefaultCompressionSettings)
+{
+   FileRAII fileGuard{"RDFSnapshotRNTuple_default_compression_settings.root"};
+   const std::vector<std::string> columns = {"x"};
+
+   auto df = ROOT::RDataFrame(25ull).Define("x", [] { return 10; });
+
+   RSnapshotOptions opts;
+   opts.fOutputFormat = ROOT::RDF::ESnapshotOutputFormat::kRNTuple;
+
+   auto sdf = df.Snapshot("ntuple", fileGuard.GetPath(), {"x"}, opts);
+
+   EXPECT_EQ(columns, sdf->GetColumnNames());
+
+   auto reader = RNTupleReader::Open("ntuple", fileGuard.GetPath());
+   auto compSettings = *reader->GetDescriptor().GetClusterDescriptor(0).GetColumnRange(0).GetCompressionSettings();
+   // The RNTuple default should be 505
+   EXPECT_EQ(505, compSettings);
+}
+
 TEST(RDFSnapshotRNTuple, Compression)
 {
    FileRAII fileGuard{"RDFSnapshotRNTuple_compression.root"};