A collection of tips for migrating away from deprecated features.
The issue with FixedLenStringArray is that it is unable to avoid copies.
Essentially, this class acts as a means to create a copy of the data in a
format suitable for writing fixed-length strings. Additionally, the class acts
as a tag for HighFive to overload on. The support of std::string in HighFive
has improved considerable. Since 2.8.0 we can write/read std::string to fixed
or variable length HDF5 strings.
Therefore, this class serves no purpose anymore. Any occurrence of it can be
replaced with an std::vector<std::string> (for example).
If desired one can silence warnings by replacing FixedLenStringArray with
deprecated::FixedLenStringArray.
A "raw read" is when the user allocates sufficient bytes and provides HighFive with the pointer to the first byte. "Regular reads" take a detour via the inspector and might resize the container, etc.
The issue is that HighFive v2 had the following two overloads:
template<class T>
DataSet::read(T& x, /* skipped */);
template<class T>
DataSet::read(T* x, /* skipped */);
and the analogous for Attribute.
The issue is that the second overload will also match things like T** and
T[][]. For example the following code used the removed overload:
double x[2][3];
dset.read(x);
which is fine because is a contiguous sequence of doubles. It's equivalent to
following v3 code:
double x[n][m];
dset.read_raw((double*) x);
We consider the example above to be accidentally using a raw read, when it could be performing a regular read. We suggest to not change the above, i.e.
double x[n][m];
dset.read(x);
continues to be correct in v3 and can check that the dimensions match. The
inspector recognizes double[n][m] as a contiguous array of doubles.
Therefore, it'll use the shallow-copy buffer and avoid any additional
allocations or copies.
When genuinely performing a "raw read", one must replace read with
read_raw. For example:
double* x = malloc(n*m * sizeof(double));
dset.read_raw(x);
is correct in v3.
The immediately preceding section is likely relevant.
In v2 raw pointers could be used to indicate dimensionality. For example:
double* x = malloc(n*m * sizeof(double));
auto dset = file.createDataSet("foo", DataSpace({n, m}), ...);
dset.write((double**) x);
dset.read((double**) x);
was valid and would write the flat array x into the two-dimensional dataset
"foo". This must be modernized as follows:
double* x = malloc(n*m * sizeof(double));
auto dset = file.createDataSet("foo", DataSpace({n, m}), ...);
dset.write_raw(x);
dset.read_raw(x);
In v3 the type T** will refer a pointer to a pointer (as usual). The
following:
size_t n = 2, m = 3;
double** x = malloc(n * sizeof(double*));
for(size_t i = 0; i < n; ++i) {
x[i] = malloc(m * sizeof(double));
}
auto dset = file.createDataSet("foo", DataSpace({n, m}), ...);
dset.write(x);
dset.read(x);
is correct in v3 but would probably segfault in v2.
In v3 we completely rewrote the CMake code of HighFive. Since HighFive is a
header only library, it needs to perform two tasks:
- Copy the sources during installation.
- Export a target that sets
-I ${HIGHFIVE_DIR}and links with HDF5.
We've removed all flags for optional dependencies, such as
-DHIGHFIVE_USE_BOOST. Instead users that want to read/write into/from
optionally supported containers, include a header with the corresponding name
and make sure to adjust their CMake code to link with the dependency.
The C++ code should have:
#include <highfive/boost.hpp>
// Code the reads or write `boost::multi_array`.
and the CMake code would have
add_executable(app)
# These lines might work, but depend on how exactly the user intends to use
# Boost. They are not specific to HighFive, but previously added automatically
# (and sometimes correctly) by HighFive.
find_package(Boost)
target_link_libraries(add PUBLIC boost::boost)
# For HighFive there's two options for adding `-I ${HIGHFIVE_DIR}` and the
# flags for HDF5.
#
# Option 1: HighFive is installed as a (system-wide) regular library:
find_package(HighFive)
target_link_libraries(app PUBLIC HighFive::HighFive)
# Option 2: HighFive is vendored as part of the project:
add_subdirectory(third_party/HighFive)
target_link_libraries(app PUBLIC HighFive::HighFive)
There are extensive examples of project integration in tests/cmake_integration,
including how those projects in turn can be included in other projects. If these
examples don't help, please feel free to open an Issue.
We've converted the enum DataSpace::DataSpaceType to an enum class. We've
added static constexpr members dataspace_null and dataspace_scalar to
DataSpace. This minimizes the risk of breaking user code.
Note that objects of type DataSpace::DataSpaceType will no longer silently
convert to an integer. Including the two constants
DataSpace::dataspace_{scalar,null}.
These have been deprecated to stick more closely with familiar HDF5 concepts.
The FileDriver is synonymous to FileAccessProps; and MPIOFileDriver is
the same as:
auto fapl = FileAccessProps{};
fapl.add(MPIOFileAccess(mpi_comm, mpi_info));
We felt that the savings in typing effort weren't worth introducing the concept of a "file driver". Removing the concept hopefully makes it easier to add a better abstraction for the handling of the property lists, when we discover such an abstraction.
HighFive v2 had a feature that a dataset (or attribute) of shape [n, 1] could
be read into a one-dimensional array automatically.
The feature is prone to accidentally not failing. Consider an array that shape
[n, m] and in general both n, m > 0. Hence, one should always be reading
into a two-dimensional array, even if n == 1 or m == 1. However, due to
broadcasting, if one of the dimensions (accidentally) happens to be one, then
the checks won't fail. This isn't a bug, however, it can hide a bug. For
example if the test happen to use [n, 1] datasets and a one-dimensional
array.
Broadcasting in HighFive was different from broadcasting in NumPy. For reading
into one-dimensional data HighFive supports stripping all dimensions that are
not 1. When extending the feature to multi-dimensional arrays it gets tricky.
We can't strip from both the front and back. If we allow stripping from both
ends, arrays such as [1, n, m] read into [n, m] if m > 1 but into [1, n] (instead of [n, 1]) if (coincidentally) m == 1. For HighFive because
avoiding being forced to read [n, 1] into std::vector<std::vector<T>> is
more important than [1, n]. Flattening the former requires copying
everything while the latter can be made flat by just accessing the first value.
Therefore, HighFive had a preference to strip from the right, while NumPy adds
1s to the front/left of the shape.
In v3 we've removed broadcasting. Instead users must use one of the two
alternatives: squeezing and reshaping. The examples show will use datasets and
reading, but it works the same for attributes and writing.
Often we know that the kth dimension is 1, e.g. a column is [n, 1] and a
row is [1, m]. In this case it's convenient to state, remove dimension k.
The syntax to simultaneously remove the dimensions {0, 2} is:
dset.squeezeMemSpace({0, 2}).read(array);
Which will read a dataset with dimensions [1, n, 1] into an array of shape
[n].
Sometimes it's easier to state what the new shape must be. For this we have the syntax:
dset.reshapeMemSpace(dims).read(array);
To declare that array should have dimensions dims even if
dset.getDimensions() is something different.
Example:
dset.reshapeMemSpace({dset.getElementCount()}).read(array);
to read into a one-dimensional array.
There's a safe case that seems needlessly strict to enforce: if the dataset is a multi-dimensional array with one element one should be able to read into (write from) a scalar.
The reverse, i.e. reading a scalar value in the HDF5 file into a
multi-dimensional array isn't supported, because if we want to support array
with runtime-defined rank, we can't deduce the correct shape, e.g. [1] vs.
[1, 1, 1], when read into an array.
In v2, File::{ReadOnly,Truncate,...} was an anonymous member enum of
File. Effectively it's type was the same as an int.
To improve type-safety, we converted it into an enum class called
File::AccessMode. In order to reduce the migration effort, we retained the
ability to write: File::ReadOnly.
Functions that accept a file access mode should be modernized as follows:
// old
HighFive::File open(std::string name, int mode) {
return HighFive::File(name, mode);
}
// new
HighFive::File open(std::string name, HighFive::File::AccessMode mode) {
return HighFive::File(name, mode);
}
Note: There's a caveat, the shorthand notation File::ReadOnly doesn't have
an address. Meaning one can't take its address or const-references of it
(results in a linker error about missing symbol File::ReadOnly). Use
File::AccessMode::ReadOnly instead.
To our knowledge these could not be used meaningfully. Please create an issue if you relied on these.
The functionality was moved to Object::getAddress. Typically, replace
// old
dataset.getInfo().getAddress()
// new
dataset.getAddress()
Keep in mind that not all files (e.g. HSDS) have addresses. Instead they seem to have VOL tokens. This is likely best avoided in generic code.