Skip to content

Commit aa06e25

Browse files
isaacbrodskyam2222
andauthored
archive contents function (#65)
* archive contents function * fix ifndef * Fix tests for Archive contents table (#67) * fix: Update error message for empty zip file case in tests * chore: Add a blank line for better readability in README * fix: Update error message for empty zip file case to reflect DuckDB version differences * fix: Improve comment clarity for error handling in read_csv tests * fix: Enhance comment clarity for empty zip file case in tests --------- Co-authored-by: Majid Hojati <mhojati@uwaterloo.ca>
1 parent 8bd418e commit aa06e25

16 files changed

Lines changed: 523 additions & 8 deletions

CMakeLists.txt

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,10 @@ set(EXTENSION_SOURCES
2323
src/zip_file_system.cpp
2424
src/archive_file_system.cpp
2525
src/raw_archive_file_system.cpp
26-
src/noop_archive_file_system.cpp)
26+
src/noop_archive_file_system.cpp
27+
src/zip_contents.cpp
28+
src/archive_contents.cpp
29+
src/noop_archive_contents.cpp)
2730

2831
build_static_extension(${TARGET_NAME} ${EXTENSION_SOURCES})
2932
build_loadable_extension(${TARGET_NAME} " " ${EXTENSION_SOURCES})

README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
This is a [DuckDB](https://duckdb.org) extension that adds support for reading files from within [zip archives](https://en.wikipedia.org/wiki/ZIP_(file_format)) and other archive formats such as `tar`.
66

7+
78
# Get started
89

910
Load from the [community extensions repository](https://community-extensions.duckdb.org/extensions/zipfs.html):
@@ -22,6 +23,11 @@ To read a file from azure blob storage (or other file system):
2223
SELECT * FROM 'zip://az://yourstorageaccount.blob.core.windows.net/yourcontainer/examples/a.zip/a.csv';
2324
```
2425

26+
To read the table of contents of a zip file:
27+
```SQL
28+
SELECT * FROM archive_contents('examples/a.zip');
29+
```
30+
2531
## File names
2632

2733
| URL quick reference | Description
@@ -31,6 +37,11 @@ SELECT * FROM 'zip://az://yourstorageaccount.blob.core.windows.net/yourcontainer
3137
| `archive://a.tar.gz!!*.csv` | Local archive file named `a.tar.gz`, containg csv files.
3238
| `compressed://a.jsonl.bz2` | Local compressed ndjson file `a.jsonl.bz2`.
3339

40+
| Function | Description
41+
| --- | ---
42+
| `zip_contents` | Read the table of contents of a zip file
43+
| `archive_contents` | Read the table of contents of an archive file
44+
3445
File names passed into the `zip://` URL scheme are expected to end with `.zip`, which indicates the end of the zip file name. The path after
3546
that is taken to be the file path within the zip archive.
3647

@@ -63,6 +74,12 @@ It is also possible to read from a variety of compressed file formats directly:
6374
SELECT * FROM read_json('compressed://examples/a.jsonl.bz2');
6475
```
6576

77+
## Archive vs zip
78+
79+
This extension supports both zip files and archive files. The zip file support is using miniz, the archive file
80+
support uses libarchive. libarchive supports a wider range of compression algorithms and container formats.
81+
libarchive is not available on Windows and using them there will result in an error.
82+
6683
## Performance considerations
6784

6885
This extension is intended more for convience than high performance. It does not implement a file metadata cache as `tarfs` (on which this

src/archive_contents.cpp

Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
#include "archive_contents.hpp"
2+
#include "archive_file_system.hpp"
3+
4+
#include "duckdb/common/exception.hpp"
5+
#include "duckdb/common/numeric_utils.hpp"
6+
#include "duckdb/common/file_opener.hpp"
7+
#include "duckdb/function/scalar/string_common.hpp"
8+
#include "duckdb/main/client_context.hpp"
9+
10+
#ifdef ENABLE_LIBARCHIVE
11+
12+
namespace duckdb {
13+
14+
struct ReadArchiveFunctionData : public GlobalTableFunctionState {
15+
ReadArchiveFunctionData() : finished(false) {}
16+
bool finished;
17+
};
18+
19+
struct ReadArchiveFunctionBindData : public TableFunctionData {
20+
string file_path;
21+
};
22+
23+
void ReadArchiveFunction(ClientContext &context, TableFunctionInput &data,
24+
DataChunk &output) {
25+
auto bind_data = data.bind_data->Cast<ReadArchiveFunctionBindData>();
26+
auto &global_data = data.global_state->Cast<ReadArchiveFunctionData>();
27+
if (global_data.finished) {
28+
return;
29+
}
30+
auto &zip_path = bind_data.file_path;
31+
32+
auto &fs = FileSystem::GetFileSystem(context);
33+
if (!fs.FileExists(zip_path)) {
34+
throw IOException("Archive file does not exist: %s", zip_path);
35+
}
36+
37+
auto handle = fs.OpenFile(zip_path, FileOpenFlags::FILE_FLAGS_READ);
38+
if (!handle) {
39+
throw IOException("Failed to open file: %s", zip_path);
40+
}
41+
42+
if (!handle->CanSeek()) {
43+
throw IOException("Cannot seek");
44+
}
45+
46+
idx_t size = handle->GetFileSize();
47+
idx_t count = 0;
48+
49+
struct archive *archive = archive_read_new();
50+
try {
51+
if (archive_read_support_filter_all(archive)) {
52+
throw IOException("Failed to init libarchive (filter all): %s",
53+
archive_error_string(archive));
54+
}
55+
if (archive_read_support_format_all(archive)) {
56+
throw IOException("Failed to init libarchive (format all): %s",
57+
archive_error_string(archive));
58+
}
59+
unique_ptr<LibArchiveHandle> zipHandle =
60+
make_uniq<LibArchiveHandle>(std::move(handle));
61+
// TODO: Add skip?
62+
if (archive_read_set_seek_callback(archive, FileSystemZipSeekFunc)) {
63+
throw IOException("Failed to init libarchive (seek callback): %s",
64+
archive_error_string(archive));
65+
}
66+
if (archive_read_open(archive, zipHandle.get(), &FileSystemZipOpenFunc,
67+
&FileSystemZipReadFunc, &FileSystemZipCloseFunc)) {
68+
throw IOException("Failed to init libarchive (read callback): %s",
69+
archive_error_string(archive));
70+
}
71+
struct archive_entry *entry = archive_entry_new2(archive);
72+
try {
73+
while (archive_read_next_header2(archive, entry) == ARCHIVE_OK) {
74+
auto pathName = archive_entry_pathname(entry);
75+
auto fileSize = archive_entry_size(entry);
76+
auto fileType = archive_entry_filetype(entry);
77+
auto isDir = fileType == AE_IFDIR;
78+
79+
idx_t col = 0;
80+
output.SetValue(col++, count, Value(pathName));
81+
output.SetValue(col++, count,
82+
Value::UBIGINT(NumericCast<uint64_t>(fileSize)));
83+
output.SetValue(col++, count, Value::BOOLEAN(isDir));
84+
85+
count++;
86+
}
87+
88+
archive_entry_free(entry);
89+
archive_read_free(archive);
90+
} catch (Exception &ex2) {
91+
archive_entry_free(entry);
92+
throw;
93+
}
94+
} catch (IOException &ex) {
95+
archive_read_free(archive);
96+
throw;
97+
} catch (Exception &ex) {
98+
archive_read_free(archive);
99+
throw;
100+
}
101+
102+
output.SetCardinality(count);
103+
global_data.finished = true;
104+
}
105+
106+
unique_ptr<FunctionData>
107+
ReadArchiveFunctionBind(ClientContext &context, TableFunctionBindInput &input,
108+
vector<LogicalType> &return_types,
109+
vector<string> &names) {
110+
auto result = make_uniq<ReadArchiveFunctionBindData>();
111+
result->file_path = input.inputs[0].GetValue<string>();
112+
113+
return_types.push_back(LogicalType::VARCHAR);
114+
names.emplace_back("file_name");
115+
116+
return_types.push_back(LogicalType::UBIGINT);
117+
names.emplace_back("file_size");
118+
119+
return_types.push_back(LogicalType::BOOLEAN);
120+
names.emplace_back("is_directory");
121+
122+
return result;
123+
}
124+
125+
unique_ptr<GlobalTableFunctionState>
126+
ReadArchiveFunctionInit(ClientContext &context, TableFunctionInitInput &input) {
127+
return std::move(make_uniq<ReadArchiveFunctionData>());
128+
}
129+
130+
} // namespace duckdb
131+
132+
#endif // ENABLE_LIBARCHIVE

src/include/archive_contents.hpp

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
#pragma once
2+
3+
#ifdef ENABLE_LIBARCHIVE
4+
5+
#include <archive.h>
6+
#include <archive_entry.h>
7+
#include "utils.hpp"
8+
9+
namespace duckdb {
10+
11+
void ReadArchiveFunction(ClientContext &context, TableFunctionInput &data,
12+
DataChunk &output);
13+
14+
unique_ptr<FunctionData>
15+
ReadArchiveFunctionBind(ClientContext &context, TableFunctionBindInput &input,
16+
vector<LogicalType> &return_types,
17+
vector<string> &names);
18+
19+
unique_ptr<GlobalTableFunctionState>
20+
ReadArchiveFunctionInit(ClientContext &context, TableFunctionInitInput &input);
21+
22+
} // namespace duckdb
23+
24+
#endif // ENABLE_LIBARCHIVE
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
#pragma once
2+
3+
#ifndef ENABLE_LIBARCHIVE
4+
5+
#include "utils.hpp"
6+
7+
namespace duckdb {
8+
9+
void NoopReadArchiveFunction(ClientContext &context, TableFunctionInput &data,
10+
DataChunk &output);
11+
12+
unique_ptr<FunctionData> NoopReadArchiveFunctionBind(
13+
ClientContext &context, TableFunctionBindInput &input,
14+
vector<LogicalType> &return_types, vector<string> &names);
15+
16+
unique_ptr<GlobalTableFunctionState>
17+
NoopReadArchiveFunctionInit(ClientContext &context,
18+
TableFunctionInitInput &input);
19+
20+
} // namespace duckdb
21+
22+
#endif // ENABLE_LIBARCHIVE

src/include/zip_contents.hpp

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
#pragma once
2+
3+
#include <miniz/miniz.h>
4+
#include <miniz/miniz_zip.h>
5+
#include "utils.hpp"
6+
7+
namespace duckdb {
8+
9+
void ReadZipFunction(ClientContext &context, TableFunctionInput &data,
10+
DataChunk &output);
11+
12+
unique_ptr<FunctionData> ReadZipFunctionBind(ClientContext &context,
13+
TableFunctionBindInput &input,
14+
vector<LogicalType> &return_types,
15+
vector<string> &names);
16+
17+
unique_ptr<GlobalTableFunctionState>
18+
ReadZipFunctionInit(ClientContext &context, TableFunctionInitInput &input);
19+
20+
} // namespace duckdb

src/include/zip_file_system.hpp

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,11 @@
77

88
namespace duckdb {
99

10+
auto const ZIP_SEPARATOR = "/";
11+
12+
size_t FileSystemZipReadFunc(void *pOpaque, mz_uint64 file_ofs, void *pBuf,
13+
size_t n);
14+
1015
class ZipFileHandle final : public FileHandle {
1116
friend class ZipFileSystem;
1217

src/noop_archive_contents.cpp

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
#include "archive_contents.hpp"
2+
#include "archive_file_system.hpp"
3+
4+
#include "duckdb/common/exception.hpp"
5+
#include "duckdb/common/numeric_utils.hpp"
6+
#include "duckdb/common/file_opener.hpp"
7+
#include "duckdb/function/scalar/string_common.hpp"
8+
#include "duckdb/main/client_context.hpp"
9+
10+
#ifndef ENABLE_LIBARCHIVE
11+
12+
namespace duckdb {
13+
14+
void NoopReadArchiveFunction(ClientContext &context, TableFunctionInput &data,
15+
DataChunk &output) {
16+
throw NotImplementedException("duckdb-zipfs was not built with libarchive "
17+
"support. (Not supported on Windows)");
18+
}
19+
20+
unique_ptr<FunctionData> NoopReadArchiveFunctionBind(
21+
ClientContext &context, TableFunctionBindInput &input,
22+
vector<LogicalType> &return_types, vector<string> &names) {
23+
return_types.push_back(LogicalType::VARCHAR);
24+
names.emplace_back("file_name");
25+
26+
return_types.push_back(LogicalType::UBIGINT);
27+
names.emplace_back("file_size");
28+
29+
return_types.push_back(LogicalType::BOOLEAN);
30+
names.emplace_back("is_directory");
31+
32+
return nullptr;
33+
}
34+
35+
unique_ptr<GlobalTableFunctionState>
36+
NoopReadArchiveFunctionInit(ClientContext &context,
37+
TableFunctionInitInput &input) {
38+
return nullptr;
39+
}
40+
41+
} // namespace duckdb
42+
43+
#endif // ENABLE_LIBARCHIVE

0 commit comments

Comments
 (0)