Skip to content

Commit 6b44365

Browse files
committed
changes to metrics output dir
1 parent 1248486 commit 6b44365

8 files changed

Lines changed: 182 additions & 46 deletions

File tree

.dockerignore

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,10 @@
22
run_output/
33
run_metrics/
44
tsv_vcf_files/
5-
test_vcf_files/
6-
!test_vcf_files/test-100.vcf
7-
!test_vcf_files/test-1k.vcf
8-
!test_vcf_files/test-10k.vcf
5+
test/test_vcf_files/*
6+
!test/test_vcf_files/test-100.vcf
7+
!test/test_vcf_files/test-1k.vcf
8+
!test/test_vcf_files/test-10k.vcf
99
out/
1010
tsv/
1111
*.jar

.gitignore

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,10 @@ scripts/node_modules
44
!time-*
55

66
run_output/
7-
test_vcf_files/*
8-
!test_vcf_files/test-100.vcf
9-
!test_vcf_files/test-1k.vcf
10-
!test_vcf_files/test-10k.vcf
7+
test/test_vcf_files/*
8+
!test/test_vcf_files/test-100.vcf
9+
!test/test_vcf_files/test-1k.vcf
10+
!test/test_vcf_files/test-10k.vcf
1111
RMLStreamer-v2.5.0-standalone.jar
1212
out/
1313
tsv/

README.md

Lines changed: 16 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -139,30 +139,32 @@ Outputs:
139139
- decompressed outputs (decompression mode default):
140140
- `./out/<sample>/<sample>.nt`
141141
- `./run_metrics` for logs and metrics
142-
- `run_metrics/metrics.csv` includes both conversion and compression metrics per run
142+
- each wrapper invocation creates a run-specific subdirectory: `run_metrics/<RUN_ID>/`
143+
- example: `run_metrics/20260225T120434/`
144+
- `run_metrics/<RUN_ID>/metrics.csv` includes both conversion and compression metrics for that run
143145
- compound-compression fields are explicit and separate from raw-RDF compression:
144146
- `gzip_on_hdt_*` (gzip applied to `.hdt`)
145147
- `brotli_on_hdt_*` (brotli applied to `.hdt`)
146148
- `hdt_source` (`generated` vs `existing` when reused)
147149
- conversion step artifacts:
148-
- `run_metrics/conversion-time-<output_name>-<run_id>.txt`
149-
- `run_metrics/conversion-metrics-<output_name>-<run_id>.json`
150+
- `run_metrics/<RUN_ID>/conversion-time-<output_name>-<run_id>.txt`
151+
- `run_metrics/<RUN_ID>/conversion-metrics-<output_name>-<run_id>.json`
150152
- compression step artifacts:
151-
- `run_metrics/compression-time-<method>-<output_name>-<run_id>.txt`
152-
- `run_metrics/compression-metrics-<output_name>-<run_id>.json`
153+
- `run_metrics/<RUN_ID>/compression-time-<method>-<output_name>-<run_id>.txt`
154+
- `run_metrics/<RUN_ID>/compression-metrics-<output_name>-<run_id>.json`
153155
- wrapper runtime artifacts:
154-
- `run_metrics/wrapper_execution_times.csv` (one row per wrapper run with mode, elapsed time, status, and full-mode triple totals when available)
155-
- `run_metrics/.wrapper_logs/wrapper-<timestamp>.log` stores detailed Docker/stdout/stderr command output
156+
- `run_metrics/<RUN_ID>/wrapper_execution_times.csv` (one row for that run with mode, elapsed time, status, and full-mode triple totals when available)
157+
- `run_metrics/<RUN_ID>/.wrapper_logs/wrapper-<run_id>.log` stores detailed Docker/stdout/stderr command output
156158

157159
Small VCF fixtures for RDF size/inflation test runs:
158-
- `test_vcf_files/infl100.vcf` (100 total lines)
159-
- `test_vcf_files/infl1k.vcf` (1000 total lines)
160-
- `test_vcf_files/infl10k.vcf` (10000 total lines)
160+
- `test/test_vcf_files/test-100.vcf` (100 total lines)
161+
- `test/test_vcf_files/test-1k.vcf` (1000 total lines)
162+
- `test/test_vcf_files/test-10k.vcf` (10000 total lines)
161163

162164
Example inflation check:
163165
```bash
164-
python3 vcf_rdfizer.py --mode full --input test_vcf_files/infl1k.vcf --rdf-layout aggregate --compression none --keep-tsv --keep-rdf
165-
wc -l out/infl1k/infl1k.nt
166+
python3 vcf_rdfizer.py --mode full --input test/test_vcf_files/test-1k.vcf --rdf-layout aggregate --compression none --keep-tsv --keep-rdf
167+
wc -l out/test-1k/test-1k.nt
166168
```
167169

168170
## How Dependencies Are Handled
@@ -195,7 +197,7 @@ The wrapper validates:
195197
- Docker runs as the host UID/GID by default to prevent root-owned output files on mounted volumes
196198
- If mounted output/metrics paths are not writable (e.g., stale root-owned files), the wrapper automatically attempts a one-time in-container permission repair before running
197199
- Raw command output is written to a hidden wrapper log file instead of printed directly to the terminal
198-
- A concise elapsed-time summary is printed at the end of each mode run and appended to `run_metrics/wrapper_execution_times.csv`
200+
- A concise elapsed-time summary is printed at the end of each mode run and written to `run_metrics/<RUN_ID>/wrapper_execution_times.csv`
199201
- Full mode prints triples produced per input (and total) when conversion metrics are available
200202
- Optional preflight storage estimate (`--estimate-size`) with a disk-space warning if the upper-bound estimate exceeds free space
201203

@@ -242,7 +244,7 @@ Options:
242244
- `-b, --build`: force docker build
243245
- `-B, --no-build`: fail if image missing
244246
- `-n, --out-name` (default `rdf`): fallback output basename in full mode
245-
- `-M, --metrics` (default `./run_metrics`): metrics/log directory
247+
- `-M, --metrics` (default `./run_metrics`): metrics root directory (a `<RUN_ID>/` subdirectory is created per run)
246248
- `-c, --compression` (default `gzip,brotli,hdt`): compression methods (`gzip,brotli,hdt,hdt_gzip,hdt_brotli,none`)
247249
- `-k, --keep-tsv`: keep TSV intermediates (full mode)
248250
- `-R, --keep-rdf`: keep raw `.nt/.nq` RDF outputs after compression (full mode; default is delete)

test/test_vcf_rdfizer_unit.py

Lines changed: 57 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,53 @@ def output_name_from_command(cmd):
6161
return None
6262

6363

64+
def latest_metrics_run_dir(metrics_root: Path) -> Path:
65+
"""Return the single/latest per-run metrics directory."""
66+
run_dirs = sorted(
67+
(
68+
path
69+
for path in metrics_root.iterdir()
70+
if path.is_dir() and re.match(r"^\d{8}T\d{6}$", path.name)
71+
),
72+
key=lambda path: path.name,
73+
)
74+
if not run_dirs:
75+
raise AssertionError(f"No per-run metrics directories found under {metrics_root}")
76+
return run_dirs[-1]
77+
78+
6479
class WrapperUnitTests(VerboseTestCase):
80+
def test_print_summary_lists_all_selected_compression_sizes(self):
81+
"""Summary printer includes one size line per requested compression method."""
82+
with tempfile.TemporaryDirectory() as td:
83+
tmp_path = Path(td)
84+
out_root = tmp_path / "out" / "sample"
85+
out_root.mkdir(parents=True, exist_ok=True)
86+
nt_path = tmp_path / "sample.nt"
87+
nt_path.write_text("<s> <p> <o> .\n")
88+
(out_root / "sample.hdt").write_text("hdt\n")
89+
(out_root / "sample.nt.gz").write_text("gz\n")
90+
91+
out_buf = StringIO()
92+
with redirect_stdout(out_buf):
93+
vcf_rdfizer.print_nt_hdt_summary(
94+
output_root=out_root,
95+
nt_path=nt_path,
96+
hdt_path=out_root / "sample.hdt",
97+
selected_methods=["hdt", "gzip"],
98+
method_results={
99+
"hdt": {"output_size_bytes": 4, "exit_code": 0},
100+
"gzip": {"output_size_bytes": 3, "exit_code": 0},
101+
},
102+
indent=" ",
103+
)
104+
105+
text = out_buf.getvalue()
106+
self.assertIn("- HDT (.hdt):", text)
107+
self.assertIn("- gzip (.nt.gz):", text)
108+
self.assertIn(str(out_root / "sample.hdt"), text)
109+
self.assertIn(str(out_root / "sample.nt.gz"), text)
110+
65111
def test_update_metrics_csv_keeps_raw_and_hdt_compound_metrics_separate(self):
66112
"""Metrics CSV keeps raw RDF gzip/brotli fields separate from gzip/brotli-on-HDT fields."""
67113
with tempfile.TemporaryDirectory() as td:
@@ -393,7 +439,8 @@ def fake_run(cmd, cwd=None, env=None):
393439

394440
self.assertEqual(rc, 0)
395441
self.assertIn("Run time (compress mode):", out_buf.getvalue())
396-
timings_csv = metrics_dir / "wrapper_execution_times.csv"
442+
run_metrics_dir = latest_metrics_run_dir(metrics_dir)
443+
timings_csv = run_metrics_dir / "wrapper_execution_times.csv"
397444
self.assertTrue(timings_csv.exists())
398445
with timings_csv.open() as handle:
399446
rows = list(csv.DictReader(handle))
@@ -422,8 +469,9 @@ def fake_run(cmd, cwd=None, env=None):
422469
sample_dir.mkdir(parents=True, exist_ok=True)
423470
(sample_dir / f"{out_name}.nt").write_text("<s> <p> <o> .\n")
424471
payload = {"artifacts": {"output_triples": {"TOTAL": 17}}}
425-
metrics_dir.mkdir(parents=True, exist_ok=True)
426-
(metrics_dir / f"conversion-metrics-{out_name}-{run_id}.json").write_text(
472+
run_metrics_dir = metrics_dir / run_id
473+
run_metrics_dir.mkdir(parents=True, exist_ok=True)
474+
(run_metrics_dir / f"conversion-metrics-{out_name}-{run_id}.json").write_text(
427475
json.dumps(payload),
428476
encoding="utf-8",
429477
)
@@ -469,7 +517,8 @@ def fake_run(cmd, cwd=None, env=None):
469517
self.assertIn("Total triples produced (full run): 17", output)
470518
self.assertIn("Run time (full mode):", output)
471519

472-
timings_csv = metrics_dir / "wrapper_execution_times.csv"
520+
run_metrics_dir = latest_metrics_run_dir(metrics_dir)
521+
timings_csv = run_metrics_dir / "wrapper_execution_times.csv"
473522
self.assertTrue(timings_csv.exists())
474523
with timings_csv.open() as handle:
475524
rows = list(csv.DictReader(handle))
@@ -1035,15 +1084,16 @@ def fake_run(cmd, cwd=None, env=None):
10351084
os.chdir(old_cwd)
10361085

10371086
self.assertEqual(rc, 0)
1038-
metrics_csv = metrics_dir / "metrics.csv"
1087+
run_metrics_dir = latest_metrics_run_dir(metrics_dir)
1088+
metrics_csv = run_metrics_dir / "metrics.csv"
10391089
self.assertTrue(metrics_csv.exists())
10401090
csv_text = metrics_csv.read_text()
10411091
self.assertIn("compression_methods", csv_text)
10421092
self.assertIn("sample", csv_text)
10431093
self.assertIn("hdt", csv_text)
10441094

1045-
json_files = list(metrics_dir.glob("compression-metrics-sample-*.json"))
1046-
time_files = list(metrics_dir.glob("compression-time-hdt-sample-*.txt"))
1095+
json_files = list(run_metrics_dir.glob("compression-metrics-sample-*.json"))
1096+
time_files = list(run_metrics_dir.glob("compression-time-hdt-sample-*.txt"))
10471097
self.assertTrue(json_files)
10481098
self.assertTrue(time_files)
10491099

0 commit comments

Comments
 (0)