Skip to content

Commit 047aeb6

Browse files
committed
Add Py4J comparison section and fix stale data in README
- Add "Previous Approach: Py4J Callback" section comparing PR apache#2430 architecture to current llmPredict built-in - Fix text identity table: reasoning 29/50 (58%) -> 33/50 (66%) - Fix reasoning analysis: APC root cause, not GPU non-determinism - Fix conclusions: all 39 divergent samples are APC - Fix cost totals: $0.114/$0.118 -> $0.108/$0.109 - Fix ROUGE scores: vLLM/SystemDS were swapped - Fix json_extraction sample counts (46 OpenAI, 50 GPU) - Regenerate summary.csv from current metrics.json
1 parent f1629ce commit 047aeb6

2 files changed

Lines changed: 144 additions & 108 deletions

File tree

scripts/staging/llm-bench/README.md

Lines changed: 119 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
Benchmarking framework that compares LLM inference across three backends:
44
OpenAI API, vLLM, and SystemDS JMLC with the native `llmPredict` built-in.
55
Evaluated on 5 workloads (math, reasoning, summarization, JSON extraction,
6-
embeddings) with n=50 per workload (46 for json_extraction due to dataset size).
6+
embeddings) with n=50 per workload (46 for OpenAI json_extraction).
77

88
## Purpose
99

@@ -153,10 +153,11 @@ Python and Java identically.
153153

154154
Both backends send identical model parameters (model, temperature,
155155
top_p, max_tokens, stream=false). Both receive the full JSON response
156-
at once. The run-order experiment (see below) showed that
157-
summarization accuracy follows run position (1st vs 2nd) due to vLLM
158-
APC, while reasoning varies across server sessions due to GPU
159-
floating-point non-determinism.
156+
at once. The run-order experiment (see below) showed that both
157+
summarization and reasoning text differences follow run position
158+
(1st vs 2nd) due to vLLM Automatic Prefix Caching (APC). For
159+
summarization this changes accuracy (25 vs 31); for reasoning the
160+
text differs but accuracy stays 29/50 in all 4 runs.
160161

161162
## Workloads
162163

@@ -265,6 +266,52 @@ without deploying vLLM. The long-term vision is to replace this external
265266
server approach entirely with native DML transformer operations that
266267
run model inference directly inside SystemDS's matrix engine.
267268

269+
### Previous Approach: Py4J Callback (PR #2430)
270+
271+
The initial implementation (closed PR #2430) loaded HuggingFace models
272+
directly inside a Python worker process and used Py4J callbacks to bridge
273+
Java and Python:
274+
275+
```
276+
Python worker (loads model into GPU memory)
277+
^
278+
| Py4J callback: generateBatch(prompts)
279+
v
280+
Java JMLC (PreparedScript.generateBatchWithMetrics)
281+
```
282+
283+
This approach had several drawbacks:
284+
- **Tight coupling:** Model loading, tokenization, and inference all lived
285+
in `llm_worker.py`, requiring Python-side changes for every model config.
286+
- **No standard API:** Used a custom Py4J callback protocol instead of the
287+
OpenAI-compatible `/v1/completions` interface that vLLM and other servers
288+
already provide.
289+
- **Limited optimization:** The Python worker reimplemented batching and
290+
tokenization rather than leveraging vLLM's continuous batching, PagedAttention,
291+
and KV cache management.
292+
- **Process lifecycle:** Java had to manage the Python worker process
293+
(`loadModel()` / `releaseModel()`) with 300-second timeouts for large models.
294+
295+
The current approach (this PR) replaces the Py4J callback with a native
296+
DML built-in (`llmPredict`) that issues HTTP requests to any
297+
OpenAI-compatible server:
298+
299+
```
300+
DML script: llmPredict(prompts, url=..., model=...)
301+
-> LlmPredictCPInstruction (Java HTTP client)
302+
-> Any OpenAI-compatible server (vLLM, llm_server.py, etc.)
303+
```
304+
305+
Benefits of the current approach:
306+
- **Decoupled:** Inference server is independent — swap vLLM for TGI, Ollama,
307+
or any OpenAI-compatible endpoint without changing DML scripts or Java code.
308+
- **Standard protocol:** Uses the `/v1/completions` API, making benchmarks
309+
directly comparable across backends.
310+
- **Server-side optimization:** vLLM handles batching, KV cache, PagedAttention,
311+
and speculative decoding transparently.
312+
- **Simpler Java code:** `LlmPredictCPInstruction` is a single 216-line class
313+
that builds JSON, sends HTTP, and parses the response — no process management.
314+
268315
## Benchmark Results
269316

270317
### Evaluation Methodology
@@ -314,13 +361,12 @@ that returns true/false per sample. The accuracy percentage is
314361

315362
**Notes:**
316363

317-
- All three backends now use the same CoNLL-2003 NER dataset (46 samples)
318-
for json_extraction with entity-level F1 scoring (threshold >= 0.5).
319-
An earlier run used strict 90% field-match scoring, which was the wrong
320-
metric for NER evaluation and reported 15% accuracy. The entity F1
321-
scorer correctly evaluates partial entity matches across categories
322-
(persons, organizations, locations, misc), yielding 65% accuracy for
323-
the same model outputs.
364+
- All three backends use the CoNLL-2003 NER dataset for json_extraction
365+
with entity-level F1 scoring (threshold >= 0.5). OpenAI ran with 46
366+
samples (earlier dataset version); vLLM and SystemDS ran with 50.
367+
The entity F1 scorer evaluates partial entity matches across categories
368+
(persons, organizations, locations, misc), yielding 66% accuracy for
369+
GPU backends (33/50) and 61% for OpenAI (28/46).
324370
- The vLLM and SystemDS backends previously sent different `top_p` values
325371
to the inference server (vLLM: server default 1.0, SystemDS: 0.9). This
326372
has been fixed -- all backends now explicitly send `top_p=0.9`. At
@@ -334,9 +380,12 @@ that returns true/false per sample. The accuracy percentage is
334380
### Accuracy Gap Analysis (vLLM vs SystemDS)
335381

336382
On 4/5 workloads (math, reasoning, json_extraction, embeddings),
337-
accuracy is identical because predictions are byte-for-byte identical.
338-
The remaining workload (summarization) diverges due to vLLM Automatic
339-
Prefix Caching (APC) — proven by the run-order experiment (see below).
383+
accuracy is identical. On 3 of these (math, json_extraction, embeddings),
384+
predictions are byte-for-byte identical. On reasoning, 17/50 samples
385+
produce different text due to APC but accuracy is still 29/50 in all 4
386+
runs. The remaining workload (summarization) has both different text
387+
(22/50) and different accuracy (25 vs 31) due to APC — proven by the
388+
run-order experiment (see below).
340389

341390
**Note on labels:** In the committed results, "vLLM" ran first and
342391
"SystemDS" ran second. For summarization, these labels correspond to
@@ -436,7 +485,7 @@ text for all samples. This confirms that the SystemDS JMLC pipeline
436485
| math | 50/50 | 0 | **100%** |
437486
| json_extraction | 46/46 | 0 | **100%** |
438487
| embeddings | 50/50 | 0 | **100%** |
439-
| reasoning | 29/50 | 21 | 58% |
488+
| reasoning | 33/50 | 17 | 66% |
440489
| summarization | 28/50 | 22 | 56% |
441490

442491
**Why do 3 workloads match perfectly but 2 don't?** The key factor is
@@ -448,7 +497,7 @@ output constraint level, not output length:
448497
| json_extraction | 150 chars | Structured (JSON fields from input, n=46) | 100% |
449498
| math | **1349 chars** | Arithmetic steps (one valid path) | **100%** |
450499
| summarization | 328 chars | Unconstrained (many valid phrasings) | 56% |
451-
| reasoning | 960 chars | Unconstrained (many valid phrasings) | 58% |
500+
| reasoning | 960 chars | Unconstrained (many valid phrasings) | 66% |
452501

453502
Math produces the **longest** outputs (avg 1349 chars) yet achieves
454503
100% identity. This is because arithmetic is highly constrained: at each
@@ -463,9 +512,14 @@ phrasings. "The report found..." vs "A report revealed..." vs
463512
even small differences in server cache state (APC) or floating-point
464513
rounding can flip the selection at near-tied positions.
465514

466-
**Two distinct root causes for the 43 divergent samples:**
515+
**Root cause for all 39 divergent samples: vLLM Automatic Prefix Caching (APC).**
467516

468-
**1. Summarization (22 samples): vLLM Automatic Prefix Caching (APC).**
517+
The run-order experiment proves that ALL divergent samples (22 summarization
518+
+ 17 reasoning) follow the same APC pattern: same-position runs are 100%
519+
identical across sessions, while cross-position runs diverge. The backend
520+
label is irrelevant — only cache position matters.
521+
522+
**1. Summarization (22 samples):**
469523

470524
vLLM 0.15.1 enables APC by default (`enable_prefix_caching=True`). APC
471525
stores KV cache tensors from previously processed prefixes and reuses
@@ -541,58 +595,35 @@ These are not random — the same cache state always produces the same
541595
output. With temperature=0, `CUBLAS_WORKSPACE_CONFIG`, and sequential
542596
requests, same prompt + same cache state → same code path → same output.
543597

544-
**2. Reasoning (21 samples): GPU floating-point non-determinism.**
598+
**2. Reasoning (17 samples): also APC.**
545599

546-
Unlike summarization, reasoning divergences do NOT follow the APC swap
547-
pattern. The run-order experiment reveals a completely different behaviour:
600+
Reasoning follows the same APC pattern as summarization. Within a
601+
session, 33/50 (66%) of predictions are byte-for-byte identical between
602+
1st and 2nd run. The remaining 17 samples diverge due to APC changing
603+
the KV cache state. Cross-session, same-position runs are 100% identical
604+
(1st vs 1st, 2nd vs 2nd) — proving position determines output, not the
605+
backend.
548606

549-
```
550-
Prediction matching patterns for reasoning (50 samples, S1 vs S3):
551-
24x ov=os, rv=rs (same within session, different across sessions)
552-
9x rv=rs only (reverse session matches, original partially differs)
553-
5x ov=os only (original session matches, reverse partially differs)
554-
12x all 4 different (every run produces unique text)
555-
```
607+
Unlike summarization, the accuracy impact is zero: all 4 runs score
608+
29/50. The 17 divergent samples produce different text but the same
609+
yes/no answer (or different wrong answers). Reasoning diverges later in
610+
the response (median divergence point ~400 chars) compared to
611+
summarization, because BoolQ chain-of-thought reasoning shares more
612+
common structure before branching.
556613

557-
Key evidence:
558-
- **0/50 predictions identical between original and reverse runs** for
559-
the *same* backend (e.g., vLLM original vs vLLM reverse = 0% match)
560-
- **Divergence starts from the 1st generated token** (43/50 samples
561-
diverge within the first 2 characters across sessions)
562-
- **Within the same session**, 29-33/50 predictions match between
563-
vLLM and SystemDS (they share the same server state)
564-
- Accuracy is unstable: vLLM scores 31 (original) vs 29 (reverse);
565-
SystemDS scores 33 (original) vs 29 (reverse)
566-
567-
This is cross-session GPU non-determinism: despite temperature=0 and
568-
`CUBLAS_WORKSPACE_CONFIG=:4096:8`, the GPU produces different logits
569-
across server restarts. BoolQ prompts are long passages (high token
570-
count) that accumulate floating-point rounding differences across
571-
attention layers, causing divergence from the very first generated
572-
token. This is a known limitation — `CUBLAS_WORKSPACE_CONFIG` makes
573-
cuBLAS operations deterministic within a session, but does not
574-
guarantee bit-identical results across process restarts (different
575-
memory layouts, kernel launch configurations, etc.).
576-
577-
**Investigation: cuBLAS non-determinism.** cuBLAS uses algorithms where
578-
the order of floating-point additions varies between runs. Since FP
579-
addition is not associative, this produces slightly different logit
580-
values. When two token candidates have nearly equal logits (e.g.,
581-
5.00001 vs 5.00000), a tiny rounding change can flip the argmax.
582-
583-
We ran vLLM with `CUBLAS_WORKSPACE_CONFIG=:4096:8` (forces deterministic
584-
cuBLAS algorithms). Constrained workloads (math, json_extraction,
585-
embeddings) became 100% byte-identical. Reasoning and summarization
586-
still diverged — cuBLAS determinism helps within a session but does not
587-
eliminate cross-session or APC-induced divergence.
614+
`CUBLAS_WORKSPACE_CONFIG=:4096:8` was used for all runs to force
615+
deterministic cuBLAS algorithms. The constrained workloads (math,
616+
json_extraction, embeddings) achieve 100% byte-identity. The
617+
unconstrained workloads (reasoning, summarization) diverge due to APC
618+
cache state, not cuBLAS non-determinism.
588619

589620
**Why streaming was investigated (and why it is not the cause).**
590621

591622
The original vLLM backend used `"stream": true` while SystemDS used
592623
`"stream": false`. Streaming was checked first as a potential source
593624
of byte-level corruption. The 150 byte-identical samples across math,
594625
json_extraction, and embeddings ruled out SSE corruption. Switching to
595-
`"stream": false` produced identical divergence counts (same 43
626+
`"stream": false` produced identical divergence counts (same 39
596627
samples), confirming streaming had no effect. Both backends now use
597628
non-streaming mode.
598629

@@ -722,14 +753,11 @@ The accuracy comparison is the apples-to-apples metric since all backends
722753
process the same prompts with the same parameters.
723754

724755
**SystemDS vs vLLM latency** (same server, same model, CUBLAS
725-
deterministic run, vLLM used `stream=true` at the time of this
726-
measurement): Latencies are within 1--6% of each other. These
727-
differences are within measurement noise for two reasons:
728-
(1) the runs were 6 minutes apart — server cache state and scheduling
729-
differ; (2) divergent samples generate different output lengths, and
730-
latency is dominated by output token count. A sample where vLLM
731-
generates 91 more characters (observed average in reasoning) simply
732-
does more work — it is not a sign that SystemDS is faster or slower.
756+
deterministic, non-streaming HTTP): Latencies are within 0--3% of each
757+
other for generation workloads. These differences are within measurement
758+
noise: the runs were ~6 minutes apart and divergent samples generate
759+
different output lengths. Latency is dominated by output token count —
760+
a sample where one run generates more tokens simply does more work.
733761

734762
| Workload | vLLM | SystemDS | Difference |
735763
|----------|------|----------|------------|
@@ -749,12 +777,12 @@ latency is determined by output token count, not by which client sends
749777
the request.
750778

751779
**Why output length differs between backends:**
752-
When two sequential runs diverge at a single token (due to APC or GPU
753-
non-determinism), the two autoregressive paths produce responses of
754-
different lengths — neither is reliably longer. Among the 21 divergent
755-
reasoning samples, the 1st-run was longer in 12 cases and the 2nd-run
756-
in 9 cases. On average the 1st-run was 91 chars longer only due to
757-
outliers — it is not a systematic property of either run position.
780+
When two sequential runs diverge at a single token (due to APC), the
781+
two autoregressive paths produce responses of different lengths — neither
782+
is reliably longer. Among the 17 divergent reasoning samples, neither
783+
run position is systematically longer. The difference in latency between
784+
backends on these samples reflects the output length difference, not a
785+
performance difference.
758786

759787
### Throughput (requests/second)
760788

@@ -829,41 +857,39 @@ embeddings), the amortized cost is ~$0.00003/query vs OpenAI's
829857
| Backend | ROUGE-1 F1 | ROUGE-2 F1 | ROUGE-L F1 |
830858
|---------|-----------|-----------|-----------|
831859
| OpenAI | 0.270 | 0.066 | 0.201 |
832-
| vLLM Qwen 3B | 0.226 | 0.056 | 0.157 |
833-
| SystemDS Qwen 3B | 0.220 | 0.057 | 0.157 |
860+
| vLLM Qwen 3B | 0.220 | 0.057 | 0.157 |
861+
| SystemDS Qwen 3B | 0.226 | 0.056 | 0.157 |
834862

835863
## Conclusions
836864

837865
1. **SystemDS `llmPredict` is a lossless pass-through**: On 3/5
838866
workloads (math, json_extraction, embeddings), every response is
839-
byte-for-byte identical between vLLM and SystemDS — 196/246 samples
867+
byte-for-byte identical between vLLM and SystemDS — 150/150 samples
840868
total. The JMLC pipeline (Py4J -> DML -> Java HTTP -> FrameBlock)
841869
introduces zero data loss or corruption.
842870

843-
2. **The 43 divergent samples have two distinct root causes**:
844-
- **Summarization (22 samples):** vLLM Automatic Prefix Caching (APC).
845-
The run-order experiment proves all 22 follow the `1st-run = variant A,
846-
2nd-run = variant B` pattern with zero exceptions. 1st-run scores
847-
25/50, 2nd-run scores 31/50, regardless of which backend runs first.
848-
- **Reasoning (21 samples):** GPU floating-point non-determinism
849-
across server sessions. The same backend produces 0% identical
850-
predictions across sessions; divergence starts from the 1st token.
851-
The ±2 sample accuracy gap is noise (n=50).
871+
2. **All 39 divergent samples are caused by vLLM Automatic Prefix
872+
Caching (APC)**: The run-order experiment proves that all divergent
873+
samples (22 summarization + 17 reasoning) follow the same pattern:
874+
same-position runs are 100% byte-identical across sessions, while
875+
cross-position runs diverge. For summarization, this changes accuracy
876+
(25/50 vs 31/50). For reasoning, the text differs but accuracy
877+
remains 29/50 in all 4 runs.
852878

853879
3. **JMLC overhead is negligible**: Latencies between SystemDS and
854-
direct vLLM calls are within 1--6%, within measurement noise.
855-
Neither backend is meaningfully faster.
880+
direct vLLM calls are within 0--3% for generation workloads, within
881+
measurement noise. Neither backend is meaningfully faster.
856882

857883
4. **Both backends benefit equally from vLLM server optimizations**:
858884
PagedAttention, continuous batching, KV cache, and CUDA kernels all
859885
happen server-side. Both are HTTP clients to the same server.
860886

861887
5. **Cost tradeoff depends on scale**: For this small benchmark (250
862888
sequential queries, ~3 min total inference), OpenAI API ($0.047) is
863-
cheaper than local H100 ($0.114 vLLM / $0.118 SystemDS) because hardware
864-
amortization ($2.00/hr) dominates at low utilization. At production
865-
scale with concurrent requests, owned hardware becomes significantly
866-
cheaper per query.
889+
cheaper than local H100 ($0.108 vLLM / $0.109 SystemDS) because
890+
hardware amortization ($2.00/hr) dominates at low utilization. At
891+
production scale with concurrent requests, owned hardware becomes
892+
significantly cheaper per query.
867893

868894
6. **Model quality matters more than serving infrastructure**: The
869895
difference between OpenAI and Qwen 3B is model quality. The

0 commit comments

Comments
 (0)