33Benchmarking framework that compares LLM inference across three backends:
44OpenAI API, vLLM, and SystemDS JMLC with the native ` llmPredict ` built-in.
55Evaluated on 5 workloads (math, reasoning, summarization, JSON extraction,
6- embeddings) with n=50 per workload (46 for json_extraction due to dataset size ).
6+ embeddings) with n=50 per workload (46 for OpenAI json_extraction ).
77
88## Purpose
99
@@ -153,10 +153,11 @@ Python and Java identically.
153153
154154Both backends send identical model parameters (model, temperature,
155155top_p, max_tokens, stream=false). Both receive the full JSON response
156- at once. The run-order experiment (see below) showed that
157- summarization accuracy follows run position (1st vs 2nd) due to vLLM
158- APC, while reasoning varies across server sessions due to GPU
159- floating-point non-determinism.
156+ at once. The run-order experiment (see below) showed that both
157+ summarization and reasoning text differences follow run position
158+ (1st vs 2nd) due to vLLM Automatic Prefix Caching (APC). For
159+ summarization this changes accuracy (25 vs 31); for reasoning the
160+ text differs but accuracy stays 29/50 in all 4 runs.
160161
161162## Workloads
162163
@@ -265,6 +266,52 @@ without deploying vLLM. The long-term vision is to replace this external
265266server approach entirely with native DML transformer operations that
266267run model inference directly inside SystemDS's matrix engine.
267268
269+ ### Previous Approach: Py4J Callback (PR #2430 )
270+
271+ The initial implementation (closed PR #2430 ) loaded HuggingFace models
272+ directly inside a Python worker process and used Py4J callbacks to bridge
273+ Java and Python:
274+
275+ ```
276+ Python worker (loads model into GPU memory)
277+ ^
278+ | Py4J callback: generateBatch(prompts)
279+ v
280+ Java JMLC (PreparedScript.generateBatchWithMetrics)
281+ ```
282+
283+ This approach had several drawbacks:
284+ - ** Tight coupling:** Model loading, tokenization, and inference all lived
285+ in ` llm_worker.py ` , requiring Python-side changes for every model config.
286+ - ** No standard API:** Used a custom Py4J callback protocol instead of the
287+ OpenAI-compatible ` /v1/completions ` interface that vLLM and other servers
288+ already provide.
289+ - ** Limited optimization:** The Python worker reimplemented batching and
290+ tokenization rather than leveraging vLLM's continuous batching, PagedAttention,
291+ and KV cache management.
292+ - ** Process lifecycle:** Java had to manage the Python worker process
293+ (` loadModel() ` / ` releaseModel() ` ) with 300-second timeouts for large models.
294+
295+ The current approach (this PR) replaces the Py4J callback with a native
296+ DML built-in (` llmPredict ` ) that issues HTTP requests to any
297+ OpenAI-compatible server:
298+
299+ ```
300+ DML script: llmPredict(prompts, url=..., model=...)
301+ -> LlmPredictCPInstruction (Java HTTP client)
302+ -> Any OpenAI-compatible server (vLLM, llm_server.py, etc.)
303+ ```
304+
305+ Benefits of the current approach:
306+ - ** Decoupled:** Inference server is independent — swap vLLM for TGI, Ollama,
307+ or any OpenAI-compatible endpoint without changing DML scripts or Java code.
308+ - ** Standard protocol:** Uses the ` /v1/completions ` API, making benchmarks
309+ directly comparable across backends.
310+ - ** Server-side optimization:** vLLM handles batching, KV cache, PagedAttention,
311+ and speculative decoding transparently.
312+ - ** Simpler Java code:** ` LlmPredictCPInstruction ` is a single 216-line class
313+ that builds JSON, sends HTTP, and parses the response — no process management.
314+
268315## Benchmark Results
269316
270317### Evaluation Methodology
@@ -314,13 +361,12 @@ that returns true/false per sample. The accuracy percentage is
314361
315362** Notes:**
316363
317- - All three backends now use the same CoNLL-2003 NER dataset (46 samples)
318- for json_extraction with entity-level F1 scoring (threshold >= 0.5).
319- An earlier run used strict 90% field-match scoring, which was the wrong
320- metric for NER evaluation and reported 15% accuracy. The entity F1
321- scorer correctly evaluates partial entity matches across categories
322- (persons, organizations, locations, misc), yielding 65% accuracy for
323- the same model outputs.
364+ - All three backends use the CoNLL-2003 NER dataset for json_extraction
365+ with entity-level F1 scoring (threshold >= 0.5). OpenAI ran with 46
366+ samples (earlier dataset version); vLLM and SystemDS ran with 50.
367+ The entity F1 scorer evaluates partial entity matches across categories
368+ (persons, organizations, locations, misc), yielding 66% accuracy for
369+ GPU backends (33/50) and 61% for OpenAI (28/46).
324370- The vLLM and SystemDS backends previously sent different ` top_p ` values
325371 to the inference server (vLLM: server default 1.0, SystemDS: 0.9). This
326372 has been fixed -- all backends now explicitly send ` top_p=0.9 ` . At
@@ -334,9 +380,12 @@ that returns true/false per sample. The accuracy percentage is
334380### Accuracy Gap Analysis (vLLM vs SystemDS)
335381
336382On 4/5 workloads (math, reasoning, json_extraction, embeddings),
337- accuracy is identical because predictions are byte-for-byte identical.
338- The remaining workload (summarization) diverges due to vLLM Automatic
339- Prefix Caching (APC) — proven by the run-order experiment (see below).
383+ accuracy is identical. On 3 of these (math, json_extraction, embeddings),
384+ predictions are byte-for-byte identical. On reasoning, 17/50 samples
385+ produce different text due to APC but accuracy is still 29/50 in all 4
386+ runs. The remaining workload (summarization) has both different text
387+ (22/50) and different accuracy (25 vs 31) due to APC — proven by the
388+ run-order experiment (see below).
340389
341390** Note on labels:** In the committed results, "vLLM" ran first and
342391"SystemDS" ran second. For summarization, these labels correspond to
@@ -436,7 +485,7 @@ text for all samples. This confirms that the SystemDS JMLC pipeline
436485| math | 50/50 | 0 | ** 100%** |
437486| json_extraction | 46/46 | 0 | ** 100%** |
438487| embeddings | 50/50 | 0 | ** 100%** |
439- | reasoning | 29 /50 | 21 | 58 % |
488+ | reasoning | 33 /50 | 17 | 66 % |
440489| summarization | 28/50 | 22 | 56% |
441490
442491** Why do 3 workloads match perfectly but 2 don't?** The key factor is
@@ -448,7 +497,7 @@ output constraint level, not output length:
448497| json_extraction | 150 chars | Structured (JSON fields from input, n=46) | 100% |
449498| math | ** 1349 chars** | Arithmetic steps (one valid path) | ** 100%** |
450499| summarization | 328 chars | Unconstrained (many valid phrasings) | 56% |
451- | reasoning | 960 chars | Unconstrained (many valid phrasings) | 58 % |
500+ | reasoning | 960 chars | Unconstrained (many valid phrasings) | 66 % |
452501
453502Math produces the ** longest** outputs (avg 1349 chars) yet achieves
454503100% identity. This is because arithmetic is highly constrained: at each
@@ -463,9 +512,14 @@ phrasings. "The report found..." vs "A report revealed..." vs
463512even small differences in server cache state (APC) or floating-point
464513rounding can flip the selection at near-tied positions.
465514
466- ** Two distinct root causes for the 43 divergent samples:**
515+ ** Root cause for all 39 divergent samples: vLLM Automatic Prefix Caching (APC). **
467516
468- ** 1. Summarization (22 samples): vLLM Automatic Prefix Caching (APC).**
517+ The run-order experiment proves that ALL divergent samples (22 summarization
518+ + 17 reasoning) follow the same APC pattern: same-position runs are 100%
519+ identical across sessions, while cross-position runs diverge. The backend
520+ label is irrelevant — only cache position matters.
521+
522+ ** 1. Summarization (22 samples):**
469523
470524vLLM 0.15.1 enables APC by default (` enable_prefix_caching=True ` ). APC
471525stores KV cache tensors from previously processed prefixes and reuses
@@ -541,58 +595,35 @@ These are not random — the same cache state always produces the same
541595output. With temperature=0, ` CUBLAS_WORKSPACE_CONFIG ` , and sequential
542596requests, same prompt + same cache state → same code path → same output.
543597
544- ** 2. Reasoning (21 samples): GPU floating-point non-determinism .**
598+ ** 2. Reasoning (17 samples): also APC .**
545599
546- Unlike summarization, reasoning divergences do NOT follow the APC swap
547- pattern. The run-order experiment reveals a completely different behaviour:
600+ Reasoning follows the same APC pattern as summarization. Within a
601+ session, 33/50 (66%) of predictions are byte-for-byte identical between
602+ 1st and 2nd run. The remaining 17 samples diverge due to APC changing
603+ the KV cache state. Cross-session, same-position runs are 100% identical
604+ (1st vs 1st, 2nd vs 2nd) — proving position determines output, not the
605+ backend.
548606
549- ```
550- Prediction matching patterns for reasoning (50 samples, S1 vs S3):
551- 24x ov=os, rv=rs (same within session, different across sessions)
552- 9x rv=rs only (reverse session matches, original partially differs)
553- 5x ov=os only (original session matches, reverse partially differs)
554- 12x all 4 different (every run produces unique text)
555- ```
607+ Unlike summarization, the accuracy impact is zero: all 4 runs score
608+ 29/50. The 17 divergent samples produce different text but the same
609+ yes/no answer (or different wrong answers). Reasoning diverges later in
610+ the response (median divergence point ~ 400 chars) compared to
611+ summarization, because BoolQ chain-of-thought reasoning shares more
612+ common structure before branching.
556613
557- Key evidence:
558- - ** 0/50 predictions identical between original and reverse runs** for
559- the * same* backend (e.g., vLLM original vs vLLM reverse = 0% match)
560- - ** Divergence starts from the 1st generated token** (43/50 samples
561- diverge within the first 2 characters across sessions)
562- - ** Within the same session** , 29-33/50 predictions match between
563- vLLM and SystemDS (they share the same server state)
564- - Accuracy is unstable: vLLM scores 31 (original) vs 29 (reverse);
565- SystemDS scores 33 (original) vs 29 (reverse)
566-
567- This is cross-session GPU non-determinism: despite temperature=0 and
568- ` CUBLAS_WORKSPACE_CONFIG=:4096:8 ` , the GPU produces different logits
569- across server restarts. BoolQ prompts are long passages (high token
570- count) that accumulate floating-point rounding differences across
571- attention layers, causing divergence from the very first generated
572- token. This is a known limitation — ` CUBLAS_WORKSPACE_CONFIG ` makes
573- cuBLAS operations deterministic within a session, but does not
574- guarantee bit-identical results across process restarts (different
575- memory layouts, kernel launch configurations, etc.).
576-
577- ** Investigation: cuBLAS non-determinism.** cuBLAS uses algorithms where
578- the order of floating-point additions varies between runs. Since FP
579- addition is not associative, this produces slightly different logit
580- values. When two token candidates have nearly equal logits (e.g.,
581- 5.00001 vs 5.00000), a tiny rounding change can flip the argmax.
582-
583- We ran vLLM with ` CUBLAS_WORKSPACE_CONFIG=:4096:8 ` (forces deterministic
584- cuBLAS algorithms). Constrained workloads (math, json_extraction,
585- embeddings) became 100% byte-identical. Reasoning and summarization
586- still diverged — cuBLAS determinism helps within a session but does not
587- eliminate cross-session or APC-induced divergence.
614+ ` CUBLAS_WORKSPACE_CONFIG=:4096:8 ` was used for all runs to force
615+ deterministic cuBLAS algorithms. The constrained workloads (math,
616+ json_extraction, embeddings) achieve 100% byte-identity. The
617+ unconstrained workloads (reasoning, summarization) diverge due to APC
618+ cache state, not cuBLAS non-determinism.
588619
589620** Why streaming was investigated (and why it is not the cause).**
590621
591622The original vLLM backend used ` "stream": true ` while SystemDS used
592623` "stream": false ` . Streaming was checked first as a potential source
593624of byte-level corruption. The 150 byte-identical samples across math,
594625json_extraction, and embeddings ruled out SSE corruption. Switching to
595- ` "stream": false ` produced identical divergence counts (same 43
626+ ` "stream": false ` produced identical divergence counts (same 39
596627samples), confirming streaming had no effect. Both backends now use
597628non-streaming mode.
598629
@@ -722,14 +753,11 @@ The accuracy comparison is the apples-to-apples metric since all backends
722753process the same prompts with the same parameters.
723754
724755** SystemDS vs vLLM latency** (same server, same model, CUBLAS
725- deterministic run, vLLM used ` stream=true ` at the time of this
726- measurement): Latencies are within 1--6% of each other. These
727- differences are within measurement noise for two reasons:
728- (1) the runs were 6 minutes apart — server cache state and scheduling
729- differ; (2) divergent samples generate different output lengths, and
730- latency is dominated by output token count. A sample where vLLM
731- generates 91 more characters (observed average in reasoning) simply
732- does more work — it is not a sign that SystemDS is faster or slower.
756+ deterministic, non-streaming HTTP): Latencies are within 0--3% of each
757+ other for generation workloads. These differences are within measurement
758+ noise: the runs were ~ 6 minutes apart and divergent samples generate
759+ different output lengths. Latency is dominated by output token count —
760+ a sample where one run generates more tokens simply does more work.
733761
734762| Workload | vLLM | SystemDS | Difference |
735763| ----------| ------| ----------| ------------|
@@ -749,12 +777,12 @@ latency is determined by output token count, not by which client sends
749777the request.
750778
751779** Why output length differs between backends:**
752- When two sequential runs diverge at a single token (due to APC or GPU
753- non-determinism), the two autoregressive paths produce responses of
754- different lengths — neither is reliably longer. Among the 21 divergent
755- reasoning samples, the 1st-run was longer in 12 cases and the 2nd-run
756- in 9 cases. On average the 1st-run was 91 chars longer only due to
757- outliers — it is not a systematic property of either run position .
780+ When two sequential runs diverge at a single token (due to APC), the
781+ two autoregressive paths produce responses of different lengths — neither
782+ is reliably longer. Among the 17 divergent reasoning samples, neither
783+ run position is systematically longer. The difference in latency between
784+ backends on these samples reflects the output length difference, not a
785+ performance difference .
758786
759787### Throughput (requests/second)
760788
@@ -829,41 +857,39 @@ embeddings), the amortized cost is ~$0.00003/query vs OpenAI's
829857| Backend | ROUGE-1 F1 | ROUGE-2 F1 | ROUGE-L F1 |
830858| ---------| -----------| -----------| -----------|
831859| OpenAI | 0.270 | 0.066 | 0.201 |
832- | vLLM Qwen 3B | 0.226 | 0.056 | 0.157 |
833- | SystemDS Qwen 3B | 0.220 | 0.057 | 0.157 |
860+ | vLLM Qwen 3B | 0.220 | 0.057 | 0.157 |
861+ | SystemDS Qwen 3B | 0.226 | 0.056 | 0.157 |
834862
835863## Conclusions
836864
8378651 . ** SystemDS ` llmPredict ` is a lossless pass-through** : On 3/5
838866 workloads (math, json_extraction, embeddings), every response is
839- byte-for-byte identical between vLLM and SystemDS — 196/246 samples
867+ byte-for-byte identical between vLLM and SystemDS — 150/150 samples
840868 total. The JMLC pipeline (Py4J -> DML -> Java HTTP -> FrameBlock)
841869 introduces zero data loss or corruption.
842870
843- 2 . ** The 43 divergent samples have two distinct root causes** :
844- - ** Summarization (22 samples):** vLLM Automatic Prefix Caching (APC).
845- The run-order experiment proves all 22 follow the `1st-run = variant A,
846- 2nd-run = variant B` pattern with zero exceptions. 1st-run scores
847- 25/50, 2nd-run scores 31/50, regardless of which backend runs first.
848- - ** Reasoning (21 samples):** GPU floating-point non-determinism
849- across server sessions. The same backend produces 0% identical
850- predictions across sessions; divergence starts from the 1st token.
851- The ±2 sample accuracy gap is noise (n=50).
871+ 2 . ** All 39 divergent samples are caused by vLLM Automatic Prefix
872+ Caching (APC)** : The run-order experiment proves that all divergent
873+ samples (22 summarization + 17 reasoning) follow the same pattern:
874+ same-position runs are 100% byte-identical across sessions, while
875+ cross-position runs diverge. For summarization, this changes accuracy
876+ (25/50 vs 31/50). For reasoning, the text differs but accuracy
877+ remains 29/50 in all 4 runs.
852878
8538793 . ** JMLC overhead is negligible** : Latencies between SystemDS and
854- direct vLLM calls are within 1--6%, within measurement noise.
855- Neither backend is meaningfully faster.
880+ direct vLLM calls are within 0--3% for generation workloads, within
881+ measurement noise. Neither backend is meaningfully faster.
856882
8578834 . ** Both backends benefit equally from vLLM server optimizations** :
858884 PagedAttention, continuous batching, KV cache, and CUDA kernels all
859885 happen server-side. Both are HTTP clients to the same server.
860886
8618875 . ** Cost tradeoff depends on scale** : For this small benchmark (250
862888 sequential queries, ~ 3 min total inference), OpenAI API ($0.047) is
863- cheaper than local H100 ($0.114 vLLM / $0.118 SystemDS) because hardware
864- amortization ($2.00/hr) dominates at low utilization. At production
865- scale with concurrent requests, owned hardware becomes significantly
866- cheaper per query.
889+ cheaper than local H100 ($0.108 vLLM / $0.109 SystemDS) because
890+ hardware amortization ($2.00/hr) dominates at low utilization. At
891+ production scale with concurrent requests, owned hardware becomes
892+ significantly cheaper per query.
867893
8688946 . ** Model quality matters more than serving infrastructure** : The
869895 difference between OpenAI and Qwen 3B is model quality. The
0 commit comments