Skip to content

Commit 7dc735d

Browse files
committed
#1602 - Upgrade dependencies
- Revert migration to Jackson and instead upgrade to modern JAXB
1 parent 8676525 commit 7dc735d

32 files changed

Lines changed: 523 additions & 361 deletions

File tree

JAXB_vs_Jackson_Analysis.md

Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
# JAXB vs Jackson Analysis for DKPro Core Migration
2+
3+
## Current Situation
4+
5+
We have migrated both XCES and BioC modules from `javax.xml.bind` (JAXB) to Jackson XML. The user is now questioning whether we should instead upgrade to **Jakarta JAXB** rather than cross-grading to Jackson.
6+
7+
## Original JAXB Approach
8+
9+
### XCES Module (Writing)
10+
**Pattern**: Hybrid XMLEventWriter + JAXB Marshaller
11+
```java
12+
JAXBContext context = JAXBContext.newInstance(XcesBody.class);
13+
Marshaller marshaller = context.createMarshaller();
14+
marshaller.setProperty(Marshaller.JAXB_FRAGMENT, Boolean.TRUE);
15+
16+
// Manually write structure with XMLEventWriter
17+
xmlEventWriter.add(xmlef.createStartElement("", "", "cesDoc"));
18+
xmlEventWriter.add(xmlef.createStartElement("", "", "cesHeader"));
19+
xmlEventWriter.add(xmlef.createEndElement("", "", "cesHeader"));
20+
xmlEventWriter.add(xmlef.createStartElement("", "", "text"));
21+
22+
// Marshal body content directly to the event writer
23+
marshaller.marshal(new JAXBElement<XcesBody>(new QName("body"), XcesBody.class, xb),
24+
xmlEventWriter);
25+
```
26+
27+
**Key Advantage**: JAXB Marshaller can write **directly to XMLEventWriter** - NO string round-trip needed!
28+
29+
### BioC Module (Reading)
30+
**Pattern**: XMLEventReader + JAXB Unmarshaller
31+
```java
32+
JAXBContext context = JAXBContext.newInstance(BioCDocument.class);
33+
Unmarshaller unmarshaller = context.createUnmarshaller();
34+
35+
// Unmarshal directly from XMLEventReader
36+
var document = unmarshaller.unmarshal(getXmlEventReader(), BioCDocument.class).getValue();
37+
```
38+
39+
**Key Advantage**: JAXB Unmarshaller works **directly with XMLEventReader** - clean streaming!
40+
41+
## Current Jackson Approach
42+
43+
### XCES Module (Writing)
44+
**Pattern**: XMLEventWriter + Jackson → String → XMLEventReader → XMLEventWriter
45+
```java
46+
XmlMapper xmlMapper = new XmlMapper();
47+
48+
// Jackson can ONLY serialize to String or OutputStream
49+
String bodyXml = xmlMapper.writer().withRootName("body").writeValueAsString(xb);
50+
51+
// Must parse string back to events and inject into stream
52+
XMLEventReader bodyReader = xif.createXMLEventReader(new StringReader(bodyXml));
53+
while (bodyReader.hasNext()) {
54+
xmlEventWriter.add(bodyReader.nextEvent());
55+
}
56+
```
57+
58+
**Problem**: Requires string round-trip because Jackson cannot write to XMLEventWriter/XMLStreamWriter
59+
60+
### BioC Module (Reading)
61+
**Pattern**: XMLStreamReader + Jackson XmlMapper
62+
```java
63+
XmlMapper mapper = new XmlMapper();
64+
65+
// Jackson supports XMLStreamReader (not XMLEventReader)
66+
var document = mapper.readValue(getXmlStreamReader(), BioCDocument.class);
67+
```
68+
69+
**Advantage**: Switched from XMLEventReader to XMLStreamReader - eliminated string buffering from an earlier bad approach
70+
**Limitation**: Required changing the streaming API from events to cursor-based
71+
72+
## Jakarta JAXB Option
73+
74+
### Migration Path
75+
- `javax.xml.bind``jakarta.xml.bind`
76+
- Package rename: `javax.xml.bind.*``jakarta.xml.bind.*`
77+
- Same API surface, just namespace change
78+
79+
### Dependencies
80+
```xml
81+
<dependency>
82+
<groupId>jakarta.xml.bind</groupId>
83+
<artifactId>jakarta.xml.bind-api</artifactId>
84+
<version>4.0.2</version>
85+
</dependency>
86+
<dependency>
87+
<groupId>org.glassfish.jaxb</groupId>
88+
<artifactId>jaxb-runtime</artifactId>
89+
<version>4.0.5</version>
90+
<scope>runtime</scope>
91+
</dependency>
92+
```
93+
94+
## Comparison Matrix
95+
96+
| Aspect | Jakarta JAXB | Jackson XML | Winner |
97+
|--------|-------------|-------------|---------|
98+
| **XCES Hybrid Writing** | ✅ Marshal directly to XMLEventWriter | ❌ Must round-trip through String | **JAXB** |
99+
| **BioC Streaming Reading** | ✅ Unmarshal from XMLEventReader | ✅ ReadValue from XMLStreamReader | **Tie** |
100+
| **Code Simplicity (XCES)** | Simple: 1 line marshal | Complex: 20+ lines string round-trip | **JAXB** |
101+
| **Code Simplicity (BioC)** | Simple: 1 line unmarshal | Simple: 1 line readValue | **Tie** |
102+
| **Memory Efficiency (XCES)** | ✅ Zero buffering | ❌ String buffering | **JAXB** |
103+
| **Memory Efficiency (BioC)** | ✅ Zero buffering | ✅ Zero buffering | **Tie** |
104+
| **Standardization** | Jakarta EE standard | De-facto JSON/XML library | **JAXB** |
105+
| **Maintenance** | Stable, mature | Active development | **Tie** |
106+
| **Learning Curve** | Known technology | New annotations | **JAXB** |
107+
| **Performance** | Mature, optimized | Mature, optimized | **Tie** |
108+
| **Annotations** | `@XmlElement`, `@XmlAttribute` | `@JsonProperty`, custom converters | **JAXB** |
109+
| **Migration Effort** | Package rename only | Complete rewrite (done) | **JAXB** |
110+
111+
## Technical Deep Dive
112+
113+
### Why JAXB Works Better for XCES
114+
115+
The XCES format requires this structure:
116+
```xml
117+
<cesDoc>
118+
<cesHeader/>
119+
<text>
120+
<body>
121+
<!-- Content here via object mapping -->
122+
</body>
123+
</text>
124+
</cesDoc>
125+
```
126+
127+
**With JAXB**: The Marshaller can write directly to the middle of an XMLEventWriter stream:
128+
```java
129+
// Write structure manually
130+
xmlEventWriter.add(startElement("cesDoc"));
131+
xmlEventWriter.add(startElement("text"));
132+
133+
// JAXB writes <body> directly to the same stream
134+
marshaller.marshal(bodyObject, xmlEventWriter);
135+
136+
// Continue writing structure
137+
xmlEventWriter.add(endElement("text"));
138+
xmlEventWriter.add(endElement("cesDoc"));
139+
```
140+
141+
**With Jackson**: Cannot write to XMLEventWriter, must round-trip:
142+
```java
143+
// Write structure manually
144+
xmlEventWriter.add(startElement("cesDoc"));
145+
xmlEventWriter.add(startElement("text"));
146+
147+
// Jackson → String → Events → XMLEventWriter
148+
String xml = xmlMapper.writeValueAsString(bodyObject);
149+
XMLEventReader events = createReader(new StringReader(xml));
150+
while (events.hasNext()) {
151+
xmlEventWriter.add(events.nextEvent()); // Copy all events
152+
}
153+
154+
// Continue writing structure
155+
xmlEventWriter.add(endElement("text"));
156+
xmlEventWriter.add(endElement("cesDoc"));
157+
```
158+
159+
### Why Both Work Similarly for BioC
160+
161+
BioC reads multi-document collections and extracts individual documents:
162+
163+
**With JAXB**:
164+
```java
165+
Unmarshaller unmarshaller = context.createUnmarshaller();
166+
BioCDocument doc = unmarshaller.unmarshal(xmlEventReader, BioCDocument.class).getValue();
167+
```
168+
169+
**With Jackson**:
170+
```java
171+
XmlMapper mapper = new XmlMapper();
172+
BioCDocument doc = mapper.readValue(xmlStreamReader, BioCDocument.class);
173+
```
174+
175+
Both are clean, both stream efficiently. The main difference is XMLEventReader vs XMLStreamReader, which is just API preference.
176+
177+
## Recommendation
178+
179+
### **Switch to Jakarta JAXB**
180+
181+
**Reasons**:
182+
183+
1. **XCES hybrid approach is MUCH cleaner** - eliminates the string round-trip entirely
184+
2. **Same or better performance** - no string buffering in XCES writers
185+
3. **Simpler code** - JAXB Marshaller/Unmarshaller integrate seamlessly with StAX streaming
186+
4. **Less migration effort** - Jakarta JAXB is just a package rename from javax JAXB
187+
5. **Standard approach** - Jakarta EE is the successor to Java EE, this is the "official" migration path
188+
6. **Familiar API** - The model classes already have JAXB annotations (were there originally)
189+
7. **No architectural compromises** - JAXB was designed specifically for XML with StAX integration
190+
191+
**Migration Effort**:
192+
- Change dependencies: `javax.xml.bind``jakarta.xml.bind`
193+
- Update imports: `javax.xml.bind.*``jakarta.xml.bind.*`
194+
- Revert code to original JAXB approach (simpler than current Jackson code!)
195+
- Tests should pass with minimal changes
196+
197+
**Conclusion**: Jakarta JAXB is the better choice. It solves the original problem (Java module system compatibility) without introducing architectural compromises. Jackson XML is excellent for many use cases, but for these hybrid StAX scenarios, JAXB's native integration is superior.

dkpro-core-io-bioc-asl/pom.xml

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -28,16 +28,12 @@
2828
<url>https://dkpro.github.io/dkpro-core/</url>
2929
<dependencies>
3030
<dependency>
31-
<groupId>com.fasterxml.jackson.core</groupId>
32-
<artifactId>jackson-databind</artifactId>
31+
<groupId>jakarta.xml.bind</groupId>
32+
<artifactId>jakarta.xml.bind-api</artifactId>
3333
</dependency>
3434
<dependency>
35-
<groupId>com.fasterxml.jackson.core</groupId>
36-
<artifactId>jackson-annotations</artifactId>
37-
</dependency>
38-
<dependency>
39-
<groupId>com.fasterxml.jackson.dataformat</groupId>
40-
<artifactId>jackson-dataformat-xml</artifactId>
35+
<groupId>org.glassfish.jaxb</groupId>
36+
<artifactId>jaxb-runtime</artifactId>
4137
</dependency>
4238
<dependency>
4339
<groupId>org.apache.uima</groupId>

dkpro-core-io-bioc-asl/src/main/java/org/dkpro/core/io/bioc/BioCReader.java

Lines changed: 26 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -20,10 +20,8 @@
2020
import static org.dkpro.core.io.bioc.BioCComponent.addCollectionMetadataField;
2121

2222
import java.io.IOException;
23-
import java.io.StringWriter;
2423
import java.util.Optional;
2524

26-
import javax.xml.stream.XMLOutputFactory;
2725
import javax.xml.stream.XMLStreamException;
2826

2927
import org.apache.uima.UimaContext;
@@ -38,10 +36,10 @@
3836
import org.dkpro.core.io.bioc.internal.BioCToCas;
3937
import org.dkpro.core.io.bioc.internal.model.BioCDocument;
4038

41-
import com.fasterxml.jackson.databind.DeserializationFeature;
42-
import com.fasterxml.jackson.dataformat.xml.XmlMapper;
43-
4439
import eu.openminted.share.annotations.api.DocumentationResource;
40+
import jakarta.xml.bind.JAXBContext;
41+
import jakarta.xml.bind.JAXBException;
42+
import jakarta.xml.bind.Unmarshaller;
4543

4644
/**
4745
* Reader for the BioC format.
@@ -53,20 +51,26 @@
5351
public class BioCReader
5452
extends BioCReaderImplBase
5553
{
56-
private XmlMapper mapper;
54+
private JAXBContext context;
55+
private Unmarshaller unmarshaller;
5756
private Optional<BioCDocument> nextDocument;
5857

5958
@Override
6059
public void initialize(UimaContext aContext) throws ResourceInitializationException
6160
{
6261
super.initialize(aContext);
63-
mapper = new XmlMapper();
64-
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
62+
63+
try {
64+
context = JAXBContext.newInstance(BioCDocument.class);
65+
}
66+
catch (JAXBException e) {
67+
throw new ResourceInitializationException(e);
68+
}
6569

6670
try {
6771
nextDocument = nextBioCDocument();
6872
}
69-
catch (CollectionException | XMLStreamException | IOException e) {
73+
catch (CollectionException | XMLStreamException | JAXBException | IOException e) {
7074
throw new ResourceInitializationException(e);
7175
}
7276
}
@@ -90,14 +94,14 @@ public void getNext(JCas aJCas) throws IOException, CollectionException
9094
// DocumentMetaData.get(aJCas).setDocumentId(document.getId());
9195
// }
9296

93-
var jb = new JCasBuilder(aJCas);
97+
JCasBuilder jb = new JCasBuilder(aJCas);
9498
new BioCToCas().readDocument(jb, document);
9599
jb.close();
96100

97101
try {
98102
nextDocument = nextBioCDocument();
99103
}
100-
catch (XMLStreamException | IOException e) {
104+
catch (XMLStreamException | JAXBException e) {
101105
throw new IOException(e);
102106
}
103107
}
@@ -109,7 +113,7 @@ public boolean hasNext() throws IOException, CollectionException
109113
}
110114

111115
private Optional<BioCDocument> nextBioCDocument()
112-
throws XMLStreamException, CollectionException, IOException
116+
throws XMLStreamException, JAXBException, CollectionException, IOException
113117
{
114118
if (!isFileOpen()) {
115119
openNextFile();
@@ -127,43 +131,26 @@ private Optional<BioCDocument> nextBioCDocument()
127131
protected void openNextFile() throws IOException, XMLStreamException, CollectionException
128132
{
129133
super.openNextFile();
130-
// no-op for XmlMapper-based parsing
134+
try {
135+
unmarshaller = context.createUnmarshaller();
136+
}
137+
catch (JAXBException e) {
138+
new IOException(e);
139+
}
131140
}
132141

133142
@Override
134143
protected void closeFile()
135144
{
136-
// mapper is reused, nothing to clear per-file
145+
unmarshaller = null;
137146
super.closeFile();
138147
}
139148

140-
private Optional<BioCDocument> nextBioCDocumentInFile() throws XMLStreamException, IOException
149+
private Optional<BioCDocument> nextBioCDocumentInFile() throws XMLStreamException, JAXBException
141150
{
142151
if (seekNextBioCDocumentInFile()) {
143-
// Serialize the current <document> event sequence to a string and parse with XmlMapper
144-
var sw = new StringWriter();
145-
var outFactory = XMLOutputFactory.newFactory();
146-
var xew = outFactory.createXMLEventWriter(sw);
147-
148-
int depth = 0;
149-
while (getXmlEventReader().hasNext()) {
150-
var e = getXmlEventReader().nextEvent();
151-
xew.add(e);
152-
if (e.isStartElement()) {
153-
depth++;
154-
}
155-
else if (e.isEndElement()) {
156-
depth--;
157-
if (depth == 0) {
158-
break;
159-
}
160-
}
161-
}
162-
xew.flush();
163-
xew.close();
164-
165-
var xml = sw.toString();
166-
var document = mapper.readValue(xml, BioCDocument.class);
152+
var document = unmarshaller.unmarshal(getXmlEventReader(), BioCDocument.class)
153+
.getValue();
167154
return Optional.of(document);
168155
}
169156

0 commit comments

Comments
 (0)