A Synthetic Dataset for Maritime Chartering Correspondence
| Property | Value |
|---|---|
| Language | English |
| Domain | Maritime chartering negotiations between brokers and charterers |
| Size | 19,817 email threads containing 103,705 individual messages |
| Thread Length | Mean 5.2 emails per thread (range: 1–35) |
| Email Length | Mean 42.5 words per email |
| Entity Coverage | Vessels, ports/locations, commodities, Incoterms, freight rates, laycan dates, demurrage terms |
| Format | JSON Lines (.jsonl.gz), with supplementary CSV and plain-text exports |
| License | CC BY-NC 4.0 |
By generation method:
| Method | Threads | Share |
|---|---|---|
| AttrPrompting | 4,777 | 24.1% |
| BARE | 10,118 | 51.1% |
| Few-Shot | 2,475 | 12.5% |
| Zero-Shot | 2,447 | 12.3% |
By model:
| Model | Threads | Share |
|---|---|---|
| Claude | 4,157 | 21.0% |
| DeepSeek | 3,753 | 18.9% |
| GPT-4 | 3,971 | 20.0% |
| Gemini | 3,549 | 17.9% |
| Mistral | 4,387 | 22.1% |
Label completeness:
| Status | Threads | Share |
|---|---|---|
| Full (all fields populated) | 4,313 | 21.8% |
| Partial | 12,617 | 63.7% |
| Empty | 2,887 | 14.6% |
Emails were generated using five contemporary large language models (Mistral, DeepSeek, Claude, GPT-4, Gemini) under four prompting strategies:
- Attribute Prompting: Systematic template population with maritime-specific attributes controlling verbosity, formality, negotiation stage, sentiment, and writer role
- BARE (Base–Refine): Llama-3.1-8B or Llama-3.2-3B generate diverse drafts, refined by instruction-tuned models for coherence
- Few-Shot: Guided by curated example chains
- Zero-Shot: No examples or structured attributes provided
Entity annotations were generated during email synthesis. Generation prompts instructed language models to produce both email conversations and structured entity labels simultaneously.
- For AttrPrompting, Few-Shot, and Zero-Shot generation, models created email text while populating the
labelsobject. - For BARE generations, the instruction-tuned model extracted entities during the refinement stage from base-generated text.
All annotations are provided as key-value pairs in the labels object (see Data Format).
All files are encoded in UTF-8 and distributed as compressed archives (.jsonl.gz).
Each record is a JSON object with two top-level keys:
email_chain: Array of email objects representing a conversation threadlabels: Object containing extracted maritime entities and negotiation details
Each email in the email_chain array contains:
| Field | Description |
|---|---|
from |
Sender's email address |
to |
Recipient's email address |
subject |
Email subject line (may evolve with Re:/Fwd: prefixes) |
timestamp |
Date and time in YYYY-MM-DD HH:MM format |
body |
Email content including salutation, message, and signature |
The labels object contains annotated entities and key information. Fields may be empty strings if not present in the conversation.
| Field | Description |
|---|---|
broker |
Brokerage firm name |
commodity |
Type of cargo (e.g., Sugar, Wheat, Crude Oil) |
load_port |
Port of loading |
discharge_port |
Port of discharge |
cargo_size |
Shipment quantity with unit (e.g., "12277MT") |
incoterm |
International Commercial Term (FOB, CIF, CFR, DAP, DDP) |
vessel |
Ship name |
dwt |
Deadweight tonnage capacity |
loa |
Length overall of vessel |
starting_freight_quote_currency |
Currency of initial rate offer |
starting_freight_quote |
Initial freight rate quoted |
final_freight_quote_currency |
Currency of agreed rate |
final_freight_quote |
Final negotiated freight rate |
laytime_start_date |
Laycan window start date (YYYY-MM-DD) |
laytime_end_date |
Laycan window end date (YYYY-MM-DD) |
demurrage_currency |
Currency for demurrage charges |
demurrage |
Demurrage rate |
The following example illustrates a complete negotiation thread between a charterer and broker:
{
"email_chain": [
{
"from": "m.mason@sealinetrading.com",
"to": "n.rosas@globalmaritime.com",
"subject": "Sugar Cargo - Cadiz to Heiligenhafen Inquiry",
"timestamp": "2014-08-13 09:23",
"body": "Hi Nate,\n\nLooking for a vessel to carry 12,277MT of sugar from Cadiz to Heiligenhafen. Need loading window around end of August.\n\nPlease advise suitable tonnage and CIF rate.\n\nBest regards,\nMia"
},
{
"from": "n.rosas@globalmaritime.com",
"to": "m.mason@sealinetrading.com",
"subject": "Re: Sugar Cargo - Cadiz to Heiligenhafen Inquiry",
"timestamp": "2014-08-13 10:45",
"body": "Dear Mia,\n\nThanks for yr inquiry. Can offer MV GEMMA for yr cargo. She is modern vessel with good sugar history.\n\nCan fix basis following terms:\n- Rate: EUR44 PMT CIF\n- Laycan: 25-30 August\n- L/D rate: 5000mt pwwd SHINC\n- Demurrage: EUR 15,000 pd pro rata\n\nVessel currently positioning well for dates.\n\nPls advise if interessted.\n\nBest rgds,\nNate"
},
{
"from": "m.mason@sealinetrading.com",
"to": "n.rosas@globalmaritime.com",
"subject": "Re: Sugar Cargo - Cadiz to Heiligenhafen Inquiry",
"timestamp": "2014-08-13 11:17",
"body": "Nate,\n\nRate seems bit high for this route. Can you check if EUR41 workable?\n\nAlso need vessel's main particulars and last 3 cargoes.\n\nRegards,\nMia"
},
{
"from": "n.rosas@globalmaritime.com",
"to": "m.mason@sealinetrading.com",
"subject": "Re: Sugar Cargo - Cadiz to Heiligenhafen Inquiry",
"timestamp": "2014-08-13 12:03",
"body": "Mia,\n\nOwners say can meet halfway at EUR42.5. Vessel particulars:\nDWT: 313,049\nLOA: 330m\nLast 3 cargoes: Sugar/Sugar/Grain\n\nVessel fresh from dd, all holds clean.\n\nCan hold rate until 15:00 today.\n\nRgds,\nNate"
},
{
"from": "m.mason@sealinetrading.com",
"to": "n.rosas@globalmaritime.com",
"subject": "Re: Sugar Cargo - Cadiz to Heiligenhafen Inquiry",
"timestamp": "2014-08-13 14:22",
"body": "Nate,\n\nEUR42.5 acceptable if you can extend laycan to 25-31 August for more flexibility.\n\nPlease confirm if possible.\n\nMia"
},
{
"from": "n.rosas@globalmaritime.com",
"to": "m.mason@sealinetrading.com",
"subject": "Re: Sugar Cargo - Cadiz to Heiligenhafen Inquiry",
"timestamp": "2014-08-13 14:45",
"body": "Mia,\n\nOwnrs confirm laycan extension ok. All other terms as discussed.\n\nShall we proceed with recap?\n\nRgds,\nNate"
},
{
"from": "m.mason@sealinetrading.com",
"to": "n.rosas@globalmaritime.com",
"subject": "Re: Sugar Cargo - Cadiz to Heiligenhafen Inquiry",
"timestamp": "2014-08-13 15:10",
"body": "Yes, please send recap.\n\nMia"
}
],
"labels": {
"broker": "Global Maritime Brokers",
"commodity": "Sugar",
"load_port": "Cadiz",
"discharge_port": "Heiligenhafen",
"cargo_size": "12277MT",
"incoterm": "CIF",
"vessel": "GEMMA",
"dwt": "313049",
"loa": "330m",
"starting_freight_quote_currency": "EUR",
"starting_freight_quote": "44",
"final_freight_quote_currency": "EUR",
"final_freight_quote": "42.5",
"laytime_start_date": "2014-08-25",
"laytime_end_date": "2014-08-31",
"demurrage_currency": "EUR",
"demurrage": "15000"
}
}Each record includes metadata fields for provenance tracking (not shown in the example above):
| Field | Description |
|---|---|
thread_id |
Unique identifier for the email chain |
message_id |
Unique identifier for each individual email |
generation_method |
One of: AttrPrompting, BARE, Few-Shot, Zero-Shot |
model |
LLM used for generation: Mistral, DeepSeek, Claude, GPT-4, Gemini |
Pre-defined train/dev/test splits are provided to prevent information leakage:
| Split | File | Threads | Share |
|---|---|---|---|
| Train | train.jsonl.gz |
14,862 | ≈75% |
| Dev | dev.jsonl.gz |
2,378 | ≈12% |
| Test | test.jsonl.gz |
2,577 | ≈13% |
Thread IDs are unique across splits.
Generation was conducted between March–June 2025 using hosted APIs. The total cost of generation and annotation amounted to approximately USD 158, distributed across models as follows: GPT-4 (35%), Claude (28%), Gemini (17%), Mistral (12%), DeepSeek (8%). Processing required roughly 42 GPU-hours equivalent (estimated at 1.5 kWh).
MaritimEmails is designed for research on synthetic data generation, email processing, and domain-specific information extraction. It is not intended for commercial negotiation automation or for reproducing identifiable human communication styles.
As a synthetic resource, it contains no personally identifiable information. Nonetheless, LLM-based annotation may introduce minor inconsistencies and reflects the biases of underlying models — most notably a documented positivity bias and limited lexical coverage of rare maritime terms.
The corpus simulates professional correspondence and may contain synthetic expressions of disagreement or negotiation tension. No real individuals or organizations are represented. To reduce misuse, each email includes a metadata flag "synthetic": true. Researchers are advised to clearly indicate synthetic provenance in downstream publications or applications.
CC BY-NC 4.0
If you use this dataset, please cite:
@inproceedings{bruendler2026maritimemails,
title = {MaritimEmails: A Synthetic Dataset for Maritime Chartering Correspondence},
author = {Br{\"u}ndler, Kevin and Clematide, Simon},
booktitle = {Proceedings of LREC 2026},
year = {2026}
}