Skip to content

ZurichNLP/MaritimEmails

Repository files navigation

MaritimEmails

A Synthetic Dataset for Maritime Chartering Correspondence

Resource Overview

Property Value
Language English
Domain Maritime chartering negotiations between brokers and charterers
Size 19,817 email threads containing 103,705 individual messages
Thread Length Mean 5.2 emails per thread (range: 1–35)
Email Length Mean 42.5 words per email
Entity Coverage Vessels, ports/locations, commodities, Incoterms, freight rates, laycan dates, demurrage terms
Format JSON Lines (.jsonl.gz), with supplementary CSV and plain-text exports
License CC BY-NC 4.0

Corpus Composition

By generation method:

Method Threads Share
AttrPrompting 4,777 24.1%
BARE 10,118 51.1%
Few-Shot 2,475 12.5%
Zero-Shot 2,447 12.3%

By model:

Model Threads Share
Claude 4,157 21.0%
DeepSeek 3,753 18.9%
GPT-4 3,971 20.0%
Gemini 3,549 17.9%
Mistral 4,387 22.1%

Label completeness:

Status Threads Share
Full (all fields populated) 4,313 21.8%
Partial 12,617 63.7%
Empty 2,887 14.6%

Generation Process

Emails were generated using five contemporary large language models (Mistral, DeepSeek, Claude, GPT-4, Gemini) under four prompting strategies:

  • Attribute Prompting: Systematic template population with maritime-specific attributes controlling verbosity, formality, negotiation stage, sentiment, and writer role
  • BARE (Base–Refine): Llama-3.1-8B or Llama-3.2-3B generate diverse drafts, refined by instruction-tuned models for coherence
  • Few-Shot: Guided by curated example chains
  • Zero-Shot: No examples or structured attributes provided

Annotation Pipeline

Entity annotations were generated during email synthesis. Generation prompts instructed language models to produce both email conversations and structured entity labels simultaneously.

  • For AttrPrompting, Few-Shot, and Zero-Shot generation, models created email text while populating the labels object.
  • For BARE generations, the instruction-tuned model extracted entities during the refinement stage from base-generated text.

All annotations are provided as key-value pairs in the labels object (see Data Format).

Data Format

All files are encoded in UTF-8 and distributed as compressed archives (.jsonl.gz).

Structure

Each record is a JSON object with two top-level keys:

  • email_chain: Array of email objects representing a conversation thread
  • labels: Object containing extracted maritime entities and negotiation details

Email Chain Fields

Each email in the email_chain array contains:

Field Description
from Sender's email address
to Recipient's email address
subject Email subject line (may evolve with Re:/Fwd: prefixes)
timestamp Date and time in YYYY-MM-DD HH:MM format
body Email content including salutation, message, and signature

Label Fields

The labels object contains annotated entities and key information. Fields may be empty strings if not present in the conversation.

Field Description
broker Brokerage firm name
commodity Type of cargo (e.g., Sugar, Wheat, Crude Oil)
load_port Port of loading
discharge_port Port of discharge
cargo_size Shipment quantity with unit (e.g., "12277MT")
incoterm International Commercial Term (FOB, CIF, CFR, DAP, DDP)
vessel Ship name
dwt Deadweight tonnage capacity
loa Length overall of vessel
starting_freight_quote_currency Currency of initial rate offer
starting_freight_quote Initial freight rate quoted
final_freight_quote_currency Currency of agreed rate
final_freight_quote Final negotiated freight rate
laytime_start_date Laycan window start date (YYYY-MM-DD)
laytime_end_date Laycan window end date (YYYY-MM-DD)
demurrage_currency Currency for demurrage charges
demurrage Demurrage rate

Example Record

The following example illustrates a complete negotiation thread between a charterer and broker:

{
  "email_chain": [
    {
      "from": "m.mason@sealinetrading.com",
      "to": "n.rosas@globalmaritime.com",
      "subject": "Sugar Cargo - Cadiz to Heiligenhafen Inquiry",
      "timestamp": "2014-08-13 09:23",
      "body": "Hi Nate,\n\nLooking for a vessel to carry 12,277MT of sugar from Cadiz to Heiligenhafen. Need loading window around end of August.\n\nPlease advise suitable tonnage and CIF rate.\n\nBest regards,\nMia"
    },
    {
      "from": "n.rosas@globalmaritime.com",
      "to": "m.mason@sealinetrading.com",
      "subject": "Re: Sugar Cargo - Cadiz to Heiligenhafen Inquiry",
      "timestamp": "2014-08-13 10:45",
      "body": "Dear Mia,\n\nThanks for yr inquiry. Can offer MV GEMMA for yr cargo. She is modern vessel with good sugar history.\n\nCan fix basis following terms:\n- Rate: EUR44 PMT CIF\n- Laycan: 25-30 August\n- L/D rate: 5000mt pwwd SHINC\n- Demurrage: EUR 15,000 pd pro rata\n\nVessel currently positioning well for dates.\n\nPls advise if interessted.\n\nBest rgds,\nNate"
    },
    {
      "from": "m.mason@sealinetrading.com",
      "to": "n.rosas@globalmaritime.com",
      "subject": "Re: Sugar Cargo - Cadiz to Heiligenhafen Inquiry",
      "timestamp": "2014-08-13 11:17",
      "body": "Nate,\n\nRate seems bit high for this route. Can you check if EUR41 workable?\n\nAlso need vessel's main particulars and last 3 cargoes.\n\nRegards,\nMia"
    },
    {
      "from": "n.rosas@globalmaritime.com",
      "to": "m.mason@sealinetrading.com",
      "subject": "Re: Sugar Cargo - Cadiz to Heiligenhafen Inquiry",
      "timestamp": "2014-08-13 12:03",
      "body": "Mia,\n\nOwners say can meet halfway at EUR42.5. Vessel particulars:\nDWT: 313,049\nLOA: 330m\nLast 3 cargoes: Sugar/Sugar/Grain\n\nVessel fresh from dd, all holds clean.\n\nCan hold rate until 15:00 today.\n\nRgds,\nNate"
    },
    {
      "from": "m.mason@sealinetrading.com",
      "to": "n.rosas@globalmaritime.com",
      "subject": "Re: Sugar Cargo - Cadiz to Heiligenhafen Inquiry",
      "timestamp": "2014-08-13 14:22",
      "body": "Nate,\n\nEUR42.5 acceptable if you can extend laycan to 25-31 August for more flexibility.\n\nPlease confirm if possible.\n\nMia"
    },
    {
      "from": "n.rosas@globalmaritime.com",
      "to": "m.mason@sealinetrading.com",
      "subject": "Re: Sugar Cargo - Cadiz to Heiligenhafen Inquiry",
      "timestamp": "2014-08-13 14:45",
      "body": "Mia,\n\nOwnrs confirm laycan extension ok. All other terms as discussed.\n\nShall we proceed with recap?\n\nRgds,\nNate"
    },
    {
      "from": "m.mason@sealinetrading.com",
      "to": "n.rosas@globalmaritime.com",
      "subject": "Re: Sugar Cargo - Cadiz to Heiligenhafen Inquiry",
      "timestamp": "2014-08-13 15:10",
      "body": "Yes, please send recap.\n\nMia"
    }
  ],
  "labels": {
    "broker": "Global Maritime Brokers",
    "commodity": "Sugar",
    "load_port": "Cadiz",
    "discharge_port": "Heiligenhafen",
    "cargo_size": "12277MT",
    "incoterm": "CIF",
    "vessel": "GEMMA",
    "dwt": "313049",
    "loa": "330m",
    "starting_freight_quote_currency": "EUR",
    "starting_freight_quote": "44",
    "final_freight_quote_currency": "EUR",
    "final_freight_quote": "42.5",
    "laytime_start_date": "2014-08-25",
    "laytime_end_date": "2014-08-31",
    "demurrage_currency": "EUR",
    "demurrage": "15000"
  }
}

Generation Metadata

Each record includes metadata fields for provenance tracking (not shown in the example above):

Field Description
thread_id Unique identifier for the email chain
message_id Unique identifier for each individual email
generation_method One of: AttrPrompting, BARE, Few-Shot, Zero-Shot
model LLM used for generation: Mistral, DeepSeek, Claude, GPT-4, Gemini

Data Splits

Pre-defined train/dev/test splits are provided to prevent information leakage:

Split File Threads Share
Train train.jsonl.gz 14,862 ≈75%
Dev dev.jsonl.gz 2,378 ≈12%
Test test.jsonl.gz 2,577 ≈13%

Thread IDs are unique across splits.

Compute and Cost Transparency

Generation was conducted between March–June 2025 using hosted APIs. The total cost of generation and annotation amounted to approximately USD 158, distributed across models as follows: GPT-4 (35%), Claude (28%), Gemini (17%), Mistral (12%), DeepSeek (8%). Processing required roughly 42 GPU-hours equivalent (estimated at 1.5 kWh).

Intended Use and Limitations

MaritimEmails is designed for research on synthetic data generation, email processing, and domain-specific information extraction. It is not intended for commercial negotiation automation or for reproducing identifiable human communication styles.

As a synthetic resource, it contains no personally identifiable information. Nonetheless, LLM-based annotation may introduce minor inconsistencies and reflects the biases of underlying models — most notably a documented positivity bias and limited lexical coverage of rare maritime terms.

Ethical Considerations

The corpus simulates professional correspondence and may contain synthetic expressions of disagreement or negotiation tension. No real individuals or organizations are represented. To reduce misuse, each email includes a metadata flag "synthetic": true. Researchers are advised to clearly indicate synthetic provenance in downstream publications or applications.

License

CC BY-NC 4.0

Citation

If you use this dataset, please cite:

@inproceedings{bruendler2026maritimemails,
    title     = {MaritimEmails: A Synthetic Dataset for Maritime Chartering Correspondence},
    author    = {Br{\"u}ndler, Kevin and Clematide, Simon},
    booktitle = {Proceedings of LREC 2026},
    year      = {2026}
}

About

Repository for the Submission at LREC 2026.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages