MaritimEmails

A Synthetic Dataset for Maritime Chartering Correspondence

Resource Overview

Property	Value
Language	English
Domain	Maritime chartering negotiations between brokers and charterers
Size	19,817 email threads containing 103,705 individual messages
Thread Length	Mean 5.2 emails per thread (range: 1–35)
Email Length	Mean 42.5 words per email
Entity Coverage	Vessels, ports/locations, commodities, Incoterms, freight rates, laycan dates, demurrage terms
Format	JSON Lines (`.jsonl.gz`), with supplementary CSV and plain-text exports
License	CC BY-NC 4.0

Corpus Composition

By generation method:

Method	Threads	Share
AttrPrompting	4,777	24.1%
BARE	10,118	51.1%
Few-Shot	2,475	12.5%
Zero-Shot	2,447	12.3%

By model:

Model	Threads	Share
Claude	4,157	21.0%
DeepSeek	3,753	18.9%
GPT-4	3,971	20.0%
Gemini	3,549	17.9%
Mistral	4,387	22.1%

Label completeness:

Status	Threads	Share
Full (all fields populated)	4,313	21.8%
Partial	12,617	63.7%
Empty	2,887	14.6%

Generation Process

Emails were generated using five contemporary large language models (Mistral, DeepSeek, Claude, GPT-4, Gemini) under four prompting strategies:

Attribute Prompting: Systematic template population with maritime-specific attributes controlling verbosity, formality, negotiation stage, sentiment, and writer role
BARE (Base–Refine): Llama-3.1-8B or Llama-3.2-3B generate diverse drafts, refined by instruction-tuned models for coherence
Few-Shot: Guided by curated example chains
Zero-Shot: No examples or structured attributes provided

Annotation Pipeline

Entity annotations were generated during email synthesis. Generation prompts instructed language models to produce both email conversations and structured entity labels simultaneously.

For AttrPrompting, Few-Shot, and Zero-Shot generation, models created email text while populating the labels object.
For BARE generations, the instruction-tuned model extracted entities during the refinement stage from base-generated text.

All annotations are provided as key-value pairs in the labels object (see Data Format).

Data Format

All files are encoded in UTF-8 and distributed as compressed archives (.jsonl.gz).

Structure

Each record is a JSON object with two top-level keys:

email_chain: Array of email objects representing a conversation thread
labels: Object containing extracted maritime entities and negotiation details

Email Chain Fields

Each email in the email_chain array contains:

Field	Description
`from`	Sender's email address
`to`	Recipient's email address
`subject`	Email subject line (may evolve with Re:/Fwd: prefixes)
`timestamp`	Date and time in `YYYY-MM-DD HH:MM` format
`body`	Email content including salutation, message, and signature

Label Fields

The labels object contains annotated entities and key information. Fields may be empty strings if not present in the conversation.

Field	Description
`broker`	Brokerage firm name
`commodity`	Type of cargo (e.g., Sugar, Wheat, Crude Oil)
`load_port`	Port of loading
`discharge_port`	Port of discharge
`cargo_size`	Shipment quantity with unit (e.g., "12277MT")
`incoterm`	International Commercial Term (FOB, CIF, CFR, DAP, DDP)
`vessel`	Ship name
`dwt`	Deadweight tonnage capacity
`loa`	Length overall of vessel
`starting_freight_quote_currency`	Currency of initial rate offer
`starting_freight_quote`	Initial freight rate quoted
`final_freight_quote_currency`	Currency of agreed rate
`final_freight_quote`	Final negotiated freight rate
`laytime_start_date`	Laycan window start date (YYYY-MM-DD)
`laytime_end_date`	Laycan window end date (YYYY-MM-DD)
`demurrage_currency`	Currency for demurrage charges
`demurrage`	Demurrage rate

Example Record

The following example illustrates a complete negotiation thread between a charterer and broker:

{
  "email_chain": [
    {
      "from": "m.mason@sealinetrading.com",
      "to": "n.rosas@globalmaritime.com",
      "subject": "Sugar Cargo - Cadiz to Heiligenhafen Inquiry",
      "timestamp": "2014-08-13 09:23",
      "body": "Hi Nate,\n\nLooking for a vessel to carry 12,277MT of sugar from Cadiz to Heiligenhafen. Need loading window around end of August.\n\nPlease advise suitable tonnage and CIF rate.\n\nBest regards,\nMia"
    },
    {
      "from": "n.rosas@globalmaritime.com",
      "to": "m.mason@sealinetrading.com",
      "subject": "Re: Sugar Cargo - Cadiz to Heiligenhafen Inquiry",
      "timestamp": "2014-08-13 10:45",
      "body": "Dear Mia,\n\nThanks for yr inquiry. Can offer MV GEMMA for yr cargo. She is modern vessel with good sugar history.\n\nCan fix basis following terms:\n- Rate: EUR44 PMT CIF\n- Laycan: 25-30 August\n- L/D rate: 5000mt pwwd SHINC\n- Demurrage: EUR 15,000 pd pro rata\n\nVessel currently positioning well for dates.\n\nPls advise if interessted.\n\nBest rgds,\nNate"
    },
    {
      "from": "m.mason@sealinetrading.com",
      "to": "n.rosas@globalmaritime.com",
      "subject": "Re: Sugar Cargo - Cadiz to Heiligenhafen Inquiry",
      "timestamp": "2014-08-13 11:17",
      "body": "Nate,\n\nRate seems bit high for this route. Can you check if EUR41 workable?\n\nAlso need vessel's main particulars and last 3 cargoes.\n\nRegards,\nMia"
    },
    {
      "from": "n.rosas@globalmaritime.com",
      "to": "m.mason@sealinetrading.com",
      "subject": "Re: Sugar Cargo - Cadiz to Heiligenhafen Inquiry",
      "timestamp": "2014-08-13 12:03",
      "body": "Mia,\n\nOwners say can meet halfway at EUR42.5. Vessel particulars:\nDWT: 313,049\nLOA: 330m\nLast 3 cargoes: Sugar/Sugar/Grain\n\nVessel fresh from dd, all holds clean.\n\nCan hold rate until 15:00 today.\n\nRgds,\nNate"
    },
    {
      "from": "m.mason@sealinetrading.com",
      "to": "n.rosas@globalmaritime.com",
      "subject": "Re: Sugar Cargo - Cadiz to Heiligenhafen Inquiry",
      "timestamp": "2014-08-13 14:22",
      "body": "Nate,\n\nEUR42.5 acceptable if you can extend laycan to 25-31 August for more flexibility.\n\nPlease confirm if possible.\n\nMia"
    },
    {
      "from": "n.rosas@globalmaritime.com",
      "to": "m.mason@sealinetrading.com",
      "subject": "Re: Sugar Cargo - Cadiz to Heiligenhafen Inquiry",
      "timestamp": "2014-08-13 14:45",
      "body": "Mia,\n\nOwnrs confirm laycan extension ok. All other terms as discussed.\n\nShall we proceed with recap?\n\nRgds,\nNate"
    },
    {
      "from": "m.mason@sealinetrading.com",
      "to": "n.rosas@globalmaritime.com",
      "subject": "Re: Sugar Cargo - Cadiz to Heiligenhafen Inquiry",
      "timestamp": "2014-08-13 15:10",
      "body": "Yes, please send recap.\n\nMia"
    }
  ],
  "labels": {
    "broker": "Global Maritime Brokers",
    "commodity": "Sugar",
    "load_port": "Cadiz",
    "discharge_port": "Heiligenhafen",
    "cargo_size": "12277MT",
    "incoterm": "CIF",
    "vessel": "GEMMA",
    "dwt": "313049",
    "loa": "330m",
    "starting_freight_quote_currency": "EUR",
    "starting_freight_quote": "44",
    "final_freight_quote_currency": "EUR",
    "final_freight_quote": "42.5",
    "laytime_start_date": "2014-08-25",
    "laytime_end_date": "2014-08-31",
    "demurrage_currency": "EUR",
    "demurrage": "15000"
  }
}

Generation Metadata

Each record includes metadata fields for provenance tracking (not shown in the example above):

Field	Description
`thread_id`	Unique identifier for the email chain
`message_id`	Unique identifier for each individual email
`generation_method`	One of: AttrPrompting, BARE, Few-Shot, Zero-Shot
`model`	LLM used for generation: Mistral, DeepSeek, Claude, GPT-4, Gemini

Data Splits

Pre-defined train/dev/test splits are provided to prevent information leakage:

Split	File	Threads	Share
Train	`train.jsonl.gz`	14,862	≈75%
Dev	`dev.jsonl.gz`	2,378	≈12%
Test	`test.jsonl.gz`	2,577	≈13%

Thread IDs are unique across splits.

Compute and Cost Transparency

Generation was conducted between March–June 2025 using hosted APIs. The total cost of generation and annotation amounted to approximately USD 158, distributed across models as follows: GPT-4 (35%), Claude (28%), Gemini (17%), Mistral (12%), DeepSeek (8%). Processing required roughly 42 GPU-hours equivalent (estimated at 1.5 kWh).

Intended Use and Limitations

MaritimEmails is designed for research on synthetic data generation, email processing, and domain-specific information extraction. It is not intended for commercial negotiation automation or for reproducing identifiable human communication styles.

As a synthetic resource, it contains no personally identifiable information. Nonetheless, LLM-based annotation may introduce minor inconsistencies and reflects the biases of underlying models — most notably a documented positivity bias and limited lexical coverage of rare maritime terms.

Ethical Considerations

The corpus simulates professional correspondence and may contain synthetic expressions of disagreement or negotiation tension. No real individuals or organizations are represented. To reduce misuse, each email includes a metadata flag "synthetic": true. Researchers are advised to clearly indicate synthetic provenance in downstream publications or applications.

License

CC BY-NC 4.0

Citation

If you use this dataset, please cite:

@inproceedings{bruendler2026maritimemails,
    title     = {MaritimEmails: A Synthetic Dataset for Maritime Chartering Correspondence},
    author    = {Br{\"u}ndler, Kevin and Clematide, Simon},
    booktitle = {Proceedings of LREC 2026},
    year      = {2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
MaritimEmails.py		MaritimEmails.py
dataset_info.json		dataset_info.json
dev.jsonl.gz		dev.jsonl.gz
full.jsonl.gz		full.jsonl.gz
readme.md		readme.md
test.jsonl.gz		test.jsonl.gz
train.jsonl.gz		train.jsonl.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MaritimEmails

Resource Overview

Corpus Composition

Generation Process

Annotation Pipeline

Data Format

Structure

Email Chain Fields

Label Fields

Example Record

Generation Metadata

Data Splits

Compute and Cost Transparency

Intended Use and Limitations

Ethical Considerations

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MaritimEmails

Resource Overview

Corpus Composition

Generation Process

Annotation Pipeline

Data Format

Structure

Email Chain Fields

Label Fields

Example Record

Generation Metadata

Data Splits

Compute and Cost Transparency

Intended Use and Limitations

Ethical Considerations

License

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages