3md Implementation Guide

For Developers and Tool Builders

This document provides technical guidance for implementing 3md parsers, validators, and conversion tools.

Parser Architecture
Error Handling and Validation
Output Formats
Parser Implementation Tips
Testing Requirements

Parser Architecture

Recommended Parsing Pipeline

3md source
    ↓
1. Extract YAML frontmatter (optional)
    ↓
2. Parse {{langs|si|ta|en}} header
    ↓
3. Split on blank lines (\n\n) → Blocks
    ↓
4. For each block:
   - Check for ෴ separator (block-level)
   - Check for ~ separator (inline)
   - If present: Multi Block (split by separator)
   - If absent: Mono Block
    ↓
5. Parse Markdown within each language variant
    ↓
6. Resolve entity references from frontmatter
    ↓
7. Generate output (HTML/JSON/per-language MD)

Parser State Machine

STATE: INITIAL
  ↓
  Read frontmatter (if present: ---...---)
  ↓
STATE: EXPECT_LANG_HEADER
  ↓
  Parse {{langs|si|ta|en}}
  Store language order
  ↓
STATE: PARSE_BLOCKS
  ↓
  For each block:
    - Detect separator type (෴ or ~)
    - Split into variants
    - Validate variant count matches language count
    - Parse Markdown in each variant
  ↓
STATE: COMPLETE

Core Data Structures

Document AST:

interface Document {
  frontmatter?: Frontmatter;
  languages: LanguageCode[];
  blocks: Block[];
}

interface Block {
  type: 'multi' | 'mono';
  content: MultiContent | MonoContent;
}

interface MultiContent {
  variants: Map<LanguageCode, string>;
  separator: 'block' | 'inline';
  element: 'paragraph' | 'heading' | 'list' | 'blockquote' | 'table';
}

interface MonoContent {
  text: string;
  element: 'code' | 'paragraph';
}

Error Handling and Validation

For detailed error specifications, see ERRORS.md.

Key Validation Requirements

Document-Level:

Language declaration must be present and valid
ISO 639-1 language codes must be correct (si, ta, en)
Language order must be consistent throughout document

Block-Level:

Variant count must match declared language count
Separators must not be mixed within same block (can't use both ෴ and ~ in one block)
Entity references must be properly closed

Content-Level:

Escape sequences must be valid
Code blocks must be properly delimited
Table structures must be valid Markdown

Error Philosophy

3md parsers adopt a fail-fast approach with clear, actionable error messages:

ERROR at line 42:

සිංහල~தமிழ்

Found 2 variants but expected 3 (si, ta, en).

Possible fixes:
1. Add missing variant:
   සිංහල~தமிழ்~English

2. Use explicit empty marker:
   සිංහල~தமිழ்~{{empty}}

Validation Levels

Level 1: Critical Errors (Parsing Fails)

Missing language declaration
Invalid language codes
Mismatched variant counts
Malformed frontmatter YAML

Level 2: Warnings (Parsing Succeeds, May Need Review)

Single-language content without separators (potential mono block ambiguity)
Entity references to undefined entities
Inconsistent table structures across variants
Very long lines (>120 characters)

Level 3: Style Suggestions (Informational)

Mixed separator usage (using inline for some headings, block for others)
Inconsistent indentation
Missing translation status in frontmatter

Output Formats

Per-Language Markdown

Extract single-language Markdown files:

3md source → parser → si.md, ta.md, en.md

Example output (si.md):

# හැඳින්වීම

මෙය සරල ලේඛනයකි.

## මූලධර්ම

1. පළමු මූලධර්මය
2. දෙවන මූලධර්මය

JSON AST

Structured representation for programmatic access:

{
  "version": "0.1.0",
  "frontmatter": {
    "status": {
      "si": "synced",
      "ta": "fuzzy",
      "en": "source"
    }
  },
  "languages": ["si", "ta", "en"],
  "blocks": [
    {
      "type": "multi",
      "separator": "inline",
      "element": "heading",
      "level": 1,
      "variants": {
        "si": "හැඳින්වීම",
        "ta": "அறிமுகம்",
        "en": "Introduction"
      }
    },
    {
      "type": "multi",
      "separator": "block",
      "element": "paragraph",
      "variants": {
        "si": "මෙය සරල ලේඛනයකි.",
        "ta": "இது ஒரு எளிய ஆவணம்.",
        "en": "This is a simple document."
      }
    }
  ]
}

HTML with Language Attributes

Generate multilingual HTML:

<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>Multilingual Document</title>
</head>
<body>
  <article lang="si">
    <h1>හැඳින්වීම</h1>
    <p>මෙය සරල ලේඛනයකි.</p>
  </article>

  <article lang="ta">
    <h1>அறிமுகம்</h1>
    <p>இது ஒரு எளிய ஆவணம்.</p>
  </article>

  <article lang="en">
    <h1>Introduction</h1>
    <p>This is a simple document.</p>
  </article>
</body>
</html>

Parallel HTML View

Side-by-side rendering for editing/review:

<div class="parallel-view">
  <div class="lang-column" lang="si">
    <h1>හැඳින්වීම</h1>
    <p>මෙය සරල ලේඛනයකි.</p>
  </div>

  <div class="lang-column" lang="ta">
    <h1>அறிமுகம்</h1>
    <p>இது ஒரு எளிய ஆவணம்.</p>
  </div>

  <div class="lang-column" lang="en">
    <h1>Introduction</h1>
    <p>This is a simple document.</p>
  </div>
</div>

Word/DOCX Export

Generate Microsoft Word documents:

Option 1: Single multilingual document

Three columns, one per language
Synchronized scrolling
Style mappings for headings, emphasis, etc.

Option 2: Three separate documents

One .docx per language
Maintain cross-references via bookmarks

Implementation approaches:

Use pandoc with custom templates
Generate via docx library in Python/JavaScript
Use Office Open XML directly

InDesign Export

ICML (InCopy Markup Language):

<?xml version="1.0" encoding="UTF-8"?>
<Document>
  <Story StoryTitle="si-content">
    <ParagraphStyleRange AppliedParagraphStyle="Heading1">
      <Content>හැඳින්වීම</Content>
    </ParagraphStyleRange>
  </Story>
</Document>

IDML (InDesign Markup Language):

Full InDesign document structure
Includes styles, text frames, pages
More complex but provides complete layout control

Best practice:

Export each language variant as separate story
Include style mappings in export
Provide templates for designers

Parser Implementation Tips

1. Separator Detection

Robust separator detection:

def detect_separator(block_text: str) -> str:
    """
    Detect which separator is used in a block.
    Returns 'block', 'inline', or 'none'.
    """
    # Check for block separator (with surrounding newlines)
    if '\n෴\n' in block_text:
        return 'block'

    # Check for inline separator (not in code blocks)
    if '~' in block_text and not is_in_code(block_text, '~'):
        return 'inline'

    return 'none'

def is_in_code(text: str, char: str) -> bool:
    """
    Check if character appears only within code blocks.
    """
    # Remove code blocks, check if char still exists
    text_no_code = remove_code_blocks(text)
    return char not in text_no_code

Never mix separators:

if '\n෴\n' in block and '~' in block:
    raise SeparatorMixError(
        "Cannot use both block (෴) and inline (~) separators in same block"
    )

2. Escape Sequence Handling

Process escapes before splitting:

def split_variants(text: str, separator: str) -> list[str]:
    """
    Split text on separator, respecting escape sequences.
    """
    # Replace escaped separators with placeholder
    text = text.replace('\\~', '\x00TILDE\x00')
    text = text.replace('\\෴', '\x00KUNDALIYA\x00')
    text = text.replace('\\\\', '\x00BACKSLASH\x00')

    # Split on separator
    variants = text.split(separator)

    # Restore escaped characters
    variants = [
        v.replace('\x00TILDE\x00', '~')
         .replace('\x00KUNDALIYA\x00', '෴')
         .replace('\x00BACKSLASH\x00', '\\')
        for v in variants
    ]

    return variants

3. Code Block Protection

Protect code blocks from parsing:

import re

def protect_code_blocks(text: str) -> tuple[str, dict]:
    """
    Replace code blocks with placeholders, return mapping.
    """
    code_blocks = {}
    counter = 0

    # Fenced code blocks
    def replace_fenced(match):
        nonlocal counter
        placeholder = f'\x00CODE_{counter}\x00'
        code_blocks[placeholder] = match.group(0)
        counter += 1
        return placeholder

    text = re.sub(r'```.*?```', replace_fenced, text, flags=re.DOTALL)

    # Inline code
    def replace_inline(match):
        nonlocal counter
        placeholder = f'\x00CODE_{counter}\x00'
        code_blocks[placeholder] = match.group(0)
        counter += 1
        return placeholder

    text = re.sub(r'`[^`]+`', replace_inline, text)

    return text, code_blocks

def restore_code_blocks(text: str, code_blocks: dict) -> str:
    """
    Restore code blocks from placeholders.
    """
    for placeholder, code in code_blocks.items():
        text = text.replace(placeholder, code)
    return text

4. Entity Resolution

Build entity map from frontmatter:

class EntityResolver:
    def __init__(self, frontmatter: dict):
        self.entities = frontmatter.get('entities', {})

    def resolve(self, entity_id: str, lang: str) -> str:
        """
        Resolve entity reference to display text.
        """
        if entity_id not in self.entities:
            raise UndefinedEntityError(f"Entity '{entity_id}' not defined")

        entity = self.entities[entity_id]

        # Return language-specific variant or primary
        return entity.get(lang, entity.get('primary', entity_id))

    def replace_references(self, text: str, lang: str) -> str:
        """
        Replace all [[entity-id|display]] references in text.
        """
        def replace(match):
            entity_id = match.group(1)
            display = match.group(2) if match.group(2) else None

            # If display specified, use it; otherwise resolve
            if display:
                return display
            else:
                return self.resolve(entity_id, lang)

        # Pattern: [[entity-id|display]] or [[entity-id]]
        pattern = r'\[\[([^|\]]+)(?:\|([^\]]+))?\]\]'
        return re.sub(pattern, replace, text)

5. HTML and Media Block Handling

HTML blocks for embeds and media

HTML content in fenced code blocks (videos, audio, iframes) must be:

Detected as fenced code blocks during parsing
Protected from separator processing (same as code protection)
Rendered as HTML, NOT wrapped in <pre><code> tags

Implementation:

import re

def detect_html_blocks(text: str) -> tuple[str, dict]:
    """
    Detect and extract HTML blocks from fenced code blocks.
    Returns text with placeholders and mapping of placeholders to HTML.
    """
    html_blocks = {}
    counter = 0

    # Pattern for fenced code blocks (with optional 'html' language hint)
    # Matches: ```\n<html>...</html>\n``` or ```html\n<html>...</html>\n```
    pattern = r'```(?:html)?\n((?:<[^>]+>[\s\S]*?</[^>]+>)|(?:<[^>]+\s*/>))\n```'

    def replace_html(match):
        nonlocal counter
        html_content = match.group(1).strip()

        # Check if content is HTML (starts with < and contains >)
        if is_html(html_content):
            placeholder = f'\x00HTML_{counter}\x00'
            html_blocks[placeholder] = html_content
            counter += 1
            return placeholder
        else:
            # Not HTML, keep as regular code block
            return match.group(0)

    text = re.sub(pattern, replace_html, text, flags=re.MULTILINE)
    return text, html_blocks

def is_html(content: str) -> bool:
    """
    Check if content appears to be HTML.
    Simple heuristic: starts with < and contains HTML-like tags.
    """
    content = content.strip()
    if not content.startswith('<'):
        return False

    # Check for common HTML patterns
    html_patterns = [
        r'<iframe[\s>]',
        r'<video[\s>]',
        r'<audio[\s>]',
        r'<svg[\s>]',
        r'<embed[\s>]',
        r'<img[\s>]',
        r'<div[\s>]',
        r'<span[\s>]',
    ]

    return any(re.search(pattern, content, re.IGNORECASE) for pattern in html_patterns)

def restore_html_blocks(text: str, html_blocks: dict) -> str:
    """
    Restore HTML blocks from placeholders without <pre><code> wrapping.
    """
    for placeholder, html in html_blocks.items():
        text = text.replace(placeholder, html)
    return text

Usage in parser pipeline:

def parse_3md_document(text: str) -> dict:
    """
    Complete parsing pipeline with HTML block handling.
    """
    # 1. Extract frontmatter
    frontmatter, text = parse_frontmatter(text)

    # 2. Parse language declaration
    langs, text = parse_language_declaration(text)

    # 3. Protect code blocks (including HTML in code blocks)
    text, code_blocks = protect_code_blocks(text)

    # 4. Detect and extract HTML blocks (from protected code blocks)
    text, html_blocks = detect_html_blocks(text)

    # 5. Split on blank lines into blocks
    blocks = text.split('\n\n')

    # 6. Parse each block (detect separators, create Multi/Mono blocks)
    parsed_blocks = []
    for block in blocks:
        parsed_block = parse_block(block, langs)
        parsed_blocks.append(parsed_block)

    # 7. Restore code blocks
    text = restore_code_blocks(text, code_blocks)

    # 8. Restore HTML blocks (as raw HTML, not code)
    text = restore_html_blocks(text, html_blocks)

    return {
        'frontmatter': frontmatter,
        'languages': langs,
        'blocks': parsed_blocks,
        'html_blocks': html_blocks
    }

Rendering HTML blocks:

When generating output (HTML, Markdown), HTML blocks should be:

HTML output: Rendered directly as HTML (no escaping, no <pre><code> wrapper)
Markdown output: Kept in fenced code blocks
JSON AST: Marked with type: 'html_embed' for processors

Example output:

# Input 3md
text = '''
{{langs|si|ta|en}}

Watch this:~මෙය බලන්න:~இதைப் பார்க்கவும்:

HTML output for English variant

html_output = '''

Watch this:

Note: No

 wrapping, direct HTML rendering



**Image handling (no fencing required):**

Images use standard Markdown syntax and do NOT require fencing:

```python
def parse_image(text: str, langs: list) -> dict:
    """
    Parse image with multilingual alt text.
    Syntax: ![alt1~alt2~alt3](image.png)
    """
    pattern = r'!\[([^\]]+)\]\(([^\)]+)\)'

    match = re.search(pattern, text)
    if not match:
        return None

    alt_text = match.group(1)
    image_path = match.group(2)

    # Split alt text by separator if present
    if '~' in alt_text:
        alt_variants = alt_text.split('~')
        if len(alt_variants) != len(langs):
            raise VariantCountMismatch(
                f"Expected {len(langs)} alt variants, got {len(alt_variants)}"
            )
    else:
        # Single alt text for all languages
        alt_variants = [alt_text] * len(langs)

    return {
        'type': 'image',
        'alt_variants': dict(zip(langs, alt_variants)),
        'src': image_path,
        'is_mono': True  # Image source is always language-invariant
    }

6. Frontmatter Parsing

Use existing YAML parsers:
import yaml

def parse_frontmatter(text: str) -> tuple[dict, str]:
    """
    Extract YAML frontmatter from document.
    Returns (frontmatter_dict, remaining_text).
    """
    if not text.startswith('---\n'):
        return {}, text

    # Find closing ---
    end_match = re.search(r'\n---\n', text[4:])
    if not end_match:
        raise FrontmatterError("Unclosed frontmatter block")

    end_pos = end_match.end() + 4

    # Extract and parse YAML
    yaml_text = text[4:end_pos-4]
    try:
        frontmatter = yaml.safe_load(yaml_text)
    except yaml.YAMLError as e:
        raise FrontmatterError(f"Invalid YAML: {e}")

    remaining = text[end_pos:]
    return frontmatter, remaining

Testing Requirements

Test Categories

1. Valid Documents

Minimal valid document
All block types (paragraphs, headings, lists, tables, blockquotes)
Both separator types (inline and block)
With and without frontmatter
All language orders (si-ta-en, en-si-ta, ta-en-si)

2. Invalid Documents

Missing language declaration
Invalid language codes
Mismatched variant counts
Mixed separators in same block
Malformed frontmatter
Unclosed entity references

3. Edge Cases

Empty blocks
Escaped separators
Separators in code blocks
Very long lines
Nested lists
Complex tables
Unicode edge cases (ZWJ, ZWNJ in Sinhala/Tamil)

4. Round-Trip Tests
3md → parse → AST → generate → 3md

Ensure generated 3md is semantically equivalent to input.
Reference Test Suite

A comprehensive test suite is maintained at:
tests/
  ├── valid/
  │   ├── minimal.3md
  │   ├── complete.3md
  │   ├── all-features.3md
  ├── invalid/
  │   ├── missing-lang-header.3md
  │   ├── mismatched-variants.3md
  │   ├── mixed-separators.3md
  ├── edge-cases/
  │   ├── escaped-separators.3md
  │   ├── code-with-separators.3md
  │   ├── complex-tables.3md
  └── expected-output/
      ├── minimal.json
      ├── complete.html
      └── all-features-si.md


Performance Considerations

Streaming Parsing

For large documents (>10MB), implement streaming:
class StreamingParser:
    def parse_blocks(self, file_path: str):
        """
        Yield blocks one at a time without loading entire file.
        """
        with open(file_path, 'r', encoding='utf-8') as f:
            frontmatter, langs = self.parse_header(f)

            current_block = []
            for line in f:
                if line.strip() == '':
                    if current_block:
                        yield self.parse_block(current_block, langs)
                        current_block = []
                else:
                    current_block.append(line)

            # Don't forget last block
            if current_block:
                yield self.parse_block(current_block, langs)
Caching

Cache parsed ASTs for frequently accessed documents:
from functools import lru_cache

@lru_cache(maxsize=100)
def parse_cached(file_path: str, mtime: float) -> Document:
    """
    Cache parsed documents, keyed by path and modification time.
    """
    return parse_file(file_path)

# Usage
mtime = os.path.getmtime(file_path)
doc = parse_cached(file_path, mtime)

Reference Implementations

Official Implementations


Python: 3md-py - Reference implementation
JavaScript: 3md-js - For web-based tools
Rust: 3md-rs - High-performance parser

Integration Libraries


Pandoc filter: Convert 3md via Pandoc
VS Code extension: Syntax highlighting and preview
InDesign plugin: Import 3md directly


Contributing

See CONTRIBUTING.md for guidelines on contributing to 3md parser implementations.

Document Version: 0.1.0
Last Updated: 2025-12-29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3md Implementation Guide

Table of Contents

Parser Architecture

Recommended Parsing Pipeline

Parser State Machine

Core Data Structures

Error Handling and Validation

Key Validation Requirements

Error Philosophy

Validation Levels

Output Formats

Per-Language Markdown

JSON AST

HTML with Language Attributes

Parallel HTML View

Word/DOCX Export

InDesign Export

Parser Implementation Tips

1. Separator Detection

2. Escape Sequence Handling

3. Code Block Protection

4. Entity Resolution

5. HTML and Media Block Handling

HTML output for English variant

Note: No
`wrapping, direct HTML rendering`

6. Frontmatter Parsing

Testing Requirements

Test Categories

Reference Test Suite

Performance Considerations

Streaming Parsing

Caching

Reference Implementations

Official Implementations

Integration Libraries

Contributing

FilesExpand file tree

IMPLEMENTATION.md

Latest commit

History

IMPLEMENTATION.md

File metadata and controls

3md Implementation Guide

Table of Contents

Parser Architecture

Recommended Parsing Pipeline

Parser State Machine

Core Data Structures

Error Handling and Validation

Key Validation Requirements

Error Philosophy

Validation Levels

Output Formats

Per-Language Markdown

JSON AST

HTML with Language Attributes

Parallel HTML View

Word/DOCX Export

InDesign Export

Parser Implementation Tips

1. Separator Detection

2. Escape Sequence Handling

3. Code Block Protection

4. Entity Resolution

5. HTML and Media Block Handling

HTML output for English variant

Note: No wrapping, direct HTML rendering

6. Frontmatter Parsing

Testing Requirements

Test Categories

Reference Test Suite

Performance Considerations

Streaming Parsing

Caching

Reference Implementations

Official Implementations

Integration Libraries

Contributing

Note: No
`wrapping, direct HTML rendering`