Skip to content

doctemplate: CRLF templates produce extra blank lines around multiline $if$ / $for$ #157

@cderv

Description

@cderv

On Windows with core.autocrlf=true, doctemplate templates with CRLF line endings render with extra blank lines around every multiline $if(...)$ and $for(...)$ directive. The symptom currently shows up only as 8 failing tests in quarto-doctemplate, but the underlying engine bug means a Windows author of any Quarto template gets visibly wrong output in every format the doctemplate engine drives. We don't have Windows CI yet, so this hasn't surfaced for the rest of the team in cargo nextest run.

I reproduced this on Windows at ebd04493.

Reproducer

let crlf_source = "before\r\n$if(show)$\r\ncontent\r\n$endif$\r\nafter\r\n";
let template = Template::compile(crlf_source).unwrap();
let mut ctx = TemplateContext::new();
ctx.insert("show", TemplateValue::Bool(true));
let result = template.render(&ctx).unwrap();
// expected: "before\ncontent\nafter\n"
// actual:   "before\r\n\r\ncontent\r\n\r\nafter\r\n"

Same shape on $for$ / $endfor$, nested $if$ / $for$, and $else$.

Root cause

normalize_multiline_directives (run after tree-sitter parsing) detects "directive on its own line" by checking the first character of the body Literal. The detection helpers hardcode '\n':

/// Check if the first node in a list is a Literal starting with '\n'.
fn first_node_is_newline_literal(nodes: &[TemplateNode]) -> bool {
if let Some(TemplateNode::Literal(lit)) = nodes.first() {
lit.text.starts_with('\n')
} else {
false
}
}
/// Strip a leading '\n' from the first Literal node if present.
fn strip_leading_newline_from_nodes(nodes: &mut Vec<TemplateNode>) {
if let Some(first) = nodes.first_mut() {
strip_leading_newline_from_node(first);
// If the node became empty, remove it
if let TemplateNode::Literal(lit) = first
&& lit.text.is_empty()
{
nodes.remove(0);
}
}
}
/// Strip a leading '\n' from a node if it's a Literal starting with '\n'.
fn strip_leading_newline_from_node(node: &mut TemplateNode) {
if let TemplateNode::Literal(lit) = node
&& lit.text.starts_with('\n')
{
lit.text = lit.text[1..].to_string();
}

For CRLF input the body Literal starts with \r\n, so starts_with('\n') returns false, is_multiline stays false, and the branch that consumes the leading and trailing newlines around the directive never runs:

fn normalize_multiline_directives(nodes: &mut Vec<TemplateNode>) {
// Process each node, with access to the next sibling for lookahead
let mut i = 0;
while i < nodes.len() {
match &mut nodes[i] {
TemplateNode::Conditional(cond) => {
// Check if this is a multiline conditional
let is_multiline = is_first_child_newline_literal(&cond.branches);
if is_multiline {
// Strip leading newline from body of each branch
for (_condition, body) in &mut cond.branches {
strip_leading_newline_from_nodes(body);
// Recursively normalize nested directives
normalize_multiline_directives(body);
}
// Strip leading newline from else branch if present
if let Some(else_body) = &mut cond.else_branch {
strip_leading_newline_from_nodes(else_body);
normalize_multiline_directives(else_body);
}
// Strip leading newline from next sibling if it's a Literal
if i + 1 < nodes.len() {
strip_leading_newline_from_node(&mut nodes[i + 1]);
}
} else {
// Still need to recursively normalize nested directives
for (_condition, body) in &mut cond.branches {
normalize_multiline_directives(body);
}
if let Some(else_body) = &mut cond.else_branch {
normalize_multiline_directives(else_body);
}
}
}
TemplateNode::ForLoop(for_loop) => {
// Check if this is a multiline for loop
let is_multiline = first_node_is_newline_literal(&for_loop.body);
if is_multiline {
// Strip leading newline from body
strip_leading_newline_from_nodes(&mut for_loop.body);
normalize_multiline_directives(&mut for_loop.body);
// Strip leading newline from separator if present
if let Some(sep) = &mut for_loop.separator {
strip_leading_newline_from_nodes(sep);
normalize_multiline_directives(sep);
}
// Strip leading newline from next sibling if it's a Literal
if i + 1 < nodes.len() {
strip_leading_newline_from_node(&mut nodes[i + 1]);
}
} else {
// Still need to recursively normalize nested directives
normalize_multiline_directives(&mut for_loop.body);
if let Some(sep) = &mut for_loop.separator {
normalize_multiline_directives(sep);
}
}
}
TemplateNode::Nesting(nesting) => {
normalize_multiline_directives(&mut nesting.children);
}
TemplateNode::BreakableSpace(bs) => {
normalize_multiline_directives(&mut bs.children);
}
// Other node types don't need processing
TemplateNode::Literal(_)
| TemplateNode::Variable(_)
| TemplateNode::Partial(_)
| TemplateNode::Comment(_) => {}
}
i += 1;
}
}

Pandoc's doctemplates, by contrast, is parser-aware. Its endline parser accepts all three conventions and returns whatever was matched, so multiline directive consumption is line-ending-agnostic and the output preserves the input convention:

pLineEnding = P.string "\n" <|> P.try (P.string "\r\n") <|> P.string "\r"
isSpacy '\r' = True
pLit = P.many1 (P.satisfy (\c -> c /= '$' && c /= '\n' && c /= '\r'))

https://github.com/jgm/doctemplates/blob/master/src/Text/DocTemplates/Parser.hs#L262-L263

Pandoc's --eol=crlf|lf|native is a separate writer option layered on top.

Constraints I see for q2 on Windows

CRLF input must render correctly. Output should preserve the input line-ending convention — silently rewriting bytes mid-pipeline diverges from Pandoc and surprises Windows users whose rest-of-file convention is CRLF. Source spans (node.start_byte(), start_position) need to keep mapping back to the on-disk file or diagnostics drift.

A one-line ingress normalize (CRLF→LF before parsing) is out: it loses the input convention and shifts every byte offset by one per preceding CRLF.

Approaches

We could teach the Rust normalization helpers to recognize \r\n and \r in addition to \n, plus an audit of tree-sitter-doctemplate to see whether the grammar needs the same alternation or whether the Rust pass alone is enough. Bytes preserved end-to-end. Same shape of work as #139 did for tree-sitter-qmd pipe tables, smaller scope.

We could also normalize CRLF→LF for the parser internally with a side-table mapping normalized→original byte positions for diagnostics, then re-emit the input convention on render. More machinery, easier to forget the side-table when adding new diagnostics.

Open question

Is "preserve input line-ending convention end-to-end" the policy we want for q2 on Windows? Or would we rather always normalize to LF on output, or expose a writer-side option like Pandoc's --eol?

This is broader than quarto-doctemplate — pampa output, the JSON / native writers, and any future tree-sitter grammar will face the same question. Picking a policy here sets the precedent.

If the answer is "preserve input convention", I'll scope the tree-sitter-doctemplate audit, add a CRLF regression test that builds the input in-process so Linux CI catches future regressions (same pattern as pipe_table_crlf_matches_lf from #139), and update the 8 affected quarto-doctemplate tests. Internal tracker is bd-1d3e.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions