Skip to content

feat: add async job mode to Docling Serve converter#3371

Open
2830500285 wants to merge 3 commits into
deepset-ai:mainfrom
2830500285:codex/docling-serve-async-jobs-3345
Open

feat: add async job mode to Docling Serve converter#3371
2830500285 wants to merge 3 commits into
deepset-ai:mainfrom
2830500285:codex/docling-serve-async-jobs-3345

Conversation

@2830500285
Copy link
Copy Markdown
Contributor

Summary

  • Add an async job mode for DoclingServeConverter that submits conversions to Docling Serve's async endpoints and polls until completion.
  • Support async jobs for both URL sources and uploaded file/ByteStream sources.
  • Preserve the existing synchronous mode as the default and cover serialization, failure handling, and sync/async execution paths with tests.

Closes #3345

Validation

  • SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 uvx hatch run fmt-check src/haystack_integrations/components/converters/docling_serve/converter.py src/haystack_integrations/components/converters/docling_serve/init.py tests/test_converter.py
  • SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 uvx hatch run test:unit tests/test_converter.py
  • SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 uvx hatch run test:types

@2830500285 2830500285 requested a review from a team as a code owner June 1, 2026 05:09
@2830500285 2830500285 requested review from davidsbatista and removed request for a team June 1, 2026 05:09
@github-actions github-actions Bot added integration:docling-serve type:documentation Improvements or additions to documentation labels Jun 1, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

Coverage report (docling_serve)

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  integrations/docling_serve/src/haystack_integrations/components/converters/docling_serve
  converter.py 212, 219-226, 231-232, 321-322, 332-334, 357-358, 371-372, 375-390, 428-429, 439-441, 454-456, 565, 588-589
Project Total  

This report was generated by python-coverage-comment-action

source=source,
error=e,
)
except DoclingServeConversionError as e:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This execption DoclingServeConversionError raised on timeout ends up being caught by the same except block in run() / run_async() that also handles ordinary conversion failures:

If a job exceeds job_timeout (default 600s), the caller silently gets an empty documents list after waiting 10 minutes — indistinguishable from "file had no content". A timeout feels like a different signal from a conversion failure.

One option would be a dedicated subclass:

class DoclingServeTimeoutError(DoclingServeConversionError):
    """Raised when a job exceeds job_timeout."""

That way callers can catch it separately, and you could re-raise it in run() or at least use a more specific log message.

Same for the run_async()

Conversion mode. `sync` uses DoclingServe's synchronous endpoints. `async` submits
conversion jobs to DoclingServe's async endpoints and polls until completion.
:param poll_interval:
Maximum server-side long-poll wait in seconds when `mode="async"`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Maximum server-side long-poll wait in seconds when `mode="async"`.
Controls both the server-side long-poll wait (?wait= parameter) and the maximum local sleep between polls. A higher value reduces round-trips; a lower value increases polling frequency.

:param job_timeout:
Maximum time in seconds to wait for each async conversion job.
"""
if poll_interval <= 0:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those are only used in async, so we can do:

if self.mode == ConversionMode.ASYNC:
      if poll_interval <= 0:
          msg = "poll_interval must be greater than 0."
          raise ValueError(msg)
      if job_timeout <= 0:
          msg = "job_timeout must be greater than 0."
          raise ValueError(msg)

@@ -184,6 +264,24 @@ def _post_file(self, client: httpx.Client, source: str | Path | ByteStream) -> d
response.raise_for_status()
return response.json()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pre-existing issue, not introduced by this PR, but worth fixing.

The sync endpoints (_post_file, _post_url) return the raw response JSON without checking the status field in the body. This means if DoclingServe returns a 200 OK with {"status": "failure", "errors": [...]}, the error details are silently discarded — the code falls through to _extract_content, gets None, and logs "No content returned for source" with no indication of why.

The async job path handles this correctly via _raise_for_failed_conversion in _fetch_job_result.

Let's apply the same check to _post_file and _post_url to make the behaviour consistent.

assert len(result["documents"]) == 2

@pytest.mark.asyncio
async def test_run_async_async_mode_uses_job_endpoints(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a test for run_async() with ASYNC mode and a URL source, but no equivalent for a file source.

Could you add a test that verifies run_async() with mode=ASYNC and a file path hits /v1/convert/file/async and returns the converted document?

assert data["to_formats"] == "md"
assert data["target_type"] == "inbody"

def test_run_async_mode_skips_failed_job(self, caplog):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for the timeout path?

When job_timeout is exceeded, DoclingServeConversionError is raised and caught — the source should be skipped with a warning containing "Timed out".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a test for the "skipped" status?

A job can complete with task_status: "success" but the conversion result can still have status: "skipped"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration:docling-serve type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

docling_serve: support the async-job endpoint for long-running conversions

2 participants