Docling OCR for Sources and Attachments

DocsGPT uses Docling as the default parser layer for many document formats. OCR is optional and controlled by two settings:


DOCLING_OCR_ENABLED=false
DOCLING_OCR_ATTACHMENTS_ENABLED=false

DOCLING_OCR_ENABLED: OCR behavior for Source Docs ingestion.
DOCLING_OCR_ATTACHMENTS_ENABLED: OCR behavior for chat attachments uploaded from the message box.

Processing Flow

Files are uploaded through /api/upload.
Ingestion runs asynchronously in Celery (ingest_worker).
SimpleDirectoryReader parses files with get_default_file_extractor.
For PDFs and image formats, Docling parsers are used. OCR in this path is controlled by DOCLING_OCR_ENABLED.
Parsed text is chunked, embedded, and stored in the vector store.
Retrieval during chat uses this indexed text and returns source citations.

Files are uploaded through /api/store_attachment.
Celery task attachment_worker parses and stores the attachment in Postgres (attachments table).
OCR in this path is controlled by DOCLING_OCR_ATTACHMENTS_ENABLED.
Attachments are not vectorized and are not added to the source index.
During answer generation, selected attachment IDs are loaded and passed directly to the LLM pipeline.

Docling OCR behavior is different for PDFs vs images:

PDF parser defaults to hybrid OCR:
- text regions: extracted directly
- bitmap/image regions: OCR only where needed
Image parser defaults to full-page OCR (the whole image is visual content).

By default, Docling parser classes use RapidOCR options (language default: english).

ℹ️

Parser internals like OCR language and force-full-page OCR are currently set by code defaults, not separate .env settings.

When attachments are used in chat, behavior depends on the selected model/provider:

If a MIME type is supported, DocsGPT sends files/images through provider-native attachment APIs.
If unsupported, DocsGPT falls back to the parsed text content stored for the attachment.
For providers that support images but not native PDF attachments, PDF files are converted to images (synthetic PDF support).

This means OCR quality is especially important for text fallback paths and for models without native attachment support.

For most OCR-enabled use cases, enable both flags:


DOCLING_OCR_ENABLED=true
DOCLING_OCR_ATTACHMENTS_ENABLED=true

After changing these settings, restart the API and Celery worker.

If Docling is unavailable, DocsGPT falls back to legacy parsers.
With OCR disabled, text-based PDFs can still parse, but scanned/image-heavy content may produce little text.
For image parsing without Docling OCR, the legacy image parser only extracts text when PARSE_IMAGE_REMOTE=true.