Docling OCR for Sources and Attachments
DocsGPT uses Docling as the default parser layer for many document formats. OCR is optional and controlled by two settings:
DOCLING_OCR_ENABLED=false
DOCLING_OCR_ATTACHMENTS_ENABLED=falseDOCLING_OCR_ENABLED: OCR behavior for Source Docs ingestion.DOCLING_OCR_ATTACHMENTS_ENABLED: OCR behavior for chat attachments uploaded from the message box.
Processing Flow
Source Docs flow (Upload and Train)
- Files are uploaded through
/api/upload. - Ingestion runs asynchronously in Celery (
ingest_worker). SimpleDirectoryReaderparses files withget_default_file_extractor.- For PDFs and image formats, Docling parsers are used. OCR in this path is controlled by
DOCLING_OCR_ENABLED. - Parsed text is chunked, embedded, and stored in the vector store.
- Retrieval during chat uses this indexed text and returns source citations.
Attachment flow (Chat-only file context)
- Files are uploaded through
/api/store_attachment. - Celery task
attachment_workerparses and stores the attachment in MongoDB (attachmentscollection). - OCR in this path is controlled by
DOCLING_OCR_ATTACHMENTS_ENABLED. - Attachments are not vectorized and are not added to the source index.
- During answer generation, selected attachment IDs are loaded and passed directly to the LLM pipeline.
How Docling OCR Works
Docling OCR behavior is different for PDFs vs images:
- PDF parser defaults to hybrid OCR:
- text regions: extracted directly
- bitmap/image regions: OCR only where needed
- Image parser defaults to full-page OCR (the whole image is visual content).
By default, Docling parser classes use RapidOCR options (language default: english).
ℹ️
Parser internals like OCR language and force-full-page OCR are currently set by code defaults, not separate .env settings.
Attachment Behavior by Model Support
When attachments are used in chat, behavior depends on the selected model/provider:
- If a MIME type is supported, DocsGPT sends files/images through provider-native attachment APIs.
- If unsupported, DocsGPT falls back to the parsed text content stored for the attachment.
- For providers that support images but not native PDF attachments, PDF files are converted to images (synthetic PDF support).
This means OCR quality is especially important for text fallback paths and for models without native attachment support.
Recommended Configuration
For most OCR-enabled use cases, enable both flags:
DOCLING_OCR_ENABLED=true
DOCLING_OCR_ATTACHMENTS_ENABLED=trueAfter changing these settings, restart the API and Celery worker.
Legacy Fallback Notes
- If Docling is unavailable, DocsGPT falls back to legacy parsers.
- With OCR disabled, text-based PDFs can still parse, but scanned/image-heavy content may produce little text.
- For image parsing without Docling OCR, the legacy image parser only extracts text when
PARSE_IMAGE_REMOTE=true.