Kinjal Shamjibhai Sorathiya
5 Minutes read
Fine-Tuning DocLayout-YOLO for Custom Document Layout
Document AI is becoming a critical component of modern enterprise applications, powering use cases such as intelligent search, Retrieval-Augmented Generation (RAG), information extraction, and automated document processing.
Before these systems can extract meaningful information, they must first understand the structure of a document by identifying elements such as titles, text blocks, tables, figures, and captions. This task is known as Document Layout Analysis (DLA).
While pre-trained models perform well on standard documents, they often struggle with domain-specific content such as technical manuals, engineering drawings, financial reports, healthcare records, and legal documents. Fine-tuning allows these models to adapt to the unique layouts and visual patterns found in real-world datasets.
In this article, we’ll walk through the end-to-end process of fine-tuning DocLayout-YOLO on custom document datasets and share practical lessons learned from real-world document AI projects.
The Problem: Why Generic Models Fall Short
Document Layout Analysis (DLA) is the task of detecting and classifying regions within a document image, such as titles, paragraphs, tables, figures, headers, footnotes, and captions. It’s the essential first step before OCR, information extraction, or RAG pipelines can work reliably.
Models like DocLayout-YOLO are pre-trained on large public datasets such as DocLayNet and D4LA. They handle academic papers and standard PDFs fairly well — but the real world rarely gives you standard PDFs.
Who Is This Guide For?
This guide is aimed at ML engineers, computer vision practitioners, data scientists, and AI developers who:
- Need to extract structured information from domain-specific documents (HVAC technical documents, invoices, medical records, legal filings, engineering drawings)
- Are comfortable with Python and have basic familiarity with object detection concepts
What is DocLayout-YOLO?
DocLayout-YOLO is a document-specific object detection model designed for Document Layout Analysis (DLA). Built on the YOLO architecture, it is optimized to detect and classify document elements, including titles, text blocks, tables, figures, captions, and formulas.
Given a document page as input, the model predicts bounding boxes around these regions, enabling downstream tasks such as OCR, information extraction, semantic search, and Retrieval-Augmented Generation (RAG).
Key Terminology
| Term | Meaning |
| YOLO | Object detection architecture that processes the entire image in a single forward pass |
| DLA | Document Layout Analysis |
| IoU | Intersection over Union, measuring overlap between predicted and ground-truth boxes |
| mAP | Mean Average Precision, the primary object detection metric |
| Transfer Learning | Continuing training from pre-trained weights |
| Fine-Tuning | Adapting a pre-trained model to a specific dataset |
Architecture Overview
DocLayout-YOLO follows the standard YOLO pipeline with document-specific enhancements.
Note: While DocLayout-YOLO supports full-network fine-tuning, smaller datasets often benefit from partially freezing the backbone to reduce overfitting.
During fine-tuning, the backbone, neck, and detection head can all be updated to adapt the model to a custom document domain. In practice, the optimal strategy depends on dataset size and diversity: smaller datasets benefit from freezing part of the backbone, while larger datasets generally benefit from full fine-tuning.
One of the strengths of DocLayout-YOLO is its ability to detect both:
- Small document elements such as footnotes
- Large structures, such as full-page diagrams
within the same model.
Fine-Tuning Pipeline
A typical fine-tuning workflow consists of the following steps:
When Should You Fine-Tune?
Fine-tuning is recommended when:
- The pretrained model misses important document elements.
- Your documents differ significantly from public datasets.
- Detection quality directly impacts downstream OCR, retrieval, or information extraction workflows.
- You have access to a labeled dataset representative of your target documents.
Step 1 — Preparing Your Dataset
Dataset quality has a greater impact on performance than most hyperparameter choices. Before training begins, invest time in creating consistent annotations.
How Many Images Do You Need?
| Scenario | Recommended Images | Expected Outcome |
| Highly consistent template (one page format) | 50–100 | Good if layout is rigid |
| Moderate variance (multiple form versions) | 200–400 | Good |
| High variance (scanned docs, mixed fonts) | 500–1000 | Acceptable |
| Mixed document types | 800+ | Varies by class balance |
Annotation Tools
Popular annotation tools include:
- Label Studio — open source, web UI, YOLO export plugin
- Roboflow — easiest YOLO export, built-in augmentation, free tier available
- LabelImg — lightweight desktop tool, direct YOLO .txt export
All three support bounding box annotation and YOLO export.
Common Annotation Mistakes
- Inconsistent figure boundaries
- Partial flowchart annotations
- Ambiguous class definitions
- Different labeling standards across annotators
Annotation consistency is often the single biggest factor affecting model performance.
YOLO Label Format
Each image has a corresponding .txt file. Each line in that file represents one detected region:
# <class_id> <x_center> <y_center> <width> <height>
# All values normalized to [0, 1] relative to image dimensions
0 0.512 0.083 0.720 0.048 # class 0 = "title"
1 0.512 0.210 0.860 0.290 # class 1 = "text"
2 0.512 0.560 0.820 0.300 # class 2 = "table"
3 0.250 0.880 0.450 0.120 # class 3 = "figure"
Format: <class_id> <x_center> <y_center> <width> <height> Dataset Structure
dataset/ ├── images/ │ ├── train/ │ ├── val/ │ └── test/ │ ├── labels/ │ ├── train/ │ ├── val/ │ └── test/ │ └── data.yaml
Converting PDF to Images (Common Pre-step)
Most document datasets start as PDFs. Convert them to high-resolution images before labeling. Use 150–200 DPI for typical documents and 300 DPI for small-text or form-heavy documents.
python
# pip install pdf2image (also requires poppler: apt-get install poppler-utils)
from pdf2image import convert_from_path
from pathlib import Path
def pdf_to_images(pdf_path: str, output_dir: str, dpi: int = 200) -> list:
pages = convert_from_path(pdf_path, dpi=dpi, fmt="jpeg")
out = Path(output_dir)
out.mkdir(parents=True, exist_ok=True)
saved = []
for i, page in enumerate(pages):
fpath = out / f"page_{i:04d}.jpg"
page.save(fpath, "JPEG", quality=95)
saved.append(fpath)
return saved
# Usage
images = pdf_to_images("documents_batch.pdf", "my_doc_dataset/images/train")
Step 2 — Environment Setup & Configuration
Install dependencies: bash source .venv/bin/activate pip install doclayout-yolo==0.0.3 Verify installation: bash python -c "import ultralytics; print(ultralytics.__version__)"
Dataset Configuration
The YAML file tells the trainer where your data lives and which classes you have. Keep class names consistent with your labeling tool.
yaml # Data.yaml path: dataset train: images/train val: images/val nc: 6 names: 0: title 1: text 2: table 3: figure 4: caption 5: footer
Choosing the Right Model
| Variant | Training Dataset | Recommended Usage |
| doclayout_yolo_docstructbench_imgsz1024.pt | DocStructBench | General-purpose document layout detection |
| doclayout_yolo_ft_imgsz1024.pt (D4LA tuned) | D4LA | Complex and scanned document layouts |
For most projects, doclayout_yolo_docstructbench_imgsz1024.pt is a good starting point. If your documents contain scanned pages, noisy layouts, or domain-specific structures, fine-tuning on your own dataset typically provides larger gains than switching between pretrained checkpoints.
Step 3 — Training the Model
Minimal Training Script
DocLayout-YOLO uses the Ultralytics training API, which means you can kick off fine-tuning in very few lines:
python
from doclayout_yolo import YOLOv10
# Load pre-trained weights
model = YOLOv10('doclayout_yolo_docstructbench_imgsz1024.pt')
results = model.train(
data='my_docs.yaml',
imgsz=1024, # keep at 1024 — model was pretrained at this resolution
epochs=100, # start here; use early stopping
batch=4, # reduce to 2 if OOM on <16GB VRAM
lr0=0.001, # initial learning rate (lower for fine-tuning)
lrf=0.01, # final LR = lr0 x lrf
warmup_epochs=3, # warmup prevents loss explosion at start
patience=20, # early stopping: stop if no improvement for 20 epochs
save_period=10, # checkpoint every 10 epochs
project='runs/doc_finetune',
name='docs_v1',
device=0, # GPU index; use 'cpu' for CPU-only (slow)
workers=4,
# Augmentation — document-safe settings
hsv_h=0.01, # very slight hue jitter
hsv_s=0.2,
hsv_v=0.3,
degrees=0.0, # NO rotation — documents are always upright
translate=0.05,
scale=0.3,
fliplr=0.0, # NO horizontal flip — changes reading direction
mosaic=0.5, # reduce from default 1.0
)
Note: Horizontal flipping is generally not recommended for document datasets because it reverses reading direction and changes document semantics. Rotation augmentation should only be used when rotated pages or diagrams are expected in production. When adapting YOLO configurations from natural-image tasks, carefully select document-specific augmentations.
GPU Memory Reference
| GPU VRAM | Recommended Batch Size | imgsz | Notes |
| 8 GB (e.g., RTX 3070) | 2 | 1024 | Enable amp=True (mixed precision) |
| 16 GB (e.g., T4, A10G) | 4–6 | 1024 | Stable; good starting point |
| 24 GB+ (e.g., A100) | 8–16 | 1024 | Use larger batches for faster convergence |
| CPU only | 1 | 640 | Very slow; only for verification runs |
What Training Output Looks Like
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 1/100 5.91G 2.847 3.421 1.203 87 1024 5/100 5.91G 1.983 2.104 1.051 91 1024 10/100 5.91G 1.612 1.580 0.974 89 1024 20/100 5.91G 1.341 1.203 0.891 92 1024 50/100 5.92G 1.089 0.901 0.811 88 1024 100/100 5.92G 0.934 0.762 0.774 90 1024
Results saved to runs/doc_finetune/docs_v1/
Watch cls_loss (classification loss) and box_loss (localization loss) together. If cls_loss plateaus but box_loss still drops, the model is finding boxes correctly but confusing class identities — often a sign of overlapping class definitions or insufficient per-class data.
Step 4 — Evaluating Performance
Load the best checkpoint:
python
from doclayout_yolo import YOLOv10
# Load the best checkpoint (saved automatically)
model = YOLOv10("runs/doc_finetune/docs_v1/weights/best.pt")
metrics = model.val(
data="Data.yaml",
imgsz=1024,
split="val",
plots=True, # generates confusion matrix, PR curve
)
print(f"mAP@50: {metrics.box.map50:.4f}")
print(f"mAP@50-95: {metrics.box.map:.4f}")
print(f"Precision: {metrics.box.mp:.4f}")
print(f"Recall: {metrics.box.mr:.4f}")
# Per-class breakdown
for cls_name, ap in zip(metrics.names.values(), metrics.box.maps):
print(f" {cls_name:<15} AP@50-95: {ap:.4f}")
Note: Never report performance based solely training metrics. Evaluation should be performed on unseen validation and test documents to measure generalization.
Key metrics
| Metric | Meaning |
| Precision | Percentage of correct detections |
| Recall | Percentage of objects found |
| mAP@50 | Detection quality at IoU=0.5 |
| mAP@50-95 | Overall detection quality |
Step 5 — Running Inference
python
from doclayout_yolo import YOLOv10
import cv2
model = YOLOv10("runs/doc_finetune/docs_v1/weights/best.pt")
results = model.predict(
source="document.jpg",
imgsz=1024,
conf=0.3, # confidence threshold — lower for recall-focused use cases
iou=0.45, # IoU threshold for NMS (Non-Maximum Suppression)
save=True, # saves annotated image to runs/predict/
)
# Access structured results
for r in results:
print("Detected regions:")
for box in r.boxes:
cls_id = int(box.cls)
cls_name = model.names[cls_id]
conf = float(box.conf)
x1, y1, x2, y2 = [int(v) for v in box.xyxy[0]]
print(f" {cls_name:15s} conf={conf:.2f} box=[{x1},{y1},{x2},{y2}]")
Example output:
Detected regions:
title conf=0.94 box=[87, 42, 621, 98]
text conf=0.91 box=[87, 115, 621, 380]
table conf=0.88 box=[87, 402, 621, 720]
stamp conf=0.83 box=[430, 740, 590, 850]
footer conf=0.79 box=[87, 870, 621, 920]
Trade-offs and Common Mistakes
Decision Trade-off Table
| Decision | Option A | Option B | Recommendation |
| Image resolution | 640 (faster) | 1024 (pretrain default) | Use 1024 — changing res requires more epochs |
| Freeze backbone? | Freeze (less data needed) | Full fine-tune (more flexible) | Freeze first if < 1800 images; full FT for 2000+ |
| Learning rate | High (0.01+) | Low (0.0001–0.001) | Low LR for fine-tuning — preserves pretrained features |
| Class definitions | Use DocLayNet’s 11 classes | Define your own classes | Define only what your task needs |
| Mosaic augmentation | Enabled (mosaic=1.0) | Reduced (mosaic=0.5) | Reduce to 0.5 — mosaic breaks document spatial coherence |
| Confidence threshold | High (0.5+) | Low (0.2–0.3) | Tune to downstream need post-training |
Common Mistakes
Mistake 1: Wrong imgsz during inference
If you train at 1024 and inference at 640, the model will not perform as expected. Always use the same image size for training, validation, and inference. Ultralytics does not error on this mismatch, so the issue can go unnoticed.
Mistake 2: Evaluating on the val split used for early stopping
Your patience parameter watches validation loss, so val metrics are optimistically biased. Always hold out a separate test set (not the val set) for final reporting.
Mistake 3: Overlapping or ambiguous class boundaries
If ‘header’ and ‘title’ look the same to an annotator, your model will learn noisy labels and plateau early. Collapse ambiguous classes or write a clear annotation guide before labeling.
Edge case — multi-column documents
DocLayout-YOLO detects axis-aligned bounding boxes. For multi-column layouts, regions from different columns may spatially interleave. Include at least 30–50 multi-column pages if your target documents use this layout.
Pro tip — freeze backbone for tiny datasets
If you have fewer than 100 annotated images, freeze the GCRF backbone layers to prevent overfitting. Add freeze=10 to your model.train() call to freeze the first 10 layers.
Key Takeaways
If you remember only a few things from this guide, remember these:
- DocLayout-YOLO is an excellent starting point for document layout analysis.
- Annotation quality has a larger impact than hyperparameter tuning.
- Fine-tuning from pretrained weights beats training from scratch in almost every practical scenario. You need far less data (50–500 images) and fewer epochs than you think.
- Augmentation configuration matters more in document detection than in natural-image detection. Disable flipping and large rotations, and keep mosaic moderate (0.5).
- Data quality trumps data quantity. One hundred accurately labeled documents will outperform a thousand noisily labeled ones. Write a clear annotation guide before labeling.
- Always maintain a clean test set separate from your validation set for honest final evaluation. Val metrics during training are optimistically biased by early stopping.
- The confidence threshold (conf) is a deployment parameter, not a training parameter. Tune it post-training based on your precision/recall requirements.
- Document-specific augmentation strategies differ significantly from those used in natural-image detection tasks.
- Start small, iterate quickly, and continuously improve your dataset.
Conclusion
Fine-tuning DocLayout-YOLO is less about finding perfect hyperparameters and more about building a high-quality dataset. Better annotations, clear class definitions, and systematic analysis of failed predictions often drive the most significant improvements in document AI performance.
For organizations working with HVAC technical manuals, engineering drawings, Flowcharts, diagrams, a well-trained DocLayout-YOLO model can significantly improve layout detection, OCR accuracy, content retrieval, and information extraction.
At ACL Digital, we are exploring AI-driven solutions to improve how users interact with complex technical documentation. By leveraging DocLayout-YOLO, teams can build intelligent systems that quickly locate relevant figures, tables, and technical information, reducing manual effort and enhancing the overall user experience.
References
- https://huggingface.co/juliozhao/DocLayout-YOLO-DocStructBench – pretrained weights,
- DocLayout-YOLO Official Repository — source code, release notes
- DocLayout-YOLO Paper (arXiv:2410.12628) — original research paper describing GCRF and MB-loss
- DocLayNet Dataset — the primary pretraining dataset used by DocLayout-YOLO
- Ultralytics Training Documentation — full list of training hyperparameters and their defaults
- Ultralytics Export Documentation — ONNX, TensorRT, CoreML export options
- Label Studio Export Guide — how to export annotations in YOLO format
- Roboflow YOLOv10 Export Guide — using Roboflow for labeling and dataset management







