Kinjal Shamjibhai Sorathiya

June 30, 2026

5 Minutes read

Fine-Tuning DocLayout-YOLO for Custom Document Layout

Document AI is becoming a critical component of modern enterprise applications, powering use cases such as intelligent search, Retrieval-Augmented Generation (RAG), information extraction, and automated document processing.

Before these systems can extract meaningful information, they must first understand the structure of a document by identifying elements such as titles, text blocks, tables, figures, and captions. This task is known as Document Layout Analysis (DLA).

While pre-trained models perform well on standard documents, they often struggle with domain-specific content such as technical manuals, engineering drawings, financial reports, healthcare records, and legal documents. Fine-tuning allows these models to adapt to the unique layouts and visual patterns found in real-world datasets.

In this article, we’ll walk through the end-to-end process of fine-tuning DocLayout-YOLO on custom document datasets and share practical lessons learned from real-world document AI projects.

The Problem: Why Generic Models Fall Short

Document Layout Analysis (DLA) is the task of detecting and classifying regions within a document image, such as titles, paragraphs, tables, figures, headers, footnotes, and captions. It’s the essential first step before OCR, information extraction, or RAG pipelines can work reliably.

Models like DocLayout-YOLO are pre-trained on large public datasets such as DocLayNet and D4LA. They handle academic papers and standard PDFs fairly well — but the real world rarely gives you standard PDFs.

Who Is This Guide For?

This guide is aimed at ML engineers, computer vision practitioners, data scientists, and AI developers who:

Need to extract structured information from domain-specific documents (HVAC technical documents, invoices, medical records, legal filings, engineering drawings)
Are comfortable with Python and have basic familiarity with object detection concepts

What is DocLayout-YOLO?

DocLayout-YOLO is a document-specific object detection model designed for Document Layout Analysis (DLA). Built on the YOLO architecture, it is optimized to detect and classify document elements, including titles, text blocks, tables, figures, captions, and formulas.

Given a document page as input, the model predicts bounding boxes around these regions, enabling downstream tasks such as OCR, information extraction, semantic search, and Retrieval-Augmented Generation (RAG).

Key Terminology

Term	Meaning
YOLO	Object detection architecture that processes the entire image in a single forward pass
DLA	Document Layout Analysis
IoU	Intersection over Union, measuring overlap between predicted and ground-truth boxes
mAP	Mean Average Precision, the primary object detection metric
Transfer Learning	Continuing training from pre-trained weights
Fine-Tuning	Adapting a pre-trained model to a specific dataset

Architecture Overview

DocLayout-YOLO follows the standard YOLO pipeline with document-specific enhancements.

Note: While DocLayout-YOLO supports full-network fine-tuning, smaller datasets often benefit from partially freezing the backbone to reduce overfitting.

During fine-tuning, the backbone, neck, and detection head can all be updated to adapt the model to a custom document domain. In practice, the optimal strategy depends on dataset size and diversity: smaller datasets benefit from freezing part of the backbone, while larger datasets generally benefit from full fine-tuning.

One of the strengths of DocLayout-YOLO is its ability to detect both:

Small document elements such as footnotes
Large structures, such as full-page diagrams

within the same model.

Fine-Tuning Pipeline

A typical fine-tuning workflow consists of the following steps:

When Should You Fine-Tune?

Fine-tuning is recommended when:

The pretrained model misses important document elements.
Your documents differ significantly from public datasets.
Detection quality directly impacts downstream OCR, retrieval, or information extraction workflows.
You have access to a labeled dataset representative of your target documents.

Step 1 — Preparing Your Dataset

Dataset quality has a greater impact on performance than most hyperparameter choices. Before training begins, invest time in creating consistent annotations.

How Many Images Do You Need?

Scenario	Recommended Images	Expected Outcome
Highly consistent template (one page format)	50–100	Good if layout is rigid
Moderate variance (multiple form versions)	200–400	Good
High variance (scanned docs, mixed fonts)	500–1000	Acceptable
Mixed document types	800+	Varies by class balance

Annotation Tools

Popular annotation tools include:

Label Studio — open source, web UI, YOLO export plugin
Roboflow — easiest YOLO export, built-in augmentation, free tier available
LabelImg — lightweight desktop tool, direct YOLO .txt export

All three support bounding box annotation and YOLO export.

Common Annotation Mistakes

Inconsistent figure boundaries
Partial flowchart annotations
Ambiguous class definitions
Different labeling standards across annotators

Annotation consistency is often the single biggest factor affecting model performance.

YOLO Label Format

Each image has a corresponding .txt file. Each line in that file represents one detected region:

# <class_id> <x_center> <y_center> <width> <height>
# All values normalized to [0, 1] relative to image dimensions

0 0.512 0.083 0.720 0.048  # class 0 = "title"
1 0.512 0.210 0.860 0.290  # class 1 = "text"
2 0.512 0.560 0.820 0.300  # class 2 = "table"
3 0.250 0.880 0.450 0.120  # class 3 = "figure"
Format: <class_id> <x_center> <y_center> <width> <height>

Dataset Structure

dataset/
├── images/
│   ├── train/
│   ├── val/
│   └── test/
│
├── labels/
│   ├── train/
│   ├── val/
│   └── test/
│
└── data.yaml

Converting PDF to Images (Common Pre-step)

Most document datasets start as PDFs. Convert them to high-resolution images before labeling. Use 150–200 DPI for typical documents and 300 DPI for small-text or form-heavy documents.

python
# pip install pdf2image (also requires poppler: apt-get install poppler-utils)
from pdf2image import convert_from_path
from pathlib import Path

def pdf_to_images(pdf_path: str, output_dir: str, dpi: int = 200) -> list:
    pages = convert_from_path(pdf_path, dpi=dpi, fmt="jpeg")
    out = Path(output_dir)
    out.mkdir(parents=True, exist_ok=True)
    saved = []
    for i, page in enumerate(pages):
        fpath = out / f"page_{i:04d}.jpg"
        page.save(fpath, "JPEG", quality=95)
        saved.append(fpath)
    return saved

# Usage
images = pdf_to_images("documents_batch.pdf", "my_doc_dataset/images/train")

Step 2 — Environment Setup & Configuration

Install dependencies:
bash
source .venv/bin/activate

pip install doclayout-yolo==0.0.3
Verify installation:
bash
python -c "import ultralytics; print(ultralytics.__version__)"

Dataset Configuration

The YAML file tells the trainer where your data lives and which classes you have. Keep class names consistent with your labeling tool.

yaml
# Data.yaml

path: dataset

train: images/train
val: images/val

nc: 6

names:
  0: title
  1: text
  2: table
  3: figure
  4: caption
  5: footer

Choosing the Right Model

Variant	Training Dataset	Recommended Usage
doclayout_yolo_docstructbench_imgsz1024.pt	DocStructBench	General-purpose document layout detection
doclayout_yolo_ft_imgsz1024.pt (D4LA tuned)	D4LA	Complex and scanned document layouts

For most projects, doclayout_yolo_docstructbench_imgsz1024.pt is a good starting point. If your documents contain scanned pages, noisy layouts, or domain-specific structures, fine-tuning on your own dataset typically provides larger gains than switching between pretrained checkpoints.

Step 3 — Training the Model

Minimal Training Script

DocLayout-YOLO uses the Ultralytics training API, which means you can kick off fine-tuning in very few lines:

python
from doclayout_yolo import YOLOv10

# Load pre-trained weights
model = YOLOv10('doclayout_yolo_docstructbench_imgsz1024.pt')

results = model.train(
    data='my_docs.yaml',
    imgsz=1024,       # keep at 1024 — model was pretrained at this resolution
    epochs=100,       # start here; use early stopping
    batch=4,          # reduce to 2 if OOM on <16GB VRAM
    lr0=0.001,        # initial learning rate (lower for fine-tuning)
    lrf=0.01,         # final LR = lr0 x lrf
    warmup_epochs=3,  # warmup prevents loss explosion at start
    patience=20,      # early stopping: stop if no improvement for 20 epochs
    save_period=10,   # checkpoint every 10 epochs
    project='runs/doc_finetune',
    name='docs_v1',
    device=0,         # GPU index; use 'cpu' for CPU-only (slow)
    workers=4,

    # Augmentation — document-safe settings

    hsv_h=0.01,       # very slight hue jitter
    hsv_s=0.2,
    hsv_v=0.3,
    degrees=0.0,      # NO rotation — documents are always upright
    translate=0.05,
    scale=0.3,
    fliplr=0.0,       # NO horizontal flip — changes reading direction
    mosaic=0.5,       # reduce from default 1.0
)

Note: Horizontal flipping is generally not recommended for document datasets because it reverses reading direction and changes document semantics. Rotation augmentation should only be used when rotated pages or diagrams are expected in production. When adapting YOLO configurations from natural-image tasks, carefully select document-specific augmentations.

GPU Memory Reference

GPU VRAM	Recommended Batch Size	imgsz	Notes
8 GB (e.g., RTX 3070)	2	1024	Enable amp=True (mixed precision)
16 GB (e.g., T4, A10G)	4–6	1024	Stable; good starting point
24 GB+ (e.g., A100)	8–16	1024	Use larger batches for faster convergence
CPU only	1	640	Very slow; only for verification runs

What Training Output Looks Like

Epoch   GPU_mem  box_loss  cls_loss  dfl_loss  Instances  Size
1/100   5.91G    2.847     3.421     1.203     87         1024
5/100   5.91G    1.983     2.104     1.051     91         1024
10/100  5.91G    1.612     1.580     0.974     89         1024
20/100  5.91G    1.341     1.203     0.891     92         1024
50/100  5.92G    1.089     0.901     0.811     88         1024
100/100 5.92G    0.934     0.762     0.774     90         1024

Results saved to runs/doc_finetune/docs_v1/
Watch cls_loss (classification loss) and box_loss (localization loss) together. If cls_loss plateaus but box_loss still drops, the model is finding boxes correctly but confusing class identities — often a sign of overlapping class definitions or insufficient per-class data.

Step 4 — Evaluating Performance

Load the best checkpoint:

python
from doclayout_yolo import YOLOv10

# Load the best checkpoint (saved automatically)
model = YOLOv10("runs/doc_finetune/docs_v1/weights/best.pt")

metrics = model.val(
    data="Data.yaml",
    imgsz=1024,
    split="val",
    plots=True,   # generates confusion matrix, PR curve
)

print(f"mAP@50:      {metrics.box.map50:.4f}")
print(f"mAP@50-95:   {metrics.box.map:.4f}")
print(f"Precision:   {metrics.box.mp:.4f}")
print(f"Recall:      {metrics.box.mr:.4f}")

# Per-class breakdown
for cls_name, ap in zip(metrics.names.values(), metrics.box.maps):
    print(f"  {cls_name:<15} AP@50-95: {ap:.4f}")

Note: Never report performance based solely training metrics. Evaluation should be performed on unseen validation and test documents to measure generalization.

Key metrics

Metric	Meaning
Precision	Percentage of correct detections
Recall	Percentage of objects found
mAP@50	Detection quality at IoU=0.5
mAP@50-95	Overall detection quality

Step 5 — Running Inference

python
from doclayout_yolo import YOLOv10
import cv2

model = YOLOv10("runs/doc_finetune/docs_v1/weights/best.pt")

results = model.predict(
    source="document.jpg",
    imgsz=1024,
    conf=0.3,     # confidence threshold — lower for recall-focused use cases
    iou=0.45,     # IoU threshold for NMS (Non-Maximum Suppression)
    save=True,    # saves annotated image to runs/predict/
)

# Access structured results
for r in results:
    print("Detected regions:")
    for box in r.boxes:
        cls_id = int(box.cls)
        cls_name = model.names[cls_id]
        conf = float(box.conf)
        x1, y1, x2, y2 = [int(v) for v in box.xyxy[0]]
        print(f"  {cls_name:15s} conf={conf:.2f}  box=[{x1},{y1},{x2},{y2}]")

Example output:
Detected regions:
title  conf=0.94  box=[87, 42, 621, 98]
text   conf=0.91  box=[87, 115, 621, 380]
table  conf=0.88  box=[87, 402, 621, 720]
stamp  conf=0.83  box=[430, 740, 590, 850]
footer conf=0.79  box=[87, 870, 621, 920]

Trade-offs and Common Mistakes

Decision Trade-off Table

Decision	Option A	Option B	Recommendation
Image resolution	640 (faster)	1024 (pretrain default)	Use 1024 — changing res requires more epochs
Freeze backbone?	Freeze (less data needed)	Full fine-tune (more flexible)	Freeze first if < 1800 images; full FT for 2000+
Learning rate	High (0.01+)	Low (0.0001–0.001)	Low LR for fine-tuning — preserves pretrained features
Class definitions	Use DocLayNet’s 11 classes	Define your own classes	Define only what your task needs
Mosaic augmentation	Enabled (mosaic=1.0)	Reduced (mosaic=0.5)	Reduce to 0.5 — mosaic breaks document spatial coherence
Confidence threshold	High (0.5+)	Low (0.2–0.3)	Tune to downstream need post-training

Common Mistakes

Mistake 1: Wrong imgsz during inference

If you train at 1024 and inference at 640, the model will not perform as expected. Always use the same image size for training, validation, and inference. Ultralytics does not error on this mismatch, so the issue can go unnoticed.

Mistake 2: Evaluating on the val split used for early stopping

Your patience parameter watches validation loss, so val metrics are optimistically biased. Always hold out a separate test set (not the val set) for final reporting.

Mistake 3: Overlapping or ambiguous class boundaries

If ‘header’ and ‘title’ look the same to an annotator, your model will learn noisy labels and plateau early. Collapse ambiguous classes or write a clear annotation guide before labeling.

Edge case — multi-column documents

DocLayout-YOLO detects axis-aligned bounding boxes. For multi-column layouts, regions from different columns may spatially interleave. Include at least 30–50 multi-column pages if your target documents use this layout.

Pro tip — freeze backbone for tiny datasets

If you have fewer than 100 annotated images, freeze the GCRF backbone layers to prevent overfitting. Add freeze=10 to your model.train() call to freeze the first 10 layers.

Key Takeaways

If you remember only a few things from this guide, remember these:

DocLayout-YOLO is an excellent starting point for document layout analysis.
Annotation quality has a larger impact than hyperparameter tuning.
Fine-tuning from pretrained weights beats training from scratch in almost every practical scenario. You need far less data (50–500 images) and fewer epochs than you think.
Augmentation configuration matters more in document detection than in natural-image detection. Disable flipping and large rotations, and keep mosaic moderate (0.5).
Data quality trumps data quantity. One hundred accurately labeled documents will outperform a thousand noisily labeled ones. Write a clear annotation guide before labeling.
Always maintain a clean test set separate from your validation set for honest final evaluation. Val metrics during training are optimistically biased by early stopping.
The confidence threshold (conf) is a deployment parameter, not a training parameter. Tune it post-training based on your precision/recall requirements.
Document-specific augmentation strategies differ significantly from those used in natural-image detection tasks.
Start small, iterate quickly, and continuously improve your dataset.

Conclusion

Fine-tuning DocLayout-YOLO is less about finding perfect hyperparameters and more about building a high-quality dataset. Better annotations, clear class definitions, and systematic analysis of failed predictions often drive the most significant improvements in document AI performance.

For organizations working with HVAC technical manuals, engineering drawings, Flowcharts, diagrams, a well-trained DocLayout-YOLO model can significantly improve layout detection, OCR accuracy, content retrieval, and information extraction.

At ACL Digital, we are exploring AI-driven solutions to improve how users interact with complex technical documentation. By leveraging DocLayout-YOLO, teams can build intelligent systems that quickly locate relevant figures, tables, and technical information, reducing manual effort and enhancing the overall user experience.

References

https://huggingface.co/juliozhao/DocLayout-YOLO-DocStructBench – pretrained weights,
DocLayout-YOLO Official Repository — source code, release notes
DocLayout-YOLO Paper (arXiv:2410.12628) — original research paper describing GCRF and MB-loss
DocLayNet Dataset — the primary pretraining dataset used by DocLayout-YOLO
Ultralytics Training Documentation — full list of training hyperparameters and their defaults
Ultralytics Export Documentation — ONNX, TensorRT, CoreML export options
Label Studio Export Guide — how to export annotations in YOLO format
Roboflow YOLOv10 Export Guide — using Roboflow for labeling and dataset management

Kinjal Shamjibhai Sorathiya

Fine-Tuning DocLayout-YOLO for Custom Document Layout

The Problem: Why Generic Models Fall Short

Who Is This Guide For?

What is DocLayout-YOLO?

Key Terminology

Architecture Overview

Fine-Tuning Pipeline

When Should You Fine-Tune?

Step 1 — Preparing Your Dataset

How Many Images Do You Need?

Annotation Tools

Common Annotation Mistakes

YOLO Label Format

Dataset Structure

Converting PDF to Images (Common Pre-step)

Step 2 — Environment Setup & Configuration

Dataset Configuration

Choosing the Right Model

Step 3 — Training the Model

Minimal Training Script

GPU Memory Reference

What Training Output Looks Like

Step 4 — Evaluating Performance

Key metrics

Step 5 — Running Inference

Trade-offs and Common Mistakes

Decision Trade-off Table

Common Mistakes

Key Takeaways

Conclusion

References

Related Insights

Securing Agentic AI-Based Systems

The Real Roadmap to Autonomous Telecom Networks Starts with Inference

AI-Augmented Mobile App Development Guardrails for Reliable Builds

Zero-Trust AI: Securing MCP-Based LLM Systems in Production

Essential Regulatory Compliance for AI and IoT-Driven MedTech Solutions in 2026

Building Intelligent Agents on the Databricks Stack

Turn Disruption into Opportunity. Catalyze Your Potential and Drive Excellence with ACL Digital.

Kinjal Shamjibhai Sorathiya

Fine-Tuning DocLayout-YOLO for Custom Document Layout

The Problem: Why Generic Models Fall Short

Who Is This Guide For?

What is DocLayout-YOLO?

Key Terminology

Architecture Overview

Fine-Tuning Pipeline

When Should You Fine-Tune?

Step 1 — Preparing Your Dataset

How Many Images Do You Need?

Annotation Tools

Common Annotation Mistakes

YOLO Label Format

Dataset Structure

Converting PDF to Images (Common Pre-step)

Step 2 — Environment Setup & Configuration

Dataset Configuration

Choosing the Right Model

Step 3 — Training the Model

Minimal Training Script

GPU Memory Reference

What Training Output Looks Like

Step 4 — Evaluating Performance

Key metrics

Step 5 — Running Inference

Trade-offs and Common Mistakes

Decision Trade-off Table

Common Mistakes

Key Takeaways

Conclusion

References

Related Insights

Turn Disruption into Opportunity. Catalyze Your Potential and Drive Excellence with ACL Digital.

Related Posts