Table of Contents
Introduction to Multi-Modal AI
Multi-modal AI represents a fundamental shift in how machines process documents. Unlike traditional OCR or text-based extraction, multi-modal models can simultaneously process visual and textual information, much like humans do when reading documents.
What Makes It "Multi-Modal"?
Multi-modal refers to the model's ability to process multiple types of input simultaneously:
- β Visual modality: Images, layouts, formatting, spatial relationships
- β Textual modality: Words, sentences, semantic meaning
- β Structural modality: Tables, hierarchies, document organization
Evolution of Document Processing
Generation 1: Traditional OCR (1980s-2010s)
Pattern matching and character recognition
- β’ Accuracy: 60-80% on complex documents
- β’ Struggles with: Tables, merged cells, complex layouts
- β’ Cannot understand context or meaning
Generation 2: Deep Learning OCR (2010s-2020)
Neural networks for text recognition
- β’ Accuracy: 85-95% on clean documents
- β’ Better at: Handwriting, varied fonts, rotated text
- β’ Still limited: Structure understanding, semantic context
Generation 3: Multi-Modal AI (2023+)
Vision-language models with contextual understanding
- β’ Accuracy: 95-99% on complex regulatory documents
- β’ Excels at: Tables, charts, figures, nested structures
- β’ Understands: Context, meaning, relationships, intent
The Document Intelligence Challenge
Regulatory documents present unique challenges that traditional extraction methods cannot handle effectively:
Why Regulatory PDFs Are Difficult
πComplex Table Structures
- β’ Multi-level headers (3-4 levels deep)
- β’ Merged cells spanning rows and columns
- β’ Nested tables within cells
- β’ Rotated text in headers
- β’ Variable cell borders and shading
πEmbedded Visualizations
- β’ Charts as raster images (no extractable data)
- β’ Multiple plots in single figure
- β’ Legends separate from chart area
- β’ Axis labels at various angles
- β’ Overlapping data series
π¬Scientific Notation
- β’ Superscripts and subscripts
- β’ Mathematical symbols (Β±, β€, β₯)
- β’ Special characters (Ξ±, Ξ², ΞΌ)
- β’ Chemical formulas
- β’ Statistical annotations
πLayout Complexity
- β’ Multi-column layouts
- β’ Text wrapping around figures
- β’ Footnotes and endnotes
- β’ Headers and footers with metadata
- β’ Landscape and portrait mixed pages
Real-World Example: Stability Study Table
A typical ICH stability study table contains:
- 1.Hierarchical Headers:Time Point (months) β Storage Condition β Test Parameter β Specification β Results
- 2.Merged Cells:"0, 3, 6, 9, 12, 18, 24, 36 months" spans entire row, "25Β°C/60% RH" spans multiple columns
- 3.Scientific Notation:"β₯95.0%" for assay, "β€2.0%" for degradation products, "pH 6.8 Β± 0.3"
- 4.Footnotes:Asterisks (*) linking to study conditions, batch numbers, and analytical methods below table
Traditional OCR Result: Extracts text as unstructured lines, loses all table structure, cannot associate headers with data cells, misses merged cell relationships β Unusable output
Multi-Modal AI Result: Understands table structure, preserves all relationships, correctly interprets merged cells, maintains header hierarchy β 98%+ structural fidelity
Vision-Language Model Architecture
Modern multi-modal AI models use a sophisticated architecture that processes visual and textual information in parallel, then fuses them for comprehensive understanding.
Core Components
Vision Encoder
Processes the document image at pixel level to extract visual features
- β’ Converts image into patch embeddings (typically 16x16 or 32x32 pixels)
- β’ Applies convolutional neural networks or vision transformers
- β’ Captures spatial relationships, layouts, visual patterns
- β’ Creates high-dimensional feature maps preserving positional information
Text Encoder
Processes extracted text to understand linguistic meaning and context
- β’ Tokenizes text into subword units
- β’ Generates contextual embeddings using transformer architecture
- β’ Understands semantic relationships between words and phrases
- β’ Captures domain-specific terminology and abbreviations
Multi-Modal Fusion Layer
Combines visual and textual representations for unified understanding
- β’ Cross-attention mechanisms link visual patches to text tokens
- β’ Learns which visual features correspond to which text elements
- β’ Aligns spatial positions with semantic meaning
- β’ Creates joint embeddings that capture both modalities
Decoder / Generation Layer
Produces structured output based on fused multi-modal understanding
- β’ Generates structured representations (JSON, Markdown, HTML)
- β’ Preserves table structure with proper row/column relationships
- β’ Maintains formatting information (bold, italics, alignment)
- β’ Outputs semantic annotations and metadata
Training Paradigm
Multi-modal models are trained in multiple stages to develop comprehensive document understanding:
Stage 1: Pre-training on General Vision-Language Tasks
Models learn basic visual-linguistic alignment from millions of image-text pairs (e.g., image captions, visual question answering). This builds foundational understanding of how visual elements relate to textual descriptions.
Stage 2: Document-Specific Fine-tuning
Continued training on document images paired with ground-truth structured outputs. Models learn to recognize tables, charts, figures, and extract them with high fidelity. Includes diverse document types: scientific papers, financial reports, forms, invoices.
Stage 3: Domain Adaptation (Optional)
Further refinement on domain-specific documents (e.g., regulatory submissions, medical records). This stage is often done by end users on proprietary data to maximize accuracy for their specific use case.
How Multi-Modal AI "Sees" Documents
Understanding the model's perception process is key to leveraging its capabilities effectively.
Visual Processing Pipeline
- Step 1:Image Ingestion: Document page rendered as high-resolution image (typically 300 DPI minimum)
- Step 2:Patch Extraction: Image divided into overlapping patches (e.g., 16x16 pixel squares with 50% overlap)
- Step 3:Feature Embedding: Each patch converted to high-dimensional vector (768-1024 dimensions)
- Step 4:Spatial Encoding: Positional information added so model knows where each patch is on page
- Step 5:Attention Mechanism: Model learns which patches are related (e.g., header cells in same table)
- Step 6:Structural Recognition: Patterns identified: table grids, chart axes, figure boundaries, text blocks
What the Model "Understands"
β Visual Patterns
- β’ Grid lines indicating table boundaries
- β’ Cell borders (solid, dashed, or invisible)
- β’ Shading and background colors
- β’ Font sizes (headers typically larger)
- β’ Text alignment (left, center, right)
- β’ Bold/italic formatting
- β’ Whitespace patterns separating sections
β Semantic Patterns
- β’ Table headers vs. data cells (based on content)
- β’ Numeric patterns (percentages, ranges, p-values)
- β’ Units of measurement
- β’ Repeated structures (row/column patterns)
- β’ Hierarchical relationships (nested headers)
- β’ Caption text ("Table 1:", "Figure 2:")
- β’ Footnote markers (*, β , β‘, a, b, c)
Spatial Reasoning Capabilities
One of the most powerful aspects of multi-modal AI is spatial reasoningβthe ability to understand geometric relationships between elements:
- βAbove/Below Relationships: Understands that text directly above a table is likely its caption
- βLeft/Right Alignment: Recognizes that aligned elements likely belong to same column
- βContainment: Knows that cells are contained within tables, text within cells
- βSpanning: Identifies when a cell spans multiple rows or columns based on border patterns
- βGrouping: Clusters related elements even without explicit borders (e.g., footnotes below table)
Table Extraction Deep Dive
Table extraction is the most demanding task for document intelligence systems. Let's examine how multi-modal AI achieves 95%+ accuracy where traditional methods fail.
The Table Structure Problem
A table is not just text arranged in a grid. It's a complex hierarchical data structure with:
- β’ Logical structure: Rows and columns forming a matrix
- β’ Header hierarchy: Multi-level headers creating tree relationships
- β’ Cell relationships: Data cells linked to all relevant headers
- β’ Merged cells: Single cells spanning multiple logical positions
- β’ Metadata: Captions, footnotes, units, statistical annotations
Multi-Modal Extraction Process
1Table Detection
First, the model identifies table regions on the page:
- β’ Recognizes visual patterns: grid lines, cell borders, regular spacing
- β’ Distinguishes tables from other structured elements (lists, forms)
- β’ Determines table boundaries (where table starts and ends)
- β’ Handles edge cases: borderless tables, tables within text flow
2Row & Column Identification
Next, logical grid structure is inferred:
- β’ Detects horizontal separators (borders, whitespace) defining rows
- β’ Detects vertical separators defining columns
- β’ Handles irregular grids (varying row heights, column widths)
- β’ Manages merged cells that span multiple rows/columns
3Header Classification
Distinguishes headers from data cells:
- β’ Visual cues: Bold text, larger font, shaded background, centered alignment
- β’ Positional cues: Top rows and left columns more likely to be headers
- β’ Content cues: Generic labels ("Parameter", "Result") vs. specific values
- β’ Builds header hierarchy for multi-level headers
4Cell Content Extraction
Reads text within each cell:
- β’ OCR with context awareness (knows expected content type)
- β’ Handles special characters, superscripts, subscripts
- β’ Preserves formatting (bold, italics, underline)
- β’ Detects empty cells vs. cells with no visible borders
5Relationship Mapping
Establishes connections between cells:
- β’ Links each data cell to its row and column headers
- β’ For multi-level headers, creates full header path for each cell
- β’ Associates footnote markers with corresponding footnote text
- β’ Connects table to its caption and any related figures
6Structure Serialization
Outputs structured representation:
- β’ JSON: Nested object structure with full metadata
- β’ Markdown: Human-readable table syntax
- β’ HTML: Semantic markup with proper th/td elements
- β’ CSV: Flattened tabular format (with header expansion)
Handling Edge Cases
Common challenges and how multi-modal AI addresses them:
Chart & Figure Analysis
Unlike tables where text is embedded in the PDF, charts and figures are often raster images. Multi-modal AI can still extract meaningful information through visual analysis.
What Can Be Extracted from Charts
Metadata Extraction
- β Chart title and subtitle
- β Axis labels and units
- β Legend items and colors
- β Data series names
- β Caption text
- β Source attribution
- β Statistical annotations (p-values, error bars)
Visual Understanding
- β Chart type (line, bar, scatter, box plot, etc.)
- β Number of data series
- β Approximate data ranges
- β Trends and patterns
- β Peak/valley identification
- β Comparative relationships
- β Outlier detection
Note: While multi-modal AI can understand chart structure and read text labels, precise data point extraction from raster images is limited. For exact values, source data files or digitization tools are needed.
Figure Caption Association
A critical capability is linking figures to their captions and related text:
- 1.Caption Detection: Identifies text beginning with "Figure X:", "Fig. X", or similar patterns
- 2.Proximity Analysis: Associates caption with nearest image based on spatial distance and layout
- 3.Numbering Validation: Ensures figure numbers are sequential and match between caption and in-text references
- 4.Related Text Extraction: Finds paragraphs referring to the figure (e.g., "As shown in Figure 3...")
Chart Type Recognition
Multi-modal AI can classify charts into standard types, enabling appropriate handling:
Line Charts
Trends over time, pharmacokinetic profiles
Bar Charts
Categorical comparisons, efficacy endpoints
Scatter Plots
Correlations, dose-response relationships
Box Plots
Distribution summaries, statistical comparisons
Kaplan-Meier
Survival analysis, time-to-event data
Forest Plots
Meta-analysis, subgroup analyses
Model Comparison & Selection
Several multi-modal AI models are available for document intelligence. Each has strengths and trade-offs.
Leading Models (November 2025)
Claude
Anthropic
Strengths
- β’ Excellent table structure preservation (98%+ accuracy)
- β’ Superior handling of merged cells and nested headers
- β’ Strong understanding of scientific notation
- β’ Reliable footnote association
- β’ Large context window (200K tokens)
Considerations
- β’ Higher cost per API call vs. competitors
- β’ Slightly slower inference (2-4 seconds per page)
- β’ Rate limits on concurrent requests
Gemini
Strengths
- β’ Fast inference (1-2 seconds per page)
- β’ Lower cost per request
- β’ Excellent chart type recognition
- β’ Strong visual reasoning for layouts
- β’ Native integration with Google Cloud
Considerations
- β’ Occasional issues with deeply nested tables
- β’ Less consistent with footnote markers
- β’ May struggle with borderless tables
OpenAI GPT
OpenAI
Strengths
- β’ Broad availability and ecosystem support
- β’ Good general-purpose vision capabilities
- β’ Flexible output formatting
- β’ Strong reasoning about document context
Considerations
- β’ Best suited for standard document formats
- β’ Complex nested tables may require validation
- β’ Moderate speed and cost
Selection Criteria
Choose your model based on these factors:
Quality Assurance & Validation
Even with 95%+ accuracy, validation is essential for regulatory submissions. Here's how to ensure quality.
Multi-Layer Validation Strategy
L1Structural Validation
- β’ Verify table has expected number of rows and columns
- β’ Check that all cells are populated (or intentionally empty)
- β’ Validate header hierarchy is complete
- β’ Ensure merged cells are properly represented
L2Content Validation
- β’ Check data types match expectations (numeric cells contain valid numbers)
- β’ Validate units are preserved
- β’ Verify special characters (β€, Β±, etc.) are correct
- β’ Confirm footnote markers match footnote text
L3Semantic Validation
- β’ Cross-check extracted data against expected ranges
- β’ Validate statistical consistency (e.g., mean within min-max range)
- β’ Check temporal ordering (month 0 before month 6)
- β’ Verify relationships (total equals sum of components)
L4Visual Comparison
- β’ Side-by-side comparison of original PDF and extracted table
- β’ Highlight differences for human review
- β’ Flag low-confidence extractions for verification
- β’ Generate quality score based on validation results
Confidence Scoring
Assign confidence scores to extracted elements based on multiple factors:
Best Practice: Flag all low-confidence extractions for mandatory human review before inclusion in regulatory submissions.
Performance Optimization
Optimizing multi-modal AI pipelines involves balancing accuracy, speed, and cost.
Image Preprocessing
Proper image preparation significantly improves extraction quality:
- βResolution: Render PDFs at 300 DPI minimum for clear text and borders
- βContrast Enhancement: Adjust contrast for faded tables or low-quality scans
- βDeskewing: Correct skewed pages to improve alignment detection
- βNoise Reduction: Remove artifacts from scanning or compression
- βFormat Standardization: Convert to PNG or JPEG (avoid TIFF for API calls)
Prompt Engineering
Well-crafted prompts dramatically improve extraction accuracy and consistency:
Effective Prompt Structure
- β’ Specify exact output format (JSON schema, Markdown table syntax)
- β’ Define handling rules for edge cases (empty cells, merged cells)
- β’ Request metadata (caption, footnotes, units)
- β’ Ask for confidence scores on ambiguous elements
- β’ Provide examples of desired output format
Domain-Specific Instructions
- β’ Include regulatory context (ICH stability study, clinical endpoint table)
- β’ Specify expected headers or column structure when known
- β’ Define abbreviations and terminology specific to your domain
- β’ Request preservation of scientific notation and special characters
Batch Processing & Caching
Batch Optimization
- β’ Group similar documents to leverage model's learned patterns
- β’ Process pages in parallel with rate limit management
- β’ Prioritize critical tables for immediate processing
- β’ Queue low-priority extractions for off-peak hours
Intelligent Caching
- β’ Hash document pages to detect duplicates
- β’ Cache extraction results keyed by image hash
- β’ Reuse extractions for recurring table templates
- β’ Invalidate cache only when source document changes
Cost Management
Strategies to minimize API costs while maintaining quality:
- β’ Use cheaper models for simple tables, reserve premium models for complex ones
- β’ Implement selective extraction (only extract tables/charts, skip plain text)
- β’ Compress images to minimum acceptable quality (balance quality vs. size)
- β’ Monitor extraction quality and adjust model selection dynamically
- β’ Set up spending alerts and quotas to prevent cost overruns
Best Practices & Limitations
Implementation Best Practices
βStart with High-Quality Sources
Digital PDFs (not scans) produce best results. If working with scans, ensure 300+ DPI resolution and clean contrast. Avoid faxed or photocopied documents when possible.
βEstablish Ground Truth Dataset
Manually verify 50-100 representative tables to establish accuracy baseline. Use these as test cases when evaluating models or tuning prompts.
βImplement Human-in-the-Loop Review
Never fully automate critical regulatory submissions. Flag low-confidence extractions for human review. Use AI to accelerate, not replace, expert oversight.
βVersion Control Extractions
Track which model version and prompt were used for each extraction. This enables traceability and rollback if issues are discovered later.
βMonitor and Iterate
Continuously measure accuracy on new documents. As models improve or document types evolve, update prompts and validation rules accordingly.
Known Limitations
β Heavily Degraded Documents
Low-resolution scans (<150 DPI), faded text, or significant noise can reduce accuracy to 70-80%. Pre-processing helps but cannot fully compensate for poor source quality.
β Extremely Complex Nested Tables
Tables with 5+ levels of header hierarchy or deeply nested sub-tables may have 5-10% error rate. Consider simplifying table structure or manual verification.
β Handwritten Annotations
Handwritten notes on printed tables are inconsistently recognized. For critical handwritten content, manual entry or specialized handwriting recognition may be needed.
β Precise Data Point Extraction from Charts
While chart metadata is accurately extracted, precise numerical values from raster chart images have limited accuracy (Β±5-10%). Use source data files when exact values are required.
β Non-Latin Scripts
Most models are optimized for Latin alphabets. Documents in Chinese, Japanese, Arabic, or other scripts may require specialized models or additional validation.
Future Directions
The field of multi-modal AI is rapidly evolving:
- βImproved Chart Data Extraction: Next-generation models will extract precise data points from chart images with 95%+ accuracy
- βMulti-Page Table Handling: Better detection and stitching of tables that span multiple pages
- βInteractive Extraction: Models that can ask clarifying questions when structure is ambiguous
- βDomain Fine-Tuning: Specialized models trained specifically on regulatory documents for even higher accuracy
- βReal-Time Processing: Sub-second extraction times enabling interactive document analysis
Ready to Implement Multi-Modal AI?
See how DossiAIr uses multi-modal AI to extract regulatory tables with 95%+ accuracy