Skip to main content
The extraction JSON schema defines what data to extract from documents using AI. It uses a group-first approach where extractions are organized into logical groups that can be processed in parallel based on their dependencies.

Schema Structure

The schema consists of three main sections:
{
  "config": {
    // Global configuration
  },
  "groups": {
    // All extraction groups
  },
  "definitions": {
    // Reusable schema definitions
  }
}
Documents are automatically processed using VLM (Vision Language Model) parsing with per-page chunking for optimal extraction quality across all document types.

Global Configuration

The config section defines default settings that apply to all groups unless overridden.
{
  "config": {
    "system_message": "Default extraction behavior",
    "reasoning_enabled": false,
    "extraction_title_prompt": "Generate a concise title summarizing the main subject matter (3-5 words)"
  }
}

Configuration Options

OptionTypeDefaultDescription
system_messagestringnullCustom system instructions for the AI
reasoning_enabledbooleanfalseEnable reasoning mode for all extractions
extraction_title_promptstringnullCustom prompt for generating extraction titles
reasoning_enabled is template-level only and cannot be overridden at the group level. It applies to all groups uniformly.

Groups

Groups are the primary organizing principle for extractions. Each group contains its own fields and can override template-level configuration.
All fields within a group are extracted together in a single LLM call, sharing the same document chunks. This makes groups ideal for semantically related fields — information that tends to appear together in the same sections of your documents.
{
  "groups": {
    "company_info": {
      "config": {
        "iterates_on": "companies.list"
      },
      "search_query": "company overview, legal name, industry sector",
      "extraction_prompt": "Extract core company details",
      "fields": {
        "name": {
          "type": "string",
          "extraction_prompt": "Extract the full legal company name"
        },
        "sector": {
          "type": "string",
          "extraction_prompt": "Extract the industry classification"
        }
      }
    },
    "financial_metrics": {
      "search_query": "financial metrics, revenue, annual figures",
      "fields": {
        "revenue": {
          "type": "number",
          "extraction_prompt": "Extract the annual revenue figure",
          "references": ["@{company_info.name}"]
        }
      }
    }
  }
}

Group-Level Configuration

Each group can have a config object with these options:
OptionTypeDescription
system_messagestringOverride template system message for this group
iterates_onstringPath to an array field to iterate over (e.g., "companies.list")

Group-Level Properties

Properties available directly at the group level (outside config):
PropertyTypeDescription
search_querystringSearch instruction for RAG to find relevant document chunks
extraction_promptstringExtraction instruction for the AI

Definitions

The definitions section contains reusable schema components:
{
  "definitions": {
    "monetary_amount": {
      "type": "number",
      "extraction_prompt": "Extract and normalize monetary value to a number"
    },
    "date_field": {
      "type": "string",
      "extraction_prompt": "Extract date in YYYY-MM-DD format"
    }
  }
}
Reference definitions in your fields using $ref:
{
  "fields": {
    "revenue": {
      "$ref": "#/definitions/monetary_amount"
    },
    "founding_date": {
      "$ref": "#/definitions/date_field"
    }
  }
}

Dependencies and Parallel Processing

Dependencies between groups are automatically computed based on:
  • Field references using mentions (@{group.field})
  • Iteration dependencies (iterates_on)
Groups are processed in parallel when their dependencies are satisfied.

Field References (Mentions)

Fields can reference values from other groups:
{
  "groups": {
    "calculations": {
      "fields": {
        "roi_percentage": {
          "type": "number",
          "extraction_prompt": "Calculate ROI using @{financials.investment} and @{financials.current_value}"
        }
      }
    }
  }
}

Iteration

Groups can iterate over arrays using iterates_on:
{
  "groups": {
    "investments": {
      "fields": {
        "companies": {
          "type": "array",
          "items": { "type": "string" }
        }
      }
    },
    "company_details": {
      "config": {
        "iterates_on": "investments.companies"
      },
      "search_query": "investment amount, @{iterator}",
      "fields": {
        "amount": {
          "type": "number",
          "extraction_prompt": "Extract the investment amount"
        }
      }
    }
  }
}

Reasoning Mode

When enabled, reasoning mode enhances extraction quality by wrapping fields with metadata that records reasoning and sources.

Enabling Reasoning Mode

{
  "config": {
    "reasoning_enabled": true
  }
}

Field Structure in Reasoning Mode

Eligible fields are wrapped with metadata:
{
  "field_name": {
    "metadata": {
      "sources": [{ "chunk_id": "...", "text": "...", "comment": "..." }],
      "reasoning": "Explanation of extraction logic..."
    },
    "value": "extracted value"
  }
}

Atomic Fields

The atomic flag controls how fields are wrapped when reasoning mode is enabled.

Default Atomicity Rules

Wrapped by default unless atomic: false
Not wrapped by default unless atomic: true, but their simple fields are wrapped
Not wrapped by default unless atomic: true, but their simple items are wrapped
Follow the atomicity rule of their target types, unless overridden with atomic flag

Examples

{
  "name": {
    "type": "string"
    // No atomic flag, will be wrapped by default
  },
  "description": {
    "type": "string",
    "atomic": false
    // Explicitly not wrapped
  }
}

When to Use Atomicity Flags

Use atomic: true when

  • Treating a complex object or array as a single unit
  • Needing reasoning about the entire structure
  • The field represents a cohesive concept

Use atomic: false when

  • You don’t need reasoning metadata for a specific field
  • Optimizing output size
  • The field value is straightforward

Writing Effective Search Queries

The search_query property is used by RAG to find relevant document chunks. Writing effective queries is critical for extraction quality.
Write short, dense semantic phrases — NOT natural language sentences. Embedding models compute similarity based on meaning, and unnecessary grammar reduces signal-to-noise ratio.

Best Practices

1

Use Concise Noun Phrases

Avoid imperative verbs like “Find”, “Get”, “Extract” and question phrasing.
// ❌ Bad
{ "search_query": "Find the company's legal name and incorporation details" }

// ✅ Good
{ "search_query": "company legal name, incorporation details" }
2

Remove Stopwords

Words like “the”, “in”, “for”, “of” have negligible embedding value.
// ❌ Bad
{ "search_query": "The name of the CEO of the company" }

// ✅ Good
{ "search_query": "CEO name, company leadership" }
3

Add Domain Keywords

Include domain-specific terms for disambiguation.
// ❌ Bad
{ "search_query": "In the context of EU privacy law, what are the obligations?" }

// ✅ Good
{ "search_query": "GDPR, data portability obligations" }
4

Keep Queries Short (3-10 words)

Anything longer becomes noisy. Anything shorter lacks discriminative power.
// ❌ Too short
{ "search_query": "revenue" }

// ✅ Optimal
{ "search_query": "financial performance, revenue, profit margins" }

Quick Reference

❌ Avoid✅ Use Instead
”Find the company’s legal name""company legal name"
"What is the total revenue?""total revenue, annual revenue"
"In the context of GDPR…""GDPR,
Single word: “revenue”With context: “annual revenue, YoY growth”

Complex Schema Best Practices

When working with nested objects, arrays, or multiple $ref definitions:

Split Complex Groups

// ❌ Problematic: One group with many complex fields
{
  "groups": {
    "company_financials": {
      "fields": {
        "valuation": { "$ref": "#/definitions/amount" },
        "revenue": { "$ref": "#/definitions/amount" },
        "funding_rounds": { "type": "array", "items": { "$ref": "#/definitions/funding_round" } },
        "key_metrics": { "type": "array", "items": { "$ref": "#/definitions/metric" } }
      }
    }
  }
}

// ✅ Better: Split into focused groups
{
  "groups": {
    "valuation_info": {
      "search_query": "company valuation, valuation date",
      "fields": {
        "valuation": { "$ref": "#/definitions/amount" }
      }
    },
    "funding_history": {
      "search_query": "funding rounds, investments, Series A B C",
      "fields": {
        "rounds": { "type": "array", "items": { "$ref": "#/definitions/funding_round" } }
      }
    }
  }
}

Use atomic: true for Complex Definitions

{
  "fields": {
    "deal_value": {
      "$ref": "#/definitions/amount",
      "atomic": true  // Reasoning applies to the whole amount
    }
  }
}

Complete Example

{
  "config": {
    "system_message": "Extract information with high accuracy",
    "reasoning_enabled": true,
    "extraction_title_prompt": "Create a brief title for this financial document"
  },
  "definitions": {
    "monetary_value": {
      "type": "number",
      "extraction_prompt": "Extract and normalize monetary value"
    },
    "date_field": {
      "type": "string",
      "extraction_prompt": "Extract date in YYYY-MM-DD format"
    }
  },
  "groups": {
    "company_info": {
      "search_query": "company overview, legal name, founding date",
      "fields": {
        "name": {
          "type": "string",
          "extraction_prompt": "Extract full legal name"
        },
        "founding_date": {
          "$ref": "#/definitions/date_field"
        }
      }
    },
    "financial_metrics": {
      "search_query": "financial metrics, revenue, @{company_info.name}",
      "fields": {
        "revenue": {
          "$ref": "#/definitions/monetary_value"
        }
      }
    },
    "investment_rounds": {
      "search_query": "investment rounds, funding history",
      "fields": {
        "rounds": {
          "type": "array",
          "items": { "type": "string" }
        }
      }
    },
    "round_details": {
      "config": {
        "iterates_on": "investment_rounds.rounds"
      },
      "search_query": "investment amount, @{iterator}",
      "fields": {
        "amount": {
          "$ref": "#/definitions/monetary_value"
        }
      }
    }
  }
}