Extraction Schema

The extraction JSON schema defines what data to extract from documents using AI. It uses a group-first approach where extractions are organized into logical groups that can be processed in parallel based on their dependencies.

Schema Structure

The schema consists of three main sections:

{
  "config": {
    // Global configuration
  },
  "groups": {
    // All extraction groups
  },
  "definitions": {
    // Reusable schema definitions
  }
}

Documents are automatically processed using VLM (Vision Language Model) parsing with per-page chunking for optimal extraction quality across all document types.

Global Configuration

The config section defines default settings that apply to all groups unless overridden.

{
  "config": {
    "system_message": "Default extraction behavior",
    "reasoning_enabled": false,
    "extraction_title_prompt": "Generate a concise title summarizing the main subject matter (3-5 words)"
  }
}

Configuration Options

Option	Type	Default	Description
`system_message`	string	`null`	Custom system instructions for the AI
`reasoning_enabled`	boolean	`false`	Enable reasoning mode for all extractions
`extraction_title_prompt`	string	`null`	Custom prompt for generating extraction titles

reasoning_enabled is template-level only and cannot be overridden at the group level. It applies to all groups uniformly.

Groups

Groups are the primary organizing principle for extractions. Each group contains its own fields and can override template-level configuration.

All fields within a group are extracted together in a single LLM call, sharing the same document chunks. This makes groups ideal for semantically related fields — information that tends to appear together in the same sections of your documents.

{
  "groups": {
    "company_info": {
      "config": {
        "iterates_on": "companies.list"
      },
      "search_query": "company overview, legal name, industry sector",
      "extraction_prompt": "Extract core company details",
      "fields": {
        "name": {
          "type": "string",
          "extraction_prompt": "Extract the full legal company name"
        },
        "sector": {
          "type": "string",
          "extraction_prompt": "Extract the industry classification"
        }
      }
    },
    "financial_metrics": {
      "search_query": "financial metrics, revenue, annual figures",
      "fields": {
        "revenue": {
          "type": "number",
          "extraction_prompt": "Extract the annual revenue figure",
          "references": ["@{company_info.name}"]
        }
      }
    }
  }
}

Group-Level Configuration

Each group can have a config object with these options:

Option	Type	Description
`system_message`	string	Override template system message for this group
`iterates_on`	string	Path to an array field to iterate over (e.g., `"companies.list"`)

Group-Level Properties

Properties available directly at the group level (outside config):

Property	Type	Description
`search_query`	string	Search instruction for RAG to find relevant document chunks
`extraction_prompt`	string	Extraction instruction for the AI

Definitions

The definitions section contains reusable schema components:

{
  "definitions": {
    "monetary_amount": {
      "type": "number",
      "extraction_prompt": "Extract and normalize monetary value to a number"
    },
    "date_field": {
      "type": "string",
      "extraction_prompt": "Extract date in YYYY-MM-DD format"
    }
  }
}

Reference definitions in your fields using $ref:

{
  "fields": {
    "revenue": {
      "$ref": "#/definitions/monetary_amount"
    },
    "founding_date": {
      "$ref": "#/definitions/date_field"
    }
  }
}

Dependencies and Parallel Processing

Dependencies between groups are automatically computed based on:

Field references using mentions (@{group.field})
Iteration dependencies (iterates_on)

Groups are processed in parallel when their dependencies are satisfied.

Field References (Mentions)

Fields can reference values from other groups:

{
  "groups": {
    "calculations": {
      "fields": {
        "roi_percentage": {
          "type": "number",
          "extraction_prompt": "Calculate ROI using @{financials.investment} and @{financials.current_value}"
        }
      }
    }
  }
}

Iteration

Groups can iterate over arrays using iterates_on:

{
  "groups": {
    "investments": {
      "fields": {
        "companies": {
          "type": "array",
          "items": { "type": "string" }
        }
      }
    },
    "company_details": {
      "config": {
        "iterates_on": "investments.companies"
      },
      "search_query": "investment amount, @{iterator}",
      "fields": {
        "amount": {
          "type": "number",
          "extraction_prompt": "Extract the investment amount"
        }
      }
    }
  }
}

Reasoning Mode

When enabled, reasoning mode enhances extraction quality by wrapping fields with metadata that records reasoning and sources.

Enabling Reasoning Mode

{
  "config": {
    "reasoning_enabled": true
  }
}

Field Structure in Reasoning Mode

Eligible fields are wrapped with metadata:

{
  "field_name": {
    "metadata": {
      "sources": [{ "chunk_id": "...", "text": "...", "comment": "..." }],
      "reasoning": "Explanation of extraction logic..."
    },
    "value": "extracted value"
  }
}

Atomic Fields

The atomic flag controls how fields are wrapped when reasoning mode is enabled.

Default Atomicity Rules

Simple fields (string, number, boolean)

Wrapped by default unless atomic: false

Complex objects

Not wrapped by default unless atomic: true, but their simple fields are wrapped

Arrays

Not wrapped by default unless atomic: true, but their simple items are wrapped

References ($ref)

Follow the atomicity rule of their target types, unless overridden with atomic flag

Examples

Simple Fields
Objects
Arrays

{
  "name": {
    "type": "string"
    // No atomic flag, will be wrapped by default
  },
  "description": {
    "type": "string",
    "atomic": false
    // Explicitly not wrapped
  }
}

{
  "address": {
    "type": "object",
    // Object itself NOT wrapped
    "properties": {
      "street": { "type": "string" },  // WILL be wrapped
      "city": { "type": "string", "atomic": false }  // NOT wrapped
    }
  },
  "person": {
    "type": "object",
    "atomic": true,  // Object WILL be wrapped as a whole
    "properties": {
      "name": { "type": "string" },
      "age": { "type": "number" }
    }
  }
}

{
  "tags": {
    "type": "array",
    // Array itself NOT wrapped
    "items": {
      "type": "string"  // Items ARE wrapped
    }
  },
  "investments": {
    "type": "array",
    "atomic": true,  // Array WILL be wrapped as a whole
    "items": {
      "type": "object",
      "properties": {
        "company": { "type": "string" },
        "amount": { "type": "number" }
      }
    }
  }
}

When to Use Atomicity Flags

Use atomic: true when

Treating a complex object or array as a single unit
Needing reasoning about the entire structure
The field represents a cohesive concept

Use atomic: false when

You don’t need reasoning metadata for a specific field
Optimizing output size
The field value is straightforward

Writing Effective Search Queries

The search_query property is used by RAG to find relevant document chunks. Writing effective queries is critical for extraction quality.

Write short, dense semantic phrases — NOT natural language sentences. Embedding models compute similarity based on meaning, and unnecessary grammar reduces signal-to-noise ratio.

Best Practices

Use Concise Noun Phrases

Avoid imperative verbs like “Find”, “Get”, “Extract” and question phrasing.

// ❌ Bad
{ "search_query": "Find the company's legal name and incorporation details" }

// ✅ Good
{ "search_query": "company legal name, incorporation details" }

Remove Stopwords

Words like “the”, “in”, “for”, “of” have negligible embedding value.

// ❌ Bad
{ "search_query": "The name of the CEO of the company" }

// ✅ Good
{ "search_query": "CEO name, company leadership" }

Add Domain Keywords

Include domain-specific terms for disambiguation.

// ❌ Bad
{ "search_query": "In the context of EU privacy law, what are the obligations?" }

// ✅ Good
{ "search_query": "GDPR, data portability obligations" }

Keep Queries Short (3-10 words)

Anything longer becomes noisy. Anything shorter lacks discriminative power.

// ❌ Too short
{ "search_query": "revenue" }

// ✅ Optimal
{ "search_query": "financial performance, revenue, profit margins" }

Quick Reference

❌ Avoid	✅ Use Instead
”Find the company’s legal name"	"company legal name"
"What is the total revenue?"	"total revenue, annual revenue"
"In the context of GDPR…"	"GDPR, “
Single word: “revenue”	With context: “annual revenue, YoY growth”

Complex Schema Best Practices

When working with nested objects, arrays, or multiple $ref definitions:

Split Complex Groups

// ❌ Problematic: One group with many complex fields
{
  "groups": {
    "company_financials": {
      "fields": {
        "valuation": { "$ref": "#/definitions/amount" },
        "revenue": { "$ref": "#/definitions/amount" },
        "funding_rounds": { "type": "array", "items": { "$ref": "#/definitions/funding_round" } },
        "key_metrics": { "type": "array", "items": { "$ref": "#/definitions/metric" } }
      }
    }
  }
}

// ✅ Better: Split into focused groups
{
  "groups": {
    "valuation_info": {
      "search_query": "company valuation, valuation date",
      "fields": {
        "valuation": { "$ref": "#/definitions/amount" }
      }
    },
    "funding_history": {
      "search_query": "funding rounds, investments, Series A B C",
      "fields": {
        "rounds": { "type": "array", "items": { "$ref": "#/definitions/funding_round" } }
      }
    }
  }
}

Use atomic: true for Complex Definitions

{
  "fields": {
    "deal_value": {
      "$ref": "#/definitions/amount",
      "atomic": true  // Reasoning applies to the whole amount
    }
  }
}

Complete Example

{
  "config": {
    "system_message": "Extract information with high accuracy",
    "reasoning_enabled": true,
    "extraction_title_prompt": "Create a brief title for this financial document"
  },
  "definitions": {
    "monetary_value": {
      "type": "number",
      "extraction_prompt": "Extract and normalize monetary value"
    },
    "date_field": {
      "type": "string",
      "extraction_prompt": "Extract date in YYYY-MM-DD format"
    }
  },
  "groups": {
    "company_info": {
      "search_query": "company overview, legal name, founding date",
      "fields": {
        "name": {
          "type": "string",
          "extraction_prompt": "Extract full legal name"
        },
        "founding_date": {
          "$ref": "#/definitions/date_field"
        }
      }
    },
    "financial_metrics": {
      "search_query": "financial metrics, revenue, @{company_info.name}",
      "fields": {
        "revenue": {
          "$ref": "#/definitions/monetary_value"
        }
      }
    },
    "investment_rounds": {
      "search_query": "investment rounds, funding history",
      "fields": {
        "rounds": {
          "type": "array",
          "items": { "type": "string" }
        }
      }
    },
    "round_details": {
      "config": {
        "iterates_on": "investment_rounds.rounds"
      },
      "search_query": "investment amount, @{iterator}",
      "fields": {
        "amount": {
          "$ref": "#/definitions/monetary_value"
        }
      }
    }
  }
}

Get Started

Guides

Extraction Schema

Schema Structure

Global Configuration

Configuration Options

Groups

Group-Level Configuration

Group-Level Properties

Definitions

Dependencies and Parallel Processing

Field References (Mentions)

Iteration

Reasoning Mode

Enabling Reasoning Mode

Field Structure in Reasoning Mode

Atomic Fields

Default Atomicity Rules

Examples

When to Use Atomicity Flags

Use atomic: true when

Use atomic: false when

Writing Effective Search Queries

Best Practices

Quick Reference

Complex Schema Best Practices

Split Complex Groups

Use atomic: true for Complex Definitions

Complete Example

Get Started

Guides

​Schema Structure

​Global Configuration

​Configuration Options

​Groups

​Group-Level Configuration

​Group-Level Properties

​Definitions

​Dependencies and Parallel Processing

​Field References (Mentions)

​Iteration

​Reasoning Mode

​Enabling Reasoning Mode

​Field Structure in Reasoning Mode

​Atomic Fields

​Default Atomicity Rules

​Examples

​When to Use Atomicity Flags

Use atomic: true when

Use atomic: false when

​Writing Effective Search Queries

​Best Practices

​Quick Reference

​Complex Schema Best Practices

​Split Complex Groups

​Use atomic: true for Complex Definitions

​Complete Example

Schema Structure

Global Configuration

Configuration Options

Groups

Group-Level Configuration

Group-Level Properties

Definitions

Dependencies and Parallel Processing

Field References (Mentions)

Iteration

Reasoning Mode

Enabling Reasoning Mode

Field Structure in Reasoning Mode

Atomic Fields

Default Atomicity Rules

Examples

When to Use Atomicity Flags

Writing Effective Search Queries

Best Practices

Quick Reference

Complex Schema Best Practices

Split Complex Groups

Use atomic: true for Complex Definitions

Complete Example