Complete guide to the extraction JSON schema format for AI-powered data extraction
The extraction JSON schema defines what data to extract from documents using AI. It uses a group-first approach where extractions are organized into logical groups that can be processed in parallel based on their dependencies.
{ "config": { // Global configuration }, "groups": { // All extraction groups }, "definitions": { // Reusable schema definitions }}
Documents are automatically processed using VLM (Vision Language Model) parsing with per-page chunking for optimal extraction quality across all document types.
Groups are the primary organizing principle for extractions. Each group contains its own fields and can override template-level configuration.
All fields within a group are extracted together in a single LLM call, sharing the same document chunks. This makes groups ideal for semantically related fields — information that tends to appear together in the same sections of your documents.
Copy
{ "groups": { "company_info": { "config": { "iterates_on": "companies.list" }, "search_query": "company overview, legal name, industry sector", "extraction_prompt": "Extract core company details", "fields": { "name": { "type": "string", "extraction_prompt": "Extract the full legal company name" }, "sector": { "type": "string", "extraction_prompt": "Extract the industry classification" } } }, "financial_metrics": { "search_query": "financial metrics, revenue, annual figures", "fields": { "revenue": { "type": "number", "extraction_prompt": "Extract the annual revenue figure", "references": ["@{company_info.name}"] } } } }}
The definitions section contains reusable schema components:
Copy
{ "definitions": { "monetary_amount": { "type": "number", "extraction_prompt": "Extract and normalize monetary value to a number" }, "date_field": { "type": "string", "extraction_prompt": "Extract date in YYYY-MM-DD format" } }}
The search_query property is used by RAG to find relevant document chunks. Writing effective queries is critical for extraction quality.
Write short, dense semantic phrases — NOT natural language sentences. Embedding models compute similarity based on meaning, and unnecessary grammar reduces signal-to-noise ratio.
Avoid imperative verbs like “Find”, “Get”, “Extract” and question phrasing.
Copy
// ❌ Bad{ "search_query": "Find the company's legal name and incorporation details" }// ✅ Good{ "search_query": "company legal name, incorporation details" }
2
Remove Stopwords
Words like “the”, “in”, “for”, “of” have negligible embedding value.
Copy
// ❌ Bad{ "search_query": "The name of the CEO of the company" }// ✅ Good{ "search_query": "CEO name, company leadership" }
3
Add Domain Keywords
Include domain-specific terms for disambiguation.
Copy
// ❌ Bad{ "search_query": "In the context of EU privacy law, what are the obligations?" }// ✅ Good{ "search_query": "GDPR, data portability obligations" }