DSPydantic

Stop manually tuning prompts. Let your data optimize them.

DSPydantic automatically optimizes your Pydantic model prompts and field descriptions using DSPy. Extract structured data from text, images, and PDFs with higher accuracy and less effort.

The Problem

You've defined a Pydantic model. You're using an LLM to extract data. But:

Your prompts are guesswork—trial and error until something works
Accuracy varies wildly depending on input phrasing
Every new use case means more manual prompt engineering

The Solution

DSPydantic takes your examples and automatically finds the best prompts for your use case:

from pydantic import BaseModel, Field
from dspydantic import Prompter, Example

class Invoice(BaseModel):
    vendor: str = Field(description="Company that issued the invoice")
    total: str = Field(description="Total amount due")
    due_date: str = Field(description="Payment due date")

prompter = Prompter(model=Invoice, model_id="openai/gpt-4o-mini")

# Optimize with examples
result = prompter.optimize(examples=[
    Example(
        text="Invoice from Acme Corp. Total: $1,250.00. Due: March 15, 2024.",
        expected_output={"vendor": "Acme Corp", "total": "$1,250.00", "due_date": "March 15, 2024"}
    ),
])

# Extract with optimized prompts
invoice = prompter.run("Consolidated Energy Partners | Invoice Total $3,200 | Due 2024-05-30")

Typical improvement: 10-30% higher accuracy with the same LLM.

Installation

pip install dspydantic

Quick Start

Extract Data (No Optimization)

For simple cases, extract immediately:

from pydantic import BaseModel, Field
from dspydantic import Prompter

class Contact(BaseModel):
    name: str = Field(description="Person's full name")
    email: str = Field(description="Email address")

prompter = Prompter(model=Contact, model_id="openai/gpt-4o-mini")

contact = prompter.run("Reach out to Sarah Chen at sarah.chen@techcorp.io")
# Contact(name='Sarah Chen', email='sarah.chen@techcorp.io')

Optimize for Better Accuracy

When accuracy matters, optimize with examples:

from dspydantic import Example

examples = [
    Example(text="...", expected_output={...}),
    # 5-20 examples typically enough
]

result = prompter.optimize(examples=examples, verbose=True)
print(f"Accuracy: {result.baseline_score:.0%} → {result.optimized_score:.0%}")

Monitor progress in real-time with verbose=True to see:

Rich-formatted optimization progress
Actual optimized descriptions after each field optimization
Final summary with scores, API calls, and token usage

By default, optimization uses single-pass mode: one DSPy compile for all fields with reduced demo budgets for maximum speed. For better quality at the cost of more API calls, use sequential=True to optimize each field description independently (deepest-nested first), then prompts. With parallel_fields=True (default), fields are optimized in parallel for speed.

Deploy to Production

# Save optimized prompter
prompter.save("./invoice_prompter")

# Load in production
prompter = Prompter.load("./invoice_prompter", model=Invoice, model_id="openai/gpt-4o-mini")
invoice = prompter.run(new_document)

Why DSPydantic?

Feature	DSPydantic	Manual Prompting
Automatic optimization	✅ Data-driven	❌ Trial and error
Pydantic native	✅ Full type safety	⚠️ JSON only
Multi-modal	✅ Text, images, PDFs	⚠️ Text only
Production ready	✅ Save/load, batch, async	❌ Manual
Confidence scores	✅ Per-extraction	❌ No

Built on: DSPy (Stanford's optimization framework) + Pydantic (Python data validation)

Input Types

# Text
Example(text="Invoice from Acme...", expected_output={...})

# Images
Example(image_path="receipt.png", expected_output={...})

# PDFs
Example(pdf_path="contract.pdf", expected_output={...})

Optimization Options

# Focus on specific fields only
result = prompter.optimize(
    examples=examples,
    include_fields=["address", "total"],  # Only optimize these
)

# Exclude fields from scoring (still extracted)
result = prompter.optimize(
    examples=examples,
    exclude_fields=["metadata", "timestamp"],
)

# Sequential mode (field-by-field optimization)
result = prompter.optimize(
    examples=examples,
    sequential=True,
)

# Parallel field optimization (sequential mode with parallelization)
result = prompter.optimize(
    examples=examples,
    sequential=True,
    parallel_fields=True,
)

# Reduce validation set size for faster optimization
result = prompter.optimize(
    examples=examples,
    max_val_examples=5,
)

Production Features

# Caching (reduce API costs)
prompter = Prompter(model=Invoice, model_id="openai/gpt-4o-mini", cache=True)

# Batch processing
invoices = prompter.predict_batch(documents, max_workers=4)

# Async
invoice = await prompter.apredict(document)

# Confidence scores
result = prompter.predict_with_confidence(document)
if result.confidence > 0.9:
    process(result.data)

Documentation

Full documentation at davidberenstein1957.github.io/dspydantic

Getting Started - First extraction in 5 minutes
Configure Optimizations - Optimizers, single-pass/sequential modes, parallelization
Field Inclusion & Exclusion - Focus optimization on specific fields
API Reference - Full documentation

License

Apache 2.0

Contributing

Contributions welcome! Open an issue or submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
examples		examples
src/dspydantic		src/dspydantic
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
ABLATION_RESULTS.md		ABLATION_RESULTS.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DSPydantic

The Problem

The Solution

Installation

Quick Start

Extract Data (No Optimization)

Optimize for Better Accuracy

Deploy to Production

Why DSPydantic?

Input Types

Optimization Options

Production Features

Documentation

License

Contributing

About

Uh oh!

Releases 11

Packages

Uh oh!

Contributors 2

Languages

Folders and files

Latest commit

History

Repository files navigation

DSPydantic

The Problem

The Solution

Installation

Quick Start

Extract Data (No Optimization)

Optimize for Better Accuracy

Deploy to Production

Why DSPydantic?

Input Types

Optimization Options

Production Features

Documentation

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 11

Packages 0

Uh oh!

Contributors 2

Languages

Packages