Stop manually tuning prompts. Let your data optimize them.
DSPydantic automatically optimizes your Pydantic model prompts and field descriptions using DSPy. Extract structured data from text, images, and PDFs with higher accuracy and less effort.
You've defined a Pydantic model. You're using an LLM to extract data. But:
- Your prompts are guesswork—trial and error until something works
- Accuracy varies wildly depending on input phrasing
- Every new use case means more manual prompt engineering
DSPydantic takes your examples and automatically finds the best prompts for your use case:
from pydantic import BaseModel, Field
from dspydantic import Prompter, Example
class Invoice(BaseModel):
vendor: str = Field(description="Company that issued the invoice")
total: str = Field(description="Total amount due")
due_date: str = Field(description="Payment due date")
prompter = Prompter(model=Invoice, model_id="openai/gpt-4o-mini")
# Optimize with examples
result = prompter.optimize(examples=[
Example(
text="Invoice from Acme Corp. Total: $1,250.00. Due: March 15, 2024.",
expected_output={"vendor": "Acme Corp", "total": "$1,250.00", "due_date": "March 15, 2024"}
),
])
# Extract with optimized prompts
invoice = prompter.run("Consolidated Energy Partners | Invoice Total $3,200 | Due 2024-05-30")Typical improvement: 10-30% higher accuracy with the same LLM.
pip install dspydanticFor simple cases, extract immediately:
from pydantic import BaseModel, Field
from dspydantic import Prompter
class Contact(BaseModel):
name: str = Field(description="Person's full name")
email: str = Field(description="Email address")
prompter = Prompter(model=Contact, model_id="openai/gpt-4o-mini")
contact = prompter.run("Reach out to Sarah Chen at sarah.chen@techcorp.io")
# Contact(name='Sarah Chen', email='sarah.chen@techcorp.io')When accuracy matters, optimize with examples:
from dspydantic import Example
examples = [
Example(text="...", expected_output={...}),
# 5-20 examples typically enough
]
result = prompter.optimize(examples=examples, verbose=True)
print(f"Accuracy: {result.baseline_score:.0%} → {result.optimized_score:.0%}")Monitor progress in real-time with verbose=True to see:
- Rich-formatted optimization progress
- Actual optimized descriptions after each field optimization
- Final summary with scores, API calls, and token usage
By default, optimization uses single-pass mode: one DSPy compile for all fields with reduced demo budgets for maximum speed. For better quality at the cost of more API calls, use sequential=True to optimize each field description independently (deepest-nested first), then prompts. With parallel_fields=True (default), fields are optimized in parallel for speed.
# Save optimized prompter
prompter.save("./invoice_prompter")
# Load in production
prompter = Prompter.load("./invoice_prompter", model=Invoice, model_id="openai/gpt-4o-mini")
invoice = prompter.run(new_document)| Feature | DSPydantic | Manual Prompting |
|---|---|---|
| Automatic optimization | ✅ Data-driven | ❌ Trial and error |
| Pydantic native | ✅ Full type safety | |
| Multi-modal | ✅ Text, images, PDFs | |
| Production ready | ✅ Save/load, batch, async | ❌ Manual |
| Confidence scores | ✅ Per-extraction | ❌ No |
Built on: DSPy (Stanford's optimization framework) + Pydantic (Python data validation)
# Text
Example(text="Invoice from Acme...", expected_output={...})
# Images
Example(image_path="receipt.png", expected_output={...})
# PDFs
Example(pdf_path="contract.pdf", expected_output={...})# Focus on specific fields only
result = prompter.optimize(
examples=examples,
include_fields=["address", "total"], # Only optimize these
)
# Exclude fields from scoring (still extracted)
result = prompter.optimize(
examples=examples,
exclude_fields=["metadata", "timestamp"],
)
# Sequential mode (field-by-field optimization)
result = prompter.optimize(
examples=examples,
sequential=True,
)
# Parallel field optimization (sequential mode with parallelization)
result = prompter.optimize(
examples=examples,
sequential=True,
parallel_fields=True,
)
# Reduce validation set size for faster optimization
result = prompter.optimize(
examples=examples,
max_val_examples=5,
)# Caching (reduce API costs)
prompter = Prompter(model=Invoice, model_id="openai/gpt-4o-mini", cache=True)
# Batch processing
invoices = prompter.predict_batch(documents, max_workers=4)
# Async
invoice = await prompter.apredict(document)
# Confidence scores
result = prompter.predict_with_confidence(document)
if result.confidence > 0.9:
process(result.data)Full documentation at davidberenstein1957.github.io/dspydantic
- Getting Started - First extraction in 5 minutes
- Configure Optimizations - Optimizers, single-pass/sequential modes, parallelization
- Field Inclusion & Exclusion - Focus optimization on specific fields
- API Reference - Full documentation
Apache 2.0
Contributions welcome! Open an issue or submit a pull request.