Tree-based, vectorless document RAG framework. Available for Python (3.10+) and Node.js (18+) with identical APIs and cross-compatible index format.
- Architecture Overview
- Pipeline Deep Dive
- Smart Hierarchy Detection
- API Reference
- LLM Backend System
- Token Management Strategy
- Benchmarks
- Case Studies
- Cross-Language Parity
- Error Handling
- Configuration & Tuning
- Limitations & Edge Cases
TreeDex has six core modules, each implemented identically in Python and TypeScript:
┌─────────────────────────────────────────────────────┐
│ Document Loaders │
│ PDFLoader · TextLoader · HTMLLoader · DOCXLoader │
│ auto_loader() — auto-detect format │
└──────────────────────┬──────────────────────────────┘
│
┌────────────▼────────────┐
│ PDF Parser │
│ extract_pages() │
│ extract_toc() │
│ [H1][H2][H3] markers │
│ group_pages() │
└────────┬───────┬────────┘
│ │
┌─────────▼─┐ ┌▼──────────────────────┐
│ ToC found │ │ No ToC → LLM path │
│ Zero LLM │ │ Heading-guided │
│ calls │ │ Capped continuation │
└─────┬──────┘ └─────────┬──────────────┘
│ │
┌────▼────────────────────▼─────┐
│ Tree Builder │
│ toc_to_sections() │
│ repair_orphans() │
│ list_to_tree() │
│ assign_page_ranges() │
│ assign_node_ids() │
│ embed_text_in_tree() │
└──────────────┬────────────────┘
│
┌──────────────▼────────────────┐
│ TreeDex Core │
│ .query() .save() .load() │
│ .stats() .show_tree() │
└──────────────┬────────────────┘
│
┌──────────────▼────────────────┐
│ QueryResult │
│ .context .node_ids │
│ .page_ranges .reasoning │
│ .answer (agentic mode) │
└───────────────────────────────┘
| Module | Python | TypeScript | LOC | Purpose |
|---|---|---|---|---|
| Core | treedex/core.py |
src/core.ts |
303 / 398 | Main TreeDex class, indexing and querying |
| LLM Backends | treedex/llm_backends.py |
src/llm-backends.ts |
700 / 738 | 18+ provider integrations |
| Loaders | treedex/loaders.py |
src/loaders.ts |
153 / 190 | Format-specific document loading |
| PDF Parser | treedex/pdf_parser.py |
src/pdf-parser.ts |
133 / 154 | PDF extraction, ToC, heading detection |
| Prompts | treedex/prompts.py |
src/prompts.ts |
91 / 105 | LLM prompt templates |
| Tree Builder | treedex/tree_builder.py |
src/tree-builder.ts |
120 / 156 | Flat sections → hierarchical tree |
| Tree Utils | treedex/tree_utils.py |
src/tree-utils.ts |
152 / 188 | Traversal, serialization, JSON extraction |
| Types | — | src/types.ts |
— / 42 | TypeScript type definitions |
TreeDex auto-detects format by extension and delegates to the appropriate loader:
| Format | Library (Python) | Library (Node.js) | Images | Headings |
|---|---|---|---|---|
| PyMuPDF (fitz) | pdfjs-dist | Base64 extraction | Font-size analysis | |
| TXT/MD | stdlib | Node.js fs | — | — |
| HTML | HTMLParser (stdlib) | htmlparser2 / regex fallback | Alt text extraction | — |
| DOCX | python-docx | mammoth | Inline images | — |
Each loader returns a list of Page objects:
{
"page_num": 0,
"text": "Chapter 1: Introduction...",
"token_count": 342,
"images": [{"data": "base64...", "mime_type": "image/png"}]
}Non-PDF formats split text into synthetic pages by character count (default 3,000 chars/page).
Before any LLM calls, TreeDex checks for PDF bookmarks:
# Python
toc = extract_toc("document.pdf")
# Returns: [{"level": 1, "title": "Introduction", "physical_index": 0}, ...]
# Or None if < 3 entries// TypeScript
const toc = await extractToc("document.pdf");How it works:
- Python: Uses
fitz.open(path).get_toc()which reads the PDF outline/bookmarks metadata - TypeScript: Uses
pdfjs.getDocument().getOutline()and recursively walks the outline tree, resolving page destinations
If a usable ToC is found (3+ entries), the entire LLM extraction step is skipped — the tree is built directly from ToC entries via toc_to_sections().
When no ToC is available, TreeDex analyzes font sizes to inject hierarchy markers:
Step 1: Analyze — Sample up to 50 pages, collect all font sizes weighted by character count.
Step 2: Classify — The most-used font size = body text. All sizes > body + 0.5pt = headings. Top 3 heading sizes → H1, H2, H3 (largest first).
Step 3: Annotate — Each line gets a marker based on its font size:
Without markers: With markers:
───────────────── ─────────────────
3 System Architecture [H2] 3 System Architecture
TreeDex consists of... TreeDex consists of...
3.1 Architecture [H3] 3.1 Architecture overview
overview Figure 1 illustrates...
Figure 1 illustrates...
Overhead: Only +2.7% more tokens (314 extra tokens on a 12k-token document), but dramatically improves hierarchy accuracy.
Documents are split into token-budgeted groups for the LLM:
Pages: [P0, P1, P2, P3, P4, P5, P6, P7, P8, P9]
max_tokens = 4000
Group 1: [P0, P1, P2, P3, P4, P5, P6] ← 3,868 tokens
overlap ↕
Group 2: [P6, P7, P8, P9, P10, P11, P12] ← 3,926 tokens
overlap ↕
Group 3: [P12, P13, P14, P15, P16, P17] ← 3,631 tokens
Each page is wrapped in XML tags: <physical_index_N>text</physical_index_N> so the LLM knows which page each section starts on.
Group 1 gets the full extraction prompt. The LLM returns:
[
{"structure": "1", "title": "Introduction", "physical_index": 1},
{"structure": "1.1", "title": "Background", "physical_index": 1},
{"structure": "1.2", "title": "Methods", "physical_index": 3}
]Groups 2+ get a continuation prompt with capped context:
| Document Size | Old Context (all sections) | New Context (capped) | Savings |
|---|---|---|---|
| 100 pages | 9,750 tokens | 4,800 tokens | 50.8% |
| 300 pages | 117,200 tokens | 19,200 tokens | 83.6% |
| 500 pages | 317,200 tokens | 31,200 tokens | 90.2% |
The capped context includes:
- All top-level sections (chapters) — to maintain overall structure awareness
- Last 30 detailed sections — to continue numbering correctly
- Total section count and last structure ID
After extraction, repair_orphans() fixes broken hierarchy:
Before repair: After repair:
───────────── ─────────────
1 — Introduction 1 — Introduction
1.1 — Background 1.1 — Background
2.3.1 — Deep orphan 2 — Section 2 ← synthetic
2.3 — Section 2.3 ← synthetic
2.3.1 — Deep orphan
The flat section list becomes a hierarchical tree:
list_to_tree()— Parent lookup by structure prefix ("1.2.3" → parent "1.2")assign_page_ranges()— Each node gets start/end page indicesassign_node_ids()— DFS assigns "0001", "0002", etc.embed_text_in_tree()— Each node gets concatenated page text for its range
result = index.query("What methods were used?")Step 1: Strip text from tree (keep only structure/title/page_ranges/node_ids) to minimize tokens.
Step 2: Send stripped tree + question to LLM. LLM returns {node_ids: [...], reasoning: "..."}.
Step 3: Look up full text for selected nodes via the node map (O(1) per node).
Step 4 (agentic): Optionally generate an answer from the retrieved context.
1. PDF has bookmarks/outline?
YES → extract_toc() → toc_to_sections() → build tree (0 LLM calls)
NO ↓
2. PDF file?
YES → detect_headings=True → font-size analysis → [H1][H2][H3] markers
NO → plain text extraction
3. LLM extraction with heading-guided prompts
→ repair_orphans() → build tree
# Step 1: Collect font sizes from first 50 pages
size_chars = Counter()
for page in doc[:50]:
for block in page.get_text("dict")["blocks"]:
for line in block["lines"]:
for span in line["spans"]:
size_chars[round(span["size"], 1)] += len(span["text"])
# Step 2: Body = most common size
body_size = size_chars.most_common(1)[0][0] # e.g., 10.0
# Step 3: Headings = larger sizes, sorted descending
# e.g., 17.2 → H1, 12.0 → H2, 11.0 → H3
# Step 4: Annotate lines
# "[H2] 3 System Architecture"
# "TreeDex consists of five core modules..."PDF Outline: → Structure Numbering:
───────────── ─────────────────────
Level 1: Introduction → "1"
Level 2: Background → "1.1"
Level 2: Motivation → "1.2"
Level 1: Methods → "2"
Level 2: Data → "2.1"
Level 3: Surveys → "2.1.1"
Counters are maintained per level. When a new Level 1 appears, all deeper counters reset.
| Method | Python | Node.js | Description |
|---|---|---|---|
| Build from file | TreeDex.from_file(path, llm, **opts) |
await TreeDex.fromFile(path, llm, opts?) |
Full pipeline: load → detect → extract → build |
| Build from pages | TreeDex.from_pages(pages, llm, **opts) |
await TreeDex.fromPages(pages, llm, opts?) |
From pre-extracted pages |
| Build from tree | TreeDex.from_tree(tree, pages, llm) |
TreeDex.fromTree(tree, pages, llm) |
From existing tree (no LLM) |
| Query | index.query(q, llm=, agentic=) |
await index.query(q, {llm?, agentic?}) |
Retrieve relevant sections |
| Save | index.save(path) |
await index.save(path) |
Export to JSON |
| Load | TreeDex.load(path, llm) |
await TreeDex.load(path, llm) |
Import from JSON |
| Show tree | index.show_tree() |
index.showTree() |
Pretty-print |
| Stats | index.stats() |
index.stats() |
Return {total_pages, total_tokens, ...} |
| Find large | index.find_large_sections(**opts) |
index.findLargeSections(opts?) |
Nodes exceeding thresholds |
| Parameter | Python | Node.js | Default | Description |
|---|---|---|---|---|
| path | str |
string |
required | Document file path |
| llm | BaseLLM |
BaseLLM |
required | LLM backend instance |
| loader | Loader |
{load()} |
None/undefined |
Custom loader (auto-detect if omitted) |
| max_tokens | int |
number |
20000 | Token budget per page group |
| overlap | int |
number |
1 | Page overlap between groups |
| verbose | bool |
boolean |
True/true |
Print progress |
| extract_images | bool |
boolean |
False/false |
Extract images from PDFs |
| Property | Python | Node.js | Type | Description |
|---|---|---|---|---|
| Context | .context |
.context |
str |
Concatenated text from selected nodes |
| Node IDs | .node_ids |
.nodeIds |
list[str] |
IDs of selected tree nodes |
| Page ranges | .page_ranges |
.pageRanges |
list[tuple] |
[(start, end), ...] |
| Pages string | .pages_str |
.pagesStr |
str |
"pages 5-8, 12-15" |
| Reasoning | .reasoning |
.reasoning |
str |
LLM's selection explanation |
| Answer | .answer |
.answer |
str |
Generated answer (agentic mode only) |
| Function | Python | Node.js | Description |
|---|---|---|---|
| Extract PDF ToC | extract_toc(path) |
await extractToc(path) |
Returns [{level, title, physical_index}] or None/null |
| ToC → sections | toc_to_sections(toc) |
tocToSections(toc) |
Convert ToC entries to numbered sections |
| Repair orphans | repair_orphans(sections) |
repairOrphans(sections) |
Insert synthetic parents for orphaned subsections |
| List to tree | list_to_tree(sections) |
listToTree(sections) |
Flat sections → hierarchical tree |
| Extract JSON | extract_json(text) |
extractJson(text) |
Robust JSON extraction from LLM output |
BaseLLM (abstract)
├── SDK-based (lazy import)
│ ├── GeminiLLM ─────── @google/generative-ai
│ ├── OpenAILLM ─────── openai
│ ├── ClaudeLLM ─────── @anthropic-ai/sdk
│ ├── MistralLLM ────── @mistralai/mistralai
│ └── CohereLLM ─────── cohere-ai
│
├── OpenAI-compatible (zero deps, fetch/urllib only)
│ ├── GroqLLM ─────────── api.groq.com
│ ├── TogetherLLM ─────── api.together.xyz
│ ├── FireworksLLM ────── api.fireworks.ai
│ ├── OpenRouterLLM ───── openrouter.ai
│ ├── DeepSeekLLM ─────── api.deepseek.com
│ ├── CerebrasLLM ─────── api.cerebras.ai
│ └── SambanovaLLM ────── api.sambanova.ai
│
├── Fetch-based (zero deps)
│ ├── HuggingFaceLLM ──── huggingface.co
│ └── OllamaLLM ──────── localhost:11434
│
├── Universal
│ ├── OpenAICompatibleLLM ── any URL + key
│ ├── LiteLLM ───────────── litellm (Python only)
│ └── FunctionLLM ────────── wrap any callable
│
└── BaseLLM ── subclass to build your own
Three backends support image description for the extract_images feature:
| Backend | Method | Image Format |
|---|---|---|
| GeminiLLM | generate_content() with inline_data |
Base64 PNG/JPEG |
| OpenAILLM | Chat completion with image_url | Base64 data URI |
| ClaudeLLM | Messages API with image source | Base64 with media_type |
# Python
class MyLLM(BaseLLM):
def generate(self, prompt: str) -> str:
return my_api.call(prompt)
# Or use FunctionLLM for quick wrapping:
llm = FunctionLLM(lambda prompt: my_api.call(prompt))// TypeScript
const llm = new FunctionLLM((prompt: string) => myApi.call(prompt));TreeDex uses cl100k_base (GPT-3.5/4 standard) for all token counting:
- Python:
tiktoken.get_encoding("cl100k_base") - Node.js:
gpt-tokenizer(encode function)
| Stage | Token Concern | How TreeDex Handles It |
|---|---|---|
| Page extraction | Pre-compute per page | page.token_count field |
| Page grouping | Fit within LLM context | group_pages(maxTokens=20000) |
| Continuation context | Don't balloon the prompt | Capped: top-level + last 30 sections |
| Query (tree structure) | Keep tree JSON small | strip_text_from_tree() removes all node text |
| Query (retrieved context) | Return relevant text | Only selected nodes' text is concatenated |
# Small context model (8k)
index = TreeDex.from_file("doc.pdf", llm, max_tokens=6000)
# Large context model (128k)
index = TreeDex.from_file("doc.pdf", llm, max_tokens=100000)Lower max_tokens = more groups = more LLM calls but works with smaller context windows.
Higher max_tokens = fewer groups = faster but needs larger context.
All benchmarks on research-paper.pdf (21 pages, 11,710 tokens, 41 ToC entries).
Node.js, measured with performance.now().
| Operation | Time | Notes |
|---|---|---|
| ToC extraction | 30.9 ms | Reads PDF outline metadata |
| Page extraction (no headings) | 298.9 ms | Plain text via pdfjs-dist |
| Page extraction (with headings) | 423.5 ms | +41.7% for font analysis |
| Heading token overhead | +314 tokens | +2.7% (12,024 vs 11,710) |
| max_tokens | Groups | Avg tokens/group | Time |
|---|---|---|---|
| 4,000 | 4 | 3,434 | 0.6 ms |
| 8,000 | 2 | 6,279 | 0.1 ms |
| 20,000 | 1 | 11,958 | 0.07 ms |
| 128,000 | 1 | 11,958 | 0.05 ms |
| Sections | Build Time | Nodes | Roots |
|---|---|---|---|
| 10 | 0.6 ms | 12 | 2 |
| 50 | 0.3 ms | 50 | 10 |
| 200 | 0.5 ms | 200 | 40 |
| 500 | 1.3 ms | 500 | 100 |
| Orphan Count | Time | Sections Added |
|---|---|---|
| 5 | 0.2 ms | 10 synthetic parents |
| 20 | 0.5 ms | 40 synthetic parents |
| 50 | 1.5 ms | 100 synthetic parents |
| 100 | 3.0 ms | 200 synthetic parents |
| ToC Entries | Time/call | 1000 calls |
|---|---|---|
| 10 | 13 μs | 12.8 ms |
| 50 | 44 μs | 43.7 ms |
| 200 | 156 μs | 155.6 ms |
| Metric | Value |
|---|---|
| Heap delta | +1.54 MB |
| RSS delta | +4.70 MB |
| External delta | +0.84 MB |
Problem: On a 21-page research paper, the old extraction would produce 41 root-level nodes with max depth 1 — every section treated as a top-level chapter.
After v0.1.5 (ToC extraction):
| Metric | Before (flat) | After (hierarchical) | Improvement |
|---|---|---|---|
| Root nodes | 41 | 10 | 75.6% reduction |
| Max depth | 1 | 3 | 3x deeper |
| Child nodes | 0 | 31 | Proper nesting |
| LLM calls | 1+ | 0 | 100% saved |
Tree output (after):
[0001] 1: Introduction (pages 1-1)
[0002] 1.1: Background (pages 1-1)
[0003] 1.2: Limitations of vector-based RAG (pages 1-1)
[0004] 1.3: Our contribution (pages 2-2)
[0005] 2: Related Work (pages 2-4)
[0006] 2.1: Retrieval-augmented generation (pages 2-2)
[0007] 2.2: Document chunking strategies (pages 3-3)
...
[0011] 3: System Architecture (pages 5-8)
[0012] 3.1: Architecture overview (pages 5-5)
...7 subsections...
Page 2 without headings (what the LLM used to see):
1 Introduction 1.1 Background Large Language Models (LLMs),
accessible primarily through web APIs...
Page 2 with headings (what the LLM sees now):
[H2] 1 Introduction
[H3] 1.1 Background
Large Language Models (LLMs), accessible primarily through web
APIs...
The [H2] and [H3] markers make the hierarchy unambiguous. The LLM prompt instructs:
[H1]= top-level chapters ("1","2")[H2]= sections ("1.1","1.2")[H3]= subsections ("1.1.1","1.1.2")
Cost: Only 314 extra tokens across the entire 21-page document (+2.7%).
For a 500-page document generating ~976 sections across ~56 page groups:
Old approach (Group 50 of 56):
- Sends all 900+ previously extracted sections as JSON
- Context: 317,200 tokens — exceeds most model limits
- LLM truncates or hallucinates structure
New approach (Group 50 of 56):
- Sends: 15 top-level chapters + 30 most recent sections + metadata
- Context: 31,200 tokens — fits comfortably
- LLM has structural overview + continuation point
- 90.2% token savings
| Document Size | Old Context | Capped Context | Savings |
|---|---|---|---|
| 100 pages | 9,750 tok | 4,800 tok | 50.8% |
| 300 pages | 117,200 tok | 19,200 tok | 83.6% |
| 500 pages | 317,200 tok | 31,200 tok | 90.2% |
Scenario: LLM outputs 2.3.1 without ever producing 2 or 2.3:
Input (broken): Output (repaired):
────────────── ──────────────────
1 — Introduction 1 — Introduction
1.1 — Background 1.1 — Background
2.3.1 — Deep section 2 — Section 2 ← synthetic
3.1.2 — Another orphan 2.3 — Section 2.3 ← synthetic
4 — Conclusion 2.3.1 — Deep section
3 — Section 3 ← synthetic
3.1 — Section 3.1 ← synthetic
3.1.2 — Another orphan
4 — Conclusion
5 input sections → 9 after repair. The tree now has correct hierarchy instead of 3 orphaned root nodes.
| Dimension | TreeDex | Vector DB (Chroma/Pinecone) |
|---|---|---|
| Indexing | LLM extracts structure (or PDF ToC) | Embedding model generates vectors |
| Storage | JSON file (human-readable) | Vector database (opaque) |
| Retrieval | LLM navigates tree | Cosine similarity search |
| Structure | Preserves chapters/sections/subsections | Flat chunks, no hierarchy |
| Attribution | Exact page ranges per node | Approximate chunk boundaries |
| Infrastructure | None (just JSON) | Database server required |
| Dependencies | 1 LLM API | 1 LLM API + 1 embedding API + 1 DB |
| Debugging | Inspect JSON tree directly | Query embedding space (impractical) |
| Cost per query | 1 LLM call (tree nav) + 1 optional (answer) | 1 embedding call + 1 LLM call |
| Scales to 1M+ tokens | Yes (page grouping + capped context) | Yes (vector DB handles scale) |
| Best for | Structured docs (papers, books, manuals) | Unstructured content (logs, chat, mixed) |
TreeDex maintains identical behavior across Python and Node.js. The JSON index format is fully cross-compatible — build in Python, query from Node.js, or vice versa.
| Concept | Python | Node.js |
|---|---|---|
| Functions | snake_case |
camelCase |
| Classes | PascalCase |
PascalCase |
| JSON fields | snake_case |
snake_case (shared format) |
| File names | snake_case.py |
kebab-case.ts |
| Feature | Python | TypeScript | Notes |
|---|---|---|---|
| PDF library | PyMuPDF (fitz) | pdfjs-dist | PyMuPDF is faster; pdfjs runs in browser |
| Token counting | tiktoken | gpt-tokenizer | Both use cl100k_base |
| Async | Synchronous | async/await |
Python blocks; TS is non-blocking |
| LiteLLM | Supported | Not available | Python-only universal backend |
| Image extraction | Full base64 | Metadata only | pdfjs doesn't easily export encoded images |
| Heading detection | page.get_text("dict") |
getTextContent() items |
Both analyze font sizes |
{
"version": "1.0",
"framework": "TreeDex",
"tree": [
{
"structure": "1",
"title": "Introduction",
"physical_index": 0,
"node_id": "0001",
"nodes": [
{
"structure": "1.1",
"title": "Background",
"physical_index": 0,
"node_id": "0002",
"nodes": []
}
]
}
],
"pages": [
{"page_num": 0, "text": "...", "token_count": 342}
]
}LLM output is often imperfect. extract_json() tries 4 strategies:
- Direct
JSON.parse(text)— clean output - Code block extraction (
```json ... ```) — markdown-wrapped - Trailing comma fix (
, }→}) — common LLM error - Brace/bracket matching — find outermost
{...}or[...]
Has alt text? → "[Image: alt text]"
Has vision LLM? → "[Image: LLM description]"
Vision fails? → "[Image present]"
No vision support? → "[Image present]"
- HTTP errors include status code and response body
- Timeout: 120s default for fetch-based backends
- No automatic retries (user controls retry logic)
try:
result = index.query("question")
except ValueError as e:
# "No LLM provided. Pass llm= to query() or TreeDex constructor."index = TreeDex.from_file("doc.pdf", llm, max_tokens=4000, overlap=2)More page groups, more LLM calls, but each call fits within the context window. Increase overlap to 2 for better section boundary detection.
index = TreeDex.from_file("doc.pdf", llm, max_tokens=100000)Fewer groups, fewer LLM calls. A 300-page document might fit in 3-5 groups instead of 56.
llm = GeminiLLM(api_key) # Must support vision
index = TreeDex.from_file("slides.pdf", llm, extract_images=True)Images are described by the vision LLM and appended as [Image: description] to page text.
Heading detection is automatic when no ToC is found. To force it even with a custom loader:
from treedex.loaders import PDFLoader
loader = PDFLoader(detect_headings=True)
pages = loader.load("doc.pdf")
index = TreeDex.from_pages(pages, llm)- Multi-column layouts — Font-size detection works but column order may be wrong
- Dense tables — Treated as regular text; no special table parsing
- RTL languages — Text extraction works but heading detection may miss markers
- Streaming — LLM responses must be complete (no streaming support)
- Concurrent queries — Not thread-safe; use separate
TreeDexinstances - Context overflow — Retrieved context isn't capped; may exceed model limits for very large sections
- Single-page sections —
end_indexclamped to>= start_index - Empty pages — Token count is 0; grouping skips gracefully
- No ToC — Falls back to LLM extraction with heading markers
- No headings detected — Falls back to plain text (no font-size variation)
- Orphaned sections — Synthetic parents auto-inserted
- Malformed LLM JSON — Multi-pass extraction with fallbacks
- Missing node IDs in query — Silently skipped (no crash)
- PDF without text — Empty pages; images can still be described via vision LLM
- Automatic section splitting — Large sections aren't auto-split; use
find_large_sections()to identify them - Query result deduplication — Overlapping page ranges aren't merged
- Incremental indexing — Re-index entire document on changes
- Page-level granularity — Minimum unit is a page; sub-page sections share the full page text
| Operation | Time | Space |
|---|---|---|
extract_pages() |
O(pages) | O(pages × text) |
extract_toc() |
O(toc entries) | O(entries) |
group_pages() |
O(pages) | O(groups × text) |
list_to_tree() |
O(n) | O(n) |
repair_orphans() |
O(n × depth) | O(inserts) |
assign_page_ranges() |
O(n) | O(1) in-place |
assign_node_ids() |
O(n) | O(1) in-place |
embed_text_in_tree() |
O(n × pages_per_node) | O(text) |
create_node_mapping() |
O(n) | O(n) |
query() |
O(1 LLM call + n) | O(context text) |
save() |
O(n + pages) | O(JSON size) |
load() |
O(JSON size) | O(n + pages) |
Where n = number of tree nodes, typically 10-500 for most documents.