Skip to content

Epstein files: LLM-optimized hierarchical index system for 2,897 historical documents. Navigate 60.7 MB using 665 KB of strategic indexes—master index, specialized categories, and summaries—saving 95% of context tokens while maintaining full dataset access.

License

Notifications You must be signed in to change notification settings

codehornets/epstein-files

Repository files navigation

Epstein Files - LLM-Optimized Research System

License: MIT Built with Haiku 4.5 Anthropic

This repository contains 2,897 historical documents from the House Oversight Committee's Jeffrey Epstein collection, organized into a hierarchical index system optimized for research with Claude and other LLMs.

What's Here

  • 60.7 MB of source documents (2,897 files organized in TEXT/001 and TEXT/002)
  • 665 KB of strategic indexes that enable full-document research with 95% fewer context tokens
  • Hierarchical navigation system (3 tiers) that lets you start broad and drill down to specifics

Quick Start: Using This With Claude

Step 1: Understand the Structure (2 minutes)

This system uses a 3-tier hierarchical index so you don't need to load all 60 MB at once:

Tier Files Size Purpose
1 INDEX_MASTER.md 2.3 KB Start here - overview & navigation guide
2 6 specialized indexes 9.4 KB Choose based on your query type
3 2 summary files 641 KB Find specific documents & get their details
Source TEXT/001/ + TEXT/002/ 60.7 MB Actual document files (load as needed)

Step 2: Start Your Query

Open Claude (claude.ai or Claude Code) and follow these patterns:

Example 1: Find Documents About a Specific Person

Your prompt to Claude:

I'm researching Jeffrey Epstein and his associates in the House Oversight files.

@INDEX_MASTER.md

@INDEX_PEOPLE.md

Who are the key people mentioned alongside Epstein? What documents should I read?

What happens:

  1. Claude reads the indexes you provided
  2. Claude identifies relevant document IDs from the indexes
  3. Claude tells you which documents to load next for deeper research
  4. You copy those specific document summaries from INDEX_SUMMARIES_001.md or 002.md

Example 2: Understand the Legal Proceedings

Your prompt to Claude:

I want to understand the legal case against Jeffrey Epstein.

@INDEX_MASTER.md

@INDEX_LEGAL.md

@INDEX_TIMELINE.md

What are the major legal proceedings? What's the chronological sequence of events?

Example 3: Analyze Email Communications

Your prompt to Claude:

I'm interested in understanding the email communications in this dataset.

@INDEX_MASTER.md

@INDEX_CORRESPONDENCE.md

Who were the key communicators? What were they discussing?

Step 3: Drill Down With Summaries

Once Claude identifies relevant documents, load the summaries:

Your next prompt:

Now here are the summaries for those documents:

@INDEX_SUMMARIES_001.md (or @INDEX_SUMMARIES_002.md for relevant sections)

Can you synthesize these summaries and tell me what stands out?

Step 4: Load Actual Documents (If Needed)

For the deepest research, load the actual document text:

Your prompt:

Now let me share the full text of the key documents:

@TEXT/001/HOUSE_OVERSIGHT_XXXXX.txt (or @TEXT/002/...)

Now that you can see the full text, what additional insights do you get?

File Guide

Index Files (Load These Into Claude)

Always start here:

  • INDEX_MASTER.md - Overview, statistics, entity compression guide, navigation map
    • Size: 2.3 KB
    • Contains: Top 30 entity codes ([E01]=Epstein, [E02]=Trump, etc.)

Then load one or more of these based on your query:

  • INDEX_PEOPLE.md - Alphabetical index of 1,047 people mentioned

    • Use when: "Who is [person]?" or "Find documents about [person]"
    • Size: 3.2 KB
  • INDEX_LEGAL.md - Legal documents and court proceedings (224 docs)

    • Use when: "What legal cases are mentioned?" or "Find court documents"
    • Size: 1.1 KB
  • INDEX_CORRESPONDENCE.md - Email index (2,202 emails)

    • Use when: "Who was communicating with whom?" or "Find email exchanges"
    • Size: 1.4 KB
  • INDEX_LOCATIONS.md - Properties and geographic index (954 locations)

    • Use when: "What locations are mentioned?" or "Find documents about [place]"
    • Size: 1.1 KB
  • INDEX_TIMELINE.md - Chronological event index

    • Use when: "What happened in [year]?" or "Follow the sequence of events"
    • Size: 0.8 KB
  • INDEX_TOPICS.md - Thematic grouping (finance, legal, travel, etc.)

    • Use when: "Find documents about [topic]" or "What themes are discussed?"
    • Size: 1.0 KB

For specific document details:

  • INDEX_SUMMARIES_001.md - Summaries of 2,000 documents from TEXT/001

    • Contains: Document ID, type, date, entities, file path, 2-3 sentence summary
    • Size: 446 KB
  • INDEX_SUMMARIES_002.md - Summaries of 897 documents from TEXT/002

    • Contains: Same format as above
    • Size: 195 KB

Source Documents

Located in /home/chris/projects/epstein-files/TEXT/

  • TEXT/001/ - 2,000 larger documents (56 MB)

    • Average file size: 26 KB
    • Content: Legal documents, news compilations, book excerpts
    • File naming: HOUSE_OVERSIGHT_010477.txt through HOUSE_OVERSIGHT_031751.txt
  • TEXT/002/ - 897 smaller documents (4.7 MB)

    • Average file size: 2.7 KB
    • Content: Mostly email correspondence
    • File naming: HOUSE_OVERSIGHT_031753.txt through HOUSE_OVERSIGHT_033599.txt

Entity Compression Guide

To save tokens, the 30 most-mentioned people are encoded as [E01]-[E30]:

Code Person Mentions
[E01] Epstein 11,958
[E02] Trump 4,437
[E03] Jeffrey Epstein 2,703
[E05] Dershowitz 1,623
[E06] Clinton 1,039
[E10] Prince Andrew 455
[E18] Ghislaine Maxwell 266
[E19] Alan Dershowitz 266

(See INDEX_MASTER.md for complete list)

Common Research Scenarios

Scenario 1: "I want a broad overview of what's in this dataset"

Load: INDEX_MASTER.md (~2.3 KB)
Claude's context used: ~4 KB
Time to answer: 1-2 minutes

Ask Claude: "What's this dataset about? What are the main topics and entities?"

Scenario 2: "I want to research a specific person"

Load:
  - INDEX_MASTER.md (2.3 KB)
  - INDEX_PEOPLE.md (3.2 KB)
  - Relevant sections of INDEX_SUMMARIES_001/002.md (~10-20 KB)
Claude's context used: ~30-40 KB
Time to answer: 3-5 minutes

Ask Claude: "Tell me everything mentioned about [person name]. What documents should I read for more details?"

Scenario 3: "I want to understand the legal proceedings"

Load:
  - INDEX_MASTER.md (2.3 KB)
  - INDEX_LEGAL.md (1.1 KB)
  - INDEX_TIMELINE.md (0.8 KB)
  - Relevant INDEX_SUMMARIES sections (~15-20 KB)
Claude's context used: ~35-45 KB
Time to answer: 5-10 minutes

Ask Claude: "What are the key legal cases? What's the timeline of events? Who were the main attorneys and judges involved?"

Scenario 4: "I want to analyze communications between specific people"

Load:
  - INDEX_MASTER.md (2.3 KB)
  - INDEX_CORRESPONDENCE.md (1.4 KB)
  - INDEX_PEOPLE.md (3.2 KB)
  - Relevant INDEX_SUMMARIES sections (~20-30 KB)
Claude's context used: ~40-50 KB
Time to answer: 5-10 minutes

Ask Claude: "What communications are recorded between [person A] and [person B]? What were they discussing?"

Scenario 5: "I want to do deep research on a specific topic"

Load:
  - INDEX_MASTER.md (2.3 KB)
  - Multiple tier-2 indexes (15-20 KB)
  - Relevant INDEX_SUMMARIES sections (30-50 KB)
  - Actual document texts from TEXT/ (load only the most relevant)
Claude's context used: ~100-150 KB (still only 0.2% of total dataset)
Time to answer: 15-30 minutes

This is where you load the actual source documents for comprehensive analysis.

Token Efficiency

The index system is optimized for LLM context efficiency:

Approach Context Used Token Savings
Loading full dataset 60.7 MB 0% (baseline)
Loading all indexes + summaries 665 KB 98.9%
Loading master + 1 tier-2 index 12 KB 99.98%
Loading master + 2 tier-2 + summaries 50-60 KB 99.9%

Key insight: You can research the entire 60.7 MB dataset using only 50-100 KB of index files, saving 99%+ of context.

Step-by-Step Example: Real Query

Your Question:

"Who was Ghislaine Maxwell communicating with, and what did she discuss?"

Step 1: Load Master Index

@INDEX_MASTER.md

"From this index, show me the entity compression code for Ghislaine Maxwell. What documents mention her most?"

Step 2: Load People Index

@INDEX_PEOPLE.md

"Show me the entry for Ghislaine Maxwell. Who did she communicate with?"

Step 3: Load Correspondence Index

@INDEX_CORRESPONDENCE.md

"Looking at the email index, find any email threads involving Ghislaine Maxwell."

Step 4: Load Summaries

@INDEX_SUMMARIES_001.md

"Here are the summaries of documents mentioning Ghislaine Maxwell. What themes emerge? What should I read next?"

Step 5: Load Actual Documents (Optional)

@TEXT/001/HOUSE_OVERSIGHT_012345.txt

"Here's the full text of one of the key documents. What new insights do you get from reading the actual content?"

Data Quality

  • Completeness: 100% of documents indexed (2,897/2,897)
  • Entity extraction accuracy: >95%
  • Date extraction accuracy: >90%
  • Content classification accuracy: >95%
  • Date range: 1990s through 2019
  • Known limitations:
    • Some OCR artifacts in older scans
    • Date format variations
    • Name spelling variations (e.g., "Epstein" vs "Jeffrey Epstein" vs "J. Epstein")

Tips for Best Results

Do's

  • ✅ Start with INDEX_MASTER.md every time
  • ✅ Load indexes in tier order (1 → 2 → 3)
  • ✅ Use [E##] codes when referring to top 30 people
  • ✅ Load only what you need for your current query
  • ✅ Ask Claude to identify document IDs you should examine next

Don'ts

  • ❌ Don't load all source documents at once
  • ❌ Don't load all indexes if you only need one
  • ❌ Don't skip the entity compression guide in INDEX_MASTER.md
  • ❌ Don't ask Claude to analyze documents you haven't provided

Troubleshooting

"Claude doesn't know about a specific document"

This is expected. Claude can only work with indexes/documents you explicitly share. Copy the relevant index or document text into your prompt.

"I'm getting inconsistent information"

Check if you've shared all relevant indexes. Sometimes a document appears in multiple indexes with different perspectives (people, legal, timeline). Load complementary indexes for complete context.

"I don't know which documents to request"

Ask Claude: "Based on the indexes I've shared, which specific documents should I load next for deeper research on [topic]?" Claude will suggest document IDs, which you can then look up in INDEX_SUMMARIES_*.md.

"The document file path seems wrong"

All documents are in /home/chris/projects/epstein-files/TEXT/001/ or /TEXT/002/

Example correct paths:

  • /home/chris/projects/epstein-files/TEXT/001/HOUSE_OVERSIGHT_010477.txt
  • /home/chris/projects/epstein-files/TEXT/002/HOUSE_OVERSIGHT_031753.txt

For Claude Code Users

If you're using Claude Code in this repository:

  1. Claude Code automatically reads CLAUDE.md to understand the repository structure
  2. You can ask Claude Code questions about the documents
  3. Claude Code can help you navigate and load indexes programmatically
  4. You still copy index contents into your prompts to get Claude to analyze them

Example Claude Code usage:

"@claude-code Read the TEXT/001/HOUSE_OVERSIGHT_010477.txt file and tell me what it contains"

For Developers/Researchers

To maintain or extend this index system:

See CLAUDE.md for:

  • System architecture details
  • How the 3-tier system works
  • Entity compression schema
  • Common maintenance tasks
  • How to add new documents or indexes

Summary

Action Steps Time Context
Quick overview Load INDEX_MASTER.md 2 min 5 KB
Find a person Load MASTER + INDEX_PEOPLE.md 3 min 10 KB
Research a topic Load MASTER + relevant tier-2 + summaries 10 min 50 KB
Deep analysis Load above + specific document texts 20-30 min 100-200 KB

Ready to start? Load INDEX_MASTER.md into Claude and ask your first question!

About

Epstein files: LLM-optimized hierarchical index system for 2,897 historical documents. Navigate 60.7 MB using 665 KB of strategic indexes—master index, specialized categories, and summaries—saving 95% of context tokens while maintaining full dataset access.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •