Lemmata

A multilingual LDA topic modeling platform for humanities researchers.

What is Lemmata?

Lemmata is a browser-based tool that lets humanities researchers perform LDA (Latent Dirichlet Allocation) topic modeling on literary and historical texts — without writing a single line of code.

Upload your texts, choose your language, adjust parameters, and download reproducible results. Everything runs in the browser.

Supported languages: Italian, English, German, French, Spanish

Quick start

No installation required. Open the platform in your browser:

lemmata.app

Select your language and POS filter in the sidebar.
Upload one or more files (.txt, .pdf, .odt, .docx, .epub) or paste text directly.
Click Run Analysis.
Explore the results across seven tabs (Overview, Topics, Topic Map, Heatmap, Distribution, Preprocessing, Export).
Download all outputs as a ZIP file or generate a PDF report.

Features

Feature	Description
Multilingual NLP	Five languages with dedicated spaCy pipelines and language-specific stopwords
POS filtering	Presets (content words, content + verbs, all open classes) or custom selection
Custom stopwords	Add domain-specific stopwords on top of the built-in lists
Coherence scoring	C_v coherence metric with interpretive guidance (good / fair / weak)
Interactive charts	Altair-based interactive visualizations with hover details
Topic Map	pyLDAvis visualization with graceful fallback
Preprocessing trace	Token-level table showing every step (original, lemma, kept/removed, reason)
Deterministic results	random_state=42 and batch learning ensure identical results on repeated runs
PDF report	Auto-generated analysis report suitable for course assignments or publications
ZIP export	All outputs (CSV, JSON, PNG, SVG, environment report) in one download

Local installation

1. Clone the repository

git clone https://github.com/oguzkoran-max/lemmata.git
cd lemmata

2. Create a virtual environment (recommended)

python -m venv .venv
source .venv/bin/activate   # Linux/macOS
.venv\Scripts\activate      # Windows

3. Install dependencies

pip install -r requirements.txt

4. Download spaCy language models

python -m spacy download it_core_news_sm
python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm
python -m spacy download fr_core_news_sm
python -m spacy download es_core_news_sm

5. Run the application

streamlit run src/lemmata/app.py

The application opens at http://localhost:8501.

Architecture

Lemmata follows a modular design with strict separation of concerns. The UI layer (app.py) contains no business logic.

lemmata/
├── app.py              # Streamlit UI only
├── config.py           # Constants, language configs, slider ranges
├── preprocessing.py    # spaCy NLP pipeline, POS filtering, lemmatization, trace
├── modelling.py        # LDA (scikit-learn), coherence (Gensim), corpus stats
├── visualisation.py    # Charts, wordclouds, pyLDAvis (zero st.* calls)
├── file_io.py          # File readers, ZIP/PDF export, environment report
├── requirements.txt    # Pinned dependencies
├── ARCHITECTURE.md     # 200 design decisions
├── prompts/            # Vibe coding development logs
└── tests/              # Pytest test suite

Design decisions:

scikit-learn LDA over Gensim: deterministic output with random_state=42.
Gensim CoherenceModel used independently for C_v evaluation.
Language configurations reviewed and approved by a corpus linguistics specialist (Doc. Dr. Hakan Cangir).

Reproducibility

Fixed random state: random_state=42 across all stochastic processes.
Batch learning: learning_method='batch' eliminates document-order effects.
Environment report: Every analysis exports Python version, package versions, and all parameters.
Same data + same parameters = same results. Guaranteed and verifiable.

How it was built: vibe coding

Lemmata was developed entirely through vibe coding — a researcher with no programming expertise communicated requirements to a large language model exclusively in natural language, and the LLM generated all code. Development was carried out using Claude Code (Anthropic).

Three safeguards ensured methodological rigor:

Full prompt-response documentation. Every prompt and response is archived in the prompts/ directory and published as supplementary material.
Expert validation at every decision point. Domain-specific decisions followed a human-in-the-loop model: corpus linguistics choices were reviewed by a specialist; literary interpretations were provided by a scholar. Technical code generation operated under a human-on-the-loop model: the LLM produced code, the principal investigator tested outputs.
Automated testing. A 120-test suite with determinism verification ensures correctness and reproducibility.

An accompanying article is in preparation for submission to Digital Scholarship in the Humanities (Oxford University Press).

Citation

@software{koran_lemmata_2026,
  author       = {Koran, Oğuz and Cangır, Hakan and Yücesan, Barış},
  title        = {Lemmata: A Multilingual LDA Topic Modeling Platform for Digital Humanities},
  year         = {2026},
  url          = {https://lemmata.app}
}

Click "Cite this repository" on the GitHub page for auto-generated citation via CITATION.cff.

Contributing

Contributions are welcome. See CONTRIBUTING.md.

License

MIT License. See LICENSE.

Acknowledgments

Lemmata is developed at Ankara University, School of Foreign Languages (Italian Language and Literature) and Faculty of Languages, History and Geography — DTCF (Italian Language and Literature).

The entire codebase was generated through LLM-assisted development using Claude and Claude Code by Anthropic. Built with spaCy, scikit-learn, Gensim, Altair, and Streamlit.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
.streamlit		.streamlit
docs		docs
examples		examples
prompts		prompts
src/lemmata		src/lemmata
tests		tests
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
DEPLOY_CHECKLIST.md		DEPLOY_CHECKLIST.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lemmata

What is Lemmata?

Quick start

Features

Local installation

1. Clone the repository

2. Create a virtual environment (recommended)

3. Install dependencies

4. Download spaCy language models

5. Run the application

Architecture

Reproducibility

How it was built: vibe coding

Citation

Contributing

License

Acknowledgments

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Lemmata

What is Lemmata?

Quick start

Features

Local installation

1. Clone the repository

2. Create a virtual environment (recommended)

3. Install dependencies

4. Download spaCy language models

5. Run the application

Architecture

Reproducibility

How it was built: vibe coding

Citation

Contributing

License

Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages