Lemmata — Architecture Decisions (200 Questions)

Date: 27 March 2026 Status: Pre-development planning complete

This document records 200 architectural, design, and strategic decisions made through structured dialogue before any code was written. It serves as both a development reference and a methodological artifact for the accompanying DSH article.

Core Architecture (Decisions 1-10)

Interface language: English v1, i18n-ready for v2
Help system: Tooltip (short, technical) + expander (pedagogical, collapsed by default)
File formats: txt, pdf, odt, docx, epub
Upload mode: Auto-detect — 1 file → chunk, multiple → separate documents
Result tabs: 7 — Overview (new), Topics, Topic Map, Heatmap, Distribution, Preprocessing, Export
Auto topic suggestion: Future version (sweep_coherence() infrastructure ready)
Theme: White/light, academic, clean
Error handling: Clear messages + empty result guide + progress bar + summary box
Report: Basic PDF (cover + params + summary + topics + heatmap + text excerpts + environment)
Comparative analysis: Future version (ZIP export for manual comparison)

Parameters (11-20)

Chunk size: Slider 300-3000, default 1000, guide info box
POS filter: User selectable, default NOUN/PROPN/ADJ (Hakan-approved)
Topic count: Slider 2-15, default 5
Stopwords: spaCy built-in + user custom (no pre-built academic list)
Topic visualization: Bar chart (default) + wordcloud toggle
pyLDAvis: Optional, try-except + 2D scatter fallback
Preprocessing trace: Summary (default) + detailed token table in expander
Metrics: C_v in Overview; C_v + Perplexity + Log-likelihood in Corpus Stats
Words per topic: Slider 5-30, default 10
Logo: Simple typographic, lambda (λ) symbol + "Lemmata" + tagline, SVG

Logging & Infrastructure (21-30)

Log production: Claude Code auto-generates, Oğuz verifies
Error logging: Meaningful errors logged (transparency)
Log structure: Per-session files + PROMPT_LOG_INDEX.md
Citation in app: Sidebar expander ("How to cite" + BibTeX)
min_df/max_df: Auto-adjust by corpus size + Advanced slider
First visit: Welcome screen + 3-step guide + TM explainer + sample data link
Deploy: Streamlit Cloud + local installation supported
Data source: File upload only (no URL fetching)
User profiles: Single mode, progressive disclosure
Data storage: In-memory during session, deleted on session end

Detail Decisions (31-40)

File size: 50MB/file, 100MB total. No limit on local.
Chunking: Word count target + sentence boundary respect (spaCy sentence detection)
Topic labels: "Topic 1 (vita, morte, uomo)" + user-editable
Chart interactivity: Altair interactive (hover). Wordcloud matplotlib static.
Analysis progress: st.status() step-by-step updates
Error language: English v1 (i18n v2)
Pre-spaCy cleaning: Line-end hyphen joining + multi-space + Unicode NFC
N-grams: Unigram only. Bigram future (NGRAM_RANGE constant ready).
Topic interpretation guide: Expander in Topics tab ("How to interpret topics")
PDF report content: Cover + params + summary + topics + heatmap + text excerpts + environment

Accessibility & Edge Cases (41-50)

Coherence display: Color-coded background (green/yellow/red) + text + action suggestion. No emoji.
Document labels: Filename without extension. Chunks: [filename]_001 format.
Page title: "Lemmata — Multilingual Topic Modeling", page_icon="📊"
Re-run: Previous results cleared + warning ("Download first if needed")
Export options: ZIP (main) + PDF report button + individual file downloads
Environment report: Full detail (version + packages + params + seed + corpus)
Version display: Sidebar footer "v0.1.0 · GitHub · MIT License"
Navigation: Hamburger menu hidden via CSS, sidebar-focused
Analytics: None (privacy)
Testing: Pytest, per-module, determinism test critical

Advanced Technical (51-60)

Topic word detail: Bar chart hover + table (word + weight). Corpus frequency in expander.
spaCy models: sm only. lg future (MODEL_SIZE constant ready).
Model loading: On "Run Analysis", @st.cache_resource cached.
Multi-file processing: Each file preprocessed separately, merged in DTM.
Analysis trigger: Sidebar "Run Analysis" button (disabled without files).
CSV export: topic_words.csv + doc_topic_matrix.csv + preprocessing_summary.csv + metrics.json
Caching: @st.cache_resource (model) + @st.cache_data (results)
Language mismatch: Low token ratio (<10%) → automatic warning
Stopword transparency: Preprocessing summary shows "Stopwords removed: N (built-in: X, custom: Y)"
Feedback: GitHub Issues, sidebar link "Report a bug · Request a feature"

Platform Quality (61-70)

Accessibility: Color-blind-friendly palette (viridis/tableau10) + table alternatives
Mobile: Streamlit default responsive + small screen info note
Mixed language: Not supported, single language enforced, info box explanation
Empty file: Skip + warning, continue with remaining files
File preview: Metadata (size, word count) + first 200 words
Color palette: Categorical tableau10, heatmap viridis
Visual export: PNG 300 DPI + SVG (wordcloud PNG only)
Session state: st.session_state preserves results; F5 clears + warning note
Tab loading: Core tabs immediate, pyLDAvis lazy
Versioning: Semantic (0.1.0), CHANGELOG.md, Zenodo concept DOI

Svevo & DSH (71-76) — Deferred to article phase

71-76. Svevo pilot design, text cleaning, figures, supplementary material, MALLET comparison — to be decided during article writing phase.

Deployment & Strategy (77-90)

Streamlit Cloud: Free plan + lemmata.app landing page redirect
Pre-deploy check: Detailed per-module test + deploy checklist
Documentation: README sufficient for v1
CI/CD: GitHub Actions pytest on push, Streamlit auto-deploy on main
Sample data: examples/ folder with short public domain Italian texts
Feature flags: None, all features active in v1
Configuration: All from config.py (no .env, no CLI args)
Application logging: Python logging module, console only, INFO default
Vibe coding docs: README "How it was built" + prompts/ folder
Academic usage guide: Short expander in welcome screen
Landing page: Simple GitHub Pages single page
Prompt log privacy: Public, no sensitive info
Repo timing: Public as soon as platform ready
Release notes: Detailed, structured, Zenodo reads this

Interaction Details (91-100)

Upload UX: Streamlit multi-uploader + text paste area (expander)
Pre-LDA analysis: Top 20 frequent lemmas bar chart in Overview
Topic-text matching: Representative document excerpt per topic. Color highlighting v2.
About section: Sidebar expander (who, how, why, GitHub link)
POS presets: Dropdown (Content words / Content+verbs / All open / Custom) + multiselect
Visual customization: None in v1. Users edit exported SVG.
Vectorization: CountVectorizer only. TF-IDF future (VECTORIZER_TYPE constant).
Security: File type validation, size limit, no st.markdown unsafe_allow_html
Memory: All text in memory (50MB limit prevents issues). Large corpus → local install.
Extension vision: method parameter in modelling.py for future NMF/BERTopic

Rakip Analizi & USP (101-110)

Competitors: MALLET (CLI, no GUI), Voyant (no LDA), Gensim (code required). Lemmata unique: preprocessing trace + no-code + deterministic + documented development.
USP: "Browser-based topic modeling where you see exactly what happened to every word."
DTM transparency: Overview shows vocabulary size, terms removed by min_df/max_df, final DTM dimensions.
Document length imbalance: Informational only (show chunk counts per document).
spaCy error tolerance: Per-token try-except, log issues, continue.
Minimum corpus: Warning if <50 unique lemmas, no blocking.
Analysis history: Last analysis only. Previous → ZIP download.
Download points: Per-tab download icons + Export tab ZIP.
Tab customization: None, fixed 7 tabs.
Sub-corpus: Future version (metadata-based filtering).

Visual Details (111-120) — NO EMOJI except Run button, analysis message, coherence indicator

Wordcloud shape: Rectangle, white background, 800x400px.
Tokenization: spaCy default sufficient. Italian contractions handled correctly.
Chunk overlap: None. Sentence-boundary chunking sufficient.
Document classification: Dominant topic auto-assigned, shown in Distribution.
Diachronic view: Topic weight trend line chart (X=doc order, Y=weight, vertical lines at file boundaries).
Onboarding: FAQ expander sufficient (no interactive tutorial).
Download mechanism: st.download_button with key parameter + cache protection.
Additional metrics: v1 has C_v/Perplexity/Log-likelihood. Topic diversity future.
Machine-readable export: analysis_results.json in ZIP.
Tool integration: Standard CSV/JSON output. No tool-specific formats.

Seed, Performance, Ordering (121-130)

Seed control: Editable in Advanced (number input), default 42.
Topic ordering: By corpus prevalence (descending average weight).
Performance target: Under 30 seconds for typical analysis.
Convergence: max_iter in Advanced + warning if model used all iterations.
Topic color identity: Fixed color per topic (tableau10), consistent across all tabs.
Lemmatization quality: spaCy default. Wrong lemmas → add to custom stopwords.
Screen layout: Streamlit wide layout, no extra responsive work.
Imbalanced corpus warning: 10x size difference → warning with suggestion.
Parameter reset: "Reset to defaults" link in sidebar.
Analysis naming: Auto: lemmata_{lang}{n}topics{date}_{time}.zip

Visual Design (131-140)

Logo: Lambda (λ) symbol + "Lemmata" (teal, medium weight) + "Multilingual Topic Modeling" (gray, small).
Typography: "sans serif" via config.toml (system default).
Background: Light gray #F8F9FA main, slightly darker #F0F2F6 sidebar.
Primary color: Teal #0F6E56 (buttons, sliders, active tabs).
Run button: st.button type="primary", use_container_width=True. Disabled without files.
Tab design: Streamlit default horizontal. NO emoji in tab names.
Analysis complete: st.success green box + short summary (topics, C_v, lemmas). No emoji.
Info boxes: st.info (blue background, left border) for pedagogical content.
Spacing: Streamlit default + st.divider between sections + st.container(border=True) for grouping.
File upload area: Streamlit default uploader + explanatory text above and below.

Sidebar Layout (141-150)

Sidebar sections: Bold headings + st.divider between. Logo top, Run button middle, footer bottom.
Sliders: Streamlit default with help="..." tooltip parameter.
POS filter: Preset dropdown + multiselect below.
Chart style: White background, minimal axes, light grid, sans-serif, tableau10 colors.
Heatmap values: Hover tooltip (Altair). No numbers in cells.
Chart sizes: Variable by type. Bar 600px, heatmap full width, wordcloud 600x400.
Long tables: st.dataframe height=400, virtual scrolling, sortable/filterable.
Wordcloud info: Topic name above, nothing below. Download via tab-level icon.
Topics layout: Topic selector (dropdown/buttons) → one topic at a time, full area.
Heatmap size: Dynamic height=max(200, n_docs25), dynamic width=max(300, n_topics60).

Landing Page & Strategy (151-160)

Landing page structure: Single page, scroll sections, anchor nav.
Landing page content: Hero (logo + tagline + screenshot + Launch button) → Features → How it works → Citation → Footer.
Landing page design: Same teal color, consistent identity, more whitespace.
SEO: Basic meta tags + OpenGraph. Google Scholar via DSH article.
Success metrics: GitHub stars + Zenodo downloads + citations. Realistic expectations.
Promotion: DSH article + DH conference poster + academic network sharing.
Monitoring: UptimeRobot free (5-min checks, email alerts).
Landing page language: English only.
Landing page tone: Academic + open-source: "Topic modeling for the humanities — no code required."
Landing page links: Launch Lemmata, Source code, Cite, Paper.

Error Recovery & Edge Cases (161-170)

Timeout: Estimated time display before analysis. Warning if >60s expected.
PDF errors: Three specific messages (protected, scanned/image, corrupted).
Duplicate files: Same filename warning, no blocking.
Topic > document: Warning + dynamic slider max = document_count / 2.
Character set: Full UTF-8 support. Non-Latin scripts tokenized but not lemmatized.
Empty chunks: Silently removed, noted in preprocessing summary.
Encoding: Auto-detect (chardet) → UTF-8 → Latin-1 fallback. Shown in trace.
Missing spaCy model: Auto-download attempt, clear error message if fails.
Session recovery: URL query params preserve parameters (not files).
Text hygiene: Auto-clean BOM, null bytes, control chars, normalize line endings.

Maintenance & Sustainability (171-200)

Powered by: Landing page footer only: "Built with spaCy, scikit-learn, Gensim, Streamlit"
Legal: Short privacy note in About section and landing page footer. No separate ToS.
Copyright notice: Small note under file upload: user responsible for upload rights.
Maintenance: 6-month check + critical fixes on demand.
Dependency security: GitHub Dependabot enabled.
Maintenance responsibility: Oğuz primary. Community contributions welcome.
Ethics statement: DSH article Discussion section, not README.
Data processing: "Texts processed on server during session. No permanent storage. No third-party sharing."
Git branches: Direct push to main. Feature branches when multiple contributors.
Commit messages: "P001: short description" — links to prompt log.
Dependency locking: requirements.txt with >= minimum. Environment report has exact versions.
Integration test: Smoke test — upload sample, run analysis, verify outputs.
Gitignore: Standard Python + macOS + .streamlit/secrets.toml.
Data entry points: Upload + text paste. No third option.
Concurrency: Streamlit session isolation built-in. Free plan resource limits accepted.
Sidebar status: Post-analysis summary below Run button ("5 topics, C_v: 0.58, 1,247 lemmas").
Critical error screen: st.error user-friendly message + expander with technical traceback.
Error message format: What happened + what to try. No error codes.
Custom CSS: Minimal (~15 lines) via st.markdown. Hamburger menu hide, footer hide.
Community: GitHub Discussions after DSH publication.
Open source credits: README Acknowledgments expanded list.
Acceptance test: 10-point DEPLOY_CHECKLIST.md before v0.1.0 release.
Keyboard accessibility: Streamlit default (Tab navigation built-in).
Concurrency limits: Free plan accepted. Heavy use → local install recommendation.
JOSS: After DSH acceptance, consider separate software paper.
Getting started guide: README installation section sufficient.
Python versions: >=3.10 in pyproject.toml, CI tests 3.11 only.
PDF preview: No. Direct download only.
Technology migration: Architecture already framework-agnostic (only app.py is Streamlit).
CLAUDE.md size: Critical rules in CLAUDE.md (~100 lines). Full decisions in ARCHITECTURE.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lemmata — Architecture Decisions (200 Questions)

Core Architecture (Decisions 1-10)

Parameters (11-20)

Logging & Infrastructure (21-30)

Detail Decisions (31-40)

Accessibility & Edge Cases (41-50)

Advanced Technical (51-60)

Platform Quality (61-70)

Svevo & DSH (71-76) — Deferred to article phase

Deployment & Strategy (77-90)

Interaction Details (91-100)

Rakip Analizi & USP (101-110)

Visual Details (111-120) — NO EMOJI except Run button, analysis message, coherence indicator

Seed, Performance, Ordering (121-130)

Visual Design (131-140)

Sidebar Layout (141-150)

Landing Page & Strategy (151-160)

Error Recovery & Edge Cases (161-170)

Maintenance & Sustainability (171-200)

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

Lemmata — Architecture Decisions (200 Questions)

Core Architecture (Decisions 1-10)

Parameters (11-20)

Logging & Infrastructure (21-30)

Detail Decisions (31-40)

Accessibility & Edge Cases (41-50)

Advanced Technical (51-60)

Platform Quality (61-70)

Svevo & DSH (71-76) — Deferred to article phase

Deployment & Strategy (77-90)

Interaction Details (91-100)

Rakip Analizi & USP (101-110)

Visual Details (111-120) — NO EMOJI except Run button, analysis message, coherence indicator

Seed, Performance, Ordering (121-130)

Visual Design (131-140)

Sidebar Layout (141-150)

Landing Page & Strategy (151-160)

Error Recovery & Edge Cases (161-170)

Maintenance & Sustainability (171-200)