Date: 27 March 2026 Status: Pre-development planning complete
This document records 200 architectural, design, and strategic decisions made through structured dialogue before any code was written. It serves as both a development reference and a methodological artifact for the accompanying DSH article.
- Interface language: English v1, i18n-ready for v2
- Help system: Tooltip (short, technical) + expander (pedagogical, collapsed by default)
- File formats: txt, pdf, odt, docx, epub
- Upload mode: Auto-detect — 1 file → chunk, multiple → separate documents
- Result tabs: 7 — Overview (new), Topics, Topic Map, Heatmap, Distribution, Preprocessing, Export
- Auto topic suggestion: Future version (sweep_coherence() infrastructure ready)
- Theme: White/light, academic, clean
- Error handling: Clear messages + empty result guide + progress bar + summary box
- Report: Basic PDF (cover + params + summary + topics + heatmap + text excerpts + environment)
- Comparative analysis: Future version (ZIP export for manual comparison)
- Chunk size: Slider 300-3000, default 1000, guide info box
- POS filter: User selectable, default NOUN/PROPN/ADJ (Hakan-approved)
- Topic count: Slider 2-15, default 5
- Stopwords: spaCy built-in + user custom (no pre-built academic list)
- Topic visualization: Bar chart (default) + wordcloud toggle
- pyLDAvis: Optional, try-except + 2D scatter fallback
- Preprocessing trace: Summary (default) + detailed token table in expander
- Metrics: C_v in Overview; C_v + Perplexity + Log-likelihood in Corpus Stats
- Words per topic: Slider 5-30, default 10
- Logo: Simple typographic, lambda (λ) symbol + "Lemmata" + tagline, SVG
- Log production: Claude Code auto-generates, Oğuz verifies
- Error logging: Meaningful errors logged (transparency)
- Log structure: Per-session files + PROMPT_LOG_INDEX.md
- Citation in app: Sidebar expander ("How to cite" + BibTeX)
- min_df/max_df: Auto-adjust by corpus size + Advanced slider
- First visit: Welcome screen + 3-step guide + TM explainer + sample data link
- Deploy: Streamlit Cloud + local installation supported
- Data source: File upload only (no URL fetching)
- User profiles: Single mode, progressive disclosure
- Data storage: In-memory during session, deleted on session end
- File size: 50MB/file, 100MB total. No limit on local.
- Chunking: Word count target + sentence boundary respect (spaCy sentence detection)
- Topic labels: "Topic 1 (vita, morte, uomo)" + user-editable
- Chart interactivity: Altair interactive (hover). Wordcloud matplotlib static.
- Analysis progress: st.status() step-by-step updates
- Error language: English v1 (i18n v2)
- Pre-spaCy cleaning: Line-end hyphen joining + multi-space + Unicode NFC
- N-grams: Unigram only. Bigram future (NGRAM_RANGE constant ready).
- Topic interpretation guide: Expander in Topics tab ("How to interpret topics")
- PDF report content: Cover + params + summary + topics + heatmap + text excerpts + environment
- Coherence display: Color-coded background (green/yellow/red) + text + action suggestion. No emoji.
- Document labels: Filename without extension. Chunks: [filename]_001 format.
- Page title: "Lemmata — Multilingual Topic Modeling", page_icon="📊"
- Re-run: Previous results cleared + warning ("Download first if needed")
- Export options: ZIP (main) + PDF report button + individual file downloads
- Environment report: Full detail (version + packages + params + seed + corpus)
- Version display: Sidebar footer "v0.1.0 · GitHub · MIT License"
- Navigation: Hamburger menu hidden via CSS, sidebar-focused
- Analytics: None (privacy)
- Testing: Pytest, per-module, determinism test critical
- Topic word detail: Bar chart hover + table (word + weight). Corpus frequency in expander.
- spaCy models: sm only. lg future (MODEL_SIZE constant ready).
- Model loading: On "Run Analysis", @st.cache_resource cached.
- Multi-file processing: Each file preprocessed separately, merged in DTM.
- Analysis trigger: Sidebar "Run Analysis" button (disabled without files).
- CSV export: topic_words.csv + doc_topic_matrix.csv + preprocessing_summary.csv + metrics.json
- Caching: @st.cache_resource (model) + @st.cache_data (results)
- Language mismatch: Low token ratio (<10%) → automatic warning
- Stopword transparency: Preprocessing summary shows "Stopwords removed: N (built-in: X, custom: Y)"
- Feedback: GitHub Issues, sidebar link "Report a bug · Request a feature"
- Accessibility: Color-blind-friendly palette (viridis/tableau10) + table alternatives
- Mobile: Streamlit default responsive + small screen info note
- Mixed language: Not supported, single language enforced, info box explanation
- Empty file: Skip + warning, continue with remaining files
- File preview: Metadata (size, word count) + first 200 words
- Color palette: Categorical tableau10, heatmap viridis
- Visual export: PNG 300 DPI + SVG (wordcloud PNG only)
- Session state: st.session_state preserves results; F5 clears + warning note
- Tab loading: Core tabs immediate, pyLDAvis lazy
- Versioning: Semantic (0.1.0), CHANGELOG.md, Zenodo concept DOI
71-76. Svevo pilot design, text cleaning, figures, supplementary material, MALLET comparison — to be decided during article writing phase.
- Streamlit Cloud: Free plan + lemmata.app landing page redirect
- Pre-deploy check: Detailed per-module test + deploy checklist
- Documentation: README sufficient for v1
- CI/CD: GitHub Actions pytest on push, Streamlit auto-deploy on main
- Sample data: examples/ folder with short public domain Italian texts
- Feature flags: None, all features active in v1
- Configuration: All from config.py (no .env, no CLI args)
- Application logging: Python logging module, console only, INFO default
- Vibe coding docs: README "How it was built" + prompts/ folder
- Academic usage guide: Short expander in welcome screen
- Landing page: Simple GitHub Pages single page
- Prompt log privacy: Public, no sensitive info
- Repo timing: Public as soon as platform ready
- Release notes: Detailed, structured, Zenodo reads this
- Upload UX: Streamlit multi-uploader + text paste area (expander)
- Pre-LDA analysis: Top 20 frequent lemmas bar chart in Overview
- Topic-text matching: Representative document excerpt per topic. Color highlighting v2.
- About section: Sidebar expander (who, how, why, GitHub link)
- POS presets: Dropdown (Content words / Content+verbs / All open / Custom) + multiselect
- Visual customization: None in v1. Users edit exported SVG.
- Vectorization: CountVectorizer only. TF-IDF future (VECTORIZER_TYPE constant).
- Security: File type validation, size limit, no st.markdown unsafe_allow_html
- Memory: All text in memory (50MB limit prevents issues). Large corpus → local install.
- Extension vision: method parameter in modelling.py for future NMF/BERTopic
- Competitors: MALLET (CLI, no GUI), Voyant (no LDA), Gensim (code required). Lemmata unique: preprocessing trace + no-code + deterministic + documented development.
- USP: "Browser-based topic modeling where you see exactly what happened to every word."
- DTM transparency: Overview shows vocabulary size, terms removed by min_df/max_df, final DTM dimensions.
- Document length imbalance: Informational only (show chunk counts per document).
- spaCy error tolerance: Per-token try-except, log issues, continue.
- Minimum corpus: Warning if <50 unique lemmas, no blocking.
- Analysis history: Last analysis only. Previous → ZIP download.
- Download points: Per-tab download icons + Export tab ZIP.
- Tab customization: None, fixed 7 tabs.
- Sub-corpus: Future version (metadata-based filtering).
- Wordcloud shape: Rectangle, white background, 800x400px.
- Tokenization: spaCy default sufficient. Italian contractions handled correctly.
- Chunk overlap: None. Sentence-boundary chunking sufficient.
- Document classification: Dominant topic auto-assigned, shown in Distribution.
- Diachronic view: Topic weight trend line chart (X=doc order, Y=weight, vertical lines at file boundaries).
- Onboarding: FAQ expander sufficient (no interactive tutorial).
- Download mechanism: st.download_button with key parameter + cache protection.
- Additional metrics: v1 has C_v/Perplexity/Log-likelihood. Topic diversity future.
- Machine-readable export: analysis_results.json in ZIP.
- Tool integration: Standard CSV/JSON output. No tool-specific formats.
- Seed control: Editable in Advanced (number input), default 42.
- Topic ordering: By corpus prevalence (descending average weight).
- Performance target: Under 30 seconds for typical analysis.
- Convergence: max_iter in Advanced + warning if model used all iterations.
- Topic color identity: Fixed color per topic (tableau10), consistent across all tabs.
- Lemmatization quality: spaCy default. Wrong lemmas → add to custom stopwords.
- Screen layout: Streamlit wide layout, no extra responsive work.
- Imbalanced corpus warning: 10x size difference → warning with suggestion.
- Parameter reset: "Reset to defaults" link in sidebar.
- Analysis naming: Auto: lemmata_{lang}{n}topics{date}_{time}.zip
- Logo: Lambda (λ) symbol + "Lemmata" (teal, medium weight) + "Multilingual Topic Modeling" (gray, small).
- Typography: "sans serif" via config.toml (system default).
- Background: Light gray #F8F9FA main, slightly darker #F0F2F6 sidebar.
- Primary color: Teal #0F6E56 (buttons, sliders, active tabs).
- Run button: st.button type="primary", use_container_width=True. Disabled without files.
- Tab design: Streamlit default horizontal. NO emoji in tab names.
- Analysis complete: st.success green box + short summary (topics, C_v, lemmas). No emoji.
- Info boxes: st.info (blue background, left border) for pedagogical content.
- Spacing: Streamlit default + st.divider between sections + st.container(border=True) for grouping.
- File upload area: Streamlit default uploader + explanatory text above and below.
- Sidebar sections: Bold headings + st.divider between. Logo top, Run button middle, footer bottom.
- Sliders: Streamlit default with help="..." tooltip parameter.
- POS filter: Preset dropdown + multiselect below.
- Chart style: White background, minimal axes, light grid, sans-serif, tableau10 colors.
- Heatmap values: Hover tooltip (Altair). No numbers in cells.
- Chart sizes: Variable by type. Bar 600px, heatmap full width, wordcloud 600x400.
- Long tables: st.dataframe height=400, virtual scrolling, sortable/filterable.
- Wordcloud info: Topic name above, nothing below. Download via tab-level icon.
- Topics layout: Topic selector (dropdown/buttons) → one topic at a time, full area.
- Heatmap size: Dynamic height=max(200, n_docs25), dynamic width=max(300, n_topics60).
- Landing page structure: Single page, scroll sections, anchor nav.
- Landing page content: Hero (logo + tagline + screenshot + Launch button) → Features → How it works → Citation → Footer.
- Landing page design: Same teal color, consistent identity, more whitespace.
- SEO: Basic meta tags + OpenGraph. Google Scholar via DSH article.
- Success metrics: GitHub stars + Zenodo downloads + citations. Realistic expectations.
- Promotion: DSH article + DH conference poster + academic network sharing.
- Monitoring: UptimeRobot free (5-min checks, email alerts).
- Landing page language: English only.
- Landing page tone: Academic + open-source: "Topic modeling for the humanities — no code required."
- Landing page links: Launch Lemmata, Source code, Cite, Paper.
- Timeout: Estimated time display before analysis. Warning if >60s expected.
- PDF errors: Three specific messages (protected, scanned/image, corrupted).
- Duplicate files: Same filename warning, no blocking.
- Topic > document: Warning + dynamic slider max = document_count / 2.
- Character set: Full UTF-8 support. Non-Latin scripts tokenized but not lemmatized.
- Empty chunks: Silently removed, noted in preprocessing summary.
- Encoding: Auto-detect (chardet) → UTF-8 → Latin-1 fallback. Shown in trace.
- Missing spaCy model: Auto-download attempt, clear error message if fails.
- Session recovery: URL query params preserve parameters (not files).
- Text hygiene: Auto-clean BOM, null bytes, control chars, normalize line endings.
- Powered by: Landing page footer only: "Built with spaCy, scikit-learn, Gensim, Streamlit"
- Legal: Short privacy note in About section and landing page footer. No separate ToS.
- Copyright notice: Small note under file upload: user responsible for upload rights.
- Maintenance: 6-month check + critical fixes on demand.
- Dependency security: GitHub Dependabot enabled.
- Maintenance responsibility: Oğuz primary. Community contributions welcome.
- Ethics statement: DSH article Discussion section, not README.
- Data processing: "Texts processed on server during session. No permanent storage. No third-party sharing."
- Git branches: Direct push to main. Feature branches when multiple contributors.
- Commit messages: "P001: short description" — links to prompt log.
- Dependency locking: requirements.txt with >= minimum. Environment report has exact versions.
- Integration test: Smoke test — upload sample, run analysis, verify outputs.
- Gitignore: Standard Python + macOS + .streamlit/secrets.toml.
- Data entry points: Upload + text paste. No third option.
- Concurrency: Streamlit session isolation built-in. Free plan resource limits accepted.
- Sidebar status: Post-analysis summary below Run button ("5 topics, C_v: 0.58, 1,247 lemmas").
- Critical error screen: st.error user-friendly message + expander with technical traceback.
- Error message format: What happened + what to try. No error codes.
- Custom CSS: Minimal (~15 lines) via st.markdown. Hamburger menu hide, footer hide.
- Community: GitHub Discussions after DSH publication.
- Open source credits: README Acknowledgments expanded list.
- Acceptance test: 10-point DEPLOY_CHECKLIST.md before v0.1.0 release.
- Keyboard accessibility: Streamlit default (Tab navigation built-in).
- Concurrency limits: Free plan accepted. Heavy use → local install recommendation.
- JOSS: After DSH acceptance, consider separate software paper.
- Getting started guide: README installation section sufficient.
- Python versions: >=3.10 in pyproject.toml, CI tests 3.11 only.
- PDF preview: No. Direct download only.
- Technology migration: Architecture already framework-agnostic (only app.py is Streamlit).
- CLAUDE.md size: Critical rules in CLAUDE.md (~100 lines). Full decisions in ARCHITECTURE.md.