A parallel text processing tool that identifies and annotates Wikipedia references in text using the Aho-Corasick algorithm, implemented in C++11 with OpenMP.
This tool takes a large text file, splits it into 1KB blocks, and automatically identifies semantically meaningful words and phrases by matching them against an offline Wikipedia titles database. It then annotates the text with reference numbers and compiles a "References" section, similar to academic paper citations.
For detailed background and motivation, see the Project Proposal.
- Parallel text processing using the Aho-Corasick algorithm
- OpenMP-based parallelization
- Offline Wikipedia titles database matching
- JSON output format with reference annotations
- Efficient memory usage with immutable trie structure
The following tables show performance comparisons between parallel and sequential processing modes using different combinations of Wikipedia title databases and input text files.
Note: The tables below show processing time only (excluding the title loading phase) to properly compare the parallelized vs sequential performance. The loading phase is sequential in both modes and varies by database size:
- Small database (51 titles): Loading time ~6-13ms
- Large database (10,000 titles): Loading time ~66-79ms
| Input File | Mode | Processing Time | Speedup | References Found |
|---|---|---|---|---|
| sample_text.txt (2.2KB) | Sequential | 5.64ms | - | 52 |
| sample_text.txt (2.2KB) | Parallel | 13.95ms | 0.40x | 52 |
| book-war-and-peace.txt (3.1MB) | Sequential | 7863.76ms | - | 193 |
| book-war-and-peace.txt (3.1MB) | Parallel | 1296.74ms | 6.06x | 193 |
| Input File | Mode | Processing Time | Speedup | References Found |
|---|---|---|---|---|
| sample_text.txt (2.2KB) | Sequential | 1252.71ms | - | 3743 |
| sample_text.txt (2.2KB) | Parallel | 671.75ms | 1.76x | 3743 |
| book-war-and-peace.txt (3.1MB) | Sequential | 1766.43s | - | 5,084,135 |
| book-war-and-peace.txt (3.1MB) | Parallel | 285.80s | 6.18x | 5,084,135 |
- Small text files: Sequential processing is faster due to parallelization overhead
- Large text files: Parallel processing provides significant speedup (6x improvement consistently)
- Loading time impact: Large databases (10K titles) take ~12x longer to load than small databases (51 titles), but this is a one-time cost
- Processing complexity: Large databases require ~200x more processing time due to dramatically more pattern matches
- Reference density: Larger databases find dramatically more matches (10K database finds 72x more references than 51-title database in small text, and 26,000x more in large text)
- Scalability: Performance benefits increase with text size - the largest test showed 6.18x speedup (29.4 min vs 4.8 min for processing only)
- Database impact: 10K title database processes the same text ~200x slower than 51-title database due to more matches found
- C++11 compatible compiler (GCC 4.8+ or Clang 3.3+)
- OpenMP support
- CMake 3.10+
- nlohmann/json library (included as submodule)
# Clone the repository with submodules
git clone --recursive https://github.com/yourusername/text_reference_breaker.git
cd text_reference_breaker
# Create a build directory
mkdir build
cd build
# Configure and build
cmake ..
make# Basic usage (parallel mode - default)
./text_reference_breaker <wiki_titles_file> <input_text_file> <output_json_file>
# Sequential mode (for performance comparison)
./text_reference_breaker <wiki_titles_file> <input_text_file> <output_json_file> --sequential
# Example with sample data
./text_reference_breaker ../data/wiki_titles.txt ../data/sample_text.txt output.jsonwiki_titles_file: A text file with one Wikipedia title per lineinput_text_file: Any text file that you want to process
The output is a JSON file with the following structure:
{
"text": "ColdPlay had a tour concert in Helsinki in August, 2024.",
"references": [
{ "start": 0, "end": 9, "titleIndex": 301, "title": "Coldplay" },
{ "start": 28, "end": 36, "titleIndex": 2105, "title": "Helsinki" },
{ "start": 40, "end": 46, "titleIndex": 984, "title": "August" }
]
}- Aho-Corasick: Header-only C++ implementation of the Aho-Corasick algorithm
- nlohmann/json: JSON for Modern C++
- War and Peace text: Downloaded from NYU Economics 370 course materials
- Google 10,000 English words: Common English words list from first20hours/google-10000-english for testing large vocabulary scenarios
- Sample Wikipedia titles: Custom curated list of 51 popular Wikipedia titles for demonstration
This project makes use of several excellent open-source libraries and data sources:
- Aho-Corasick Algorithm Implementation: Thanks to cjgdev for the header-only C++ implementation of the Aho-Corasick string matching algorithm (aho_corasick)
- JSON Processing: nlohmann/json by Niels Lohmann for modern C++ JSON processing
- Test Data: War and Peace text sourced from NYU Economics 370 course repository maintained by mmcky
MIT License