BioScribe

An automated end-to-end pipeline for systematic literature review of traumatic brain injury (TBI) research papers. BioScribe screens, extracts, validates, and stores cognitive test data from scientific publications.

Overview

BioScribe streamlines the systematic review process with an AI-powered pipeline:

Screening - Automatically filters relevant TBI papers using LLM-based analysis
Extraction - Extracts structured cognitive test data (21 fields) from included papers
Validation - Validates extraction accuracy with confidence scoring and hallucination detection
Database - Stores results in DynamoDB with two-table architecture
Human-Loop - Provides web UI for expert review of medium/low confidence records

Reliability: Built-in network timeout protection with automatic retries and graceful degradation.

Architecture

Input CSV → Screening → Extraction → Validation → DynamoDB → Human Review UI
                            ↓                          ↓
                    [PubMed Fetch]          [extraction table]
                    [Auto-retry]             [validation table]

Quick Start

Prerequisites

Python 3.8+
AWS account (or local DynamoDB)
OpenAI API key

Installation

# Clone the repository
git clone https://github.com/ece1786-2025/Bioscribe.git
cd Bioscribe

# Install dependencies
pip install -r requirements.txt

# Configure credentials
# 1. Add OpenAI API key to credentials/open_ai_key.txt
# 2. Add AWS credentials to credentials/aws_credentials.py

Running the Pipeline

# Run end-to-end pipeline (tables created automatically)
python main.py --input_csv inputs/test_papers.csv --output-dir outputs/end-to-end

The pipeline will:

✅ Check/create DynamoDB tables automatically
✅ Screen papers for relevance
✅ Extract cognitive test data
✅ Validate extractions with confidence scores
✅ Insert records to database
✅ Launch Human-Loop UI in browser

Project Structure

Bioscribe/
├── main.py                     # End-to-end pipeline orchestrator
├── screener_script/            # Paper screening module
│   └── end_to_end_screening.py
├── extractor_script/           # Data extraction module
│   └── extractor_script.py
├── validatior_script/          # Validation module
│   └── validator.py
├── database_script/            # DynamoDB operations
│   ├── create_tables.py        # Table creation
│   ├── extraction_database.py  # Extraction table manager
│   └── validation_database.py  # Validation table manager
├── apps/
│   └── human-loop/             # Web UI for human review
│       └── app.py
├── credentials/                # API keys and AWS credentials
├── inputs/                     # Input CSV files
└── outputs/                    # Pipeline outputs

Key Features

Automated Table Creation

DynamoDB tables are created automatically on first run - no manual setup required.

Confidence-Based Routing

High confidence (≥0.80): Auto-approve, no review needed
Medium confidence (0.60-0.79): Flag for review
Low confidence (<0.60): Priority review required

Two-Table Architecture

bioscribe-successful-entries: Extraction data (21 fields per record)
bioscribe-validations: Validation metadata (confidence scores, routing decisions)

Human-Loop UI

Flask-based web interface for reviewing flagged records with:

Confidence tier filtering
Side-by-side source text comparison
In-place editing and approval workflow

Data Schema

Extraction Fields (21 total)

Study metadata: Population, N, age, gender, education, injury severity
Cognitive test: Test name, subtest, domain, outcome measure
Results: Mean, SD, median, IQR, min/max, range, statistics
Timing: Time since injury, follow-up duration

Validation Metadata

Composite confidence score
Field-level confidence breakdown
Hallucination risk assessment
Routing decision (approve/review/reject)

Configuration

Region Settings

Default AWS region: us-east-1

Modify in main.py:

ensure_tables_exist(region="us-east-2")  # Change region

Confidence Thresholds

Adjust in validatior_script/validator.py:

DEFAULT_THRESHOLDS = {
    "high": 0.80,    # Auto-approve threshold
    "medium": 0.60,  # Review threshold
    "low": 0.50      # Reject threshold
}

Outputs

Screening Output

screening_final.json: Final screening decisions with justifications

Extraction Output

extractions.json: All extracted records with metadata

Validation Output

validation_results.json: Validation results with confidence scores

Development

Running Individual Components

# Screen papers only
cd screener_script
python end_to_end_screening.py --input_csv ../inputs/data/screen/test_papers.csv

# Extract from specific paper
cd extractor_script
python extractor_script.py <PMID>

# Validate extractions
cd validatior_script
python validator.py --input extractions.json --output validation.json

# Create tables manually
python database_script/create_tables.py

Human-Loop UI Standalone

cd apps/human-loop
python app.py
# Visit http://127.0.0.1:5000

Troubleshooting

Missing Tables

If table creation fails, create manually:

python database_script/create_tables.py

AWS Connection Issues

Tables will auto-create on first run. For local development:

# Install and run local DynamoDB
docker run -p 8000:8000 amazon/dynamodb-local

Missing Dependencies

pip install cloudscraper  # For web scraping
pip install boto3          # For AWS DynamoDB

License

MIT Liscense

Contributors

ECE1786 2025 - University of Toronto

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
apps/human-loop		apps/human-loop
bioscribe_pipeline		bioscribe_pipeline
credentials		credentials
database_script		database_script
extractor_script		extractor_script
inputs		inputs
screener_script		screener_script
utilities		utilities
validatior_script		validatior_script
.gitignore		.gitignore
README.md		README.md
architecture.png		architecture.png
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

BioScribe

Overview

Architecture

Quick Start

Prerequisites

Installation

Running the Pipeline

Project Structure

Key Features

Automated Table Creation

Confidence-Based Routing

Two-Table Architecture

Human-Loop UI

Data Schema

Extraction Fields (21 total)

Validation Metadata

Configuration

Region Settings

Confidence Thresholds

Outputs

Screening Output

Extraction Output

Validation Output

Development

Running Individual Components

Human-Loop UI Standalone

Troubleshooting

Missing Tables

AWS Connection Issues

Missing Dependencies

License

Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages