Skip to content

Jackymn25/Bioscribe

 
 

Repository files navigation

BioScribe

An automated end-to-end pipeline for systematic literature review of traumatic brain injury (TBI) research papers. BioScribe screens, extracts, validates, and stores cognitive test data from scientific publications.

Overview

BioScribe streamlines the systematic review process with an AI-powered pipeline:

  1. Screening - Automatically filters relevant TBI papers using LLM-based analysis
  2. Extraction - Extracts structured cognitive test data (21 fields) from included papers
  3. Validation - Validates extraction accuracy with confidence scoring and hallucination detection
  4. Database - Stores results in DynamoDB with two-table architecture
  5. Human-Loop - Provides web UI for expert review of medium/low confidence records

Reliability: Built-in network timeout protection with automatic retries and graceful degradation.

Architecture

image
Input CSV → Screening → Extraction → Validation → DynamoDB → Human Review UI
                            ↓                          ↓
                    [PubMed Fetch]          [extraction table]
                    [Auto-retry]             [validation table]

Quick Start

Prerequisites

  • Python 3.8+
  • AWS account (or local DynamoDB)
  • OpenAI API key

Installation

# Clone the repository
git clone https://github.com/ece1786-2025/Bioscribe.git
cd Bioscribe

# Install dependencies
pip install -r requirements.txt

# Configure credentials
# 1. Add OpenAI API key to credentials/open_ai_key.txt
# 2. Add AWS credentials to credentials/aws_credentials.py

Running the Pipeline

# Run end-to-end pipeline (tables created automatically)
python main.py --input_csv inputs/test_papers.csv --output-dir outputs/end-to-end

The pipeline will:

  • ✅ Check/create DynamoDB tables automatically
  • ✅ Screen papers for relevance
  • ✅ Extract cognitive test data
  • ✅ Validate extractions with confidence scores
  • ✅ Insert records to database
  • ✅ Launch Human-Loop UI in browser

Project Structure

Bioscribe/
├── main.py                     # End-to-end pipeline orchestrator
├── screener_script/            # Paper screening module
│   └── end_to_end_screening.py
├── extractor_script/           # Data extraction module
│   └── extractor_script.py
├── validatior_script/          # Validation module
│   └── validator.py
├── database_script/            # DynamoDB operations
│   ├── create_tables.py        # Table creation
│   ├── extraction_database.py  # Extraction table manager
│   └── validation_database.py  # Validation table manager
├── apps/
│   └── human-loop/             # Web UI for human review
│       └── app.py
├── credentials/                # API keys and AWS credentials
├── inputs/                     # Input CSV files
└── outputs/                    # Pipeline outputs

Key Features

Automated Table Creation

DynamoDB tables are created automatically on first run - no manual setup required.

Confidence-Based Routing

  • High confidence (≥0.80): Auto-approve, no review needed
  • Medium confidence (0.60-0.79): Flag for review
  • Low confidence (<0.60): Priority review required

Two-Table Architecture

  • bioscribe-successful-entries: Extraction data (21 fields per record)
  • bioscribe-validations: Validation metadata (confidence scores, routing decisions)

Human-Loop UI

Flask-based web interface for reviewing flagged records with:

  • Confidence tier filtering
  • Side-by-side source text comparison
  • In-place editing and approval workflow

Data Schema

Extraction Fields (21 total)

  • Study metadata: Population, N, age, gender, education, injury severity
  • Cognitive test: Test name, subtest, domain, outcome measure
  • Results: Mean, SD, median, IQR, min/max, range, statistics
  • Timing: Time since injury, follow-up duration

Validation Metadata

  • Composite confidence score
  • Field-level confidence breakdown
  • Hallucination risk assessment
  • Routing decision (approve/review/reject)

Configuration

Region Settings

Default AWS region: us-east-1

Modify in main.py:

ensure_tables_exist(region="us-east-2")  # Change region

Confidence Thresholds

Adjust in validatior_script/validator.py:

DEFAULT_THRESHOLDS = {
    "high": 0.80,    # Auto-approve threshold
    "medium": 0.60,  # Review threshold
    "low": 0.50      # Reject threshold
}

Outputs

Screening Output

  • screening_final.json: Final screening decisions with justifications

Extraction Output

  • extractions.json: All extracted records with metadata

Validation Output

  • validation_results.json: Validation results with confidence scores

Development

Running Individual Components

# Screen papers only
cd screener_script
python end_to_end_screening.py --input_csv ../inputs/data/screen/test_papers.csv

# Extract from specific paper
cd extractor_script
python extractor_script.py <PMID>

# Validate extractions
cd validatior_script
python validator.py --input extractions.json --output validation.json

# Create tables manually
python database_script/create_tables.py

Human-Loop UI Standalone

cd apps/human-loop
python app.py
# Visit http://127.0.0.1:5000

Troubleshooting

Missing Tables

If table creation fails, create manually:

python database_script/create_tables.py

AWS Connection Issues

Tables will auto-create on first run. For local development:

# Install and run local DynamoDB
docker run -p 8000:8000 amazon/dynamodb-local

Missing Dependencies

pip install cloudscraper  # For web scraping
pip install boto3          # For AWS DynamoDB

License

MIT Liscense

Contributors

ECE1786 2025 - University of Toronto

About

Multi-Agent Paper Review System - aim to auto systematic review in BRIDGE lab

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 85.8%
  • HTML 14.2%