A Python utility for extracting structured data from research paper abstracts using LLMs.
- Extract structured data from paper abstracts using OpenAI or LlamaIndex
- Support for direct extraction from DOI references
- Schema validation and type conversion
- Error handling for LLM outputs
- Returns data as pandas DataFrame
pip install -r requirements.txtfrom paper_extractor import extract_data_from_abstract
# Example abstract text
abstract = """
Background: Long-lasting insecticidal nets (LLINs) are the primary malaria prevention approach globally. However,
insecticide resistance in vectors threatens the efficacy of insecticidal interventions, including LLINs.
Interceptor® G2 is a new LLIN that contains a mixture of two insecticides: alpha-cypermethrin and chlorfenapyr.
Methods: This study was conducted in Northeast Tanzania between December 2017 to January 2018...
"""
# Extract data using OpenAI
df = extract_data_from_abstract(
abstract,
llm_service="openai",
api_key="your_openai_api_key"
)
# Or use LlamaIndex
df = extract_data_from_abstract(
abstract,
llm_service="llamaindex",
api_key="your_api_key"
)
print(df)from paper_extractor import extract_data_from_doi
# Extract data from a DOI
df = extract_data_from_doi(
"10.1186/s12936-019-2973-x",
llm_service="openai",
api_key="your_openai_api_key"
)
print(df)The script extracts the following fields from research paper abstracts:
- Pub_year: Publication year (integer)
- Journal: Name of journal (string)
- Study_type: Hut trial, lab based bioassay, or village trial (string)
- Net_type: Names of LLINs tested, comma-separated if multiple (string)
- Source: Whether mosquitoes were from the wild or lab - 'Wild' or 'Lab' (string)
- Country: Country where the study was conducted (string)
- Site: Specific geographic information (string)
- Start_date: Study start date in YYYY-MM format (string)
- End_date: Study end date in YYYY-MM format (string)
- Time_elapsed: Time elapsed in months (float)
The implementation approach focused on creating a flexible, robust system for extracting structured data from academic paper abstracts:
-
Dual LLM Integration: Support for both OpenAI and LlamaIndex provides flexibility, allowing users to choose their preferred LLM service.
-
Structured Output Format: Carefully crafted prompts instruct the LLM to produce responses in valid JSON format to ensure consistent parsing.
-
Schema Validation: A rigorous validation process converts extracted values to the appropriate data types and handles missing or invalid values.
-
Modular Design: The codebase separates concerns into discrete functions for fetching abstracts, extracting data, and validating output.
-
DOI Integration: Added capability to directly fetch abstracts from DOI references, eliminating the need for manual copying.
Several challenges were encountered during development:
-
LLM Output Variability: LLMs occasionally generate responses that don't strictly adhere to the requested format, requiring robust parsing logic to extract valid JSON.
-
HTML Parsing Complexity: Different publishers format their paper pages differently, making it difficult to create a universal abstract extraction method from DOIs.
-
Date Extraction: Dates in abstracts are often presented in various formats, requiring additional logic to standardize to the YYYY-MM format.
-
Inferring Time Elapsed: Calculating the time elapsed between dates often requires contextual understanding, as this information might not be explicitly stated.
-
Type Conversion Edge Cases: Converting extracted text to specific data types (especially numerics) requires handling a variety of edge cases and formats.
Several enhancements could make the extraction more robust and efficient:
-
Few-Shot Learning: Include examples of correct extractions in the prompt to guide the LLM toward more accurate outputs. This would significantly improve extraction accuracy by demonstrating the expected format and reasoning.
-
Custom NER Model: Train a specialized Named Entity Recognition model for scientific papers to pre-process abstracts and identify key entities before LLM extraction.
-
Cross-Validation: Implement a multi-LLM approach where extractions from different models are compared and reconciled for higher confidence.
-
Structured Reasoning: Break down the extraction process into steps, asking the LLM to first identify relevant sections before extraction.
-
Caching Mechanism: Implement a caching system for DOI fetching and LLM calls to improve efficiency for repeated queries.
-
Enhanced Publisher Integration: Develop dedicated parsers for major academic publishers to improve DOI-based abstract retrieval.