This repository provides tools and scripts for evaluating the LoCoMo and LongMemEval dataset using various models and APIs.
-
Set the
PYTHONPATHenvironment variable:export PYTHONPATH=../src cd evaluation
-
Install the required dependencies:
poetry install --with eval
Create an .env file in the evaluation/ directory and include the following environment variables:
OPENAI_API_KEY="sk-xxx"
OPENAI_BASE_URL="your_base_url"
MEM0_API_KEY="your_mem0_api_key"
MEM0_PROJECT_ID="your_mem0_proj_id"
MEM0_ORGANIZATION_ID="your_mem0_org_id"
MODEL="gpt-4o-mini" # or your preferred model
EMBEDDING_MODEL="text-embedding-3-small" # or your preferred embedding model
ZEP_API_KEY="your_zep_api_key"
The smaller dataset "LoCoMo" has already been included in the repo to facilitate reproducing.
To download the "LongMemEval" dataset, run the following command:
huggingface-cli download --repo-type dataset --resume-download xiaowu0162/longmemeval --local-dir data/longmemevalAfter downloading, rename the files as follows:
longmemeval_m.jsonlongmemeval_s.jsonlongmemeval_oracle.json
To evaluate the locomo dataset, execute the following scripts in order:
-
Ingest locomo history into MemOS:
python scripts/locomo/locomo_ingestion.py --lib memos
-
Search Memory for each QA pair in locomo:
python scripts/locomo/locomo_search.py --lib memos
-
Generate responses from OpenAI with provided context:
python scripts/locomo/locomo_responses.py --lib memos
-
Evaluate the generated answers:
python scripts/locomo/locomo_eval.py --lib memos
-
Calculate fine-grained scores for each category:
python scripts/locomo/locomo_metric.py --lib memos
-
Add New Metrics When incorporating the evaluation of reflection duration, ensure to record related data in
{lib}_locomo_judged.json. For additional NLP metrics like BLEU and ROUGE-L score, make adjustments to thelocomo_graderfunction inscripts/locomo/locomo_eval.py. -
Intermediate Results While I have provided intermediate results like
{lib}_locomo_search_results.json,{lib}_locomo_responses.json, and{lib}_locomo_judged.jsonfor reproducibility, contributors are encouraged to report final results in the PR description rather than editing these files directly. Any valuable modifications will be combined into an updated version of the evaluation code containing revised intermediate results (at specified intervals).