Name	Name	Last commit message	Last commit date
parent directory ..
data/locomo	data/locomo
scripts/locomo	scripts/locomo
README.md	README.md

Name

Last commit message

Last commit date

Evaluation Memory Framework

This repository provides tools and scripts for evaluating the LoCoMo and LongMemEval dataset using various models and APIs.

Installation

Set the PYTHONPATH environment variable:
```
export PYTHONPATH=../src
cd evaluation
```
Install the required dependencies:
```
poetry install --with eval
```

Configuration

Create an .env file in the evaluation/ directory and include the following environment variables:

OPENAI_API_KEY="sk-xxx"
OPENAI_BASE_URL="your_base_url"

MEM0_API_KEY="your_mem0_api_key"
MEM0_PROJECT_ID="your_mem0_proj_id"
MEM0_ORGANIZATION_ID="your_mem0_org_id"

MODEL="gpt-4o-mini"  # or your preferred model
EMBEDDING_MODEL="text-embedding-3-small"  # or your preferred embedding model
ZEP_API_KEY="your_zep_api_key"

Dataset

The smaller dataset "LoCoMo" has already been included in the repo to facilitate reproducing.

To download the "LongMemEval" dataset, run the following command:

huggingface-cli download --repo-type dataset --resume-download xiaowu0162/longmemeval --local-dir data/longmemeval

After downloading, rename the files as follows:

longmemeval_m.json
longmemeval_s.json
longmemeval_oracle.json

Evaluation Scripts

To evaluate the locomo dataset, execute the following scripts in order:

Ingest locomo history into MemOS:

python scripts/locomo/locomo_ingestion.py --lib memos

Search Memory for each QA pair in locomo:

python scripts/locomo/locomo_search.py --lib memos

Generate responses from OpenAI with provided context:

python scripts/locomo/locomo_responses.py --lib memos

Evaluate the generated answers:

python scripts/locomo/locomo_eval.py --lib memos

Calculate fine-grained scores for each category:

python scripts/locomo/locomo_metric.py --lib memos

Contributing Guidelines

Add New Metrics When incorporating the evaluation of reflection duration, ensure to record related data in {lib}_locomo_judged.json. For additional NLP metrics like BLEU and ROUGE-L score, make adjustments to the locomo_grader function in scripts/locomo/locomo_eval.py.
Intermediate Results While I have provided intermediate results like {lib}_locomo_search_results.json, {lib}_locomo_responses.json, and {lib}_locomo_judged.json for reproducibility, contributors are encouraged to report final results in the PR description rather than editing these files directly. Any valuable modifications will be combined into an updated version of the evaluation code containing revised intermediate results (at specified intervals).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Evaluation Memory Framework

Installation

Configuration

Dataset

Evaluation Scripts

Contributing Guidelines

FilesExpand file tree

evaluation

Directory actions

More options

Directory actions

More options

Latest commit

History

evaluation

Folders and files

parent directory

README.md

Evaluation Memory Framework

Installation

Configuration

Dataset

Evaluation Scripts

Contributing Guidelines