feat: Add context evaluation to eval framework [COG-1366] #586

alekszievr · 2025-02-27T16:06:54Z

Description

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin

Summary by CodeRabbit

New Features
- Introduced a class-based retrieval mechanism to enhance answer generation with improved context extraction and completion.
- Added a new evaluation metric for contextual relevancy and an option to enable context evaluation during the evaluation process.
Refactor
- Transitioned from a function-based answer resolver to a more modular retriever approach to improve extensibility.
Tests
- Updated tests to align with the new answer generation and evaluation process.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

# Conflicts: # cognee/modules/retrieval/utils/code_graph_retrieval.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

evals/eval_framework/answer_generation/answer_generation_executor.py (1)
2-9: Add Type import for better type annotations.

The imports look good, but to properly annotate class types, you should add Type from the typing module which will be needed for accurate type hints.
-from typing import List, Dict, Callable, Awaitable
+from typing import List, Dict, Callable, Awaitable, Type
cognee/tests/unit/eval_framework/deepeval_adapter_test.py (3)

18-37: Consider adding context-specific assertions.

The test validates that EM and f1 metrics are returned but doesn't explicitly verify that the context is correctly utilized in the evaluation. Consider adding assertions that specifically test how different contexts affect the evaluation results.

39-91: Enhance test consistency with retrieval_context.

For consistency, consider adding the retrieval_context field to all test cases, particularly in test_unsupported_metric and test_missing_fields_in_answer. This would ensure comprehensive test coverage for the new field across all scenarios.

1-11: Consider adding a test for ContextualRelevancyMetric.

The AI summary mentions the introduction of a ContextualRelevancyMetric in the DeepEvalAdapter. Consider adding specific tests for this new metric to validate its behavior with different contexts.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2e2beb3 and 3f10725.

📒 Files selected for processing (2)

cognee/tests/unit/eval_framework/deepeval_adapter_test.py (2 hunks)
evals/eval_framework/answer_generation/answer_generation_executor.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (29)

GitHub Check: Test on macos-15
GitHub Check: Test on macos-13
GitHub Check: Test on ubuntu-22.04
GitHub Check: run_eval_framework_test / test
GitHub Check: Test on macos-15
GitHub Check: run_notebook_test / test
GitHub Check: Test on macos-15
GitHub Check: run_networkx_metrics_test / test
GitHub Check: run_notebook_test / test
GitHub Check: run_notebook_test / test
GitHub Check: run_dynamic_steps_example_test / test
GitHub Check: run_simple_example_test / test
GitHub Check: run_notebook_test / test
GitHub Check: Test on macos-13
GitHub Check: run_multimedia_example_test / test
GitHub Check: Test on macos-13
GitHub Check: test
GitHub Check: Test on ubuntu-22.04
GitHub Check: Test on ubuntu-22.04
GitHub Check: Test on ubuntu-22.04
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: windows-latest
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: Build Cognee Backend Docker App Image
GitHub Check: docker-compose-test

🔇 Additional comments (6)

evals/eval_framework/answer_generation/answer_generation_executor.py (4)

12-19: Fix type annotations for retriever classes.

The current type annotation indicates instances of BaseRetriever, but you're storing class references rather than instances. This should be updated for clarity and correctness.
-retriever_options: Dict[str, BaseRetriever] = {
+retriever_options: Dict[str, Type[BaseRetriever]] = {
26-26: Revisit the type hint for retriever_cls.

The parameter is typed as BaseRetriever, but you're passing class constructors rather than instances as shown by the instantiation at line 28.
-    retriever_cls: BaseRetriever,
+    retriever_cls: Type[BaseRetriever],
34-35: Good implementation of the context-aware retrieval process.

The two-step retrieval process (getting context first, then obtaining completion with that context) is a clean approach that properly separates concerns and allows for better evaluation of context quality.

42-42: Great addition of retrieval context to results.

Including the retrieval context in the output allows for better analysis of how different contexts affect answer quality, which is essential for context evaluation.
cognee/tests/unit/eval_framework/deepeval_adapter_test.py (2)

25-25: Good addition of retrieval context for testing.

This change properly integrates context evaluation capabilities into the testing framework, aligning with the PR objective. The provided context "2 + 2 = 4" is relevant to the question and answer being evaluated.

81-81: Good test coverage for None values.

Extending the None value testing to include the new retrieval_context field ensures proper handling of null contexts in the adapter.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3f10725 and d9b007a.

📒 Files selected for processing (2)

evals/eval_framework/eval_config.py (1 hunks)
evals/eval_framework/modal_run_eval.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

evals/eval_framework/modal_run_eval.py

⏰ Context from checks skipped due to timeout of 90000ms (23)

GitHub Check: run_multimedia_example_test / test
GitHub Check: run_notebook_test / test
GitHub Check: run_eval_framework_test / test
GitHub Check: Test on macos-15
GitHub Check: run_notebook_test / test
GitHub Check: test
GitHub Check: test
GitHub Check: Test on macos-13
GitHub Check: test
GitHub Check: Test on macos-13
GitHub Check: test
GitHub Check: run_notebook_test / test
GitHub Check: run_simple_example_test / test
GitHub Check: Test on macos-13
GitHub Check: Test on ubuntu-22.04
GitHub Check: test
GitHub Check: test
GitHub Check: run_notebook_test / test
GitHub Check: windows-latest
GitHub Check: run_dynamic_steps_example_test / test
GitHub Check: run_networkx_metrics_test / test
GitHub Check: run_simple_example_test
GitHub Check: docker-compose-test

evals/eval_framework/eval_config.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

cognee/eval_framework/evaluation/evaluation_executor.py (1)
6-10: Constructor signature updated to accept context evaluation parameter.

The __init__ method now accepts an evaluate_contexts parameter, with a default value of False.

Note that this default differs from the True default in the EvalConfig class. While this shouldn't cause issues since the parameter is explicitly passed from configuration, consider aligning these defaults to avoid potential confusion in future maintenance.
- def __init__(
-     self,
-     evaluator_engine: Union[str, EvaluatorAdapter, Any] = "DeepEval",
-     evaluate_contexts: bool = False,
- ) -> None:
+ def __init__(
+     self,
+     evaluator_engine: Union[str, EvaluatorAdapter, Any] = "DeepEval",
+     evaluate_contexts: bool = True,
+ ) -> None:
cognee/eval_framework/evaluation/deep_eval_adapter.py (1)
34-34: Prevent potential KeyError for answer["retrieval_context"].

Currently, retrieval_context=[answer["retrieval_context"]] could raise a KeyError if retrieval_context is missing from the answer dictionary. Consider using .get("retrieval_context", default_value) or adding error handling.
- retrieval_context=[answer["retrieval_context"]]
+ retrieval_context=[answer.get("retrieval_context", None)]
cognee/eval_framework/answer_generation/answer_generation_executor.py (2)
3-9: Explicitly annotate imports as classes if needed.

Here you import retriever classes, but the type annotation for retriever_options is Dict[str, BaseRetriever], implying instances, not classes. Adjust to Dict[str, Type[BaseRetriever]] or instantiate them here if this is intentional.
- retriever_options: Dict[str, BaseRetriever] = {
+ from typing import Type
+ retriever_options: Dict[str, Type[BaseRetriever]] = {
26-28: Clarify the type of retriever_cls.

Since retriever_cls represents a constructor, consider updating its type annotation to Type[BaseRetriever]. Otherwise, the current signature may cause confusion, as it suggests an instance rather than a class.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3453ede and 5b9a64d.

📒 Files selected for processing (7)

cognee/eval_framework/answer_generation/answer_generation_executor.py (1 hunks)
cognee/eval_framework/answer_generation/run_question_answering_module.py (2 hunks)
cognee/eval_framework/eval_config.py (2 hunks)
cognee/eval_framework/evaluation/deep_eval_adapter.py (3 hunks)
cognee/eval_framework/evaluation/evaluation_executor.py (2 hunks)
cognee/eval_framework/evaluation/run_evaluation_module.py (1 hunks)
cognee/tests/unit/eval_framework/answer_generation_test.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

cognee/tests/unit/eval_framework/answer_generation_test.py

⏰ Context from checks skipped due to timeout of 90000ms (31)

GitHub Check: Test on macos-15
GitHub Check: run_notebook_test / test
GitHub Check: Test on macos-15
GitHub Check: run_multimedia_example_test / test
GitHub Check: run_eval_framework_test / test
GitHub Check: run_dynamic_steps_example_test / test
GitHub Check: run_notebook_test / test
GitHub Check: Test on macos-13
GitHub Check: Test on macos-13
GitHub Check: run_networkx_metrics_test / test
GitHub Check: Test on ubuntu-22.04
GitHub Check: test
GitHub Check: Test on ubuntu-22.04
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: run_simple_example_test / test
GitHub Check: run_notebook_test / test
GitHub Check: run_notebook_test / test
GitHub Check: Test on macos-15
GitHub Check: Test on macos-13
GitHub Check: test
GitHub Check: windows-latest
GitHub Check: test
GitHub Check: lint (ubuntu-latest, 3.10.x)
GitHub Check: Test on ubuntu-22.04
GitHub Check: test
GitHub Check: Test on ubuntu-22.04
GitHub Check: Build Cognee Backend Docker App Image
GitHub Check: docker-compose-test
GitHub Check: run_simple_example_test

🔇 Additional comments (11)

cognee/eval_framework/eval_config.py (2)

21-21: Added context evaluation configuration flag.

The new evaluating_contexts boolean flag is appropriately added to the EvalConfig class, defaulting to True. This enables context evaluation in the evaluation framework by default.

55-55: Updated to_dict method to include the new configuration parameter.

The to_dict method is properly updated to include the new evaluating_contexts parameter, maintaining consistency with the serialization of other configuration options. The comment provides helpful context about the parameter's purpose.

cognee/eval_framework/answer_generation/run_question_answering_module.py (2)

6-6: Import changed from function-based to class-based retriever approach.

The change from question_answering_engine_options to retriever_options reflects the architectural shift from function-based answer resolvers to class-based retrievers, which is consistent with modern design patterns.

51-52: Updated to use retriever class instead of answer resolver function.

The implementation now uses the retriever_cls parameter with the appropriate class from retriever_options based on the configured engine. This aligns with the architectural changes for using class-based retrievers.

cognee/eval_framework/evaluation/run_evaluation_module.py (1)

45-48: Added context evaluation parameter to EvaluationExecutor.

The EvaluationExecutor instantiation now correctly passes the evaluating_contexts configuration parameter, enabling the evaluation of contexts when configured.

cognee/eval_framework/evaluation/evaluation_executor.py (2)

21-21: Stored context evaluation setting as instance variable.

The constructor appropriately stores the evaluate_contexts parameter as an instance variable for later use.

24-25: Added conditional logic to include contextual relevancy metric.

The implementation now conditionally appends the "contextual_relevancy" metric to the list of evaluator metrics when context evaluation is enabled. This ensures that the evaluator will perform context evaluation only when configured to do so.

cognee/eval_framework/evaluation/deep_eval_adapter.py (2)

8-8: Ensure dependency compatibility for the newly introduced metric.

You are importing ContextualRelevancyMetric from deepeval.metrics. Verify that the library version supports this metric and is pinned in the project’s dependencies to avoid potential mismatched or missing attributes at runtime.

Would you like to run a script to confirm the availability of ContextualRelevancyMetric in your current environment?

17-17: New “contextual_relevancy” metric looks good.

Adding a new metric to g_eval_metrics is straightforward and maintains consistency with the other metrics. This cleanly extends the adapter for contextual evaluations.

cognee/eval_framework/answer_generation/answer_generation_executor.py (2)

12-19: Enhance reusability of retriever_options.

This dictionary neatly centralizes retriever classes. Ensure all retrievers share consistent interfaces and that their constructor parameters remain compatible if modifications are made in the future.

34-35: Capture retrieval context carefully.

Storing the retrieval context is beneficial for debugging and evaluation. However, ensure you handle any exceptions from get_context or get_completion gracefully, as network or data errors could cause the method to fail.

Would you like to add a try-except block to handle potential retriever errors and ensure graceful fallback?

Also applies to: 42-42

cognee/eval_framework/answer_generation/answer_generation_executor.py

cognee/eval_framework/answer_generation/run_question_answering_module.py

hajdul88

I think in general we shouldn't change how we generate answers etc. So the pipeline shouldn't change from structural point of view. I believe the context can be simply added to the json that we are passing through the modules.

hajdul88

Tested it with dreamify it works! Also we talked about the changes so I think its good now. The dashboard will need some changes in the next PR. Good job!

lxobr and others added 30 commits February 21, 2025 09:39

refactor: utils folder in retrieval

468de67

feat: add base_retriever.py

d789dd0

feat: add retriever classes

49c2355

fix: include generate_completion function

7619df2

feat: add search comparison script, compare summaries

5a5eb5e

feat: enable context dumping

8f0cbee

fix: improve context getting and completion

beacdea

feat: add all searches and context comparisons

4b71081

Merge branch 'dev' into feat/COG-1365-unify-retrievers

7631b11

Update cognee/tasks/completion/query_completion.py

62f8ac3

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

feat: context dumping error handling

58c7eaf

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

feat: expose aggregate metrics, enable saving

afd5ca4

feat: add modal example

8bf5aae

delete: metrics_dashboard.py

fd7f837

fix: dashboard generation

416eed1

feat: add get_golden_context flag

d9fcb12

feat: implement get_golden_context for hotpot_qa

36dbdf7

chore: added todos

c07cf22

chore: added a todo

2ef174a

Merge branch 'dev' into feat/COG-1365-unify-retrievers

5910fb7

Merge branch 'dev' into feat/COG-1364-golden-contexts

65784e1

feat: simplify twowikimultihop, get golden context

bdaea29

feat: add golden context to musique_adapter.py

32d5829

Merge branch 'dev' into feat/COG-1331-modal-run-eval

ec3b753

fix: update tests

2f70de4

Merge branch 'dev' into feat/COG-1365-unify-retrievers

3d0b839

# Conflicts: # cognee/modules/retrieval/utils/code_graph_retrieval.py

feat: update code retriever

4903d7e

refactor: rename variables

e98c12e

Merge branch 'dev' into feat/COG-1364-golden-contexts

af5d7c6

refactor: add metadata_field_name property

0ece58a

Merge branch 'dev' into feat/cog-1366-add-context-evaluation

3f10725

coderabbitai bot reviewed Feb 27, 2025

View reviewed changes

Merge branch 'dev' into feat/cog-1366-add-context-evaluation

df5ba7b

lxobr self-requested a review March 3, 2025 11:54

alekszievr and others added 2 commits March 3, 2025 14:44

Merge branch 'dev' into feat/cog-1366-add-context-evaluation

4c09877

ruff format

d9b007a

coderabbitai bot reviewed Mar 3, 2025

View reviewed changes

evals/eval_framework/eval_config.py Outdated Show resolved Hide resolved

fix

5691a1f

borisarzentar reviewed Mar 3, 2025

View reviewed changes

evals/eval_framework/eval_config.py Outdated Show resolved Hide resolved

borisarzentar added 2 commits March 3, 2025 16:33

fix: add comment to new param

634a7fa

Merge branch 'dev' into feat/cog-1366-add-context-evaluation

3453ede

lxobr previously approved these changes Mar 3, 2025

View reviewed changes

borisarzentar changed the title ~~Add context evaluation to eval framework [COG-1366]~~ feat: Add context evaluation to eval framework [COG-1366] Mar 4, 2025

Merge branch 'dev' into feat/cog-1366-add-context-evaluation

5b9a64d

alekszievr dismissed lxobr’s stale review via 5b9a64d March 5, 2025 11:07

coderabbitai bot reviewed Mar 5, 2025

View reviewed changes

alekszievr requested review from hajdul88 and lxobr March 5, 2025 11:21

hajdul88 reviewed Mar 5, 2025

View reviewed changes

cognee/eval_framework/answer_generation/answer_generation_executor.py Show resolved Hide resolved

hajdul88 reviewed Mar 5, 2025

View reviewed changes

cognee/eval_framework/answer_generation/answer_generation_executor.py Show resolved Hide resolved

hajdul88 reviewed Mar 5, 2025

View reviewed changes

cognee/eval_framework/answer_generation/answer_generation_executor.py Show resolved Hide resolved

hajdul88 reviewed Mar 5, 2025

View reviewed changes

cognee/eval_framework/answer_generation/run_question_answering_module.py Show resolved Hide resolved

hajdul88 requested changes Mar 5, 2025

View reviewed changes

alekszievr added 2 commits March 5, 2025 14:56

Pass system prompt in question answering

6b2b6f2

Adjust tests

1a3371e

alekszievr requested a review from hajdul88 March 5, 2025 14:25

hajdul88 approved these changes Mar 5, 2025

View reviewed changes

alekszievr merged commit 433264d into dev Mar 5, 2025
37 checks passed

alekszievr deleted the feat/cog-1366-add-context-evaluation branch March 5, 2025 15:40

coderabbitai bot mentioned this pull request Mar 7, 2025

Feat: evaluate retrieved context against golden context [cog-1481] #619

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add context evaluation to eval framework [COG-1366] #586

feat: Add context evaluation to eval framework [COG-1366] #586

Uh oh!

alekszievr commented Feb 27, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hajdul88 left a comment

Uh oh!

hajdul88 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

feat: Add context evaluation to eval framework [COG-1366] #586

feat: Add context evaluation to eval framework [COG-1366] #586

Uh oh!

Conversation

alekszievr commented Feb 27, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

DCO Affirmation

Summary by CodeRabbit

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hajdul88 left a comment

Choose a reason for hiding this comment

Uh oh!

hajdul88 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

alekszievr commented Feb 27, 2025 •

edited by coderabbitai bot

Loading