-
Notifications
You must be signed in to change notification settings - Fork 966
feat: Add context evaluation to eval framework [COG-1366] #586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
# Conflicts: # cognee/modules/retrieval/utils/code_graph_retrieval.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
evals/eval_framework/answer_generation/answer_generation_executor.py (1)
2-9: Add Type import for better type annotations.The imports look good, but to properly annotate class types, you should add
Typefrom the typing module which will be needed for accurate type hints.-from typing import List, Dict, Callable, Awaitable +from typing import List, Dict, Callable, Awaitable, Typecognee/tests/unit/eval_framework/deepeval_adapter_test.py (3)
18-37: Consider adding context-specific assertions.The test validates that EM and f1 metrics are returned but doesn't explicitly verify that the context is correctly utilized in the evaluation. Consider adding assertions that specifically test how different contexts affect the evaluation results.
39-91: Enhance test consistency with retrieval_context.For consistency, consider adding the retrieval_context field to all test cases, particularly in test_unsupported_metric and test_missing_fields_in_answer. This would ensure comprehensive test coverage for the new field across all scenarios.
1-11: Consider adding a test for ContextualRelevancyMetric.The AI summary mentions the introduction of a ContextualRelevancyMetric in the DeepEvalAdapter. Consider adding specific tests for this new metric to validate its behavior with different contexts.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cognee/tests/unit/eval_framework/deepeval_adapter_test.py(2 hunks)evals/eval_framework/answer_generation/answer_generation_executor.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (29)
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-13
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: run_eval_framework_test / test
- GitHub Check: Test on macos-15
- GitHub Check: run_notebook_test / test
- GitHub Check: Test on macos-15
- GitHub Check: run_networkx_metrics_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: run_simple_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: Test on macos-13
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: Test on macos-13
- GitHub Check: test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: windows-latest
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: Build Cognee Backend Docker App Image
- GitHub Check: docker-compose-test
🔇 Additional comments (6)
evals/eval_framework/answer_generation/answer_generation_executor.py (4)
12-19: Fix type annotations for retriever classes.The current type annotation indicates instances of
BaseRetriever, but you're storing class references rather than instances. This should be updated for clarity and correctness.-retriever_options: Dict[str, BaseRetriever] = { +retriever_options: Dict[str, Type[BaseRetriever]] = {
26-26: Revisit the type hint for retriever_cls.The parameter is typed as
BaseRetriever, but you're passing class constructors rather than instances as shown by the instantiation at line 28.- retriever_cls: BaseRetriever, + retriever_cls: Type[BaseRetriever],
34-35: Good implementation of the context-aware retrieval process.The two-step retrieval process (getting context first, then obtaining completion with that context) is a clean approach that properly separates concerns and allows for better evaluation of context quality.
42-42: Great addition of retrieval context to results.Including the retrieval context in the output allows for better analysis of how different contexts affect answer quality, which is essential for context evaluation.
cognee/tests/unit/eval_framework/deepeval_adapter_test.py (2)
25-25: Good addition of retrieval context for testing.This change properly integrates context evaluation capabilities into the testing framework, aligning with the PR objective. The provided context "2 + 2 = 4" is relevant to the question and answer being evaluated.
81-81: Good test coverage for None values.Extending the None value testing to include the new retrieval_context field ensures proper handling of null contexts in the adapter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
evals/eval_framework/eval_config.py(1 hunks)evals/eval_framework/modal_run_eval.py(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- evals/eval_framework/modal_run_eval.py
⏰ Context from checks skipped due to timeout of 90000ms (23)
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_eval_framework_test / test
- GitHub Check: Test on macos-15
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: Test on macos-13
- GitHub Check: test
- GitHub Check: Test on macos-13
- GitHub Check: test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_simple_example_test / test
- GitHub Check: Test on macos-13
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: test
- GitHub Check: run_notebook_test / test
- GitHub Check: windows-latest
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: run_networkx_metrics_test / test
- GitHub Check: run_simple_example_test
- GitHub Check: docker-compose-test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
cognee/eval_framework/evaluation/evaluation_executor.py (1)
6-10: Constructor signature updated to accept context evaluation parameter.The
__init__method now accepts anevaluate_contextsparameter, with a default value ofFalse.Note that this default differs from the
Truedefault in theEvalConfigclass. While this shouldn't cause issues since the parameter is explicitly passed from configuration, consider aligning these defaults to avoid potential confusion in future maintenance.- def __init__( - self, - evaluator_engine: Union[str, EvaluatorAdapter, Any] = "DeepEval", - evaluate_contexts: bool = False, - ) -> None: + def __init__( + self, + evaluator_engine: Union[str, EvaluatorAdapter, Any] = "DeepEval", + evaluate_contexts: bool = True, + ) -> None:cognee/eval_framework/evaluation/deep_eval_adapter.py (1)
34-34: Prevent potential KeyError foranswer["retrieval_context"].Currently,
retrieval_context=[answer["retrieval_context"]]could raise a KeyError ifretrieval_contextis missing from theanswerdictionary. Consider using.get("retrieval_context", default_value)or adding error handling.- retrieval_context=[answer["retrieval_context"]] + retrieval_context=[answer.get("retrieval_context", None)]cognee/eval_framework/answer_generation/answer_generation_executor.py (2)
3-9: Explicitly annotate imports as classes if needed.Here you import retriever classes, but the type annotation for
retriever_optionsisDict[str, BaseRetriever], implying instances, not classes. Adjust toDict[str, Type[BaseRetriever]]or instantiate them here if this is intentional.- retriever_options: Dict[str, BaseRetriever] = { + from typing import Type + retriever_options: Dict[str, Type[BaseRetriever]] = {
26-28: Clarify the type ofretriever_cls.Since
retriever_clsrepresents a constructor, consider updating its type annotation toType[BaseRetriever]. Otherwise, the current signature may cause confusion, as it suggests an instance rather than a class.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
cognee/eval_framework/answer_generation/answer_generation_executor.py(1 hunks)cognee/eval_framework/answer_generation/run_question_answering_module.py(2 hunks)cognee/eval_framework/eval_config.py(2 hunks)cognee/eval_framework/evaluation/deep_eval_adapter.py(3 hunks)cognee/eval_framework/evaluation/evaluation_executor.py(2 hunks)cognee/eval_framework/evaluation/run_evaluation_module.py(1 hunks)cognee/tests/unit/eval_framework/answer_generation_test.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/tests/unit/eval_framework/answer_generation_test.py
⏰ Context from checks skipped due to timeout of 90000ms (31)
- GitHub Check: Test on macos-15
- GitHub Check: run_notebook_test / test
- GitHub Check: Test on macos-15
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: run_eval_framework_test / test
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: Test on macos-13
- GitHub Check: Test on macos-13
- GitHub Check: run_networkx_metrics_test / test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: run_simple_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-13
- GitHub Check: test
- GitHub Check: windows-latest
- GitHub Check: test
- GitHub Check: lint (ubuntu-latest, 3.10.x)
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: Build Cognee Backend Docker App Image
- GitHub Check: docker-compose-test
- GitHub Check: run_simple_example_test
🔇 Additional comments (11)
cognee/eval_framework/eval_config.py (2)
21-21: Added context evaluation configuration flag.The new
evaluating_contextsboolean flag is appropriately added to theEvalConfigclass, defaulting toTrue. This enables context evaluation in the evaluation framework by default.
55-55: Updated to_dict method to include the new configuration parameter.The
to_dictmethod is properly updated to include the newevaluating_contextsparameter, maintaining consistency with the serialization of other configuration options. The comment provides helpful context about the parameter's purpose.cognee/eval_framework/answer_generation/run_question_answering_module.py (2)
6-6: Import changed from function-based to class-based retriever approach.The change from
question_answering_engine_optionstoretriever_optionsreflects the architectural shift from function-based answer resolvers to class-based retrievers, which is consistent with modern design patterns.
51-52: Updated to use retriever class instead of answer resolver function.The implementation now uses the
retriever_clsparameter with the appropriate class fromretriever_optionsbased on the configured engine. This aligns with the architectural changes for using class-based retrievers.cognee/eval_framework/evaluation/run_evaluation_module.py (1)
45-48: Added context evaluation parameter to EvaluationExecutor.The
EvaluationExecutorinstantiation now correctly passes theevaluating_contextsconfiguration parameter, enabling the evaluation of contexts when configured.cognee/eval_framework/evaluation/evaluation_executor.py (2)
21-21: Stored context evaluation setting as instance variable.The constructor appropriately stores the
evaluate_contextsparameter as an instance variable for later use.
24-25: Added conditional logic to include contextual relevancy metric.The implementation now conditionally appends the "contextual_relevancy" metric to the list of evaluator metrics when context evaluation is enabled. This ensures that the evaluator will perform context evaluation only when configured to do so.
cognee/eval_framework/evaluation/deep_eval_adapter.py (2)
8-8: Ensure dependency compatibility for the newly introduced metric.You are importing
ContextualRelevancyMetricfromdeepeval.metrics. Verify that the library version supports this metric and is pinned in the project’s dependencies to avoid potential mismatched or missing attributes at runtime.Would you like to run a script to confirm the availability of
ContextualRelevancyMetricin your current environment?
17-17: New “contextual_relevancy” metric looks good.Adding a new metric to
g_eval_metricsis straightforward and maintains consistency with the other metrics. This cleanly extends the adapter for contextual evaluations.cognee/eval_framework/answer_generation/answer_generation_executor.py (2)
12-19: Enhance reusability of retriever_options.This dictionary neatly centralizes retriever classes. Ensure all retrievers share consistent interfaces and that their constructor parameters remain compatible if modifications are made in the future.
34-35: Capture retrieval context carefully.Storing the retrieval context is beneficial for debugging and evaluation. However, ensure you handle any exceptions from
get_contextorget_completiongracefully, as network or data errors could cause the method to fail.Would you like to add a try-except block to handle potential retriever errors and ensure graceful fallback?
Also applies to: 42-42
cognee/eval_framework/answer_generation/run_question_answering_module.py
Show resolved
Hide resolved
hajdul88
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in general we shouldn't change how we generate answers etc. So the pipeline shouldn't change from structural point of view. I believe the context can be simply added to the json that we are passing through the modules.
hajdul88
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested it with dreamify it works! Also we talked about the changes so I think its good now. The dashboard will need some changes in the next PR. Good job!
Description
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin
Summary by CodeRabbit
New Features
Refactor
Tests