-
Notifications
You must be signed in to change notification settings - Fork 963
Feature/cog 1312 integrating evaluation framework into dreamify #562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/cog 1312 integrating evaluation framework into dreamify #562
Conversation
WalkthroughThis update refactors several modules by updating import paths from the old Changes
Sequence Diagram(s)sequenceDiagram
participant Main as main()
participant Corpus as run_corpus_builder
participant QA as run_question_answering
participant Eval as run_evaluation
participant Dashboard as generate_metrics_dashboard
Main->>Corpus: run_corpus_builder(params, chunk_size, chunker)
Main->>QA: run_question_answering(params, system_prompt)
Main->>Eval: run_evaluation(params)
alt Dashboard requested
Main->>Dashboard: generate_metrics_dashboard(json_data, output_file, benchmark)
end
Possibly related PRs
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (30)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
cognee/eval_framework/evaluation/run_evaluation_module.py (1)
31-54: Add error handling around metrics variable.While the return type change is good, consider adding error handling around the metrics variable to ensure it's not None or empty before returning.
metrics = await evaluator.execute( answers=answers, evaluator_metrics=params["evaluation_metrics"] ) +if not metrics: + logging.warning("No metrics were generated during evaluation") + return [] with open(params["metrics_path"], "w", encoding="utf-8") as f: json.dump(metrics, f, ensure_ascii=False, indent=4)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
cognee/eval_framework/answer_generation/run_question_answering_module.py(3 hunks)cognee/eval_framework/benchmark_adapters/benchmark_adapters.py(1 hunks)cognee/eval_framework/benchmark_adapters/dummy_adapter.py(1 hunks)cognee/eval_framework/benchmark_adapters/hotpot_qa_adapter.py(1 hunks)cognee/eval_framework/benchmark_adapters/musique_adapter.py(1 hunks)cognee/eval_framework/benchmark_adapters/twowikimultihop_adapter.py(1 hunks)cognee/eval_framework/corpus_builder/corpus_builder_executor.py(1 hunks)cognee/eval_framework/corpus_builder/run_corpus_builder.py(3 hunks)cognee/eval_framework/corpus_builder/task_getters/default_task_getter.py(1 hunks)cognee/eval_framework/corpus_builder/task_getters/task_getters.py(1 hunks)cognee/eval_framework/evaluation/deep_eval_adapter.py(1 hunks)cognee/eval_framework/evaluation/evaluation_executor.py(1 hunks)cognee/eval_framework/evaluation/evaluator_adapters.py(1 hunks)cognee/eval_framework/evaluation/run_evaluation_module.py(3 hunks)cognee/eval_framework/run_eval.py(2 hunks)cognee/eval_framework/tests/unit/benchmark_adapters_test.py(1 hunks)
✅ Files skipped from review due to trivial changes (12)
- cognee/eval_framework/corpus_builder/task_getters/task_getters.py
- cognee/eval_framework/evaluation/evaluator_adapters.py
- cognee/eval_framework/benchmark_adapters/musique_adapter.py
- cognee/eval_framework/benchmark_adapters/hotpot_qa_adapter.py
- cognee/eval_framework/benchmark_adapters/benchmark_adapters.py
- cognee/eval_framework/evaluation/evaluation_executor.py
- cognee/eval_framework/benchmark_adapters/twowikimultihop_adapter.py
- cognee/eval_framework/corpus_builder/task_getters/default_task_getter.py
- cognee/eval_framework/benchmark_adapters/dummy_adapter.py
- cognee/eval_framework/corpus_builder/corpus_builder_executor.py
- cognee/eval_framework/evaluation/deep_eval_adapter.py
- cognee/eval_framework/tests/unit/benchmark_adapters_test.py
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: docker-compose-test
🔇 Additional comments (7)
cognee/eval_framework/corpus_builder/run_corpus_builder.py (2)
3-5: LGTM! Import changes align with the new return type and namespace.The addition of
Listtype and updated import path forCorpusBuilderExecutorreflect the namespace change fromevalstocognee.
31-31: LGTM! Return type change improves function usability.The function now returns the
questionslist instead ofNone, which is more useful for the evaluation pipeline and aligns with the return type changes in other modules.Also applies to: 48-48
cognee/eval_framework/evaluation/run_evaluation_module.py (1)
3-4: LGTM! Import changes align with the new return type and namespace.The addition of
Listtype and updated import path forEvaluationExecutorreflect the namespace change fromevalstocognee.cognee/eval_framework/answer_generation/run_question_answering_module.py (2)
3-4: LGTM! Import changes align with the new return type and namespace.The addition of
Listtype and updated import path for answer generation modules reflect the namespace change fromevalstocognee.
34-34: LGTM! Return type change improves function usability.The function now returns the
answerslist instead ofNone, which is more useful for the evaluation pipeline and aligns with the return type changes in other modules.Also applies to: 57-57
cognee/eval_framework/run_eval.py (2)
4-4: LGTM! Import paths updated consistently.The import statements have been properly updated to reflect the new
cogneenamespace, and the new import for metrics dashboard generation is appropriately placed.Also applies to: 6-11
32-33:Details
✅ Verification successful
Verify the metrics flow between evaluation and dashboard generation.
Since the dashboard generation has been moved from
run_evaluation_module.pyto this file, please verify:
- That
run_evaluationno longer attempts to generate the dashboard- That the metrics data flows correctly between the evaluation and dashboard generation
🏁 Script executed:
#!/bin/bash # Description: Verify the metrics flow and dashboard generation logic # Check if run_evaluation still contains dashboard generation code echo "Checking run_evaluation_module.py for dashboard generation..." rg -A 5 "generate_metrics_dashboard" cognee/eval_framework/evaluation/run_evaluation_module.py # Check the metrics path usage echo "Checking metrics path usage..." rg "metrics_path" cognee/eval_framework/Length of output: 720
Dashboard Integration Verified – No Further Changes Required
The verification confirms that:
- The dashboard generation code (i.e., any calls to “generate_metrics_dashboard”) has been successfully removed from
run_evaluation_module.py.- The
metrics_pathis correctly defined ineval_config.py, passed viaeval_params, and used appropriately in bothrun_evaluation_module.pyandrun_eval.py.Everything appears to be in order with the metrics flow and dashboard integration.
cognee/eval_framework/answer_generation/run_question_answering_module.py
Outdated
Show resolved
Hide resolved
…work-into-dreamify
cognee/eval_framework/answer_generation/run_question_answering_module.py
Outdated
Show resolved
Hide resolved
…work-into-dreamify
…work-into-dreamify
…work-into-dreamify
…work-into-dreamify
…work-into-dreamify
…work-into-dreamify
…work-into-dreamify
…work-into-dreamify
…work-into-dreamify
…work-into-dreamify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
cognee/eval_framework/metrics_dashboard.py (4)
7-22: Consider merging multiple metrics into a combined subplot for clarity.If multiple metrics grow large in number, displaying individual separate histograms may get unwieldy. Combining them into subplots or a single figure with different traces can improve readability and reduce clutter.
25-50: Use a consistent color scheme for confidence interval bars.Currently, the distribution histograms and the confidence bar charts may have differing colors, making them appear disjoint. Adopting a unified color scheme or legend can provide a more cohesive visual theme.
98-130: Include meta tags for improved HTML readability and responsiveness.Adding
<meta charset="UTF-8">and<meta name="viewport" content="width=device-width, initial-scale=1">inside<head>can improve accessibility and ensure the page correctly adapts for mobile devices.
133-172: Add error handling or validation for JSON parsing.Currently, the code assumes that both JSON files contain valid fields. In case of missing fields or malformed JSON, execution will fail abruptly. Consider gracefully handling such scenarios by validating or wrapping JSON operations in a
try/exceptblock.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
cognee/eval_framework/evaluation/run_evaluation_module.py(1 hunks)cognee/eval_framework/metrics_dashboard.py(1 hunks)cognee/eval_framework/modal_run_eval.py(1 hunks)cognee/tests/unit/eval_framework/dashboard_test.py(1 hunks)cognee/tests/unit/eval_framework/deepeval_adapter_test.py(1 hunks)cognee/tests/unit/eval_framework/metrics_test.py(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- cognee/eval_framework/evaluation/run_evaluation_module.py
- cognee/eval_framework/modal_run_eval.py
🚧 Files skipped from review as they are similar to previous changes (3)
- cognee/tests/unit/eval_framework/dashboard_test.py
- cognee/tests/unit/eval_framework/metrics_test.py
- cognee/tests/unit/eval_framework/deepeval_adapter_test.py
⏰ Context from checks skipped due to timeout of 90000ms (25)
- GitHub Check: run_simple_example_test / test
- GitHub Check: run_eval_framework_test / test
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: Test on macos-15
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: Test on macos-13
- GitHub Check: Test on macos-13
- GitHub Check: test
- GitHub Check: test
- GitHub Check: run_networkx_metrics_test / test
- GitHub Check: windows-latest
- GitHub Check: Test on macos-13
- GitHub Check: test
- GitHub Check: test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: run_simple_example_test
- GitHub Check: Build Cognee Backend Docker App Image
- GitHub Check: docker-compose-test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cognee/eval_framework/evaluation/run_evaluation_module.py (1)
5-5: Remove unused import.
create_dashboardis not referenced anywhere in this file and can be safely removed for clarity.-from cognee.eval_framework.analysis.dashboard_generator import create_dashboard
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cognee/eval_framework/evaluation/run_evaluation_module.py(1 hunks)cognee/eval_framework/run_eval.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/eval_framework/run_eval.py
⏰ Context from checks skipped due to timeout of 90000ms (31)
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_eval_framework_test / test
- GitHub Check: Test on macos-15
- GitHub Check: run_notebook_test / test
- GitHub Check: run_simple_example_test / test
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-13
- GitHub Check: Test on macos-13
- GitHub Check: Test on macos-13
- GitHub Check: lint (ubuntu-latest, 3.11.x)
- GitHub Check: test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: run_networkx_metrics_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: windows-latest
- GitHub Check: test
- GitHub Check: test
- GitHub Check: docker-compose-test
- GitHub Check: run_simple_example_test
- GitHub Check: Build Cognee Backend Docker App Image
🔇 Additional comments (1)
cognee/eval_framework/evaluation/run_evaluation_module.py (1)
3-3: Looks good.No issues noted with the updated import path for
EvaluationExecutor.
…work-into-dreamify
…dreamify' of https://github.com/topoteretes/cognee into feature/cog-1312-integrating-evaluation-framework-into-dreamify
…work-into-dreamify
…work-into-dreamify
…work-into-dreamify
soobrosa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, maybe we should keep PRs more agile.
Agree. This was put on the side because the prio was different, so that's why we left this open for 2 weeks |
Description
This PR contains eval framework changes due to the autooptimizer integration
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin
Summary by CodeRabbit
New Features
Refactor/Chore
Tests