-
Notifications
You must be signed in to change notification settings - Fork 966
Feat/cog 1331 modal run eval #576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Warning Rate limit exceeded@lxobr has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 14 minutes and 7 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughThe changes refactor and expand the evaluation pipeline and dashboard generation. In the dashboard module, the old bootstrap-based CI calculation is removed, and new modular functions are introduced to create distribution plots, CI plots, detailed HTML, and an integrated dashboard. A new metrics calculator file provides functions for calculating confidence intervals, loading and processing metrics data, and saving aggregated results. The evaluation configuration has been updated with new attributes for metrics processing. Furthermore, the evaluation module now features an asynchronous execution function, and a new Modal-based evaluation pipeline is added for parallel evaluations. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant EvalProcess as execute_evaluation
participant Executor as EvaluationExecutor
participant Metrics as calculate_metrics_statistics
participant Dashboard as create_dashboard
Client->>EvalProcess: Invoke evaluation
EvalProcess->>Executor: Process answers
Executor-->>EvalProcess: Return evaluation results
EvalProcess->>Metrics: Calculate metrics statistics
Metrics-->>EvalProcess: Return metrics data and CI
EvalProcess->>Dashboard: Generate dashboard
Dashboard-->>EvalProcess: Return dashboard HTML
EvalProcess-->>Client: Send complete evaluation output
sequenceDiagram
participant Main
participant ModalEval as modal_run_eval
participant Corpus as run_corpus_builder
participant QA as run_question_answering
participant Eval as run_evaluation
participant Combine as read_and_combine_metrics
Main->>ModalEval: Trigger modal evaluation
ModalEval->>Corpus: Build corpus
ModalEval->>QA: Execute Q&A step
ModalEval->>Eval: Run evaluation
Eval-->>ModalEval: Return evaluation output
ModalEval->>Combine: Combine metrics from files
Combine-->>ModalEval: Return aggregated metrics
ModalEval-->>Main: Return final metrics results
Possibly related PRs
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
evals/eval_framework/analysis/dashboard_generator.py (1)
98-129:⚠️ Potential issueHandle the case of an empty
figureslist.
figures[-1]can throw an error whenfiguresis empty. Including a simple check will improve resiliency.if not figures: + return "<h1>No available figures to display</h1>" dashboard_html = f""" ...
🧹 Nitpick comments (7)
evals/eval_framework/eval_config.py (1)
25-27: Consider clarifying default behavior for metrics calculation.Providing
Trueas the default forcalculate_metricsis reasonable if metrics are a central aspect of all evaluations. However, if metrics calculation introduces significant overhead or requires additional runtime, you might want to ensure that users explicitly opt in. Otherwise, this is a sensible addition aligning with the rest of the configuration.evals/eval_framework/evaluation/run_evaluation_module.py (1)
32-53: Consider non-blocking file I/O in an async context.While this function correctly handles file reading and JSON decoding,
open(...)in an async function may block the event loop if files are large. For better concurrency, consider using asynchronous file operations or a thread executor to prevent potential bottlenecks.evals/eval_framework/analysis/dashboard_generator.py (2)
7-22: Consider parameterizing bin size for distribution plots.
The histogram uses a fixed bin count (nbinsx=10), which might be too coarse or too fine depending on the data. Allowing a configurable number of bins can enhance the flexibility of these plots.def create_distribution_plots(metrics_data: Dict[str, List[float]], nbins: int = 10) -> List[str]: ... - fig.add_trace(go.Histogram(x=scores, name=metric, nbinsx=10, marker_color="#1f77b4")) + fig.add_trace(go.Histogram(x=scores, name=metric, nbinsx=nbins, marker_color="#1f77b4")) ...
133-172: Refactor for clearer function naming.
create_dashboardreads files, processes data, and writes output—more than just “creating.” Renaming to something likegenerate_and_save_dashboardor similar can better reflect its scope.evals/eval_framework/analysis/metrics_calculator.py (3)
7-18: Consider exposing a random seed parameter.
Repeated runs can produce slightly different intervals. Exposing a seed can help reproduce results in CI pipelines.def bootstrap_ci(scores, num_samples=10000, confidence_level=0.95, seed: Optional[int] = None): + if seed is not None: + np.random.seed(seed) ...
56-74: Add concurrency-safe writes if needed.
For large-scale usage or parallel writes, consider locks or a temp file rename approach to ensure atomic updates of the aggregate file.
76-92: Combine input arguments for consistent usage.
calculate_metrics_statisticsacceptsjson_dataandaggregate_output_pathseparately. In some cases, combining or standardizing how these paths are passed might simplify usage.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
evals/eval_framework/analysis/dashboard_generator.py(6 hunks)evals/eval_framework/analysis/metrics_calculator.py(1 hunks)evals/eval_framework/eval_config.py(2 hunks)evals/eval_framework/evaluation/run_evaluation_module.py(2 hunks)evals/eval_framework/modal_run_eval.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (24)
- GitHub Check: run_eval_framework_test / test
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: run_simple_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: Test on macos-15
- GitHub Check: run_notebook_test / test
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-13
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-13
- GitHub Check: test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: windows-latest
- GitHub Check: Test on macos-13
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: docker-compose-test
🔇 Additional comments (16)
evals/eval_framework/eval_config.py (3)
35-35: Default aggregate metrics path looks good.Using a straightforward, centralized default (
"aggregate_metrics.json") aligns well with the rest of the configuration pattern and promotes consistency.
50-50: Ensuring the config dictionary consistency with new attribute.Adding
"calculate_metrics"to the output dictionary ensures the behavior is consistent across the application when serialized or accessed elsewhere.
55-55: Aggregated metrics path successfully exposed in the config dictionary.Including
"aggregate_metrics_path"into_dict()ensures the file path is accessible for any external logic or serialization needs.evals/eval_framework/evaluation/run_evaluation_module.py (2)
4-5: New imports for metrics and dashboard generation confirmed.Bringing in
calculate_metrics_statisticsandcreate_dashboardis consistent with the newly modularized metrics and dashboard logic.
55-85: Evaluation pipeline flags effectively gate each step.The conditional checks for evaluating answers, calculating metrics, and generating the dashboard provide a clear, modular approach. Logging is also consistent. This structure is readable, and the step-by-step approach is appreciated.
evals/eval_framework/modal_run_eval.py (5)
1-13: Imports are properly organized for the Modal evaluation pipeline.The selection of libraries and modules aligns with the usage patterns in the subsequent logic. No issues noted.
14-31: Verify keys exist before file reads inread_and_combine_metrics.This helper function cleanly merges metrics files, handling file and JSON errors gracefully. Consider verifying the presence of
metrics_pathandaggregate_metrics_pathineval_paramsor providing defaults to avoid potentialKeyError.
33-49: Modal app and image configuration appear consistent with project needs.The Dockerfile-based approach, environment variable injections, and package installations look properly defined for the Modal environment. No concerns identified.
51-72: Ensure comprehensive param validation inmodal_run_eval.Skipping metrics collection when
evaluating_answersorcalculate_metricsis false makes sense. Just confirm that required parameters (e.g., paths) are provided to the evaluation even when these flags vary.
74-114: Parallel evaluation logic is well-structured.Running each
EvalConfigasynchronously, combining non-Noneresults, and writing to a timestamped JSON file is a clear approach. This provides a smooth way to handle multiple evaluations in parallel without collisions.evals/eval_framework/analysis/dashboard_generator.py (3)
3-4: Use of additional imports is appropriate.
The newly introduced type annotations anddefaultdictimport improve code clarity and reduce boilerplate.
25-49: Ensureci_resultsis not empty before plotting.
When no metrics are available, attempting to plot CI bars might result in errors or an empty graph. Consider adding a safeguard for an emptyci_results.
52-96: Consider defensive checks for missing keys in entries.
Relying on “question”, “answer”, and “golden_answer” might lead to KeyErrors if the data is malformed. A validation step here or upstream can prevent runtime errors.evals/eval_framework/analysis/metrics_calculator.py (3)
1-4: Check that NumPy is officially listed as a requirement.
Because this file depends onnumpy, remember to ensure it’s included in your project dependencies (e.g.,requirements.txtorsetup.py).
20-29: Solid error handling for file operations.
Properly raises descriptive exceptions if the file is missing or invalid JSON. Good practice.
31-54: Optional validation of metric data.
If anyentry["metrics"]is missing or incomplete, this might cause key errors. A gentle validation or fallback may help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
cognee/tests/unit/eval_framework/metrics_test.py (1)
65-73: Comprehensive basic test for bootstrap_ci functionThe test_bootstrap_ci_basic method provides a good validation of the bootstrap_ci function by:
- Testing with a diverse list of scores
- Verifying the mean is calculated correctly
- Confirming the confidence interval bounds are appropriately positioned around the mean
One improvement could be to add a fixed random seed to ensure deterministic test results.
Consider adding a fixed random seed to ensure deterministic test results:
def test_bootstrap_ci_basic(self): scores = [1, 2, 3, 4, 5] + np.random.seed(42) # Set fixed seed for reproducibility mean, lower, upper = bootstrap_ci(scores, num_samples=1000, confidence_level=0.95) self.assertAlmostEqual(mean, np.mean(scores), places=2) self.assertLessEqual(lower, mean) self.assertGreaterEqual(upper, mean)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
cognee/tests/unit/eval_framework/answer_generation_test.py(2 hunks)cognee/tests/unit/eval_framework/dashboard_test.py(1 hunks)cognee/tests/unit/eval_framework/deepeval_adapter_test.py(2 hunks)cognee/tests/unit/eval_framework/metrics_test.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (28)
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: Test on macos-15
- GitHub Check: run_notebook_test / test
- GitHub Check: run_networkx_metrics_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: Test on macos-13
- GitHub Check: run_eval_framework_test / test
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: Test on macos-15
- GitHub Check: run_simple_example_test / test
- GitHub Check: Test on macos-15
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: windows-latest
- GitHub Check: test
- GitHub Check: Test on macos-13
- GitHub Check: Test on macos-13
- GitHub Check: test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: run_simple_example_test
- GitHub Check: docker-compose-test
- GitHub Check: Build Cognee Backend Docker App Image
🔇 Additional comments (16)
cognee/tests/unit/eval_framework/deepeval_adapter_test.py (2)
25-25: Appropriate addition of retrieval context fieldThe retrieval context field has been correctly added to match the expected input structure for the DeepEvalAdapter. This enhances the test by providing more realistic test data that includes context information alongside questions and answers.
81-81: Consistent test data model updateGood addition of the retrieval_context field with None value in the none_values test case, maintaining consistency with the updated data model while properly testing null handling.
cognee/tests/unit/eval_framework/answer_generation_test.py (5)
14-16: Well-structured mock retriever implementationThe mock retriever correctly implements both required methods (get_context and get_completion) with appropriate AsyncMock return values, which better reflects the actual retriever behavior expected in the evaluation pipeline.
18-19: Good factory function patternUsing a factory function for the mock_retriever_cls is the right approach, as it matches how retriever classes would be instantiated in production code.
24-24: Updated parameter name reflects architectural changeChanging from answer_resolver to retriever_cls correctly implements the architectural refactoring mentioned in the PR objectives, transitioning to a retriever-based approach.
27-27: Enhanced test verificationGood addition of an assertion to verify that the context retrieval method was called with the correct question parameter, which ensures the proper interaction between components.
36-36: Updated assertion to match new mock implementationThe assertion has been properly updated to match the new mock implementation's return value ("Mocked answer" instead of "mock_answer").
cognee/tests/unit/eval_framework/metrics_test.py (3)
5-7: Appropriate imports for new test classAdded necessary imports for unittest, numpy, and the bootstrap_ci function that will be tested in the new test class.
74-80: Good edge case test for identical valuesThis test effectively verifies that bootstrap_ci handles the case where all values are identical, correctly returning equal mean and CI bounds.
82-87: Important empty list edge case handlingThe test_bootstrap_ci_empty_list properly verifies that the bootstrap_ci function handles empty lists by returning NaN values, which is a good practice for statistical functions.
cognee/tests/unit/eval_framework/dashboard_test.py (6)
6-12: Updated imports reflect modular dashboard architectureThe imports now correctly reference the separate, more granular functions from the dashboard_generator module, which aligns with the PR objective of splitting the dashboard into calculator and visualization components.
15-15: Descriptive class name reflects expanded functionalityRenaming the test class from TestGenerateMetricsDashboard to TestDashboardFunctions better represents its expanded scope testing multiple dashboard components rather than just metrics generation.
17-38: Well-structured test data setupThe setUp method now correctly initializes three separate data structures:
- metrics_data for raw metrics values
- ci_data for confidence interval information
- detail_data for individual evaluation details
This separation of concerns aligns with the modular dashboard architecture.
40-47: Comprehensive test for details HTML generationThe test_generate_details_html method effectively verifies that:
- The function produces expected HTML structure
- Metric-specific headers are included
- Table headers are properly generated
- Specific metric details are included in the output
This validates the details HTML generation component of the dashboard.
48-60: Effective integration test for dashboard HTML templateThis test correctly verifies the full dashboard HTML generation process by:
- Creating distribution plots
- Creating CI plots
- Generating detail HTML
- Combining them using the dashboard template
- Verifying essential elements in the output
This ensures the components work together as expected.
61-88: Complete end-to-end test with file I/OThe test_create_dashboard method provides excellent coverage by:
- Setting up test files with JSON data
- Calling the create_dashboard function
- Verifying the output file exists
- Cleaning up test files
This ensures the full dashboard creation process works as expected, including file I/O operations.
Description
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin
Summary by CodeRabbit
New Features
Refactor
Bug Fixes