Skip to content

Conversation

@lxobr
Copy link
Collaborator

@lxobr lxobr commented Feb 24, 2025

Description

  • Split metrics dashboard into two modules: calculator (statistics) and generator (visualization)
  • Added aggregate metrics as a new phase in evaluation pipeline
  • Created modal example to run multiple evaluations in parallel and collect results into a single combined output

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin

Summary by CodeRabbit

  • New Features

    • Enhanced metrics reporting with improved visualizations, including histogram and confidence interval plots.
    • Introduced an asynchronous evaluation process that supports parallel execution and streamlined result aggregation.
    • Added new configuration options to control metrics calculation and aggregated output storage.
  • Refactor

    • Restructured dashboard generation and evaluation workflows into a more modular, maintainable design.
    • Improved error handling and logging for better feedback during evaluation processes.
  • Bug Fixes

    • Updated test cases to ensure accurate validation of the new dashboard generation and metrics calculation functionalities.

@lxobr lxobr self-assigned this Feb 24, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 24, 2025

Warning

Rate limit exceeded

@lxobr has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 14 minutes and 7 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 3616732 and 3283600.

📒 Files selected for processing (1)
  • evals/eval_framework/modal_run_eval.py (1 hunks)

Walkthrough

The changes refactor and expand the evaluation pipeline and dashboard generation. In the dashboard module, the old bootstrap-based CI calculation is removed, and new modular functions are introduced to create distribution plots, CI plots, detailed HTML, and an integrated dashboard. A new metrics calculator file provides functions for calculating confidence intervals, loading and processing metrics data, and saving aggregated results. The evaluation configuration has been updated with new attributes for metrics processing. Furthermore, the evaluation module now features an asynchronous execution function, and a new Modal-based evaluation pipeline is added for parallel evaluations.

Changes

File Change Summary
evals/eval_framework/analysis/dashboard_generator.py Removed bootstrap_ci; refactored dashboard generation by replacing generate_metrics_dashboard with a modular approach using new functions: create_distribution_plots, create_ci_plot, generate_details_html, get_dashboard_html_template, and create_dashboard.
evals/eval_framework/analysis/metrics_calculator.py New file adding functions for metrics processing: bootstrap_ci, load_metrics_data, extract_metrics_and_details, save_aggregate_metrics, and calculate_metrics_statistics.
evals/eval_framework/eval_config.py Updated EvalConfig to include new attributes: calculate_metrics (default True) and aggregate_metrics_path (default "aggregate_metrics.json"), with corresponding updates in the to_dict method.
evals/eval_framework/evaluation/run_evaluation_module.py Added an asynchronous function execute_evaluation and modified run_evaluation to incorporate steps for metrics calculation and dashboard generation, with enhanced logging and streamlined control flow.
evals/eval_framework/modal_run_eval.py New file for a Modal-based evaluation pipeline; introduces functions modal_run_eval and main to orchestrate evaluation steps, combine metrics from multiple JSON sources, and run evaluations in parallel.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant EvalProcess as execute_evaluation
    participant Executor as EvaluationExecutor
    participant Metrics as calculate_metrics_statistics
    participant Dashboard as create_dashboard

    Client->>EvalProcess: Invoke evaluation
    EvalProcess->>Executor: Process answers
    Executor-->>EvalProcess: Return evaluation results
    EvalProcess->>Metrics: Calculate metrics statistics
    Metrics-->>EvalProcess: Return metrics data and CI
    EvalProcess->>Dashboard: Generate dashboard
    Dashboard-->>EvalProcess: Return dashboard HTML
    EvalProcess-->>Client: Send complete evaluation output
Loading
sequenceDiagram
    participant Main
    participant ModalEval as modal_run_eval
    participant Corpus as run_corpus_builder
    participant QA as run_question_answering
    participant Eval as run_evaluation
    participant Combine as read_and_combine_metrics

    Main->>ModalEval: Trigger modal evaluation
    ModalEval->>Corpus: Build corpus
    ModalEval->>QA: Execute Q&A step
    ModalEval->>Eval: Run evaluation
    Eval-->>ModalEval: Return evaluation output
    ModalEval->>Combine: Combine metrics from files
    Combine-->>ModalEval: Return aggregated metrics
    ModalEval-->>Main: Return final metrics results
Loading

Possibly related PRs

  • Test: test eval dashboard generation [COG-1234] #570: The changes in the main PR involve significant modifications to the generate_metrics_dashboard function, including its refactoring and the removal of the bootstrap_ci function, while the retrieved PR focuses on testing the generate_metrics_dashboard function and includes tests for the bootstrap_ci function, indicating a direct relationship at the code level.

Suggested reviewers

  • hajdul88
  • borisarzentar

Poem

Oh, how my paws now swiftly code,
With modular hops down a lighter road.
New metrics and dashboards cheer my day,
Async flows lead the bright new way.
I nibble bytes with carrots in sight,
Hopping with joy through code tonight! 🥕🐰


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
evals/eval_framework/analysis/dashboard_generator.py (1)

98-129: ⚠️ Potential issue

Handle the case of an empty figures list.
figures[-1] can throw an error when figures is empty. Including a simple check will improve resiliency.

 if not figures:
+    return "<h1>No available figures to display</h1>"

 dashboard_html = f"""
 ...
🧹 Nitpick comments (7)
evals/eval_framework/eval_config.py (1)

25-27: Consider clarifying default behavior for metrics calculation.

Providing True as the default for calculate_metrics is reasonable if metrics are a central aspect of all evaluations. However, if metrics calculation introduces significant overhead or requires additional runtime, you might want to ensure that users explicitly opt in. Otherwise, this is a sensible addition aligning with the rest of the configuration.

evals/eval_framework/evaluation/run_evaluation_module.py (1)

32-53: Consider non-blocking file I/O in an async context.

While this function correctly handles file reading and JSON decoding, open(...) in an async function may block the event loop if files are large. For better concurrency, consider using asynchronous file operations or a thread executor to prevent potential bottlenecks.

evals/eval_framework/analysis/dashboard_generator.py (2)

7-22: Consider parameterizing bin size for distribution plots.
The histogram uses a fixed bin count (nbinsx=10), which might be too coarse or too fine depending on the data. Allowing a configurable number of bins can enhance the flexibility of these plots.

 def create_distribution_plots(metrics_data: Dict[str, List[float]], nbins: int = 10) -> List[str]:
     ...
-    fig.add_trace(go.Histogram(x=scores, name=metric, nbinsx=10, marker_color="#1f77b4"))
+    fig.add_trace(go.Histogram(x=scores, name=metric, nbinsx=nbins, marker_color="#1f77b4"))
     ...

133-172: Refactor for clearer function naming.
create_dashboard reads files, processes data, and writes output—more than just “creating.” Renaming to something like generate_and_save_dashboard or similar can better reflect its scope.

evals/eval_framework/analysis/metrics_calculator.py (3)

7-18: Consider exposing a random seed parameter.
Repeated runs can produce slightly different intervals. Exposing a seed can help reproduce results in CI pipelines.

 def bootstrap_ci(scores, num_samples=10000, confidence_level=0.95, seed: Optional[int] = None):
+    if seed is not None:
+        np.random.seed(seed)
     ...

56-74: Add concurrency-safe writes if needed.
For large-scale usage or parallel writes, consider locks or a temp file rename approach to ensure atomic updates of the aggregate file.


76-92: Combine input arguments for consistent usage.
calculate_metrics_statistics accepts json_data and aggregate_output_path separately. In some cases, combining or standardizing how these paths are passed might simplify usage.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6e56744 and 416eed1.

📒 Files selected for processing (5)
  • evals/eval_framework/analysis/dashboard_generator.py (6 hunks)
  • evals/eval_framework/analysis/metrics_calculator.py (1 hunks)
  • evals/eval_framework/eval_config.py (2 hunks)
  • evals/eval_framework/evaluation/run_evaluation_module.py (2 hunks)
  • evals/eval_framework/modal_run_eval.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (24)
  • GitHub Check: run_eval_framework_test / test
  • GitHub Check: run_dynamic_steps_example_test / test
  • GitHub Check: run_simple_example_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_multimedia_example_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: Test on macos-15
  • GitHub Check: run_notebook_test / test
  • GitHub Check: Test on macos-15
  • GitHub Check: Test on macos-13
  • GitHub Check: Test on macos-15
  • GitHub Check: Test on macos-13
  • GitHub Check: test
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: windows-latest
  • GitHub Check: Test on macos-13
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: test
  • GitHub Check: docker-compose-test
🔇 Additional comments (16)
evals/eval_framework/eval_config.py (3)

35-35: Default aggregate metrics path looks good.

Using a straightforward, centralized default ("aggregate_metrics.json") aligns well with the rest of the configuration pattern and promotes consistency.


50-50: Ensuring the config dictionary consistency with new attribute.

Adding "calculate_metrics" to the output dictionary ensures the behavior is consistent across the application when serialized or accessed elsewhere.


55-55: Aggregated metrics path successfully exposed in the config dictionary.

Including "aggregate_metrics_path" in to_dict() ensures the file path is accessible for any external logic or serialization needs.

evals/eval_framework/evaluation/run_evaluation_module.py (2)

4-5: New imports for metrics and dashboard generation confirmed.

Bringing in calculate_metrics_statistics and create_dashboard is consistent with the newly modularized metrics and dashboard logic.


55-85: Evaluation pipeline flags effectively gate each step.

The conditional checks for evaluating answers, calculating metrics, and generating the dashboard provide a clear, modular approach. Logging is also consistent. This structure is readable, and the step-by-step approach is appreciated.

evals/eval_framework/modal_run_eval.py (5)

1-13: Imports are properly organized for the Modal evaluation pipeline.

The selection of libraries and modules aligns with the usage patterns in the subsequent logic. No issues noted.


14-31: Verify keys exist before file reads in read_and_combine_metrics.

This helper function cleanly merges metrics files, handling file and JSON errors gracefully. Consider verifying the presence of metrics_path and aggregate_metrics_path in eval_params or providing defaults to avoid potential KeyError.


33-49: Modal app and image configuration appear consistent with project needs.

The Dockerfile-based approach, environment variable injections, and package installations look properly defined for the Modal environment. No concerns identified.


51-72: Ensure comprehensive param validation in modal_run_eval.

Skipping metrics collection when evaluating_answers or calculate_metrics is false makes sense. Just confirm that required parameters (e.g., paths) are provided to the evaluation even when these flags vary.


74-114: Parallel evaluation logic is well-structured.

Running each EvalConfig asynchronously, combining non-None results, and writing to a timestamped JSON file is a clear approach. This provides a smooth way to handle multiple evaluations in parallel without collisions.

evals/eval_framework/analysis/dashboard_generator.py (3)

3-4: Use of additional imports is appropriate.
The newly introduced type annotations and defaultdict import improve code clarity and reduce boilerplate.


25-49: Ensure ci_results is not empty before plotting.
When no metrics are available, attempting to plot CI bars might result in errors or an empty graph. Consider adding a safeguard for an empty ci_results.


52-96: Consider defensive checks for missing keys in entries.
Relying on “question”, “answer”, and “golden_answer” might lead to KeyErrors if the data is malformed. A validation step here or upstream can prevent runtime errors.

evals/eval_framework/analysis/metrics_calculator.py (3)

1-4: Check that NumPy is officially listed as a requirement.
Because this file depends on numpy, remember to ensure it’s included in your project dependencies (e.g., requirements.txt or setup.py).


20-29: Solid error handling for file operations.
Properly raises descriptive exceptions if the file is missing or invalid JSON. Good practice.


31-54: Optional validation of metric data.
If any entry["metrics"] is missing or incomplete, this might cause key errors. A gentle validation or fallback may help.

@lxobr lxobr requested a review from hajdul88 February 25, 2025 15:13
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
cognee/tests/unit/eval_framework/metrics_test.py (1)

65-73: Comprehensive basic test for bootstrap_ci function

The test_bootstrap_ci_basic method provides a good validation of the bootstrap_ci function by:

  1. Testing with a diverse list of scores
  2. Verifying the mean is calculated correctly
  3. Confirming the confidence interval bounds are appropriately positioned around the mean

One improvement could be to add a fixed random seed to ensure deterministic test results.

Consider adding a fixed random seed to ensure deterministic test results:

    def test_bootstrap_ci_basic(self):
        scores = [1, 2, 3, 4, 5]
+       np.random.seed(42)  # Set fixed seed for reproducibility
        mean, lower, upper = bootstrap_ci(scores, num_samples=1000, confidence_level=0.95)

        self.assertAlmostEqual(mean, np.mean(scores), places=2)
        self.assertLessEqual(lower, mean)
        self.assertGreaterEqual(upper, mean)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7f3cbd6 and 3616732.

📒 Files selected for processing (4)
  • cognee/tests/unit/eval_framework/answer_generation_test.py (2 hunks)
  • cognee/tests/unit/eval_framework/dashboard_test.py (1 hunks)
  • cognee/tests/unit/eval_framework/deepeval_adapter_test.py (2 hunks)
  • cognee/tests/unit/eval_framework/metrics_test.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (28)
  • GitHub Check: run_dynamic_steps_example_test / test
  • GitHub Check: Test on macos-15
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_networkx_metrics_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: Test on macos-13
  • GitHub Check: run_eval_framework_test / test
  • GitHub Check: run_multimedia_example_test / test
  • GitHub Check: Test on macos-15
  • GitHub Check: run_simple_example_test / test
  • GitHub Check: Test on macos-15
  • GitHub Check: run_notebook_test / test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: windows-latest
  • GitHub Check: test
  • GitHub Check: Test on macos-13
  • GitHub Check: Test on macos-13
  • GitHub Check: test
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: run_simple_example_test
  • GitHub Check: docker-compose-test
  • GitHub Check: Build Cognee Backend Docker App Image
🔇 Additional comments (16)
cognee/tests/unit/eval_framework/deepeval_adapter_test.py (2)

25-25: Appropriate addition of retrieval context field

The retrieval context field has been correctly added to match the expected input structure for the DeepEvalAdapter. This enhances the test by providing more realistic test data that includes context information alongside questions and answers.


81-81: Consistent test data model update

Good addition of the retrieval_context field with None value in the none_values test case, maintaining consistency with the updated data model while properly testing null handling.

cognee/tests/unit/eval_framework/answer_generation_test.py (5)

14-16: Well-structured mock retriever implementation

The mock retriever correctly implements both required methods (get_context and get_completion) with appropriate AsyncMock return values, which better reflects the actual retriever behavior expected in the evaluation pipeline.


18-19: Good factory function pattern

Using a factory function for the mock_retriever_cls is the right approach, as it matches how retriever classes would be instantiated in production code.


24-24: Updated parameter name reflects architectural change

Changing from answer_resolver to retriever_cls correctly implements the architectural refactoring mentioned in the PR objectives, transitioning to a retriever-based approach.


27-27: Enhanced test verification

Good addition of an assertion to verify that the context retrieval method was called with the correct question parameter, which ensures the proper interaction between components.


36-36: Updated assertion to match new mock implementation

The assertion has been properly updated to match the new mock implementation's return value ("Mocked answer" instead of "mock_answer").

cognee/tests/unit/eval_framework/metrics_test.py (3)

5-7: Appropriate imports for new test class

Added necessary imports for unittest, numpy, and the bootstrap_ci function that will be tested in the new test class.


74-80: Good edge case test for identical values

This test effectively verifies that bootstrap_ci handles the case where all values are identical, correctly returning equal mean and CI bounds.


82-87: Important empty list edge case handling

The test_bootstrap_ci_empty_list properly verifies that the bootstrap_ci function handles empty lists by returning NaN values, which is a good practice for statistical functions.

cognee/tests/unit/eval_framework/dashboard_test.py (6)

6-12: Updated imports reflect modular dashboard architecture

The imports now correctly reference the separate, more granular functions from the dashboard_generator module, which aligns with the PR objective of splitting the dashboard into calculator and visualization components.


15-15: Descriptive class name reflects expanded functionality

Renaming the test class from TestGenerateMetricsDashboard to TestDashboardFunctions better represents its expanded scope testing multiple dashboard components rather than just metrics generation.


17-38: Well-structured test data setup

The setUp method now correctly initializes three separate data structures:

  1. metrics_data for raw metrics values
  2. ci_data for confidence interval information
  3. detail_data for individual evaluation details

This separation of concerns aligns with the modular dashboard architecture.


40-47: Comprehensive test for details HTML generation

The test_generate_details_html method effectively verifies that:

  1. The function produces expected HTML structure
  2. Metric-specific headers are included
  3. Table headers are properly generated
  4. Specific metric details are included in the output

This validates the details HTML generation component of the dashboard.


48-60: Effective integration test for dashboard HTML template

This test correctly verifies the full dashboard HTML generation process by:

  1. Creating distribution plots
  2. Creating CI plots
  3. Generating detail HTML
  4. Combining them using the dashboard template
  5. Verifying essential elements in the output

This ensures the components work together as expected.


61-88: Complete end-to-end test with file I/O

The test_create_dashboard method provides excellent coverage by:

  1. Setting up test files with JSON data
  2. Calling the create_dashboard function
  3. Verifying the output file exists
  4. Cleaning up test files

This ensures the full dashboard creation process works as expected, including file I/O operations.

@lxobr lxobr requested a review from alekszievr March 3, 2025 12:46
@lxobr lxobr merged commit bee04ca into dev Mar 3, 2025
36 of 37 checks passed
@lxobr lxobr deleted the feat/COG-1331-modal-run-eval branch March 3, 2025 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants