-
Notifications
You must be signed in to change notification settings - Fork 966
test: answer generation [COG-1234] #569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis pull request introduces a new unit test for the answer generation functionality within the evaluation framework. The test file sets up a dummy data adapter, utilizes an asynchronous mock answer resolver, and invokes the Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Test File
participant Adapter as DummyAdapter
participant Executor as AnswerGeneratorExecutor
participant Resolver as Async AnswerResolver
Test->>Adapter: Load corpus and question-answer pairs
Test->>Resolver: Setup AsyncMock for answer resolution
Test->>Executor: Invoke question_answering_non_parallel(questions, resolver)
Executor->>Resolver: Process and resolve question
Resolver-->>Executor: Return "mock_answer"
Executor-->>Test: Return generated answers list
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (26)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
evals/eval_framework/benchmark_adapters/twowikimultihop_adapter.py (1)
24-26: 🛠️ Refactor suggestionAdd timeout to requests.get call.
The request to download the dataset should have a timeout to prevent hanging:
- response = requests.get(self.dataset_info["URL"]) + response = requests.get(self.dataset_info["URL"], timeout=30) response.raise_for_status() corpus_json = response.json()
🧹 Nitpick comments (4)
cognee/tests/unit/eval_framework/answer_generation_test.py (2)
16-19: Consider adding error handling for system preparation.While the system preparation steps look good, consider adding error handling and assertions to verify that each step completed successfully.
- await cognee.prune.prune_data() - await cognee.prune.prune_system(metadata=True) - await cognee.add(corpus_list) - await cognee.cognify() + try: + await cognee.prune.prune_data() + await cognee.prune.prune_system(metadata=True) + await cognee.add(corpus_list) + await cognee.cognify() + except Exception as e: + pytest.fail(f"System preparation failed: {str(e)}")
21-35: Consider adding more edge cases to the assertions.The current assertions cover the basic functionality well. Consider adding more edge cases:
- Test with empty questions list
- Test with invalid answer resolver
- Verify answer format/structure
+ # Test empty questions list + empty_answers = await answer_generator.question_answering_non_parallel( + questions=[], + answer_resolver=qa_engine, + ) + assert len(empty_answers) == 0, "Empty questions list should return empty answers" + answers = await answer_generator.question_answering_non_parallel( questions=qa_pairs, answer_resolver=qa_engine, ) assert len(answers) == len(qa_pairs) + # Verify answer structure + assert isinstance(answers[0], dict), "Answer should be a dictionary" + assert all(key in answers[0] for key in ["question", "answer", "golden_answer"]), \ + "Answer dictionary missing required keys"cognee/tests/unit/eval_framework/corpus_builder_test.py (1)
8-16: Consider adding more test cases for corpus loading.The test covers basic functionality but could be enhanced with more edge cases:
- Test with limit=0
- Test with negative limit
- Test with limit larger than available data
@pytest.mark.parametrize("benchmark", benchmark_options) -def test_corpus_builder_load_corpus(benchmark): +@pytest.mark.parametrize("limit", [2, 0, -1, 1000]) +def test_corpus_builder_load_corpus(benchmark, limit): - limit = 2 corpus_builder = CorpusBuilderExecutor(benchmark, "Default") raw_corpus, questions = corpus_builder.load_corpus(limit=limit) assert len(raw_corpus) > 0, f"Corpus builder loads empty corpus for {benchmark}" - assert len(questions) <= 2, ( + expected_limit = max(0, limit) if limit != 1000 else float('inf') + assert len(questions) <= expected_limit, ( f"Corpus builder loads {len(questions)} for {benchmark} when limit is {limit}" )evals/eval_framework/benchmark_adapters/twowikimultihop_adapter.py (1)
31-33: Consider using numpy for random sampling.For better performance with large datasets, consider using numpy's random sampling:
+import numpy as np + if limit is not None and 0 < limit < len(corpus_json): - random.seed(seed) - corpus_json = random.sample(corpus_json, limit) + rng = np.random.default_rng(seed) + indices = rng.choice(len(corpus_json), size=limit, replace=False) + corpus_json = [corpus_json[i] for i in indices]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
.github/workflows/test_python_3_10.yml(1 hunks).github/workflows/test_python_3_11.yml(1 hunks).github/workflows/test_python_3_12.yml(1 hunks)cognee/tests/unit/eval_framework/answer_generation_test.py(1 hunks)cognee/tests/unit/eval_framework/benchmark_adapters_test.py(1 hunks)cognee/tests/unit/eval_framework/corpus_builder_test.py(1 hunks)evals/eval_framework/benchmark_adapters/dummy_adapter.py(1 hunks)evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py(2 hunks)evals/eval_framework/benchmark_adapters/musique_adapter.py(2 hunks)evals/eval_framework/benchmark_adapters/twowikimultihop_adapter.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (30)
- GitHub Check: run_notebook_test / test
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-15
- GitHub Check: run_networkx_metrics_test / test
- GitHub Check: run_simple_example_test / test
- GitHub Check: Test on macos-13
- GitHub Check: Test on macos-13
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: Test on macos-15
- GitHub Check: run_eval_framework_test / test
- GitHub Check: Test on macos-13
- GitHub Check: test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: windows-latest
- GitHub Check: test
- GitHub Check: test
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: lint (ubuntu-latest, 3.11.x)
- GitHub Check: lint (ubuntu-latest, 3.10.x)
- GitHub Check: docker-compose-test
- GitHub Check: Build Cognee Backend Docker App Image
🔇 Additional comments (12)
evals/eval_framework/benchmark_adapters/dummy_adapter.py (1)
9-9: LGTM! Type hint simplification looks good.The simplified return type
tuple[list[str], list[dict[str, str]]]is more accurate and cleaner than the previous version withUnion[LiteralString, str], as the method only returns regular strings.cognee/tests/unit/eval_framework/answer_generation_test.py (1)
10-14: LGTM! Test setup looks good.Good use of pytest's parametrization to test multiple QA engines. The limit of 1 is appropriate for unit testing as it keeps the test focused and fast.
evals/eval_framework/benchmark_adapters/twowikimultihop_adapter.py (1)
5-5: LGTM! Type hint simplification looks good.The simplified imports and return type
tuple[list[str], list[dict[str, Any]]]are more accurate and cleaner than the previous version withUnion[LiteralString, str].Also applies to: 17-17
evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py (1)
5-5: LGTM! Type hints simplified correctly.The simplified type hints better reflect the actual types used in the code. The
corpus_listonly contains strings, makinglist[str]more accurate than the previous union type.Also applies to: 19-19
cognee/tests/unit/eval_framework/benchmark_adapters_test.py (1)
66-69: LGTM! Test assertion correctly focuses on QA pairs limit.The test now properly focuses on validating the limit for QA pairs, which is the primary concern. This change acknowledges that different adapters may handle corpus entries differently, making the previous corpus list length assertion too restrictive.
evals/eval_framework/benchmark_adapters/musique_adapter.py (2)
4-4: LGTM! Type hints simplified correctly.The simplified type hints better reflect the actual types used in the code.
67-68:Details
❓ Verification inconclusive
Verify impact of paragraph handling change.
The change from concatenating paragraphs to appending them individually improves granularity but might affect downstream processing that expects concatenated paragraphs.
Run this script to check for potential impacts:
🏁 Script executed:
#!/bin/bash # Search for code that processes the corpus entries rg -A 5 "corpus_list" --glob '!**/*test*.py'Length of output: 5338
Review: Verify Downstream Impact of Modified Paragraph Handling in
musique_adapter.pyThe change now appends each individual
paragraph["paragraph_text"]tocorpus_listrather than concatenating multiple texts. This increased granularity may alter the behavior expected by downstream processing, especially since similar adapters (e.g.,twowikimultihop_adapter.pyandhotpot_qa_adapter.py) join sentence fragments into a single string.Action items:
- Validate downstream consumers: Confirm that any functions or modules processing
corpus_listinmusique_adapter.pycan handle multiple discrete entries instead of a combined text.- Ensure consistency: Reassess if the differing handling across adapters is intentional. If a uniform output format is desired across benchmarks, consider aligning the logic.
- Review integration tests: Verify that tests covering the
musique_adapterstructure and its downstream usage remain valid with this change..github/workflows/test_python_3_12.yml (1)
51-51: LGTM! Dependencies and environment variables properly configured.The changes correctly:
- Install evaluation framework dependencies via the
evalsextra- Provide the necessary
LLM_API_KEYfor unit testsAlso applies to: 57-60
.github/workflows/test_python_3_10.yml (2)
50-50: Enhanced Dependency Installation Command.
The updated command now includes the additionalevalsextra alongsidedocs, ensuring that all required dependencies for evaluation features are installed. This update promotes consistency with the other workflow files.
57-59: Updated Environment Variables in Unit Test Step.
The "Run unit tests" step now explicitly setsENV: 'dev'and introducesLLM_API_KEYfrom GitHub secrets. This ensures that the unit tests have access to the necessary runtime configurations and credentials..github/workflows/test_python_3_11.yml (2)
51-51: Refined Dependency Installation Command.
The installation step now usespoetry install --no-interaction -E docs -E evals, which adds theevalsextra to the dependency management. This adjustment aligns with other Python version workflows and maintains consistency.
59-61: Consistent Environment Configuration for Unit Tests.
The "Run unit tests" step has been updated to include the environment variablesENV: 'dev'andLLM_API_KEY(sourced from GitHub secrets). This change ensures that the unit tests operate with the same configuration across different Python versions.
…est-answer-generation
Description
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin
Summary by CodeRabbit