-
Notifications
You must be signed in to change notification settings - Fork 966
Cog 813 source code chunks #383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 7 commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
682ddfc
fix: pass the list of all CodeFiles to enrichment task
lxobr d40fc12
feat: introduce SourceCodeChunk, update metadata
lxobr aea7382
feat: get_source_code_chunks code graph pipeline task
lxobr 1c5ca84
feat: integrate get_source_code_chunks task, comment out summarize_code
lxobr ccd5cb6
Fix code summarization (#387)
alekszievr 5deb83e
feat: update data models
lxobr ace45ef
feat: naive parse long strings in source code
lxobr b524e94
Merge branch 'dev' into COG-813-source-code-chunks
lxobr c1539f6
fix: get_non_py_files instead of get_non_code_files
lxobr d2911c1
fix: limit recursion, add comment
lxobr c6f4eb1
Merge branch 'dev' into COG-813-source-code-chunks
lxobr 762df11
handle embedding empty input error (#398)
alekszievr ce6f730
feat: robustly handle CodeFile source code
lxobr c50d0c7
refactor: sort imports
lxobr 07dcf73
todo: add support for other embedding models
lxobr 35071b5
feat: add custom logger
lxobr 68a9d27
feat: add robustness to get_source_code_chunks
lxobr cf63dbc
feat: improve embedding exceptions
lxobr f5fa3ec
refactor: format indents, rename module
lxobr File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,140 @@ | ||
| from typing import AsyncGenerator, Generator | ||
| from uuid import NAMESPACE_OID, uuid5 | ||
| from cognee.infrastructure.engine import DataPoint | ||
| from cognee.shared.CodeGraphEntities import CodePart, SourceCodeChunk, CodeFile | ||
| import tiktoken | ||
| import parso | ||
|
|
||
| from cognee.tasks.repo_processor import logger | ||
lxobr marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| def _count_tokens(tokenizer: tiktoken.Encoding, source_code: str) -> int: | ||
| return len(tokenizer.encode(source_code)) | ||
|
|
||
|
|
||
| def _get_naive_subchunk_token_counts( | ||
| tokenizer: tiktoken.Encoding, source_code: str, max_subchunk_tokens: int = 8000 | ||
| ) -> list[tuple[str, int]]: | ||
| """Splits source code into subchunks of up to max_subchunk_tokens and counts tokens.""" | ||
|
|
||
| token_ids = tokenizer.encode(source_code) | ||
| subchunk_token_counts = [] | ||
|
|
||
| for start_idx in range(0, len(token_ids), max_subchunk_tokens): | ||
| subchunk_token_ids = token_ids[start_idx: start_idx + max_subchunk_tokens] | ||
| token_count = len(subchunk_token_ids) | ||
| subchunk = ''.join( | ||
| tokenizer.decode_single_token_bytes(token_id).decode('utf-8', errors='replace') | ||
| for token_id in subchunk_token_ids | ||
| ) | ||
| subchunk_token_counts.append((subchunk, token_count)) | ||
|
|
||
| return subchunk_token_counts | ||
|
|
||
|
|
||
| def _get_subchunk_token_counts( | ||
| tokenizer: tiktoken.Encoding, source_code: str, max_subchunk_tokens: int = 8000 | ||
| ) -> list[tuple[str, int]]: | ||
| """Splits source code into subchunk and counts tokens for each subchunk.""" | ||
|
|
||
| try: | ||
| module = parso.parse(source_code) | ||
| except Exception as e: | ||
| logger.error(f"Error parsing source code: {e}") | ||
| return [] | ||
|
|
||
| if not module.children: | ||
| logger.warning("Parsed module has no children (empty or invalid source code).") | ||
| return [] | ||
|
|
||
| if len(module.children) <= 2: | ||
lxobr marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| module = module.children[0] | ||
|
|
||
| subchunk_token_counts = [] | ||
| for child in module.children: | ||
| subchunk = child.get_code() | ||
| token_count = _count_tokens(tokenizer, subchunk) | ||
|
|
||
| if token_count == 0: | ||
| continue | ||
|
|
||
| if token_count <= max_subchunk_tokens: | ||
| subchunk_token_counts.append((subchunk, token_count)) | ||
| continue | ||
|
|
||
| if child.type == 'string': | ||
| subchunk_token_counts.extend(_get_naive_subchunk_token_counts(tokenizer, subchunk, max_subchunk_tokens)) | ||
| continue | ||
|
|
||
| subchunk_token_counts.extend(_get_subchunk_token_counts(tokenizer, subchunk, max_subchunk_tokens)) | ||
|
|
||
| return subchunk_token_counts | ||
|
|
||
|
|
||
| def _get_chunk_source_code( | ||
| code_token_counts: list[tuple[str, int]], overlap: float, max_tokens: int | ||
| ) -> tuple[list[tuple[str, int]], str]: | ||
| """Generates a chunk of source code from tokenized subchunks with overlap handling.""" | ||
lxobr marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| current_count = 0 | ||
| cumulative_counts = [] | ||
| current_source_code = '' | ||
|
|
||
| for i, (child_code, token_count) in enumerate(code_token_counts): | ||
| current_count += token_count | ||
| cumulative_counts.append(current_count) | ||
| if current_count > max_tokens: | ||
| break | ||
| current_source_code += f"\n{child_code}" | ||
|
|
||
| if current_count <= max_tokens: | ||
| return [], current_source_code.strip() | ||
|
|
||
| cutoff = 1 | ||
| for i, cum_count in enumerate(cumulative_counts): | ||
| if cum_count > (1 - overlap) * max_tokens: | ||
| break | ||
| cutoff = i | ||
|
|
||
| return code_token_counts[cutoff:], current_source_code.strip() | ||
|
|
||
|
|
||
| def get_source_code_chunks_from_code_part( | ||
| code_file_part: CodePart, | ||
| max_tokens: int = 8192, | ||
| overlap: float = 0.25, | ||
lxobr marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| granularity: float = 0.1, | ||
| model_name: str = "text-embedding-3-large" | ||
| ) -> Generator[SourceCodeChunk, None, None]: | ||
lxobr marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """Yields source code chunks from a CodePart object, with configurable token limits and overlap.""" | ||
| tokenizer = tiktoken.encoding_for_model(model_name) | ||
| max_subchunk_tokens = max(1, int(granularity * max_tokens)) | ||
| subchunk_token_counts = _get_subchunk_token_counts(tokenizer, code_file_part.source_code, max_subchunk_tokens) | ||
|
|
||
| previous_chunk = None | ||
| while subchunk_token_counts: | ||
| subchunk_token_counts, chunk_source_code = _get_chunk_source_code(subchunk_token_counts, overlap, max_tokens) | ||
| if not chunk_source_code: | ||
| continue | ||
| current_chunk = SourceCodeChunk( | ||
| id=uuid5(NAMESPACE_OID, chunk_source_code), | ||
| code_chunk_of=code_file_part, | ||
| source_code=chunk_source_code, | ||
| previous_chunk=previous_chunk | ||
| ) | ||
| yield current_chunk | ||
| previous_chunk = current_chunk | ||
|
|
||
|
|
||
| async def get_source_code_chunks(data_points: list[DataPoint], embedding_model="text-embedding-3-large") -> \ | ||
| AsyncGenerator[list[DataPoint], None]: | ||
| """Processes code graph datapoints, create SourceCodeChink datapoints.""" | ||
| for data_point in data_points: | ||
| yield data_point | ||
| if not isinstance(data_point, CodeFile): | ||
| continue | ||
| if not data_point.contains: | ||
| continue | ||
| for code_part in data_point.contains: | ||
| yield code_part | ||
| for source_code_chunk in get_source_code_chunks_from_code_part(code_part, model_name=embedding_model): | ||
lxobr marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| yield source_code_chunk | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.