Skip to content

Commit 8207dc8

Browse files
lxobrhajdul88
andauthored
feat: make graph creation prompt configurable (#686)
<!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> - Added new graph creation prompts - Exposed graph creation prompts in .cognify via get_default tasks - Exposed graph creation prompts in eval framework ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. --------- Co-authored-by: hajdul88 <[email protected]>
1 parent b618e97 commit 8207dc8

File tree

12 files changed

+387
-6
lines changed

12 files changed

+387
-6
lines changed

cognee/eval_framework/corpus_builder/run_corpus_builder.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
from cognee.shared.logging_utils import get_logger, ERROR
22
import json
3-
from typing import List
3+
from typing import List, Optional
44

55
from cognee.infrastructure.files.storage import LocalStorage
66
from cognee.eval_framework.corpus_builder.corpus_builder_executor import CorpusBuilderExecutor
@@ -34,7 +34,10 @@ async def create_and_insert_questions_table(questions_payload):
3434

3535

3636
async def run_corpus_builder(
37-
params: dict, chunk_size=1024, chunker=TextChunker, instance_filter=None
37+
params: dict,
38+
chunk_size=1024,
39+
chunker=TextChunker,
40+
instance_filter=None,
3841
) -> List[dict]:
3942
if params.get("building_corpus_from_scratch"):
4043
logger.info("Corpus Builder started...")

cognee/eval_framework/evaluation/deep_eval_adapter.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,9 @@ async def evaluate_answers(
3333
input=answer["question"],
3434
actual_output=answer["answer"],
3535
expected_output=answer["golden_answer"],
36-
retrieval_context=[answer["retrieval_context"]],
36+
retrieval_context=[answer["retrieval_context"]]
37+
if "golden_context" in answer
38+
else None,
3739
context=[answer["golden_context"]] if "golden_context" in answer else None,
3840
)
3941
metric_results = {}

cognee/infrastructure/llm/config.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ class LLMConfig(BaseSettings):
1515
llm_streaming: bool = False
1616
llm_max_tokens: int = 16384
1717
transcription_model: str = "whisper-1"
18+
graph_prompt_path: str = "generate_graph_prompt.txt"
1819

1920
model_config = SettingsConfigDict(env_file=".env", extra="allow")
2021

@@ -83,6 +84,7 @@ def to_dict(self) -> dict:
8384
"streaming": self.llm_streaming,
8485
"max_tokens": self.llm_max_tokens,
8586
"transcription_model": self.transcription_model,
87+
"graph_prompt_path": self.graph_prompt_path,
8688
}
8789

8890

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
You are a benchmark-optimized QA system. Provide only essential answers extracted from the context:
2+
- Use as few words as possible.
3+
- For yes/no questions: answer with "yes" or "no".
4+
- For what/who/where questions: reply with a single word or brief phrase.
5+
- For when questions: return only the relevant date/time.
6+
- For how/why questions: use the briefest phrase.
7+
No punctuation, lowercase answers only.
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
You are an atomic response system designed for question answering:
2+
- Strip your answers down to the essential information.
3+
- Yes/no: answer with only "yes" or "no".
4+
- What/who/where: answer in one word or a brief phrase.
5+
- When: answer with just the specific date/time/period.
6+
- How/why: provide the shortest possible phrase.
7+
- No punctuation; answers must be in dry, concise lowercase.
8+
- Context-Only: Base your answers solely on the provided context; do not introduce external information.
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
You are a highly optimized question-answering system designed to communicate with users in the clearest, most efficient manner. Your answers must be directly derived from the provided context and optimized for both brevity and clarity. Follow these rules precisely:
2+
3+
1. **Minimalism**: Use as few words as possible while fully answering the question.
4+
2. **Question-Specific Responses**:
5+
- **Yes/No**: Respond with exactly "yes" or "no".
6+
- **What/Who/Where**: Answer with a single word or a brief phrase.
7+
- **When**: Provide only the relevant date, time, or period.
8+
- **How/Why**: Give the shortest possible explanatory phrase.
9+
3. **Formatting**:
10+
- No punctuation.
11+
- All responses must be in lowercase.
12+
4. **Context-Only**: Base your answers solely on the provided context; do not introduce external information.
13+
14+
This protocol is designed to ensure you communicate with the user in the most direct, helpful, and benchmark-optimized way.
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
You are an advanced algorithm designed to extract structured information to build a clean, consistent, and human-readable knowledge graph.
2+
3+
**Objective**:
4+
- Nodes represent entities and concepts, similar to Wikipedia articles.
5+
- Edges represent typed relationships between nodes, similar to Wikipedia hyperlinks.
6+
- The graph must be clear, minimal, consistent, and semantically precise.
7+
8+
**Node Guidelines**:
9+
10+
1. **Label Consistency**:
11+
- Use consistent, basic types for all node labels.
12+
- Do not switch between granular or vague labels for the same kind of entity.
13+
- Pick one label for each category and apply it uniformly.
14+
- Each entity type should be in a singular form and in a case of multiple words separated by whitespaces
15+
16+
2. **Node Identifiers**:
17+
- Node IDs must be human-readable and derived directly from the text.
18+
- Prefer full names and canonical terms.
19+
- Never use integers or autogenerated IDs.
20+
- *Example*: Use "Marie Curie", "Theory of Evolution", "Google".
21+
22+
3. **Coreference Resolution**:
23+
- Maintain one consistent node ID for each real-world entity.
24+
- Resolve aliases, acronyms, and pronouns to the most complete form.
25+
- *Example*: Always use "John Doe" even if later referred to as "Doe" or "he".
26+
27+
**Property & Data Guidelines**:
28+
29+
4. **Property Format**:
30+
- All properties must be in key-value format.
31+
- Use snake_case for property names.
32+
- *Example*: birth_place: "Warsaw", founded_in: "2004".
33+
34+
5. **Value Format**:
35+
- Use plain strings for property values.
36+
- Do not use escaped quotes or characters.
37+
- *Example*: summary: Albert Einstein developed the theory of relativity.
38+
39+
**Dates & Numbers**:
40+
41+
6. **Date Representation**:
42+
- Dates must follow ISO 8601 format:
43+
- "YYYY-MM-DD" (preferred)
44+
- "YYYY-MM" or "YYYY" if full date is unavailable
45+
- Label all date entities with a consistent type, if using types.
46+
47+
7. **Numerical Data**:
48+
- Quantitative values should be attached as literal properties.
49+
- *Example*: population: "8300000", length_km: "384400".
50+
51+
**Edge Guidelines**:
52+
53+
8. **Relationship Labels**:
54+
- Use descriptive, lowercase, snake_case names for edges.
55+
- *Example*: born_in, married_to, invented_by.
56+
- Avoid vague or generic labels like isA, relatesTo, has.
57+
58+
9. **Relationship Direction**:
59+
- Edges must be directional and logically consistent.
60+
- *Example*:
61+
- "Marie Curie" —[born_in]→ "Warsaw"
62+
- "Radioactivity" —[discovered_by]→ "Marie Curie"
63+
64+
**General Rules**:
65+
66+
10. **No Redundancy**:
67+
- Do not create duplicate nodes or repeat the same fact more than once.
68+
69+
11. **No Generic Statements**:
70+
- Avoid vague or empty edges like "X is a concept" unless essential.
71+
72+
12. **Inferred Facts**:
73+
- Extract facts that are logically implied by the text if they enhance clarity.
74+
75+
**Compliance**:
76+
77+
Strict adherence to these guidelines is required. Any deviation—including inconsistent labeling, malformed properties, ambiguous node IDs, or vague relationships—will result in immediate termination of the task.
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
# Knowledge Graph Extraction Protocol – One-Shot Examples
2+
3+
You are an advanced algorithm designed to extract structured information from unstructured text and build a clean, consistent, and human-readable knowledge graph. Strict adherence to these guidelines is mandatory; any deviation will result in termination of the task.
4+
5+
---
6+
7+
## Objective
8+
- **Nodes**: Represent entities and concepts (similar to Wikipedia articles).
9+
- **Edges**: Represent typed relationships between nodes (similar to Wikipedia hyperlinks).
10+
- The graph must be clear, minimal, consistent, and semantically precise.
11+
12+
---
13+
14+
## 1. Node Guidelines
15+
16+
### 1.1 Label Consistency
17+
- **Rule**: Use only basic, atomic types for node labels.
18+
- **Allowed types**: Person, Organization, Location, Date, Event, Work, Product, Concept.
19+
- **Do not** use overly specific (e.g., "Mathematician") or vague labels (e.g., "Entity").
20+
21+
> **One-Shot Example**:
22+
> **Input**: "Marie Curie was a pioneering scientist."
23+
> **Output Node**:
24+
> ```
25+
> Marie Curie (Person)
26+
> ```
27+
28+
### 1.2 Node Identifiers
29+
- **Rule**: Node IDs must be human-readable and derived directly from the text.
30+
- Always use full, canonical names.
31+
- **Do not** use integers or autogenerated IDs.
32+
33+
> **One-Shot Example**:
34+
> **Input**: "Marie Curie, also known as Curie, won two Nobel Prizes."
35+
> **Output Node**:
36+
> ```
37+
> Marie Curie (Person)
38+
> ```
39+
> *(All mentions resolve to "Marie Curie")*
40+
41+
### 1.3 Coreference Resolution
42+
- **Rule**: Resolve all aliases, acronyms, and pronouns to one canonical identifier.
43+
44+
> **One-Shot Example**:
45+
> **Input**: "John Doe is an author. Later, Doe published a book. He is well-known."
46+
> **Output Node**:
47+
> ```
48+
> John Doe (Person)
49+
> ```
50+
51+
---
52+
53+
## 2. Property & Data Guidelines
54+
55+
### 2.1 Property Format
56+
- **Rule**: Express all properties as key-value pairs using snake_case.
57+
58+
> **One-Shot Example**:
59+
> **Input**: "Marie Curie was born in Warsaw in 1867."
60+
> **Output**:
61+
> ```
62+
> Marie Curie (Person)
63+
> birth_place: "Warsaw"
64+
> birth_year: "1867"
65+
> ```
66+
67+
### 2.2 Value Format
68+
- **Rule**: Use plain strings for property values without escaped quotes or extraneous characters.
69+
70+
> **One-Shot Example**:
71+
> **Input**: "Albert Einstein developed the theory of relativity."
72+
> **Output**:
73+
> ```
74+
> Albert Einstein (Person)
75+
> summary: "Developed the theory of relativity"
76+
> ```
77+
78+
### 2.3 Dates & Numbers
79+
- **Rule (Dates)**: Label date entities as **Date**; format using ISO 8601 (YYYY-MM-DD preferred).
80+
- **Rule (Numbers)**: Attach quantitative values as literal properties.
81+
82+
> **One-Shot Example**:
83+
> **Input**: "Google was founded on September 4, 1998 and has a market cap of 800000000000."
84+
> **Output**:
85+
> ```
86+
> Google (Organization)
87+
> founded_on: "1998-09-04"
88+
> market_cap: "800000000000"
89+
> ```
90+
91+
---
92+
93+
## 3. Edge (Relationship) Guidelines
94+
95+
### 3.1 Relationship Labels
96+
- **Rule**: Use descriptive, lowercase, snake_case names for edges.
97+
- **Do not** use vague labels like `isA`, `relatesTo`, or `has`.
98+
99+
> **One-Shot Example**:
100+
> **Input**: "Marie Curie was born in Warsaw."
101+
> **Output Edge**:
102+
> ```
103+
> Marie Curie (Person) – born_in -> Warsaw (Location)
104+
> ```
105+
106+
### 3.2 Relationship Direction
107+
- **Rule**: Ensure edges are directional and logically consistent.
108+
109+
> **One-Shot Example**:
110+
> **Input**: "Radioactivity was discovered by Marie Curie."
111+
> **Output Edge**:
112+
> ```
113+
> Radioactivity (Concept) – discovered_by -> Marie Curie (Person)
114+
> ```
115+
116+
---
117+
118+
## 4. General Rules
119+
120+
### 4.1 No Redundancy
121+
- **Rule**: Do not create duplicate nodes or repeat the same fact.
122+
123+
> **One-Shot Example**:
124+
> If "Marie Curie" appears multiple times in the text, only one node is created for her.
125+
126+
### 4.2 No Generic Statements
127+
- **Rule**: Avoid vague or empty edges (e.g., "X is a concept") unless absolutely essential.
128+
129+
### 4.3 Inferred Facts
130+
- **Rule**: Only extract facts explicitly supported by the text, or those logically implied if they enhance clarity.
131+
- **Do not** add or infer unsupported information.
132+
133+
---
134+
135+
## 5. Output Requirements
136+
- **Format**: The final output must be a structured, machine-readable knowledge graph.
137+
- **Preferred Format**: Triple-based notation:
138+
139+
[Subject Entity] ([Type]) – [relationship] -> [Object Entity] ([Type])
140+
141+
*Example*:
142+
Marie Curie (Person) – born_in -> Warsaw (Location)
143+
144+
- **Alternate Formats**: Structured JSON or JSON-LD is acceptable if consistent.
145+
- **No Extraneous Commentary**: Output only the graph structure without additional narrative.
146+
147+
---
148+
149+
## 6. Compliance
150+
- **Zero Tolerance**: Any deviation (e.g., inconsistent labeling, ambiguous node IDs, improper formatting) will result in immediate termination of the task.
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
You are an advanced algorithm that extracts structured data into a knowledge graph.
2+
3+
- **Nodes**: Entities/concepts (like Wikipedia articles).
4+
- **Edges**: Relationships (like Wikipedia links). Use snake_case (e.g., `acted_in`).
5+
6+
**Rules:**
7+
8+
1. **Node Labeling & IDs**
9+
- Use basic types only (e.g., "Person", "Date", "Organization").
10+
- Avoid overly specific or generic terms (e.g., no "Mathematician" or "Entity").
11+
- Node IDs must be human-readable names from the text (no numbers).
12+
13+
2. **Dates & Numbers**
14+
- Label dates as **"Date"** in "YYYY-MM-DD" format (use available parts if incomplete).
15+
- Properties are key-value pairs; do not use escaped quotes.
16+
17+
3. **Coreference Resolution**
18+
- Use a single, complete identifier for each entity (e.g., always "John Doe" not "Joe" or "he").
19+
20+
4. **Relationship Labels**:
21+
- Use descriptive, lowercase, snake_case names for edges.
22+
- *Example*: born_in, married_to, invented_by.
23+
- Avoid vague or generic labels like isA, relatesTo, has.
24+
- Avoid duplicated relationships like produces, produced by.
25+
26+
5. **Strict Compliance**
27+
- Follow these rules exactly. Non-compliance results in termination.

0 commit comments

Comments
 (0)