Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
121 commits
Select commit Hold shift + click to select a range
d57c367
enh(preprocessing): Add split_markdown_by_headings.
daavoo Jan 22, 2025
fe93f74
Add benchmark
daavoo Jan 20, 2025
92c70a7
Move to structured_qa. Add entrypoint
daavoo Jan 20, 2025
70ef785
Move back outside
daavoo Jan 20, 2025
16ff8bd
Fix main
daavoo Jan 20, 2025
539898e
Update questions
daavoo Jan 20, 2025
ed71947
Update model and prompt
daavoo Jan 20, 2025
fd4fb95
Update
daavoo Jan 20, 2025
5add514
Update
daavoo Jan 20, 2025
9f8c755
fix
daavoo Jan 20, 2025
bec2ef1
Add system_instruction
daavoo Jan 20, 2025
08cad02
Update ratio
daavoo Jan 20, 2025
b7ce84e
Add more wait
daavoo Jan 20, 2025
6fc48fe
Fix return
daavoo Jan 20, 2025
8929e9e
Fix URLs
daavoo Jan 20, 2025
4a9e75e
Update download name
daavoo Jan 20, 2025
41ffc23
Update
daavoo Jan 20, 2025
4390852
Update
daavoo Jan 20, 2025
68621eb
Update with upper
daavoo Jan 20, 2025
422e5d5
Cast to str
daavoo Jan 20, 2025
3040978
Extend
daavoo Jan 20, 2025
bc0d8ce
Add benchmark
daavoo Jan 20, 2025
03e0e60
Fix
daavoo Jan 20, 2025
c19738e
fix
daavoo Jan 20, 2025
3cd7b24
Drop export
daavoo Jan 21, 2025
22df32b
Updates
daavoo Jan 21, 2025
b35dc23
Update default model
daavoo Jan 21, 2025
6cf13d7
Update
daavoo Jan 21, 2025
ad1ef9b
Use info
daavoo Jan 21, 2025
f237b89
Update with None
daavoo Jan 21, 2025
a34f4e2
Add answer type
daavoo Jan 21, 2025
291e376
Refactor
daavoo Jan 21, 2025
d7e99e7
Add fallback for out of context
daavoo Jan 21, 2025
0f381bb
Update with debugging info
daavoo Jan 21, 2025
a0391a4
Update
daavoo Jan 21, 2025
c3182cb
Update with mit-1
daavoo Jan 22, 2025
20b1651
test unsloth
daavoo Jan 22, 2025
0dd98da
Add , skip_special_tokens = True
daavoo Jan 22, 2025
6ac29aa
Update
daavoo Jan 22, 2025
95b3d57
Updates
daavoo Jan 22, 2025
d946f81
Add full_context
daavoo Jan 22, 2025
4ea1f7d
Update full context
daavoo Jan 22, 2025
a4888f2
update
daavoo Jan 22, 2025
e0f3a82
Add load and clean
daavoo Jan 22, 2025
906c8d9
Update
daavoo Jan 22, 2025
bb2afe5
Update
daavoo Jan 22, 2025
51c31f7
print
daavoo Jan 22, 2025
c5e0ac4
Update
daavoo Jan 22, 2025
cc10a9d
Add load_gemini_model
daavoo Jan 22, 2025
1560c71
Add sleep
daavoo Jan 22, 2025
94e7580
Update get_response
daavoo Jan 22, 2025
e7b5d5b
Update
daavoo Jan 22, 2025
5f6443b
Log error
daavoo Jan 22, 2025
819c6b2
fix
daavoo Jan 22, 2025
5625c39
Make the more info check more flexible
daavoo Jan 23, 2025
d125b79
Add gemini_full_context notebook
daavoo Jan 23, 2025
88a9357
typo
daavoo Jan 23, 2025
d929a80
Check por API KEY
daavoo Jan 23, 2025
9e718b3
Update with outputs
daavoo Jan 23, 2025
9027567
Add ragatouille
daavoo Jan 23, 2025
d2a3d98
Fix
daavoo Jan 23, 2025
17942ca
Update notebooks
daavoo Jan 24, 2025
fcdd953
Update gemini notebooks
daavoo Jan 24, 2025
bfdacea
Extend structured_qa. Add perfect_context.
daavoo Jan 27, 2025
a7d8dc5
Add gemini_perfect_context
daavoo Jan 27, 2025
308ab91
Update
daavoo Jan 27, 2025
704050b
fix line
daavoo Jan 27, 2025
67b8f80
fix line
daavoo Jan 27, 2025
a6bfe34
Update perfect_context
daavoo Jan 28, 2025
39a17ae
Add missing perfect context
daavoo Jan 28, 2025
ae325d3
Updates
daavoo Jan 28, 2025
56d8620
Update gemini_ragatouille
daavoo Jan 28, 2025
eb00902
Update gemini_fra
daavoo Jan 28, 2025
1d06d2c
Update
daavoo Jan 28, 2025
8ac9201
Update
daavoo Jan 28, 2025
0352173
Drop some log
daavoo Jan 28, 2025
0b8e5cf
Update
daavoo Jan 28, 2025
e2c5457
Update gemini_perfect_context with results
daavoo Jan 29, 2025
36350ee
Use rapizfuzz
daavoo Jan 29, 2025
215226e
Use question_part
daavoo Jan 29, 2025
5d4d961
Fix
daavoo Jan 29, 2025
1223b03
break when no section_names
daavoo Jan 29, 2025
08c0b85
Update prompt
daavoo Jan 29, 2025
7b9c96c
Add qwen perfect context
daavoo Jan 29, 2025
c056bdc
Update gemini_find_retrieve_answer
daavoo Jan 30, 2025
b726447
Update qwen perfect context
daavoo Jan 30, 2025
036f8a3
Add qwen RAGatouille
daavoo Jan 30, 2025
6b0a0c1
Update qwen notebooks
daavoo Jan 30, 2025
c60fe3e
Update
daavoo Jan 30, 2025
d12fa72
Update prompt
daavoo Jan 30, 2025
38d2530
Update qwen notebooks
daavoo Jan 30, 2025
1360437
Cleanup
daavoo Jan 30, 2025
6906991
Cleanup
daavoo Jan 30, 2025
8abcfb1
Add DeepSeek-R1-Distill-Qwen-7B
daavoo Jan 31, 2025
034fe29
Debug current calls. Set to 9 before reset
daavoo Feb 1, 2025
a2d301f
Add qwen find retrieve answer
daavoo Feb 1, 2025
8300573
Extend benchmark
daavoo Feb 3, 2025
4f8f82a
Update
daavoo Feb 3, 2025
2de0bfb
Add max_sections_to_check
daavoo Feb 3, 2025
8f7d173
Default to None
daavoo Feb 3, 2025
7ff95ff
Default to half of sections
daavoo Feb 3, 2025
d05d992
Update
daavoo Feb 3, 2025
db63dc9
fix
daavoo Feb 3, 2025
20f9e3f
Fix
daavoo Feb 3, 2025
c5ee8e6
Add qwen full context
daavoo Feb 3, 2025
a4da649
Update qwen_full_context
daavoo Feb 3, 2025
4ea56e2
Update gemini_full_context
daavoo Feb 3, 2025
82f37f3
Add statistics
daavoo Feb 3, 2025
a02ffd7
Update prompt
daavoo Feb 4, 2025
8af98df
Update with type
daavoo Feb 4, 2025
97049d6
Update gemini prompt and count
daavoo Feb 4, 2025
6555304
Update results with same prompts
daavoo Feb 4, 2025
0ab4688
Update with same prompt
daavoo Feb 4, 2025
5276d16
Update results
daavoo Feb 4, 2025
476bbe1
Bring back llama-cpp-python
daavoo Feb 5, 2025
fdafdc3
Update prompts
daavoo Feb 5, 2025
2ac1f61
Reduce notebook size
daavoo Feb 5, 2025
c99adb0
Update pre-commit
daavoo Feb 5, 2025
a114fe5
Update docstrings
daavoo Feb 5, 2025
df394cc
Merge branch 'main' into 5-add-benchmark
daavoo Feb 5, 2025
eec44b0
Update test
daavoo Feb 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add benchmark
  • Loading branch information
daavoo committed Jan 22, 2025
commit fe93f7426da8c9f7c21c59b4a93f4e367108b6b5
80 changes: 80 additions & 0 deletions benchmark/gemini.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
import datetime
import json
import os
import time

import google.generativeai as genai
from loguru import logger

SYSTEM_PROMPT = """
You are given an input document and a question.
You can only answer the question based on the information in the document.
You will return a JSON name with two keys: "section" and "answer".
In `"section"`, you will return the name of the section where you found the answer.
In `"answer"`, you will return the answer either as Yes/No (for boolean questions) or as a single number (for numeric questions).
Example response:
{
"section": "1. Introduction",
"answer": "No"
}
"""


def gemini_process_document(document_file, document_data):
genai.configure(api_key=os.environ["GEMINI_API_KEY"])

logger.info("Uploading file")
file = genai.upload_file(document_file, mime_type="application/pdf")
while file.state.name == 'PROCESSING':
logger.debug('Waiting for file to be processed.')
time.sleep(2)
file = genai.get_file(file.name)

logger.info("Creating cache")
cache =genai.caching.CachedContent.create(
model="models/gemini-1.5-flash-8b-latest",
display_name='cached file', # used to identify the cache
system_instruction=SYSTEM_PROMPT,
contents=[file],
ttl=datetime.timedelta(minutes=15),
)

logger.info("Creating model")
model = genai.GenerativeModel.from_cached_content(
cached_content=cache,
generation_config={
"temperature": 1,
"top_p": 0.95,
"top_k": 40,
"max_output_tokens": 8192,
"response_mime_type": "application/json",
}
)

logger.info("Predicting")
n = 0
answers = {}
sections = {}
for index, row in document_data.iterrows():
if n > 0 and n % 13 == 0:
logger.info("Waiting for 60 seconds")
time.sleep(60)
question = row["question"]
logger.debug(f"Question: {question}")
chat_session = model.start_chat(
history=[
{
"role": "user",
"parts": [
question,
],
}
]
)

response = chat_session.send_message("INSERT_INPUT_HERE")
logger.debug(response.text)
response_json = json.loads(response.text)
answers[index] = response_json["answer"]
sections[index] = response_json["section"]
n += 1
43 changes: 43 additions & 0 deletions benchmark/run_benchmark.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
from pathlib import Path
from urllib.request import urlretrieve

import pandas as pd
from fire import Fire
from loguru import logger


from gemini import gemini_process_document


def download_document(url, output_file):
if not Path(output_file).exists():
urlretrieve(url, output_file)
logger.debug(f"Downloaded {url} to {output_file}")
else:
logger.debug(f"File {output_file} already exists")


@logger.catch(reraise=True)
def run_benchmark(input_data: str, output_file: str, model: str):
logger.info("Loading input data")
data = pd.read_csv(input_data)
data["pred_answer"] = [None] * len(data)
data["pred_section"] = [None] * len(data)


for document_link, document_data in data.groupby("document"):
logger.info(f"Downloading document {document_link}")
downloaded_document = Path(f"example_data/{Path(document_link).name}.pdf")
download_document(document_link, downloaded_document)

if model == "gemini":
answers, sections = gemini_process_document(downloaded_document, document_data)

for index in document_data.index:
data.loc[index, "pred_answer"] = answers[index]
data.loc[index, "pred_section"] = sections[index]

data.to_csv(output_file)

if __name__ == "__main__":
Fire(run_benchmark)
61 changes: 61 additions & 0 deletions benchmark/structured_qa.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
document,section,question,bool_answer,num_answer,multi_choice_answer
https://arxiv.org/pdf/1706.03762,3 Model Architecture,Does the model use a encoder only architecture,0,,
https://arxiv.org/pdf/1706.03762,3 Model Architecture,Does the model use a decoder only architecture,0,,
https://arxiv.org/pdf/1706.03762,3 Model Architecture,Does the model use a encoder-decoder architecture,1,,
https://arxiv.org/pdf/1706.03762,3 Model Architecture,How many layers compose the encoder,,6,
https://arxiv.org/pdf/1706.03762,3 Model Architecture,How many layers compose the decoder,,6,
https://arxiv.org/pdf/1706.03762,3 Model Architecture,How many parallel attention heads are used,,8,
https://arxiv.org/pdf/1706.03762,3 Model Architecture,Does the model use learned embeddings for the input and output tokens,1,,
https://arxiv.org/pdf/1706.03762,3 Model Architecture,Does the model use learned positional embeddings,0,,
https://arxiv.org/pdf/1706.03762,5 Training,How many GPUs were used for training,,8,
https://arxiv.org/pdf/1706.03762,5 Training,Was the model trained on NVIDIA A100 GPUs,0,,
https://arxiv.org/pdf/1706.03762,5 Training,Was the model trained on NVIDIA P100 GPUs,1,,
https://arxiv.org/pdf/1706.03762,5 Training,Was the SGD optimizer used,0,,
https://arxiv.org/pdf/1706.03762,5 Training,Was the AdamW optimizer used,0,,
https://arxiv.org/pdf/1706.03762,5 Training,Was the Adam optimizer used,1,,
https://arxiv.org/pdf/1706.03762,5 Training,Was a fixed learning rate used,0,,
https://arxiv.org/pdf/1706.03762,5 Training,Was a varied learning rate used,1,,
https://arxiv.org/pdf/1706.03762,5 Training,How many warmup steps were used,,4000,
https://arxiv.org/pdf/1706.03762,5 Training,Was the label dropout regularization used during training,1,,
https://arxiv.org/pdf/1706.03762,5 Training,What was the dropout rate used for the base model,,0.1,
https://arxiv.org/pdf/1706.03762,5 Training,Was the label smoothing regularization used during training,1,,
https://arxiv.org/pdf/2210.05189,2.1 Fully Connected Networks,How many layers are in the toy model (y = x^2)?,,3,
https://arxiv.org/pdf/2210.05190,2.1 Fully Connected Networks,Does the model use Sigmoid activation function?,0,,
https://arxiv.org/pdf/2210.05191,3 Experimental Results,How many parameters are in the y = x^2 toy model tree?,,14,
https://arxiv.org/pdf/2210.05192,2.4 Recurrent Networks,Can recurrent networks also be converted to decision trees?,1,,
https://arxiv.org/pdf/2210.05193,3 Experimental Results,How many layers are in the half-moon neural network?,,3,
https://arxiv.org/pdf/2210.05194,3 Experimental Results,"What is the main computational advantage of decision trees? A: Less storage memory, B: Fewer operations, C: Lower accuracy",,,B
https://arxiv.org/pdf/2106.09685v2.pdf,4 Our Method,Does LoRA work with any neural network containing dense layers?,1,,
https://arxiv.org/pdf/2106.09685v2.pdf,5.5 Scaling Up to GPT-3,"How much memory is saved when training GPT-3 175B with LoRA compared to full fine-tuning? A: 850GB, B: 100GB, C: 5GB",,,A
https://arxiv.org/pdf/2106.09685v2.pdf,Abstract,"By how much can LoRA reduce GPU memory requirements during training? A: 10x, B: 5x, C: 3x",,,C
https://arxiv.org/pdf/2106.09685v2.pdf,1. Introduction,"In billions, how many trainable parameters does GPT-3 have?",,175,
https://arxiv.org/pdf/2106.09685v2.pdf,1. Introduction,Does LoRA introduce additional inference latency compared to full fine-tuning?,0,,
https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L_202401689,Prohibited AI Practices (Article 5),"Which type of AI systems are banned by the AI Act? (A) High-risk systems, (B) Manipulative systems, (C) Real-time biometric systems in public spaces",,,C
https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L_202401690,Requirements for High-Risk AI Systems (Article 10),"what is a requirement for datasets used in high-risk AI systems? (A) Exclusively open-source datasets, (B) Datasets ensuring quality and diversity, (C) Datasets not exceeding 1 GB in size",,,B
https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L_202401691,Classification rules (article 51),"What is the threshold, measured in floating point operations, that leads to a presumption that a general-purpose AI model has systemic risk?
A) 10^15, B) 10^20, C) 10^25",,,C
https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L_202401692,"TRANSPARENCY OBLIGATIONS FOR PROVIDERS AND DEPLOYERS OF CERTAIN AI SYSTEMS
(Article 50)","What should providers of AI systems that generate synthetic content ensure?
A) That the content is not marked in any way. B) That the outputs are marked in a machine-readable format and detectable as artificially generated or manipulated. C) That there is no way to detect that the content is synthetic.",,,B
https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L_202401693,Sharing of information on serious incidents (article 73),How long does a market surveillance authority have to take appropriate measures after receiving notification of a serious incident? A) 3 days B) 7 days C) 14 days,,,B
https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L_202401694,Testing of high-risk AI systems in real world conditions outside AI regulatory sandboxes (Article 60),"What is the maximum duration of testing in real-world conditions? a) 3 months b) 6 months, with a possible extension of an additional 6 months. c) 12 months",,,B
https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L_202401695,Penalties (Article 99),"What is the maximum fine for supplying incorrect, incomplete, or misleading information to notified bodies or national competent authorities? A) 7,500,000 EUR or 1% of annual turnover, whichever is higher. B) 5,000,000 EUR or 0.5 % of annual turnover, whichever is higher C) 10,000,000 EUR or 5% of annual turnover, whichever is higher",,,A
https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L_202401696,Code of practice (article 56),By what date should codes of practice be ready? a) 2 May 2025 b) 2 May 2024 c) 2 August 2025,,,A
https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L_202401697,"Compliant AI systems which present a risk (article 82)
",What is the time period for a market surveillance authority to inform the Commission of a finding related to a non-compliant AI system? a) 1 month b) 2 months c) Immediately,,,C
https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L_202401698,EU declaration of conformity (article 47),"How long after a high-risk AI system has been placed on the market or put into service must the authorized representative keep the technical documentation, EU declaration of conformity and certificates available for competent authorities? a) 5 years b) 10 years c) 15 years",,,B
https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L_202401699,Establishment and structure of the European Artificial Intelligence Board (article 65),"How long is the term of office for a Member State representative on the European Artificial Intelligence Board? a) 2 years, renewable once b) 3 years, renewable once c) 4 years, renewable once",,,B
https://authorsalliance.org/wp-content/uploads/Documents/Guides/Authors%20Alliance%20-%20Understanding%20Open%20Access.pdf,Introduction,"According to the guide, what is the typical license used to grant reuse rights with libre open access? a) GNU General Public License b) Creative Commons license c) MIT license",,,B
https://authorsalliance.org/wp-content/uploads/Documents/Guides/Authors%20Alliance%20-%20Understanding%20Open%20Access.pdf,Chapter 5 Where do you want to make your work available?,"how many peer-reviewed open access journals are indexed by the Directory of Open Access Journals (DOAJ)? a) Over 10,000 b) Over 20,000 c) Exactly 30,000",,,A
https://authorsalliance.org/wp-content/uploads/Documents/Guides/Authors%20Alliance%20-%20Understanding%20Open%20Access.pdf,Chapter 2 Benefits of Open Access,what is the term of office for members of the advisory board of the Authors Alliance? a) The source does not specify a term of office for the advisory board. b) 2 years c) 4 years,,,A
https://authorsalliance.org/wp-content/uploads/Documents/Guides/Authors%20Alliance%20-%20Understanding%20Open%20Access.pdf,Introduction,Does open access eliminate price barriers?,1,,
https://authorsalliance.org/wp-content/uploads/Documents/Guides/Authors%20Alliance%20-%20Understanding%20Open%20Access.pdf,Chatper 1 What is this guide and who is it for,Are publication fees required for all open access journals?,0,,
https://authorsalliance.org/wp-content/uploads/Documents/Guides/Authors%20Alliance%20-%20Understanding%20Open%20Access.pdf,Chapter 3 Open Access Policies,In what year did the Bull and Melinda Gates foundation implement an open access policy?,,2015,
https://authorsalliance.org/wp-content/uploads/Documents/Guides/Authors%20Alliance%20-%20Understanding%20Open%20Access.pdf,Chapter 5 Where do you want to make your work available?,Are Gold Open Access and Green Open Access mutually exclusive.,0,,
https://arxiv.org/pdf/2201.11903,3 Arithmetic Reasoning,Is Arithmetic reasoning is a task that language models often find very easy?,0,,
https://arxiv.org/pdf/2201.11904,3.1 Experimental Setup,How many large language models were evaluated?,,5,
https://arxiv.org/pdf/2201.11905,3.1 Experimental Setup,How many benchmarks were used to evaluate arithmetic reasoning?,,5,
https://arxiv.org/pdf/2201.11906,5 Symbolic Reasoning,Is symbolic reasoning usually simple for humans but challenging for language models?,1,,
https://arxiv.org/pdf/2201.11907,3.4 Robustness of Chain of Thought,How many annotators provided independent chains of thought?,,3,
https://arxiv.org/pdf/2201.11908,3.2 Results,How many random samples for examined to understand model errors?,,50,
https://arxiv.org/pdf/2201.11909,5 Symbolic Reasoning,"Which symbolic reasoning task is used as an out-of-domain evaluation? A: Coin Flip, B: Tower of Hanoi, C: Chess puzzles",,,A