Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
121 commits
Select commit Hold shift + click to select a range
d57c367
enh(preprocessing): Add split_markdown_by_headings.
daavoo Jan 22, 2025
fe93f74
Add benchmark
daavoo Jan 20, 2025
92c70a7
Move to structured_qa. Add entrypoint
daavoo Jan 20, 2025
70ef785
Move back outside
daavoo Jan 20, 2025
16ff8bd
Fix main
daavoo Jan 20, 2025
539898e
Update questions
daavoo Jan 20, 2025
ed71947
Update model and prompt
daavoo Jan 20, 2025
fd4fb95
Update
daavoo Jan 20, 2025
5add514
Update
daavoo Jan 20, 2025
9f8c755
fix
daavoo Jan 20, 2025
bec2ef1
Add system_instruction
daavoo Jan 20, 2025
08cad02
Update ratio
daavoo Jan 20, 2025
b7ce84e
Add more wait
daavoo Jan 20, 2025
6fc48fe
Fix return
daavoo Jan 20, 2025
8929e9e
Fix URLs
daavoo Jan 20, 2025
4a9e75e
Update download name
daavoo Jan 20, 2025
41ffc23
Update
daavoo Jan 20, 2025
4390852
Update
daavoo Jan 20, 2025
68621eb
Update with upper
daavoo Jan 20, 2025
422e5d5
Cast to str
daavoo Jan 20, 2025
3040978
Extend
daavoo Jan 20, 2025
bc0d8ce
Add benchmark
daavoo Jan 20, 2025
03e0e60
Fix
daavoo Jan 20, 2025
c19738e
fix
daavoo Jan 20, 2025
3cd7b24
Drop export
daavoo Jan 21, 2025
22df32b
Updates
daavoo Jan 21, 2025
b35dc23
Update default model
daavoo Jan 21, 2025
6cf13d7
Update
daavoo Jan 21, 2025
ad1ef9b
Use info
daavoo Jan 21, 2025
f237b89
Update with None
daavoo Jan 21, 2025
a34f4e2
Add answer type
daavoo Jan 21, 2025
291e376
Refactor
daavoo Jan 21, 2025
d7e99e7
Add fallback for out of context
daavoo Jan 21, 2025
0f381bb
Update with debugging info
daavoo Jan 21, 2025
a0391a4
Update
daavoo Jan 21, 2025
c3182cb
Update with mit-1
daavoo Jan 22, 2025
20b1651
test unsloth
daavoo Jan 22, 2025
0dd98da
Add , skip_special_tokens = True
daavoo Jan 22, 2025
6ac29aa
Update
daavoo Jan 22, 2025
95b3d57
Updates
daavoo Jan 22, 2025
d946f81
Add full_context
daavoo Jan 22, 2025
4ea1f7d
Update full context
daavoo Jan 22, 2025
a4888f2
update
daavoo Jan 22, 2025
e0f3a82
Add load and clean
daavoo Jan 22, 2025
906c8d9
Update
daavoo Jan 22, 2025
bb2afe5
Update
daavoo Jan 22, 2025
51c31f7
print
daavoo Jan 22, 2025
c5e0ac4
Update
daavoo Jan 22, 2025
cc10a9d
Add load_gemini_model
daavoo Jan 22, 2025
1560c71
Add sleep
daavoo Jan 22, 2025
94e7580
Update get_response
daavoo Jan 22, 2025
e7b5d5b
Update
daavoo Jan 22, 2025
5f6443b
Log error
daavoo Jan 22, 2025
819c6b2
fix
daavoo Jan 22, 2025
5625c39
Make the more info check more flexible
daavoo Jan 23, 2025
d125b79
Add gemini_full_context notebook
daavoo Jan 23, 2025
88a9357
typo
daavoo Jan 23, 2025
d929a80
Check por API KEY
daavoo Jan 23, 2025
9e718b3
Update with outputs
daavoo Jan 23, 2025
9027567
Add ragatouille
daavoo Jan 23, 2025
d2a3d98
Fix
daavoo Jan 23, 2025
17942ca
Update notebooks
daavoo Jan 24, 2025
fcdd953
Update gemini notebooks
daavoo Jan 24, 2025
bfdacea
Extend structured_qa. Add perfect_context.
daavoo Jan 27, 2025
a7d8dc5
Add gemini_perfect_context
daavoo Jan 27, 2025
308ab91
Update
daavoo Jan 27, 2025
704050b
fix line
daavoo Jan 27, 2025
67b8f80
fix line
daavoo Jan 27, 2025
a6bfe34
Update perfect_context
daavoo Jan 28, 2025
39a17ae
Add missing perfect context
daavoo Jan 28, 2025
ae325d3
Updates
daavoo Jan 28, 2025
56d8620
Update gemini_ragatouille
daavoo Jan 28, 2025
eb00902
Update gemini_fra
daavoo Jan 28, 2025
1d06d2c
Update
daavoo Jan 28, 2025
8ac9201
Update
daavoo Jan 28, 2025
0352173
Drop some log
daavoo Jan 28, 2025
0b8e5cf
Update
daavoo Jan 28, 2025
e2c5457
Update gemini_perfect_context with results
daavoo Jan 29, 2025
36350ee
Use rapizfuzz
daavoo Jan 29, 2025
215226e
Use question_part
daavoo Jan 29, 2025
5d4d961
Fix
daavoo Jan 29, 2025
1223b03
break when no section_names
daavoo Jan 29, 2025
08c0b85
Update prompt
daavoo Jan 29, 2025
7b9c96c
Add qwen perfect context
daavoo Jan 29, 2025
c056bdc
Update gemini_find_retrieve_answer
daavoo Jan 30, 2025
b726447
Update qwen perfect context
daavoo Jan 30, 2025
036f8a3
Add qwen RAGatouille
daavoo Jan 30, 2025
6b0a0c1
Update qwen notebooks
daavoo Jan 30, 2025
c60fe3e
Update
daavoo Jan 30, 2025
d12fa72
Update prompt
daavoo Jan 30, 2025
38d2530
Update qwen notebooks
daavoo Jan 30, 2025
1360437
Cleanup
daavoo Jan 30, 2025
6906991
Cleanup
daavoo Jan 30, 2025
8abcfb1
Add DeepSeek-R1-Distill-Qwen-7B
daavoo Jan 31, 2025
034fe29
Debug current calls. Set to 9 before reset
daavoo Feb 1, 2025
a2d301f
Add qwen find retrieve answer
daavoo Feb 1, 2025
8300573
Extend benchmark
daavoo Feb 3, 2025
4f8f82a
Update
daavoo Feb 3, 2025
2de0bfb
Add max_sections_to_check
daavoo Feb 3, 2025
8f7d173
Default to None
daavoo Feb 3, 2025
7ff95ff
Default to half of sections
daavoo Feb 3, 2025
d05d992
Update
daavoo Feb 3, 2025
db63dc9
fix
daavoo Feb 3, 2025
20f9e3f
Fix
daavoo Feb 3, 2025
c5ee8e6
Add qwen full context
daavoo Feb 3, 2025
a4da649
Update qwen_full_context
daavoo Feb 3, 2025
4ea56e2
Update gemini_full_context
daavoo Feb 3, 2025
82f37f3
Add statistics
daavoo Feb 3, 2025
a02ffd7
Update prompt
daavoo Feb 4, 2025
8af98df
Update with type
daavoo Feb 4, 2025
97049d6
Update gemini prompt and count
daavoo Feb 4, 2025
6555304
Update results with same prompts
daavoo Feb 4, 2025
0ab4688
Update with same prompt
daavoo Feb 4, 2025
5276d16
Update results
daavoo Feb 4, 2025
476bbe1
Bring back llama-cpp-python
daavoo Feb 5, 2025
fdafdc3
Update prompts
daavoo Feb 5, 2025
2ac1f61
Reduce notebook size
daavoo Feb 5, 2025
c99adb0
Update pre-commit
daavoo Feb 5, 2025
a114fe5
Update docstrings
daavoo Feb 5, 2025
df394cc
Merge branch 'main' into 5-add-benchmark
daavoo Feb 5, 2025
eec44b0
Update test
daavoo Feb 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Update with upper
  • Loading branch information
daavoo committed Jan 22, 2025
commit 68621eb4aa71327a8bacaa73aefb31d7c946a781
2 changes: 1 addition & 1 deletion benchmark/run_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def run_benchmark(input_data: str, output_file: str, model: str):
)

for index in document_data.index:
data.loc[index, "pred_answer"] = answers[index]
data.loc[index, "pred_answer"] = answers[index].upper()
data.loc[index, "pred_section"] = sections[index]

data.to_csv(output_file)
Expand Down
26 changes: 13 additions & 13 deletions benchmark/structured_qa.csv
Original file line number Diff line number Diff line change
Expand Up @@ -3,24 +3,24 @@ https://arxiv.org/pdf/1706.03762,3 Model Architecture,"What type of architecture
https://arxiv.org/pdf/1706.03762,3 Model Architecture,"How many layers compose the encoder?",6
https://arxiv.org/pdf/1706.03762,3 Model Architecture,"How many layers compose the decoder?",6
https://arxiv.org/pdf/1706.03762,3 Model Architecture,"How many parallel attention heads are used?",8
https://arxiv.org/pdf/1706.03762,3 Model Architecture,"Does the final model use learned embeddings for the input and output tokens?",Yes
https://arxiv.org/pdf/1706.03762,3 Model Architecture,"Does the final model use learned positional embeddings?",No
https://arxiv.org/pdf/1706.03762,3 Model Architecture,"Does the final model use learned embeddings for the input and output tokens?",YES
https://arxiv.org/pdf/1706.03762,3 Model Architecture,"Does the final model use learned positional embeddings?",NO
https://arxiv.org/pdf/1706.03762,5 Training,"How many GPUs were used for training?",8
https://arxiv.org/pdf/1706.03762,5 Training,"What type of GPUs were used for training? -A: NVIDIA A100 -B: NVIDIA P100 -C: NVIDIA T4",B
https://arxiv.org/pdf/1706.03762,5 Training,"What optimizer was used for trainin? -A: AdamW -B: Adam -C: SGD",A
https://arxiv.org/pdf/1706.03762,5 Training,"What optimizer was used for training? -A: AdamW -B: Adam -C: SGD",A
https://arxiv.org/pdf/1706.03762,5 Training,"How many warmup steps were used?",4000
https://arxiv.org/pdf/1706.03762,5 Training,"What was the dropout rate used for the base model?",0.1
https://arxiv.org/pdf/2210.05189,2.1 Fully Connected Networks,"How many layers are in the toy model (y = x^2)?",3
https://arxiv.org/pdf/2210.05189,2.1 Fully Connected Networks,"Does the model use Sigmoid activation function?",No
https://arxiv.org/pdf/2210.05189,2.1 Fully Connected Networks,"Does the model use Sigmoid activation function?",NO
https://arxiv.org/pdf/2210.05189,3 Experimental Results,"How many parameters are in the y = x^2 toy model tree?",14
https://arxiv.org/pdf/2210.05189,2.4 Recurrent Networks,"Can recurrent networks also be converted to decision trees?",Yes
https://arxiv.org/pdf/2210.05189,2.4 Recurrent Networks,"Can recurrent networks also be converted to decision trees?",YES
https://arxiv.org/pdf/2210.05189,3 Experimental Results,"How many layers are in the half-moon neural network?",3
https://arxiv.org/pdf/2210.05189,3 Experimental Results,"What is the main computational advantage of decision trees? -A: Less storage memory, -B: Fewer operations, -C: Lower accuracy",B
https://arxiv.org/pdf/2106.09685v2.pdf,4 Our Method,Does LoRA work with any neural network containing dense layers?,Yes
https://arxiv.org/pdf/2106.09685v2.pdf,5.5 Scaling Up to GPT-3,"How much memory is saved when training GPT-3 175B with LoRA compared to full fine-tuning? -A: 850GB, -B: 100GB, -C: 5GB",A
https://arxiv.org/pdf/2106.09685v2.pdf,4 Our Method,Does LoRA work with any neural network containing dense layers?,YES
https://arxiv.org/pdf/2106.09685v2.pdf,5.5 Scaling Up to GPT-3,"How much memory is saved (in GB) when training GPT-3 175B with LoRA compared to full fine-tuning?",850
https://arxiv.org/pdf/2106.09685v2.pdf,Abstract,"By how much can LoRA reduce GPU memory requirements during training? -A: 10x, -B: 5x, -C: 3x",C
https://arxiv.org/pdf/2106.09685v2.pdf,1. Introduction,"In billions, how many trainable parameters does GPT-3 have?",175
https://arxiv.org/pdf/2106.09685v2.pdf,1. Introduction,Does LoRA introduce additional inference latency compared to full fine-tuning?,No
https://arxiv.org/pdf/2106.09685v2.pdf,1. Introduction,Does LoRA introduce additional inference latency compared to full fine-tuning?,NO
https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L_202401689,Prohibited AI Practices (Article 5),"Which type of AI systems are banned by the AI Act? -A: High-risk systems, -B: Manipulative systems, -C: Real-time biometric systems in public spaces",C
https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L_202401689,Requirements for High-Risk AI Systems (Article 10),"what is a requirement for datasets used in high-risk AI systems? -A: Exclusively open-source datasets -B: Datasets ensuring quality and diversity -C: Datasets not exceeding 1 GB in size",B
https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L_202401689,Classification rules (article 51),"What is the threshold, measured in floating point operations, that leads to a presumption that a general-purpose AI model has systemic risk? -A: 10^1 -B: 10^20 -C: 10^25",C
Expand All @@ -35,14 +35,14 @@ https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L_202401689,Establish
https://authorsalliance.org/wp-content/uploads/Documents/Guides/Authors%20Alliance%20-%20Understanding%20Open%20Access.pdf,Introduction,"According to the guide, what is the typical license used to grant reuse rights with libre open access? -A: GNU General Public License -B: Creative Commons license -C: MIT license",B
https://authorsalliance.org/wp-content/uploads/Documents/Guides/Authors%20Alliance%20-%20Understanding%20Open%20Access.pdf,Chapter 5 Where do you want to make your work available?,"how many peer-reviewed open access journals are indexed by the Directory of Open Access Journals (DOAJ)? -A: Over 10,000 -B: Over 20,000 -C: Exactly 30,000",A
https://authorsalliance.org/wp-content/uploads/Documents/Guides/Authors%20Alliance%20-%20Understanding%20Open%20Access.pdf,Chapter 2 Benefits of Open Access,what is the term of office for members of the advisory board of the Authors Alliance? -A: The source does not specify a term of office for the advisory board. -B: 2 years -C: 4 years,A
https://authorsalliance.org/wp-content/uploads/Documents/Guides/Authors%20Alliance%20-%20Understanding%20Open%20Access.pdf,Introduction,Does open access eliminate price barriers?,Yes
https://authorsalliance.org/wp-content/uploads/Documents/Guides/Authors%20Alliance%20-%20Understanding%20Open%20Access.pdf,Chatper 1 What is this guide and who is it for,Are publication fees required for all open access journals?,No
https://authorsalliance.org/wp-content/uploads/Documents/Guides/Authors%20Alliance%20-%20Understanding%20Open%20Access.pdf,Introduction,Does open access eliminate price barriers?,YES
https://authorsalliance.org/wp-content/uploads/Documents/Guides/Authors%20Alliance%20-%20Understanding%20Open%20Access.pdf,Chatper 1 What is this guide and who is it for,Are publication fees required for all open access journals?,NO
https://authorsalliance.org/wp-content/uploads/Documents/Guides/Authors%20Alliance%20-%20Understanding%20Open%20Access.pdf,Chapter 3 Open Access Policies,In what year did the Bull and Melinda Gates foundation implement an open access policy?,2015
https://authorsalliance.org/wp-content/uploads/Documents/Guides/Authors%20Alliance%20-%20Understanding%20Open%20Access.pdf,Chapter 5 Where do you want to make your work available?,Are Gold Open Access and Green Open Access mutually exclusive.,No
https://arxiv.org/pdf/2201.11903,3 Arithmetic Reasoning,Is Arithmetic reasoning is a task that language models often find very easy?,No
https://authorsalliance.org/wp-content/uploads/Documents/Guides/Authors%20Alliance%20-%20Understanding%20Open%20Access.pdf,Chapter 5 Where do you want to make your work available?,Are Gold Open Access and Green Open Access mutually exclusive.,NO
https://arxiv.org/pdf/2201.11903,3 Arithmetic Reasoning,Is Arithmetic reasoning is a task that language models often find very easy?,NO
https://arxiv.org/pdf/2201.11903,3.1 Experimental Setup,How many large language models were evaluated?,5
https://arxiv.org/pdf/2201.11903,3.1 Experimental Setup,How many benchmarks were used to evaluate arithmetic reasoning?,5
https://arxiv.org/pdf/2201.11903,5 Symbolic Reasoning,Is symbolic reasoning usually simple for humans but challenging for language models?,Yes
https://arxiv.org/pdf/2201.11903,5 Symbolic Reasoning,Is symbolic reasoning usually simple for humans but challenging for language models?,YES
https://arxiv.org/pdf/2201.11903,3.4 Robustness of Chain of Thought,How many annotators provided independent chains of thought?,3
https://arxiv.org/pdf/2201.11903,3.2 Results,How many random samples for examined to understand model errors?,50
https://arxiv.org/pdf/2201.11903,5 Symbolic Reasoning,"Which symbolic reasoning task is used as an out-of-domain evaluation? -A: Coin Flip -B: Tower of Hanoi -C: Chess puzzles",A