Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
121 commits
Select commit Hold shift + click to select a range
d57c367
enh(preprocessing): Add split_markdown_by_headings.
daavoo Jan 22, 2025
fe93f74
Add benchmark
daavoo Jan 20, 2025
92c70a7
Move to structured_qa. Add entrypoint
daavoo Jan 20, 2025
70ef785
Move back outside
daavoo Jan 20, 2025
16ff8bd
Fix main
daavoo Jan 20, 2025
539898e
Update questions
daavoo Jan 20, 2025
ed71947
Update model and prompt
daavoo Jan 20, 2025
fd4fb95
Update
daavoo Jan 20, 2025
5add514
Update
daavoo Jan 20, 2025
9f8c755
fix
daavoo Jan 20, 2025
bec2ef1
Add system_instruction
daavoo Jan 20, 2025
08cad02
Update ratio
daavoo Jan 20, 2025
b7ce84e
Add more wait
daavoo Jan 20, 2025
6fc48fe
Fix return
daavoo Jan 20, 2025
8929e9e
Fix URLs
daavoo Jan 20, 2025
4a9e75e
Update download name
daavoo Jan 20, 2025
41ffc23
Update
daavoo Jan 20, 2025
4390852
Update
daavoo Jan 20, 2025
68621eb
Update with upper
daavoo Jan 20, 2025
422e5d5
Cast to str
daavoo Jan 20, 2025
3040978
Extend
daavoo Jan 20, 2025
bc0d8ce
Add benchmark
daavoo Jan 20, 2025
03e0e60
Fix
daavoo Jan 20, 2025
c19738e
fix
daavoo Jan 20, 2025
3cd7b24
Drop export
daavoo Jan 21, 2025
22df32b
Updates
daavoo Jan 21, 2025
b35dc23
Update default model
daavoo Jan 21, 2025
6cf13d7
Update
daavoo Jan 21, 2025
ad1ef9b
Use info
daavoo Jan 21, 2025
f237b89
Update with None
daavoo Jan 21, 2025
a34f4e2
Add answer type
daavoo Jan 21, 2025
291e376
Refactor
daavoo Jan 21, 2025
d7e99e7
Add fallback for out of context
daavoo Jan 21, 2025
0f381bb
Update with debugging info
daavoo Jan 21, 2025
a0391a4
Update
daavoo Jan 21, 2025
c3182cb
Update with mit-1
daavoo Jan 22, 2025
20b1651
test unsloth
daavoo Jan 22, 2025
0dd98da
Add , skip_special_tokens = True
daavoo Jan 22, 2025
6ac29aa
Update
daavoo Jan 22, 2025
95b3d57
Updates
daavoo Jan 22, 2025
d946f81
Add full_context
daavoo Jan 22, 2025
4ea1f7d
Update full context
daavoo Jan 22, 2025
a4888f2
update
daavoo Jan 22, 2025
e0f3a82
Add load and clean
daavoo Jan 22, 2025
906c8d9
Update
daavoo Jan 22, 2025
bb2afe5
Update
daavoo Jan 22, 2025
51c31f7
print
daavoo Jan 22, 2025
c5e0ac4
Update
daavoo Jan 22, 2025
cc10a9d
Add load_gemini_model
daavoo Jan 22, 2025
1560c71
Add sleep
daavoo Jan 22, 2025
94e7580
Update get_response
daavoo Jan 22, 2025
e7b5d5b
Update
daavoo Jan 22, 2025
5f6443b
Log error
daavoo Jan 22, 2025
819c6b2
fix
daavoo Jan 22, 2025
5625c39
Make the more info check more flexible
daavoo Jan 23, 2025
d125b79
Add gemini_full_context notebook
daavoo Jan 23, 2025
88a9357
typo
daavoo Jan 23, 2025
d929a80
Check por API KEY
daavoo Jan 23, 2025
9e718b3
Update with outputs
daavoo Jan 23, 2025
9027567
Add ragatouille
daavoo Jan 23, 2025
d2a3d98
Fix
daavoo Jan 23, 2025
17942ca
Update notebooks
daavoo Jan 24, 2025
fcdd953
Update gemini notebooks
daavoo Jan 24, 2025
bfdacea
Extend structured_qa. Add perfect_context.
daavoo Jan 27, 2025
a7d8dc5
Add gemini_perfect_context
daavoo Jan 27, 2025
308ab91
Update
daavoo Jan 27, 2025
704050b
fix line
daavoo Jan 27, 2025
67b8f80
fix line
daavoo Jan 27, 2025
a6bfe34
Update perfect_context
daavoo Jan 28, 2025
39a17ae
Add missing perfect context
daavoo Jan 28, 2025
ae325d3
Updates
daavoo Jan 28, 2025
56d8620
Update gemini_ragatouille
daavoo Jan 28, 2025
eb00902
Update gemini_fra
daavoo Jan 28, 2025
1d06d2c
Update
daavoo Jan 28, 2025
8ac9201
Update
daavoo Jan 28, 2025
0352173
Drop some log
daavoo Jan 28, 2025
0b8e5cf
Update
daavoo Jan 28, 2025
e2c5457
Update gemini_perfect_context with results
daavoo Jan 29, 2025
36350ee
Use rapizfuzz
daavoo Jan 29, 2025
215226e
Use question_part
daavoo Jan 29, 2025
5d4d961
Fix
daavoo Jan 29, 2025
1223b03
break when no section_names
daavoo Jan 29, 2025
08c0b85
Update prompt
daavoo Jan 29, 2025
7b9c96c
Add qwen perfect context
daavoo Jan 29, 2025
c056bdc
Update gemini_find_retrieve_answer
daavoo Jan 30, 2025
b726447
Update qwen perfect context
daavoo Jan 30, 2025
036f8a3
Add qwen RAGatouille
daavoo Jan 30, 2025
6b0a0c1
Update qwen notebooks
daavoo Jan 30, 2025
c60fe3e
Update
daavoo Jan 30, 2025
d12fa72
Update prompt
daavoo Jan 30, 2025
38d2530
Update qwen notebooks
daavoo Jan 30, 2025
1360437
Cleanup
daavoo Jan 30, 2025
6906991
Cleanup
daavoo Jan 30, 2025
8abcfb1
Add DeepSeek-R1-Distill-Qwen-7B
daavoo Jan 31, 2025
034fe29
Debug current calls. Set to 9 before reset
daavoo Feb 1, 2025
a2d301f
Add qwen find retrieve answer
daavoo Feb 1, 2025
8300573
Extend benchmark
daavoo Feb 3, 2025
4f8f82a
Update
daavoo Feb 3, 2025
2de0bfb
Add max_sections_to_check
daavoo Feb 3, 2025
8f7d173
Default to None
daavoo Feb 3, 2025
7ff95ff
Default to half of sections
daavoo Feb 3, 2025
d05d992
Update
daavoo Feb 3, 2025
db63dc9
fix
daavoo Feb 3, 2025
20f9e3f
Fix
daavoo Feb 3, 2025
c5ee8e6
Add qwen full context
daavoo Feb 3, 2025
a4da649
Update qwen_full_context
daavoo Feb 3, 2025
4ea56e2
Update gemini_full_context
daavoo Feb 3, 2025
82f37f3
Add statistics
daavoo Feb 3, 2025
a02ffd7
Update prompt
daavoo Feb 4, 2025
8af98df
Update with type
daavoo Feb 4, 2025
97049d6
Update gemini prompt and count
daavoo Feb 4, 2025
6555304
Update results with same prompts
daavoo Feb 4, 2025
0ab4688
Update with same prompt
daavoo Feb 4, 2025
5276d16
Update results
daavoo Feb 4, 2025
476bbe1
Bring back llama-cpp-python
daavoo Feb 5, 2025
fdafdc3
Update prompts
daavoo Feb 5, 2025
2ac1f61
Reduce notebook size
daavoo Feb 5, 2025
c99adb0
Update pre-commit
daavoo Feb 5, 2025
a114fe5
Update docstrings
daavoo Feb 5, 2025
df394cc
Merge branch 'main' into 5-add-benchmark
daavoo Feb 5, 2025
eec44b0
Update test
daavoo Feb 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Extend benchmark
  • Loading branch information
daavoo committed Feb 3, 2025
commit 8300573a6fee37f061120a588eb9b76ccb02cf4b
100 changes: 100 additions & 0 deletions benchmark/perfect_context/2.1 Pre-training Data
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
2 Approach
Our training approach is similar to the methods
described in previous work (Brown et al., 2020;
Chowdhery et al., 2022), and is inspired by the
Chinchilla scaling laws (Hoffmann et al., 2022).
We train large transformers on a large quantity of
textual data using a standard optimizer.
2.1 Pre-training Data
Our training dataset is a mixture of several sources,
reported in Table 1, that cover a diverse set of do-
mains. For the most part, we reuse data sources
that have been leveraged to train other LLMs, with
the restriction of only using data that is publicly
available, and compatible with open sourcing. This
leads to the following mixture of data and the per-
centage they represent in the training set:
English CommonCrawl [67%]. We preprocess
five CommonCrawl dumps, ranging from 2017
to 2020, with the CCNet pipeline (Wenzek et al.,
2020). This process deduplicates the data at the
line level, performs language identification with
a fastText linear classifier to remove non-English
pages and filters low quality content with an n-
gram language model. In addition, we trained a
linear model to classify pages used as references
in Wikipedia v.s. randomly sampled pages, and
discarded pages not classified as references.
C4 [15%]. During exploratory experiments, we
observed that using diverse pre-processed Com-
monCrawl datasets improves performance. We thus
included the publicly available C4 dataset (Raffel
et al., 2020) in our data. The preprocessing of C4
also contains deduplication and language identifi-
cation steps: the main difference with CCNet is
the quality filtering, which mostly relies on heuris-
tics such as presence of punctuation marks or the
number of words and sentences in a webpage.
Github [4.5%]. We use the public GitHub
dataset available on Google BigQuery. We only
kept projects that are distributed under the Apache,
BSD and MIT licenses. Additionally, we filtered
low quality files with heuristics based on the line
length or proportion of alphanumeric characters,
and removed boilerplate, such as headers, with reg-
ular expressions. Finally, we deduplicate the result-
ing dataset at the file level, with exact matches.
Wikipedia [4.5%]. We add Wikipedia dumps
from the June-August 2022 period, covering 20
Dataset Sampling prop. Epochs Disk size
CommonCrawl 67.0% 1.10 3.3 TB
C4 15.0% 1.06 783 GB
Github 4.5% 0.64 328 GB
Wikipedia 4.5% 2.45 83 GB
Books 4.5% 2.23 85 GB
ArXiv 2.5% 1.06 92 GB
StackExchange 2.0% 1.03 78 GB
Table 1: Pre-training data. Data mixtures used for pre-
training, for each subset we list the sampling propor-
tion, number of epochs performed on the subset when
training on 1.4T tokens, and disk size. The pre-training
runs on 1T tokens have the same sampling proportion.
languages, which use either the Latin or Cyrillic
scripts: bg, ca, cs, da, de, en, es, fr, hr, hu, it,
nl, pl, pt, ro, ru, sl, sr, sv, uk. We process the
data to remove hyperlinks, comments and other
formatting boilerplate.
Gutenberg and Books3 [4.5%]. We include
two book corpora in our training dataset: the Guten-
berg Project, which contains books that are in the
public domain, and the Books3 section of TheP-
ile (Gao et al., 2020), a publicly available dataset
for training large language models. We perform
deduplication at the book level, removing books
with more than 90% content overlap.
ArXiv [2.5%]. We process arXiv Latex files
to add scientific data to our dataset. Following
Lewkowycz et al. (2022), we removed everything
before the first section, as well as the bibliography.
We also removed the comments from the .tex files,
and inline-expanded definitions and macros written
by users to increase consistency across papers.
Stack Exchange [2%]. We include a dump of
Stack Exchange, a website of high quality ques-
tions and answers that covers a diverse set of do-
mains, ranging from computer science to chemistry.
We kept the data from the 28 largest websites, re-
moved the HTML tags from text and sorted the
answers by score (from highest to lowest).
Tokenizer. We tokenize the data with the byte-
pair encoding (BPE) algorithm (Sennrich et al.,
2015), using the implementation from Sentence-
Piece (Kudo and Richardson, 2018). Notably, we
split all numbers into individual digits, and fallback
to bytes to decompose unknown UTF-8 characters.
Overall, our entire training dataset contains
roughly 1.4T tokens after tokenization. For most of
our training data, each token is used only once dur-
ing training, with the exception of the Wikipedia
and Books domains, over which we perform ap-
proximately two epochs.
9 changes: 9 additions & 0 deletions benchmark/perfect_context/2.3 Optimizer
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Our models are trained using the AdamW opti-
mizer (Loshchilov and Hutter, 2017), with the fol-
lowing hyper-parameters: β1 = 0.9, β2 = 0.95.
We use a cosine learning rate schedule, such that
the final learning rate is equal to 10% of the maxi-
mal learning rate. We use a weight decay of 0.1 and
gradient clipping of 1.0. We use 2, 000 warmup
steps, and vary the learning rate and batch size with
the size of the model (see Table 2 for details).
36 changes: 36 additions & 0 deletions benchmark/perfect_context/3 Main results.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
Following previous work (Brown et al., 2020), we
consider zero-shot and few-shot tasks, and report
results on a total of 20 benchmarks:
• Zero-shot. We provide a textual description
of the task and a test example. The model
either provides an answer using open-ended
generation, or ranks the proposed answers.
• Few-shot. We provide a few examples of the
task (between 1 and 64) and a test example.
The model takes this text as input and gener-
ates the answer or ranks different options.
We compare LLaMA with other foundation mod-
els, namely the non-publicly available language
models GPT-3 (Brown et al., 2020), Gopher (Rae
et al., 2021), Chinchilla (Hoffmann et al., 2022)
and PaLM (Chowdhery et al., 2022), as well as
the open-sourced OPT models (Zhang et al., 2022),
GPT-J (Wang and Komatsuzaki, 2021), and GPT-
Neo (Black et al., 2022). In Section 4, we also
briefly compare LLaMA with instruction-tuned
models such as OPT-IML (Iyer et al., 2022) and
Flan-PaLM (Chung et al., 2022).
We evaluate LLaMA on free-form generation
tasks and multiple choice tasks. In the multiple
choice tasks, the objective is to select the most
appropriate completion among a set of given op-
tions, based on a provided context. We select the
completion with the highest likelihood given the
provided context. We follow Gao et al. (2021)
and use the likelihood normalized by the number
of characters in the completion, except for certain
datasets (OpenBookQA, BoolQ), for which we fol-
low Brown et al. (2020), and select a completion
based on the likelihood normalized by the likeli-
hood of the completion given “Answer:” as context:
P (completion|context)/P (completion|“Answer:”).
16 changes: 16 additions & 0 deletions benchmark/perfect_context/5 Bias, Toxicity and Misinformation.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Large language models have been showed to re-
produce and amplify biases that are existing in
the training data (Sheng et al., 2019; Kurita et al.,
2019), and to generate toxic or offensive con-
tent (Gehman et al., 2020). As our training dataset
contains a large proportion of data from the Web,
we believe that it is crucial to determine the po-
tential for our models to generate such content.
To understand the potential harm of LLaMA-65B,
we evaluate on different benchmarks that measure
toxic content production and stereotypes detection.
While we have selected some of the standard bench-
marks that are used by the language model com-
munity to indicate some of the issues with these
models, these evaluations are not sufficient to fully
understand the risks associated with these models.
32 changes: 32 additions & 0 deletions benchmark/perfect_context/5.2 CrowS-Pairs.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
LLaMA GPT3 OPT
Gender 70.6 62.6 65.7
Religion 79.0 73.3 68.6
Race/Color 57.0 64.7 68.6
Sexual orientation 81.0 76.2 78.6
Age 70.1 64.4 67.8
Nationality 64.2 61.6 62.9
Disability 66.7 76.7 76.7
Physical appearance 77.8 74.6 76.2
Socioeconomic status 71.5 73.8 76.2
Average 66.6 67.2 69.5
Table 12: CrowS-Pairs. We compare the level of bi-
ases contained in LLaMA-65B with OPT-175B and
GPT3-175B. Higher score indicates higher bias.
5.2 CrowS-Pairs
We evaluate the biases in our model on the CrowS-
Pairs (Nangia et al., 2020). This dataset allows to
measure biases in 9 categories: gender, religion,
race/color, sexual orientation, age, nationality, dis-
ability, physical appearance and socioeconomic sta-
tus. Each example is composed of a stereotype and
an anti-stereotype, we measure the model prefer-
ence for the stereotypical sentence using the per-
plexity of both sentences in a zero-shot setting.
Higher scores thus indicate higher bias. We com-
pare with GPT-3 and OPT-175B in Table 12.
LLaMA compares slightly favorably to both
models on average. Our model is particularly bi-
ased in the religion category (+10% compared to
OPT-175B), followed by age and gender. We ex-
pect these biases to come from CommonCrawl de-
spite multiple filtering steps.
54 changes: 54 additions & 0 deletions benchmark/perfect_context/Accountability and responsibility.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
Accountability and responsibility
Ensuring accountability for generative AI means that individuals and organisations can be
held accountable for the AI systems they develop, deploy, or use, and that human oversight
is maintained. To establish accountable practices across the AI lifecycle, you should
consider three key elements.
• Answerability: you should establish a chain of human responsibility across the generative
AI project lifecycle, including responsibility throughout the supply chain. In cases of
harm or errors caused by generative AI, recourse and feedback mechanisms need to be
established for affected individuals. Identifying the specific actors involved in generative AI
systems is vital to answerability. This includes model developers, application developers,
policymakers, regulators, system operators and end-users. The roles and responsibilities
of each must be clearly defined and aligned with legal and ethical standards.
• Auditability: you should demonstrate the responsibility and trustworthiness of
the development and deployment practices by upholding robust reporting and
documentation protocols, and retaining traceability throughout the AI lifecycle. This refers
to the process by which all stages of the generative AI innovation lifecycle from data
collection and base model training to implementation, fine-tuning, system deployment,
updating, and retirement are documented in a way that is accessible to relevant
stakeholders and easily understood.
• Liability: you should make sure that all parties involved in the generative AI project
lifecycle, from vendors and technical teams to system users, are acting lawfully and
understand their respective legal obligations.
As an end-user, being accountable means taking responsibility for a system’s outputs and
generated content and its potential consequences. This includes checking that these are
factual, truthful, non-discriminatory, non-harmful, and do not violate existing legal provisions,
guidelines, policies or the providers’ terms of use. It entails putting the necessary oversight
and human-in-the-loop processes in place to validate output in situations with high impact
or risk. Where these risks are too high, you must consider if generative AI should be used.
Ultimately, responsibility for any output or decision made or supported by an AI system
always rests with the public organisation. Where generative AI is bought commercially,
ensure that vendors understand their responsibilities and liabilities, put the required risk
mitigations in place and share all relevant information. Refer to the Buying generative AI
section for further guidance.
Practical recommendations
Follow existing legal provisions, guidelines and policies as well as the provider’s
terms of use when developing, deploying or using generative AI.
As an end-user, assume responsibility for output produced by generative AI tools
when used to support everyday tasks, such as drafting emails and reports.
Clearly define responsibilities, accountability, and liability across all actors involved
in the AI lifecycle. Where the generative AI is bought commercially, define detailed
responsibilities and liability contractually.
Nominate a Senior Responsible Owner who will be accountable for the use of
generative AI in a specific project.
Where generative AI is used in situations of high impact or risk, establish a
human-in-the-loop to oversee and validate outputs.
Adopt a risk-based approach to the use of AI-generated content and put
strategies in place to minimise the risk of inaccurate or harmful outputs. Where
the potential risks and harmful impacts are too high, consider whether human-in-
the-loop approaches offer sufficient mitigation or if generative AI should be used.
Provide routes for appeal and actionable redress and put feedback
channels into place.
Use assurance techniques to evaluate the performance of generative AI systems.
The CDEI AI assurance guide provides a useful starting point, and the CDEI
portfolio of AI assurance techniques offers real-world examples.

This file was deleted.

39 changes: 0 additions & 39 deletions benchmark/perfect_context/Codes of practice.txt

This file was deleted.

This file was deleted.

Loading