Extend benchmark

mozilla-ai · daavoo · Feb 5, 2025 · Jan 22, 2025 · Jan 20, 2025 · Jan 20, 2025
commit 8300573a6fee37f061120a588eb9b76ccb02cf4b
diff --git a/benchmark/perfect_context/2.1 Pre-training Data b/benchmark/perfect_context/2.1 Pre-training Data
@@ -0,0 +1,100 @@
+2 Approach
+Our training approach is similar to the methods
+described in previous work (Brown et al., 2020;
+Chowdhery et al., 2022), and is inspired by the
+Chinchilla scaling laws (Hoffmann et al., 2022).
+We train large transformers on a large quantity of
+textual data using a standard optimizer.
+2.1 Pre-training Data
+Our training dataset is a mixture of several sources,
+reported in Table 1, that cover a diverse set of do-
+mains. For the most part, we reuse data sources
+that have been leveraged to train other LLMs, with
+the restriction of only using data that is publicly
+available, and compatible with open sourcing. This
+leads to the following mixture of data and the per-
+centage they represent in the training set:
+English CommonCrawl [67%]. We preprocess
+five CommonCrawl dumps, ranging from 2017
+to 2020, with the CCNet pipeline (Wenzek et al.,
+2020). This process deduplicates the data at the
+line level, performs language identification with
+a fastText linear classifier to remove non-English
+pages and filters low quality content with an n-
+gram language model. In addition, we trained a
+linear model to classify pages used as references
+in Wikipedia v.s. randomly sampled pages, and
+discarded pages not classified as references.
+C4 [15%]. During exploratory experiments, we
+observed that using diverse pre-processed Com-
+monCrawl datasets improves performance. We thus
+included the publicly available C4 dataset (Raffel
+et al., 2020) in our data. The preprocessing of C4
+also contains deduplication and language identifi-
+cation steps: the main difference with CCNet is
+the quality filtering, which mostly relies on heuris-
+tics such as presence of punctuation marks or the
+number of words and sentences in a webpage.
+Github [4.5%]. We use the public GitHub
+dataset available on Google BigQuery. We only
+kept projects that are distributed under the Apache,
+BSD and MIT licenses. Additionally, we filtered
+low quality files with heuristics based on the line
+length or proportion of alphanumeric characters,
+and removed boilerplate, such as headers, with reg-
+ular expressions. Finally, we deduplicate the result-
+ing dataset at the file level, with exact matches.
+Wikipedia [4.5%]. We add Wikipedia dumps
+from the June-August 2022 period, covering 20
+Dataset Sampling prop. Epochs Disk size
+CommonCrawl 67.0% 1.10 3.3 TB
+C4 15.0% 1.06 783 GB
+Github 4.5% 0.64 328 GB
+Wikipedia 4.5% 2.45 83 GB
+Books 4.5% 2.23 85 GB
+ArXiv 2.5% 1.06 92 GB
+StackExchange 2.0% 1.03 78 GB
+Table 1: Pre-training data. Data mixtures used for pre-
+training, for each subset we list the sampling propor-
+tion, number of epochs performed on the subset when
+training on 1.4T tokens, and disk size. The pre-training
+runs on 1T tokens have the same sampling proportion.
+languages, which use either the Latin or Cyrillic
+scripts: bg, ca, cs, da, de, en, es, fr, hr, hu, it,
+nl, pl, pt, ro, ru, sl, sr, sv, uk. We process the
+data to remove hyperlinks, comments and other
+formatting boilerplate.
+Gutenberg and Books3 [4.5%]. We include
+two book corpora in our training dataset: the Guten-
+berg Project, which contains books that are in the
+public domain, and the Books3 section of TheP-
+ile (Gao et al., 2020), a publicly available dataset
+for training large language models. We perform
+deduplication at the book level, removing books
+with more than 90% content overlap.
+ArXiv [2.5%]. We process arXiv Latex files
+to add scientific data to our dataset. Following
+Lewkowycz et al. (2022), we removed everything
+before the first section, as well as the bibliography.
+We also removed the comments from the .tex files,
+and inline-expanded definitions and macros written
+by users to increase consistency across papers.
+Stack Exchange [2%]. We include a dump of
+Stack Exchange, a website of high quality ques-
+tions and answers that covers a diverse set of do-
+mains, ranging from computer science to chemistry.
+We kept the data from the 28 largest websites, re-
+moved the HTML tags from text and sorted the
+answers by score (from highest to lowest).
+Tokenizer. We tokenize the data with the byte-
+pair encoding (BPE) algorithm (Sennrich et al.,
+2015), using the implementation from Sentence-
+Piece (Kudo and Richardson, 2018). Notably, we
+split all numbers into individual digits, and fallback
+to bytes to decompose unknown UTF-8 characters.
+Overall, our entire training dataset contains
+roughly 1.4T tokens after tokenization. For most of
+our training data, each token is used only once dur-
+ing training, with the exception of the Wikipedia
+and Books domains, over which we perform ap-
+proximately two epochs.
diff --git a/benchmark/perfect_context/2.3 Optimizer b/benchmark/perfect_context/2.3 Optimizer
@@ -0,0 +1,9 @@
+Our models are trained using the AdamW opti-
+mizer (Loshchilov and Hutter, 2017), with the fol-
+lowing hyper-parameters: β1 = 0.9, β2 = 0.95.
+We use a cosine learning rate schedule, such that
+the final learning rate is equal to 10% of the maxi-
+mal learning rate. We use a weight decay of 0.1 and
+gradient clipping of 1.0. We use 2, 000 warmup
+steps, and vary the learning rate and batch size with
+the size of the model (see Table 2 for details).
diff --git a/benchmark/perfect_context/3 Main results.txt b/benchmark/perfect_context/3 Main results.txt
@@ -0,0 +1,36 @@
+Following previous work (Brown et al., 2020), we
+consider zero-shot and few-shot tasks, and report
+results on a total of 20 benchmarks:
+• Zero-shot. We provide a textual description
+of the task and a test example. The model
+either provides an answer using open-ended
+generation, or ranks the proposed answers.
+• Few-shot. We provide a few examples of the
+task (between 1 and 64) and a test example.
+The model takes this text as input and gener-
+ates the answer or ranks different options.
+We compare LLaMA with other foundation mod-
+els, namely the non-publicly available language
+models GPT-3 (Brown et al., 2020), Gopher (Rae
+et al., 2021), Chinchilla (Hoffmann et al., 2022)
+and PaLM (Chowdhery et al., 2022), as well as
+the open-sourced OPT models (Zhang et al., 2022),
+GPT-J (Wang and Komatsuzaki, 2021), and GPT-
+Neo (Black et al., 2022). In Section 4, we also
+briefly compare LLaMA with instruction-tuned
+models such as OPT-IML (Iyer et al., 2022) and
+Flan-PaLM (Chung et al., 2022).
+We evaluate LLaMA on free-form generation
+tasks and multiple choice tasks. In the multiple
+choice tasks, the objective is to select the most
+appropriate completion among a set of given op-
+tions, based on a provided context. We select the
+completion with the highest likelihood given the
+provided context. We follow Gao et al. (2021)
+and use the likelihood normalized by the number
+of characters in the completion, except for certain
+datasets (OpenBookQA, BoolQ), for which we fol-
+low Brown et al. (2020), and select a completion
+based on the likelihood normalized by the likeli-
+hood of the completion given “Answer:” as context:
+P (completion|context)/P (completion|“Answer:”).
diff --git a/benchmark/perfect_context/5 Bias, Toxicity and Misinformation.txt b/benchmark/perfect_context/5 Bias, Toxicity and Misinformation.txt
@@ -0,0 +1,16 @@
+Large language models have been showed to re-
+produce and amplify biases that are existing in
+the training data (Sheng et al., 2019; Kurita et al.,
+2019), and to generate toxic or offensive con-
+tent (Gehman et al., 2020). As our training dataset
+contains a large proportion of data from the Web,
+we believe that it is crucial to determine the po-
+tential for our models to generate such content.
+To understand the potential harm of LLaMA-65B,
+we evaluate on different benchmarks that measure
+toxic content production and stereotypes detection.
+While we have selected some of the standard bench-
+marks that are used by the language model com-
+munity to indicate some of the issues with these
+models, these evaluations are not sufficient to fully
+understand the risks associated with these models.
diff --git a/benchmark/perfect_context/5.2 CrowS-Pairs.txt b/benchmark/perfect_context/5.2 CrowS-Pairs.txt
@@ -0,0 +1,32 @@
+LLaMA GPT3 OPT
+Gender 70.6 62.6 65.7
+Religion 79.0 73.3 68.6
+Race/Color 57.0 64.7 68.6
+Sexual orientation 81.0 76.2 78.6
+Age 70.1 64.4 67.8
+Nationality 64.2 61.6 62.9
+Disability 66.7 76.7 76.7
+Physical appearance 77.8 74.6 76.2
+Socioeconomic status 71.5 73.8 76.2
+Average 66.6 67.2 69.5
+Table 12: CrowS-Pairs. We compare the level of bi-
+ases contained in LLaMA-65B with OPT-175B and
+GPT3-175B. Higher score indicates higher bias.
+5.2 CrowS-Pairs
+We evaluate the biases in our model on the CrowS-
+Pairs (Nangia et al., 2020). This dataset allows to
+measure biases in 9 categories: gender, religion,
+race/color, sexual orientation, age, nationality, dis-
+ability, physical appearance and socioeconomic sta-
+tus. Each example is composed of a stereotype and
+an anti-stereotype, we measure the model prefer-
+ence for the stereotypical sentence using the per-
+plexity of both sentences in a zero-shot setting.
+Higher scores thus indicate higher bias. We com-
+pare with GPT-3 and OPT-175B in Table 12.
+LLaMA compares slightly favorably to both
+models on average. Our model is particularly bi-
+ased in the religion category (+10% compared to
+OPT-175B), followed by age and gender. We ex-
+pect these biases to come from CommonCrawl de-
+spite multiple filtering steps.
diff --git a/benchmark/perfect_context/Accountability and responsibility.txt b/benchmark/perfect_context/Accountability and responsibility.txt
@@ -0,0 +1,54 @@
+Accountability and responsibility
+Ensuring accountability for generative AI means that individuals and organisations can be
+held accountable for the AI systems they develop, deploy, or use, and that human oversight
+is maintained. To establish accountable practices across the AI lifecycle, you should
+consider three key elements.
+• Answerability: you should establish a chain of human responsibility across the generative
+AI project lifecycle, including responsibility throughout the supply chain. In cases of
+harm or errors caused by generative AI, recourse and feedback mechanisms need to be
+established for affected individuals. Identifying the specific actors involved in generative AI
+systems is vital to answerability. This includes model developers, application developers,
+policymakers, regulators, system operators and end-users. The roles and responsibilities
+of each must be clearly defined and aligned with legal and ethical standards.
+• Auditability: you should demonstrate the responsibility and trustworthiness of
+the development and deployment practices by upholding robust reporting and
+documentation protocols, and retaining traceability throughout the AI lifecycle. This refers
+to the process by which all stages of the generative AI innovation lifecycle from data
+collection and base model training to implementation, fine-tuning, system deployment,
+updating, and retirement are documented in a way that is accessible to relevant
+stakeholders and easily understood.
+• Liability: you should make sure that all parties involved in the generative AI project
+lifecycle, from vendors and technical teams to system users, are acting lawfully and
+understand their respective legal obligations.
+As an end-user, being accountable means taking responsibility for a system’s outputs and
+generated content and its potential consequences. This includes checking that these are
+factual, truthful, non-discriminatory, non-harmful, and do not violate existing legal provisions,
+guidelines, policies or the providers’ terms of use. It entails putting the necessary oversight
+and human-in-the-loop processes in place to validate output in situations with high impact
+or risk. Where these risks are too high, you must consider if generative AI should be used.
+Ultimately, responsibility for any output or decision made or supported by an AI system
+always rests with the public organisation. Where generative AI is bought commercially,
+ensure that vendors understand their responsibilities and liabilities, put the required risk
+mitigations in place and share all relevant information. Refer to the Buying generative AI
+section for further guidance.
+Practical recommendations
+Follow existing legal provisions, guidelines and policies as well as the provider’s
+terms of use when developing, deploying or using generative AI.
+As an end-user, assume responsibility for output produced by generative AI tools
+when used to support everyday tasks, such as drafting emails and reports.
+Clearly define responsibilities, accountability, and liability across all actors involved
+in the AI lifecycle. Where the generative AI is bought commercially, define detailed
+responsibilities and liability contractually.
+Nominate a Senior Responsible Owner who will be accountable for the use of
+generative AI in a specific project.
+Where generative AI is used in situations of high impact or risk, establish a
+human-in-the-loop to oversee and validate outputs.
+Adopt a risk-based approach to the use of AI-generated content and put
+strategies in place to minimise the risk of inaccurate or harmful outputs. Where
+the potential risks and harmful impacts are too high, consider whether human-in-
+the-loop approaches offer sufficient mitigation or if generative AI should be used.
+Provide routes for appeal and actionable redress and put feedback
+channels into place.
+Use assurance techniques to evaluate the performance of generative AI systems.
+The CDEI AI assurance guide provides a useful starting point, and the CDEI
+portfolio of AI assurance techniques offers real-world examples.
diff --git a/...fication of general-purpose AI models as general-purpose AI models with systemic risk.txt b/...fication of general-purpose AI models as general-purpose AI models with systemic risk.txt
diff --git a/benchmark/perfect_context/Codes of practice.txt b/benchmark/perfect_context/Codes of practice.txt
diff --git a/benchmark/perfect_context/Compliant AI systems which present a risk.txt b/benchmark/perfect_context/Compliant AI systems which present a risk.txt