Extend structured_qa. Add perfect_context.

mozilla-ai · daavoo · Feb 5, 2025 · Jan 22, 2025 · Jan 20, 2025 · Jan 20, 2025
commit bfdacea8db390d9b3ab240676ea3b9e51276cd87
diff --git a/benchmark/perfect_context/1 INTRODUCTION.txt b/benchmark/perfect_context/1 INTRODUCTION.txt
@@ -0,0 +1,49 @@
+Many applications in natural language processing rely on adapt-
+ing one large-scale, pre-trained language model to multiple down-
+stream applications. Such adaptation is usually done via fine-tuning,
+which updates all the parameters of the pre-trained model. The ma-
+jor downside of fine-tuning is that the new model contains as many
+parameters as in the original model. As larger models are trained
+every few months, this changes from a mere “inconvenience” for
+GPT-2 (Radford et al., b) or RoBERTa large (Liu et al., 2019) to a
+critical deployment challenge for GPT-3 (Brown et al., 2020) with
+175 billion trainable parameters.1
+Many sought to mitigate this by adapting only some parameters or
+learning external modules for new tasks. This way, we only need
+to store and load a small number of task-specific parameters in ad-
+dition to the pre-trained model for each task, greatly boosting the
+operational efficiency when deployed. However, existing techniques
+ften introduce inference latency (Houlsby et al., 2019; Rebuffi et al., 2017) by extending model
+depth or reduce the model’s usable sequence length (Li & Liang, 2021; Lester et al., 2021; Ham-
+bardzumyan et al., 2020; Liu et al., 2021) (Section 3). More importantly, these method often fail to
+match the fine-tuning baselines, posing a trade-off between efficiency and model quality.
+We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned
+over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the
+change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed
+Low-Rank Adaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural
+network indirectly by optimizing rank decomposition matrices of the dense layers’ change during
+adaptation instead, while keeping the pre-trained weights frozen, as shown in Figure 1. Using GPT-3
+175B as an example, we show that a very low rank (i.e., r in Figure 1 can be one or two) suffices even
+when the full rank (i.e., d) is as high as 12,288, making LoRA both storage- and compute-efficient.
+LoRA possesses several key advantages.
+• A pre-trained model can be shared and used to build many small LoRA modules for dif-
+ferent tasks. We can freeze the shared model and efficiently switch tasks by replacing the
+matrices A and B in Figure 1, reducing the storage requirement and task-switching over-
+head significantly.
+• LoRA makes training more efficient and lowers the hardware barrier to entry by up to 3
+times when using adaptive optimizers since we do not need to calculate the gradients or
+maintain the optimizer states for most parameters. Instead, we only optimize the injected,
+much smaller low-rank matrices.
+• Our simple linear design allows us to merge the trainable matrices with the frozen weights
+when deployed, introducing no inference latency compared to a fully fine-tuned model, by
+construction.
+• LoRA is orthogonal to many prior methods and can be combined with many of them, such
+as prefix-tuning. We provide an example in Appendix E.
+Terminologies and Conventions We make frequent references to the Transformer architecture
+and use the conventional terminologies for its dimensions. We call the input and output di-
+mension size of a Transformer layer dmodel. We use Wq , Wk, Wv , and Wo to refer to the
+query/key/value/output projection matrices in the self-attention module. W or W0 refers to a pre-
+trained weight matrix and ∆W its accumulated gradient update during adaptation. We use r to
+denote the rank of a LoRA module. We follow the conventions set out by (Vaswani et al., 2017;
+Brown et al., 2020) and use Adam (Loshchilov & Hutter, 2019; Kingma & Ba, 2017) for model
+optimization and use a Transformer MLP feedforward dimension df f n = 4 × dmodel.
diff --git a/benchmark/perfect_context/1.2.1. Internal partitions and doors.txt b/benchmark/perfect_context/1.2.1. Internal partitions and doors.txt
@@ -0,0 +1,16 @@
+Fire resistance of partitions:
+- Copy rooms must have vertical partitions with a fire resistance of EI 30 and doors must have
+a fire resistance of EI1 30 and close automatically (linked to the fire detection system).
+Door retaining devices:
+- Certain fire doors for rooms which are accessed or traversed very frequently are kept open
+using magnetic retainers linked to the fire detection system (e.g. entrance halls and lift
+lobbies, corridor compartment doors, kitchenette doors and doors of copy rooms).
+- As a minimum, rooms accommodating kitchenettes must have doors which close
+automatically (linked to the fire detection system).
+Door closers:
+- In addition to the requirements set out in the applicable legislation, access doors to
+toilets/washrooms, kitchenettes, copy rooms, etc. must also be fitted with door closers.
+Horizontal communication between two buildings:
+- In the case of doors forming an airlock between two buildings, an intermittent red light
+signal should be placed above or beside the door frames. This signal should light up on the
+non-dangerous side to indicate the danger when the alarm is raised.
diff --git a/benchmark/perfect_context/15.3. API Fundamentals.txt b/benchmark/perfect_context/15.3. API Fundamentals.txt
@@ -0,0 +1,22 @@
+Graph memory nodes are graph nodes representing either memory allocation or free actions. As a
+shorthand, nodes that allocate memory are called allocation nodes. Likewise, nodes that free memory
+are called free nodes. Allocations created by allocation nodes are called graph allocations. CUDA as-
+signs virtual addresses for the graph allocation at node creation time. While these virtual addresses
+are fixed for the lifetime of the allocation node, the allocation contents are not persistent past the
+freeing operation and may be overwritten by accesses referring to a different allocation.
+Graph allocations are considered recreated every time a graph runs. A graph allocation’s lifetime, which
+differs from the node’s lifetime, begins when GPU execution reaches the allocating graph node and
+ends when one of the following occurs:
+▶ GPU execution reaches the freeing graph node
+▶ GPU execution reaches the freeing cudaFreeAsync() stream call
+▶ immediately upon the freeing call to cudaFree()
+Note: Graph destruction does not automatically free any live graph-allocated memory, even though it
+ends the lifetime of the allocation node. The allocation must subsequently be freed in another graph,
+or using cudaFreeAsync()∕cudaFree().
+Just like other Graph Structure, graph memory nodes are ordered within a graph by dependency edges.
+A program must guarantee that operations accessing graph memory:
+▶ are ordered after the allocation node
+▶ are ordered before the operation freeing the memory
+Graph allocation lifetimes begin and usually end according to GPU execution (as opposed to API invo-
+cation). GPU ordering is the order that work runs on the GPU as opposed to the order that the work
+is enqueued or described. Thus, graph allocations are considered ‘GPU ordered.
diff --git a/benchmark/perfect_context/2.1. Toilets.txt b/benchmark/perfect_context/2.1. Toilets.txt
@@ -0,0 +1,8 @@
+Toilets must be installed on each level containing office rooms and for each structural unit;
+they must be distributed uniformly and located in a central area.
+Sinks must be supplied exclusively with cold water.
+Accessibility for persons with reduced mobility (PRM)
+In the event that a new office building is constructed upon request by the Commission, one
+toilet which is accessible for persons with reduced mobility must be installed on each level
+containing office rooms or similar.
+In other cases, the requirements of the applicable legislation must be observed
diff --git a/benchmark/perfect_context/2.4 Recurrent Networks.txt b/benchmark/perfect_context/2.4 Recurrent Networks.txt
@@ -0,0 +1,59 @@
+As recurrent neural networks (RNNs) can be unrolled to
+feed-forward representation, RNNs can also be equivalently
+represented as decision trees. We study following recurrent
+neural network. Note that we simply omit the bias terms as
+they can be represented by concatenating a 1 value to input
+vectors.
+h(t) = σ(WT h(t−1) + UT x(t))
+o(t) = VT h(t) (12)
+Similar to previous analysis, one can rewrite h(t) as fol-
+lows.
+h(t) = a(t) (WT h(t−1) + UT x(t)) (13)
+Eq. 13 can be rewritten follows.
+h(t) = a(t) (
+1∏
+j=(t−1)
+(WT a(j)))WT h(0)
++a(t)
+t∑
+i=1
+(
+i∏
+j=(t−1)
+(WT a(j)))UT x(i)
+(14)
+Note that in Eq. 14, the product operator stands for ma-
+trix multiplication, its steps are −1 and we consider the out-
+put of product operator to be 1 when i = t. One can rewrite
+Eq. 14 by introducing cj ˆWj as follows.
+h(t) = a(t) c1 ˆW1WT h(0) + a(t)
+t∑
+i=1
+ci ˆWiUT x(i)
+ci ˆWT
+i =
+i∏
+j=(t−1)
+(WT a(j)
+Combining Eq. 15 and Eq. 12, one can write o(t) as
+follows.
+o(t) = a(t) ˆVT
+c1 ˆW1WT h(0) +a(t) ˆVT t∑
+i=1
+ci ˆWiUT x(i) (16)
+Eq. 16 can be further simplified to the following.
+o(t) = c1 ˆZT
+1 WT h(0) +
+t∑
+i=1
+ci ˆZiUT x(i) (17)
+In Eq. 17, ci ˆZT
+i = a(t) ˆVT
+ci ˆWi .As one can observe from
+Eq. 17, the RNN output only depends on the categoriza-
+tion vector ci, which enables the tree equivalence -similar
+to previous analysis.
+Note that for RNNs, a popular choice for σ in Eq. 12
+is tanh. As mentioned in Section 2.3, in order to provide
+finite trees, one might consider using a piece-wise linear
+approximation of tanh.
diff --git a/benchmark/perfect_context/23.1. What is Lazy Loading?.txt b/benchmark/perfect_context/23.1. What is Lazy Loading?.txt
@@ -0,0 +1,21 @@
+Lazy Loading delays loading of CUDA modules and kernels from program initalization closer to kernels
+execution. If a program does not use every single kernel it has included, then some kernels will be
+loaded unneccesarily. This is very common, especially if you include any libraries. Most of the time,
+programs only use a small amount of kernels from libraries they include.
+Thanks to Lazy Loading, programs are able to only load kernels they are actually going to use, saving
+time on initialization. This reduces memory overhead, both on GPU memory and host memory.
+Lazy Loading is enabled by setting the CUDA_MODULE_LOADING environment variable to LAZY.
+Firstly, CUDA Runtime will no longer load all modules during program initialization, with the exception
+of modules containing managed variables. Each module will be loaded on first usage of a variable or
+a kernel from that module. This optimization is only relevant to CUDA Runtime users, CUDA Driver
+users who use cuModuleLoad are unaffected. This optimization shipped in CUDA 11.8. The behavior
+for CUDA Driver users who use cuLibraryLoad to load module data into memory can be changed by
+setting the CUDA_MODULE_DATA_LOADING environment variable.
+Secondly, loading a module (cuModuleLoad*() family of functions) will not be loading kernels immedi-
+ately, instead it will delay loading of a kernel until cuModuleGetFunction() is called. There are certain
+exceptions here, some kernels have to be loaded during cuModuleLoad*(), such as kernels of which
+pointers are stored in global variables. This optimization is relevant to both CUDA Runtime and CUDA
+Driver users. CUDA Runtime will only call cuModuleGetFunction() when a kernel is used/referenced
+for the first time. This optimization shipped in CUDA 11.7.
+Both of these optimizations are designed to be invisible to the user, assuming CUDA Programming
+Model is followed.
diff --git a/benchmark/perfect_context/3 Arithmetic Reasoning.txt b/benchmark/perfect_context/3 Arithmetic Reasoning.txt
@@ -0,0 +1,6 @@
+We begin by considering math word problems of the form in Figure 1, which measure the arithmetic
+reasoning ability of language models. Though simple for humans, arithmetic reasoning is a task where
+language models often struggle (Hendrycks et al., 2021; Patel et al., 2021, inter alia). Strikingly, chain-
+of-thought prompting when used with the 540B parameter language model performs comparably with
+task-specific finetuned models on several tasks, even achieving new state of the art on the challenging
+GSM8K benchmark (Cobbe et al., 2021).
diff --git a/benchmark/perfect_context/3 Experimental Results.txt b/benchmark/perfect_context/3 Experimental Results.txt
@@ -0,0 +1,91 @@
+First, we make a toy experiment where we fit a neural
+network to: y = x2 equation. The neural network has 3
+dense layers with 2 filters each, except for last layer which
+has 1 filter. The network uses leaky-ReLU activations after
+fully connected layers, except for the last layer which has
+no post-activation. We have used negative slope of 0.3 for
+leaky-ReLU which is the default value in Tensorflow [1].
+The network was trained with 5000 (x, y) pairs where x was
+regularly sampled from [−2.5, 2.5] interval. Fig. 2 shows
+the decision tree corresponding to the neural network. In the
+tree, every black rectangle box indicates a rule, left child
+from the box means the rule does not hold, and the right
+child means the rule holds. For better visualization, the
+rules are obtained via converting wT x + β > 0 to direct
+inequalities acting on x. This can be done for the partic-
+ular regression y = x2, since x is a scalar. In every leaf,
+the network applies a linear function -indicated by a red
+rectangle- based on the decisions so far. We have avoided
+writing these functions explicitly due to limited space. At
+first glance, the tree representation of a neural network in
+this example seems large due to the 2∑n−2
+i mi = 24 = 16
+categorizations. However, we notice that a lot of the rules
+in the decision tree is redundant, and hence some paths in
+the decision tree becomes invalid. An example to redundant
+rule is checking x < 0.32 after x < −1.16 rule holds. This
+directly creates the invalid left child for this node. Hence,
+the tree can be cleaned via removing the left child in this
+case, and merging the categorization rule to the stricter one :
+x < −1.16 in the particular case. Via cleaning the decision
+tree in Fig. 2, we obtain the simpler tree in Fig. 3a, which
+only consists of 5 categories instead of 16. The 5 categories
+are directly visible also from the model response in Fig. 3b.
+The interpretation of the neural network is thus straightfor-
+ward: for each region whose boundaries are determined via
+the decision tree representation, the network approximates
+the non-linear y = x2 equation by a linear equation. One
+can clearly interpret and moreover make deduction from the
+decision tree, some of which are as follows. The neural
+network is unable to grasp the symmetrical nature of the
+regression problem which is evident from the fact that the
+decision boundaries are asymmetrical. The region in below
+−1.16 and above 1 is unbounded and thus neural decisions
+lose accuracy as x goes beyond these boundaries.
+
+y = x2 Half-Moon
+Param. Comp. Mult./Add. Param. Comp. Mult./Add.
+Tree 14 2.6 2 39 4.1 8.2
+NN 13 4 16 15 5 25
+Table 1. Computation and memory analysis of toy problems
+
+Next, we investigate another toy problem of classifying
+half-moons and analyse the decision tree produced by a neu-
+ral network. We train a fully connected neural network with
+3 layers with leaky-ReLU activations, except for last layer
+which has sigmoid activation. Each layer has 2 filters ex-
+cept for the last layer which has 1. The cleaned decision
+tree induced by the trained network is shown in Fig. 4. The
+decision tree finds many categories whose boundaries are
+determined by the rules in the tree, where each category
+is assigned a single class. In order to better visualize the
+categories, we illustrate them with different colors in Fig.
+5. One can make several deductions from the decision tree
+such as some regions are very well-defined, bounded and
+the classifications they make are perfectly in line with the
+training data, thus these regions are very reliable. There are
+unbounded categories which help obtaining accurate classi-
+fication boundaries, yet fail to provide a compact represen-
+tation of the training data, these may correspond to inaccu-
+rate extrapolations made by neural decisions. There are also
+some categories that emerged although none of the training
+data falls to them.
+Besides the interpretability aspect, the decision tree rep-
+resentation also provides some computational advantages.
+In Table 1, we compare the number of parameters, float-
+point comparisons and multiplication or addition operations
+of the neural network and the tree induced by it. Note that
+the comparisons, multiplications and additions in the tree
+representation are given as expected values, since per each
+category depth of the tree is different. As the induced tree
+is an unfolding of the neural network, it covers all possi-
+ble routes and keeps all possible effective filters in mem-
+ory. Thus, as expected, the number of parameters in the tree
+representation of a neural network is larger than that of the
+network. In the induced tree, in every layer i, a maximum
+of mi filters are applied directly on the input, whereas in the
+neural network always mi filters are applied on the previous
+feature, which is usually much larger than the input in the
+feature dimension. Thus, computation-wise, the tree repre-
+sentation is advantageous compared to the neural network
+one.
diff --git a/benchmark/perfect_context/3 Model Architecture.txt b/benchmark/perfect_context/3 Model Architecture.txt
@@ -0,0 +1,9 @@
+Most competitive neural sequence transduction models have an encoder-decoder structure [ 5, 2 , 35].
+Here, the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence
+of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output
+sequence (y1, ..., ym) of symbols one element at a time. At each step the model is auto-regressive
+[10], consuming the previously generated symbols as additional input when generating the next.
+
+The Transformer follows this overall architecture using stacked self-attention and point-wise, fully
+connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,
+respectively.
diff --git a/benchmark/perfect_context/3.1 Encoder and Decoder Stacks.txt b/benchmark/perfect_context/3.1 Encoder and Decoder Stacks.txt
@@ -0,0 +1,15 @@
+Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two
+sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-
+wise fully connected feed-forward network. We employ a residual connection [ 11 ] around each of
+the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is
+LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer
+itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
+layers, produce outputs of dimension dmodel = 512.
+
+Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two
+sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head
+attention over the output of the encoder stack. Similar to the encoder, we employ residual connections
+around each of the sub-layers, followed by layer normalization. We also modify the self-attention
+sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This
+masking, combined with fact that the output embeddings are offset by one position, ensures that the
+predictions for position i can depend only on the known outputs at positions less than i.