Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
121 commits
Select commit Hold shift + click to select a range
d57c367
enh(preprocessing): Add split_markdown_by_headings.
daavoo Jan 22, 2025
fe93f74
Add benchmark
daavoo Jan 20, 2025
92c70a7
Move to structured_qa. Add entrypoint
daavoo Jan 20, 2025
70ef785
Move back outside
daavoo Jan 20, 2025
16ff8bd
Fix main
daavoo Jan 20, 2025
539898e
Update questions
daavoo Jan 20, 2025
ed71947
Update model and prompt
daavoo Jan 20, 2025
fd4fb95
Update
daavoo Jan 20, 2025
5add514
Update
daavoo Jan 20, 2025
9f8c755
fix
daavoo Jan 20, 2025
bec2ef1
Add system_instruction
daavoo Jan 20, 2025
08cad02
Update ratio
daavoo Jan 20, 2025
b7ce84e
Add more wait
daavoo Jan 20, 2025
6fc48fe
Fix return
daavoo Jan 20, 2025
8929e9e
Fix URLs
daavoo Jan 20, 2025
4a9e75e
Update download name
daavoo Jan 20, 2025
41ffc23
Update
daavoo Jan 20, 2025
4390852
Update
daavoo Jan 20, 2025
68621eb
Update with upper
daavoo Jan 20, 2025
422e5d5
Cast to str
daavoo Jan 20, 2025
3040978
Extend
daavoo Jan 20, 2025
bc0d8ce
Add benchmark
daavoo Jan 20, 2025
03e0e60
Fix
daavoo Jan 20, 2025
c19738e
fix
daavoo Jan 20, 2025
3cd7b24
Drop export
daavoo Jan 21, 2025
22df32b
Updates
daavoo Jan 21, 2025
b35dc23
Update default model
daavoo Jan 21, 2025
6cf13d7
Update
daavoo Jan 21, 2025
ad1ef9b
Use info
daavoo Jan 21, 2025
f237b89
Update with None
daavoo Jan 21, 2025
a34f4e2
Add answer type
daavoo Jan 21, 2025
291e376
Refactor
daavoo Jan 21, 2025
d7e99e7
Add fallback for out of context
daavoo Jan 21, 2025
0f381bb
Update with debugging info
daavoo Jan 21, 2025
a0391a4
Update
daavoo Jan 21, 2025
c3182cb
Update with mit-1
daavoo Jan 22, 2025
20b1651
test unsloth
daavoo Jan 22, 2025
0dd98da
Add , skip_special_tokens = True
daavoo Jan 22, 2025
6ac29aa
Update
daavoo Jan 22, 2025
95b3d57
Updates
daavoo Jan 22, 2025
d946f81
Add full_context
daavoo Jan 22, 2025
4ea1f7d
Update full context
daavoo Jan 22, 2025
a4888f2
update
daavoo Jan 22, 2025
e0f3a82
Add load and clean
daavoo Jan 22, 2025
906c8d9
Update
daavoo Jan 22, 2025
bb2afe5
Update
daavoo Jan 22, 2025
51c31f7
print
daavoo Jan 22, 2025
c5e0ac4
Update
daavoo Jan 22, 2025
cc10a9d
Add load_gemini_model
daavoo Jan 22, 2025
1560c71
Add sleep
daavoo Jan 22, 2025
94e7580
Update get_response
daavoo Jan 22, 2025
e7b5d5b
Update
daavoo Jan 22, 2025
5f6443b
Log error
daavoo Jan 22, 2025
819c6b2
fix
daavoo Jan 22, 2025
5625c39
Make the more info check more flexible
daavoo Jan 23, 2025
d125b79
Add gemini_full_context notebook
daavoo Jan 23, 2025
88a9357
typo
daavoo Jan 23, 2025
d929a80
Check por API KEY
daavoo Jan 23, 2025
9e718b3
Update with outputs
daavoo Jan 23, 2025
9027567
Add ragatouille
daavoo Jan 23, 2025
d2a3d98
Fix
daavoo Jan 23, 2025
17942ca
Update notebooks
daavoo Jan 24, 2025
fcdd953
Update gemini notebooks
daavoo Jan 24, 2025
bfdacea
Extend structured_qa. Add perfect_context.
daavoo Jan 27, 2025
a7d8dc5
Add gemini_perfect_context
daavoo Jan 27, 2025
308ab91
Update
daavoo Jan 27, 2025
704050b
fix line
daavoo Jan 27, 2025
67b8f80
fix line
daavoo Jan 27, 2025
a6bfe34
Update perfect_context
daavoo Jan 28, 2025
39a17ae
Add missing perfect context
daavoo Jan 28, 2025
ae325d3
Updates
daavoo Jan 28, 2025
56d8620
Update gemini_ragatouille
daavoo Jan 28, 2025
eb00902
Update gemini_fra
daavoo Jan 28, 2025
1d06d2c
Update
daavoo Jan 28, 2025
8ac9201
Update
daavoo Jan 28, 2025
0352173
Drop some log
daavoo Jan 28, 2025
0b8e5cf
Update
daavoo Jan 28, 2025
e2c5457
Update gemini_perfect_context with results
daavoo Jan 29, 2025
36350ee
Use rapizfuzz
daavoo Jan 29, 2025
215226e
Use question_part
daavoo Jan 29, 2025
5d4d961
Fix
daavoo Jan 29, 2025
1223b03
break when no section_names
daavoo Jan 29, 2025
08c0b85
Update prompt
daavoo Jan 29, 2025
7b9c96c
Add qwen perfect context
daavoo Jan 29, 2025
c056bdc
Update gemini_find_retrieve_answer
daavoo Jan 30, 2025
b726447
Update qwen perfect context
daavoo Jan 30, 2025
036f8a3
Add qwen RAGatouille
daavoo Jan 30, 2025
6b0a0c1
Update qwen notebooks
daavoo Jan 30, 2025
c60fe3e
Update
daavoo Jan 30, 2025
d12fa72
Update prompt
daavoo Jan 30, 2025
38d2530
Update qwen notebooks
daavoo Jan 30, 2025
1360437
Cleanup
daavoo Jan 30, 2025
6906991
Cleanup
daavoo Jan 30, 2025
8abcfb1
Add DeepSeek-R1-Distill-Qwen-7B
daavoo Jan 31, 2025
034fe29
Debug current calls. Set to 9 before reset
daavoo Feb 1, 2025
a2d301f
Add qwen find retrieve answer
daavoo Feb 1, 2025
8300573
Extend benchmark
daavoo Feb 3, 2025
4f8f82a
Update
daavoo Feb 3, 2025
2de0bfb
Add max_sections_to_check
daavoo Feb 3, 2025
8f7d173
Default to None
daavoo Feb 3, 2025
7ff95ff
Default to half of sections
daavoo Feb 3, 2025
d05d992
Update
daavoo Feb 3, 2025
db63dc9
fix
daavoo Feb 3, 2025
20f9e3f
Fix
daavoo Feb 3, 2025
c5ee8e6
Add qwen full context
daavoo Feb 3, 2025
a4da649
Update qwen_full_context
daavoo Feb 3, 2025
4ea56e2
Update gemini_full_context
daavoo Feb 3, 2025
82f37f3
Add statistics
daavoo Feb 3, 2025
a02ffd7
Update prompt
daavoo Feb 4, 2025
8af98df
Update with type
daavoo Feb 4, 2025
97049d6
Update gemini prompt and count
daavoo Feb 4, 2025
6555304
Update results with same prompts
daavoo Feb 4, 2025
0ab4688
Update with same prompt
daavoo Feb 4, 2025
5276d16
Update results
daavoo Feb 4, 2025
476bbe1
Bring back llama-cpp-python
daavoo Feb 5, 2025
fdafdc3
Update prompts
daavoo Feb 5, 2025
2ac1f61
Reduce notebook size
daavoo Feb 5, 2025
c99adb0
Update pre-commit
daavoo Feb 5, 2025
a114fe5
Update docstrings
daavoo Feb 5, 2025
df394cc
Merge branch 'main' into 5-add-benchmark
daavoo Feb 5, 2025
eec44b0
Update test
daavoo Feb 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Extend structured_qa. Add perfect_context.
  • Loading branch information
daavoo committed Jan 27, 2025
commit bfdacea8db390d9b3ab240676ea3b9e51276cd87
49 changes: 49 additions & 0 deletions benchmark/perfect_context/1 INTRODUCTION.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
Many applications in natural language processing rely on adapt-
ing one large-scale, pre-trained language model to multiple down-
stream applications. Such adaptation is usually done via fine-tuning,
which updates all the parameters of the pre-trained model. The ma-
jor downside of fine-tuning is that the new model contains as many
parameters as in the original model. As larger models are trained
every few months, this changes from a mere “inconvenience” for
GPT-2 (Radford et al., b) or RoBERTa large (Liu et al., 2019) to a
critical deployment challenge for GPT-3 (Brown et al., 2020) with
175 billion trainable parameters.1
Many sought to mitigate this by adapting only some parameters or
learning external modules for new tasks. This way, we only need
to store and load a small number of task-specific parameters in ad-
dition to the pre-trained model for each task, greatly boosting the
operational efficiency when deployed. However, existing techniques
ften introduce inference latency (Houlsby et al., 2019; Rebuffi et al., 2017) by extending model
depth or reduce the model’s usable sequence length (Li & Liang, 2021; Lester et al., 2021; Ham-
bardzumyan et al., 2020; Liu et al., 2021) (Section 3). More importantly, these method often fail to
match the fine-tuning baselines, posing a trade-off between efficiency and model quality.
We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned
over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the
change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed
Low-Rank Adaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural
network indirectly by optimizing rank decomposition matrices of the dense layers’ change during
adaptation instead, while keeping the pre-trained weights frozen, as shown in Figure 1. Using GPT-3
175B as an example, we show that a very low rank (i.e., r in Figure 1 can be one or two) suffices even
when the full rank (i.e., d) is as high as 12,288, making LoRA both storage- and compute-efficient.
LoRA possesses several key advantages.
• A pre-trained model can be shared and used to build many small LoRA modules for dif-
ferent tasks. We can freeze the shared model and efficiently switch tasks by replacing the
matrices A and B in Figure 1, reducing the storage requirement and task-switching over-
head significantly.
• LoRA makes training more efficient and lowers the hardware barrier to entry by up to 3
times when using adaptive optimizers since we do not need to calculate the gradients or
maintain the optimizer states for most parameters. Instead, we only optimize the injected,
much smaller low-rank matrices.
• Our simple linear design allows us to merge the trainable matrices with the frozen weights
when deployed, introducing no inference latency compared to a fully fine-tuned model, by
construction.
• LoRA is orthogonal to many prior methods and can be combined with many of them, such
as prefix-tuning. We provide an example in Appendix E.
Terminologies and Conventions We make frequent references to the Transformer architecture
and use the conventional terminologies for its dimensions. We call the input and output di-
mension size of a Transformer layer dmodel. We use Wq , Wk, Wv , and Wo to refer to the
query/key/value/output projection matrices in the self-attention module. W or W0 refers to a pre-
trained weight matrix and ∆W its accumulated gradient update during adaptation. We use r to
denote the rank of a LoRA module. We follow the conventions set out by (Vaswani et al., 2017;
Brown et al., 2020) and use Adam (Loshchilov & Hutter, 2019; Kingma & Ba, 2017) for model
optimization and use a Transformer MLP feedforward dimension df f n = 4 × dmodel.
16 changes: 16 additions & 0 deletions benchmark/perfect_context/1.2.1. Internal partitions and doors.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Fire resistance of partitions:
- Copy rooms must have vertical partitions with a fire resistance of EI 30 and doors must have
a fire resistance of EI1 30 and close automatically (linked to the fire detection system).
Door retaining devices:
- Certain fire doors for rooms which are accessed or traversed very frequently are kept open
using magnetic retainers linked to the fire detection system (e.g. entrance halls and lift
lobbies, corridor compartment doors, kitchenette doors and doors of copy rooms).
- As a minimum, rooms accommodating kitchenettes must have doors which close
automatically (linked to the fire detection system).
Door closers:
- In addition to the requirements set out in the applicable legislation, access doors to
toilets/washrooms, kitchenettes, copy rooms, etc. must also be fitted with door closers.
Horizontal communication between two buildings:
- In the case of doors forming an airlock between two buildings, an intermittent red light
signal should be placed above or beside the door frames. This signal should light up on the
non-dangerous side to indicate the danger when the alarm is raised.
22 changes: 22 additions & 0 deletions benchmark/perfect_context/15.3. API Fundamentals.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
Graph memory nodes are graph nodes representing either memory allocation or free actions. As a
shorthand, nodes that allocate memory are called allocation nodes. Likewise, nodes that free memory
are called free nodes. Allocations created by allocation nodes are called graph allocations. CUDA as-
signs virtual addresses for the graph allocation at node creation time. While these virtual addresses
are fixed for the lifetime of the allocation node, the allocation contents are not persistent past the
freeing operation and may be overwritten by accesses referring to a different allocation.
Graph allocations are considered recreated every time a graph runs. A graph allocation’s lifetime, which
differs from the node’s lifetime, begins when GPU execution reaches the allocating graph node and
ends when one of the following occurs:
▶ GPU execution reaches the freeing graph node
▶ GPU execution reaches the freeing cudaFreeAsync() stream call
▶ immediately upon the freeing call to cudaFree()
Note: Graph destruction does not automatically free any live graph-allocated memory, even though it
ends the lifetime of the allocation node. The allocation must subsequently be freed in another graph,
or using cudaFreeAsync()∕cudaFree().
Just like other Graph Structure, graph memory nodes are ordered within a graph by dependency edges.
A program must guarantee that operations accessing graph memory:
▶ are ordered after the allocation node
▶ are ordered before the operation freeing the memory
Graph allocation lifetimes begin and usually end according to GPU execution (as opposed to API invo-
cation). GPU ordering is the order that work runs on the GPU as opposed to the order that the work
is enqueued or described. Thus, graph allocations are considered ‘GPU ordered.
8 changes: 8 additions & 0 deletions benchmark/perfect_context/2.1. Toilets.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Toilets must be installed on each level containing office rooms and for each structural unit;
they must be distributed uniformly and located in a central area.
Sinks must be supplied exclusively with cold water.
Accessibility for persons with reduced mobility (PRM)
In the event that a new office building is constructed upon request by the Commission, one
toilet which is accessible for persons with reduced mobility must be installed on each level
containing office rooms or similar.
In other cases, the requirements of the applicable legislation must be observed
59 changes: 59 additions & 0 deletions benchmark/perfect_context/2.4 Recurrent Networks.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
As recurrent neural networks (RNNs) can be unrolled to
feed-forward representation, RNNs can also be equivalently
represented as decision trees. We study following recurrent
neural network. Note that we simply omit the bias terms as
they can be represented by concatenating a 1 value to input
vectors.
h(t) = σ(WT h(t−1) + UT x(t))
o(t) = VT h(t) (12)
Similar to previous analysis, one can rewrite h(t) as fol-
lows.
h(t) = a(t) (WT h(t−1) + UT x(t)) (13)
Eq. 13 can be rewritten follows.
h(t) = a(t) (
1∏
j=(t−1)
(WT a(j)))WT h(0)
+a(t)
t∑
i=1
(
i∏
j=(t−1)
(WT a(j)))UT x(i)
(14)
Note that in Eq. 14, the product operator stands for ma-
trix multiplication, its steps are −1 and we consider the out-
put of product operator to be 1 when i = t. One can rewrite
Eq. 14 by introducing cj ˆWj as follows.
h(t) = a(t) c1 ˆW1WT h(0) + a(t)
t∑
i=1
ci ˆWiUT x(i)
ci ˆWT
i =
i∏
j=(t−1)
(WT a(j)
Combining Eq. 15 and Eq. 12, one can write o(t) as
follows.
o(t) = a(t) ˆVT
c1 ˆW1WT h(0) +a(t) ˆVT t∑
i=1
ci ˆWiUT x(i) (16)
Eq. 16 can be further simplified to the following.
o(t) = c1 ˆZT
1 WT h(0) +
t∑
i=1
ci ˆZiUT x(i) (17)
In Eq. 17, ci ˆZT
i = a(t) ˆVT
ci ˆWi .As one can observe from
Eq. 17, the RNN output only depends on the categoriza-
tion vector ci, which enables the tree equivalence -similar
to previous analysis.
Note that for RNNs, a popular choice for σ in Eq. 12
is tanh. As mentioned in Section 2.3, in order to provide
finite trees, one might consider using a piece-wise linear
approximation of tanh.
21 changes: 21 additions & 0 deletions benchmark/perfect_context/23.1. What is Lazy Loading?.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
Lazy Loading delays loading of CUDA modules and kernels from program initalization closer to kernels
execution. If a program does not use every single kernel it has included, then some kernels will be
loaded unneccesarily. This is very common, especially if you include any libraries. Most of the time,
programs only use a small amount of kernels from libraries they include.
Thanks to Lazy Loading, programs are able to only load kernels they are actually going to use, saving
time on initialization. This reduces memory overhead, both on GPU memory and host memory.
Lazy Loading is enabled by setting the CUDA_MODULE_LOADING environment variable to LAZY.
Firstly, CUDA Runtime will no longer load all modules during program initialization, with the exception
of modules containing managed variables. Each module will be loaded on first usage of a variable or
a kernel from that module. This optimization is only relevant to CUDA Runtime users, CUDA Driver
users who use cuModuleLoad are unaffected. This optimization shipped in CUDA 11.8. The behavior
for CUDA Driver users who use cuLibraryLoad to load module data into memory can be changed by
setting the CUDA_MODULE_DATA_LOADING environment variable.
Secondly, loading a module (cuModuleLoad*() family of functions) will not be loading kernels immedi-
ately, instead it will delay loading of a kernel until cuModuleGetFunction() is called. There are certain
exceptions here, some kernels have to be loaded during cuModuleLoad*(), such as kernels of which
pointers are stored in global variables. This optimization is relevant to both CUDA Runtime and CUDA
Driver users. CUDA Runtime will only call cuModuleGetFunction() when a kernel is used/referenced
for the first time. This optimization shipped in CUDA 11.7.
Both of these optimizations are designed to be invisible to the user, assuming CUDA Programming
Model is followed.
6 changes: 6 additions & 0 deletions benchmark/perfect_context/3 Arithmetic Reasoning.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
We begin by considering math word problems of the form in Figure 1, which measure the arithmetic
reasoning ability of language models. Though simple for humans, arithmetic reasoning is a task where
language models often struggle (Hendrycks et al., 2021; Patel et al., 2021, inter alia). Strikingly, chain-
of-thought prompting when used with the 540B parameter language model performs comparably with
task-specific finetuned models on several tasks, even achieving new state of the art on the challenging
GSM8K benchmark (Cobbe et al., 2021).
91 changes: 91 additions & 0 deletions benchmark/perfect_context/3 Experimental Results.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
First, we make a toy experiment where we fit a neural
network to: y = x2 equation. The neural network has 3
dense layers with 2 filters each, except for last layer which
has 1 filter. The network uses leaky-ReLU activations after
fully connected layers, except for the last layer which has
no post-activation. We have used negative slope of 0.3 for
leaky-ReLU which is the default value in Tensorflow [1].
The network was trained with 5000 (x, y) pairs where x was
regularly sampled from [−2.5, 2.5] interval. Fig. 2 shows
the decision tree corresponding to the neural network. In the
tree, every black rectangle box indicates a rule, left child
from the box means the rule does not hold, and the right
child means the rule holds. For better visualization, the
rules are obtained via converting wT x + β > 0 to direct
inequalities acting on x. This can be done for the partic-
ular regression y = x2, since x is a scalar. In every leaf,
the network applies a linear function -indicated by a red
rectangle- based on the decisions so far. We have avoided
writing these functions explicitly due to limited space. At
first glance, the tree representation of a neural network in
this example seems large due to the 2∑n−2
i mi = 24 = 16
categorizations. However, we notice that a lot of the rules
in the decision tree is redundant, and hence some paths in
the decision tree becomes invalid. An example to redundant
rule is checking x < 0.32 after x < −1.16 rule holds. This
directly creates the invalid left child for this node. Hence,
the tree can be cleaned via removing the left child in this
case, and merging the categorization rule to the stricter one :
x < −1.16 in the particular case. Via cleaning the decision
tree in Fig. 2, we obtain the simpler tree in Fig. 3a, which
only consists of 5 categories instead of 16. The 5 categories
are directly visible also from the model response in Fig. 3b.
The interpretation of the neural network is thus straightfor-
ward: for each region whose boundaries are determined via
the decision tree representation, the network approximates
the non-linear y = x2 equation by a linear equation. One
can clearly interpret and moreover make deduction from the
decision tree, some of which are as follows. The neural
network is unable to grasp the symmetrical nature of the
regression problem which is evident from the fact that the
decision boundaries are asymmetrical. The region in below
−1.16 and above 1 is unbounded and thus neural decisions
lose accuracy as x goes beyond these boundaries.

y = x2 Half-Moon
Param. Comp. Mult./Add. Param. Comp. Mult./Add.
Tree 14 2.6 2 39 4.1 8.2
NN 13 4 16 15 5 25
Table 1. Computation and memory analysis of toy problems

Next, we investigate another toy problem of classifying
half-moons and analyse the decision tree produced by a neu-
ral network. We train a fully connected neural network with
3 layers with leaky-ReLU activations, except for last layer
which has sigmoid activation. Each layer has 2 filters ex-
cept for the last layer which has 1. The cleaned decision
tree induced by the trained network is shown in Fig. 4. The
decision tree finds many categories whose boundaries are
determined by the rules in the tree, where each category
is assigned a single class. In order to better visualize the
categories, we illustrate them with different colors in Fig.
5. One can make several deductions from the decision tree
such as some regions are very well-defined, bounded and
the classifications they make are perfectly in line with the
training data, thus these regions are very reliable. There are
unbounded categories which help obtaining accurate classi-
fication boundaries, yet fail to provide a compact represen-
tation of the training data, these may correspond to inaccu-
rate extrapolations made by neural decisions. There are also
some categories that emerged although none of the training
data falls to them.
Besides the interpretability aspect, the decision tree rep-
resentation also provides some computational advantages.
In Table 1, we compare the number of parameters, float-
point comparisons and multiplication or addition operations
of the neural network and the tree induced by it. Note that
the comparisons, multiplications and additions in the tree
representation are given as expected values, since per each
category depth of the tree is different. As the induced tree
is an unfolding of the neural network, it covers all possi-
ble routes and keeps all possible effective filters in mem-
ory. Thus, as expected, the number of parameters in the tree
representation of a neural network is larger than that of the
network. In the induced tree, in every layer i, a maximum
of mi filters are applied directly on the input, whereas in the
neural network always mi filters are applied on the previous
feature, which is usually much larger than the input in the
feature dimension. Thus, computation-wise, the tree repre-
sentation is advantageous compared to the neural network
one.
9 changes: 9 additions & 0 deletions benchmark/perfect_context/3 Model Architecture.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Most competitive neural sequence transduction models have an encoder-decoder structure [ 5, 2 , 35].
Here, the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence
of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output
sequence (y1, ..., ym) of symbols one element at a time. At each step the model is auto-regressive
[10], consuming the previously generated symbols as additional input when generating the next.

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully
connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,
respectively.
15 changes: 15 additions & 0 deletions benchmark/perfect_context/3.1 Encoder and Decoder Stacks.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two
sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-
wise fully connected feed-forward network. We employ a residual connection [ 11 ] around each of
the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is
LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer
itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
layers, produce outputs of dimension dmodel = 512.

Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two
sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head
attention over the output of the encoder stack. Similar to the encoder, we employ residual connections
around each of the sub-layers, followed by layer normalization. We also modify the self-attention
sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This
masking, combined with fact that the output embeddings are offset by one position, ensures that the
predictions for position i can depend only on the known outputs at positions less than i.
Loading