diff --git a/README.md b/README.md index 171f5406..7d298284 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,10 @@ # Algorithmica v3 -Algorithmica is a free and open web book about Computer Science. +Algorithmica is an open-access web book dedicated to the art and science of computing. -If you are concerned with editing, please read the [contributing guide](https://ru.algorithmica.org/contributing/) (in Russian). +You can contribute via [Prose](https://prose.io/) by clicking on the pencil icon on the top right on any page or by editing its source directly on GitHub. We use a slightly different Markdown dialect, so if you are not sure that the change is correct (for example, editing an intricate LaTeX formula), you can install [Hugo](https://gohugo.io/) and build the site locally — or just create a pull request, and a preview link will be automatically generated for you. + +If you happen to speak Russian, please also read the [contributing guidelines](https://ru.algorithmica.org/contributing/). --- @@ -16,11 +18,11 @@ Key technical changes from the [previous version](https://github.com/algorithmic * Rich metadata support (language, sections, TOCs, authors...) * Automated global table of contents * Theming support +* Search support (Lunr) Short-term todo list: -* Search with lunr -* Themes (especially a better dark theme) -* Minor style adjustments for mobile and print versions +* Style adjustments for mobile and print versions * A pdf version of the whole website +* Meta-information support (for Google Scholar and social media) * [Sticky table of contents](https://css-tricks.com/table-of-contents-with-intersectionobserver/) diff --git a/assets/slides.sass b/assets/slides.sass index e69de29b..671ababe 100644 --- a/assets/slides.sass +++ b/assets/slides.sass @@ -0,0 +1,50 @@ +$font-text: 'Source Sans', serif !default +$font-code: 'Inconsolata', monospace !default +$font-headings: 'Garamond', serif !default + +$borders: 1px solid #eaecef !default + +/* fonts */ +@font-face + font-family: 'CMU' + src: url(fonts/cmu.woff2) + +@font-face + font-family: 'Merriweather' + src: url(fonts/merriweather.woff2) + +@font-face + font-family: 'Inconsolata' + src: url(fonts/inconsolata.woff2) + +@font-face + font-family: 'Garamond' + src: url(fonts/garamond.woff2) + +@font-face + font-family: "Open Sans" + src: url(fonts/opensans.woff2) + +@font-face + font-family: "Source Sans" + src: url(fonts/sourcesans.ttf) + +@font-face + font-family: "Crimson" + src: url(fonts/crimson.ttf) + +body + font-family: $font-text + font-size: 24px + +h1 + font-size: 2em + text-align: center + margin-top: 0 + margin-bottom: 20px + +h2 + font-size: 1.5em + +h3 + font-size: 1.25em diff --git a/config.yaml b/config.yaml index 7e4ca1b7..1f196de4 100644 --- a/config.yaml +++ b/config.yaml @@ -8,6 +8,15 @@ outputFormats: baseName: index mediaType: text/html isHTML: true + SearchIndex: + mediaType: "application/json" + baseName: "searchindex" + isPlainText: true + notAlternative: true +outputs: + home: + - HTML + - SearchIndex markup: goldmark: footnote: false # katex conflict @@ -33,8 +42,8 @@ languages: params: repo: "https://github.com/algorithmica-org/algorithmica" reveal_hugo: - theme: white + #theme: white slide_number: true transition: none - #custom_theme: "slides.sass" - #custom_theme_compile: true + custom_theme: "slides.sass" + custom_theme_compile: true diff --git a/content/english/hpc/_index.md b/content/english/hpc/_index.md index 5bb1fe60..9b6aa606 100644 --- a/content/english/hpc/_index.md +++ b/content/english/hpc/_index.md @@ -33,17 +33,17 @@ A "release" for an open-source book like this essentially means: - mostly freezing the table of contents (except for the case studies), - doing one final round of heavy copyediting (hopefully, with the help of a professional editor — I still haven’t figured out how commas work in English), - drawing illustrations (I stole a lot of those that are currently displayed), -- making a print-optimized pdf and figuring out the best way to distribute it. +- making a print-optimized PDF and figuring out the best way to distribute it. After that, I will mostly be fixing errors and only doing some minor edits reflecting the changes in technology or new algorithm advancements. The e-book/printed editions will most likely be sold on a "pay what you want" basis, and in any case, the web version will always be fully available online. **Pre-ordering / financially supporting the book.** Due to my unfortunate citizenship and place of birth, you can't — that is, until I find a way that at the same time complies with international sanctions, doesn't sponsor [the war](https://en.wikipedia.org/wiki/2022_Russian_invasion_of_Ukraine), and won't put me in prison for tax evasion. -So, don't bother. If you want to support this book, just share the articles you like on link aggregators and social media and help fix typos — that would be enough. +So, don't bother. If you want to support this book, just share it and help fix typos — that would be enough. **Translations.** The website has a separate functionality for creating and managing translations — and I've already been contacted by some nice people willing to translate the book into Italian and Chinese (and I will personally translate at least some of it into my native Russian). -However, as the book is still evolving, it is probably not the best idea to start translating it at least until Part I is finished. That said, you are very much encouraged to make translations of any articles and publish them in your blogs — just send me the link so that we can merge it back when a centralized translation process starts. +However, as the book is still evolving, it is probably not the best idea to start translating it at least until Part I is finished. That said, you are very much encouraged to make translations of any articles and publish them in your blogs — just send me the link so that we can merge it back when centralized translation starts. **"Translating" the Russian version.** The articles hosted at [ru.algorithmica.org/cs/](https://ru.algorithmica.org/cs/) are not about advanced performance engineering but mostly about classical computer science algorithms — without discussing how to speed them up beyond asymptotic complexity. Most of the information there is not unique and already exists in English on some other places on the internet: for example, the similar-spirited [cp-algorithms.com](https://cp-algorithms.com/). @@ -51,7 +51,7 @@ However, as the book is still evolving, it is probably not the best idea to star There are two highly impactful textbooks on which most computer science courses are built. Both are undoubtedly outstanding, but [one of them](https://en.wikipedia.org/wiki/The_Art_of_Computer_Programming) is 50 years old, and [the other](https://en.wikipedia.org/wiki/Introduction_to_Algorithms) is 30 years old, and [computers have changed a lot](/hpc/complexity/hardware) since then. Asymptotic complexity is not the sole deciding factor anymore. In modern practical algorithm design, you choose the approach that makes better use of different types of parallelism available in the hardware over the one that theoretically does fewer raw operations on galaxy-scale inputs. -And yet, the computer science curricula in most colleges completely ignore this shift. Although there are some great courses that aim to correct that — such as "[Performance Engineering of Software Systems](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-software-systems-fall-2018/)" from MIT, "[Programming Parallel Computers](https://ppc.cs.aalto.fi/)" from Aalto University, and some non-academic ones like Denis Bakhvalov's "[Performance Ninja](https://github.com/dendibakh/perf-ninja)" — most computer science graduates still treat the hardware like something from the 90s. +And yet, the computer science curricula in most colleges completely ignore this shift. Although there are some great courses that aim to correct that — such as "[Performance Engineering of Software Systems](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-software-systems-fall-2018/)" from MIT, "[Programming Parallel Computers](https://ppc.cs.aalto.fi/)" from Aalto University, and some non-academic ones like Denis Bakhvalov's "[Performance Ninja](https://github.com/dendibakh/perf-ninja)" — most computer science graduates still treat modern hardware like something from the 1990s. What I really want to achieve is that performance engineering becomes taught right after introduction to algorithms. Writing the first comprehensive textbook on the subject is a large part of it, and this is why I rush to finish it by the summer so that the colleges can pick it up in the next academic year. But creating a new course requires more than that: you need a balanced curriculum, course infrastructure, lecture slides, lab assignments… so for some time after finishing the main book, I will be working on course materials and tools for *teaching* performance engineering — and I'm looking forward to collaborating with other people who want to make it a reality as well. @@ -76,7 +76,7 @@ Competitive programming is, in my opinion, misguided. They are doing useless thi The first part covers the basics of computer architecture and optimization of single-threaded algorithms. -It walks through the main CPU optimization topics such as caching, SIMD and pipelining, and provides brief examples in C++, followed by large case studies where we usually achieve a significant speedup over some STL algorithm or data structure. +It walks through the main CPU optimization topics such as caching, SIMD, and pipelining, and provides brief examples in C++, followed by large case studies where we usually achieve a significant speedup over some STL algorithm or data structure. Planned table of contents: @@ -94,7 +94,7 @@ Planned table of contents: 1.4. Functions and Recursion 1.5. Indirect Branching 1.6. Machine Code Layout - 1.7. Interrupts and System Calls + 1.7. System Calls 1.8. Virtualization 3. Instruction-Level Parallelism 3.1. Pipeline Hazards @@ -163,11 +163,11 @@ Planned table of contents: 9.11. AoS and SoA 10. SIMD Parallelism 10.1. Intrinsics and Vector Types - 10.2. Loading and Writing Data - 10.3. Sums and Other Reductions + 10.2. Moving Data + 10.3. Reductions 10.4. Masking and Blending 10.5. In-Register Shuffles - 10.6. Auto-Vectorization + 10.6. Auto-Vectorization and SPMD 11. Algorithm Case Studies 11.1. Binary GCD (11.2. Prime Number Sieves) @@ -178,20 +178,22 @@ Planned table of contents: 11.7. Number-Theoretic Transform 11.8. Argmin with SIMD 11.9. Prefix Sum with SIMD - 11.10. Reading and Writing Integers -(11.11. Reading and Writing Floats) -(11.12. String Searching) - 11.13. Sorting - 11.14. Matrix Multiplication + 11.10. Reading Decimal Integers + 11.11. Writing Decimal Integers +(11.12. Reading and Writing Floats) +(11.13. String Searching) + 11.14. Sorting + 11.15. Matrix Multiplication 12. Data Structure Case Studies 12.1. Binary Search 12.2. Static B-Trees - 12.3. Segment Trees -(12.4. Search Trees) -(12.5. Range Minimum Query) - 12.6. Hash Tables -(12.7. Bitmaps) -(12.8. Probabilistic Filters) +(12.3. Search Trees) + 12.4. Segment Trees +(12.5. Tries) +(12.6. Range Minimum Query) + 12.7. Hash Tables +(12.8. Bitmaps) +(12.9. Probabilistic Filters) ``` Among the cool things that we will speed up: @@ -201,18 +203,47 @@ Among the cool things that we will speed up: - 5-10x faster segment trees (compared to Fenwick trees) - 5x faster hash tables (compared to `std::unordered_map`) - 2x faster popcount (compared to repeatedly calling `popcnt`) -- 2x faster parsing series of integers (compared to `scanf`) +- 35x faster parsing series of integers (compared to `scanf`) - ?x faster sorting (compared to `std::sort`) - 2x faster sum (compared to `std::accumulate`) - 2-3x faster prefix sum (compared to naive implementation) - 10x faster argmin (compared to naive implementation) - 10x faster array searching (compared to `std::find`) +- 15x faster search tree (compared to `std::set`) - 100x faster matrix multiplication (compared to "for-for-for") - optimal word-size integer factorization (~0.4ms per 60-bit integer) - optimal Karatsuba Algorithm - optimal FFT -This work is largely based on blog posts, research papers, conference talks and other work authored by a lot of people: +Volume: 450-600 pages +Release date: Q3 2022 + +### Part II: Parallel Algorithms + +Concurrency, models of parallelism, context switching, green threads, concurrent runtimes, cache coherence, synchronization primitives, OpenMP, reductions, scans, list ranking, graph algorithms, lock-free data structures, heterogeneous computing, CUDA, kernels, warps, blocks, matrix multiplication, sorting. + +Volume: 150-200 pages +Release date: 2023-2024? + +### Part III: Distributed Computing + + + +Metworking, message passing, actor model, communication-constrained algorithms, distributed primitives, all-reduce, MapReduce, stream processing, query planning, storage, sharding, compression, distributed databases, consistency, reliability, scheduling, workflow engines, cloud computing. + +Release date: ??? (more likely to be completed than not) + +### Part IV: Software & Hardware + + + +LLVM IR, compiler optimizations & back-end, interpreters, JIT-compilation, Cython, JAX, Numba, Julia, OpenCL, DPC++, oneAPI, XLA, (basic) Verilog, FPGAs, ASICs, TPUs and other AI accelerators. + +Release date: ??? (less likely to be completed than not) + +### Acknowledgements + +The book is largely based on blog posts, research papers, conference talks, and other work authored by a lot of people: - [Agner Fog](https://agner.org/optimize/) - [Daniel Lemire](https://lemire.me/en/#publications) @@ -236,35 +267,23 @@ This work is largely based on blog posts, research papers, conference talks and - [Geoff Langdale](https://branchfree.org/) - [Matt Kulukundis](https://twitter.com/JuvHarlequinKFM) - [Georg Sauthoff](https://gms.tf/) +- [Danila Kutenin](https://danlark.org/author/kutdanila/) +- [Ivica Bogosavljević](https://johnysswlab.com/author/ibogi/) +- [Matt Pharr](https://pharr.org/matt/) +- [Jan Wassenberg](https://research.google/people/JanWassenberg/) - [Marshall Lochbaum](https://mlochbaum.github.io/publications.html) +- [Pavel Zemtsov](https://pzemtsov.github.io/) +- [Gustavo Duarte](https://manybutfinite.com/) +- [Nyaan](https://nyaannyaan.github.io/library/) - [Nayuki](https://www.nayuki.io/category/programming) +- [Konstantin](http://const.me/) +- [InstLatX64](https://twitter.com/InstLatX64) - [ridiculous_fish](https://ridiculousfish.com/blog/) +- [Z boson](https://stackoverflow.com/users/2542702/z-boson) - [Creel](https://www.youtube.com/c/WhatsACreel) -Volume: 450-600 pages -Release date: Q2 2022 - -### Part II: Parallel Algorithms - -Concurrency, models of parallelism, green threads and concurrent runtimes, cache coherence, synchronization primitives, OpenMP, reductions, scans, list ranking and graph algorithms, lock-free data structures, heterogeneous computing, CUDA, kernels, warps, blocks, matrix multiplication and sorting. - -Volume: 150-200 pages -Release date: 2023? - -### Part III: Distributed Computing - -Communication-constrained algorithms, message passing, actor model, partitioning, MapReduce, consistency and reliability at scale, storage, compression, scheduling and cloud computing, distributed deep learning. - -Release date: ??? (more likely to be completed than not) - -### Part IV: Compilers and Domain-Specific Architectures - -LLVM IR, compiler optimizations, JIT-compilation, Cython, JAX, Numba, Julia, OpenCL, DPC++ and oneAPI, XLA, Verilog, FPGAs, ASICs, TPUs and other AI accelerators. - -Release date: ??? (less likely to be completed than not) - ### Disclaimer: Technology Choices -The examples in this book use C++, GCC, x86-64, CUDA, and Spark, although the underlying principles we aim to convey are not specific to them. +The examples in this book use C++, GCC, x86-64, CUDA, and Spark, although the underlying principles conveyed are not specific to them. -To clear my conscience, I'm not happy with any of these choices: these technologies just happen to be the most widespread and stable at the moment and thus more helpful to the reader. I would have respectively picked C / Rust, LLVM, arm, OpenCL, and Dask; maybe there will be a 2nd edition in which some of the tech stack is changed. +To clear my conscience, I'm not happy with any of these choices: these technologies just happen to be the most widespread and stable at the moment and thus more helpful to the reader. I would have respectively picked C / Rust / [Carbon?](https://github.com/carbon-language/carbon-lang), LLVM, arm, OpenCL, and Dask; maybe there will be a 2nd edition in which some of the tech stack is changed. diff --git a/content/english/hpc/algorithms/argmin.md b/content/english/hpc/algorithms/argmin.md index ccd9f140..2089d083 100644 --- a/content/english/hpc/algorithms/argmin.md +++ b/content/english/hpc/algorithms/argmin.md @@ -3,7 +3,7 @@ title: Argmin with SIMD weight: 7 --- -Computing the *minimum* of an array [easily vectorizable](/hpc/simd/reduction), as it is not different from any other reduction: in AVX2, you just need to use a convenient `_mm256_min_epi32` intrinsic as the inner operation. It computes the minimum of two 8-element vectors in one cycle — even faster than in the scalar case, which requires at least a comparison and a conditional move. +Computing the *minimum* of an array is [easily vectorizable](/hpc/simd/reduction), as it is not different from any other reduction: in AVX2, you just need to use a convenient `_mm256_min_epi32` intrinsic as the inner operation. It computes the minimum of two 8-element vectors in one cycle — even faster than in the scalar case, which requires at least a comparison and a conditional move. Finding the *index* of that minimum element (*argmin*) is much harder, but it is still possible to vectorize very efficiently. In this section, we design an algorithm that computes the argmin (almost) at the speed of computing the minimum and ~15x faster than the naive scalar approach. @@ -164,7 +164,7 @@ int argmin(int *a, int n) { The compiler [optimized the machine code layout](/hpc/architecture/layout), and the CPU is now able to execute the loop at around 2 GFLOPS — a slight but sizeable improvement from 1.5 GFLOPS of the non-hinted loop. -Here is the idea: if we are only updating the minimum a dozen or so times during the entire computation, we can ditch all the vector-blending and index updating and just maintain the minimum and regularly check if it has changed. Inside this check, we can use however slow method of updating the argmin we want because it will only be called a few times. +Here is the idea: if we are only updating the minimum a dozen or so times during the entire computation, we can ditch all the vector-blending and index updating and just maintain the minimum and regularly check if it has changed. Inside this check, we can use however slow method of updating the argmin we want because it will only be called a few times. To implement it with SIMD, all we need to do on each iteration is a vector load, a comparison, and a test-if-zero: diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md index 4ff8061d..b900eb8c 100644 --- a/content/english/hpc/algorithms/factorization.md +++ b/content/english/hpc/algorithms/factorization.md @@ -1,48 +1,74 @@ --- title: Integer Factorization weight: 3 -draft: true +published: true --- -Integer factorization is interesting because of RSA problem. +The problem of factoring integers into primes is central to computational [number theory](/hpc/number-theory/). It has been [studied](https://www.cs.purdue.edu/homes/ssw/chapter3.pdf) since at least the 3rd century BC, and [many methods](https://en.wikipedia.org/wiki/Category:Integer_factorization_algorithms) have been developed that are efficient for different inputs. -"How big are your numbers?" determines the method to use: +In this case study, we specifically consider the factorization of *word-sized* integers: those on the order of $10^9$ and $10^{18}$. Untypical for this book, in this one, you may actually learn an asymptotically better algorithm: we start with a few basic approaches and gradually build up to the $O(\sqrt[4]{n})$-time *Pollard's rho algorithm* and optimize it to the point where it can factorize 60-bit semiprimes in 0.3-0.4ms and ~3 times faster than the previous state-of-the-art. -- Less than 2^16 or so: Lookup table. -- Less than 2^70 or so: Richard Brent's modification of Pollard's rho algorithm. -- Less than 10^50: Lenstra elliptic curve factorization -- Less than 10^100: Quadratic Sieve -- More than 10^100: General Number Field Sieve + +### Benchmark -and do other computations such as computing the greatest common multiple (given that it is not even so that ) (since $\gcd(n, r) = 1$) - -For all methods, we will implement `find_factor` function which returns one divisor ot 1. You can apply it recurively to get the factorization, so whatever asymptotic you had won't affect it: +For all methods, we will implement `find_factor` function that takes a positive integer $n$ and returns any of its non-trivial divisors (or `1` if the number is prime): ```c++ -typedef uint32_t u32; -typedef uint64_t u64; +// I don't feel like typing "unsigned long long" each time +typedef __uint16_t u16; +typedef __uint32_t u32; +typedef __uint64_t u64; typedef __uint128_t u128; +u64 find_factor(u64 n); +``` + +To find the full factorization, you can apply it to $n$, reduce it, and continue until a new factor can no longer be found: + +```c++ vector factorize(u64 n) { - vector res; - while (int d = find_factor(n); d > 1) // does it work? - res.push_back(d); - return res; + vector factorization; + do { + u64 d = find_factor(n); + factorization.push_back(d); + n /= d; + } while (d != 1); + return factorization; } ``` -## Trial division +After each removed factor, the problem becomes considerably smaller, so the worst-case running time of full factorization is equal to the worst-case running time of a `find_factor` call. + +For many factorization algorithms, including those presented in this section, the running time scales with the smaller prime factor. Therefore, to provide worst-case input, we use *semiprimes:* products of two prime numbers $p \le q$ that are on the same order of magnitude. We generate a $k$-bit semiprime as the product of two random $\lfloor k / 2 \rfloor$-bit primes. + +Since some of the algorithms are inherently randomized, we also tolerate a small (<1%) percentage of false-negative errors (when `find_factor` returns `1` despite number $n$ being composite), although this rate can be reduced to almost zero without significant performance penalties. + +### Trial division + + + +The most basic approach is to try every integer smaller than $n$ as a divisor: + +```c++ +u64 find_factor(u64 n) { + for (u64 d = 2; d < n; d++) + if (n % d == 0) + return d; + return 1; +} +``` -The smallest divisor has to be a prime number. -We remove the factor from the number, and repeat the process. -If we cannot find any divisor in the range $[2; \sqrt{n}]$, then the number itself has to be prime. +We can notice that if $n$ is divided by $d < \sqrt n$, then it is also divided by $\frac{n}{d} > \sqrt n$, and there is no need to check for it separately. This lets us stop trial division early and only check for potential divisors that do not exceed $\sqrt n$: ```c++ u64 find_factor(u64 n) { @@ -53,13 +79,43 @@ u64 find_factor(u64 n) { } ``` +In our benchmark, $n$ is a semiprime, and we always find the lesser divisor, so both $O(n)$ and $O(\sqrt n)$ implementations perform the same and are able to factorize ~2k 30-bit numbers per second — while taking whole 20 seconds to factorize a single 60-bit number. + +### Lookup Table + +Nowadays, you can type `factor 57` in your Linux terminal or Google search bar to get the factorization of any number. But before computers were invented, it was more practical to use *factorization tables:* special books containing factorizations of the first $N$ numbers. + +We can also use this approach to compute these lookup tables [during compile time](/hpc/compilation/precalc/). To save space, we can store only the smallest divisor of a number. Since the smallest divisor does not exceed the $\sqrt n$, we need just one byte per a 16-bit integer: + +```c++ +template +struct Precalc { + unsigned char divisor[N]; + + constexpr Precalc() : divisor{} { + for (int i = 0; i < N; i++) + divisor[i] = 1; + for (int i = 2; i * i < N; i++) + if (divisor[i] == 1) + for (int k = i * i; k < N; k += i) + divisor[k] = i; + } +}; + +constexpr Precalc P{}; + +u64 find_factor(u64 n) { + return P.divisor[n]; +} +``` + +With this approach, we can process 3M 16-bit integers per second, although it would probably [get slower](../hpc/cpu-cache/bandwidth/) for larger numbers. While it requires just a few milliseconds and 64KB of memory to calculate and store the divisors of the first $2^{16}$ numbers, it does not scale well for larger inputs. + ### Wheel factorization -This is an optimization of the trial division. -The idea is the following. -Once we know that the number is not divisible by 2, we don't need to check every other even number. -This leaves us with only $50\%$ of the numbers to check. -After checking 2, we can simply start with 3 and skip every other number. +To save paper space, pre-computer era factorization tables typically excluded numbers divisible by $2$ and $5$, making the factorization table ½ × ⅘ = 0.4 of its original size. In the decimal numeral system, you can quickly determine whether a number is divisible by $2$ or $5$ (by looking at its last digit) and keep dividing the number $n$ by $2$ or $5$ while it is possible, eventually arriving at some entry in the factorization table. + +We can apply a similar trick to trial division by first checking if the number is divisible by $2$ and then only considering odd divisors: ```c++ u64 find_factor(u64 n) { @@ -72,24 +128,29 @@ u64 find_factor(u64 n) { } ``` -This method can be extended. -If the number is not divisible by 3, we can also ignore all other multiples of 3 in the future computations. -So we only need to check the numbers $5, 7, 11, 13, 17, 19, 23, \dots$. -We can observe a pattern of these remaining numbers. -We need to check all numbers with $d \bmod 6 = 1$ and $d \bmod 6 = 5$. -So this leaves us with only $33.3\%$ percent of the numbers to check. -We can implement this by checking the primes 2 and 3 first, and then start checking with 5 and alternatively skip 1 or 3 numbers. +With 50% fewer divisions to perform, this algorithm works twice as fast. + +This method can be extended: if the number is not divisible by $3$, we can also ignore all multiples of $3$, and the same goes for all other divisors. The problem is, as we increase the number of primes to exclude, it becomes less straightforward to iterate only over the numbers not divisible by them as they follow an irregular pattern — unless the number of primes is small. + +For example, if we consider $2$, $3$, and $5$, then, among the first $90$ numbers, we only need to check: + +```center +(1,) 7, 11, 13, 17, 19, 23, 29, +31, 37, 41, 43, 47, 49, 53, 59, +61, 67, 71, 73, 77, 79, 83, 89… +``` + +You can notice a pattern: the sequence repeats itself every $30$ numbers. This is not surprising since the remainder modulo $2 \times 3 \times 5 = 30$ is all we need to determine whether a number is divisible by $2$, $3$, or $5$. This means that we only need to check $8$ numbers with specific remainders out of every $30$, proportionally improving the performance: ```c++ u64 find_factor(u64 n) { for (u64 d : {2, 3, 5}) if (n % d == 0) return d; - u64 increments[] = {0, 4, 6, 10, 12, 16, 22, 24}; - u64 sum = 30; - for (u64 d = 7; d * d <= n; d += sum) { - for (u64 k = 0; k < 8; k++) { - u64 x = d + increments[k]; + u64 offsets[] = {0, 4, 6, 10, 12, 16, 22, 24}; + for (u64 d = 7; d * d <= n; d += 30) { + for (u64 offset : offsets) { + u64 x = d + offset; if (n % x == 0) return x; } @@ -98,98 +159,290 @@ u64 find_factor(u64 n) { } ``` -We can extend this even further. -Here is an implementation for the prime number 2, 3 and 5. -It's convenient to use an array to store how much we have to skip. +As expected, it works $\frac{30}{8} = 3.75$ times faster than the naive trial division, processing about 7.6k 30-bit numbers per second. The performance can be improved further by considering more primes, but the returns are diminishing: adding a new prime $p$ reduces the number of iterations by $\frac{1}{p}$ but increases the size of the skip-list by a factor of $p$, requiring proportionally more memory. -### Lookup table +### Precomputed Primes -We will choose to store smallest factors of first $2^16$ — because this way they all fit in just one byte, so we are sort of saving on memory here. +If we keep increasing the number of primes in wheel factorization, we eventually exclude all composite numbers and only check for prime factors. In this case, we don't need this array of offsets but just the array of primes: ```c++ -template -struct Precalc { - char divisor[N]; +const int N = (1 << 16); - constexpr Precalc() : divisor{} { - for (int i = 0; i < N; i++) - divisor[i] = 1; - for (int i = 2; i * i < N; i++) - if (divisor[i] == 1) - for (int k = i * i; k < N; k += i) - divisor[k] = i; +struct Precalc { + u16 primes[6542]; // # of primes under N=2^16 + + constexpr Precalc() : primes{} { + bool marked[N] = {}; + int n_primes = 0; + + for (int i = 2; i < N; i++) { + if (!marked[i]) { + primes[n_primes++] = i; + for (int j = 2 * i; j < N; j += i) + marked[j] = true; + } + } } }; -constexpr Precalc precalc{}; +constexpr Precalc P{}; u64 find_factor(u64 n) { - return precalc.divisor[n]; + for (u16 p : P.primes) + if (n % p == 0) + return p; + return 1; } ``` +This approach lets us process almost 20k 30-bit integers per second, but it does not work for larger (64-bit) numbers unless they have small ($< 2^{16}$) factors. + +Note that this is actually an asymptotic optimization: there are $O(\frac{n}{\ln n})$ primes among the first $n$ numbers, so this algorithm performs $O(\frac{\sqrt n}{\ln \sqrt n})$ operations, while wheel factorization only eliminates a large but constant fraction of divisors. If we extend it to 64-bit numbers and precompute every prime under $2^{32}$ (storing which would require several hundred megabytes of memory), the relative speedup would grow by a factor of $\frac{\ln \sqrt{n^2}}{\ln \sqrt n} = 2 \cdot \frac{1/2}{1/2} \cdot \frac{\ln n}{\ln n} = 2$. + +All variants of trial division, including this one, are bottlenecked by the speed of integer division, which can be [optimized](/hpc/arithmetic/division/) if we know the divisors in advance and allow for some additional precomputation. In our case, it is suitable to use [the Lemire division check](/hpc/arithmetic/division/#lemire-reduction): + +```c++ +// ...precomputation is the same as before, +// but we store the reciprocal instead of the prime number itself +u64 magic[6542]; +// for each prime i: +magic[n_primes++] = u64(-1) / i + 1; + +u64 find_factor(u64 n) { + for (u64 m : P.magic) + if (m * n < m) + return u64(-1) / m + 1; + return 1; +} +``` + +This makes the algorithm ~18x faster: we can now factorize **~350k** 30-bit numbers per second, which is actually the most efficient algorithm we have for this number range. While it can probably be optimized even further by performing these checks in parallel with [SIMD](/hpc/simd), we will stop there and try a different, asymptotically better approach. + ### Pollard's Rho Algorithm -The algorithm is probabilistic. This means that it may or may not work. You would also need to + + +Pollard's rho is a randomized $O(\sqrt[4]{n})$ integer factorization algorithm that makes use of the [birthday paradox](https://en.wikipedia.org/wiki/Birthday_problem): + +> One only needs to draw $d = \Theta(\sqrt{n})$ random numbers between $1$ and $n$ to get a collision with high probability. + +The reasoning behind it is that each of the $d$ added element has a $\frac{d}{n}$ chance of colliding with some other element, implying that the expected number of collisions is $\frac{d^2}{n}$. If $d$ is asymptotically smaller than $\sqrt n$, then this ratio grows to zero as $n \to \infty$, and to infinity otherwise. + +Consider some function $f(x)$ that takes a remainder $x \in [0, n)$ and maps it to some other remainder of $n$ in a way that seems random from the number theory point of view. Specifically, we will use $f(x) = x^2 + 1 \bmod n$, which is random enough for our purposes. + +Now, consider a graph where each number-vertex $x$ has an edge pointing to $f(x)$. Such graphs are called *functional*. In functional graphs, the "trajectory" of any element — the path we walk if we start from that element and keep following the edges — is a path that eventually loops around (because the set of vertices is limited, and at some point, we have to go to a vertex we have already visited). + +![The trajectory of an element resembles the greek letter ρ (rho), which is what the algorithm is named after](../img/rho.jpg) + +Consider a trajectory of some particular element $x_0$: + +$$ +x_0, \; f(x_0), \; f(f(x_0)), \; \ldots +$$ + +Let's make another sequence out of this one by reducing each element modulo $p$, the smallest prime divisor of $n$. -> В мультимножество нужно добавить $O(\sqrt{n})$ случайных чисел от 1 до $n$, чтобы какие-то два совпали. +**Lemma.** The expected length of the reduced sequence before it turns into a cycle is $O(\sqrt[4]{n})$. -## $\rho$-алгоритм Полларда +**Proof:** Since $p$ is the smallest divisor, $p \leq \sqrt n$. Each time we follow a new edge, we essentially generate a random number between $0$ and $p$ (we treat $f$ as a "deterministically-random" function). The birthday paradox states that we only need to generate $O(\sqrt p) = O(\sqrt[4]{n})$ numbers until we get a collision and thus enter a loop. -Итак, мы хотим факторизовать число $n$. Предположим, что $n = p q$ и $p \approx q$. Понятно, что труднее случая, наверное, нет. Алгоритм итеративно ищет наименьший делитель и таким образом сводит задачу к как минимум в два раза меньшей. +Since we don't know $p$, this mod-$p$ sequence is only imaginary, but if find a cycle in it — that is, $i$ and $j$ such that -Возьмём произвольную «достаточно случайную» с точки зрения теории чисел функцию. Например $f(x) = (x+1)^2 \mod n$. +$$ +f^i(x_0) \equiv f^j(x_0) \pmod p +$$ -Граф, в котором из каждой вершины есть единственное ребро $x \to f(x)$, называется *функциональным*. Если в нём нарисовать «траекторию» произвольного элемента — какой-то путь, превращающийся в цикл — то получится что-то похожее на букву $\rho$ (ро). Алгоритм из-за этого так и назван. +then we can also find $p$ itself as -![](https://upload.wikimedia.org/wikipedia/commons/4/47/Pollard_rho_cycle.jpg) +$$ +p = \gcd(|f^i(x_0) - f^j(x_0)|, n) +$$ -Рассмотрим траекторию какого-нибудь элемента $x_0$: {$x_0$, $f(x_0)$, $f(f(x_0))$, $\ldots$}. Сделаем из неё новую последовательность, мысленно взяв каждый элемент по модулю $p$ — наименьшего из простых делителей $n$. +The algorithm itself just finds this cycle and $p$ using this GCD trick and Floyd's "[tortoise and hare](https://en.wikipedia.org/wiki/Cycle_detection#Floyd's_tortoise_and_hare)" algorithm: we maintain two pointers $i$ and $j = 2i$ and check that -**Утверждение**. Ожидаемая длина цикла в этой последовательности $O(\sqrt[4]{n})$. +$$ +\gcd(|f^i(x_0) - f^j(x_0)|, n) \neq 1 +$$ -*Доказательство:* так как $p$ — меньший делитель, то $p \leq \sqrt n$. Теперь просто подставлим в следствие из парадокса дней рождений: в множество нужно добавить $O(\sqrt{p}) = O(\sqrt[4]{n})$ элементов, чтобы какие-то два совпали, а значит последовательность зациклилась. +which is equivalent to comparing $f^i(x_0)$ and $f^j(x_0)$ modulo $p$. Since $j$ (hare) is increasing at twice the rate of $i$ (tortoise), their difference is increasing by $1$ each iteration and eventually will become equal to (or a multiple of) the cycle length, with $i$ and $j$ pointing to the same elements. And as we proved half a page ago, reaching a cycle would only require $O(\sqrt[4]{n})$ iterations: -Если мы найдём цикл в такой последовательности — то есть такие $i$ и $j$, что $f^i(x_0) \equiv f^j(x_0) \pmod p$ — то мы сможем найти и какой-то делитель $n$, а именно $\gcd(|f^i(x_0) - f^j(x_0)|, n)$ — это число меньше $n$ и делится на $p$. +```c++ +u64 f(u64 x, u64 mod) { + return ((u128) x * x + 1) % mod; +} + +u64 diff(u64 a, u64 b) { + // a and b are unsigned and so is their difference, so we can't just call abs(a - b) + return a > b ? a - b : b - a; +} + +const u64 SEED = 42; + +u64 find_factor(u64 n) { + u64 x = SEED, y = SEED, g = 1; + while (g == 1) { + x = f(f(x, n), n); // advance x twice + y = f(y, n); // advance y once + g = gcd(diff(x, y)); + } + return g; +} +``` + +While it processes only ~25k 30-bit integers — which is almost 15 times slower than by checking each prime using a fast division trick — it dramatically outperforms every $\tilde{O}(\sqrt n)$ algorithm for 60-bit numbers, factorizing around 90 of them per second. + +### Pollard-Brent Algorithm -Алгоритм по сути находит цикл в этой последовательности, используя для этого стандартный алгоритм («черепаха и заяц»): будем поддерживать два удаляющихся друг от друга указателя $i$ и $j$ ($i = 2j$) и проверять, что $f^i(x_0) \equiv f^j(x_0) \pmod p$, что эквивалентно проверке $\gcd(|f^i(x_0) - f^j(x_0)|, n) \not \in \{ 1, n \}$. +Floyd's cycle-finding algorithm has a problem in that it moves iterators more than necessary: at least half of the vertices are visited one additional time by the slower iterator. + +One way to solve it is to memorize the values $x_i$ that the faster iterator visits and, every two iterations, compute the GCD using the difference of $x_i$ and $x_{\lfloor i / 2 \rfloor}$. But it can also be done without extra memory using a different principle: the tortoise doesn't move on every iteration, but it gets reset to the value of the faster iterator when the iteration number becomes a power of two. This lets us save additional iterations while still using the same GCD trick to compare $x_i$ and $x_{2^{\lfloor \log_2 i \rfloor}}$ on each iteration: ```c++ -typedef long long ll; - -inline ll f(ll x) { return (x+1)*(x+1); } - -ll find_divisor(ll n, ll seed = 1) { - ll x = seed, y = seed; - ll divisor = 1; - while (divisor == 1 || divisor == n) { - // двигаем первый указатель на шаг - y = f(y) % n; - // а второй -- на два - x = f(f(x) % n) % n; - // пытаемся найти общий делитель - divisor = __gcd(abs(x-y), n); +u64 find_factor(u64 n) { + u64 x = SEED; + + for (int l = 256; l < (1 << 20); l *= 2) { + u64 y = x; + for (int i = 0; i < l; i++) { + x = f(x, n); + if (u64 g = gcd(diff(x, y), n); g != 1) + return g; + } } - return divisor; + + return 1; } ``` -Так как алгоритм рандомизированный, при полной реализации нужно учитывать разные детали. Например, что иногда делитель не находится (нужно запускать несколько раз), или что при попытке факторизовать простое число он будет работать за $O(\sqrt n)$ (нужно добавить отсечение по времени). +Note that we also set an upper limit on the number of iterations so that the algorithm finishes in a reasonable amount of time and returns `1` if $n$ turns out to be a prime. + +It actually does *not* improve performance and even makes the algorithm ~1.5x *slower*, which probably has something to do with the fact that $x$ is stale. It spends most of the time computing the GCD and not advancing the iterator — in fact, the time requirement of this algorithm is currently $O(\sqrt[4]{n} \log n)$ because of it. + +Instead of [optimizing the GCD itself](../gcd), we will optimize the number of its invocations. We can use the fact that if one of $a$ and $b$ contains factor $p$, then $a \cdot b \bmod n$ will also contain it, so instead of computing $\gcd(a, n)$ and $\gcd(b, n)$, we can compute $\gcd(a \cdot b \bmod n, n)$. This way, we can group the calculations of GCP in groups of $M = O(\log n)$ we remove $\log n$ out of the asymptotic: + +```c++ +const int M = 1024; + +u64 find_factor(u64 n) { + u64 x = SEED; + + for (int l = M; l < (1 << 20); l *= 2) { + u64 y = x, p = 1; + for (int i = 0; i < l; i += M) { + for (int j = 0; j < M; j++) { + y = f(y, n); + p = (u128) p * diff(x, y) % n; + } + if (u64 g = gcd(p, n); g != 1) + return g; + } + } + + return 1; +} +``` + +Now it performs 425 factorizations per second, bottlenecked by the speed of modulo. + +### Optimizing the Modulo + +The final step is to apply [Montgomery multiplication](/hpc/number-theory/montgomery/). Since the modulo is constant, we can perform all computations — advancing the iterator, multiplication, and even computing the GCD — in the Montgomery space where reduction is cheap: + +```c++ +struct Montgomery { + u64 n, nr; + + Montgomery(u64 n) : n(n) { + nr = 1; + for (int i = 0; i < 6; i++) + nr *= 2 - n * nr; + } + + u64 reduce(u128 x) const { + u64 q = u64(x) * nr; + u64 m = ((u128) q * n) >> 64; + return (x >> 64) + n - m; + } + + u64 multiply(u64 x, u64 y) { + return reduce((u128) x * y); + } +}; + +u64 f(u64 x, u64 a, Montgomery m) { + return m.multiply(x, x) + a; +} + +const int M = 1024; + +u64 find_factor(u64 n, u64 x0 = 2, u64 a = 1) { + Montgomery m(n); + u64 x = SEED; + + for (int l = M; l < (1 << 20); l *= 2) { + u64 y = x, p = 1; + for (int i = 0; i < l; i += M) { + for (int j = 0; j < M; j++) { + x = f(x, m); + p = m.multiply(p, diff(x, y)); + } + if (u64 g = gcd(p, n); g != 1) + return g; + } + } + + return 1; +} +``` + +This implementation can processes around 3k 60-bit integers per second, which is ~3x faster than what [PARI](https://pari.math.u-bordeaux.fr/) / [SageMath's `factor`](https://doc.sagemath.org/html/en/reference/structure/sage/structure/factorization.html) / `cat semiprimes.txt | time factor` measures. + +### Further Improvements + +**Optimizations.** There is still a lot of potential for optimization in our implementation of the Pollard's algorithm: + +- We could probably use a better cycle-finding algorithm, exploiting the fact that the graph is random. For example, there is little chance that we enter the loop in within the first few iterations (the length of the cycle and the path we walk before entering it should be equal in expectation since before we loop around, we choose the vertex of the path we've walked independently), so we may just advance the iterator for some time before starting the trials with the GCD trick. +- Our current approach is bottlenecked by advancing the iterator (the latency of Montgomery multiplication is much higher than its reciprocal throughput), and while we are waiting for it to complete, we could perform more than just one trial using the previous values. +- If we run $p$ independent instances of the algorithm with different seeds in parallel and stop when one of them finds the answer, it would finish $\sqrt p$ times faster (the reasoning is similar to the Birthday paradox; try to prove it yourself). We don't have to use multiple cores for that: there is a lot of untapped [instruction-level parallelism](/hpc/pipelining/), so we could concurrently run two or three of the same operations on the same thread, or use [SIMD](/hpc/simd) instructions to perform 4 or 8 multiplications in parallel. -### Brent's Method +I would not be surprised to see another 3x improvement and throughput of ~10k/sec. If you [implement](https://github.com/sslotin/amh-code/tree/main/factor) some of these ideas, please [let me know](http://sereja.me/). -Another idea is to accumulate the product and instead of calculating GCD on each step to calculate it every log n steps. + -### Optimizing division +**Errors.** Another aspect that we need to handle in a practical implementation is possible errors. Our current implementation has a 0.7% error rate for 60-bit integers, and it grows higher if the numbers are lower. These errors come from three main sources: -The next step is to actually apply Montgomery Multiplication. +- A cycle simply not being found (the algorithm is inherently random, and there is no guarantee that it will be found). In this case, we need to perform a primality test and optionally start again. +- The `p` variable becoming zero (because both $p$ and $q$ can get into the product). It becomes increasingly more likely as we decrease size of the inputs or increase the constant `M`. In this case, we need to either restart the process or (better) roll back the last $M$ iterations and perform the trials one by one. +- Overflows in the Montgomery multiplication. Our current implementation is pretty loose with them, and if $n$ is large, we need to add more `x > mod ? x - mod : x` kind of statements to deal with overflows. -This is exactly the type of problem when we need specific knowledge, because we have 64-bit modulo by not-compile-constants, and compiler can't really do much to optimize it. +**Larger numbers.** These issues become less important if we exclude small numbers and numbers with small prime factors using the algorithms we've implemented before. In general, the optimal approach should depend on the size of the numbers: -... +- Smaller than $2^{16}$: use a lookup table; +- Smaller than $2^{32}$: use a list of precomputed primes with a fast divisibility check; +- Smaller than $2^{64}$ or so: use Pollard's rho algorithm with Montgomery multiplication; +- Smaller than $10^{50}$: switch to [Lenstra elliptic curve factorization](https://en.wikipedia.org/wiki/Lenstra_elliptic-curve_factorization); +- Smaller than $10^{100}$: switch to [Quadratic Sieve](https://en.wikipedia.org/wiki/Quadratic_sieve); +- Larger than $10^{100}$: switch to [General Number Field Sieve](https://en.wikipedia.org/wiki/General_number_field_sieve). -## Further optimizations + -Существуют также [субэкспоненциальные](https://ru.wikipedia.org/wiki/%D0%A4%D0%B0%D0%BA%D1%82%D0%BE%D1%80%D0%B8%D0%B7%D0%B0%D1%86%D0%B8%D1%8F_%D1%86%D0%B5%D0%BB%D1%8B%D1%85_%D1%87%D0%B8%D1%81%D0%B5%D0%BB#%D0%A1%D1%83%D0%B1%D1%8D%D0%BA%D1%81%D0%BF%D0%BE%D0%BD%D0%B5%D0%BD%D1%86%D0%B8%D0%B0%D0%BB%D1%8C%D0%BD%D1%8B%D0%B5_%D0%B0%D0%BB%D0%B3%D0%BE%D1%80%D0%B8%D1%82%D0%BC%D1%8B), но не полиномиальные алгоритмы факторизации. Человечество [умеет](https://en.wikipedia.org/wiki/Integer_factorization_records) факторизовывать числа порядка $2^{200}$. +The last three approaches are very different from what we've been doing and require much more advanced number theory, and they deserve an article (or a full-length university course) of their own. diff --git a/content/english/hpc/algorithms/gcd.md b/content/english/hpc/algorithms/gcd.md index 59e55f10..6a4f8ca7 100644 --- a/content/english/hpc/algorithms/gcd.md +++ b/content/english/hpc/algorithms/gcd.md @@ -14,7 +14,7 @@ $$ \gcd(a, b) = \max_{g: \; g|a \, \land \, g | b} g $$ -You probably already know this algorithm from a CS textbook, but let me briefly remind it anyway. It is based on the following formula, assuming that $a > b$: +You probably already know this algorithm from a CS textbook, but I will summarize it here. It is based on the following formula, assuming that $a > b$: $$ \gcd(a, b) = \begin{cases} @@ -135,7 +135,7 @@ int gcd(int a, int b) { Let's run it, and… it sucks. The difference in speed compared to `std::gcd` is indeed 2x, but on the other side of the equation. This is mainly because of all the branching needed to differentiate between the cases. Let's start optimizing. -First, let's replace all divisions by 2 with divisions by whichever highest power of 2 we can. We can do it efficiently with `__builtin_ctz`, the "count trailing zeros" instruction available on modern CPUs. Whenever we are supposed to divide by 2 in the original algorithm, we will call this function instead, which will give us the exact amount to right-shift the number by. Assuming that the we are dealing with large random numbers, this is expected to decrease the number of iterations by almost a factor 2, because $1 + \frac{1}{2} + \frac{1}{4} + \frac{1}{8} + \ldots \to 2$. +First, let's replace all divisions by 2 with divisions by whichever highest power of 2 we can. We can do it efficiently with `__builtin_ctz`, the "count trailing zeros" instruction available on modern CPUs. Whenever we are supposed to divide by 2 in the original algorithm, we will call this function instead, which will give us the exact number of bits to right-shift the number by. Assuming that the we are dealing with large random numbers, this is expected to decrease the number of iterations by almost a factor 2, because $1 + \frac{1}{2} + \frac{1}{4} + \frac{1}{8} + \ldots \to 2$. Second, we can notice that condition 2 can now only be true once — in the very beginning — because every other identity leaves at least one of the numbers odd. Therefore we can handle this case just once in the beginning and not consider it in the main loop. @@ -186,7 +186,7 @@ loop: Let's draw the dependency graph of this loop: -@@ + + +![](../img/gcd-dependency1.png) Modern processors can execute many instructions in parallel, essentially meaning that the true "cost" of this computation is roughly the sum of latencies on its critical path. In this case, it is the total latency of `diff`, `abs`, `ctz`, and `shift`. -We can decrease this latency using the fact that we can actually calculate `ctz` using just `diff = a - b`, because a negative number divisible by $2^k$ still has $k$ zeros at the end. This lets us not wait for `max(diff, -diff)` to be computed first, resulting in a shorter graph like this: +We can decrease this latency using the fact that we can actually calculate `ctz` using just `diff = a - b`, because a [negative number](../hpc/arithmetic/integer/#signed-integers) divisible by $2^k$ still has $k$ zeros at the end of its binary representation. This lets us not wait for `max(diff, -diff)` to be computed first, resulting in a shorter graph like this: -@@ + + +![](../img/gcd-dependency2.png) Hopefully you will be less confused when you think about how the final code will be executed: @@ -248,9 +252,9 @@ int gcd(int a, int b) { } ``` -It runs in 91ns — which is good enough to leave it there. +It runs in 91ns, which is good enough to leave it there. -If somebody wants to try to shove off a few more nanoseconds by re-writing assembly by hand or trying a lookup table to save a few last iterations, please [let me know](http://sereja.me/). +If somebody wants to try to shave off a few more nanoseconds by rewriting the assembly by hand or trying a lookup table to save a few last iterations, please [let me know](http://sereja.me/). ### Acknowledgements diff --git a/content/english/hpc/algorithms/img/column-major.jpg b/content/english/hpc/algorithms/img/column-major.jpg new file mode 100644 index 00000000..675d0b85 Binary files /dev/null and b/content/english/hpc/algorithms/img/column-major.jpg differ diff --git a/content/english/hpc/algorithms/img/gcd-dependency1.png b/content/english/hpc/algorithms/img/gcd-dependency1.png new file mode 100644 index 00000000..4e58904c Binary files /dev/null and b/content/english/hpc/algorithms/img/gcd-dependency1.png differ diff --git a/content/english/hpc/algorithms/img/gcd-dependency2.png b/content/english/hpc/algorithms/img/gcd-dependency2.png new file mode 100644 index 00000000..b045ada4 Binary files /dev/null and b/content/english/hpc/algorithms/img/gcd-dependency2.png differ diff --git a/content/english/hpc/algorithms/img/mm-blas.svg b/content/english/hpc/algorithms/img/mm-blas.svg new file mode 100644 index 00000000..5027faef --- /dev/null +++ b/content/english/hpc/algorithms/img/mm-blas.svg @@ -0,0 +1,1570 @@ + + + + + + + + 2022-04-05T01:19:43.486396 + image/svg+xml + + + Matplotlib v3.5.1, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/content/english/hpc/algorithms/img/mm-blocked-barplot.svg b/content/english/hpc/algorithms/img/mm-blocked-barplot.svg new file mode 100644 index 00000000..93334ac1 --- /dev/null +++ b/content/english/hpc/algorithms/img/mm-blocked-barplot.svg @@ -0,0 +1,1402 @@ + + + + + + + + 2022-04-05T01:18:41.689702 + image/svg+xml + + + Matplotlib v3.5.1, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/content/english/hpc/algorithms/img/mm-blocked-plot.svg b/content/english/hpc/algorithms/img/mm-blocked-plot.svg new file mode 100644 index 00000000..87dda835 --- /dev/null +++ b/content/english/hpc/algorithms/img/mm-blocked-plot.svg @@ -0,0 +1,1474 @@ + + + + + + + + 2022-04-05T01:18:54.049300 + image/svg+xml + + + Matplotlib v3.5.1, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/content/english/hpc/algorithms/img/mm-kernel-barplot.svg b/content/english/hpc/algorithms/img/mm-kernel-barplot.svg new file mode 100644 index 00000000..834d8b39 --- /dev/null +++ b/content/english/hpc/algorithms/img/mm-kernel-barplot.svg @@ -0,0 +1,1277 @@ + + + + + + + + 2022-04-05T01:18:16.721432 + image/svg+xml + + + Matplotlib v3.5.1, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/content/english/hpc/algorithms/img/mm-kernel-plot.svg b/content/english/hpc/algorithms/img/mm-kernel-plot.svg new file mode 100644 index 00000000..99f9315a --- /dev/null +++ b/content/english/hpc/algorithms/img/mm-kernel-plot.svg @@ -0,0 +1,1385 @@ + + + + + + + + 2022-04-05T01:18:30.773700 + image/svg+xml + + + Matplotlib v3.5.1, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/content/english/hpc/algorithms/img/mm-noalloc.svg b/content/english/hpc/algorithms/img/mm-noalloc.svg new file mode 100644 index 00000000..a4911ea0 --- /dev/null +++ b/content/english/hpc/algorithms/img/mm-noalloc.svg @@ -0,0 +1,1344 @@ + + + + + + + + 2022-04-05T01:19:35.314892 + image/svg+xml + + + Matplotlib v3.5.1, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/content/english/hpc/algorithms/img/mm-vectorized-barplot.svg b/content/english/hpc/algorithms/img/mm-vectorized-barplot.svg new file mode 100644 index 00000000..610d8276 --- /dev/null +++ b/content/english/hpc/algorithms/img/mm-vectorized-barplot.svg @@ -0,0 +1,1140 @@ + + + + + + + + 2022-04-05T01:17:55.289785 + image/svg+xml + + + Matplotlib v3.5.1, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/content/english/hpc/algorithms/img/mm-vectorized-plot.svg b/content/english/hpc/algorithms/img/mm-vectorized-plot.svg new file mode 100644 index 00000000..7374f73f --- /dev/null +++ b/content/english/hpc/algorithms/img/mm-vectorized-plot.svg @@ -0,0 +1,1379 @@ + + + + + + + + 2022-04-05T01:18:01.560593 + image/svg+xml + + + Matplotlib v3.5.1, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/content/english/hpc/algorithms/img/rho.jpg b/content/english/hpc/algorithms/img/rho.jpg new file mode 100644 index 00000000..d7f01ad8 Binary files /dev/null and b/content/english/hpc/algorithms/img/rho.jpg differ diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index be5bd07d..cf976045 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -1,426 +1,485 @@ --- title: Matrix Multiplication -weight: 4 -draft: true +weight: 20 --- + -## Case Study: Distance Product +In this case study, we will design and implement several algorithms for matrix multiplication. -(We are going to speedrun "[Programming Parallel Computers](http://ppc.cs.aalto.fi/ch2/)" course) +We start with the naive "for-for-for" algorithm and incrementally improve it, eventually arriving at a version that is 50 times faster and matches the performance of BLAS libraries while being under 40 lines of C. -Given a matrix $D$, we need to calculate its "min-plus matrix multiplication" defined as: +All implementations are compiled with GCC 13 and run on a [Zen 2](https://en.wikichip.org/wiki/amd/microarchitectures/zen_2) CPU clocked at 2GHz. -$(D \circ D)_{ij} = \min_k(D_{ik} + D_{kj})$ +## Baseline ----- +The result of multiplying an $l \times n$ matrix $A$ by an $n \times m$ matrix $B$ is defined as an $l \times m$ matrix $C$ such that: -Graph interpretation: -find shortest paths of length 2 between all vertices in a fully-connected weighted graph +$$ +C_{ij} = \sum_{k=1}^{n} A_{ik} \cdot B_{kj} +$$ -![](https://i.imgur.com/Zf4G7qj.png) +For simplicity, we will only consider *square* matrices, where $l = m = n$. ----- +To implement matrix multiplication, we can simply transfer this definition into code, but instead of two-dimensional arrays (aka matrices), we will be using one-dimensional arrays to be explicit about pointer arithmetic: -A cool thing about distance product is that if if we iterate the process and calculate: +```c++ +void matmul(const float *a, const float *b, float *c, int n) { + for (int i = 0; i < n; i++) + for (int j = 0; j < n; j++) + for (int k = 0; k < n; k++) + c[i * n + j] += a[i * n + k] * b[k * n + j]; +} +``` -$D_2 = D \circ D, \;\; -D_4 = D_2 \circ D_2, \;\; -D_8 = D_4 \circ D_4, \;\; -\ldots$ +For reasons that will become apparent later, we will only use matrix sizes that are multiples of $48$ for benchmarking, but the implementations remain correct for all others. We also use [32-bit floats](/hpc/arithmetic/ieee-754) specifically, although all implementations can be easily [generalized](#generalizations) to other data types and operations. -Then we can find all-pairs shortest distances in $O(\log n)$ steps +Compiled with `g++ -O3 -march=native -ffast-math -funroll-loops`, the naive approach multiplies two matrices of size $n = 1920 = 48 \times 40$ in ~16.7 seconds. To put it in perspective, this is approximately $\frac{1920^3}{16.7 \times 10^9} \approx 0.42$ useful operations per nanosecond (GFLOPS), or roughly 5 CPU cycles per multiplication, which doesn't look that good yet. -(but recall that there are [more direct ways](https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm) to solve it) +## Transposition ---- +In general, when optimizing an algorithm that processes large quantities of data — and $1920^2 \times 3 \times 4 \approx 42$ MB clearly is a large quantity as it can't fit into any of the [CPU caches](/hpc/cpu-cache) — one should always start with memory before optimizing arithmetic, as it is much more likely to be the bottleneck. -## V0: Baseline +The field $C_{ij}$ can be thought of as the dot product of row $i$ of matrix $A$ and column $j$ of matrix $B$. As we increment `k` in the inner loop above, we are reading the matrix `a` sequentially, but we are jumping over $n$ elements as we iterate over a column of `b`, which is [not as fast](/hpc/cpu-cache/aos-soa) as sequential iteration. -Implement the definition of what we need to do, but using arrays instead of matrices: +One [well-known](/hpc/external-memory/oblivious/#matrix-multiplication) optimization that tackles this problem is to store matrix $B$ in *column-major* order — or, alternatively, to *transpose* it before the matrix multiplication. This requires $O(n^2)$ additional operations but ensures sequential reads in the innermost loop: -```cpp -const float infty = std::numeric_limits::infinity(); + + +```c++ +void matmul(const float *a, const float *_b, float *c, int n) { + float *b = new float[n * n]; + + for (int i = 0; i < n; i++) + for (int j = 0; j < n; j++) + b[i * n + j] = _b[j * n + i]; + + for (int i = 0; i < n; i++) + for (int j = 0; j < n; j++) + for (int k = 0; k < n; k++) + c[i * n + j] += a[i * n + k] * b[j * n + k]; // <- note the indices } ``` -Compile with `g++ -O3 -march=native -std=c++17` +This code runs in ~12.4s, or about 30% faster. -On our Intel Core i5-6500 ("Skylake", 4 cores, 3.6 GHz) with $n=4000$ it runs for 99s, -which amounts to ~1.3B useful floating point operations per second +As we will see in a bit, there are more important benefits to transposing it than just the sequential memory reads. ---- +## Vectorization -## Theoretical Performance +Now that all we do is just sequentially read the elements of `a` and `b`, multiply them, and add the result to an accumulator variable, we can use [SIMD](/hpc/simd/) instructions to speed it all up. It is pretty straightforward to implement using [GCC vector types](/hpc/simd/intrinsics/#gcc-vector-extensions) — we can [memory-align](/hpc/cpu-cache/alignment/) matrix rows, pad them with zeros, and then compute the multiply-sum as we would normally compute any other [reduction](/hpc/simd/reduction/): -$$ -\underbrace{4}_{CPUs} \cdot \underbrace{8}_{SIMD} \cdot \underbrace{2}_{1/thr} \cdot \underbrace{3.6 \cdot 10^9}_{cycles/sec} = 230.4 \; GFLOPS \;\; (2.3 \cdot 10^{11}) -$$ +```c++ +// a vector of 256 / 32 = 8 floats +typedef float vec __attribute__ (( vector_size(32) )); -RAM bandwidth: 34.1 GB/s (or ~10 bytes per cycle) - +// a helper function that allocates n vectors and initializes them with zeros +vec* alloc(int n) { + vec* ptr = (vec*) std::aligned_alloc(32, 32 * n); + memset(ptr, 0, 32 * n); + return ptr; +} ---- +void matmul(const float *_a, const float *_b, float *c, int n) { + int nB = (n + 7) / 8; // number of 8-element vectors in a row (rounded up) + + vec *a = alloc(n * nB); + vec *b = alloc(n * nB); + + // move both matrices to the aligned region + for (int i = 0; i < n; i++) { + for (int j = 0; j < n; j++) { + a[i * nB + j / 8][j % 8] = _a[i * n + j]; + b[i * nB + j / 8][j % 8] = _b[j * n + i]; // <- b is still transposed + } + } + + for (int i = 0; i < n; i++) { + for (int j = 0; j < n; j++) { + vec s{}; // initialize the accumulator with zeros + + // vertical summation + for (int k = 0; k < nB; k++) + s += a[i * nB + k] * b[j * nB + k]; + + // horizontal summation + for (int k = 0; k < 8; k++) + c[i * n + j] += s[k]; + } + } -## OpenMP + std::free(a); + std::free(b); +} +``` -* We have 4 cores, so why don't we use them? -* There are low-level ways of creating threads, but they involve a lot of code -* We will use a high-level interface called OpenMP -* (We will talk about multithreading in much more detail on the next lecture) +The performance for $n = 1920$ is now around 2.3 GFLOPS — or another ~4 times higher compared to the transposed but not vectorized version. -![](https://www.researchgate.net/profile/Mario_Storti/publication/231168223/figure/fig2/AS:393334787985424@1470789729707/The-master-thread-creates-a-team-of-parallel-threads.png =400x) +![](../img/mm-vectorized-barplot.svg) ----- +This optimization looks neither too complex nor specific to matrix multiplication. Why can't the compiler [auto-vectorizee](/hpc/simd/auto-vectorization/) the inner loop by itself? -## Multithreading Made Easy +It actually can; the only thing preventing that is the possibility that `c` overlaps with either `a` or `b`. To rule it out, you can communicate to the compiler that you guarantee `c` is not [aliased](/hpc/compilation/contracts/#memory-aliasing) with anything by adding the `__restrict__` keyword to it: -All you need to know for now is the `#pragma omp parallel for` directive + -```cpp -#pragma omp parallel for -for (int i = 0; i < 10; ++i) { - do_stuff(i); +```c++ +void matmul(const float *a, const float *_b, float * __restrict__ c, int n) { + // ... } ``` -It splits iterations of a loop among multiple threads +Both manually and auto-vectorized implementations perform roughly the same. -There are many ways to control scheduling, -but we'll just leave defaults because our use case is simple - + -## Warning: Data Races +## Memory efficiency -This only works when all iterations can safely be executed simultaneously -It's not always easy to determine, but for now following rules of thumb are enough: +What is interesting is that the implementation efficiency depends on the problem size. -* There must not be any shared data element that is read by X and written by Y -* There must not be any shared data element that is written by X and written by Y +At first, the performance (defined as the number of useful operations per second) increases as the overhead of the loop management and the horizontal reduction decreases. Then, at around $n=256$, it starts smoothly decreasing as the matrices stop fitting into the [cache](/hpc/cpu-cache/) ($2 \times 256^2 \times 4 = 512$ KB is the size of the L2 cache), and the performance becomes bottlenecked by the [memory bandwidth](/hpc/cpu-cache/bandwidth/). -E. g. sum can't be parallelized this way, as threads would modify a shared variable - +![](../img/mm-vectorized-plot.svg) ---- +It is also interesting that the naive implementation is mostly on par with the non-vectorized transposed version — and even slightly better because it doesn't need to perform a transposition. -## Parallel Baseline - -OpenMP is included in compilers: just add `-fopenmp` flag and that's it - -```cpp -void step(float* r, const float* d, int n) { - #pragma omp parallel for - for (int i = 0; i < n; ++i) { - for (int j = 0; j < n; ++j) { - float v = infty; - for (int k = 0; k < n; ++k) { - float x = d[n*i + k]; - float y = d[n*k + j]; - float z = x + y; - v = std::min(v, z); - } - r[n*i + j] = v; - } - } -} -``` +One might think that there would be some general performance gain from doing sequential reads since we are fetching fewer cache lines, but this is not the case: fetching the first column of `b` indeed takes more time, but the next 15 column reads will be in the same cache lines as the first one, so they will be cached anyway — unless the matrix is so large that it can't even fit `n * cache_line_size` bytes into the cache, which is not the case for any practical matrix sizes. -Runs ~4x times faster, as it should +Instead, the performance deteriorates on only a few specific matrix sizes due to the effects of [cache associativity](/hpc/cpu-cache/associativity/): when $n$ is a multiple of a large power of two, we are fetching the addresses of `b` that all likely map to the same cache line, which reduces the effective cache size. This explains the 30% performance dip for $n = 1920 = 2^7 \times 3 \times 5$, and you can see an even more noticeable one for $1536 = 2^9 \times 3$: it is roughly 3 times slower than for $n=1535$. ---- +So, counterintuitively, transposing the matrix doesn't help with caching — and in the naive scalar implementation, we are not really bottlenecked by the memory bandwidth anyway. But our vectorized implementation certainly is, so let's work on its I/O efficiency. -## Memory Bottleneck +## Register reuse -![](https://i.imgur.com/z4d6aez.png =450x) +Using a Python-like notation to refer to submatrices, to compute the cell $C[x][y]$, we need to calculate the dot product of $A[x][:]$ and $B[:][y]$, which requires fetching $2n$ elements, even if we store $B$ in column-major order. -(It is slower on macOS because of smaller page sizes) + ----- +To compute $C[x:x+2][y:y+2]$, a $2 \times 2$ submatrix of $C$, we would need two rows from $A$ and two columns from $B$, namely $A[x:x+2][:]$ and $B[:][y:y+2]$, containing $4n$ elements in total, to update *four* elements instead of *one* — which is $\frac{2n / 1}{4n / 4} = 2$ times better in terms of I/O efficiency. -## Virtual Memory + -## V1: Linear Reading +To avoid fetching data more than once, we need to iterate over these rows and columns in parallel and calculate all $2 \times 2$ possible combinations of products. Here is a proof of concept: -Just transpose it, as we did with matrices +```c++ +void kernel_2x2(int x, int y) { + int c00 = 0, c01 = 0, c10 = 0, c11 = 0; -```cpp -void step(float* r, const float* d, int n) { - std::vector t(n*n); - #pragma omp parallel for - for (int i = 0; i < n; ++i) { - for (int j = 0; j < n; ++j) { - t[n*j + i] = d[n*i + j]; - } - } + for (int k = 0; k < n; k++) { + // read rows + int a0 = a[x][k]; + int a1 = a[x + 1][k]; - #pragma omp parallel for - for (int i = 0; i < n; ++i) { - for (int j = 0; j < n; ++j) { - float v = std::numeric_limits::infinity(); - for (int k = 0; k < n; ++k) { - float x = d[n*i + k]; - float y = t[n*j + k]; - float z = x + y; - v = std::min(v, z); - } - r[n*i + j] = v; - } + // read columns + int b0 = b[k][y]; + int b1 = b[k][y + 1]; + + // update all combinations + c00 += a0 * b0; + c01 += a0 * b1; + c10 += a1 * b0; + c11 += a1 * b1; } + + // write the results to C + c[x][y] = c00; + c[x][y + 1] = c01; + c[x + 1][y] = c10; + c[x + 1][y + 1] = c11; } ``` ----- +We can now simply call this kernel on all 2x2 submatrices of $C$, but we won't bother evaluating it: although this algorithm is better in terms of I/O operations, it would still not beat our SIMD-based implementation. Instead, we will extend this approach and develop a similar *vectorized* kernel right away. -![](https://i.imgur.com/UwxcEG7.png =600x) + ---- +## Designing the kernel -## V2: Instruction-Level Parallelism +Instead of designing a kernel that computes an $h \times w$ submatrix of $C$ from scratch, we will declare a function that *updates* it using columns from $l$ to $r$ of $A$ and rows from $l$ to $r$ of $B$. For now, this seems like an over-generalization, but this function interface will prove useful later. -We can apply the same trick as we did with array sum earlier, so that instead of: + -```cpp -v0 = min(v0, z0); -v1 = min(v1, z1); -v0 = min(v0, z2); -v1 = min(v1, z3); -v0 = min(v0, z4); -... -v = min(v0, v1); -``` +To determine $h$ and $w$, we have several performance considerations: ----- +- In general, to compute an $h \times w$ submatrix, we need to fetch $2 \cdot n \cdot (h + w)$ elements. To optimize the I/O efficiency, we want the $\frac{h \cdot w}{h + w}$ ratio to be high, which is achieved with large and square-ish submatrices. +- We want to use the [FMA](https://en.wikipedia.org/wiki/FMA_instruction_set) ("fused multiply-add") instruction available on all modern x86 architectures. As you can guess from the name, it performs the `c += a * b` operation — which is the core of a dot product — on 8-element vectors in one go, which saves us from executing vector multiplication and addition separately. +- To achieve better utilization of this instruction, we want to make use of [instruction-level parallelism](/hpc/pipelining/). On Zen 2, the `fma` instruction has a latency of 5 and a throughput of 2, meaning that we need to concurrently execute at least $5 \times 2 = 10$ of them to saturate its execution ports. +- We want to avoid register spill (move data to and from registers more than necessary), and we only have $16$ logical vector registers that we can use as accumulators (minus those that we need to hold temporary values). -![](https://i.imgur.com/ihMC6z2.png) +For these reasons, we settle on a $6 \times 16$ kernel. This way, we process $96$ elements at once that are stored in $6 \times 2 = 12$ vector registers. To update them efficiently, we use the following procedure: -Our memory layout looks like this now + - #pragma omp parallel for - for (int j = 0; j < n; ++j) { - for (int i = 0; i < n; ++i) { - d[nab*j + i] = d_[n*j + i]; - t[nab*j + i] = d_[n*i + j]; - } - } +```c++ +// update 6x16 submatrix C[x:x+6][y:y+16] +// using A[x:x+6][l:r] and B[l:r][y:y+16] +void kernel(float *a, vec *b, vec *c, int x, int y, int l, int r, int n) { + vec t[6][2]{}; // will be zero-filled and stored in ymm registers - #pragma omp parallel for - for (int i = 0; i < n; ++i) { - for (int j = 0; j < n; ++j) { - // vv[0] = result for k = 0, 4, 8, ... - // vv[1] = result for k = 1, 5, 9, ... - // vv[2] = result for k = 2, 6, 10, ... - // vv[3] = result for k = 3, 7, 11, ... - float vv[nb]; - for (int kb = 0; kb < nb; ++kb) { - vv[kb] = infty; - } - for (int ka = 0; ka < na; ++ka) { - for (int kb = 0; kb < nb; ++kb) { - float x = d[nab*i + ka * nb + kb]; - float y = t[nab*j + ka * nb + kb]; - float z = x + y; - vv[kb] = std::min(vv[kb], z); - } - } - // v = result for k = 0, 1, 2, ... - float v = infty; - for (int kb = 0; kb < nb; ++kb) { - v = std::min(vv[kb], v); - } - r[n*i + j] = v; + for (int k = l; k < r; k++) { + for (int i = 0; i < 6; i++) { + // broadcast a[x + i][k] into a register + vec alpha = vec{} + a[(x + i) * n + k]; // converts to a broadcast + // multiply b[k][y:y+16] by it and update t[i][0] and t[i][1] + for (int j = 0; j < 2; j++) + t[i][j] += alpha * b[(k * n + y) / 8 + j]; // converts to an fma } } + + // write the results back to C + for (int i = 0; i < 6; i++) + for (int j = 0; j < 2; j++) + c[((x + i) * n + y) / 8 + j] += t[i][j]; } ``` ----- +We need `t` so that the compiler stores these elements in vector registers. We could just update their final destinations in `c`, but, unfortunately, the compiler re-writes them back to memory, causing a slowdown (wrapping everything in `__restrict__` keywords doesn't help). -![](https://i.imgur.com/5uHVRL4.png =600x) +After unrolling these loops and hoisting `b` out of the `i` loop (`b[(k * n + y) / 8 + j]` does not depend on `i` and can be loaded once and reused in all 6 iterations), the compiler generates something more similar to this: ---- - -## V3: Vectorization + -![](https://i.imgur.com/EG0WjHl.png =400x) +```c++ +for (int k = l; k < r; k++) { + __m256 b0 = _mm256_load_ps((__m256*) &b[k * n + y]; + __m256 b1 = _mm256_load_ps((__m256*) &b[k * n + y + 8]; + + __m256 a0 = _mm256_broadcast_ps((__m128*) &a[x * n + k]); + t00 = _mm256_fmadd_ps(a0, b0, t00); + t01 = _mm256_fmadd_ps(a0, b1, t01); ----- + __m256 a1 = _mm256_broadcast_ps((__m128*) &a[(x + 1) * n + k]); + t10 = _mm256_fmadd_ps(a1, b0, t10); + t11 = _mm256_fmadd_ps(a1, b1, t11); -```cpp -static inline float8_t min8(float8_t x, float8_t y) { - return x < y ? x : y; + // ... } +``` -void step(float* r, const float* d_, int n) { - // elements per vector - constexpr int nb = 8; - // vectors per input row - int na = (n + nb - 1) / nb; - - // input data, padded, converted to vectors - float8_t* vd = float8_alloc(n*na); - // input data, transposed, padded, converted to vectors - float8_t* vt = float8_alloc(n*na); - - #pragma omp parallel for - for (int j = 0; j < n; ++j) { - for (int ka = 0; ka < na; ++ka) { - for (int kb = 0; kb < nb; ++kb) { - int i = ka * nb + kb; - vd[na*j + ka][kb] = i < n ? d_[n*j + i] : infty; - vt[na*j + ka][kb] = i < n ? d_[n*i + j] : infty; - } - } - } +We are using $12+3=15$ vector registers and a total of $6 \times 3 + 2 = 20$ instructions to perform $16 \times 6 = 96$ updates. Assuming that there are no other bottleneks, we should be hitting the throughput of `_mm256_fmadd_ps`. - #pragma omp parallel for - for (int i = 0; i < n; ++i) { - for (int j = 0; j < n; ++j) { - float8_t vv = f8infty; - for (int ka = 0; ka < na; ++ka) { - float8_t x = vd[na*i + ka]; - float8_t y = vt[na*j + ka]; - float8_t z = x + y; - vv = min8(vv, z); - } - r[n*i + j] = hmin8(vv); - } +Note that this kernel is architecture-specific. If we didn't have `fma`, or if its throughput/latency were different, or if the SIMD width was 128 or 512 bits, we would have made different design choices. Multi-platform BLAS implementations ship [many kernels](https://github.com/xianyi/OpenBLAS/tree/develop/kernel), each written in assembly by hand and optimized for a particular architecture. + +The rest of the implementation is straightforward. Similar to the previous vectorized implementation, we just move the matrices to memory-aligned arrays and call the kernel instead of the innermost loop: + +```c++ +void matmul(const float *_a, const float *_b, float *_c, int n) { + // to simplify the implementation, we pad the height and width + // so that they are divisible by 6 and 16 respectively + int nx = (n + 5) / 6 * 6; + int ny = (n + 15) / 16 * 16; + + float *a = alloc(nx * ny); + float *b = alloc(nx * ny); + float *c = alloc(nx * ny); + + for (int i = 0; i < n; i++) { + memcpy(&a[i * ny], &_a[i * n], 4 * n); + memcpy(&b[i * ny], &_b[i * n], 4 * n); // we don't need to transpose b this time } - std::free(vt); - std::free(vd); + for (int x = 0; x < nx; x += 6) + for (int y = 0; y < ny; y += 16) + kernel(a, (vec*) b, (vec*) c, x, y, 0, n, ny); + + for (int i = 0; i < n; i++) + memcpy(&_c[i * n], &c[i * ny], 4 * n); + + std::free(a); + std::free(b); + std::free(c); } ``` ----- +This improves the benchmark performance, but only by ~40%: -![](https://i.imgur.com/R3OvLKO.png =600x) +![](../img/mm-kernel-barplot.svg) ---- +The speedup is much higher (2-3x) on smaller arrays, indicating that there is still a memory bandwidth problem: -## V4: Register Reuse - -* At this point we are actually bottlenecked by memory -* It turns out that calculating one $r_{ij}$ at a time is not optimal -* We can reuse data that we read into registers to update other fields - ----- - -![](https://i.imgur.com/ljvD0ba.png =400x) - ----- - -```cpp -for (int ka = 0; ka < na; ++ka) { - float8_t y0 = vt[na*(jc * nd + 0) + ka]; - float8_t y1 = vt[na*(jc * nd + 1) + ka]; - float8_t y2 = vt[na*(jc * nd + 2) + ka]; - float8_t x0 = vd[na*(ic * nd + 0) + ka]; - float8_t x1 = vd[na*(ic * nd + 1) + ka]; - float8_t x2 = vd[na*(ic * nd + 2) + ka]; - vv[0][0] = min8(vv[0][0], x0 + y0); - vv[0][1] = min8(vv[0][1], x0 + y1); - vv[0][2] = min8(vv[0][2], x0 + y2); - vv[1][0] = min8(vv[1][0], x1 + y0); - vv[1][1] = min8(vv[1][1], x1 + y1); - vv[1][2] = min8(vv[1][2], x1 + y2); - vv[2][0] = min8(vv[2][0], x2 + y0); - vv[2][1] = min8(vv[2][1], x2 + y1); - vv[2][2] = min8(vv[2][2], x2 + y2); -} +![](../img/mm-kernel-plot.svg) + +Now, if you've read the section on [cache-oblivious algorithms](/hpc/external-memory/oblivious/), you know that one universal solution to these types of things is to split all matrices into four parts, perform eight recursive block matrix multiplications, and carefully combine the results together. This solution is okay in practice, but there is some [overhead to recursion](/hpc/architecture/functions/), and it also doesn't allow us to fine-tune the algorithm, so instead, we will follow a different, simpler approach. + +## Blocking + +The *cache-aware* alternative to the divide-and-conquer trick is *cache blocking*: splitting the data into blocks that can fit into the cache and processing them one by one. If we have more than one layer of cache, we can do hierarchical blocking: we first select a block of data that fits into the L3 cache, then we split it into blocks that fit into the L2 cache, and so on. This approach requires knowing the cache sizes in advance, but it is usually easier to implement and also faster in practice. + +Cache blocking is less trivial to do with matrices than with arrays, but the general idea is this: + +- Select a submatrix of $B$ that fits into the L3 cache (say, a subset of its columns). +- Select a submatrix of $A$ that fits into the L2 cache (say, a subset of its rows). +- Select a submatrix of the previously selected submatrix of $B$ (a subset of its rows) that fits into the L1 cache. +- Update the relevant submatrix of $C$ using the kernel. + +Here is a good [visualization](https://jukkasuomela.fi/cache-blocking-demo/) by Jukka Suomela (it features many different approaches; you are interested in the last one). + +Note that the decision to start this process with matrix $B$ is not arbitrary. During the kernel execution, we are reading the elements of $A$ much slower than the elements of $B$: we fetch and broadcast just one element of $A$ and then multiply it with $16$ elements of $B$. Therefore, we want $B$ to be in the L1 cache while $A$ can stay in the L2 cache and not the other way around. + +This sounds complicated, but we can implement it with just three more outer `for` loops, which are collectively called *macro-kernel* (and the highly optimized low-level function that updates a 6x16 submatrix is called *micro-kernel*): + +```c++ +const int s3 = 64; // how many columns of B to select +const int s2 = 120; // how many rows of A to select +const int s1 = 240; // how many rows of B to select + +for (int i3 = 0; i3 < ny; i3 += s3) + // now we are working with b[:][i3:i3+s3] + for (int i2 = 0; i2 < nx; i2 += s2) + // now we are working with a[i2:i2+s2][:] + for (int i1 = 0; i1 < ny; i1 += s1) + // now we are working with b[i1:i1+s1][i3:i3+s3] + // and we need to update c[i2:i2+s2][i3:i3+s3] with [l:r] = [i1:i1+s1] + for (int x = i2; x < std::min(i2 + s2, nx); x += 6) + for (int y = i3; y < std::min(i3 + s3, ny); y += 16) + kernel(a, (vec*) b, (vec*) c, x, y, i1, std::min(i1 + s1, n), ny); ``` -Ugly, but worth it +Cache blocking completely removes the memory bottleneck: ----- +![](../img/mm-blocked-barplot.svg) -![](https://i.imgur.com/GZvIt8J.png =600x) +The performance is no longer (significantly) affected by the problem size: ---- +![](../img/mm-blocked-plot.svg) -## V5: More Register Reuse +Notice that the dip at $1536$ is still there: cache associativity still affects the performance. To mitigate this, we can adjust the step constants or insert holes into the layout, but we will not bother doing that for now. -![](https://i.imgur.com/amUznoQ.png =400x) +## Optimization ----- +To approach closer to the performance limit, we need a few more optimizations: -![](https://i.imgur.com/24nBJ1Y.png =600x) +- Remove memory allocation and operate directly on the arrays that are passed to the function. Note that we don't need to do anything with `a` as we are reading just one element at a time, and we can use an [unaligned](/hpc/simd/moving/#aligned-loads-and-stores) `store` for `c` as we only use it rarely, so our only concern is reading `b`. +- Get rid of the `std::min` so that the size parameters are (mostly) constant and can be embedded into the machine code by the compiler (which also lets it [unroll](/hpc/architecture/loops/) the micro-kernel loop more efficiently and avoid runtime checks). +- Rewrite the micro-kernel by hand using 12 vector variables (the compiler seems to struggle with keeping them in registers and writes them first to a temporary memory location and only then to $C$). ---- +These optimizations are straightforward but quite tedious to implement, so we are not going to list [the code](https://github.com/sslotin/amh-code/blob/main/matmul/v5-unrolled.cc) here in the article. It also requires some more work to effectively support "weird" matrix sizes, which is why we only run benchmarks for sizes that are multiple of $48 = \frac{6 \cdot 16}{\gcd(6, 16)}$. -## V6: Software Prefetching + -## V7: Temporal Cache Locality +These individually small improvements compound and result in another 50% improvement: -![](https://i.imgur.com/29vTLKJ.png) +![](../img/mm-noalloc.svg) ----- +We are actually not that far from the theoretical performance limit — which can be calculated as the SIMD width times the `fma` instruction throughput times the clock frequency: -### Z-Curve +$$ +\underbrace{8}_{SIMD} \cdot \underbrace{2}_{thr.} \cdot \underbrace{2 \cdot 10^9}_{cycles/sec} = 32 \; GFLOPS \;\; (3.2 \cdot 10^{10}) +$$ -![](https://i.imgur.com/0optLZ3.png) +It is more representative to compare against some practical library, such as [OpenBLAS](https://www.openblas.net/). The laziest way to do it is to simply [invoke matrix multiplication from NumPy](/hpc/complexity/languages/#blas). There may be some minor overhead due to Python, but it ends up reaching 80% of the theoretical limit, which seems plausible (a 20% overhead is okay: matrix multiplication is not the only thing that CPUs are made for). ----- +![](../img/mm-blas.svg) -![](https://i.imgur.com/U3GaO5b.png) +We've reached ~93% of BLAS performance and ~75% of the theoretical performance limit, which is really great for what is essentially just 40 lines of C. ---- +Interestingly, the whole thing can be rolled into just one deeply nested `for` loop with a BLAS level of performance (assuming that we're in 2050 and using GCC version 35, which finally stopped screwing up with register spilling): + +```c++ +for (int i3 = 0; i3 < n; i3 += s3) + for (int i2 = 0; i2 < n; i2 += s2) + for (int i1 = 0; i1 < n; i1 += s1) + for (int x = i2; x < i2 + s2; x += 6) + for (int y = i3; y < i3 + s3; y += 16) + for (int k = i1; k < i1 + s1; k++) + for (int i = 0; i < 6; i++) + for (int j = 0; j < 2; j++) + c[x * n / 8 + i * n / 8 + y / 8 + j] + += (vec{} + a[x * n + i * n + k]) + * b[n / 8 * k + y / 8 + j]; +``` + +There is also an approach that performs asymptotically fewer arithmetic operations — [the Strassen algorithm](/hpc/external-memory/oblivious/#strassen-algorithm) — but it has a large constant factor, and it is only efficient for [very large matrices](https://arxiv.org/pdf/1605.01078.pdf) ($n > 4000$), where we typically have to use either multiprocessing or some approximate dimensionality-reducing methods anyway. + +## Generalizations + +FMA also supports 64-bit floating-point numbers, but it does not support integers: you need to perform addition and multiplication separately, which results in decreased performance. If you can guarantee that all intermediate results can be represented exactly as 32- or 64-bit floating-point numbers (which is [often the case](/hpc/arithmetic/errors/)), it may be faster to just convert them to and from floats. + +This approach can be also applied to some similar-looking computations. One example is the "min-plus matrix multiplication" defined as: + +$$ +(A \circ B)_{ij} = \min_{1 \le k \le n} (A_{ik} + B_{kj}) +$$ + +It is also known as the "distance product" due to its graph interpretation: when applied to itself $(D \circ D)$, the result is the matrix of shortest paths of length two between all pairs of vertices in a fully-connected weighted graph specified by the edge weight matrix $D$. + +A cool thing about the distance product is that if we iterate the process and calculate + +$$ +D_2 = D \circ D \\ +D_4 = D_2 \circ D_2 \\ +D_8 = D_4 \circ D_4 \\ +\ldots +$$ + +…we can find all-pairs shortest paths in $O(\log n)$ steps: + +```c++ +for (int l = 0; l < logn; l++) + for (int i = 0; i < n; i++) + for (int j = 0; j < n; j++) + for (int k = 0; k < n; k++) + d[i][j] = min(d[i][j], d[i][k] + d[k][j]); +``` + +This requires $O(n^3 \log n)$ operations. If we do these two-edge relaxations in a particular order, we can do it with just one pass, which is known as the [Floyd-Warshall algorithm](https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm): + +```c++ +for (int k = 0; k < n; k++) + for (int i = 0; i < n; i++) + for (int j = 0; j < n; j++) + d[i][j] = min(d[i][j], d[i][k] + d[k][j]); +``` + +Interestingly, similarly vectorizing the distance product and executing it $O(\log n)$ times ([or possibly fewer](https://arxiv.org/pdf/1904.01210.pdf)) in $O(n^3 \log n)$ total operations is faster than naively executing the Floyd-Warshall algorithm in $O(n^3)$ operations, although not by a lot. + +As an exercise, try to speed up this "for-for-for" computation. It is harder to do than in the matrix multiplication case because now there is a logical dependency between the iterations, and you need to perform updates in a particular order, but it is still possible to design [a similar kernel and a block iteration order](https://github.com/sslotin/amh-code/blob/main/floyd/blocked.cc) that achieves a 30-50x total speedup. + +## Acknowledgements -## Summary +The final algorithm was originally designed by Kazushige Goto, and it is the basis of GotoBLAS and OpenBLAS. The author himself describes it in more detail in "[Anatomy of High-Performance Matrix Multiplication](https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf)". -* Deal with memory problems first (make sure data fits L3 cache) -* SIMD can get you ~10x speedup -* ILP can get you 2-3x speedup -* Multi-core parallelism can get you $NUM_CORES speedup - (and it can be just one `#pragma omp parallel for` away) +The exposition style is inspired by the "[Programming Parallel Computers](http://ppc.cs.aalto.fi/)" course by Jukka Suomela, which features a [similar case study](http://ppc.cs.aalto.fi/ch2/) on speeding up the distance product. diff --git a/content/english/hpc/algorithms/parsing.md b/content/english/hpc/algorithms/parsing.md deleted file mode 100644 index c189e66a..00000000 --- a/content/english/hpc/algorithms/parsing.md +++ /dev/null @@ -1,5 +0,0 @@ ---- -title: Parsing with SIMD -weight: 5 -draft: true ---- diff --git a/content/english/hpc/algorithms/prefix.md b/content/english/hpc/algorithms/prefix.md index 5e31570d..43bfd560 100644 --- a/content/english/hpc/algorithms/prefix.md +++ b/content/english/hpc/algorithms/prefix.md @@ -61,7 +61,7 @@ for (int l = 0; l < logn; l++) We can prove that this algorithm works by induction: if on $k$-th iteration every element $a_i$ is equal to the sum of the $(i - 2^k, i]$ segment of the original array, then after adding $a_{i - 2^k}$ to it, it will be equal to the sum of $(i - 2^{k+1}, i]$. After $O(\log n)$ iterations, the array will turn into its prefix sum. -To implement it in SIMD, we could use [permutations](/hpc/simd/shuffles) to place $i$-th element against $(i-2^k)$-th, but they are too slow. Instead, we will use the `sll` ("shift lanes left") instruction that does exactly that and also replaces the unmatched elements with zeros: +To implement it in SIMD, we could use [permutations](/hpc/simd/shuffling) to place $i$-th element against $(i-2^k)$-th, but they are too slow. Instead, we will use the `sll` ("shift lanes left") instruction that does exactly that and also replaces the unmatched elements with zeros: ```c++ typedef __m128i v4i; @@ -76,7 +76,7 @@ v4i prefix(v4i x) { // x = 1, 3, 5, 7 // + 0, 0, 1, 3 // = 1, 3, 6, 10 - return s; + return x; } ``` @@ -91,7 +91,7 @@ v8i prefix(v8i x) { x = _mm256_add_epi32(x, _mm256_slli_si256(x, 8)); x = _mm256_add_epi32(x, _mm256_slli_si256(x, 16)); // <- this does nothing // x = 1, 3, 6, 10, 5, 11, 18, 26 - return s; + return x; } ``` @@ -146,7 +146,7 @@ Another interesting data point: if we only execute the `prefix` phase, the perfo ### Blocking -So, we have a memory bandwidth problem for large arrays. We can avoid re-fetching the entire array from the RAM if we split it into blocks that fit in the cache and process them separately. All we need to pass to the next block is the sum of the previous ones, so we can design a `local_prefix` function with an interface similar to `accumulate`: +So, we have a memory bandwidth problem for large arrays. We can avoid re-fetching the entire array from RAM if we split it into blocks that fit in the cache and process them separately. All we need to pass to the next block is the sum of the previous ones, so we can design a `local_prefix` function with an interface similar to `accumulate`: ```c++ const int B = 4096; // <- ideally should be slightly less or equal to the L1 cache diff --git a/content/english/hpc/algorithms/reading-integers.md b/content/english/hpc/algorithms/reading-integers.md new file mode 100644 index 00000000..de9da4e9 --- /dev/null +++ b/content/english/hpc/algorithms/reading-integers.md @@ -0,0 +1,59 @@ +--- +title: Reading Decimal Integers +weight: 10 +draft: true +--- + +I wrote a new integer parsing algorithm that is ~35x faster than scanf. + +(No, this is not an April Fools' joke — although it does sound ridiculous.) + +Zen 2 @ 2GHz. The compiler is Clang 13. + +Ridiculous. + +### Iostream + +### Scanf + +### Syncronization + +### Getchar + +### Buffering + +### SIMD + +http://0x80.pl/notesen/2014-10-12-parsing-decimal-numbers-part-1-swar.html + + +### Serial + +### Transpose-based approach + +### Instruction-level parallelism + + +### Modifications + +ILP benefits would not be that huge. + +One huge asterisk. We get the integers, and we can even do other parsing algorithms on them. + +1.75 cycles per byte. + +AVX-512 both due to larger SIMD lane size and dedicated operations for filtering. + +It accounts for ~2% of all time, but it can be optimized by using special procedures. Pad buffer with any digits. + +### Future work + +Next time, we will be *writing* integers. + +You can create a string searcing algorithm by computing hashes in rabin-karp algorithm — although it does not seem to be possible to make an *exact* algorithm for that. + +## Acknowledgements + +http://0x80.pl/articles/simd-parsing-int-sequences.html + +https://stackoverflow.com/questions/25622745/transpose-an-8x8-float-using-avx-avx2/25627536#25627536 diff --git a/content/english/hpc/architecture/assembly.md b/content/english/hpc/architecture/assembly.md index 20a018c7..de94e4cf 100644 --- a/content/english/hpc/architecture/assembly.md +++ b/content/english/hpc/architecture/assembly.md @@ -19,7 +19,7 @@ Jumping right into it, here is how you add two numbers (`*c = *a + *b`) in Arm a ldr w0, [x0] ; load 4 bytes from wherever x0 points into w0 ldr w1, [x1] ; load 4 bytes from wherever x1 points into w1 add w0, w0, w1 ; add w0 with w1 and save the result to w0 -str w0, [x2] ; write contents of w0 to wherever x2 points/ +str w0, [x2] ; write contents of w0 to wherever x2 points ``` Here is the same operation in x86 assembly: @@ -33,7 +33,7 @@ mov DWORD PTR [rdx], eax ; write contents of eax to wherever rdx points Assembly is very simple in the sense that it doesn't have many syntactical constructions compared to high-level programming languages. From what you can observe from the examples above: -- A program is a sequence of instructions, each written as its name followed by a variable amount of operands. +- A program is a sequence of instructions, each written as its name followed by a variable number of operands. - The `[reg]` syntax is used for "dereferencing" a pointer stored in a register, and on x86 you need to prefix it with size information (`DWORD` here means 32 bit). - The `;` sign is used for line comments, similar to `#` and `//` in other languages. @@ -49,15 +49,15 @@ Since there are far more differences between the architectures than just this on For historical reasons, instruction mnemonics in most assembly languages are very terse. Back when people used to write assembly by hand and repeatedly wrote the same set of common instructions, one less character to type was one step away from insanity. -For example, `mov` is for "store/load a word", `inc` is for "increment by 1", `mul` is for "multiply", and `idiv` is for "integer division". You can look up the description of an instruction by its name in [one of x86 references](https://www.felixcloutier.com/x86/), but most instructions do what you'd think they do. +For example, `mov` is for "store/load a word," `inc` is for "increment by 1," `mul` is for "multiply," and `idiv` is for "integer division." You can look up the description of an instruction by its name in [one of x86 references](https://www.felixcloutier.com/x86/), but most instructions do what you'd think they do. Most instructions write their result into the first operand, which can also be involved in the computation like in the `add eax, [rdi]` example we saw before. Operands can be either registers, constant values, or memory locations. -**Registers** are named `rax`, `rbx`, `rcx`, `rdx`, `rdi`, `rsi`, `rbp`, `rsp`, and `r8`-`r15` for a total of 16 of them. The "letter" ones are named like that for historical reasons: `rax` is "accumulator", `rcx` is "counter", `rdx` is "data" and so on, but, of course, they don't have to be used only for that. +**Registers** are named `rax`, `rbx`, `rcx`, `rdx`, `rdi`, `rsi`, `rbp`, `rsp`, and `r8`-`r15` for a total of 16 of them. The "letter" ones are named like that for historical reasons: `rax` is "accumulator," `rcx` is "counter," `rdx` is "data" and so on — but, of course, they don't have to be used only for that. -There are also 32-, 16-bit and 8-bit registers that have similar names (`rax` → `eax` → `ax` → `al`). They are not fully separate but *aliased*: the first 32 bits of `rax` are `eax`, the first 16 bits of `eax` are `ax`, and so on. This is made to save die space while maintaining compatibility, and it is also the reason why basic type casts in compiled programming languages are usually free. +There are also 32-, 16-bit and 8-bit registers that have similar names (`rax` → `eax` → `ax` → `al`). They are not fully separate but *aliased*: the lowest 32 bits of `rax` are `eax`, the lowest 16 bits of `eax` are `ax`, and so on. This is made to save die space while maintaining compatibility, and it is also the reason why basic type casts in compiled programming languages are usually free. -These are just the *general-purpose* registers that you can, with [some exceptions](../functions), use however you like in most instructions. There is also a separate set of registers for [floating-point arithmetic](/hpc/arithmetic/float), a bunch of very wide registers used in [vector extensions](/hpc/simd), and a few special ones that are needed for [control flow](../jumps), but we'll get there in time. +These are just the *general-purpose* registers that you can, with [some exceptions](../functions), use however you like in most instructions. There is also a separate set of registers for [floating-point arithmetic](/hpc/arithmetic/float), a bunch of very wide registers used in [vector extensions](/hpc/simd), and a few special ones that are needed for [control flow](../loops), but we'll get there in time. **Constants** are just integer or floating-point values: `42`, `0x2a`, `3.14`, `6.02e23`. They are more commonly called *immediate values* because they are embedded right into the machine code. Because it may considerably increase the complexity of the instruction encoding, some instructions don't support immediate values or allow just a fixed subset of them. In some cases, you have to load a constant value into a register and then use it instead of an immediate value. @@ -117,20 +117,18 @@ There are actually multiple *assemblers* (the programs that produce machine code These syntaxes are also sometimes called *GAS* and *NASM* respectively, by the names of the two primary assemblers that use them (*GNU Assembler* and *Netwide Assembler*). -We used Intel syntax in this chapter and will continue to preferably use it for the rest of the book. For comparison, here is what the summation loop looks like in AT&T asm: +We used Intel syntax in this chapter and will continue to preferably use it for the rest of the book. For comparison, here is how the same `*c = *a + *b` example looks like in AT&T asm: ```asm -loop: - addl (%rax), %edx - addq $4, %rax - cmpq %rcx, %rax - jne loop +movl (%rsi), %eax +addl (%rdi), %eax +movl %eax, (%rdx) ``` The key differences can be summarized as follows: 1. The *last* operand is used to specify the destination. -2. Register names and constants need to be prefixed by `%` and `$` respectively. +2. Registers and constants need to be prefixed by `%` and `$` respectively (e.g., `addl $1, %rdx` increments `rdx`). 3. Memory addressing looks like this: `displacement(%base, %index, scale)`. 4. Both `;` and `#` can be used for line comments, and also `/* */` can be used for block comments. diff --git a/content/english/hpc/architecture/functions.md b/content/english/hpc/architecture/functions.md index ec8631f0..3f98a381 100644 --- a/content/english/hpc/architecture/functions.md +++ b/content/english/hpc/architecture/functions.md @@ -1,6 +1,7 @@ --- title: Functions and Recursion weight: 3 +published: true --- To "call a function" in assembly, you need to [jump](../loops) to its beginning and then jump back. But then two important problems arise: @@ -15,9 +16,9 @@ Both of these concerns can be solved by having a dedicated location in memory wh The hardware stack works the same way software stacks do and is similarly implemented as just two pointers: - The *base pointer* marks the start of the stack and is conventionally stored in `rbp`. -- The *stack pointer* marks the last element on the stack and is conventionally stored in `rsp`. +- The *stack pointer* marks the last element of the stack and is conventionally stored in `rsp`. -When you need to call a function, you push all your local variables onto the stack (which you can also do in other circumstances, e. g. when you run out of registers), push the current instruction pointer, and then jump to the beginning of the function. When exiting from a function, you look at the pointer stored on top of the stack, jump there, and then carefully read all the variables stored on the stack back into their registers. +When you need to call a function, you push all your local variables onto the stack (which you can also do in other circumstances; e.g., when you run out of registers), push the current instruction pointer, and then jump to the beginning of the function. When exiting from a function, you look at the pointer stored on top of the stack, jump there, and then carefully read all the variables stored on the stack back into their registers. -By convention, a function should take its arguments in `rdi`, `rsi`, `rdx`, `rcx`, `r8`, `r9` (and the rest in the stack if that wasn't enough), put the return value into `rax`, and then return. Thus, `square`, being a simple one-argument function, can be implemented like this: +By convention, a function should take its arguments in `rdi`, `rsi`, `rdx`, `rcx`, `r8`, `r9` (and the rest in the stack if those weren't enough), put the return value into `rax`, and then return. Thus, `square`, being a simple one-argument function, can be implemented like this: ```nasm square: ; x = edi, ret = eax @@ -189,7 +190,7 @@ distance: ret ``` -This is better, but we are still implicitly accessing stack memory: you need to push and pop the instruction pointer on each function call. In simple cases like this, we can *inline* function calls by stitching callee's code into the caller and resolving conflicts over registers. In our example: +This is better, but we are still implicitly accessing stack memory: you need to push and pop the instruction pointer on each function call. In simple cases like this, we can *inline* function calls by stitching the callee's code into the caller and resolving conflicts over registers. In our example: ```nasm distance: @@ -229,7 +230,7 @@ Equivalent assembly: ```nasm ; n = edi, ret = eax factorial: - test edi, edi ; test if a value if zero + test edi, edi ; test if a value is zero jne nonzero ; (the machine code of "cmp rax, 0" would be one byte longer) mov eax, 1 ; return 1 ret diff --git a/content/english/hpc/architecture/indirect.md b/content/english/hpc/architecture/indirect.md index ce6e86b8..1bd96c06 100644 --- a/content/english/hpc/architecture/indirect.md +++ b/content/english/hpc/architecture/indirect.md @@ -102,11 +102,11 @@ There are many ways to implement this behavior, but C++ does it using a *virtual For all concrete implementations of `Animal`, compiler pads all their methods (that is, their instruction sequences) so that they have the exact same length for all classes (by inserting some [filler instructions](../layout) after `ret`) and then just writes them sequentially somewhere in the instruction memory. Then it adds a *run-time type information* field to the structure (that is, to all its instances), which is essentially just the offset in the memory region that points to the right implementation of the virtual methods of the class. -During a virtual method call, that offset field is fetched from the instance of a structure, and a normal function call is made with it, using the fact that all methods and other fields of every derived class have exactly the same offsets. +With a virtual method call, that offset field is fetched from the instance of a structure and a normal function call is made with it, using the fact that all methods and other fields of every derived class have exactly the same offsets. Of course, this adds some overhead: -- You may need to spend another 15 cycles or so for the same pipeline flushing reasons as for [branch misprediction](../pipelining). +- You may need to spend another 15 cycles or so for the same pipeline flushing reasons as for [branch misprediction](/hpc/pipelining). - The compiler most likely won't be able to inline the function call itself. - Class size increases by a couple of bytes or so (this is implementation-specific). - The binary size itself increases a little bit. diff --git a/content/english/hpc/architecture/isa.md b/content/english/hpc/architecture/isa.md index d109b359..b902f69c 100644 --- a/content/english/hpc/architecture/isa.md +++ b/content/english/hpc/architecture/isa.md @@ -14,7 +14,7 @@ Abstractions help us in reducing all this complexity down to a single *interface Hardware engineers love abstractions too. An abstraction of a CPU is called an *instruction set architecture* (ISA), and it defines how a computer should work from a programmer's perspective. Similar to software interfaces, it gives computer engineers the ability to improve on existing CPU designs while also giving its users — us, programmers — the confidence that things that worked before won't break on newer chips. -An ISA essentially defines how the hardware should interpret the machine language. Apart from instructions and their binary encodings, ISA importantly defines counts, sizes, and purposes of registers, the memory model, and the input/output model. Similar to software interfaces, ISAs can be extended too: in fact, they are often updated, mostly in a backward-compatible way, to add new and more specialized instructions that can improve performance. +An ISA essentially defines how the hardware should interpret the machine language. Apart from instructions and their binary encodings, an ISA also defines the counts, sizes, and purposes of registers, the memory model, and the input/output model. Similar to software interfaces, ISAs can be extended too: in fact, they are often updated, mostly in a backward-compatible way, to add new and more specialized instructions that can improve performance. ### RISC vs CISC @@ -23,7 +23,7 @@ Historically, there have been many competing ISAs in use. But unlike [character - **Arm** chips, which are used in almost all mobile devices, as well as other computer-like devices such as TVs, smart fridges, microwaves, [car autopilots](https://en.wikipedia.org/wiki/Tesla_Autopilot), and so on. They are designed by a British company of the same name, as well as a number of electronics manufacturers including Apple and Samsung. - **x86**[^x86] chips, which are used in almost all servers and desktops, with a few notable exceptions such as Apple's M1 MacBooks, AWS's Graviton processors, and the current [world's fastest supercomputer](https://en.wikipedia.org/wiki/Fugaku_(supercomputer)), all of which use Arm-based CPUs. They are designed by a duopoly of Intel and AMD. -[^x86]: Modern 64-bit versions of x86 are known as "AMD64", "Intel 64", or by the more vendor-neutral names of "x86-64" or just "x64". A similar 64-bit extension of Arm is called "AArch64" or "ARM64". In this book, we will just use plain "x86" and "Arm" implying the 64-bit versions. +[^x86]: Modern 64-bit versions of x86 are known as "AMD64," "Intel 64," or by the more vendor-neutral names of "x86-64" or just "x64." A similar 64-bit extension of Arm is called "AArch64" or "ARM64." In this book, we will just use plain "x86" and "Arm" implying the 64-bit versions. The main difference between them is that of architectural complexity, which is more of a design philosophy rather than some strictly defined property: diff --git a/content/english/hpc/architecture/layout.md b/content/english/hpc/architecture/layout.md index 1ab39c82..df414512 100644 --- a/content/english/hpc/architecture/layout.md +++ b/content/english/hpc/architecture/layout.md @@ -1,6 +1,7 @@ --- title: Machine Code Layout weight: 10 +published: true --- Computer engineers like to mentally split the [pipeline of a CPU](/hpc/pipelining) into two parts: the *front-end*, where instructions are fetched from memory and decoded, and the *back-end*, where they are scheduled and finally executed. Typically, the performance is bottlenecked by the execution stage, and for this reason, most of our efforts in this book are going to be spent towards optimizing around the back-end. @@ -15,7 +16,7 @@ During the **fetch** stage, the CPU simply loads a fixed-size chunk of bytes fro -Next comes the **decode** stage: the CPU looks at this chunk of bytes, discards everything that comes before the instruction pointer, and splits the rest of them into instructions. Machine instructions are encoded using a variable amount of bytes: something simple and very common like `inc rax` takes one byte, while some obscure instruction with encoded constants and behavior-modifying prefixes may take up to 15. So, from a 32-byte block, a variable number of instructions may be decoded, but no more than a certain machine-dependant limit called the *decode width*. On my CPU (a [Zen 2](https://en.wikichip.org/wiki/amd/microarchitectures/zen_2)), the decode width is 4, which means that on each cycle, up to 4 instructions can be decoded and passed to the next stage. +Next comes the **decode** stage: the CPU looks at this chunk of bytes, discards everything that comes before the instruction pointer, and splits the rest of them into instructions. Machine instructions are encoded using a variable number of bytes: something simple and very common like `inc rax` takes one byte, while some obscure instruction with encoded constants and behavior-modifying prefixes may take up to 15. So, from a 32-byte block, a variable number of instructions may be decoded, but no more than a certain machine-dependent limit called the *decode width*. On my CPU (a [Zen 2](https://en.wikichip.org/wiki/amd/microarchitectures/zen_2)), the decode width is 4, which means that on each cycle, up to 4 instructions can be decoded and passed to the next stage. The stages work in a pipelined fashion: if the CPU can tell (or [predict](/hpc/pipelining/branching/)) which instruction block it needs next, then the fetch stage doesn't wait for the last instruction in the current block to be decoded and loads the next one right away. @@ -29,7 +30,7 @@ Loop Stream Detector (LSD) ### Code Alignment -Other things being equal, compilers typically prefer instructions with shorter machine code, because this way more instructions can fit in a single 32B fetch block, and also because it reduces the size of the binary. But sometimes the reverse advice applies, caused by the fact that the fetched instructions blocks have to be aligned. +Other things being equal, compilers typically prefer instructions with shorter machine code, because this way more instructions can fit in a single 32B fetch block, and also because it reduces the size of the binary. But sometimes the reverse is prefereable, due to the fact that the fetched instructions' blocks must be aligned. Imagine that you need to execute an instruction sequence that starts on the last byte of a 32B-aligned block. You may be able to execute the first instruction without additional delay, but for the subsequent ones, you have to wait for one additional cycle to do another instruction fetch. If the code block was aligned on a 32B boundary, then up to 4 instructions could be decoded and then executed concurrently (unless they are extra long or interdependent). @@ -45,15 +46,15 @@ In GCC, you can use `-falign-labels=n` flag to specify a particular alignment po The instructions are stored and fetched using largely the same [memory system](/hpc/cpu-cache) as for the data, except maybe the lower layers of cache are replaced with a separate *instruction cache* (because you wouldn't want a random data read to kick out the code that processes it). -The instruction cache is crucial in situations when you either +The instruction cache is crucial in situations when you either: - don't know what instructions you are going to execute next, and need to fetch the next block with [low latency](/hpc/cpu-cache/latency), -- or executing a long sequence of verbose-but-quick-to-process instructions, and need [high bandwidth](/hpc/cpu-cache/bandwidth). +- or are executing a long sequence of verbose-but-quick-to-process instructions, and need [high bandwidth](/hpc/cpu-cache/bandwidth). The memory system can therefore become the bottleneck for programs with large machine code. This consideration limits the applicability of the optimization techniques we've previously discussed: - [Inlining functions](../functions) is not always optimal, because it reduces code sharing and increases the binary size, requiring more instruction cache. -- [Unrolling loops](../loops) is only beneficial up to some extent, even if the number of loops is known during compile-time: at some point, the CPU would have to fetch both instructions and data from the main memory, in which case it will likely be bottlenecked by the memory bandwidth. +- [Unrolling loops](../loops) is only beneficial up to some extent, even if the number of iterations is known during compile time: at some point, the CPU would have to fetch both instructions and data from the main memory, in which case it will likely be bottlenecked by the memory bandwidth. - Huge [code alignments](#code-alignment) increase the binary size, again requiring more instruction cache. Spending one more cycle on fetch is a minor penalty compared to missing the cache and waiting for the instructions to be fetched from the main memory. Another aspect is that placing frequently used instruction sequences on the same [cache lines](/hpc/cpu-cache/cache-lines) and [memory pages](/hpc/cpu-cache/paging) improves [cache locality](/hpc/external-memory/locality). To improve instruction cache utilization, you should group hot code with hot code and cold code with cold code, and remove dead (unused) code if possible. If you want to explore this idea further, check out Facebook's [Binary Optimization and Layout Tool](https://engineering.fb.com/2018/06/19/data-infrastructure/accelerate-large-scale-applications-with-bolt/), which was recently [merged](https://github.com/llvm/llvm-project/commit/4c106cfdf7cf7eec861ad3983a3dd9a9e8f3a8ae) into LLVM. @@ -126,7 +127,7 @@ normal: ret swap: xchg edi, esi - jump normal + jmp normal ``` This technique is quite handy when handling exceptions cases in general, and in high-level code, you can give the compiler a [hint](/hpc/compilation/situational) that a certain branch is more likely than the other: @@ -152,7 +153,7 @@ length: ret ``` -This is a very important issue, and we will spend [much of the next chapter](/hpc/pipelining/branching) discussing it in more detail. +Eliminating branches is an important topic, and we will spend [much of the next chapter](/hpc/pipelining/branching) discussing it in more detail. @@ -47,7 +47,7 @@ Since running machine code in an interpreter doesn't make sense, this makes a to - Compiled languages with a runtime, such as Java, C#, or Erlang (and languages that work on their VMs, such as Scala, F#, or Elixir). - Compiled native languages, such as C, Go, or Rust. -There is no "right" way of executing computer programs: each approach has its own gains and drawbacks. Interpreters and virtual machines provide flexibility and enable some nice high-level programming features such as dynamic typing, run-time code alteration, and automatic memory management, but this comes with some unavoidable performance trade-offs, which we will now talk about. +There is no "right" way of executing computer programs: each approach has its own gains and drawbacks. Interpreters and virtual machines provide flexibility and enable some nice high-level programming features such as dynamic typing, run time code alteration, and automatic memory management, but these come with some unavoidable performance trade-offs, which we will now talk about. ### Interpreted languages @@ -94,7 +94,7 @@ This is not surprising if you consider the things that Python needs to do to fig - looks up its type, figures out that it's a `float`, and fetches the method implementing `*` operator; - does the same things for `b` and `c` and finally add-assigns the result to `c[i][j]`. -Granted, the interpreters of widely-used languages such as Python are well-optimized, and they can skip through some of these steps on repeated execution of the same code. But still, some quite significant overhead is unavoidable due to the language design. If we get rid of all this type checking and pointer chasing, perhaps we can get cycles per multiplication ratio closer to 1, or whatever the "cost" of native multiplication is? +Granted, the interpreters of widely used languages such as Python are well-optimized, and they can skip through some of these steps on repeated execution of the same code. But still, some quite significant overhead is unavoidable due to the language design. If we get rid of all this type checking and pointer chasing, perhaps we can get cycles per multiplication ratio closer to 1, or whatever the "cost" of native multiplication is? ### Managed Languages @@ -204,7 +204,7 @@ print(duration) Now it takes ~0.12 seconds: a ~5x speedup over the auto-vectorized C version and ~5250x speedup over our initial Python implementation! -You don't typically see such dramatic improvements. For now, we are not ready to tell you exactly how this is achieved. Implementations of dense matrix multiplication in OpenBLAS are typically [5000 lines of handwritten assembly](https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_64/dgemm_kernel_16x2_haswell.S) tailored separately for *each* architecture. In later chapters, we will explain all the relevant techniques one by one, and then return to this example and develop our own BLAS-level implementation using just under 40 lines of C. +You don't typically see such dramatic improvements. For now, we are not ready to tell you exactly how this is achieved. Implementations of dense matrix multiplication in OpenBLAS are typically [5000 lines of handwritten assembly](https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_64/dgemm_kernel_16x2_haswell.S) tailored separately for *each* architecture. In later chapters, we will explain all the relevant techniques one by one, and then [return](/hpc/algorithms/matmul) to this example and develop our own BLAS-level implementation using just under 40 lines of C. ### Takeaway diff --git a/content/english/hpc/complexity/levels.md b/content/english/hpc/complexity/levels.md index d0757754..9a792917 100644 --- a/content/english/hpc/complexity/levels.md +++ b/content/english/hpc/complexity/levels.md @@ -30,13 +30,55 @@ You get especially frustrated if you had a competitive programming experience. Y Programmers can be put in several "levels" in terms of their software optimization abilities: -0. "Newbie". Those who don't think about performance at all. They usually write in high-level languages, sometimes in declarative / functional languages. Most "programmers" stay there (and there is nothing wrong with it). -1. "Undergraduate student". Those who know about Big O notation and are familiar with basic data structures and approaches. LeetCode and CodeForces folks are there. This is also the requirement in getting into big companies — they have a lot of in-house software, large scale, and they are looking for people in the long term, so asking things like programming language. -2. "Graduate student". Those who know that not all operations are created equal; know other cost models such as external memory model (B-tree, external sorting), word model (bitset,) or parallel computing, but still in theory. -3. "Professional developer". Those who know actual timings of these operations. Aware that branch mispredictions are costly, memory is split into cache lines. Knows some basic SIMD techniques. -4. "Performance engineer". Know exactly what happens inside their hardware. Know the difference between latency and bandwidth, know about ports. Knows how to use SIMD and the rest of instruction set effectively. Can read assembly and use profilers. -5. "Intel employee". Knows microarchitecture-specific details. This is outside of the purview of normal engineers. +0. *Newbie*. Those who don't think about performance at all. They usually write in high-level languages, sometimes in declarative / functional languages. Most "programmers" stay there (and there is nothing wrong with it). +1. *Undergraduate student*. Those who know about Big O notation and are familiar with basic data structures and approaches. LeetCode and CodeForces folks are there. This is also the requirement in getting into big companies — they have a lot of in-house software, large scale, and they are looking for people in the long term, so asking things like programming language. +2. *Graduate student*. Those who know that not all operations are created equal; know other cost models such as external memory model (B-tree, external sorting), word model (bitset,) or parallel computing, but still in theory. +3. *Professional developer*. Those who know actual timings of these operations. Aware that branch mispredictions are costly, memory is split into cache lines. Knows some basic SIMD techniques. +4. *Performance engineer*. Know exactly what happens inside their hardware. Know the difference between latency and bandwidth, know about ports. Knows how to use SIMD and the rest of instruction set effectively. Can read assembly and use profilers. +5. *Intel employee*. Knows microarchitecture-specific details. This is outside of the purview of normal engineers. In this book, we expect that the average reader is somewhere around stage 1, and hopefully by the end of it will get to 4. You should also go through these levels when designing algorithms. First get it working in the first place, then select a bunch of reasonably asymptotically optimal algorithm. Then think about how they are going to work in terms of their memory operations or ability to execute in parallel (even if you consider single-threaded programs, there is still going to be plenty of parallelism inside a core, so this model is extremely ), and then proceed toward actual implementation. Avoid premature optimization, as Knuth once said. + +--- + +For most web services, efficiency doesn't matter, but *latency* does. + +Increasing efficiency is not how it is done nowadays. + +A pageview usually generates somewhere on the order of 0.1 to 1 cent per pageview. This is a typical rate at which you monetize user attention. Say, if I simply installed AdSense, i'd be getting something like that — depending on where most of my readers are from and how many of them are using an ad blocker. + +At the same time, a server with a dedicated core and 1GB of ram (which is an absurdly large amount of resources for a simple web service) costs around one millionth per second when amortized. You could fetch 100 photos with that. + +Amazon had an experiment where they A/B tested their service with artificial delays and found out that a 100ms delay decreased revenue. This follows for most other services, say, you lose your "flow" at twitter, the user is likely to start thinking on something else and leave. If the delay at Google is more than a few seconds, people will just think that Google isn't working and quit. + +Minimization of latency can be usually done with parallel computing, which is why distributed systems are scaled more on scalability. This part of the book is concerned with improving *efficiency* of algorithms, which makes latency lower as the by-product. + +However, there are still use cases when there is a trade-off between quality and cost of servers. + +- Search is hierarchical. There are usually many layers of more accurate but slower models. The more documents you rank on each layer, the better the final quality. +- Games. They are more enjoyable on large scale, but computational power also increases. This includes AI. +- AI workloads — those that have large quantities of data such as language models. Heavier models require more compute. The bottleneck in them is not the number of data, but efficiencty. + +Inherently sequential algorithms, or cases when the resources are constrained. Ctrl+f'ing a large PDF is painful. Factorization. + +## Estimating the impact + +Sometime the optimization needs to happen in the calling layer. + +SIMDJSON speeds up JSON parsing, but it may be better to not use JSON in the first place. + +Protobuf or flat binary formats. + +There is also a chicken and egg problem: people don't use an approach that much because it is slow and not feasible. + +Cost to implement, bugs, maintainability. It is perfectly fine that most software in the world is inefficient. + +What does it mean to be a better programmer? Faster programs? Faster speed of work? Fewer bugs? It is a combination of those. + +Implementing compiler optimizations or databases are examples of high-leverage activities because they act as a tax on everything else — which is why you see most people writing books on these particular topics rather than software optimization in general. + +--- + +Factorization is kind of useless by itself, but it helps with understanding how to optimize number theoretic computations in general. Same goes for sorting and binary trees: most people hold some metainformation. diff --git a/content/english/hpc/cpu-cache/_index.md b/content/english/hpc/cpu-cache/_index.md index 484a39dc..ef1bbd6f 100644 --- a/content/english/hpc/cpu-cache/_index.md +++ b/content/english/hpc/cpu-cache/_index.md @@ -5,7 +5,7 @@ weight: 9 In the [previous chapter](../external-memory), we studied computer memory from a theoretical standpoint, using the [external memory model](../external-memory/model) to estimate the performance of memory-bound algorithms. -While it is more or less accurate for computations involving HDDs and network storage, where in-memory arithmetic is negligibly fast compared to the external I/O operations, it is too imprecise for lower levels in the cache hierarchy, where the costs of these operations become comparable. +While the external memory model is more or less accurate for computations involving HDDs and network storage, where cost of arithmetic operations on in-memory values is negligible compared to external I/O operations, it is too imprecise for lower levels in the cache hierarchy, where the costs of these operations become comparable. To perform more fine-grained optimization of in-memory algorithms, we have to start taking into account the many specific details of the CPU cache system. And instead of studying loads of boring Intel documents with dry specs and theoretically achievable limits, we will estimate these parameters experimentally by running numerous small benchmark programs with access patterns that resemble the ones that often occur in practical code. @@ -34,7 +34,7 @@ Although the CPU can be clocked at 4.1GHz in boost mode, we will perform most ex --> -Due to difficulties in [refraining the compiler from cheating](/hpc/profiling/noise/), the code snippets in this article are slightly simplified for exposition purposes. Check the [code repository](https://github.com/sslotin/amh-code/tree/main/cpu-cache) if you want to reproduce them yourself. +Due to difficulties in [preventing the compiler from optimizing away unused values](/hpc/profiling/noise/), the code snippets in this article are slightly simplified for exposition purposes. Check the [code repository](https://github.com/sslotin/amh-code/tree/main/cpu-cache) if you want to reproduce them yourself. ### Acknowledgements diff --git a/content/english/hpc/cpu-cache/alignment.md b/content/english/hpc/cpu-cache/alignment.md index 32c54b6d..e9c5f4d3 100644 --- a/content/english/hpc/cpu-cache/alignment.md +++ b/content/english/hpc/cpu-cache/alignment.md @@ -33,7 +33,7 @@ struct alignas(64) Data { }; ``` -Whenever an instance of `Data` is allocated, it will be at the beginning of a cache line. The downside is that the effective size of the structure will be rounded up to the nearest multiple of 64 bytes. This has to be done so that, e. g. when allocating an array of `Data`, not just the first element is properly aligned. +Whenever an instance of `Data` is allocated, it will be at the beginning of a cache line. The downside is that the effective size of the structure will be rounded up to the nearest multiple of 64 bytes. This has to be done so that, e.g., when allocating an array of `Data`, not just the first element is properly aligned. ### Structure Alignment @@ -77,7 +77,7 @@ This potentially wastes space but saves a lot of CPU cycles. This trade-off is m ### Optimizing Member Order -Padding is only inserted before a not-yet-aligned member or at the end of the structure. By changing the ordering of members in a structure, it is possible to change the required amount of padding bytes and the total size of the structure. +Padding is only inserted before a not-yet-aligned member or at the end of the structure. By changing the ordering of members in a structure, it is possible to change the required number of padding bytes and the total size of the structure. In the previous example, we could reorder the structure members like this: @@ -94,7 +94,7 @@ Now, each of them is aligned without any padding, and the size of the structure As a rule of thumb, place your type definitions from largest data types to smallest — this greedy algorithm is guaranteed to work unless you have some weird non-power-of-two type sizes such as the [10-byte](/hpc/arithmetic/ieee-754#float-formats) `long double`[^extended]. -[^extended]: The 80-bit `long double` takes *at least* 10 bytes, but the exact format is up to the compiler — e. g. it may pad it to 12 or 16 bytes to minimize alignment issues (64-bit GCC and Clang use 16 bytes by default; you can override this by specifying one of `-mlong-double-64/80/128` or `-m96/128bit-long-double` [options](https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html)). +[^extended]: The 80-bit `long double` takes *at least* 10 bytes, but the exact format is up to the compiler — for example, it may pad it to 12 or 16 bytes to minimize alignment issues (64-bit GCC and Clang use 16 bytes by default; you can override this by specifying one of `-mlong-double-64/80/128` or `-m96/128bit-long-double` [options](https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html)). + +## B− Tree + +Instead of making small incremental improvements like we usually do in other case studies, in this article, we will implement just one data structure that we name *B− tree*, which is based on the [B+ tree](../s-tree/#b-tree-layout-1), with a few minor differences: + +- Nodes in the B− tree do not store pointers or any metadata except for the pointers to internal node children (while the B+ tree leaf nodes store a pointer to the next leaf node). This lets us perfectly place the keys in the leaf nodes on cache lines. +- We define key $i$ to be the *maximum* key in the subtree of the child $i$ instead of the *minimum* key in the subtree of the child $(i + 1)$. This lets us not fetch any other nodes after we reach a leaf (in the B+ tree, all keys in the leaf node may be less than the search key, so we need to go to the next leaf node to fetch its first element). + +We also use a node size of $B=32$, which is smaller than typical. The reason why it is not $16$, which was [optimal for the S+ tree](../s-tree/#modifications-and-further-optimizations), is because we have the additional overhead associated with fetching the pointer, and the benefit of reducing the tree height by ~20% outweighs the cost of processing twice the elements per node, and also because it improves the running time of the `insert` query that needs to perform a costly node split every $\frac{B}{2}$ insertions on average. + + + +### Memory Layout + +Although this is probably not the best approach in terms of software engineering, we will simply store the entire tree in a large pre-allocated array, without discriminating between leaves and internal nodes: + +```c++ +const int R = 1e8; +alignas(64) int tree[R]; +``` + +We also pre-fill this array with infinities to simplify the implementation: + +```c++ +for (int i = 0; i < R; i++) + tree[i] = INT_MAX; +``` + +(In general, it is technically cheating to compare against `std::set` or other structures that use `new` under the hood, but memory allocation and initialization are not the bottlenecks here, so this does not significantly affect the evaluation.) + +Both nodes types store their keys sequentially in sorted order and are identified by the index of its first key in the array: + +- A leaf node has up to $(B - 1)$ keys but is padded to $B$ elements with infinities. +- An internal node has up to $(B - 2)$ keys padded to $B$ elements and up to $(B - 1)$ indices of its child nodes, also padded to $B$ elements. + +These design decisions are not arbitrary: + +- The padding ensures that leaf nodes occupy exactly 2 cache lines and internal nodes occupy exactly 4 cache lines. +- We specifically use [indices instead of pointers](/hpc/cpu-cache/pointers/) to save cache space and make moving them with SIMD faster. + (We will use "pointer" and "index" interchangeably from now on.) +- We store indices right after the keys even though they are stored in separate cache lines because [we have reasons](/hpc/cpu-cache/aos-soa/). +- We intentionally "waste" one array cell in leaf nodes and $2+1=3$ cells in internal nodes because we need it to store temporary results during a node split. + +Initially, we only have one empty leaf node as the root: + +```c++ +const int B = 32; + +int root = 0; // where the keys of the root start +int n_tree = B; // number of allocated array cells +int H = 1; // current tree height +``` + +To "allocate" a new node, we simply increase `n_tree` by $B$ if it is a leaf node or by $2 B$ if it is an internal node. + +Since new nodes can only be created by splitting a full node, each node except for the root will be at least half full. This implies that we need between 4 and 8 bytes per integer element (the internal nodes will contribute $\frac{1}{16}$-th or so to that number), the former being the case when the inserts are sequential, and the latter being the case when the input is adversarial. When the queries are uniformly distributed, the nodes are ~75% full on average, projecting to ~5.2 bytes per element. + +B-trees are very memory-efficient compared to the pointer-based binary trees. For example, `std::set` needs at least three pointers (the left child, the right child, and the parent), alone costing $3 \times 8 = 24$ bytes, plus at least another $8$ bytes to store the key and the meta-information due to [structure padding](/hpc/cpu-cache/alignment/). + +### Searching + +It is a very common scenario when >90% of operations are lookups, and even if this is not the case, every other tree operation typically begins with locating a key anyway, so we will start with implementing and optimizing the searches. + +When we implemented [S-trees](../s-tree/#optimization), we ended up storing the keys in permuted order due to the intricacies of how the blending/packs instructions work. For the *dynamic tree* problem, storing the keys in permuted order would make inserts much harder to implement, so we will change the approach instead. + +An alternative way to think about finding the would-be position of the element `x` in a sorted array is not "the index of the first element that is not less than `x`" but "the number of elements that are less than `x`." This observation generates the following idea: compare the keys against `x`, aggregate the vector masks into a 32-bit mask (where each bit can correspond to any element as long as the mapping is bijective), and then call `popcnt` on it, returning the number of elements less than `x`. + +This trick lets us perform the local search efficiently and without requiring any shuffling: + +```c++ +typedef __m256i reg; + +reg cmp(reg x, int *node) { + reg y = _mm256_load_si256((reg*) node); + return _mm256_cmpgt_epi32(x, y); +} + +// returns how many keys are less than x +unsigned rank32(reg x, int *node) { + reg m1 = cmp(x, node); + reg m2 = cmp(x, node + 8); + reg m3 = cmp(x, node + 16); + reg m4 = cmp(x, node + 24); + + // take lower 16 bits from m1/m3 and higher 16 bits from m2/m4 + m1 = _mm256_blend_epi16(m1, m2, 0b01010101); + m3 = _mm256_blend_epi16(m3, m4, 0b01010101); + m1 = _mm256_packs_epi16(m1, m3); // can also use blendv here, but packs is simpler + + unsigned mask = _mm256_movemask_epi8(m1); + return __builtin_popcount(mask); +} +``` + +Note that, because of this procedure, we have to pad the "key area" with infinities, which prevents us from storing metadata in the vacated cells (unless we are also willing to spend a few cycles to mask it out when loading a SIMD lane). + +Now, to implement `lower_bound`, we can descend the tree just like we did in the S+ tree, but fetching the pointer after we compute the child number: + +```c++ +int lower_bound(int _x) { + unsigned k = root; + reg x = _mm256_set1_epi32(_x); + + for (int h = 0; h < H - 1; h++) { + unsigned i = rank32(x, &tree[k]); + k = tree[k + B + i]; + } + + unsigned i = rank32(x, &tree[k]); + + return tree[k + i]; +} +``` + +Implementing search is easy, and it doesn't introduce much overhead. The hard part is implementing insertion. + +### Insertion + +On the one side, correctly implementing insertion takes a lot of code, but on the other, most of that code is executed very infrequently, so we don't have to care about its performance that much. Most often, all we need to do is to reach the leaf node (which we've already figured out how to do) and then insert a new key into it, moving some suffix of the keys one position to the right. Occasionally, we also need to split the node and/or update some ancestors, but this is relatively rare, so let's focus on the most common execution path first. + +To insert a key into an array of $(B - 1)$ sorted elements, we can load them in vector registers and then [mask-store](/hpc/simd/masking) them one position to the right using a [precomputed](/hpc/compilation/precalc/) mask that tells which elements need to be written for a given `i`: + +```c++ +struct Precalc { + alignas(64) int mask[B][B]; + + constexpr Precalc() : mask{} { + for (int i = 0; i < B; i++) + for (int j = i; j < B - 1; j++) + // everything from i to B - 2 inclusive needs to be moved + mask[i][j] = -1; + } +}; + +constexpr Precalc P; + +void insert(int *node, int i, int x) { + // need to iterate right-to-left to not overwrite the first element of the next lane + for (int j = B - 8; j >= 0; j -= 8) { + // load the keys + reg t = _mm256_load_si256((reg*) &node[j]); + // load the corresponding mask + reg mask = _mm256_load_si256((reg*) &P.mask[i][j]); + // mask-write them one position to the right + _mm256_maskstore_epi32(&node[j + 1], mask, t); + } + node[i] = x; // finally, write the element itself +} +``` + +This [constexpr magic](/hpc/compilation/precalc/) is the only C++ feature we use. + +There are other ways to do it, some possibly more efficient, but we are going to stop there for now. + +When we split a node, we need to move half of the keys to another node, so let's write another primitive that does it: + +```c++ +// move the second half of a node and fill it with infinities +void move(int *from, int *to) { + const reg infs = _mm256_set1_epi32(INT_MAX); + for (int i = 0; i < B / 2; i += 8) { + reg t = _mm256_load_si256((reg*) &from[B / 2 + i]); + _mm256_store_si256((reg*) &to[i], t); + _mm256_store_si256((reg*) &from[B / 2 + i], infs); + } +} +``` + +With these two vector functions implemented, we can now very carefully implement insertion: + +```c++ +void insert(int _x) { + // the beginning of the procedure is the same as in lower_bound, + // except that we save the path in case we need to update some of our ancestors + unsigned sk[10], si[10]; // k and i on each iteration + // ^------^ We assume that the tree height does not exceed 10 + // (which would require at least 16^10 elements) + + unsigned k = root; + reg x = _mm256_set1_epi32(_x); + + for (int h = 0; h < H - 1; h++) { + unsigned i = rank32(x, &tree[k]); + + // optionally update the key i right away + tree[k + i] = (_x > tree[k + i] ? _x : tree[k + i]); + sk[h] = k, si[h] = i; // and save the path + + k = tree[k + B + i]; + } + + unsigned i = rank32(x, &tree[k]); + + // we can start computing the is-full check before insertion completes + bool filled = (tree[k + B - 2] != INT_MAX); + + insert(tree + k, i, _x); + + if (filled) { + // the node needs to be split, so we create a new leaf node + move(tree + k, tree + n_tree); + + int v = tree[k + B / 2 - 1]; // new key to be inserted + int p = n_tree; // pointer to the newly created node + + n_tree += B; + + for (int h = H - 2; h >= 0; h--) { + // ascend and repeat until we reach the root or find a the node is not split + k = sk[h], i = si[h]; + + filled = (tree[k + B - 3] != INT_MAX); + + // the node already has a correct key (the right one) + // and a correct pointer (the left one) + insert(tree + k, i, v); + insert(tree + k + B, i + 1, p); + + if (!filled) + return; // we're done + + // create a new internal node + move(tree + k, tree + n_tree); // move keys + move(tree + k + B, tree + n_tree + B); // move pointers + + v = tree[k + B / 2 - 1]; + tree[k + B / 2 - 1] = INT_MAX; + + p = n_tree; + n_tree += 2 * B; + } + + // if reach here, this means we've reached the root, + // and it was split into two, so we need a new root + tree[n_tree] = v; + + tree[n_tree + B] = root; + tree[n_tree + B + 1] = p; + + root = n_tree; + n_tree += 2 * B; + H++; + } +} +``` + +There are many inefficiencies, but, luckily, the body of `if (filled)` is executed very infrequently — approximately every $\frac{B}{2}$ insertions — and the insertion performance is not really our top priority, so we will just leave it there. + +## Evaluation + +We have only implemented `insert` and `lower_bound`, so this is what we will measure. + +We want the evaluation to take a reasonable time, so our benchmark is a loop that alternates between two steps: + +- Increase the structure size from $1.17^k$ to $1.17^{k+1}$ using individual `insert`s and measure the time it took. +- Perform $10^6$ random `lower_bound` queries and measure the time it took. + +We start at the size $10^4$ and end at $10^7$, for around $50$ data points in total. We generate the data for both query types uniformly in the $[0, 2^{30})$ range and independently between the stages. Since the data generation process allows for repeated keys, we compared against `std::multiset` and `absl::btree_multiset`[^absl], although we still refer to them as `std::set` and `absl::btree` for brevity. We also enable [hugepages](/hpc/cpu-cache/paging) on the system level for all three runs. + +[^absl]: If you also think that only comparing with Abseil's B-tree is not convincing enough, [feel free](https://github.com/sslotin/amh-code/tree/main/b-tree) to add your favorite search tree to the benchmark. + + + +The performance of the B− tree matches what we originally predicted — at least for the lookups: + +![](../img/btree-absolute.svg) + +The relative speedup varies with the structure size — 7-18x/3-8x over STL and 3-7x/1.5-2x over Abseil: + +![](../img/btree-relative.svg) + +Insertions are only 1.5-2 faster than for `absl::btree`, which uses scalar code to do everything. My best guess why insertions are *that* slow is due to data dependency: since the tree nodes may change, the CPU can't start processing the next query before the previous one finishes (the [true latency](../s-tree/#comparison-with-stdlower_bound) of both queries is roughly equal and ~3x of the reciprocal throughput of `lower_bound`). + +![](../img/btree-absl.svg) + +When the structure size is small, the [reciprocal throughput](../s-tree/#comparison-with-stdlower_bound) of `lower_bound` increases in discrete steps: it starts with 3.5ns when there is only the root to visit, then grows to 6.5ns (two nodes), and then to 12ns (three nodes), and then hits the L2 cache (not shown on the graphs) and starts increasing more smoothly, but still with noticeable spikes when the tree height increases. + +Interestingly, B− tree outperforms `absl::btree` even when it only stores a single key: it takes around 5ns stalling on [branch misprediction](/hpc/pipelining/branching/), while (the search in) the B− tree is entirely branchless. + +### Possible Optimizations + +In our previous endeavors in data structure optimization, it helped a lot to make as many variables as possible compile-time constants: the compiler can hardcode these constants into the machine code, simplify the arithmetic, unroll all the loops, and do many other nice things for us. + +This would not be a problem at all if our tree were of constant height, but it is not. It is *largely* constant, though: the height rarely changes, and in fact, under the constraints of the benchmark, the maximum height was only 6. + +What we can do is pre-compile the `insert` and `lower_bound` functions for several different compile-time constant heights and switch between them as the tree grows. The idiomatic C++ way is to use virtual functions, but I prefer to be explicit and use raw function pointers like this: + +```c++ +void (*insert_ptr)(int); +int (*lower_bound_ptr)(int); + +void insert(int x) { + insert_ptr(x); +} + +int lower_bound(int x) { + return lower_bound_ptr(x); +} +``` + +We now define template functions that have the tree height as a parameter, and in the grow-tree block inside the `insert` function, we change the pointers as the tree grows: + +```c++ +template +void insert_impl(int _x) { + // ... +} + +template +void insert_impl(int _x) { + // ... + if (/* tree grows */) { + // ... + insert_ptr = &insert_impl; + lower_bound_ptr = &lower_bound_impl; + } +} + +template <> +void insert_impl<10>(int x) { + std::cerr << "This depth was not supposed to be reached" << std::endl; + exit(1); +} +``` + + + +I tried but could not get any performance improvement with this, but I still have high hope for this approach because the compiler can (theoretically) remove `sk` and `si`, completely removing any temporary storage and only reading and computing everything once, greatly optimizing the `insert` procedure. + +Insertion can also probably be optimized by using a larger block size as node splits would become rare, but this comes at the cost of slower lookups. We could also try different node sizes for different layers: leaves should probably be larger than the internal nodes. + +**Another idea** is to move extra keys on insert to a sibling node, delaying the node split as long as possible. + +One such particular modification is known as the B* tree. It moves the last key to the next node if the current one is full, and when both nodes become full, it jointly splits both of them, producing three nodes that are ⅔ full. This reduces the memory overhead (the nodes will be ⅚ full on average) and increases the fanout factor, reducing the height, which helps all operations. + +This technique can even be extended to, say, three-to-four splits, although further generalization would come at the cost of a slower `insert`. + +**And yet another idea** is to get rid of (some) pointers. For example, for large trees, we can probably afford a small [S+ tree](../s-tree) for $16 \cdot 17$ or so elements as the root, which we rebuild from scratch on each infrequent occasion when it changes. You can't extend it to the whole tree, unfortunately: I believe there is a paper somewhere saying that we can't turn a dynamic structure fully implicit without also having to do $\Omega(\sqrt n)$ operations per query. + +We could also try some non-tree data structures, such as the [skip list](https://en.wikipedia.org/wiki/Skip_list). There has even been a [successful attempt to vectorize it](https://doublequan.github.io/) — although the speedup was not that impressive. I have low hope that skip-list, in particular, can be improved, although it may achieve a higher total throughput in the concurrent setting. + +### Other Operations + +To *delete* a key, we can similarly locate and remove it from a node with the same mask-store trick. After that, if the node is at least half-full, we're done. Otherwise, we try to borrow a key from the next sibling. If the sibling has more than $\frac{B}{2}$ keys, we append its first key and shift its keys one to the left. Otherwise, both the current node and the next node have less than $\frac{B}{2}$ keys, so we can merge them, after which we go to the parent and iteratively delete a key there. + +Another thing we may want to implement is *iteration*. Bulk-loading each key from `l` to `r` is a very common pattern — for example, in `SELECT abc ORDER BY xyz` type of queries in databases — and B+ trees usually store pointers to the next node in the data layer to allow for this type of rapid iteration. In B− trees, as we're using a much smaller node size, we can experience [pointer chasing](/hpc/cpu-cache/latency/) problems if we do this. Going to the parent and reading all its $B$ pointers is probably faster as it negates this problem. Therefore, a stack of ancestors (the `sk` and `si` arrays we used in `insert`) can serve as an iterator and may even be better than separately storing pointers in nodes. + +We can easily implement almost everything that `std::set` does, but the B− tree, like any other B-tree, is very unlikely to become a drop-in replacement to `std::set` due to the requirement of pointer stability: a pointer to an element should remain valid unless the element is deleted, which is hard to achieve when we split and merge nodes all the time. This is a major problem not only for search trees but most data structures in general: having both pointer stability and high performance at the same time is next to impossible. + + + +## Acknowledgements + +Thanks to [Danila Kutenin](https://danlark.org/) from Google for meaningful discussions of applicability and the usage of B-trees in Abseil. + + diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md index 61aec502..6426ddde 100644 --- a/content/english/hpc/data-structures/binary-search.md +++ b/content/english/hpc/data-structures/binary-search.md @@ -1,15 +1,18 @@ --- title: Binary Search weight: 1 +published: true --- + + While improving the speed of user-facing applications is the end goal of performance engineering, people don't really get excited over 5-10% improvements in some databases. Yes, this is what software engineers are paid for, but these types of optimizations tend to be too intricate and system-specific to be readily generalized to other software. Instead, the most fascinating showcases of performance engineering are multifold optimizations of textbook algorithms: the kinds that everybody knows and deemed so simple that it would never even occur to try to optimize them in the first place. These optimizations are simple and instructive and can very much be adopted elsewhere. And they are surprisingly not as rare as you'd think. -In this article, we focus on such fundamental algorithm — *binary search* — and implement two of its variants that are, depending on the problem size, up to 4x faster than `std::lower_bound`, while being under just 15 lines of code. +In this section, we focus on one such fundamental algorithm — *binary search* — and implement two of its variants that are, depending on the problem size, up to 4x faster than `std::lower_bound`, while being under just 15 lines of code. The first algorithm achieves that by removing [branches](/hpc/pipelining/branching), and the second also optimizes the memory layout to achieve better [cache system](/hpc/cpu-cache) performance. This technically disqualifies it from being a drop-in replacement for `std::lower_bound` as it needs to permute the elements of the array before it can start answering queries — but I can't recall a lot of scenarios where you obtain a sorted array but can't afford to spend linear time on preprocessing. @@ -20,7 +23,7 @@ The first algorithm achieves that by removing [branches](/hpc/pipelining/branchi --> -The usual disclaimer: the CPU is a [Zen 2](https://www.7-cpu.com/cpu/Zen2.html), the RAM is a [DDR4-2666](http://localhost:1313/hpc/cpu-cache/), and the compiler we will be using by default is Clang 10. The performance on your machine may be different, so I highly encourage to [go and test it](https://godbolt.org/z/14rd5Pnve) for yourself. +The usual disclaimer: the CPU is a [Zen 2](https://www.7-cpu.com/cpu/Zen2.html), the RAM is a [DDR4-2666](/hpc/cpu-cache/), and the compiler we will be using by default is Clang 10. The performance on your machine may be different, so I highly encourage to [go and test it](https://godbolt.org/z/14rd5Pnve) for yourself. + +Note that this loop is not always equivalent to the standard binary search. Since it always rounds *up* the size of the search interval, it accesses slightly different elements and may perform one comparison more than needed. Apart from simplifying computations on each iteration, it also makes the number of iterations constant if the array size is constant, removing branch mispredictions completely. + +As typical for predication, this trick is very fragile to compiler optimizations — depending on the compiler and how the function is invoked, it may still leave a branch or generate suboptimal code. It works fine on Clang 10, yielding a 2.5-3x improvement on small arrays: -As typical for predication, this trick is very fragile to compiler optimizations. It doesn't make a difference on Clang — for some reason, it replaces the ternary operator with a branch anyway — but it works fine on GCC (9.3), yielding a 2.5-3x improvement on small arrays: + ![](../img/search-branchless.svg) @@ -162,20 +202,22 @@ int lower_bound(int x) { int *base = t, len = n; while (len > 1) { int half = len / 2; - __builtin_prefetch(&base[(len - half) / 2]); - __builtin_prefetch(&base[half + (len - half) / 2]); - base = (base[half] < x ? &base[half] : base); len -= half; + __builtin_prefetch(&base[len / 2 - 1]); + __builtin_prefetch(&base[half + len / 2 - 1]); + base += (base[half - 1] < x) * half; } - return *(base + (*base < x)); + return *base; } ``` + + With prefetching, the performance on large arrays becomes roughly the same: ![](../img/search-branchless-prefetch.svg) -The graph still grows faster as the branchy version also prefetches "grandchildren", "grand-grandchildren", and so on — although the usefulness of each new speculative read diminishes exponentially as the prediction is less and less likely to be correct. +The graph still grows faster as the branchy version also prefetches "grandchildren," "great-grandchildren," and so on — although the usefulness of each new speculative read diminishes exponentially as the prediction is less and less likely to be correct. In the branchless version, we could also fetch ahead by more than one layer, but the number of fetches we'd need also grows exponentially. Instead, we will try a different approach to optimize memory operations. @@ -248,7 +290,7 @@ Apart from being compact, it has some nice properties, like that all even-number Here is how this layout looks when applied to binary search: -![](../img/eytzinger.png) +![Note that the tree is slightly imbalanced (because of the last layer is continuous)](../img/eytzinger.png) When searching in this layout, we just need to start from the first element of the array, and then on each iteration jump to either $2 k$ or $(2k + 1)$, depending on how the comparison went: @@ -278,15 +320,17 @@ void eytzinger(int k = 1) { } ``` -This function takes the current node number `k`, recursively writes out all elements to the left of the middle of the search interval, writes out the current element we'd compare against, and then recursively writes out all the elements on the right. It seems a bit complicated, but to convince ourselves that it works, we only need three observations: +This function takes the current node number `k`, recursively writes out all elements to the left of the middle of the search interval, writes out the current element we'd compare against, and then recursively writes out all the elements on the right. It seems a bit complicated, but to convince yourself that it works, you only need three observations: - It writes exactly `n` elements as we enter the body of `if` for each `k` from `1` to `n` just once. - It writes out sequential elements from the original array as it increments the `i` pointer each time. -- By the time we write the element at node `k`, we have already written all the elements to its left (exactly `i`). +- By the time we write the element at node `k`, we will have already written all the elements to its left (exactly `i`). + +Despite being recursive, it is actually quite fast as all the memory reads are sequential, and the memory writes are only in $O(\log n)$ different memory blocks at a time. Maintaining the permutation is both logically and computationally harder to maintain though: adding an element to a sorted array only requires shifting a suffix of its elements one position to the right, while Eytzinger array practically needs to be rebuilt from scratch. -Despite being recursive, it is actually quite fast as all the memory reads are sequential, and the memory writes are only in $O(\log n)$ different memory blocks at a time. +Note that this traversal and the resulting permutation are not exactly equivalent to the "tree" of vanilla binary search: for example, the left child subtree may be larger than the right child subtree — up to twice as large — but it doesn't matter much since both approaches result in the same $\lceil \log_2 n \rceil$ tree depth. -Note that the Eytzinger array is one-indexed — this will be important for performance later. You can put in the zeroth element the value that you want to be returned in the case when the lower bound doesn't exist (similar to `a.end()` for `std::lower_bound`). +Also note that the Eytzinger array is one-indexed — this will be important for performance later. You can put in the zeroth element the value that you want to be returned in the case when the lower bound doesn't exist (similar to `a.end()` for `std::lower_bound`). ### Search Implementation @@ -298,22 +342,35 @@ while (k <= n) k = 2 * k + (t[k] < x); ``` -The only problem arises when we need to restore the index of the resulting element, as $k$ may end up not pointing to a leaf node. Here is an example of how that can happen: +The only problem arises when we need to restore the index of the resulting element, as $k$ does not directly point to it. Consider this example (its corresponding tree is listed above): -``` - array: 1 2 3 4 5 6 7 8 -eytzinger: 4 2 5 1 6 3 7 8 -1st range: --------------- k := 1 -2nd range: ------- k := 2*k (=2) -3rd range: --- k := 2*k + 1 (=5) -4th range: - k := 2*k + 1 (=11) -``` + + +
+    array:  0 1 2 3 4 5 6 7 8 9                            
+eytzinger:  6 3 7 1 5 8 9 0 2 4                            
+1st range:  ------------?------  k := 2*k     = 2   (6 ≥ 3)
+2nd range:  ------?------        k := 2*k     = 4   (3 ≥ 3)
+3rd range:  --?----              k := 2*k + 1 = 9   (1 < 3)
+4th range:      ?--              k := 2*k + 1 = 19  (2 < 3)
+5th range:        !                                        
+
+ + -Here we query the array of $[1, …, 8]$ for the lower bound of $x=4$. We compare it against $4$, $2$, and $5$, go left-right-right, and end up with $k = 11$, which isn't even a valid array index. +Here we query the array of $[0, …, 9]$ for the lower bound of $x=3$. We compare it against $6$, $3$, $1$, and $2$, go left-left-right-right, and end up with $k = 19$, which isn't even a valid array index. -The trick is to notice that, unless the answer is the last element of the array, we compare $x$ against it at some point, and after we've learned that it is not less than $x$, we start comparing $x$ against elements to the left, and all these comparisons evaluate true (i. e. leading to the right). Therefore, to restore the answer, we just need to "cancel" some number of right turns. +The trick is to notice that, unless the answer is the last element of the array, we compare $x$ against it at some point, and after we've learned that it is not less than $x$, we go left exactly once and then keep going right until we reach a leaf (because we will only be comparing $x$ against lesser elements). Therefore, to restore the answer, we just need to "cancel" some number of right turns and then one more. -This can be done in an elegant way by observing that the right turns are recorded in the binary representation of $k$ as 1-bits, and so we just need to find the number of trailing ones in the binary representation and right-shift $k$ by exactly that amount. To do this, we can invert the number (`~k`) and call the "find first set" instruction: +This can be done in an elegant way by observing that the right turns are recorded in the binary representation of $k$ as 1-bits, and so we just need to find the number of trailing 1s in the binary representation and right-shift $k$ by exactly that number of bits plus one. To do this, we can invert the number (`~k`) and call the "find first set" instruction: ```c++ int lower_bound(int x) { @@ -359,9 +416,9 @@ This observation extends to the grand-children of node $k$ — they are also sto \end{aligned} --> -Their cache line can also be fetched with one instruction. Interesting… what if we continue this, and instead of fetching direct children, we fetch ahead as many descendants as we can cramp into one cache line? That would be $\frac{64}{4} = 16$ elements, our grand-grand-grandchildren with indices from $16k$ to $(16k + 15)$. +Their cache line can also be fetched with one instruction. Interesting… what if we continue this, and instead of fetching direct children, we fetch ahead as many descendants as we can cramp into one cache line? That would be $\frac{64}{4} = 16$ elements, our great-great-grandchildren with indices from $16k$ to $(16k + 15)$. -Now, if we prefetch just one of these 16 elements, we will probably only get some but not all of them, as they may cross a cache line boundary. We can prefetch the first *and* the last element, but to get away with just one memory request, we need to notice that the index of the first element, $16k$, is divisible by $16$, so its memory address will be the base address of the array plus something divisible by $16 \cdot 4 = 64$, the cache line size. If the array were to begin on a cache line, then these $16$ grand-gran-grandchildren elements will be guaranteed to be on a single cache line, which is just what we needed. +Now, if we prefetch just one of these 16 elements, we will probably only get some but not all of them, as they may cross a cache line boundary. We can prefetch the first *and* the last element, but to get away with just one memory request, we need to notice that the index of the first element, $16k$, is divisible by $16$, so its memory address will be the base address of the array plus something divisible by $16 \cdot 4 = 64$, the cache line size. If the array were to begin on a cache line, then these $16$ great-great-grandchildren elements will be guaranteed to be on a single cache line, which is just what we needed. Therefore, we only need to [align](/hpc/cpu-cache/alignment) the array: @@ -399,7 +456,7 @@ Also, note that the last few prefetch requests are actually not needed, and in f This prefetching technique allows us to read up to four elements ahead, but it doesn't really come for free — we are effectively trading off excess memory [bandwidth](/hpc/cpu-cache/bandwidth) for reduced [latency](/hpc/cpu-cache/latency). If you run more than one instance at a time on separate hardware threads or just any other memory-intensive computation in the background, it will significantly [affect](/hpc/cpu-cache/sharing) the benchmark performance. -But we can do better. Instead of fetching four cache lines at a time, we could fetch four times *fewer* cache lines. And in the [next article](../s-tree), we will explore the approach. +But we can do better. Instead of fetching four cache lines at a time, we could fetch four times *fewer* cache lines. And in the [next section](../s-tree), we will explore the approach. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/content/english/hpc/data-structures/img/btree-absolute.svg b/content/english/hpc/data-structures/img/btree-absolute.svg new file mode 100644 index 00000000..6709908f --- /dev/null +++ b/content/english/hpc/data-structures/img/btree-absolute.svg @@ -0,0 +1,1430 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/content/english/hpc/data-structures/img/btree-relative.svg b/content/english/hpc/data-structures/img/btree-relative.svg new file mode 100644 index 00000000..e40210ff --- /dev/null +++ b/content/english/hpc/data-structures/img/btree-relative.svg @@ -0,0 +1,1505 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/content/english/hpc/data-structures/img/eytzinger.png b/content/english/hpc/data-structures/img/eytzinger.png index 97237c73..901efdd2 100644 Binary files a/content/english/hpc/data-structures/img/eytzinger.png and b/content/english/hpc/data-structures/img/eytzinger.png differ diff --git a/content/english/hpc/data-structures/img/eytzinger_old.png b/content/english/hpc/data-structures/img/eytzinger_old.png new file mode 100644 index 00000000..97237c73 Binary files /dev/null and b/content/english/hpc/data-structures/img/eytzinger_old.png differ diff --git a/content/english/hpc/data-structures/img/src/eytzinger.svg b/content/english/hpc/data-structures/img/src/eytzinger.svg new file mode 100644 index 00000000..da565f0d --- /dev/null +++ b/content/english/hpc/data-structures/img/src/eytzinger.svg @@ -0,0 +1,454 @@ + + + + + + + + + + + + + + + + 0 + 7 + + 2 + + + 1 + 3 + + 4 + + 8 + 5 + + 9 + 6 + + + + + 1 + 3 + + + 2 + + 4 + + 8 + 5 + + 9 + 6 + + 0 + 7 + + + + + + + + + + + + diff --git a/content/english/hpc/data-structures/s-tree.md b/content/english/hpc/data-structures/s-tree.md index 216ba4bb..875f72ec 100644 --- a/content/english/hpc/data-structures/s-tree.md +++ b/content/english/hpc/data-structures/s-tree.md @@ -3,9 +3,9 @@ title: Static B-Trees weight: 2 --- -This article is a follow-up to the [previous one](../binary-search), where we optimized binary search by the means of removing branching and improving the memory layout. Here, we will also be searching over sorted arrays, but this time we are not limited to fetching and comparing only one element at a time. +This section is a follow-up to the [previous one](../binary-search), where we optimized binary search by the means of removing branching and improving the memory layout. Here, we will also be searching in sorted arrays, but this time we are not limited to fetching and comparing only one element at a time. -In this article, we generalize the techniques we developed for binary search to *static B-trees* and accelerate them further using [SIMD instructions](/hpc/simd). In particular, we develop two new implicit data structures: +In this section, we generalize the techniques we developed for binary search to *static B-trees* and accelerate them further using [SIMD instructions](/hpc/simd). In particular, we develop two new implicit data structures: - The [first](#b-tree-layout) is based on the memory layout of a B-tree, and, depending on the array size, it is up to 8x faster than `std::lower_bound` while using the same space as the array and only requiring a permutation of its elements. - The [second](#b-tree-layout-1) is based on the memory layout of a B+ tree, and it is up to 15x faster than `std::lower_bound` while using just 6-7% more memory — or 6-7% **of** the memory if we can keep the original sorted array. @@ -102,7 +102,19 @@ int i = __builtin_ffs(mask) - 1; // now i is the number of the correct child node ``` -Unfortunately, the compilers are not smart enough yet to auto-vectorize this code, so we need to manually vectorize it with intrinsics: +Unfortunately, the compilers are not smart enough to [auto-vectorize](/hpc/simd/auto-vectorization/) this code yet, so we have to optimize it manually. In AVX2, we can load 8 elements, compare them against the search key, producing a [vector mask](/hpc/simd/masking/), and then extract the scalar mask from it with `movemask`. Here is a minimized illustrated example of what we want to do: + +```center + y = 4 17 65 103 + x = 42 42 42 42 + y ≥ x = 00000000 00000000 11111111 11111111 + ├┬┬┬─────┴────────┴────────┘ +movemask = 0011 + ┌─┘ + ffs = 3 +``` + +Since we are limited to processing 8 elements at a time (half our block / cache line size), we have to split the elements into two groups and then combine the two 8-bit masks. To do this, it will be slightly easier to swap the condition for `x > y` and compute the inverted mask instead: ```c++ typedef __m256i reg; @@ -114,7 +126,7 @@ int cmp(reg x_vec, int* y_ptr) { } ``` -This function works for 8-element vectors, which is half our block / cache line size. To process the entire block, we need to call it twice and then combine the masks: +Now, to process the entire block, we need to call it twice and combine the masks: ```c++ int mask = ~( @@ -123,7 +135,7 @@ int mask = ~( ); ``` -Now, to descend down the tree, we use `ffs` on that mask to get the correct child number and just call the `go` function we defined earlier: +To descend down the tree, we use `ffs` on that mask to get the correct child number and just call the `go` function we defined earlier: ```c++ int i = __builtin_ffs(mask) - 1; @@ -301,7 +313,7 @@ It doesn't feel very satisfying so far, but we will reuse these optimization ide There are two main problems with the current implementation: - The `update` procedure is quite costly, especially considering that it is very likely going to be useless: 16 out of 17 times, we can just fetch the result from the last block. -- We do a non-constant number of iterations, causing branch prediction problems similar to how it did for the [Eytzinger binary search](/binary-search/#removing-the-last-branch); you can also see it on the graph this time, but the latency bumps have a period of $2^4$. +- We do a non-constant number of iterations, causing branch prediction problems similar to how it did for the [Eytzinger binary search](../binary-search/#removing-the-last-branch); you can also see it on the graph this time, but the latency bumps have a period of $2^4$. To address these problems, we need to change the layout a little bit. @@ -325,7 +337,7 @@ The disadvantage is that this layout is not *succinct*: we need some additional ### Implicit B+ Tree -To be more explicit with pointer arithmetic, we will store the entire tree in a single one-dimensional array. To minimize index computations during run-time, we will store each layer sequentially in this array and use compile-time computed offsets to address them: the keys of the node number `k` on layer `h` start with `btree[offset(h) + k * B]`, and its `i`-th child will at `btree[offset(h - 1) + (k * (B + 1) + i) * B]`. +To be more explicit with pointer arithmetic, we will store the entire tree in a single one-dimensional array. To minimize index computations during run time, we will store each layer sequentially in this array and use compile time computed offsets to address them: the keys of the node number `k` on layer `h` start with `btree[offset(h) + k * B]`, and its `i`-th child will at `btree[offset(h - 1) + (k * (B + 1) + i) * B]`. To implement all that, we need slightly more `constexpr` functions: @@ -335,7 +347,7 @@ constexpr int blocks(int n) { return (n + B - 1) / B; } -// number of keys on the layer pervious to one with n element +// number of keys on the layer previous to one with n keys constexpr int prev_keys(int n) { return (blocks(n) + B) / (B + 1) * B; } @@ -345,7 +357,7 @@ constexpr int height(int n) { return (n <= B ? 1 : height(prev_keys(n)) + 1); } -// where the layer h starts (0 is the largest) +// where the layer h starts (layer 0 is the largest) constexpr int offset(int h) { int k = 0, n = N; while (h--) { @@ -467,7 +479,7 @@ A lot of the performance boost of the S+ tree comes from removing branching and -Although nobody except maybe the HFT people cares about real latency, and everybody actually measures throughput even when using the word "latency", this nuance is still something to take into account when predicting the possible speedup in user applications. +Although nobody except maybe the HFT people cares about real latency, and everybody actually measures throughput even when using the word "latency," this nuance is still something to take into account when predicting the possible speedup in user applications. ### Modifications and Further Optimizations @@ -548,6 +560,7 @@ Other possible minor optimizations include: - Rewriting the whole thing in assembly, as the compiler seems to struggle with pointer arithmetic. - Using [blending](/hpc/simd/masking) instead of `packs`: you can odd-even shuffle node keys (`[1 3 5 7] [2 4 6 8]`), compare against the search key, and then blend the low 16 bits of the first register mask with the high 16 bits of the second. Blending is slightly faster on many architectures, and it may also help to alternate between packing and blending as they use different subsets of ports. (Thanks to Const-me from HackerNews for [suggesting](https://news.ycombinator.com/item?id=30381912) it.) - Using [popcount](/hpc/simd/shuffling/#shuffles-and-popcount) instead of `tzcnt`: the index `i` is equal to the number of keys less than `x`, so we can compare `x` against all keys, combine the vector mask any way we want, call `maskmov`, and then calculate the number of set bits with `popcnt`. This removes the need to store the keys in any particular order, which lets us skip the permutation step and also use this procedure on the last layer as well. +- Defining the key $i$ as the *maximum* key in the subtree of child $i$ instead of the *minimum* key in the subtree of child $(i + 1)$. The correctness doesn't change, but this guarantees that the result will be stored in the last node we access (and not in the first element of the next neighbor node), which lets us fetch slightly fewer cache lines. Note that the current implementation is specific to AVX2 and may require some non-trivial changes to adapt to other platforms. It would be interesting to port it for Intel CPUs with AVX-512 and Arm CPUs with 128-bit NEON, which may require some [trickery](https://github.com/WebAssembly/simd/issues/131) to work. @@ -583,7 +596,7 @@ My next priorities is to adapt it to segment trees, which I know how to do, and Of course, this comparison is not fair, as implementing a dynamic search tree is a more high-dimensional problem. -We'd also need to implement the update operation, which will not be that efficient, and for which we'd need to sacrifice the fanout factor. But it still seems possible to implement a 10-20x faster `std::set` and a 3-5x faster `absl::btree_set`, depending on how you define "faster" — and this is one of the things we'll attempt to do next. +We'd also need to implement the update operation, which will not be that efficient, and for which we'd need to sacrifice the fanout factor. But it still seems possible to implement a 10-20x faster `std::set` and a 3-5x faster `absl::btree_set`, depending on how you define "faster" — and this is one of the things we'll [attempt to do next](../b-tree). @@ -249,7 +250,7 @@ Apart from requiring much less memory, which is good for fitting into the CPU ca To improve the performance further, we can: -- manually optimize the index arithmetic (e. g. noticing that we need to multiply `v` by `2` either way), +- manually optimize the index arithmetic (e.g., noticing that we need to multiply `v` by `2` either way), - replace division by two with an explicit binary shift (because [compilers aren't always able to do it themselves](/hpc/compilation/contracts/#arithmetic)), - and, most importantly, get rid of [recursion](/hpc/architecture/functions) and make the implementation fully iterative. @@ -329,7 +330,7 @@ int sum(int l, int r) { int s = 0; while (l <= r) { if ( l & 1) s += t[l++]; // l is a right child: add it and move to a cousin - if (~r & 1) s += t[r--]; // r is a light child: add it and move to a cousin + if (~r & 1) s += t[r--]; // r is a left child: add it and move to a cousin l >>= 1, r >>= 1; } return s; @@ -530,7 +531,7 @@ Repeatedly adding the lowest set bit to `k` makes it "more even" and lifts it to ![A path for an update query in a Fenwick tree](../img/fenwick-update.png) -Now, if we leave all the code as it is, it works correctly even when $n$ is not a power of two. In this case, the Fenwick tree is not equivalent to a segment tree fo size $n$ but to a *forest* of up to $O(\log n)$ segment trees of power-of-two sizes — or to a single segment tree padded with zeros to a large power of two, if you like to think this way. In either case, all procedures remain working correctly as they never touch anything outside the $[1, n]$ range. +Now, if we leave all the code as it is, it works correctly even when $n$ is not a power of two. In this case, the Fenwick tree is not equivalent to a segment tree of size $n$ but to a *forest* of up to $O(\log n)$ segment trees of power-of-two sizes — or to a single segment tree padded with zeros to a large power of two, if you like to think this way. In either case, all procedures still work correctly as they never touch anything outside the $[1, n]$ range. @@ -592,8 +593,8 @@ constexpr int height(int n) { constexpr int offset(int h) { int s = 0, n = N; while (h--) { - s += (n + B - 1) / B * B; - n /= B; + n = (n + B - 1) / B; + s += n * B; } return s; } @@ -602,14 +603,14 @@ constexpr int H = height(N); alignas(64) int t[offset(H)]; // an array for storing nodes ``` -This way we effectively reduce the height of the tree by approximately $\frac{\log_B n}{\log_2 n} = \log_2 B$ times ($\sim4$ times if $B = 16$), but it becomes non-trivial to implement in-node operations efficiently. For our problem, we have two main options: +This way, we effectively reduce the height of the tree by approximately $\frac{\log_B n}{\log_2 n} = \log_2 B$ times ($\sim4$ times if $B = 16$), but it becomes non-trivial to implement in-node operations efficiently. For our problem, we have two main options: 1. We could store $B$ *sums* in each node (for each of its $B$ children). 2. We could store $B$ *prefix sums* in each node (the $i$-th being the sum of the first $(i + 1)$ children). If we go with the first option, the `add` query would be largely the same as in the bottom-up segment tree, but the `sum` query would need to add up to $B$ scalars in each node it visits. And if we go with the second option, the `sum` query would be trivial, but the `add` query would need to add `x` to some suffix on each node it visits. -In either case, one operation will perform $O(\log_B n)$ operations, touching just one scalar in each node, while the other will perform $O(B \cdot \log_B n)$ operations, touching up to $B$ scalars in each node. However, it is 21st century, and we can use [SIMD](/hpc/simd) to accelerate the slower operation. Since there are no fast [horizontal reductions](/hpc/simd/reduction) in SIMD instruction sets, but it is easy to add a vector to a vector, we will choose the second approach and store prefix sums in each node. +In either case, one operation would perform $O(\log_B n)$ operations, touching just one scalar in each node, while the other would perform $O(B \cdot \log_B n)$ operations, touching up to $B$ scalars in each node. We can, however, use [SIMD](/hpc/simd) to accelerate the slower operation, and since there are no fast [horizontal reductions](/hpc/simd/reduction) in SIMD instruction sets, but it is easy to add a vector to a vector, we will choose the second approach and store prefix sums in each node. This makes the `sum` query extremely fast and easy to implement: @@ -622,7 +623,7 @@ int sum(int k) { } ``` -The `add` query is more complicated and slower. We need to add a number to only a suffix of a node, and we can do this by [masking out](/hpc/simd/masking) the positions that need not be modified. +The `add` query is more complicated and slower. We need to add a number only to a suffix of a node, and we can do this by [masking out](/hpc/simd/masking) the positions that should not be modified. We can pre-calculate a $B \times B$ array corresponding to $B$ such masks that tell, for each of $B$ positions within a node, whether a certain prefix sum value needs to be updated or not: @@ -724,7 +725,7 @@ This makes both queries much slower — especially the reduction — but this sh **Minimum** is a nice exception where the update query can be made slightly faster if the new value of the element is less than the current one: we can skip the horizontal reduction part and just update $\log_B n$ nodes using a scalar procedure. -This works very fast when we mostly have such updates, which is the case e. g. for the sparse-graph Dijkstra algorithm when we have more edges than vertices. For this problem, the wide segment tree can serve as an efficient fixed-universe min-heap. +This works very fast when we mostly have such updates, which is the case, e.g., for the sparse-graph Dijkstra algorithm when we have more edges than vertices. For this problem, the wide segment tree can serve as an efficient fixed-universe min-heap. **Lazy propagation** can be done by storing a separate array for the delayed operations in a node. To propagate the updates, we need to go top to bottom (which can be done by simply reversing the direction of the `for` loop and using `k >> (h * b)` to calculate the `h`-th ancestor), [broadcast](/hpc/simd/moving/#broadcast) and reset the delayed operation value stored in the parent of the current node, and apply it to all values stored in the current node with SIMD. diff --git a/content/english/hpc/external-memory/_index.md b/content/english/hpc/external-memory/_index.md index d7c1612c..0af587b3 100644 --- a/content/english/hpc/external-memory/_index.md +++ b/content/english/hpc/external-memory/_index.md @@ -19,7 +19,7 @@ When you fetch anything from memory, the request goes through an incredibly comp --> -When you fetch anything from memory, there is always some non-zero latency before the data arrives. Moreover, the request doesn't go directly to its ultimate storage location, but it first goes through an incredibly complex system of address translation units and caching layers designed to both help in memory management and reduce the latency. +When you fetch anything from memory, there is always some latency before the data arrives. Moreover, the request doesn't go directly to its ultimate storage location, but it first goes through a complex system of address translation units and caching layers designed to both help in memory management and reduce latency. Therefore, the only correct answer to this question is "it depends" — primarily on where the operands are stored: @@ -27,7 +27,7 @@ Therefore, the only correct answer to this question is "it depends" — primaril - If it was accessed recently, it is probably *cached* and will take less than that to fetch, depending on how long ago it was accessed — it could be ~50 cycles for the slowest layer of cache and around 4-5 cycles for the fastest. - But it could also be stored on some type of *external memory* such as a hard drive, and in this case, it will take around 5ms, or roughly $10^7$ cycles (!) to access it. -Such high variance of memory performance is caused by the fact that memory hardware doesn't follow the same [laws of silicon scaling](/hpc/complexity/hardware) as CPU chips do. Memory is still improving through other means, but if 50 years ago memory timings were roughly on the same scale with the instruction latencies, nowadays they lag far behind. +Such a high variance of memory performance is caused by the fact that memory hardware doesn't follow the same [laws of silicon scaling](/hpc/complexity/hardware) as CPU chips do. Memory is still improving through other means, but if 50 years ago memory timings were roughly on the same scale with the instruction latencies, nowadays they lag far behind. ![](img/memory-vs-compute.png) @@ -41,7 +41,7 @@ It becomes ever more important to optimize Modern computers grow ever more powerful, but their memory systems can't quite pick up with the increase in computing power, because they don't follow the same [laws of silicon scaling](/hpc/complexity/hardware) as CPU chips do. -If a CPU core has a frequency of 3 GHz, it roughly means that it is capable of executing up to $3 \cdot 10^9$ operations per second, depending on what constitutes an "operation". This is the baseline: on modern architectures, it can be increased by techniques such as SIMD and instruction-level parallelism up to $10^{11}$ operations per second, if the computation allows it. +If a CPU core has a frequency of 3 GHz, it roughly means that it is capable of executing up to $3 \cdot 10^9$ operations per second, depending on what constitutes an "operation." This is the baseline: on modern architectures, it can be increased by techniques such as SIMD and instruction-level parallelism up to $10^{11}$ operations per second, if the computation allows it. But for many algorithms, the CPU is not the bottleneck. Before trying to optimize performance above that baseline, we need to learn not to drop below it, and the number one reason for this is memory. diff --git a/content/english/hpc/external-memory/hierarchy.md b/content/english/hpc/external-memory/hierarchy.md index 35670da9..26dfc144 100644 --- a/content/english/hpc/external-memory/hierarchy.md +++ b/content/english/hpc/external-memory/hierarchy.md @@ -40,8 +40,8 @@ Everything up to the RAM level is called *volatile memory* because it does not p From fastest to slowest: -- **CPU registers**, which are the zero-time access data cells CPU uses to store all its intermediate values, can also be thought of as a memory type. There is only a limited number of them (e. g. 16 "general purpose" ones), and in some cases, you may want to use all of them for performance reasons. -- **CPU caches.** Modern CPUs have multiple layers of cache (L1, L2, often L3, and rarely even L4). The lowest layer is shared between cores and is usually scaled with their number (e. g. a 10-core CPU should have around 10M of L3 cache). +- **CPU registers**, which are the zero-time access data cells CPU uses to store all its intermediate values, can also be thought of as a memory type. There is only a limited number of them (e.g., just 16 "general purpose" ones), and in some cases, you may want to use all of them for performance reasons. +- **CPU caches.** Modern CPUs have multiple layers of cache (L1, L2, often L3, and rarely even L4). The lowest layer is shared between cores and is usually scaled with their number (e.g., a 10-core CPU should have around 10M of L3 cache). - **Random access memory,** which is the first scalable type of memory: nowadays you can rent machines with half a terabyte of RAM on the public clouds. This is the one where most of your working data is supposed to be stored. The CPU cache system has an important concept of a *cache line*, which is the basic unit of data transfer between the CPU and the RAM. The size of a cache line is 64 bytes on most architectures, meaning that all main memory is divided into blocks of 64 bytes, and whenever you request (read or write) a single byte, you are also fetching all its 63 cache line neighbors whether your want them or not. @@ -58,7 +58,7 @@ There are other caches inside CPUs that are used for something other than data. ### Non-Volatile Memory -While the data cells in CPU caches and the RAM only gently store just a few electrons (that periodically leak and need to be periodically refreshed), the data cells in *non-volatile memory* types store hundreds of them. This lets the data to be persisted for prolonged periods of time without power but comes at the cost of performance and durability — because when you have more electrons, you also have more opportunities for them colliding with silicon atoms. +While the data cells in CPU caches and the RAM only gently store just a few electrons (that periodically leak and need to be periodically refreshed), the data cells in *non-volatile memory* types store hundreds of them. This lets the data persist for prolonged periods of time without power but comes at the cost of performance and durability — because when you have more electrons, you also have more opportunities for them to collide with silicon atoms. diff --git a/content/english/hpc/external-memory/list-ranking.md b/content/english/hpc/external-memory/list-ranking.md index 07b33c71..6d7c0053 100644 --- a/content/english/hpc/external-memory/list-ranking.md +++ b/content/english/hpc/external-memory/list-ranking.md @@ -50,11 +50,11 @@ List ranking is especially useful in graph algorithms. For example, we can obtain the Euler tour of a tree in external memory by constructing a linked list from the tree that corresponds to its Euler tour and then applying the list ranking algorithm — the ranks of each node will be the same as its index $tin_v$ in the Euler tour. To construct this list, we need to: -- split each undirected tree edge into two directed ones; -- duplicate the parent node for each up-edge (because list nodes can only have one incoming edge, but we visit some tree vertices multiple times); -- route each such node either to the "next sibling", if it has one, or otherwise to its own parent; +- split each undirected edge into two directed ones; +- duplicate the parent node for each up-edge (because list nodes can only have one incoming edge, but we visit some vertices multiple times); +- route each such node either to the "next sibling," if it has one, or otherwise to its own parent; - and then finally break the resulting cycle at the root. This general technique is called *tree contraction*, and it serves as the basis for a large number of tree algorithms. -Exactly the same approach can be applied to parallel algorithms, and we will convert that much more deeply in part 2. +The same approach can be applied to parallel algorithms, and we will cover that much more deeply in part II. diff --git a/content/english/hpc/external-memory/locality.md b/content/english/hpc/external-memory/locality.md index 8607506d..e61cb5a3 100644 --- a/content/english/hpc/external-memory/locality.md +++ b/content/english/hpc/external-memory/locality.md @@ -23,7 +23,7 @@ In this article, we continue designing algorithms for the external memory model In this context, we can talk about the degree of cache reuse primarily in two ways: -- *Temporal locality* refers to the repeated access of the same data within a relatively small time duration, such that the data likely remains cached between the requests. +- *Temporal locality* refers to the repeated access of the same data within a relatively small time period, such that the data likely remains cached between the requests. - *Spatial locality* refers to the use of elements relatively close to each other in terms of their memory locations, such that they are likely fetched in the same memory block. In other words, temporal locality is when it is likely that this same memory location will soon be requested again, while spatial locality is when it is likely that a nearby location will be requested right after. @@ -34,8 +34,8 @@ In this section, we will do some case studies to show how these high-level conce Consider a divide-and-conquer algorithm such as merge sorting. There are two approaches to implementing it: -- We can implement it recursively, or "depth-first", the way it is normally implemented: sort the left half, sort the right half and then merge the results. -- We can implement it iteratively, or "breadth-first": do the lowest "layer" first, looping through the entire dataset and comparing odd elements with even elements, then merge the first two elements with the second two elements, the third two elements with the fourth two elements and so on. +- We can implement it recursively, or "depth-first," the way it is normally implemented: sort the left half, sort the right half and then merge the results. +- We can implement it iteratively, or "breadth-first:" do the lowest "layer" first, looping through the entire dataset and comparing odd elements with even elements, then merge the first two elements with the second two elements, the third two elements with the fourth two elements and so on. It seems like the second approach is more cumbersome, but faster — because recursion is always slow, right? @@ -47,44 +47,51 @@ In practice, there is still some overhead associated with the recursion, and for ### Dynamic Programming -Similar reasoning can be applied to the implementations of dynamic programming algorithms but leading to the reverse result. Consider the classic knapsack problem, where we got $n$ items with integer costs $c_i$, and we need to pick a subset of items with the maximum total cost that does not exceed a given constant $w$. +Similar reasoning can be applied to the implementations of dynamic programming algorithms but leading to the reverse result. Consider the classic *knapsack problem:* given $N$ items with positive integer costs $c_i$, pick a subset of items with the maximum total cost that does not exceed a given constant $W$. -The way to solve it is to introduce the *state* $f[i, k]$, which corresponds to the maximum total cost not exceeding $k$ that can be achieved having already considered and excluded the first $i$ items. The state can be updated in $O(1)$ time per entry if consider either taking or not taking the $i$-th item and using further states of the dynamic to compute the optimal decision for each state. +The way to solve it is to introduce the *state* $f[n, w]$, which corresponds to the maximum total cost not exceeding $w$ that can be achieved using only the first $n$ items. These values can be computed in $O(1)$ time per entry if we consider either taking or not taking the $n$-th item and using the previous states of the dynamic to make the optimal decision. -Python has a handy `lru_cache` decorator, which can be used for implementing it with memoized recursion: +Python has a handy `lru_cache` decorator which can be used for implementing it with memoized recursion: ```python @lru_cache -def f(i, k): - if i == n or k == 0: +def f(n, w): + # check if we have no items to choose + if n == 0: return 0 - if w[i] > k: - return f(i + 1, k) - return max(f(i + 1, k), c[i] + f(i + 1, k - w[i])) + + # check if we can't pick the last item (note zero-based indexing) + if c[n - 1] > w: + return f(n - 1, w) + + # otherwise, we can either pick the last item or not + return max(f(n - 1, w), c[n - 1] + f(n - 1, w - c[n - 1])) ``` -When computing $f[n, w]$, the recursion may visit up to $O(n \cdot w)$ different states, which is asymptotically efficient, but rather slow in reality. Even after nullifying the overhead of Python recursion and all the hash table queries required for the LRU cache to work, it would still be slow because it does random I/O throughout most of the execution. +When computing $f[N, W]$, the recursion may visit up to $O(N \cdot W)$ different states, which is asymptotically efficient, but rather slow in reality. Even after nullifying the overhead of Python recursion and all the [hash table queries](../policies/#implementing-caching) required for the LRU cache to work, it would still be slow because it does random I/O throughout most of the execution. What we can do instead is to create a two-dimensional array for the dynamic and replace the recursion with a nice nested loop like this: ```cpp -int f[N + 1][W + 1]; +int f[N + 1][W + 1] = {0}; // this zero-fills the array -for (int i = n - 1; i >= 0; i++) - for (int k = 0; k <= W; k++) - f[i][k] = w[i] > k ? f[i + 1][k] : max(f[i + 1][k], c[i] + f[i + 1][k - w[i]]); +for (int n = 1; n <= N; n++) + for (int w = 0; w <= W; w++) + f[n][w] = c[n - 1] > w ? + f[n - 1][w] : + max(f[n - 1][k], c[n - 1] + f[n - 1][w - c[n - 1]]); ``` -Notice that we are only using the previous layer of the dynamic to calculate the next one. This means that if we can store one layer in the cache, we would only need to write $O(\frac{n \cdot w}{B})$ blocks in external memory. +Notice that we are only using the previous layer of the dynamic to calculate the next one. This means that if we can store one layer in the cache, we would only need to write $O(\frac{N \cdot W}{B})$ blocks in external memory. -Moreover, if we only need the answer, we don't actually have to store the whole 2d array but only the last layer. This lets us use just $O(w)$ memory by maintaining a single array of $w$ values. To simplify the code, we can slightly change the dynamic to store a binary value: whether it is possible to get the sum of exactly $k$ using the items that we have already considered. This dynamic is even faster to compute: +Moreover, if we only need the answer, we don't actually have to store the whole 2d array but only the last layer. This lets us use just $O(W)$ memory by maintaining a single array of $W$ values. To simplify the code, we can slightly change the dynamic to store a binary value: whether it is possible to get the sum of exactly $w$ using the items that we have already considered. This dynamic is even faster to compute: ```cpp -bool f[W + 1] = {}; // this zero-fills the array +bool f[W + 1] = {0}; f[0] = 1; -for (int i = 0; i < n; i++) - for (int x = W - a[i]; x >= 0; x--) - f[x + a[i]] |= f[x]; +for (int n = 0; n < N; n++) + for (int x = W - c[n]; x >= 0; x--) + f[x + c[n]] |= f[x]; ``` As a side note, now that it only uses simple bitwise operations, it can be optimized further by using a bitset: @@ -92,8 +99,8 @@ As a side note, now that it only uses simple bitwise operations, it can be optim ```cpp std::bitset b; b[0] = 1; -for (int i = 0; i < n; i++) - b |= b << c[i]; +for (int n = 0; n < N; n++) + b |= b << c[n]; ``` Surprisingly, there is still some room for improvement, and we will come back to this problem later. @@ -129,7 +136,7 @@ $$ t[k][i] = \min(t[k-1][i], t[k-1][i+2^{k-1}]) $$ -Now, there are two design choices to make: whether the log-size $k$ should be the first or the second dimension, and whether to iterate over $k$ and then $i$ or the other way around. This means that there are of $2×2=4$ ways to build it, and here is the optimal one: +Now, there are two design choices to make: whether the log-size $k$ should be the first or the second dimension, and whether to iterate over $k$ and then $i$ or the other way around. This means that there are $2×2=4$ ways to build it, and here is the optimal one: ```cpp int mn[logn][maxn]; @@ -167,7 +174,7 @@ The AoS layout is usually preferred for data structures, but SoA still has good This difference in design is important in data processing applications. For example, databases can be either *row-* or *column-oriented* (also called *columnar*): -- *Row-oriented* storage formats are used when you need to search for a limited amount of objects in a large dataset and fetch all or most of their fields. Examples: PostgreSQL, MongoDB. +- *Row-oriented* storage formats are used when you need to search for a limited number of objects in a large dataset and/or fetch all or most of their fields. Examples: PostgreSQL, MongoDB. - *Columnar* storage formats are used for big data processing and analytics, where you need to scan through everything anyway to calculate certain statistics. Examples: ClickHouse, Hbase. Columnar formats have the additional advantage that you can only read the fields that you need, as different fields are stored in separate external memory regions. diff --git a/content/english/hpc/external-memory/model.md b/content/english/hpc/external-memory/model.md index 35cba4ea..9ab86eba 100644 --- a/content/english/hpc/external-memory/model.md +++ b/content/english/hpc/external-memory/model.md @@ -18,7 +18,7 @@ Similar in spirit, in the *external memory model*, we simply ignore every operat In this model, we measure the performance of an algorithm in terms of its high-level *I/O operations*, or *IOPS* — that is, the total number of blocks read or written to external memory during execution. -We will mostly focus on the case where the internal memory is RAM and external memory is SSD or HDD, although the underlying analysis techniques that we will develop are applicable to any layer in the cache hierarchy. Under these settings, reasonable block size $B$ is about 1MB, internal memory size $M$ is usually a few gigabytes, and $N$ is up to a few terabytes. +We will mostly focus on the case where the internal memory is RAM and the external memory is SSD or HDD, although the underlying analysis techniques that we will develop are applicable to any layer in the cache hierarchy. Under these settings, reasonable block size $B$ is about 1MB, internal memory size $M$ is usually a few gigabytes, and $N$ is up to a few terabytes. ### Array Scan diff --git a/content/english/hpc/external-memory/oblivious.md b/content/english/hpc/external-memory/oblivious.md index 5e4650b2..93c4f2fc 100644 --- a/content/english/hpc/external-memory/oblivious.md +++ b/content/english/hpc/external-memory/oblivious.md @@ -118,7 +118,7 @@ It seems like we can't do better, but it turns out we can. ### Algorithm -Cache-oblivious matrix multiplication relies on essentially the same trick as the transposition. We need to divide the data until it fits into lowest cache (i. e. $N^2 \leq M$). For matrix multiplication, this equates to using this formula: +Cache-oblivious matrix multiplication relies on essentially the same trick as the transposition. We need to divide the data until it fits into lowest cache (i.e., $N^2 \leq M$). For matrix multiplication, this equates to using this formula: $$ \begin{pmatrix} @@ -198,7 +198,7 @@ $$ T(N) = O\left(\frac{(\sqrt{M})^2}{B} \cdot \left(\frac{N}{\sqrt M}\right)^3\right) = O\left(\frac{N^3}{B\sqrt{M}}\right) $$ -This is better than just $O(\frac{N^3}{B})$ and by quite a lot. +This is better than just $O(\frac{N^3}{B})$, and by quite a lot. ### Strassen Algorithm @@ -237,7 +237,7 @@ $$ You can verify these formulas with simple substitution if you feel like it. -As far as I know, none of the mainstream optimized linear algebra libraries use the Strassen algorithm, although there are some prototype implementations that are efficient for matrices larger than 4000 or so. +As far as I know, none of the mainstream optimized linear algebra libraries use the Strassen algorithm, although there are [some prototype implementations](https://arxiv.org/pdf/1605.01078.pdf) that are efficient for matrices larger than 2000 or so. This technique can and actually has been extended multiple times to reduce the asymptotic even further by considering more submatrix products. As of 2020, current world record is $O(n^{2.3728596})$. Whether you can multiply matrices in $O(n^2)$ or at least $O(n^2 \log^k n)$ time is an open problem. diff --git a/content/english/hpc/external-memory/policies.md b/content/english/hpc/external-memory/policies.md index 1ff0e724..4cb36bdd 100644 --- a/content/english/hpc/external-memory/policies.md +++ b/content/english/hpc/external-memory/policies.md @@ -33,7 +33,7 @@ $$ The main idea of the proof is to consider the worst case scenario. For LRU it would be the repeating series of $\frac{M}{B}$ distinct blocks: each block is new and so LRU has 100% cache misses. Meanwhile, $OPT_{M/2}$ would be able to cache half of them (but not more, because it only has half the memory). Thus $LRU_M$ needs to fetch double the number of blocks that $OPT_{M/2}$ does, which is basically what is expressed in the inequality, and anything better for $LRU$ would only weaken it. -![Dimmed are the blocks cached by OPT (but note cached by LRU)](../img/opt.png) +![Dimmed are the blocks cached by OPT (but not cached by LRU)](../img/opt.png) This is a very relieving result. It means that, at least in terms of asymptotic I/O complexity, you can just assume that the eviction policy is either LRU or OPT — whichever is easier for you — do complexity analysis with it, and the result you get will normally transfer to any other reasonable cache replacement policy. diff --git a/content/english/hpc/external-memory/sorting.md b/content/english/hpc/external-memory/sorting.md index 6ac13ae0..299da78f 100644 --- a/content/english/hpc/external-memory/sorting.md +++ b/content/english/hpc/external-memory/sorting.md @@ -1,6 +1,7 @@ --- title: External Sorting weight: 4 +published: true --- Now, let's try to design some actually useful algorithms for the new [external memory model](../model). Our goal in this section is to slowly build up more complex things and eventually get to *external sorting* and its interesting applications. @@ -33,17 +34,17 @@ So far the examples have been simple, and their analysis doesn't differ too much In the standard RAM model, the asymptotic complexity would be multiplied $k$, since we would need to perform $O(k)$ comparisons to fill each next element. But in the external memory model, since everything we do in-memory doesn't cost us anything, its asymptotic complexity would not change as long as we can fit $(k+1)$ full blocks in memory, that is, if $k = O(\frac{M}{B})$. -Remember [the $M \gg B$ assumption](../model) when we introduced the computational model? If we have $M \geq B^{1+ε}$ for $\epsilon > 0$, then we can fit any sub-polynomial amount of blocks in memory, certainly including $O(\frac{M}{B})$. This condition is called *tall cache assumption*, and it is usually required in many other external memory algorithms. +Remember [the $M \gg B$ assumption](../model) when we introduced the computational model? If we have $M \geq B^{1+ε}$ for $\epsilon > 0$, then we can fit any sub-polynomial number of blocks in memory, certainly including $O(\frac{M}{B})$. This condition is called *tall cache assumption*, and it is usually required in many other external memory algorithms. ### Merge Sorting -The "normal" complexity of the standard mergesort algorithm is $O(N \log_2 N)$: on each of its $O(\log_2 N)$ "layers", the algorithms need to go through all $N$ elements in total and merge them in linear time. +The "normal" complexity of the standard mergesort algorithm is $O(N \log_2 N)$: on each of its $O(\log_2 N)$ "layers," the algorithms need to go through all $N$ elements in total and merge them in linear time. -In the external memory model, when we read a block of size $M$, we can sort its elements "for free", since they are already in memory. This way we can split the arrays into $O(\frac{N}{M})$ blocks of consecutive elements and sort them separately as the base step, and only then merge them. +In the external memory model, when we read a block of size $M$, we can sort its elements "for free," since they are already in memory. This way we can split the arrays into $O(\frac{N}{M})$ blocks of consecutive elements and sort them separately as the base step, and only then merge them. ![](../img/k-way.png) -This effectively means that, in terms of IO operations, the first $O(\log M)$ layers of mergesort are free, and there are only $O(\log_2 \frac{N}{B})$ non-zero-cost layers, each mergeable in $O(\frac{N}{B})$ IOPS in total. This brings total I/O complexity to +This effectively means that, in terms of I/O operations, the first $O(\log M)$ layers of mergesort are free, and there are only $O(\log_2 \frac{N}{M})$ non-zero-cost layers, each mergeable in $O(\frac{N}{B})$ IOPS in total. This brings total I/O complexity to $$ O\left(\frac{N}{B} \log_2 \frac{N}{M}\right) @@ -57,7 +58,7 @@ Half of a page ago we have learned that in the external memory model, we can mer Let's sort each block of size $M$ in-memory just as we did before, but during each merge stage, we will split sorted blocks not just in pairs to be merged, but take as many blocks we can fit into our memory during a $k$-way merge. This way the height of the merge tree would be greatly reduced, while each layer would still be done in $O(\frac{N}{B})$ IOPS. -How many sorted arrays can we merge at once? Exactly $k = \frac{M}{B}$, since we need memory for one block for each array. Since the total amount of layers will be reduced to $\log_{\frac{M}{B}} \frac{N}{M}$, the total complexity will be reduced to +How many sorted arrays can we merge at once? Exactly $k = \frac{M}{B}$, since we need memory for one block for each array. Since the total number of layers will be reduced to $\log_{\frac{M}{B}} \frac{N}{M}$, the total complexity will be reduced to $$ SORT(N) \stackrel{\text{def}}{=} O\left(\frac{N}{B} \log_{\frac{M}{B}} \frac{N}{M} \right) @@ -106,15 +107,28 @@ fclose(input); What is left now is to merge them together. The bandwidth of modern HDDs can be quite high, and there may be a lot of parts to merge, so the I/O efficiency of this stage is not our only concern: we also need a faster way to merge $k$ arrays than by finding minima with $O(k)$ comparisons. We can do that in $O(\log k)$ time per element if we maintain a min-heap for these $k$ elements, in a manner almost identical to heapsort. -Here is how to implement it. First, we need to initialize some variables: +Here is how to implement it. First, we are going to need a heap (`priority_queue` in C++): -```cpp +```c++ +struct Pointer { + int key, part; // the element itself and the number of its part + + bool operator<(const Pointer& other) const { + return key > other.key; // std::priority_queue is a max-heap by default + } +}; + +std::priority_queue q; +``` + +Then, we need to allocate and fill the buffers: + +```c++ const int nparts = parts.size(); -std::priority_queue< std::pair > q; // the heap itself (element + part number) -auto buffers = new int[nparts][B]; // buffers for each part -int *l = new int[nparts], // # of already processed buffer elements - *r = new int[nparts]; // buffer size (in case it isn't full) +auto buffers = new int[nparts][B]; // buffers for each part +int *l = new int[nparts], // # of already processed buffer elements + *r = new int[nparts]; // buffer size (in case it isn't full) // now we add fill the buffer for each part and add their elements to the heap for (int part = 0; part < nparts; part++) { diff --git a/content/english/hpc/external-memory/virtual.md b/content/english/hpc/external-memory/virtual.md index 6535283d..92bb454c 100644 --- a/content/english/hpc/external-memory/virtual.md +++ b/content/english/hpc/external-memory/virtual.md @@ -19,7 +19,7 @@ Virtual memory gives each process the impression that it fully controls a contig To achieve this, the memory address space is divided into *pages* (typically 4KB in size), which are the base units of memory that the programs can request from the operating system. The memory system maintains a special hardware data structure called the *page table*, which contains the mappings of virtual page addresses to the physical ones. When a process accesses data using its virtual memory address, the memory system calculates its page number (by right-shifting it by $12$ if $4096=2^{12}$ is the page size), looks up in the page table that its physical address is, and forwards the read or write request to where that data is actually stored. -Since the address translation needs to be done for each memory request, and the number of memory pages itself may be large (e. g. 16G RAM / 4K page size = 4M pages), address translation poses a difficult problem in itself. One way to speed it up is to use a special cache for the page table itself called *translation lookaside buffer* (TLB), and the other is to [increase the page size](/hpc/cpu-cache/paging) so that the total number of memory pages is made smaller at the cost of reduced granularity. +Since the address translation needs to be done for each memory request, and the number of memory pages itself may be large (e.g., 16G RAM / 4K page size = 4M pages), address translation poses a difficult problem in itself. One way to speed it up is to use a special cache for the page table itself called *translation lookaside buffer* (TLB), and the other is to [increase the page size](/hpc/cpu-cache/paging) so that the total number of memory pages is made smaller at the cost of reduced granularity. diff --git a/content/english/hpc/number-theory/cryptography.md b/content/english/hpc/number-theory/cryptography.md index 87f58124..0b8c6b76 100644 --- a/content/english/hpc/number-theory/cryptography.md +++ b/content/english/hpc/number-theory/cryptography.md @@ -1,6 +1,6 @@ --- title: Cryptography -weight: 6 +weight: 7 draft: true --- @@ -22,15 +22,15 @@ To calculate $d$ and restore the message, the attacker would need to repeat step When doing actual communication, people first exchange their public keys (in any, possibly unsecure way) and then use it to encrypt messages. -This is what web browsers do when establishing connection "https". You can also do it by hand with GPG. +This is what web browsers do when establishing connection "https." You can also do it by hand with GPG. ### Man-in-the-middle There is an issue when establishing initial communication that the attacker could replace it and control the communication. -Between your browser and a bank. "Hey this is a message from a bank". +Between your browser and a bank. "Hey this is a message from a bank." -Trust networks. E. g. everyone can trust Google or whoever makes the device or operating system. +Trust networks. E.g., everyone can trust Google or whoever makes the device or operating system. ## Symmetric Cryptography diff --git a/content/english/hpc/number-theory/error-correction.md b/content/english/hpc/number-theory/error-correction.md index 91f1f472..e8774ed8 100644 --- a/content/english/hpc/number-theory/error-correction.md +++ b/content/english/hpc/number-theory/error-correction.md @@ -1,6 +1,6 @@ --- title: Error Correction -weight: 4 +weight: 6 draft: true --- diff --git a/content/english/hpc/number-theory/euclid-extended.md b/content/english/hpc/number-theory/euclid-extended.md new file mode 100644 index 00000000..a37c1b29 --- /dev/null +++ b/content/english/hpc/number-theory/euclid-extended.md @@ -0,0 +1,100 @@ +--- +title: Extended Euclidean Algorithm +weight: 3 +--- + +[Fermat’s theorem](../modular/#fermats-theorem) allows us to calculate modular multiplicative inverses through [binary exponentiation](..exponentiation/) in $O(\log n)$ operations, but it only works with prime modula. There is a generalization of it, [Euler's theorem](https://en.wikipedia.org/wiki/Euler%27s_theorem), stating that if $m$ and $a$ are coprime, then + +$$ +a^{\phi(m)} \equiv 1 \pmod m +$$ + +where $\phi(m)$ is [Euler's totient function](https://en.wikipedia.org/wiki/Euler%27s_totient_function) defined as the number of positive integers $x < m$ that are coprime with $m$. In the special case when $m$ is a prime, then all the $m - 1$ residues are coprime and $\phi(m) = m - 1$, yielding the Fermat's theorem. + +This lets us calculate the inverse of $a$ as $a^{\phi(m) - 1}$ if we know $\phi(m)$, but in turn, calculating it is not so fast: you usually need to obtain the [factorization](/hpc/algorithms/factorization/) of $m$ to do it. There is a more general method that works by modifying the [the Euclidean algorthm](/hpc/algorithms/gcd/). + +### Algorithm + +*Extended Euclidean algorithm*, apart from finding $g = \gcd(a, b)$, also finds integers $x$ and $y$ such that + +$$ +a \cdot x + b \cdot y = g +$$ + +which solves the problem of finding modular inverse if we substitute $b$ with $m$ and $g$ with $1$: + +$$ +a^{-1} \cdot a + k \cdot m = 1 +$$ + +Note that, if $a$ is not coprime with $m$, there is no solution since no integer combination of $a$ and $m$ can yield anything that is not a multiple of their greatest common divisor. + +The algorithm is also recursive: it calculates the coefficients $x'$ and $y'$ for $\gcd(b, a \bmod b)$ and restores the solution for the original number pair. If we have a solution $(x', y')$ for the pair $(b, a \bmod b)$ + +$$ +b \cdot x' + (a \bmod b) \cdot y' = g +$$ + +then, to get the solution for the initial input, we can rewrite the expression $(a \bmod b)$ as $(a - \lfloor \frac{a}{b} \rfloor \cdot b)$ and subsitute it into the aforementioned equation: + +$$ +b \cdot x' + (a - \Big \lfloor \frac{a}{b} \Big \rfloor \cdot b) \cdot y' = g +$$ + +Now we rearrange the terms grouping by $a$ and $b$ to get + +$$ +a \cdot \underbrace{y'}_x + b \cdot \underbrace{(x' - \Big \lfloor \frac{a}{b} \Big \rfloor \cdot y')}_y = g +$$ + +Comparing it with the initial expression, we infer that we can just use coefficients of $a$ and $b$ for the initial $x$ and $y$. + +### Implementation + +We implement the algorithm as a recursive function. Since its output is not one but three integers, we pass the coefficients to it by reference: + +```c++ +int gcd(int a, int b, int &x, int &y) { + if (a == 0) { + x = 0; + y = 1; + return b; + } + int x1, y1; + int d = gcd(b % a, a, x1, y1); + x = y1 - (b / a) * x1; + y = x1; + return d; +} +``` + +To calculate the inverse, we simply pass $a$ and $m$ and return the $x$ coefficient the algorithm finds. Since we pass two positive numbers, one of the coefficient will be positive and the other one is negative (which one depends on whether the number of iterations is odd or even), so we need to optionally check if $x$ is negative and add $m$ to get a correct residue: + +```c++ +int inverse(int a) { + int x, y; + gcd(a, M, x, y); + if (x < 0) + x += M; + return x; +} +``` + +It works in ~160ns — 10ns faster than inverting numbers with [binary exponentiation](../exponentiation). To optimize it further, we can similarly turn it iterative ­— which takes 135ns: + +```c++ +int inverse(int a) { + int b = M, x = 1, y = 0; + while (a != 1) { + y -= b / a * x; + b %= a; + swap(a, b); + swap(x, y); + } + return x < 0 ? x + M : x; +} +``` + +Note that, unlike binary exponentiation, the running time depends on the value of $a$. For example, for this particular value of $m$ ($10^9 + 7$), the worst input happens to be 564400443, for which the algorithm performs 37 iterations and takes 250ns. + +**Exercise**. Try to adapt the same technique for the [binary GCD](/hpc/algorithms/gcd/#binary-gcd) (it won't give performance speedup though unless you are better than me at optimization). diff --git a/content/english/hpc/number-theory/exponentiation.md b/content/english/hpc/number-theory/exponentiation.md new file mode 100644 index 00000000..8806257d --- /dev/null +++ b/content/english/hpc/number-theory/exponentiation.md @@ -0,0 +1,109 @@ +--- +title: Binary Exponentiation +weight: 2 +--- + +In modular arithmetic (and computational algebra in general), you often need to raise a number to the $n$-th power — to do [modular division](../modular/#modular-division), perform [primality tests](../modular/#fermats-theorem), or compute some combinatorial values — ­and you usually want to spend fewer than $\Theta(n)$ operations calculating it. + +*Binary exponentiation*, also known as *exponentiation by squaring*, is a method that allows for computation of the $n$-th power using $O(\log n)$ multiplications, relying on the following observation: + +$$ +\begin{aligned} + a^{2k} &= (a^k)^2 +\\ a^{2k + 1} &= (a^k)^2 \cdot a +\end{aligned} +$$ + +To compute $a^n$, we can recursively compute $a^{\lfloor n / 2 \rfloor}$, square it, and then optionally multiply by $a$ if $n$ is odd, corresponding to the following recurrence: + +$$ +a^n = f(a, n) = \begin{cases} + 1, && n = 0 +\\ f(a, \frac{n}{2})^2, && 2 \mid n +\\ f(a, n - 1) \cdot a, && 2 \nmid n +\end{cases} +$$ + +Since $n$ is at least halved every two recursive transitions, the depth of this recurrence and the total number of multiplications will be at most $O(\log n)$. + +### Recursive Implementation + +As we already have a recurrence, it is natural to implement the algorithm as a case matching recursive function: + +```c++ +const int M = 1e9 + 7; // modulo +typedef unsigned long long u64; + +u64 binpow(u64 a, u64 n) { + if (n == 0) + return 1; + if (n % 2 == 1) + return binpow(a, n - 1) * a % M; + else { + u64 b = binpow(a, n / 2); + return b * b % M; + } +} +``` + +In our benchmark, we use $n = m - 2$ so that we compute the [multiplicative inverse](../modular/#modular-division) of $a$ modulo $m$: + +```c++ +u64 inverse(u64 a) { + return binpow(a, M - 2); +} +``` + +We use $m = 10^9+7$, which is a modulo value commonly used in competitive programming to calculate checksums in combinatorial problems — because it is prime (allowing inverse via binary exponentiation), sufficiently large, not overflowing `int` in addition, not overflowing `long long` in multiplication, and easy to type as `1e9 + 7`. + +Since we use it as compile-time constant in the code, the compiler can optimize the modulo by [replacing it with multiplication](/hpc/arithmetic/division/) (even if it is not a compile-time constant, it is still cheaper to compute the magic constants by hand once and use them for fast reduction). + +The execution path — and consequently the running time — depends on the value of $n$. For this particular $n$, the baseline implementation takes around 330ns per call. As recursion introduces some [overhead](/hpc/architecture/functions/), it makes sense to unroll the implementation into an iterative procedure. + +### Iterative Implementation + +The result of $a^n$ can be represented as the product of $a$ to some powers of two — those that correspond to 1s in the binary representation of $n$. For example, if $n = 42 = 32 + 8 + 2$, then + +$$ +a^{42} = a^{32+8+2} = a^{32} \cdot a^8 \cdot a^2 +$$ + +To calculate this product, we can iterate over the bits of $n$ maintaining two variables: the value of $a^{2^k}$ and the current product after considering $k$ lowest bits of $n$. On each step, we multiply the current product by $a^{2^k}$ if the $k$-th bit of $n$ is set, and, in either case, square $a^k$ to get $a^{2^k \cdot 2} = a^{2^{k+1}}$ that will be used on the next iteration. + +```c++ +u64 binpow(u64 a, u64 n) { + u64 r = 1; + + while (n) { + if (n & 1) + r = res * a % M; + a = a * a % M; + n >>= 1; + } + + return r; +} +``` + +The iterative implementation takes about 180ns per call. The heavy calculations are the same; the improvement mainly comes from the reduced dependency chain: `a = a * a % M` needs to finish before the loop can proceed, and it can now execute concurrently with `r = res * a % M`. + +The performance also benefits from $n$ being a constant, [making all branches predictable](/hpc/pipelining/branching/) and letting the scheduler know what needs to be executed in advance. The compiler, however, does not take advantage of it and does not unroll the `while(n) n >>= 1` loop. We can rewrite it as a `for` loop that performs constant 30 iterations: + +```c++ +u64 inverse(u64 a) { + u64 r = 1; + + #pragma GCC unroll(30) + for (int l = 0; l < 30; l++) { + if ( (M - 2) >> l & 1 ) + r = r * a % M; + a = a * a % M; + } + + return r; +} +``` + +This forces the compiler to generate only the instructions we need, shaving off another 10ns and making the total running time ~170ns. + +Note that the performance depends not only on the binary length of $n$, but also on the number of binary 1s. If $n$ is $2^{30}$, it takes around 20ns less as we don't have to to perform any off-path multiplications. diff --git a/content/english/hpc/number-theory/finite.md b/content/english/hpc/number-theory/finite.md index fbef0015..cae2f2ef 100644 --- a/content/english/hpc/number-theory/finite.md +++ b/content/english/hpc/number-theory/finite.md @@ -1,6 +1,6 @@ --- title: Finite Fields -weight: 3 +weight: 5 draft: true --- diff --git a/content/english/hpc/number-theory/hashing.md b/content/english/hpc/number-theory/hashing.md index 0484d173..294573a1 100644 --- a/content/english/hpc/number-theory/hashing.md +++ b/content/english/hpc/number-theory/hashing.md @@ -12,7 +12,7 @@ Hash function is any function that is: * Computed fast — at least in linear time, that is. * Has a limited image — say, 64-bit values. -* "Deterministically-random": if it takes $n$ different values, then the probability of collision of two random hashes is $\frac{1}{n}$ and can't be predicted well without knowing the hash function. +* "Deterministically-random:" if it takes $n$ different values, then the probability of collision of two random hashes is $\frac{1}{n}$ and can't be predicted well without knowing the hash function. One good test is that can't create a collision in any better time than by birthday paradox. Square root of the hash space. diff --git a/content/english/hpc/number-theory/img/clock.gif b/content/english/hpc/number-theory/img/clock.gif new file mode 100644 index 00000000..0d0c6555 Binary files /dev/null and b/content/english/hpc/number-theory/img/clock.gif differ diff --git a/content/english/hpc/number-theory/inverse.md b/content/english/hpc/number-theory/inverse.md deleted file mode 100644 index dbfe1676..00000000 --- a/content/english/hpc/number-theory/inverse.md +++ /dev/null @@ -1,187 +0,0 @@ ---- -title: Modular Inverse -weight: 1 ---- - -```c++ -mint inv() const { - uint t = x; - uint res = 1; - while (t != 1) { - uint z = mod / t; - res = (ull) res * (mod - z) % mod; - t = mod - t * z; - } - return res; -} -``` - -In this section, we are going to discuss some preliminaries before discussing more advanced topics. - -In computers, we use the 1st of January, 1970 as the start of the "Unix era", and all time computations are usually done relative to that timestamp. - -We humans also keep track of time relative to some point in the past, which usually has a political or religious significance. At the moment of writing, approximately 63882260594 seconds have passed since 0 AD. - -But for daily tasks, we do not really need that information. Depending on the situation, the relevant part may be that it is 2 pm right now and it's time to go to dinner, or that it's Thursday and so Subway's sub of the day is an Italian BMT. What we do is instead of using a timestamp we use its remainder, which contains just the information we need. And the beautiful thing about it is that remainders are small and cyclic. Think the hour clock: after 12 there comes 1 again, so the number is always small. - -![](../img/clock.gif) - -It is much easier to deal with 1- or 2-digit numbers than 11-digit ones. If we encode each day of the weak starting with Monday from 0 to 6 inclusive, Thursday is going to get number 3. But what day of the week is it going to be in one year? We need to add 365 to it and then reduce modulo 7. It is convenient that `365 % 7` is 1, so we will know that it's Friday unless it is a leap year (in which case it will be Saturday). - -Modular arithmetic studies the way these sets of remainders behave, and it has fundamental applications in number theory, cryptography and data compression. - - -Consider the following problem: our "week" now consists of $m$ days, and we cycle through it with a steps of $a > 0$. How many distinct days there will be? - -Let's assume that the first day is always Monday. At some point the sequence of day is going to cycle. The days will be representable as $k a \mod m$, so we need to find the first $k$ such as $k a$ is divisible by $m$. In the case of $m=7$, $m$ is prime, so the cycle length will be 7 exactly for any $a$. - -Now, if $m$ is not prime, but it is still coprime with $a$. For $ka$ to be divisible by $m$, $k$ needs to be divisible by $m$. In general, the answer is $\frac{m}{gcd(a, m)}$. For example, if the week is 10 days long, if the starting number is even, then it will cycle through all even numbers, and if the number is 5, then it will only cycle between 0 and 5. Otherwise it will go through all 10 remainders. - -### Fermat's Theorem - -Now, consider what happens if instead of adding a number $a$, we repeatedly multiply by it, that is, write numbers in the form $a^n \mod m$. Since these are all finite numbers there is going to be a cycle, but what will its length be? If $p$ is prime, it turns out, all of them. - -**Theorem.** $a^p \equiv a \pmod p$ for all $a$ that are not multiple of $p$. - -**Proof**. Let $P(x_1, x_2, \ldots, x_n) = \frac{k}{\prod (x_i!)}$ be the *multinomial coefficient*, that is, the number of times the element $a_1^{x_1} a_2^{x_2} \ldots a_n^{x_n}$ would appear after the expansion of $(a_1 + a_2 + \ldots + a_n)^k$. Then - -$$ -\begin{aligned} -a^p &= (\underbrace{1+1+\ldots+1+1}_\text{$a$ times})^p & -\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} P(x_1, x_2, \ldots, x_a) & \text{(by defenition)} -\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} \frac{p!}{x_1! x_2! \ldots x_a!} & \text{(which terms will not be divisible by $p$?)} -\\\ &\equiv P(p, 0, \ldots, 0) + \ldots + P(0, 0, \ldots, p) & \text{(everything else will be canceled)} -\\\ &= a -\end{aligned} -$$ - -and then dividing by $a$ gives us the Fermat's theorem. - -Note that this is only true for prime $p$. Euler's theorem handles the case of arbitary $m$, and states that - -$$ -a^{\phi(m)} \equiv 1 \pmod m -$$ - -where $\phi(m)$ is called Euler's totient function and is equal to the number of residues of $m$ that is coprime with it. In particular case of when $m$ is prime, $\phi(p) = p - 1$ and we get Fermat's theorem, which is just a special case. - -### Primality Testing - -These theorems have a lot of applications. One of them is checking whether a number $n$ is prime or not faster than factoring it. You can pick any base $a$ at random and try to raise it to power $a^{p-1}$ modulo $n$ and check if it is $1$. Such base is called *witness*. - -Such probabilistic tests are therefore returning either "no" or "maybe". It may be the case that it just happened to be equal to $1$ but in fact $n$ is composite, in which case you need to repeat the test until you are okay with the false positive probability. Moreover, there exist carmichael numbers, which are composite numbers $n$ that satisfy $a^n \equiv 1 \pmod n$ for all $a$. These numbers are rare, but still [exist](https://oeis.org/A002997). - -Unless the input is provided by an adversary, the mistake probability will be low. This test is adequate for finding large primes: there are roughly $\frac{n}{\ln n}$ primes among the first $n$ numbers, which is another fact that we are not going to prove. These primes are distributed more or less evenly, so one can just pick a random number and check numbers in sequence, and after checking $O(\ln n)$ numbers one will probably be found. - -### Binary Exponentiation - -To perform the Fermat test, we need to raise a number to power $n-1$, preferrably using less than $n-2$ modular multiplications. We can use the fact that multiplication is associative: - -$$ -\begin{aligned} - a^{2k} &= (a^k)^2 -\\ a^{2k + 1} &= (a^k)^2 \cdot a -\end{aligned} -$$ - -We essentially group it like this: - -$$ -a^8 = (aaaa) \cdot (aaaa) = ((aa)(aa))((aa)(aa)) -$$ - -This allows using only $O(\log n)$ operations (or, more specifically, at most $2 \cdot \log_2 n$ modular multiplications). - -```c++ -int binpow(int a, int n) { - int res = 1; - while (n) { - if (n & 1) - res = res * a % mod; - a = a * a % mod; - n >>= 1; - } - return res; -} -``` - -This helps if `n` or `mod` is a constant. - -### Modular Division - -"Normal" operations also apply to residues: +, -, *. But there is an issue with division, because we can't just bluntly divide two numbers: $\frac{8}{2} = 4$, но $\frac{8 \\% 5 = 3}{2 \\% 5 = 2} \neq 4$. - -To perform division, we need to find an element that will behave itself like the reciprocal $\frac{1}{a} = a^{-1}$, and instead of "division" multiply by it. This element is called a *modular inverse*. - -If the modulo is a prime number, then the solution is $a^{-1} \equiv a^{p-2}$, which follows directly from Fermat's theorem by dividing the equivalence by $a$: - -$$ -a^p \equiv a \implies a^{p-1} \equiv 1 \implies a^{p-2} \equiv a^{-1} -$$ - -This means that $a^{p-2}$ "behaves" like $a^{-1}$ which is what we need. - -You can calculate $a^{p-2}$ in $O(\log p)$ time using binary exponentiation: - -```c++ -int inv(int x) { - return binpow(x, mod - 2); -} -``` - -If the modulo is not prime, then we can still get by calculating $\phi(m)$ and invoking Euler's theorem. But calculating $\phi(m)$ is as difficult as factoring it, which is not fast. There is a more general method. - -### Extended Euclidean Algorithm - -*Extended Euclidean algorithm* apart from finding $g = \gcd(a, b)$ also finds integers $x$ and $y$ such that - -$$ -a \cdot x + b \cdot y = g -$$ - -which solves the problem of finding modular inverse if we substitute $b$ with $m$ and $g$ with $1$: - -$$ -a^{-1} \cdot a + k \cdot m = 1 -$$ - -Note that if $a$ is not coprime with $m$, then there will be no solution. We can still find *some* element, but it will not work for any dividend. - -The algorithm is also recursive. It makes a recursive call, calculates the coefficients $x'$ and $y'$ for $\gcd(b, a \bmod b)$, and restores the general solution. If we have a solution $(x', y')$ for pair $(b, a \bmod b)$: - -$$ -b \cdot x' + (a \bmod b) \cdot y' = g -$$ - -To get the solution for the initial input, rewrite the expression $(a \bmod b)$ as $(a - \lfloor \frac{a}{b} \rfloor \cdot b)$ and subsitute it into the aforementioned equality: - -$$ -b \cdot x' + (a - \Big \lfloor \frac{a}{b} \Big \rfloor \cdot b) \cdot y' = g -$$ - -Now let's rearrange the terms (grouping by $a$ and $b$) to get - -$$ -a \cdot \underbrace{y'}_x + b \cdot \underbrace{(x' - \Big \lfloor \frac{a}{b} \Big \rfloor \cdot y')}_y = g -$$ - -Comparing it with initial expression, we infer that we can just use coefficients by $a$ and $b$ for the initial $x$ and $y$. - -```c++ -int gcd(int a, int b, int &x, int &y) { - if (a == 0) { - x = 0; - y = 1; - return b; - } - int x1, y1; - int d = gcd(b % a, a, x1, y1); - x = y1 - (b / a) * x1; - y = x1; - return d; -} -``` - -Another application is the exact division modulo $2^k$. - -**Exercise**. Try to adapt the technique for binary GCD. diff --git a/content/english/hpc/number-theory/modular.md b/content/english/hpc/number-theory/modular.md new file mode 100644 index 00000000..3d05e2f9 --- /dev/null +++ b/content/english/hpc/number-theory/modular.md @@ -0,0 +1,140 @@ +--- +title: Modular Arithmetic +weight: 1 +--- + + + +Computers usually store time as the number of seconds that have passed since the 1st of January, 1970 — the start of the "Unix era" — and use these timestamps in all computations that have to do with time. + +We humans also keep track of time relative to some point in the past, which usually has a political or religious significance. For example, at the moment of writing, approximately 63882260594 seconds have passed since 1 AD — [6th century Eastern Roman monks' best estimate](https://en.wikipedia.org/wiki/Anno_Domini) of the day Jesus Christ was born. + +But unlike computers, we do not always need *all* that information. Depending on the task at hand, the relevant part may be that it's 2 pm right now, and it's time to go to dinner; or that it's Thursday, and so Subway's sub of the day is an Italian BMT. Instead of the whole timestamp, we use its *remainder* containing just the information we need: it is much easier to deal with 1- or 2-digit numbers than 11-digit ones. + +**Problem.** Today is Thursday. What day of the week will be exactly in a year? + +If we enumerate each day of the week, starting with Monday, from $0$ to $6$ inclusive, Thursday gets number $3$. To find out what day it is going to be in a year from now, we need to add $365$ to it and then reduce modulo $7$. Conveniently, $365 \bmod 7 = 1$, so we know that it will be Friday unless it is a leap year (in which case it will be Saturday). + +### Residues + +**Definition.** Two integers $a$ and $b$ are said to be *congruent* modulo $m$ if $m$ divides their difference: + +$$ +m \mid (a - b) \; \Longleftrightarrow \; a \equiv b \pmod m +$$ + +For example, the 42nd day of the year is the same weekday as the 161st since $(161 - 42) = 119 = 17 \times 7$. + +Congruence modulo $m$ is an equivalence relation that splits all integers into equivalence classes called *residues*. Each residue class modulo $m$ may be represented by any one of its members — although we commonly use the smallest nonnegative integer of that class (equal to the remainder $x \bmod m$ for all nonnegative $x$). + + + +*Modular arithmetic* studies these sets of residues, which are fundamental for number theory. + +**Problem.** Our "week" now consists of $m$ days, and our year consists of $a$ days (no leap years). How many distinct days of the week there will be among one, two, three and so on whole years from now? + +For simplicity, assume that today is Monday, so that the initial day number $d_0$ is zero, and after each year, it changes to + +$$ +d_{k + 1} = (d_k + a) \bmod m +$$ + +After $k$ years, it will be + +$$ +d_k = k \cdot a \bmod m +$$ + +Since there are only $m$ days in a week, at some point, it will be Monday again, and the sequence of day numbers is going to cycle. The number of distinct days is the length of this cycle, so we need to find the smallest $k$ such that + +$$ +k \cdot a \equiv 0 \pmod m +$$ + +First of all, if $a \equiv 0$, it will be eternal Monday. Now, assuming the non-trivial case of $a \not \equiv 0$: + +- For a seven-day week, $m = 7$ is prime. There is no $k$ smaller than $m$ such that $k \cdot a$ is divisible by $m$ because $m$ can not be decomposed in such a product by the definition of primality. So, if $m$ is prime, we will cycle through all of $m$ weekdays. +- If $m$ is not prime, but $a$ is *coprime* with it (that is, $a$ and $m$ do not have common divisors), then the answer is still $m$ for the same reason: the divisors of $a$ do not help in zeroing out the product any faster. +- If $a$ and $m$ share some divisors, then it is only possible to get residues that are also divisible by them. For example, if the week is $m = 10$ days long, and the year has $a = 42$ or any other even number of days, then we will cycle through all even day numbers, and if the number of days is a multiple of $5$, then we will only oscillate between $0$ and $5$. Otherwise, we will go through all the $10$ remainders. + +Therefore, in general, the answer is $\frac{m}{\gcd(a, m)}$, where $\gcd(a, m)$ is the [greatest common divisor](/hpc/algorithms/gcd/) of $a$ and $m$. + +### Fermat's Theorem + +Now, consider what happens if, instead of adding a number $a$, we repeatedly multiply by it, writing out a sequence of + +$$ +d_n = a^n \bmod m +$$ + +Again, since there is a finite number of residues, there is going to be a cycle. But what will its length be? Turns out, if $m$ is prime, it will span all $(m - 1)$ non-zero residues. + +**Theorem.** For any $a$ and a prime $p$: + +$$ +a^p \equiv a \pmod p +$$ + +**Proof**. Let $P(x_1, x_2, \ldots, x_n) = \frac{k}{\prod (x_i!)}$ be the *multinomial coefficient:* the number of times the element $a_1^{x_1} a_2^{x_2} \ldots a_n^{x_n}$ appears after the expansion of $(a_1 + a_2 + \ldots + a_n)^k$. Then: + +$$ +\begin{aligned} +a^p &= (\underbrace{1+1+\ldots+1+1}_\text{$a$ times})^p & +\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} P(x_1, x_2, \ldots, x_a) & \text{(by definition)} +\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} \frac{p!}{x_1! x_2! \ldots x_a!} & \text{(which terms will not be divisible by $p$?)} +\\\ &\equiv P(p, 0, \ldots, 0) + \ldots + P(0, 0, \ldots, p) & \text{(everything else will be canceled)} +\\\ &= a +\end{aligned} +$$ + +Note that this is only true for prime $p$. We can use this fact to test whether a given number is prime faster than by factoring it: we can pick a number $a$ at random, calculate $a^{p} \bmod p$, and check whether it is equal to $a$ or not. + +This is called *Fermat primality test*, and it is probabilistic — only returning either "no" or "maybe" — since it may be that $a^p$ just happened to be equal to $a$ despite $p$ being composite, in which case you need to repeat the test with a different random $a$ until you are satisfied with the false positive probability. + +Primality tests are commonly used to generate large primes (for cryptographic purposes). There are roughly $\frac{n}{\ln n}$ primes among the first $n$ numbers (a fact that we are not going to prove), and they are distributed more or less evenly. One can just pick a random number from the required range, perform a primality check, and repeat until a prime is found, performing $O(\ln n)$ trials on average. + +An extremely bad input to the Fermat test is the [Carmichael numbers](https://en.wikipedia.org/wiki/Carmichael_number), which are composite numbers $n$ that satisfy $a^{n-1} \equiv 1 \pmod n$ for all relatively prime $a$. But these are [rare](https://oeis.org/A002997), and the chance of randomly bumping into it is low. + +### Modular Division + +Implementing most "normal" arithmetic operations with residues is straightforward. You only need to take care of integer overflows and remember to take modulo: + +```c++ +c = (a + b) % m; +c = (a - b + m) % m; +c = a * b % m; +``` + +But there is an issue with division: we can't just bluntly divide two residues. For example, $\frac{8}{2} = 4$, but + +$$ +\frac{8 \bmod 5}{2 \bmod 5} = \frac{3}{2} \neq 4 +$$ + +To perform modular division, we need to find an element that "acts" like the reciprocal $\frac{1}{a} = a^{-1}$ and multiply by it. This element is called a *modular multiplicative inverse*, and Fermat's theorem can help us find it when the modulo $p$ is a prime. When we divide its equivalence twice by $a$, we get: + +$$ +a^p \equiv a \implies a^{p-1} \equiv 1 \implies a^{p-2} \equiv a^{-1} +$$ + +Therefore, $a^{p-2}$ is like $a^{-1}$ for the purposes of multiplication, which is what we need from a modular inverse of $a$. diff --git a/content/english/hpc/number-theory/montgomery.md b/content/english/hpc/number-theory/montgomery.md index e784dfaf..0eeef0b0 100644 --- a/content/english/hpc/number-theory/montgomery.md +++ b/content/english/hpc/number-theory/montgomery.md @@ -1,102 +1,208 @@ --- title: Montgomery Multiplication -weight: 2 +weight: 4 +published: true --- -When we talked about [integers](../integer) in general, we discussed how to perform division and modulo by multiplication, and, unsurprisingly, in modular arithmetic 90% of its time is spent calculating modulo. Apart from using the general tricks described in the previous article, there is another method specifically for modular arithmetic, called *Montgomery multiplication*. +Unsurprisingly, a large fraction of computation in [modular arithmetic](../modular) is often spent on calculating the modulo operation, which is as slow as [general integer division](/hpc/arithmetic/division/) and typically takes 15-20 cycles, depending on the operand size. -As all other fast reduction methods, it doesn't come for free. It works only in *Montgomery space*, so we need to transform our numbers in and out of it before doing the multiplications. This means that on top of doing some compile-time computations, we would also need to do some operations before the multiplication. +The best way to deal this nuisance is to avoid modulo operation altogether, delaying or replacing it with [predication](/hpc/pipelining/branchless), which can be done, for example, when calculating modular sums: -For the space we need a positive integer $r \ge n$ coprime to $n$. In practice we always choose $r$ to be $2^m$ (with $m$ usually being equal 32 or 64), since multiplications, divisions and modulo $r$ operations can then be efficiently implemented using shifts and bitwise operations. Therefore $n$ needs to be an odd number so that every power of $2$ will be coprime to $n$. And if it is not, we can make it odd (?). +```cpp +const int M = 1e9 + 7; -The representative $\bar x$ of a number $x$ in the Montgomery space is defined as +// input: array of n integers in the [0, M) range +// output: sum modulo M +int slow_sum(int *a, int n) { + int s = 0; + for (int i = 0; i < n; i++) + s = (s + a[i]) % M; + return s; +} + +int fast_sum(int *a, int n) { + int s = 0; + for (int i = 0; i < n; i++) { + s += a[i]; // s < 2 * M + s = (s >= M ? s - M : s); // will be replaced with cmov + } + return s; +} + +int faster_sum(int *a, int n) { + long long s = 0; // 64-bit integer to handle overflow + for (int i = 0; i < n; i++) + s += a[i]; // will be vectorized + return s % M; +} +``` + +However, sometimes you only have a chain of modular multiplications, and there is no good way to eel out of computing the remainder of the division — other than with the [integer division tricks](../hpc/arithmetic/division/) requiring a constant modulo and some precomputation. + +But there is another technique designed specifically for modular arithmetic, called *Montgomery multiplication*. + +### Montgomery Space + +Montgomery multiplication works by first transforming the multipliers into *Montgomery space*, where modular multiplication can be performed cheaply, and then transforming them back when their actual values are needed. Unlike general integer division methods, Montgomery multiplication is not efficient for performing just one modular reduction and only becomes worthwhile when there is a chain of modular operations. + +The space is defined by the modulo $n$ and a positive integer $r \ge n$ coprime to $n$. The algorithm involves modulo and division by $r$, so in practice, $r$ is chosen to be $2^{32}$ or $2^{64}$, so that these operations can be done with a right-shift and a bitwise AND respectively. + + + +**Definition.** The *representative* $\bar x$ of a number $x$ in the Montgomery space is defined as $$ \bar{x} = x \cdot r \bmod n $$ -Note that the transformation is actually such a multiplication that we want to optimize, so it is still an expensive operation. However, we will only need to transform a number into the space once, perform as many operations as we want efficiently in that space and at the end transform the final result back, which should be profitable if we are doing lots of operations modulo $n$. +Computing this transformation involves a multiplication and a modulo — an expensive operation that we wanted to optimize away in the first place — which is why we only use this method when the overhead of transforming numbers to and from the Montgomery space is worth it and not for general modular multiplication. + + + +Inside the Montgomery space, addition, substraction, and checking for equality is performed as usual: + +$$ +x \cdot r + y \cdot r \equiv (x + y) \cdot r \bmod n +$$ -Inside the Montgomery space addition, substraction and checking for equality is performed as usual ($x \cdot r + y \cdot r \equiv (x + y) \cdot r \bmod n$). However, this is not the case for multiplication. Denoting multiplication in Montgomery space as $*$ and normal multiplication as $\cdot$, we expect the result to be: +However, this is not the case for multiplication. Denoting multiplication in the Montgomery space as $*$ and the "normal" multiplication as $\cdot$, we expect the result to be: $$ \bar{x} * \bar{y} = \overline{x \cdot y} = (x \cdot y) \cdot r \bmod n $$ -But the normal multiplication will give us: +But the normal multiplication in the Montgomery space yields: $$ \bar{x} \cdot \bar{y} = (x \cdot y) \cdot r \cdot r \bmod n $$ -Therefore the multiplication in the Montgomery space is defined as +Therefore, the multiplication in the Montgomery space is defined as $$ \bar{x} * \bar{y} = \bar{x} \cdot \bar{y} \cdot r^{-1} \bmod n $$ -This means that whenever we multiply two numbers, after the multiplication we need to *reduce* them. Therefore, we need to have an efficient way of calculating $x \cdot r^{-1} \bmod n$. +This means that, after we normally multiply two numbers in the Montgomery space, we need to *reduce* the result by multiplying it by $r^{-1}$ and taking the modulo — and there is an efficent way to do this particular operation. ### Montgomery reduction -Assume that $r=2^{64}$, the modulo $n$ is 64-bit and the number $x$ we need to reduce (multiply by $r^{-1}$) is 128-bit (the product of two 64-bit numbers). +Assume that $r=2^{32}$, the modulo $n$ is 32-bit, and the number $x$ we need to reduce is 64-bit (the product of two 32-bit numbers). Our goal is to calculate $y = x \cdot r^{-1} \bmod n$. -Because $\gcd(n, r) = 1$, we know that there are two numbers $r^{-1}$ and $n'$ in the $[0, n)$ range such that +Since $r$ is coprime with $n$, we know that there are two numbers $r^{-1}$ and $n^\prime$ in the $[0, n)$ range such that $$ -r \cdot r^{-1} + n \cdot n' = 1 +r \cdot r^{-1} + n \cdot n^\prime = 1 $$ -and both $r^{-1}$ and $n'$ can be computed using the extended Euclidean algorithm. +and both $r^{-1}$ and $n^\prime$ can be computed, e.g., using the [extended Euclidean algorithm](../euclid-extended). -Using this identity we can express $r \cdot r^{-1}$ as $(-n \cdot n' + 1)$ and write $x \cdot r^{-1}$ as +Using this identity, we can express $r \cdot r^{-1}$ as $(1 - n \cdot n^\prime)$ and write $x \cdot r^{-1}$ as $$ \begin{aligned} x \cdot r^{-1} &= x \cdot r \cdot r^{-1} / r -\\ &= x \cdot (-n \cdot n^{\prime} + 1) / r -\\ &= (-x \cdot n \cdot n^{\prime} + x) / r -\\ &\equiv (-x \cdot n \cdot n^{\prime} + l \cdot r \cdot n + x) / r \bmod n -\\ &\equiv ((-x \cdot n^{\prime} + l \cdot r) \cdot n + x) / r \bmod n +\\ &= x \cdot (1 - n \cdot n^{\prime}) / r +\\ &= (x - x \cdot n \cdot n^{\prime} ) / r +\\ &\equiv (x - x \cdot n \cdot n^{\prime} + k \cdot r \cdot n) / r &\pmod n &\;\;\text{(for any integer $k$)} +\\ &\equiv (x - (x \cdot n^{\prime} - k \cdot r) \cdot n) / r &\pmod n \end{aligned} $$ -The equivalences hold for any integer $l$. This means that we can add or subtract an arbitrary multiple of $r$ to $x \cdot n'$, or in other words, we can compute $q = x \cdot n'$ modulo $r$. +Now, if we choose $k$ to be $\lfloor x \cdot n^\prime / r \rfloor$ (the upper 64 bits of the $x \cdot n^\prime$ product), it will cancel out, and $(k \cdot r - x \cdot n^{\prime})$ will simply be equal to $x \cdot n^{\prime} \bmod r$ (the lower 32 bits of $x \cdot n^\prime$), implying: + +$$ +x \cdot r^{-1} \equiv (x - x \cdot n^{\prime} \bmod r \cdot n) / r +$$ + +The algorithm itself just evaluates this formula, performing two multiplications to calculate $q = x \cdot n^{\prime} \bmod r$ and $m = q \cdot n$, and then subtracts it from $x$ and right-shifts the result to divide it by $r$. + +The only remaining thing to handle is that the result may not be in the $[0, n)$ range; but since + +$$ +x < n \cdot n < r \cdot n \implies x / r < n +$$ + +and + +$$ +m = q \cdot n < r \cdot n \implies m / r < n +$$ + +it is guaranteed that + +$$ +-n < (x - m) / r < n +$$ + +Therefore, we can simply check if the result is negative and in that case, add $n$ to it, giving the following algorithm: -This gives us the following algorithm to compute $x \cdot r^{-1} \bmod n$: +```c++ +typedef __uint32_t u32; +typedef __uint64_t u64; -```python -def reduce(x): - q = (x % r) * nr % r - a = (x - q * n) / r - if a < 0: - a += n - return a +const u32 n = 1e9 + 7, nr = inverse(n, 1ull << 32); + +u32 reduce(u64 x) { + u32 q = u32(x) * nr; // q = x * n' mod r + u64 m = (u64) q * n; // m = q * n + u32 y = (x - m) >> 32; // y = (x - m) / r + return x < m ? y + n : y; // if y < 0, add n to make it be in the [0, n) range +} ``` -Since $x < n \cdot n < r \cdot n$ (as $x$ is a product of multiplicatio) and $q \cdot n < r \cdot n$, we know that $-n < (x - q \cdot n) / r < n$. Therefore the final modulo operation can be implemented using a single bound check and addition. +This last check is relatively cheap, but it is still on the critical path. If we are fine with the result being in the $[0, 2 \cdot n - 2]$ range instead of $[0, n)$, we can remove it and add $n$ to the result unconditionally: + +```c++ +u32 reduce(u64 x) { + u32 q = u32(x) * nr; + u64 m = (u64) q * n; + u32 y = (x - m) >> 32; + return y + n +} +``` + +We can also move the `>> 32` operation one step earlier in the computation graph and compute $\lfloor x / r \rfloor - \lfloor m / r \rfloor$ instead of $(x - m) / r$. This is correct because the lower 32 bits of $x$ and $m$ are equal anyway since + +$$ +m = x \cdot n^\prime \cdot n \equiv x \pmod r +$$ + +But why would we voluntarily choose to perfom two right-shifts instead of just one? This is beneficial because for `((u64) q * n) >> 32` we need to do a 32-by-32 multiplication and take the upper 32 bits of the result (which the x86 `mul` instruction [already writes](../hpc/arithmetic/integer/#128-bit-integers) in a separate register, so it doesn't cost anything), and the other right-shift `x >> 32` is not on the critical path. + +```c++ +u32 reduce(u64 x) { + u32 q = u32(x) * nr; + u32 m = ((u64) q * n) >> 32; + return (x >> 32) + n - m; +} +``` -Here is an equivalent C implementation for 64-bit integers: +One of the main advantages of Montgomery multiplication over other modular reduction methods is that it doesn't require very large data types: it only needs a $r \times r$ multiplication that extracts the lower and higher $r$ bits of the result, which [has special support](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=7395,7392,7269,4868,7269,7269,1820,1835,6385,5051,4909,4918,5051,7269,6423,7410,150,2138,1829,1944,3009,1029,7077,519,5183,4462,4490,1944,5055,5012,5055&techs=AVX,AVX2&text=mul) on most hardware also makes it easily generalizable to [SIMD](../hpc/simd/) and larger data types: ```c++ -u64 reduce(u128 x) { +typedef __uint128_t u128; + +u64 reduce(u128 x) const { u64 q = u64(x) * nr; u64 m = ((u128) q * n) >> 64; - u64 xhi = (x >> 64); - if (xhi >= m) - return (xhi - m); - else - return (xhi - m) + n; + return (x >> 64) + n - m; } ``` -We also need to implement calculating calculating the inverse of $n$ (`nr`) and transformation of numbers in and our of Montgomery space. Before providing complete implementation, let's discuss how to do that smarter, although they are just done once. +Note that a 128-by-64 modulo is not possible with general integer division tricks: the compiler [falls back](https://godbolt.org/z/fbEE4v4qr) to calling a slow [long arithmetic library function](https://github.com/llvm-mirror/compiler-rt/blob/69445f095c22aac2388f939bedebf224a6efcdaf/lib/builtins/udivmodti4.c#L22) to support it. + +### Faster Inverse and Transform -To transfer a number back from the Montgomery space we can just use Montgomery reduction. +Montgomery multiplication itself is fast, but it requires some precomputation: -### Fast inverse +- inverting $n$ modulo $r$ to compute $n^\prime$, +- transforming a number *to* the Montgomery space, +- transforming a number *from* the Montgomery space. -For computing the inverse $n' = n^{-1} \bmod r$ more efficiently, we can use the following trick inspired from the Newton's method: +The last operation is already efficiently performed with the `reduce` procedure we just implemented, but the first two can be slightly optimized. + +**Computing the inverse** $n^\prime = n^{-1} \bmod r$ can be done faster than with the extended Euclidean algorithm by taking advantage of the fact that $r$ is a power of two and using the following identity: $$ a \cdot x \equiv 1 \bmod 2^k @@ -106,7 +212,7 @@ a \cdot x \cdot (2 - a \cdot x) 1 \bmod 2^{2k} $$ -This can be proven this way: +Proof: $$ \begin{aligned} @@ -119,47 +225,69 @@ a \cdot x \cdot (2 - a \cdot x) \end{aligned} $$ -This means we can start with $x = 1$ as the inverse of $a$ modulo $2^1$, apply the trick a few times and in each iteration we double the number of correct bits of $x$. - -### Fast transformation +We can start with $x = 1$ as the inverse of $a$ modulo $2^1$ and apply this identity exactly $\log_2 r$ times, each time doubling the number of bits in the inverse — somewhat reminiscent of [the Newton's method](../hpc/arithmetic/newton/). -Although we can just multiply a number by $r$ and compute one modulo the usual way, there is a faster way that makes use of the following relation: +**Transforming** a number into the Montgomery space can be done by multiplying it by $r$ and computing modulo [the usual way](../hpc/arithmetic/division/), but we can also take advantage of this relation: $$ \bar{x} = x \cdot r \bmod n = x * r^2 $$ -Transforming a number into the space is just a multiplication inside the space of the number with $r^2$. Therefore we can precompute $r^2 \bmod n$ and just perform a multiplication and reduction instead. +Transforming a number into the space is just a multiplication by $r^2$. Therefore, we can precompute $r^2 \bmod n$ and perform a multiplication and reduction instead — which may or may not be actually faster because multiplying a number by $r=2^{k}$ can be implemented with a left-shift, while multiplication by $r^2 \bmod n$ can not. ### Complete Implementation +It is convenient to wrap everything into a single `constexpr` structure: + ```c++ -// TODO fix me and prettify me -struct montgomery { - u64 n, nr; +struct Montgomery { + u32 n, nr; - montgomery(u64 n) : n(n) { - nr = 1; - for (int i = 0; i < 6; i++) + constexpr Montgomery(u32 n) : n(n), nr(1) { + // log(2^32) = 5 + for (int i = 0; i < 5; i++) nr *= 2 - n * nr; } - u64 reduce(u128 x) { - u64 q = u64(x) * nr; - u64 m = ((u128) q * n) >> 64; - u64 xhi = (x >> 64); - if (xhi >= m) - return (xhi - m); - else - return (xhi - m) + n; + u32 reduce(u64 x) const { + u32 q = u32(x) * nr; + u32 m = ((u64) q * n) >> 32; + return (x >> 32) + n - m; + // returns a number in the [0, 2 * n - 2] range + // (add a "x < n ? x : x - n" type of check if you need a proper modulo) } - u64 mult(u64 x, u64 y) { - return reduce((u128) x * y); + u32 multiply(u32 x, u32 y) const { + return reduce((u64) x * y); } - u64 transform(u64 x) { - return (u128(x) << 64) % n; + u32 transform(u32 x) const { + return (u64(x) << 32) % n; + // can also be implemented as multiply(x, r^2 mod n) } }; ``` + +To test its performance, we can plug Montgomery multiplication into the [binary exponentiation](../hpc/number-theory/exponentiation/): + +```c++ +constexpr Montgomery space(M); + +int inverse(int _a) { + u64 a = space.transform(_a); + u64 r = space.transform(1); + + #pragma GCC unroll(30) + for (int l = 0; l < 30; l++) { + if ( (M - 2) >> l & 1 ) + r = space.multiply(r, a); + a = space.multiply(a, a); + } + + return space.reduce(r); +} +``` + +While vanilla binary exponentiation with a compiler-generated fast modulo trick requires ~170ns per `inverse` call, this implementation takes ~166ns, going down to ~158ns we omit `transform` and `reduce` (a reasonable use case is for `inverse` to be used as a subprocedure in a bigger modular computation). This is a small improvement, but Montgomery multiplication becomes much more advantageous for SIMD applications and larger data types. + +**Exercise.** Implement efficient *modular* [matix multiplication](/hpc/algorithms/matmul). diff --git a/content/english/hpc/parallel/concurrency/fibers.md b/content/english/hpc/parallel/concurrency/fibers.md index 2ec2806c..cce7b860 100644 --- a/content/english/hpc/parallel/concurrency/fibers.md +++ b/content/english/hpc/parallel/concurrency/fibers.md @@ -28,4 +28,4 @@ func main() { The way they work is that the language maintains a group of threads ready to pick up from where they left. This is called N:M scheduling. -Similar runtimes exist for other languages, e. g. for C++ and Rust. +Similar runtimes exist for other languages, e.g., for C++ and Rust. diff --git a/content/english/hpc/parallel/gpu/_index.en.md b/content/english/hpc/parallel/gpu/_index.en.md index aafb7ba1..ac2a4aa9 100644 --- a/content/english/hpc/parallel/gpu/_index.en.md +++ b/content/english/hpc/parallel/gpu/_index.en.md @@ -73,7 +73,7 @@ CUDA is available for many languages. Nice documentation can be found here: https://documen.tician.de/pycuda/index.html -If you are on Colab, go to Runtime -> Change runtime type -> Hardware accelerator and set it to "GPU". +If you are on Colab, go to Runtime -> Change runtime type -> Hardware accelerator and set it to "GPU." ```python @@ -167,7 +167,7 @@ There is also `drv.InOut` function, which makes it available for both reading an Most of the operations here are memory operations, so measuring performance here is useless. Don't worry, we will get to more complex examples soon enough. -GPUs have very specific operations. However, in case of NVIDIA GPUs managing it is quite simple: the cards have *compute capabilities* (1.0, 1.1, 1.2, 1.3, 2.0, etc.) and all features added at capability $x$ is also available at later versions. These can be checked at run-time or compile-time. +GPUs have very specific operations. However, in case of NVIDIA GPUs managing it is quite simple: the cards have *compute capabilities* (1.0, 1.1, 1.2, 1.3, 2.0, etc.) and all features added at capability $x$ is also available at later versions. These can be checked at run time or compile time. You can check differences in this Wikipedia article: https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications @@ -195,7 +195,7 @@ Some tasks, especially in cryptography, cannot be parallelized. But some can. ## Summing arrays in $O(\log n)$ time -Assume we want to perform some associative (i. e. $A*(B*C) = (A*B)*C$) operation on an array of $n$ elements. Say, sum it up. +Assume we want to perform some associative (i.e., $A*(B*C) = (A*B)*C$) operation on an array of $n$ elements. Say, sum it up. Normally, we would do that with a simple loop: @@ -418,7 +418,7 @@ Intrinsics for that. Now, a lot of value comes from cryptocurrency and deep learning. The latter relies on two specific operations: matrix multiplications for linear layers and convolutions for convolutional layers used in computer vision. -First, they introduced "multiply-accumulate" operation (e. g. `x += y * z`) per 1 GPU clock cycle. +First, they introduced "multiply-accumulate" operation (e.g., `x += y * z`) per 1 GPU clock cycle. Google uses Tensor Processing Units. Nobody really knows how they work (proprietary hardware that they rent, not sell). @@ -431,7 +431,7 @@ Well, you don't really need anything more precise than that for deep learning an It is called mixed precision because input matrices are fp16 but multiplication result and accumulator are fp32 matrices. -Probably, the proper name would be "4x4 matrix cores", however NVIDIA marketing team decided to use "tensor cores". +Probably, the proper name would be "4x4 matrix cores," however NVIDIA marketing team decided to use "tensor cores." So, see, this is not exactly fair comparison. diff --git a/content/english/hpc/pipelining/_index.md b/content/english/hpc/pipelining/_index.md index 3d7d49b5..aab72d79 100644 --- a/content/english/hpc/pipelining/_index.md +++ b/content/english/hpc/pipelining/_index.md @@ -5,7 +5,7 @@ weight: 3 When programmers hear the word *parallelism*, they mostly think about *multi-core parallelism*, the practice of explicitly splitting a computation into semi-independent *threads* that work together to solve a common problem. -This type of parallelism is mainly about reducing *latency* and achieving *scalability*, but not about improving *efficiency*. You can solve a problem ten times as big with a parallel algorithm, but it would take at least ten times as many computational resources. Although parallel hardware is becoming [ever more abundant](/hpc/complexity/hardware), and parallel algorithm design is becoming an increasingly more important area, for now, we will consider the use of more than one CPU core cheating. +This type of parallelism is mainly about reducing *latency* and achieving *scalability*, but not about improving *efficiency*. You can solve a problem ten times as big with a parallel algorithm, but it would take at least ten times as many computational resources. Although parallel hardware is becoming [ever more abundant](/hpc/complexity/hardware) and parallel algorithm design is becoming an increasingly important area, for now, we will limit ourselves to considering only a single CPU core. But there are other types of parallelism, already existing inside a CPU core, that you can use *for free*. @@ -19,9 +19,9 @@ Parallelism helps in reducing *latency*. It is important, but for now, our main Sharing computations is an art in itself, but for now, we want to learn how to use resources that we already have more efficiently. -While multi-core parallelism is "cheating", many form of parallelism exist "for free". +While multi-core parallelism is "cheating," many form of parallelism exist "for free." -Adapting algorithms for parallel hardware is important for achieving *scalability*. In the first part of this book, we will consider this technique "cheating". We only do optimizations that are truly free, and preferably don't take away resources from other processes that might be running concurrently. +Adapting algorithms for parallel hardware is important for achieving *scalability*. In the first part of this book, we will consider this technique "cheating." We only do optimizations that are truly free, and preferably don't take away resources from other processes that might be running concurrently. --> @@ -42,16 +42,16 @@ Pipelining does not reduce *actual* latency but functionally makes it seem like Having this in mind, hardware manufacturers prefer to use *cycles per instruction* (CPI) instead of something like "average instruction latency" as the main performance indicator for CPU designs. It is a [pretty good metric](/hpc/profiling/benchmarking) for algorithm designs too, if we only consider *useful* instructions. -CPI of a perfectly pipelined processor should tend to one, but it can actually be even lower if we make each stage of the pipeline "wider" by duplicating it, so that more than one instruction can be processed at a time. Because the cache and most of the ALU can be shared, this ends up being cheaper than adding a fully separate core. Such architectures, capable of executing more than one instruction per cycle, are called *superscalar*, and most modern CPUs are. +The CPI of a perfectly pipelined processor should tend to one, but it can actually be even lower if we make each stage of the pipeline "wider" by duplicating it, so that more than one instruction can be processed at a time. Because the cache and most of the ALU can be shared, this ends up being cheaper than adding a fully separate core. Such architectures, capable of executing more than one instruction per cycle, are called *superscalar*, and most modern CPUs are. -You can only take advantage of superscalar processing if the stream of instructions contains groups of logically independent operations that can be processed separately. The instructions don't always arrive in the most convenient order, so, when possible, modern CPUs can execute them *out-of-order* to improve overall utilization and minimize pipeline stalls. How this magic works is a topic for a more advanced discussion, but for now, you can assume that the CPU maintains a buffer of pending instructions up to some distance in the future, and executes them as soon as the values of its operands are computed and there is an execution unit available. +You can only take advantage of superscalar processing if the stream of instructions contains groups of logically independent operations that can be processed separately. The instructions don't always arrive in the most convenient order, so, when possible, modern CPUs can execute them *out of order* to improve overall utilization and minimize pipeline stalls. How this magic works is a topic for a more advanced discussion, but for now, you can assume that the CPU maintains a buffer of pending instructions up to some distance in the future, and executes them as soon as the values of its operands are computed and there is an execution unit available. ### An Education Analogy Consider how our education system works: 1. Topics are taught to groups of students instead of individuals as broadcasting the same things to everyone at once is more efficient. -2. An intake of students is split into groups lead by different teachers; assignments and other course materials are shared between groups. +2. An intake of students is split into groups led by different teachers; assignments and other course materials are shared between groups. 3. Each year the same course is taught to a new intake so that the teachers are kept busy. These innovations greatly increase the *throughput* of the whole system, although the *latency* (time to graduation for a particular student) remains unchanged (and maybe increases a little bit because personalized tutoring is more effective). @@ -62,7 +62,7 @@ You can find many analogies with modern CPUs: 2. There are multiple execution units that can process these instructions simultaneously while sharing other CPU facilities (usually 2-4 execution units). 3. Instructions are processed in pipelined fashion (saving roughly the same number of cycles as the number of years between kindergarten and PhD). - + In addition to that, several other aspects also match: diff --git a/content/english/hpc/pipelining/branching.md b/content/english/hpc/pipelining/branching.md index 849e75a0..08d7887d 100644 --- a/content/english/hpc/pipelining/branching.md +++ b/content/english/hpc/pipelining/branching.md @@ -45,17 +45,17 @@ body: jmp counter ``` -Our goal is to simulate a completely unpredictable branch, and we successfully achieve it: the code takes ~14 CPU cycles per element. For a very rough estimate of what it is supposed to be, we can assume that the branches alternate between "<" and ">=", and the pipeline is mispredicted every other iteration. Then, every two iterations: +Our goal is to simulate a completely unpredictable branch, and we successfully achieve it: the code takes ~14 CPU cycles per element. For a very rough estimate of what it is supposed to be, we can assume that the branches alternate between `<` and `>=`, and the pipeline is mispredicted every other iteration. Then, every two iterations: -- We discard the pipeline, which is 19 cycles deep on Zen 2 (i. e. it has 19 stages, each taking one cycle). +- We discard the pipeline, which is 19 cycles deep on Zen 2 (i.e., it has 19 stages, each taking one cycle). - We need a memory fetch and a comparison, which costs ~5 cycles. We can check the conditions of even and odd iterations concurrently, so let's assume we only pay it once per 2 iterations. -- In the case of the "<" branch, we need another ~4 cycles to add `a[i]` to a volatile (memory-stored) variable `s`. +- In the case of the `<` branch, we need another ~4 cycles to add `a[i]` to a volatile (memory-stored) variable `s`. Therefore, on average, we need to spend $(4 + 5 + 19) / 2 = 14$ cycles per element, matching what we measured. ### Branch Prediction -We can replace the hardcoded `50` with a tweakable parameter `P` that effectively sets the probability of the "<" branch: +We can replace the hardcoded `50` with a tweakable parameter `P` that effectively sets the probability of the `<` branch: ```c++ for (int i = 0; i < N; i++) @@ -69,7 +69,7 @@ Now, if we benchmark it for different values of `P`, we get an interesting-looki Its peak is at 50-55%, as expected: branch misprediction is the most expensive thing here. This graph is asymmetrical: it takes just ~1 cycle to only check conditions that are never satisfied (`P = 0`), and ~7 cycles for the sum if the branch is always taken (`P = 100`). -This graph is not unimodal: there is another local minimum at around 85-90%. We spend ~6.15 cycles per element there or about 10-15% faster than when we always take the branch, accounting for the fact that we need to perform fewer additions. Branch misprediction stops affecting the performance at this point because when it happens, not the whole instruction buffer is discarded, but only the operations that were speculatively scheduled. Essentially, that 10-15% mispredict rate is the equilibrium point where we can see far enough in the pipeline not to stall but still save 10-15% on taking the cheaper ">=" branch. +This graph is not unimodal: there is another local minimum at around 85-90%. We spend ~6.15 cycles per element there or about 10-15% faster than when we always take the branch, accounting for the fact that we need to perform fewer additions. Branch misprediction stops affecting the performance at this point because when it happens, not the whole instruction buffer is discarded, but only the operations that were speculatively scheduled. Essentially, that 10-15% mispredict rate is the equilibrium point where we can see far enough in the pipeline not to stall but still save 10-15% on taking the cheaper `>=` branch. Note that it costs almost nothing to check for a condition that never or almost never occurs. This is why programmers use runtime exceptions and base case checks so profusely: if they are indeed rare, they don't really cost anything. @@ -86,9 +86,9 @@ for (int i = 0; i < N; i++) std::sort(a, a + n); ``` -We are still processing the same elements, but in a different order, and instead of 14 cycles, it now runs in a little bit more than 4, which is exactly the average of the cost of the pure "<" and ">=" branches. +We are still processing the same elements, but in a different order, and instead of 14 cycles, it now runs in a little bit more than 4, which is exactly the average of the cost of the pure `<` and `>=` branches. -The branch predictor can pick up on much more complicated patterns than just "always left, then always right" or "left-right-left-right". If we just decrease the size of the array $N$ to 1000 (without sorting it), then the branch predictor memorizes the entire sequence of comparisons, and the benchmark again measures at around 4 cycles — in fact, even slightly fewer than in the sorted array case, because in the former case branch predictor needs to spend some time flicking between the "always yes" and "always no" states. +The branch predictor can pick up on much more complicated patterns than just "always left, then always right" or "left-right-left-right." If we just decrease the size of the array $N$ to 1000 (without sorting it), then the branch predictor memorizes the entire sequence of comparisons, and the benchmark again measures at around 4 cycles — in fact, even slightly fewer than in the sorted array case, because in the former case branch predictor needs to spend some time flicking between the "always yes" and "always no" states. ### Hinting Likeliness of Branches diff --git a/content/english/hpc/pipelining/branchless.md b/content/english/hpc/pipelining/branchless.md index 62f0aa2f..31bd5a39 100644 --- a/content/english/hpc/pipelining/branchless.md +++ b/content/english/hpc/pipelining/branchless.md @@ -28,11 +28,11 @@ for (int i = 0; i < N; i++) s += (a[i] < 50) * a[i]; ``` -Suddenly, the loop now takes ~7 cycles per element instead of the original ~14. Also, the performance remains constant if we change `50` to some other threshold, so it doesn't depend on the branch probability. +The loop now takes ~7 cycles per element instead of the original ~14. Also, the performance remains constant if we change `50` to some other threshold, so it doesn't depend on the branch probability. But wait… shouldn't there still be a branch? How does `(a[i] < 50)` map to assembly? -There are no boolean types in assembly, nor any instructions that yield either one or zero based on the result of the comparison, but we can compute it indirectly like this: `(a[i] - 50) >> 31`. This trick relies on the [binary representation of integers](/hpc/arithmetic/integer), specifically on the fact that if the expression `a[i] - 50` is negative (implying `a[i] < 50`), then the highest bit of the result will be set to one, which we can then extract using a right-shift. +There are no Boolean types in assembly, nor any instructions that yield either one or zero based on the result of the comparison, but we can compute it indirectly like this: `(a[i] - 50) >> 31`. This trick relies on the [binary representation of integers](/hpc/arithmetic/integer), specifically on the fact that if the expression `a[i] - 50` is negative (implying `a[i] < 50`), then the highest bit of the result will be set to one, which we can then extract using a right-shift. ```nasm mov ebx, eax ; t = x @@ -41,7 +41,7 @@ sar ebx, 31 ; t >>= 31 imul eax, ebx ; x *= t ``` -Another, more complicated way to implement this whole sequence is to convert this sign bit into a mask and then use bitwise `and` instead of multiplication: `((a[i] - 50) >> 1 - 1) & a`. This makes the whole sequence one cycle faster, considering that unlike other instructions, `imul` takes 3 cycles: +Another, more complicated way to implement this whole sequence is to convert this sign bit into a mask and then use bitwise `and` instead of multiplication: `((a[i] - 50) >> 31 - 1) & a[i]`. This makes the whole sequence one cycle faster, considering that, unlike other instructions, `imul` takes 3 cycles: ```nasm mov ebx, eax ; t = x @@ -91,9 +91,9 @@ $$ This way you can eliminate branching, but this comes at the cost of evaluating *both* branches and the `cmov` itself. Because evaluating the ">=" branch costs nothing, the performance is exactly equal to [the "always yes" case](../branching/#branch-prediction) in the branchy version. -### When It Is Beneficial +### When Predication Is Beneficial -Using predication eliminates [a structural hazard](../hazard) but introduces a data hazard. There is still a pipeline stall, but it is a cheaper one: you only need to wait for `cmov` to be resolved and not flush the entire pipeline in case of a mispredict. +Using predication eliminates [a control hazard](../hazards) but introduces a data hazard. There is still a pipeline stall, but it is a cheaper one: you only need to wait for `cmov` to be resolved and not flush the entire pipeline in case of a mispredict. However, there are many situations when it is more efficient to leave branchy code as it is. This is the case when the cost of computing *both* branches instead of just *one* outweighs the penalty for the potential branch mispredictions. @@ -101,9 +101,9 @@ In our example, the branchy code wins when the branch can be predicted with a pr ![](../img/branchy-vs-branchless.svg) -This 75% threshold is commonly used by the compilers as a heuristic for determining whether to use the `cmov` or not. Unfortunately, this probability is usually unknown at the compile-time, so it needs to be provided in one of several ways: +This 75% threshold is commonly used by the compilers as a heuristic for determining whether to use the `cmov` or not. Unfortunately, this probability is usually unknown at the compile time, so it needs to be provided in one of several ways: -- We can use [profile-guided optimization](/hpc/compilation/pgo) which will decide for itself whether to use predication or not. +- We can use [profile-guided optimization](/hpc/compilation/situational/#profile-guided-optimization) which will decide for itself whether to use predication or not. - We can use [likeliness attributes](../branching#hinting-likeliness-of-branches) and [compiler-specific intrinsics](/hpc/compilation/situational) to hint at the likeliness of branches: `__builtin_expect_with_probability` in GCC and `__builtin_unpredictable` in Clang. - We can rewrite branchy code using the ternary operator or various arithmetic tricks, which acts as sort of an implicit contract between programmers and compilers: if the programmer wrote the code this way, then it was probably meant to be branchless. @@ -180,11 +180,11 @@ int abs(int a) { ### Larger Examples -**Strings.** Oversimplifying things, an `std::string` is comprised of a pointer to a null-terminated char array (also known as "C-string") allocated somewhere on the heap and one integer containing the string size. +**Strings.** Oversimplifying things, an `std::string` is comprised of a pointer to a null-terminated `char` array (also known as a "C-string") allocated somewhere on the heap and one integer containing the string size. -A very common value for strings is the empty string — which is also its default value. You also need to handle them somehow, and the idiomatic thing to do is to assign `nullptr` as the pointer and `0` as the string size, and then check if the pointer is null or if the size is zero at the beginning of every procedure involving strings. +A common value for a string is the empty string — which is also its default value. You also need to handle them somehow, and the idiomatic approach is to assign `nullptr` as the pointer and `0` as the string size, and then check if the pointer is null or if the size is zero at the beginning of every procedure involving strings. -However, this requires a separate branch, which is costly unless most strings are empty. What we can do to get rid of it is to allocate a "zero C-string", which is just a zero byte allocated somewhere, and then simply point all empty strings there. Now all string operations with empty strings have to read this useless zero byte, but this is still much cheaper than a branch misprediction. +However, this requires a separate branch, which is costly (unless the majority of strings are either empty or non-empty). To remove the check and thus also the branch, we can allocate a "zero C-string," which is just a zero byte allocated somewhere, and then simply point all empty strings there. Now all string operations with empty strings have to read this useless zero byte, but this is still much cheaper than a branch misprediction. **Binary search.** The standard binary search [can be implemented](/hpc/data-structures/binary-search) without branches, and on small arrays (that fit into cache) it works ~4x faster than the branchy `std::lower_bound`: @@ -193,10 +193,10 @@ int lower_bound(int x) { int *base = t, len = n; while (len > 1) { int half = len / 2; - base = (base[half] < x ? &base[half] : base); + base += (base[half - 1] < x) * half; // will be replaced with a "cmov" len -= half; } - return *(base + (*base < x)); + return *base; } ``` @@ -216,9 +216,9 @@ That there are no substantial reasons why compilers can't do this on their own, --> -**Data-parallel programming.** Branchless programming is very important for [SIMD](/hpc/simd) applications, including GPU programming, because they don't have branching in the first place. +**Data-parallel programming.** Branchless programming is very important for [SIMD](/hpc/simd) applications because they don't have branching in the first place. -In our array sum example, if you remove the `volatile` type qualifier from the accumulator, the compiler becomes able to [vectorize](/hpc/simd/auto-vectorization) the loop: +In our array sum example, removing the `volatile` type qualifier from the accumulator allows the compiler to [vectorize](/hpc/simd/auto-vectorization) the loop: ```c++ /* volatile */ int s = 0; @@ -230,7 +230,7 @@ for (int i = 0; i < N; i++) It now works in ~0.3 per element, which is mainly [bottlenecked by the memory](/hpc/cpu-cache/bandwidth). -The compiler is usually able to vectorize any loop that doesn't have branches or dependencies between the iterations — and some specific deviations from that, such as [reductions](/hpc/simd/reduction) or simple loops that contain just one if-without-else. Vectorization of anything more complex is a very nontrivial problem, which may involve various techniques such as [masking](/hpc/simd/masking) and [in-register permutations](/hpc/simd/shuffling). +The compiler is usually able to vectorize any loop that doesn't have branches or dependencies between the iterations — and some specific small deviations from that, such as [reductions](/hpc/simd/reduction) or simple loops that contain just one if-without-else. Vectorization of anything more complex is a very nontrivial problem, which may involve various techniques such as [masking](/hpc/simd/masking) and [in-register permutations](/hpc/simd/shuffling). -Interleaving the stages of execution is a general idea in digital electronics, and it is applied not only in the main CPU pipeline, but also on the level of separate instructions and [memory](/hpc/cpu-cache/mlp). Most execution units have their own little pipelines, and can take another instruction just one or two cycles after the previous one. If a certain instruction is frequently used, it makes sense to duplicate its execution unit also, and also place frequently jointly used instructions on the same execution unit: e. g. not using the same for arithmetic and memory operation. +Interleaving the stages of execution is a general idea in digital electronics, and it is applied not only in the main CPU pipeline, but also on the level of separate instructions and [memory](/hpc/cpu-cache/mlp). Most execution units have their own little pipelines, and can take another instruction just one or two cycles after the previous one. If a certain instruction is frequently used, it makes sense to duplicate its execution unit also, and also place frequently jointly used instructions on the same execution unit: e.g., not using the same for arithmetic and memory operation. ### Microcode @@ -22,9 +22,9 @@ While complex instruction sets had the benefit, with superscalar processors you Instructions are microcoded. -uOps ("micro-ops", the first letter is meant to be greek letter mu as in us (microsecond), but nobody cares enough to type it). +uOps ("micro-ops," the first letter is meant to be greek letter mu as in us (microsecond), but nobody cares enough to type it). -Each architecture has its own set of "ports", each capable of executing its own set of instructions (uOps, to be more exact). +Each architecture has its own set of "ports," each capable of executing its own set of instructions (uOps, to be more exact). But still, when you use it, it appears and feels like a single instruction. How does CPU achieve that? diff --git a/content/english/hpc/pipelining/tables.md b/content/english/hpc/pipelining/tables.md index d18d99c6..ad90c400 100644 --- a/content/english/hpc/pipelining/tables.md +++ b/content/english/hpc/pipelining/tables.md @@ -14,7 +14,7 @@ In this context, it makes sense to use two different "[costs](/hpc/complexity)" -You can get latency and throughput numbers for a specific architecture from special documents called [instruction tables](https://www.agner.org/optimize/instruction_tables.pdf). Here are some samples values for my Zen 2 (all specified for 32-bit operands, if there is any difference): +You can get latency and throughput numbers for a specific architecture from special documents called [instruction tables](https://www.agner.org/optimize/instruction_tables.pdf). Here are some sample values for my Zen 2 (all specified for 32-bit operands, if there is any difference): | Instruction | Latency | RThroughput | |-------------|---------|:------------| @@ -30,11 +30,11 @@ You can get latency and throughput numbers for a specific architecture from spec Some comments: -- Because our minds are so used to the cost model where "more" means "worse", people mostly use *reciprocals* of throughput instead of throughput. +- Because our minds are so used to the cost model where "more" means "worse," people mostly use *reciprocals* of throughput instead of throughput. - If a certain instruction is especially frequent, its execution unit could be duplicated to increase its throughput — possibly to even more than one, but not higher than the [decode width](/hpc/architecture/layout). - Some instructions have a latency of 0. This means that these instruction are used to control the scheduler and don't reach the execution stage. They still have non-zero reciprocal throughput because the [CPU front-end](/hpc/architecture/layout) still needs to process them. -- Most instructions are pipelined, and if they have the reciprocal throughput of $n$, this usually means that their execution unit can take another instruction after $n$ cycles (and if it is below 1, this means that there are multiple execution units, all capable of taking another instruction on the next cycle). One notable exception is the [integer division](/hpc/arithmetic/division): it is either very poorly pipelined or not pipelined at all. -- Some instructions have variable latency, depending on not only the size, but also the values of the operands. For memory operations (including fused ones like `add`), latency is usually specified for the best case (an L1 cache hit). +- Most instructions are pipelined, and if they have the reciprocal throughput of $n$, this usually means that their execution unit can take another instruction after $n$ cycles (and if it is below 1, this means that there are multiple execution units, all capable of taking another instruction on the next cycle). One notable exception is [integer division](/hpc/arithmetic/division): it is either very poorly pipelined or not pipelined at all. +- Some instructions have variable latency, depending on not only the size, but also the values of the operands. For memory operations (including fused ones like `add`), the latency is usually specified for the best case (an L1 cache hit). There are many more important little details, but this mental model will suffice for now. diff --git a/content/english/hpc/pipelining/throughput.md b/content/english/hpc/pipelining/throughput.md index ffb6b762..0b596404 100644 --- a/content/english/hpc/pipelining/throughput.md +++ b/content/english/hpc/pipelining/throughput.md @@ -6,7 +6,7 @@ weight: 4 Optimizing for *latency* is usually quite different from optimizing for *throughput*: - When optimizing data structure queries or small one-time or branchy algorithms, you need to [look up the latencies](../tables) of its instructions, mentally construct the execution graph of the computation, and then try to reorganize it so that the critical path is shorter. -- When optimizing hot loops and large-dataset algorithms, you need to look up the throughputs of its instructions, count how many times each one is used per iteration, determine which of them is the bottleneck, and then try to restructure the loop so that it is used less often. +- When optimizing hot loops and large-dataset algorithms, you need to look up the throughputs of their instructions, count how many times each one is used per iteration, determine which of them is the bottleneck, and then try to restructure the loop so that it is used less often. The last advice only works for *data-parallel* loops, where each iteration is fully independent of the previous one. When there is some interdependency between consecutive iterations, there may potentially be a pipeline stall caused by a [data hazard](../hazards) as the next iteration is waiting for the previous one to complete. @@ -21,7 +21,7 @@ for (int i = 0; i < n; i++) s += a[i]; ``` -Let's assume for a moment that the compiler doesn't [vectorize](/hpc/simd) this loop, [the memory bandwidth](/hpc/memory/bandwidth) isn't a concern, and that the loop is [unrolled](/hpc/architecture/loops) so that we don't pay any additional cost associated with maintaining the loop variables. In this case, the computation becomes very simple: +Let's assume for a moment that the compiler doesn't [vectorize](/hpc/simd) this loop, [the memory bandwidth](/hpc/cpu-cache/bandwidth) isn't a concern, and that the loop is [unrolled](/hpc/architecture/loops) so that we don't pay any additional cost associated with maintaining the loop variables. In this case, the computation becomes very simple: ```c++ int s = 0; @@ -64,7 +64,7 @@ If an instruction has a latency of $x$ and a throughput of $y$, then you would n This technique is mostly used with [SIMD](/hpc/simd) and not in scalar code. You can [generalize](/hpc/simd/reduction) the code above and compute sums and other reductions faster than the compiler. -In general, when optimizing loops, you usually have just one or a few *execution ports* that you want to utilize to their fullest, and you engineer the rest of the loop around them. As different instructions may use different sets of ports, it is not always clear which one is going to be the overused. In situations like this, [machine code analyzers](/hpc/profiling/mca) can be very helpful for finding bottlenecks of small assembly loops. +In general, when optimizing loops, you usually have just one or a few *execution ports* that you want to utilize to their fullest, and you engineer the rest of the loop around them. As different instructions may use different sets of ports, it is not always clear which one is going to be overused. In situations like this, [machine code analyzers](/hpc/profiling/mca) can be very helpful for finding the bottlenecks of small assembly loops. + *Instrumentation* is an overcomplicated term that means inserting timers and other tracking code into programs. The simplest example is using the `time` utility in Unix-like systems to measure the duration of execution for the whole program. More generally, we want to know *which parts* of the program need optimization. There are tools shipped with compilers and IDEs that can time designated functions automatically, but it is more robust to do it by hand using any methods of interacting with time that the language provides: @@ -77,4 +80,4 @@ void query() { This way we can remove the need to sample a new random number on each invocation, only resetting the counter when we choose to calculate statistics. -Techniques like that are frequently by library algorithm developers inside large projects to collect profiling data without affecting the performance of the end program too much. +Techniques like that are frequently used by library algorithm developers inside large projects to collect profiling data without affecting the performance of the end program too much. diff --git a/content/english/hpc/profiling/mca.md b/content/english/hpc/profiling/mca.md index 4634ba25..99cfe2ed 100644 --- a/content/english/hpc/profiling/mca.md +++ b/content/english/hpc/profiling/mca.md @@ -40,7 +40,7 @@ First, it outputs general information about the loop and the hardware: - It "ran" the loop 100 times, executing 400 instructions in total in 108 cycles, which is the same as executing $\frac{400}{108} \approx 3.7$ [instructions per cycle](/hpc/complexity/hardware) on average (IPC). - The CPU is theoretically capable of executing up to 6 instructions per cycle ([dispatch width](/hpc/architecture/layout)). - Each cycle in theory can be executed in 0.8 cycles on average ([block reciprocal throughput](/hpc/pipelining/tables)). -- The "uOps" here are the micro-operations that CPU splits each instruction into (e. g. fused load-add is composed of two uOps). +- The "uOps" here are the micro-operations that the CPU splits each instruction into (e.g., fused load-add is composed of two uOps). Then it proceeds to give information about each individual instruction: diff --git a/content/english/hpc/profiling/noise.md b/content/english/hpc/profiling/noise.md index c530c160..b1b186ae 100644 --- a/content/english/hpc/profiling/noise.md +++ b/content/english/hpc/profiling/noise.md @@ -1,6 +1,7 @@ --- title: Getting Accurate Results weight: 10 +published: true --- It is not an uncommon for there to be two library algorithm implementations, each maintaining its own benchmarking code, and each claiming to be faster than the other. This confuses everyone involved, especially the users, who have to somehow choose between the two. @@ -11,7 +12,7 @@ Situations like these are usually not caused by fraudulent actions by their auth There are many things that can introduce bias into benchmarks. -**Differing datasets.** There are many algorithms whose performance somehow depends on the dataset distribution. In order to define, for example, what the fastest sorting, shortest path, or binary search algorithms are, you have to fixing the dataset on which the algorithm is run. +**Differing datasets.** There are many algorithms whose performance somehow depends on the dataset distribution. In order to define, for example, what the fastest sorting, shortest path, or binary search algorithms are, you have to fix the dataset on which the algorithm is run. This sometimes applies even to algorithms that process a single piece of input. For example, it is not a good idea to feed GCD implementations sequential numbers because it makes branches very predictable: @@ -87,7 +88,7 @@ for (int i = 0; i < N; i++) checksum ^= lower_bound(checksum ^ q[i]); ``` -It usually makes the most difference in algorithms with possible pipeline stall issues, e. g. when comparing branchy and branch-free algorithms. +It usually makes the most difference in algorithms with possible pipeline stall issues, e.g., when comparing branchy and branch-free algorithms. **Cold cache.** Another source of bias is the *cold cache effect*, when memory reads initially take longer time because the required data is not in cache yet. @@ -111,7 +112,7 @@ for (int i = 0; i < N; i++) checksum ^= lower_bound(q[i]); ``` -It is also sometimes convenient to combine the warm-up run with answer validation, it if is more complicated than just computing some sort of checksum. +It is also sometimes convenient to combine the warm-up run with answer validation, if it is more complicated than just computing some sort of checksum. **Over-optimization.** Sometimes the benchmark is outright erroneous because the compiler just optimized the benchmarked code away. To prevent the compiler from cutting corners, you need to add checksums and either print them somewhere or add the `volatile` qualifier, which also prevents any sort of interleaving of loop iterations. @@ -127,10 +128,10 @@ https://github.com/sosy-lab/benchexec The issues we've described produce *bias* in measurements: they consistently give advantage to one algorithm over the other. There are other types of possible problems with benchmarking that result in either unpredictable skews or just completely random noise, thus increasing *variance*. -These type of issues are caused by side effects and some sort of external noise, mostly due to noisy neighbors and CPU frequency scaling: +These types of issues are caused by side effects and some sort of external noise, mostly due to noisy neighbors and CPU frequency scaling: - If you benchmark a compute-bound algorithm, measure its performance in cycles using `perf stat`: this way it will be independent of clock frequency, fluctuations of which is usually the main source of noise. -- Otherwise, set core frequency to the what you expect it to be and make sure nothing interferes with it. On Linux you can do it with `cpupower` (e. g. `sudo cpupower frequency-set -g powersave` to put it to minimum or `sudo cpupower frequency-set -g ondemand` to enable turbo boost). I use a [convenient GNOME shell extension](https://extensions.gnome.org/extension/1082/cpufreq/) that has a separate button to do it. +- Otherwise, set core frequency to what you expect it to be and make sure nothing interferes with it. On Linux you can do it with `cpupower` (e.g., `sudo cpupower frequency-set -g powersave` to put it to minimum or `sudo cpupower frequency-set -g ondemand` to enable turbo boost). I use a [convenient GNOME shell extension](https://extensions.gnome.org/extension/1082/cpufreq/) that has a separate button to do it. - If applicable, turn hyper-threading off and attach jobs to specific cores. Make sure no other jobs are running on the system, turn off networking and try not to fiddle with the mouse. You can't remove noises and biases completely. Even a program's name can affect its speed: the executable's name ends up in an environment variable, environment variables end up on the call stack, and so the length of the name affects stack alignment, which can result in data accesses slowing down due to crossing cache line or memory page boundaries. diff --git a/content/english/hpc/profiling/simulation.md b/content/english/hpc/profiling/simulation.md index 2f6c6dc6..75401b8a 100644 --- a/content/english/hpc/profiling/simulation.md +++ b/content/english/hpc/profiling/simulation.md @@ -50,7 +50,7 @@ Mispred rate: 22.0% ( 22.5% + 0.0% ) We've fed Cachegrind exactly the same example code as in [the previous section](../events): we create an array of a million random integers, sort it, and then perform a million binary searches on it. Cachegrind shows roughly the same numbers as perf does, except that that perf's measured numbers of memory reads and branches are slightly inflated due to [speculative execution](/hpc/pipelining): they really happen in hardware and thus increment hardware counters, but are discarded and don't affect actual performance, and thus ignored in the simulation. -Cachegrind only models the first (`D1` for data, `I1` for instructions) and the last (`LL`, unified) levels of cache, the characteristics of which are inferred from the system. It doesn't limit you in any way as you can also set them from the command line, e. g. to model the L2 cache: `--LL=,,`. +Cachegrind only models the first (`D1` for data, `I1` for instructions) and the last (`LL`, unified) levels of cache, the characteristics of which are inferred from the system. It doesn't limit you in any way as you can also set them from the command line, e g., to model the L2 cache: `--LL=,,`. It seems like it only slowed down our program so far and hasn't provided us any information that `perf stat` couldn't. To get more out of it than just the summary info, we can inspect a special file with profiling info, which it dumps by default in the same directory named as `cachegrind.out.`. It is human-readable, but is expected to be read via the `cg_annotate` command: diff --git a/content/english/hpc/simd/_index.md b/content/english/hpc/simd/_index.md index 988e83e8..50f6e3ed 100644 --- a/content/english/hpc/simd/_index.md +++ b/content/english/hpc/simd/_index.md @@ -29,7 +29,7 @@ Now, let's add the following magic directive in the very beginning: When compiled and run in the same environment, it finishes in 1.24 seconds. This is almost twice as fast, and we didn't change a single line of code or the optimization level. -What happened here is we provided a little bit of info about the computer on which this code is supposed to be run. Specifically, we told the compiler that the target CPU supports an extension to the x86 instruction set called "AVX2". AVX2 is one of the many so-called "SIMD extensions" for x86. These extensions include instructions that operate on special registers capable of holding 128, 256, or even 512 bits of data using the "single instruction, multiple data" (SIMD) approach. Instead of working with a single scalar value, SIMD instructions divide the data in registers into blocks of 8, 16, 32, or 64 bits and perform the same operation on them in parallel, yielding a proportional increase in performance[^power]. +What happened here is we provided a little bit of info about the computer on which this code is supposed to be run. Specifically, we told the compiler that the target CPU supports an extension to the x86 instruction set called "AVX2." AVX2 is one of the many so-called "SIMD extensions" for x86. These extensions include instructions that operate on special registers capable of holding 128, 256, or even 512 bits of data using the "single instruction, multiple data" (SIMD) approach. Instead of working with a single scalar value, SIMD instructions divide the data in registers into blocks of 8, 16, 32, or 64 bits and perform the same operation on them in parallel, yielding a proportional increase in performance[^power]. [^power]: On some CPUs, especially heavy SIMD instructions consume more energy and thus [require downclocking](https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/) to balance off the total power consumption, so the real-time speedup is not always proportional. @@ -43,6 +43,6 @@ In particular, AVX2 has instructions for working with 256-bit registers, while b ![](img/intel-extensions.webp) -Compilers often do a good job rewriting simple loops with SIMD instructions, like in the case above. This optimization is called [auto-vectorization](auto-vectorization), and it is the preferred way to use SIMD. +Compilers often do a good job rewriting simple loops with SIMD instructions, like in the case above. This optimization is called [auto-vectorization](auto-vectorization), and it is the most popular way of using SIMD. The problem is that it only works with certain types of loops, and even then it often yields suboptimal results. To understand its limitations, we need to get our hands dirty and explore this technology on a lower level, which is what we are going to do in this chapter. diff --git a/content/english/hpc/simd/auto-vectorization.md b/content/english/hpc/simd/auto-vectorization.md index 5fc568c3..b7b8a45f 100644 --- a/content/english/hpc/simd/auto-vectorization.md +++ b/content/english/hpc/simd/auto-vectorization.md @@ -1,15 +1,17 @@ --- -title: Auto-Vectorization +title: Auto-Vectorization and SPMD weight: 10 --- -SIMD-parallelism is most often used for *embarrassingly parallel* computations: the kinds where all you do is apply some elementwise function to all elements of an array and write it back somewhere else. In this setting, you don't even need to know how SIMD works: the compiler is perfectly capable of optimizing such loops by itself — you just need to be aware that such optimization exists and that it usually yields a 5-10x speedup. +SIMD parallelism is most often used for *embarrassingly parallel* computations: the kinds where all you do is apply some elementwise function to all elements of an array and write it back somewhere else. In this setting, you don't even need to know how SIMD works: the compiler is perfectly capable of optimizing such loops by itself — you just need to be aware that such optimization exists and that it usually yields a 5-10x speedup. -Doing nothing and relying on auto-vectorization is actually the preferred way of using SIMD. Whenever you can, you should always stick with the scalar code for its simplicity and maintainability. But often even the loops that seem straightforward to vectorize are not optimized because of some technical nuances. [As in many other cases](/hpc/compilation/contracts), the compiler may need some additional input from the programmer as he may know a bit more about the problem than what can be inferred from static analysis. +Doing nothing and relying on auto-vectorization is actually the most popular way of using SIMD. In fact, in many cases, it even advised to stick with the plain scalar code for its simplicity and maintainability. + +But often even the loops that seem straightforward to vectorize are not optimized because of some technical nuances. [As in many other cases](/hpc/compilation/contracts), the compiler may need some additional input from the programmer as he may know a bit more about the problem than what can be inferred from static analysis. ### Potential Problems -Consider the "a + b" example: +Consider the "a + b" example we [started with](../intrinsics/#simd-intrinsics): ```c++ void sum(int *a, int *b, int *c, int n) { @@ -47,8 +49,18 @@ for (int i = 0; i < n; i++) To help the compiler eliminate this corner case, we can use the `alignas` specifier on static arrays and the `std::assume_aligned` function to mark pointers aligned. -**Checking if vectorization happened.** In either case, it is useful to check if the compiler vectorized the loop the way you intended. You can either [compiling it to assembly](/hpc/compilation/stages) and look for blocks for instructions that start with a "v" or add the `-fopt-info-vec-optimized` compiler flag so that the compiler indicates where auto-vectorization is happening and what SIMD width is being used. If you swap `optimized` for `missed` or `all`, you may also get some reasoning behind why it is not happening in other places. +**Checking if vectorization happened.** In either case, it is useful to check if the compiler vectorized the loop the way you intended. You can either [compiling it to assembly](/hpc/compilation/stages) and look for blocks for instructions that start with a "v" or add the `-fopt-info-vec-optimized` compiler flag so that the compiler indicates where auto-vectorization is happening and what SIMD width is being used. If you swap `optimized` for `missed` or `all`, you may also get some reasoning behind why it is not happening in other places. ---- +There are [many other ways](https://software.intel.com/sites/default/files/m/4/8/8/2/a/31848-CompilerAutovectorizationGuide.pdf) of telling the compiler exactly what we mean, but in especially complex cases — e.g., when there are a lot of branches or function calls inside the loop — it is easier to go one level of abstraction down and vectorize manually. + +### SPMD + +There is a neat compromise between auto-vectorization and the manual use of SIMD intrinsics: "single program, multiple data" (SPMD). This is a model of computation in which the programmer writes what appears to be a regular serial program, but that is actually executed in parallel on the hardware. + +The programming experience is largely the same, and there is still the fundamental limitation in that the computation must be data-parallel, but SPMD ensures that the vectorization will happen regardless of the compiler and the target CPU architecture. It also allows for the computation to be automatically parallelized across multiple cores and, in some cases, even offloaded to other types of parallel hardware. + +There is support for SPMD is some modern languages ([Julia](https://docs.julialang.org/en/v1/base/base/#Base.SimdLoop.@simd)), multiprocessing APIs ([OpenMP](https://www.openmp.org/spec-html/5.0/openmpsu42.html)), and specialized compilers (Intel [ISPC](https://ispc.github.io/)), but it has seen the most success in the context of GPU programming where both problems and hardware are massively parallel. + +We will cover this model of computation in much more depth in Part 2 -There are [many other ways](https://software.intel.com/sites/default/files/m/4/8/8/2/a/31848-CompilerAutovectorizationGuide.pdf) of telling the compiler what we meant exactly, but in especially complex cases — when inside the loop there are a lot of branches or some functions are called — it is easier to go down to the intrinsics level and write it yourself. + diff --git a/content/english/hpc/simd/intrinsics.md b/content/english/hpc/simd/intrinsics.md index 0b2b8d32..4e9c6804 100644 --- a/content/english/hpc/simd/intrinsics.md +++ b/content/english/hpc/simd/intrinsics.md @@ -95,7 +95,7 @@ for (int i = 0; i < 100; i += 4) { The main challenge of using SIMD is getting the data into contiguous fixed-sized blocks suitable for loading into registers. In the code above, we may in general have a problem if the length of the array is not divisible by the block size. There are two common solutions to this: -1. We can "overshoot" by iterating over the last incomplete segment either way. To make sure we don't segfault by trying to read from or write to a memory region we don't own, we need to pad the arrays to the nearest block size (typically with some "neutral" element, e. g. zero). +1. We can "overshoot" by iterating over the last incomplete segment either way. To make sure we don't segfault by trying to read from or write to a memory region we don't own, we need to pad the arrays to the nearest block size (typically with some "neutral" element, e.g., zero). 2. Make one iteration less and write a little loop in the end that calculates the remainder normally (with scalar operations). Humans prefer #1 because it is simpler and results in less code, and compilers prefer #2 because they don't really have another legal option. @@ -135,13 +135,13 @@ Also, some of the intrinsics don't map to a single instruction but a short seque ### GCC Vector Extensions -If you feel like the design of C intrinsics is terrible, you are not alone. are all generated by cats walking on keyboards. I've spent hundreds of hours writing SIMD code and reading the Intel Intrinsics Guide, and I still can't remember whether I need to type `_mm256` or `__m256`. +If you feel like the design of C intrinsics is terrible, you are not alone. I've spent hundreds of hours writing SIMD code and reading the Intel Intrinsics Guide, and I still can't remember whether I need to type `_mm256` or `__m256`. Intrinsics are not only hard to use but also neither portable nor maintainable. In good software, you don't want to maintain different procedures for each CPU: you want to implement it just once, in an architecture-agnostic way. @@ -156,7 +156,7 @@ typedef int v8si __attribute__ (( vector_size(32) )); Unfortunately, this is not a part of the C or C++ standard, so different compilers use different syntax for that. -There is somewhat of a naming convention, which is to include size and type of elements into the name of the type: in the example above, we defined a "vector of 8 signed integers". But you may choose any name you want, like `vec`, `reg` or whatever. The only thing you don't want to do is to name it `vector` because of how much confusion there would be because of `std::vector`. +There is somewhat of a naming convention, which is to include size and type of elements into the name of the type: in the example above, we defined a "vector of 8 signed integers." But you may choose any name you want, like `vec`, `reg` or whatever. The only thing you don't want to do is to name it `vector` because of how much confusion there would be because of `std::vector`. The main advantage of using these types is that for many operations you can use normal C++ operators instead of looking up the relevant intrinsic. @@ -185,4 +185,13 @@ for (int i = 0; i < 100/4; i++) c[i] = a[i] + b[i]; ``` -As you can see, vector extensions are much cleaner compared to the nightmare we have with intrinsic functions. But some things that we may want to do are just not expressible with native C++ constructs, so we will still need intrinsics. Luckily, this is not an exclusive choice, because vector types support zero-cost conversion to the `_mm` types and back. We will, however, try to avoid doing so as much as possible and stick to vector extensions when we can. +As you can see, vector extensions are much cleaner compared to the nightmare we have with intrinsic functions. Their downside is that there are some things that we may want to do are just not expressible with native C++ constructs, so we will still need intrinsics for them. Luckily, this is not an exclusive choice, because vector types support zero-cost conversion to the `_mm` types and back: + +```c++ +v8f x; +int mask = _mm256_movemask_ps((__m256) x) +``` + +There are also many third-party libraries for different languages that provide a similar capability to write portable SIMD code and also implement some, and just in general are nicer to use than both intrinsics and built-in vector types. Notable examples for C++ are [Highway](https://github.com/google/highway), [Expressive Vector Engine](https://github.com/jfalcou/eve), [Vector Class Library](https://github.com/vectorclass/version2), and [xsimd](https://github.com/xtensor-stack/xsimd). + +Using a well-established SIMD library is recommended as it greatly improves the developer experience. In this book, however, we will try to keep close to the hardware and mostly use intrinsics directly, occasionally switching to the vector extensions for simplicity when we can. diff --git a/content/english/hpc/simd/masking.md b/content/english/hpc/simd/masking.md index 332597c1..dbe71575 100644 --- a/content/english/hpc/simd/masking.md +++ b/content/english/hpc/simd/masking.md @@ -67,7 +67,7 @@ for (int i = 0; i < N; i += 8) { } ``` -This loop performs slightly faster because on this particular CPU, the vector `and` take one cycle less than `blend`. +This loop performs slightly faster because on this particular CPU, the vector `and` takes one cycle less than `blend`. Several other instructions support masks as inputs, most notably: diff --git a/content/english/hpc/simd/moving.md b/content/english/hpc/simd/moving.md index e2cf3035..72cbbd33 100644 --- a/content/english/hpc/simd/moving.md +++ b/content/english/hpc/simd/moving.md @@ -1,5 +1,5 @@ --- -title: Loading and Writing Data +title: Moving Data aliases: [/hpc/simd/vectorization] weight: 2 --- @@ -13,7 +13,7 @@ While using the elementwise instructions is easy, the largest challenge with SIM ### Aligned Loads and Stores -Operations of reading and writing the contents of a SIMD register into memory have two versions each: `load` / `loadu` and `store` / `storeu`. The letter "u" here stands for "unaligned". The difference is that the former ones only work correctly when the read / written block fits inside a single [cache line](/hpc/cpu-cache/cache-lines) (and crash otherwise), while the latter work either way, but with a slight performance penalty if the block crosses a cache line. +Operations of reading and writing the contents of a SIMD register into memory have two versions each: `load` / `loadu` and `store` / `storeu`. The letter "u" here stands for "unaligned." The difference is that the former ones only work correctly when the read / written block fits inside a single [cache line](/hpc/cpu-cache/cache-lines) (and crash otherwise), while the latter work either way, but with a slight performance penalty if the block crosses a cache line. Sometimes, especially when the "inner" operation is very lightweight, the performance difference becomes significant (at least because you need to fetch two cache lines instead of one). As an extreme example, this way of adding two arrays together: @@ -39,7 +39,7 @@ for (int i = 0; i < n; i += 8) { In the first version, assuming that arrays `a`, `b` and `c` are all 64-byte *aligned* (the addresses of their first elements are divisible by 64, and so they start at the beginning of a cache line), roughly half of reads and writes will be "bad" because they cross a cache line boundary. -Note that the performance difference is caused by the cache system and not by the instructions themselves. On most modern architectures, the `loadu` / `storeu` intrinsics should be equally as fast as `load` / `store` given that in both cases the blocks only span one cache line. The advantage of the latter is that they can act as free run-time assertions that all reads and writes are aligned. +Note that the performance difference is caused by the cache system and not by the instructions themselves. On most modern architectures, the `loadu` / `storeu` intrinsics should be equally as fast as `load` / `store` given that in both cases the blocks only span one cache line. The advantage of the latter is that they can act as free run time assertions that all reads and writes are aligned. This makes it important to properly [align](/hpc/cpu-cache/alignment) arrays and other data on allocation, and it is also one of the reasons why compilers can't always [auto-vectorize](../auto-vectorization) efficiently. For most purposes, we only need to guarantee that any 32-byte SIMD block will not cross a cache line boundary, and we can specify this alignment with the `alignas` specifier: diff --git a/content/english/hpc/simd/reduction.md b/content/english/hpc/simd/reduction.md index 28fb4d9c..89678103 100644 --- a/content/english/hpc/simd/reduction.md +++ b/content/english/hpc/simd/reduction.md @@ -1,9 +1,9 @@ --- -title: Sums and Other Reductions +title: Reductions weight: 3 --- -*Reduction* (also known as *folding* in functional programming) is the action of computing the value of some associative and commutative operation (i.e. $(a \circ b) \circ c = a \circ (b \circ c)$ and $a \circ b = b \circ a$) over a range of arbitrary elements. +*Reduction* (also known as *folding* in functional programming) is the action of computing the value of some associative and commutative operation (i.e., $(a \circ b) \circ c = a \circ (b \circ c)$ and $a \circ b = b \circ a$) over a range of arbitrary elements. The simplest example of reduction is calculating the sum an array: @@ -46,58 +46,64 @@ int sum_simd(v8si *a, int n) { } ``` -You can use this approach for for other reductions, such as for finding the minimum or the xor-sum of an array. - -### Horizontal Summation - -The last part, where we sum up the 8 accumulators stored in a vector register into a single scalar to get the total sum, is called "horizontal summation". - -Although extracting and adding every scalar one by one only takes a constant number of cycles, it can be computed slightly faster using a [special instruction](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX2&text=_mm256_hadd_epi32&expand=2941) that adds together pairs of adjacent elements in a register. - -![Horizontal summation in SSE/AVX. Note how the output is stored: the (a b a b) interleaving is common for reducing operations](../img/hsum.png) - -Since it is a very specific operation, it can only be done with SIMD intrinsics — although the compiler probably emits roughly the same procedure for the scalar code anyway: - -```c++ -int hsum(__m256i x) { - __m128i l = _mm256_extracti128_si256(x, 0); - __m128i h = _mm256_extracti128_si256(x, 1); - l = _mm_add_epi32(l, h); - l = _mm_hadd_epi32(l, l); - return _mm_extract_epi32(l, 0) + _mm_extract_epi32(l, 1); -} -``` - -There are [other similar instructions](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX,AVX2&ig_expand=3037,3009,5135,4870,4870,4872,4875,833,879,874,849,848,6715,4845&text=horizontal), e. g. for integer multiplication or calculating absolute differences between adjacent elements (used in image processing). - -There is also one specific instruction, `_mm_minpos_epu16`, that calculates the horizontal minimum and its index among eight 16-bit integers. This is the only horizontal reduction that works in one go: all others are computed in multiple steps. +You can use this approach for other reductions, such as for finding the minimum or the xor-sum of an array. ### Instruction-Level Parallelism -Our implementation matches what the compiler produces automatically, but it is actually [suboptimal](/hpc/pipelining/throughput): when we use just one accumulator, we have to wait one cycle between the loop iterations for vector addition to complete, while its throughput is 2 on this microarchitecture. +Our implementation matches what the compiler produces automatically, but it is actually suboptimal: when we use just one accumulator, [we have to wait](/hpc/pipelining/throughput) one cycle between the loop iterations for a vector addition to complete, while the [throughput](/hpc/pipelining/tables/) of corresponding instruction is 2 on this microarchitecture. If we again divide the array in $B \geq 2$ parts and use a *separate* accumulator for each, we can saturate the throughput of vector addition and increase the performance twofold: ```c++ -const int B = 2; +const int B = 2; // how many vector accumulators to use int sum_simd(v8si *a, int n) { v8si b[B] = {0}; - for (int i = 0; i < n / 8; i += B) + for (int i = 0; i + (B - 1) < n / 8; i += B) for (int j = 0; j < B; j++) b[j] += a[i + j]; - + + // sum all vector accumulators into one for (int i = 1; i < B; i++) b[0] += b[i]; int s = 0; + // sum 8 scalar accumulators into one for (int i = 0; i < 8; i++) s += b[0][i]; + // add the remainder of a + for (int i = n / (8 * B) * (8 * B); i < n; i++) + s += a[i]; + return s; } ``` -If you have more than 2 relevant execution ports, you can increase `B` accordingly. But the n-fold performance increase will only apply to arrays that fit L1 cache — [memory bandwidth](/hpc/cpu-cache/bandwidth) will be the bottleneck for anything larger. +If you have more than 2 relevant execution ports, you can increase the `B` constant accordingly, but the $n$-fold performance increase will only apply to arrays that fit into L1 cache — [memory bandwidth](/hpc/cpu-cache/bandwidth) will be the bottleneck for anything larger. + +### Horizontal Summation + +The part where we sum up the 8 accumulators stored in a vector register into a single scalar to get the total sum is called "horizontal summation." + +Although extracting and adding every scalar one by one only takes a constant number of cycles, it can be computed slightly faster using a [special instruction](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX2&text=_mm256_hadd_epi32&expand=2941) that adds together pairs of adjacent elements in a register. + +![Horizontal summation in SSE/AVX. Note how the output is stored: the (a b a b) interleaving is common for reducing operations](../img/hsum.png) + +Since it is a very specific operation, it can only be done with SIMD intrinsics — although the compiler probably emits roughly the same procedure for the scalar code anyway: + +```c++ +int hsum(__m256i x) { + __m128i l = _mm256_extracti128_si256(x, 0); + __m128i h = _mm256_extracti128_si256(x, 1); + l = _mm_add_epi32(l, h); + l = _mm_hadd_epi32(l, l); + return _mm_extract_epi32(l, 0) + _mm_extract_epi32(l, 1); +} +``` + +There are [other similar instructions](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX,AVX2&ig_expand=3037,3009,5135,4870,4870,4872,4875,833,879,874,849,848,6715,4845&text=horizontal), e.g., for integer multiplication or calculating absolute differences between adjacent elements (used in image processing). + +There is also one specific instruction, `_mm_minpos_epu16`, that calculates the horizontal minimum and its index among eight 16-bit integers. This is the only horizontal reduction that works in one go: all others are computed in multiple steps. diff --git a/content/english/hpc/simd/shuffling.md b/content/english/hpc/simd/shuffling.md index f2a2cd15..6ff3b749 100644 --- a/content/english/hpc/simd/shuffling.md +++ b/content/english/hpc/simd/shuffling.md @@ -175,7 +175,7 @@ The general idea of our algorithm is as follows: - use this mask to index a lookup table that returns a permutation moving the elements that satisfy the predicate to the beginning of the vector (in their original order); - use the `_mm256_permutevar8x32_epi32` intrinsic to permute the values; - write the whole permuted vector to the buffer — it may have some trailing garbage, but its prefix is correct; -- calculate the population count of the scalar mask and move the buffer pointer by that amount. +- calculate the population count of the scalar mask and move the buffer pointer by that number. First, we need to precompute the permutations: @@ -225,7 +225,9 @@ The vectorized version takes some work to implement, but it is 6-7x faster than ![](../img/filter.svg) -This operation is considerably faster on AVX-512: it has a special "[compress](_mm512_mask_compress_epi32)" instruction that takes a vector of data and a mask and writes its unmasked elements contiguously. It makes a huge difference in algorithms that rely on various filtering subroutines. +The loop performance is still relatively low — taking 4 CPU cycles per iteration — because, on this particular CPU (Zen 2), `movemask`, `permute`, and `store` have low throughput and all have to go through the same execution port (P2). On most other x86 CPUs, you can expect it to be ~2x faster. + +Filtering can also be implemented considerably faster on AVX-512: it has a special "[compress](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=7395,7392,7269,4868,7269,7269,1820,1835,6385,5051,4909,4918,5051,7269,6423,7410,150,2138,1829,1944,3009,1029,7077,519,5183,4462,4490,1944,1395&text=_mm512_mask_compress_epi32)" instruction that takes a vector of data and a mask and writes its unmasked elements contiguously. It makes a huge difference in algorithms that rely on various filtering subroutines, such as quicksort. +- Clock speed is volatile, so counting cycles is more useful for analytical purposes + +---- + +![](https://external-preview.redd.it/6PIp0RLbdWFGFUOT6tFuufpMlplgWdnXWOmjuqkpMMU.jpg?auto=webp&s=9bed495f3dbb994d7cdda33cc114aba1cebd30e2 =400x) + +http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/ + +---- + +### Asymptotic complexity + +![](https://en.algorithmica.org/hpc/complexity/img/complexity.jpg =400x) + +For sufficiently large $n$, we only care about asymptotic complexity: $O(n) = O(1000 \cdot n)$ + +$\implies$ The costs of basic ops don't matter since they don't affect complexity + +But can we handle "sufficiently large" $n$? + +--- + +When complexity theory was developed, computers were different + +![](https://upload.wikimedia.org/wikipedia/commons/thumb/4/4e/Eniac.jpg/640px-Eniac.jpg =500x) + +Bulky, costly, and fundamentally slow (due to speed of light) + +---- + +![](https://researchresearch-news-wordpress-media-live.s3.eu-west-1.amazonaws.com/2022/02/microchip_fingertip-738x443.jpg =500x) + +Micro-scale circuits allow signals to propagate faster + +---- + + + +
+ +
+ +![](https://en.algorithmica.org/hpc/complexity/img/lithography.png =450x) + +
+ +
+ +Microchips are "printed" on a slice of silicon using a procees called [photolithography](https://en.wikipedia.org/wiki/Photolithography): + +1. grow and slice a [very pure silicon crystal](https://en.wikipedia.org/wiki/Wafer_(electronics)) +2. cover it with a layer of [photoresist](https://en.wikipedia.org/wiki/Photoresist) +3. hit it with photons in a set pattern +4. chemically [etch](https://en.wikipedia.org/wiki/Etching_(microfabrication)) the exposed parts +5. remove the remaining photoresist + +(…plus another 40-50 steps over several months to complete the rest of the CPU) + +
+ +
+ +---- + +The development of microchips and photolithography enabled: + +- higher clock rates +- the ability to scale the production +- **much** lower material and power usage (= lower cost) + +---- + +![](https://upload.wikimedia.org/wikipedia/commons/4/49/MOS_6502AD_4585_top.jpg =500x) + +MOS Technology 6502 (1975), Atari 2600 (1977), Apple II (1977), Commodore 64 (1982) + +---- + +Also a clear path to improvement: just make lenses stronger and chips smaller + +**Moore’s law:** transistor count doubles every two years. + +---- + +**Dennard scaling:** reducing die dimensions by 30% + +- doubles the transistor density ($0.7^2 \approx 0.5$) +- increases the clock speed by 40% ($\frac{1}{0.7} \approx 1.4$) +- leaves the overall *power density* the same + (we have a mechanical limit on how much heat can be dissipated) + +$\implies$ Each new "generation" should have roughly the same total cost, but 40% higher clock and twice as many transistors + +(which can be used, e.g., to add new instructions or increase the word size) + +---- + +Around 2005, Dennard scaling stopped — due to *leakage* issues: + +- transistors became very smal +- $\implies$ their magnetic fields started to interfere with the neighboring circuitry +- $\implies$ unnecessary heating and occasional bit flipping +- $\implies$ have to increase voltage to fix it +- $\implies$ have to reduce clock frequency to balance off power consumption + +---- + +![](https://en.algorithmica.org/hpc/complexity/img/dennard.ppm =600x) + +A limit on the clock speed + +--- + +Clock rates have plateaued, but we still have more transistors to use: + +- **Pipelining:** overlapping the execution of sequential instructions to keep different parts of the CPU busy +- **Out-of-order execution:** no waiting for the previous instructions to complete +- **Superscalar processing:** adding duplicates of execution units +- **Caching:** adding layers of faster memory on the chip to speed up RAM access +- **SIMD:** adding instructions that handle a block of 128, 256, or 512 bits of data +- **Parallel computing:** adding multiple identinal cores on a chip +- **Distributed computing:** multiple chips in a motherboard or multiple computers +- **FPGAs** and **ASICs:** using custom hardware to solve a specific problem + +---- + +![](https://en.algorithmica.org/hpc/complexity/img/die-shot.jpg =500x) + +For modern computers, the “let’s count all operations” approach for predicting algorithm performance is off by several orders of magnitude + +--- + +### Matrix multiplication + +```python +n = 1024 + +a = [[random.random() + for row in range(n)] + for col in range(n)] + +b = [[random.random() + for row in range(n)] + for col in range(n)] + +c = [[0 + for row in range(n)] + for col in range(n)] + +for i in range(n): + for j in range(n): + for k in range(n): + c[i][j] += a[i][k] * b[k][j] +``` + +630 seconds or 10.5 minutes to multiply two $1024 \times 1024$ matrices in plain Python + +~880 cycles per multiplication + +---- + +```java +public class Matmul { + static int n = 1024; + static double[][] a = new double[n][n]; + static double[][] b = new double[n][n]; + static double[][] c = new double[n][n]; + + public static void main(String[] args) { + Random rand = new Random(); + + for (int i = 0; i < n; i++) { + for (int j = 0; j < n; j++) { + a[i][j] = rand.nextDouble(); + b[i][j] = rand.nextDouble(); + c[i][j] = 0; + } + } + + for (int i = 0; i < n; i++) + for (int j = 0; j < n; j++) + for (int k = 0; k < n; k++) + c[i][j] += a[i][k] * b[k][j]; + } +} +``` + +Java needs 10 seconds, 63 times faster + +~13 cycles per multiplication + +---- + +```c +#define n 1024 +double a[n][n], b[n][n], c[n][n]; + +int main() { + for (int i = 0; i < n; i++) { + for (int j = 0; j < n; j++) { + a[i][j] = (double) rand() / RAND_MAX; + b[i][j] = (double) rand() / RAND_MAX; + } + } + + for (int i = 0; i < n; i++) + for (int j = 0; j < n; j++) + for (int k = 0; k < n; k++) + c[i][j] += a[i][k] * b[k][j]; + + return 0; +} +``` + +`GCC -O3` needs 9 seconds, but if we include `-march=native` and `-ffast-math`, the compiler vectorizes the code, and it drops down to 0.6s. + +---- + +```python +import time +import numpy as np + +n = 1024 + +a = np.random.rand(n, n) +b = np.random.rand(n, n) + +start = time.time() + +c = np.dot(a, b) + +duration = time.time() - start +print(duration) +``` + +BLAS needs ~0.12 seconds +(~5x over auto-vectorized C and ~5250x over plain Python) diff --git a/content/english/hpc/slides/_index.md b/content/english/hpc/slides/_index.md new file mode 100644 index 00000000..794e67a6 --- /dev/null +++ b/content/english/hpc/slides/_index.md @@ -0,0 +1,10 @@ +--- +title: Slides +ignoreIndexing: true +weight: 1000 +draft: true +--- + +This is an attempt to make a university course out of the book. + +Work in progress. diff --git a/content/english/hpc/stats.md b/content/english/hpc/stats.md index 2961f4d5..15d81e39 100644 --- a/content/english/hpc/stats.md +++ b/content/english/hpc/stats.md @@ -18,7 +18,7 @@ A **random variable** is any variable whose value depends on an outcome of a ran 2. $\forall x \in X, 0 \leq P \leq 1$. 3. $\sum_{x \in X} P(x) = 1$. -For example, consider a random variable $X$ with $k$ discrete states (e. g. the result of a die toss). We can place a *uniform distribution* on $X$ — that is, make each of its states equally likely — by setting its probability distribution to: +For example, consider a random variable $X$ with $k$ discrete states (e.g., the result of a die toss). We can place a *uniform distribution* on $X$ — that is, make each of its states equally likely — by setting its probability distribution to: $$ P(x=x_i) = \frac{1}{k} @@ -121,7 +121,7 @@ The last transition is true because it is a sum of harmonic series. ### Order Statistics -There is a slight modification of quicksort called quickselect that allows finding the $k$-th smallest element in $O(n)$ time, which is useful when we need to quickly compute order statistics, e. g. medians or 75-th quantiles. +There is a slight modification of quicksort called quickselect that allows finding the $k$-th smallest element in $O(n)$ time, which is useful when we need to quickly compute order statistics; e.g., medians or 75-th quantiles. 1. Select a random element $p$ from the array. 2. Partition the array into two arrays $L$ and $R$ using the predicate $a_i > p$. @@ -193,7 +193,7 @@ f(n, m) &= 1 \times (1-\frac{1}{m}) \times (1-\frac{2}{m}) \times ... \times (1- \end{aligned} $$ -This product shrinks pretty quickly with $n$, but it is not clear what value of $m$ is needed to be "safe". Turns out, if $n = O(\sqrt m)$, the probability of collision tends to zero, and anything asymptotically larger guarantees a collision. One can show this with calculus, but we will choose the probability theory way. +This product shrinks pretty quickly with $n$, but it is not clear what value of $m$ is needed to be "safe." Turns out, if $n = O(\sqrt m)$, the probability of collision tends to zero, and anything asymptotically larger guarantees a collision. One can show this with calculus, but we will choose the probability theory way. Let's go back to the idea of counting pairs of birthdays and introduce $\frac{n \cdot (n-1)}{2}$ indicators $I_{ij}$ — one for each pair $(i, j)$ of persons — each being equal to $1$ if the birthdays match. The probability and expectation of each indicator is $\frac{1}{m}$. diff --git a/content/russian/cs/algebra/binpow.md b/content/russian/cs/algebra/binpow.md index 5c7d2d43..4126061d 100644 --- a/content/russian/cs/algebra/binpow.md +++ b/content/russian/cs/algebra/binpow.md @@ -6,7 +6,7 @@ authors: weight: -10 --- -*Бинарное возведение в степень* — приём, позволяющий возводить любое число в $n$-ую степень за $O(\log n)$ умножений (вместо n умножений при обычном подходе). +*Бинарное возведение в степень* — приём, позволяющий возводить любое число в $n$-ую степень за $O(\log n)$ умножений (вместо $n$ умножений при обычном подходе). ## Основная идея diff --git a/content/russian/cs/algebra/matmul.md b/content/russian/cs/algebra/matmul.md index bc5ca593..8a633bea 100644 --- a/content/russian/cs/algebra/matmul.md +++ b/content/russian/cs/algebra/matmul.md @@ -188,7 +188,7 @@ matrix binpow(matrix a, int p) { Эту технику можно применить и к другим динамикам, где нужно посчитать количество способов что-то сделать — иногда очень неочевидными способами. -Например, можно решить такую задачу: найти количество строк длины $k \approx 10^{18}$, не содержащих данные маленькие запрещённые подстроки. Для этого нужно построить граф «легальных» переходов в [Ахо-Корасике](/cs/automata/aho-corasick), возвести его матрицу смежности в $k$-тую степень и просуммировать в нём первую строчку. +Например, можно решить такую задачу: найти количество строк длины $k \approx 10^{18}$, не содержащих данные маленькие запрещённые подстроки. Для этого нужно построить граф «легальных» переходов в [Ахо-Корасике](/cs/string-structures/aho-corasick), возвести его матрицу смежности в $k$-тую степень и просуммировать в нём первую строчку. В некоторых изощрённых случаях в матричном умножении вместо умножения и сложения нужно использовать другие операции, которые ведут себя как умножение и сложение. Пример задачи: «найти путь от $s$ до $t$ с минимальным весом ребра, использующий ровно $k$ переходов»; здесь нужно возводить в $(k-1)$-ую степень матрицу весов графа, и вместо и сложения, и умножения использовать минимум из двух весов. diff --git a/content/russian/cs/basic-structures/iterators.md b/content/russian/cs/basic-structures/iterators.md index b2d8269f..c048e0b6 100644 --- a/content/russian/cs/basic-structures/iterators.md +++ b/content/russian/cs/basic-structures/iterators.md @@ -71,7 +71,7 @@ for (int x : c) ### Алгоритмы из STL -Например, итераторы `std::vector` относятся к `random_access_iterator`, и если вызвать функцию `lower_bound` из стандартной библиотеки, то она произведет [бинарный поиск](../../ordered-search/binary-search) по элементам (предполагая, что они отсортированы в порядке неубывания): +Например, итераторы `std::vector` относятся к `random_access_iterator`, и если вызвать функцию `lower_bound` из стандартной библиотеки, то она произведет [бинарный поиск](/cs/interactive/binary-search/) по элементам (предполагая, что они отсортированы в порядке неубывания): ```cpp vector a = {1, 2, 3, 5, 8, 13}; @@ -93,4 +93,4 @@ array a = {4, 2, 1, 3}; cout << *min_element(a.begin(), a.end()) << endl; ``` -Подробнее про разные полезные алгоритмы STL можно прочитать в [ликбезе по C++](../../programming/cpp). + diff --git a/content/russian/cs/decomposition/scanline.md b/content/russian/cs/decomposition/scanline.md index 6ea7e2e7..3bc99afd 100644 --- a/content/russian/cs/decomposition/scanline.md +++ b/content/russian/cs/decomposition/scanline.md @@ -1,14 +1,15 @@ --- title: Сканирующая прямая authors: -- Сергей Слотин + - Сергей Слотин prerequisites: -- /cs/range-queries -- /cs/segment-tree + - /cs/range-queries + - /cs/segment-tree weight: 1 +published: true --- -Метод сканирующей прямой (англ. *scanline*) заключается в сортировке точек или каких-то абстрактных *событий* (англ. *event*) и последующему проходу по ним. +Метод сканирующей прямой (англ. *scanline*) заключается в сортировке точек на координатной прямой либо каких-то абстрактных «событий» по какому-то признаку и последующему проходу по ним. Он часто используется для решения задач на структуры данных, когда все запросы известны заранее, а также в геометрии для нахождения объединений фигур. @@ -22,7 +23,7 @@ weight: 1 Это решение можно улучшить. Отсортируем интересные точки по возрастанию координаты и пройдем по ним слева направо, поддерживая количество отрезков `cnt`, которые покрывают данную точку. Если в данной точке начинается отрезок, то надо увеличить `cnt` на единицу, а если заканчивается, то уменьшить. После этого пробуем обновить ответ на задачу текущим значением `cnt`. -Как такое писать: нужно представить интересные точки в виде структур с полями «координата» и «тип» (начало / конец) и отсортировать со своим компаратором. Удобно начало отрезка обозначать +1, а конец -1, чтобы просто прибавлять к `cnt` это значение и на разбирать случае. +Как такое писать: нужно представить интересные точки в виде структур с полями «координата» и «тип» (начало / конец) и отсортировать со своим компаратором. Удобно начало отрезка обозначать +1, а конец -1, чтобы просто прибавлять к `cnt` это значение и не разбивать на случаи. Единственный нюанс — если координаты двух точек совпали, чтобы получить правильный ответ, сначала надо рассмотреть все начала отрезков, а только потом концы (чтобы при обновлении ответа в этой координате учлись и правые, и левые граничные отрезки). @@ -62,15 +63,15 @@ int scanline(vector> segments) { **Задача.** Дан набор из $n$ отрезков на прямой, заданных координатами начал и концов $[l_i, r_i]$. Требуется найти суммарную длину их объединения. -Как и в прошлой задаче, отсортируем интересные точки и при проходе будем поддерживать число отрезков, покрывающих данную точку. Если оно больше 0, то отрезок который мы прошли с прошлой рассмотренной точки принадлежит объединению, и его длину нужно прибавить к ответу: +Как и в прошлой задаче, отсортируем все интересные точки и при проходе будем поддерживать число отрезков, покрывающих текущую точку. Если оно больше 0, то отрезок, который мы прошли с прошлой рассмотренной точки, принадлежит объединению, и его длину нужно прибавить к ответу: ```cpp int cnt = 0, res = 0, prev = -inf; for (event e : events) { - cnt += e.type; if (prev != -inf && cnt > 0) - res += prev - e.x; + res += e.x - prev; // весь отрезок [prev, e.x] покрыт cnt отрезками + cnt += e.type; prev = e.x; } ``` diff --git a/content/russian/cs/factorization/eratosthenes.md b/content/russian/cs/factorization/eratosthenes.md index 02e72c0e..acf47749 100644 --- a/content/russian/cs/factorization/eratosthenes.md +++ b/content/russian/cs/factorization/eratosthenes.md @@ -12,10 +12,10 @@ published: true Основная идея соответствует названию алгоритма: запишем ряд чисел $1, 2,\ldots, n$, а затем будем вычеркивать -* сначала числа, делящиеся на $2$, кроме самого числа $2$, -* потом числа, делящиеся на $3$, кроме самого числа $3$, -* с числами, делящимися на $4$, ничего делать не будем — мы их уже вычёркивали, -* потом продолжим вычеркивать числа, делящиеся на $5$, кроме самого числа $5$, +- сначала числа, делящиеся на $2$, кроме самого числа $2$, +- потом числа, делящиеся на $3$, кроме самого числа $3$, +- с числами, делящимися на $4$, ничего делать не будем — мы их уже вычёркивали, +- потом продолжим вычеркивать числа, делящиеся на $5$, кроме самого числа $5$, …и так далее. @@ -23,10 +23,10 @@ published: true ```c++ vector sieve(int n) { - vector is_prime(n+1, true); + vector is_prime(n + 1, true); for (int i = 2; i <= n; i++) if (is_prime[i]) - for (int j = 2*i; j <= n; j += i) + for (int j = 2 * i; j <= n; j += i) is_prime[j] = false; return is_prime; } @@ -49,7 +49,6 @@ $$ У исходного алгоритма асимптотика должна быть ещё лучше. Чтобы найти её точнее, нам понадобятся два факта про простые числа: 1. Простых чисел от $1$ до $n$ примерно $\frac{n}{\ln n}$ . - 2. Простые числа распределены без больших «разрывов» и «скоплений», то есть $k$-тое простое число примерно равно $k \ln k$. Мы можем упрощённо считать, что число $k$ является простым с «вероятностью» $\frac{1}{\ln n}$. Тогда, время работы алгоритма можно более точнее оценить как @@ -65,11 +64,11 @@ $$ ## Линейное решето -Основная проблема решета Эратосфена состоит в том, что некоторые числа мы будем помечать как составные несколько раз — а именно столько раз, сколько у них различных простых делителей. Чтобы достичь линейного времени работы, нам нужно придумать способ, как рассматривать все составные числа ровно один раз. +Основная проблема решета Эратосфена состоит в том, что некоторые числа мы будем помечать как составные несколько раз — столько, сколько у них различных простых делителей. Чтобы достичь линейного времени работы, нам нужно придумать способ, как рассматривать все составные числа ровно один раз. Обозначим за $d(k)$ минимальный простой делитель числа $k$ и заметим следующий факт: у составного числа $k$ есть единственное представление $k = d(k) \cdot r$, и при этом у числа $r$ нет простых делителей меньше $d(k)$. -Идея оптимизации состоит в том, чтобы перебирать этот $r$, и для каждого перебирать только нужные множители — а именно все от $2$ до $d(r)$ включительно. +Идея оптимизации состоит в том, чтобы перебирать этот $r$, и для каждого перебирать только нужные множители — а именно, все от $2$ до $d(r)$ включительно. ### Алгоритм diff --git a/content/russian/cs/geometry-basic/polygons.md b/content/russian/cs/geometry-basic/polygons.md index 7537e591..e0a3c5e7 100644 --- a/content/russian/cs/geometry-basic/polygons.md +++ b/content/russian/cs/geometry-basic/polygons.md @@ -80,7 +80,7 @@ $$ В более общем случае есть два популярных подхода, оба за $O(n)$. -Первый заключается в подсчете углов. Пройдемся по всем вершинам в порядке обхода и будем последовательно рассматривать углы с вершиной в точке $P$ и лучами, проходящими через соседние вершины многоугольника. Если просуммировать эти ориентированные углы, то получится какая-то величина $\theta$. Если точка $P$ лежит внутри многоугольника, то $\theta = \pm 2 \theta$, иначе $\theta = 0$. +Первый заключается в подсчете углов. Пройдемся по всем вершинам в порядке обхода и будем последовательно рассматривать углы с вершиной в точке $P$ и лучами, проходящими через соседние вершины многоугольника. Если просуммировать эти ориентированные углы, то получится какая-то величина $\theta$. Если точка $P$ лежит внутри многоугольника, то $\theta = \pm 2 \pi$, иначе $\theta = 0$. Второй заключается в подсчете, сколько раз луч, выпущенный из $P$, пересекает ребра многоугольника. diff --git a/content/russian/cs/geometry-basic/products.md b/content/russian/cs/geometry-basic/products.md index a4e1a3d5..488dbca6 100644 --- a/content/russian/cs/geometry-basic/products.md +++ b/content/russian/cs/geometry-basic/products.md @@ -1,6 +1,7 @@ --- title: Скалярное и векторное произведение weight: 2 +published: true --- Помимо очевидных сложения, вычитания и умножения на константу, у векторов можно ввести и свои особенные операции, которые нам упростят жизнь. @@ -40,9 +41,9 @@ $$ a \times b = |a| \cdot |b| \cdot \sin \theta = x_a y_b - y_a x_b $$ -Так же, как и со скалярным произведением, доказательство координатной формулы оставляется упражнением читателю. Если кто-то захочет это сделать: это следует из линейности обоих произведений (что в свою очередь тоже нужно доказать) и разложения и разложения по базисным векторам $\overline{(0, 1)}$ и $\overline{(1, 0)}$. +Так же, как и со скалярным произведением, доказательство координатной формулы оставляется упражнением читателю. Если кто-то захочет это сделать: это следует из линейности обоих произведений (что в свою очередь тоже нужно доказать) и разложения по базисным векторам $\overline{(0, 1)}$ и $\overline{(1, 0)}$. -Геометрически, это ориентированный объем параллелограмма, натянутого на вектора $a$ и $b$: +Геометрически, это ориентированная площадь параллелограмма, натянутого на вектора $a$ и $b$: ![](../img/cross.jpg) @@ -65,7 +66,7 @@ int operator^(r a, r b) { return a.x*b.y - b.x*a.y; } Скалярное и векторное произведения тесно связаны с углами между векторами и могут использоваться для подсчета величин вроде ориентированных углов и площадей, которые обычно используются для разных проверок. -Когда они уже реализованы, использовать произведения гораздо проще, чем опираться на алгебру. Например, можно легко угол между двумя векторами, подставив в знакомый нам `atan2` векторное и скалярное произведение: +Когда они уже реализованы, использовать произведения гораздо проще, чем опираться на алгебру. Например, можно легко вычислить угол между двумя векторами, подставив в знакомый нам `atan2` векторное и скалярное произведение: ```c++ double angle(r a, r b) { diff --git a/content/russian/cs/geometry-basic/vectors.md b/content/russian/cs/geometry-basic/vectors.md index 05051396..ee1a052a 100644 --- a/content/russian/cs/geometry-basic/vectors.md +++ b/content/russian/cs/geometry-basic/vectors.md @@ -1,6 +1,7 @@ --- -title: Точки и векторы +title: Точки и вектора weight: 1 +published: true --- Отрезок, для которого указано, какой из его концов считается началом, а какой концом, называется *вектором*. Вектор на плоскости можно задать двумя числами — его координатами по горизонтали и вертикали. diff --git a/content/russian/cs/graph-traversals/connectivity.md b/content/russian/cs/graph-traversals/connectivity.md index 45ceec28..17628308 100644 --- a/content/russian/cs/graph-traversals/connectivity.md +++ b/content/russian/cs/graph-traversals/connectivity.md @@ -31,7 +31,7 @@ void dfs(int v, int num) { int num = 0; for (int v = 0; v < n; v++) if (!component[v]) - dfs(v, num++); + dfs(v, ++num); ``` После этого переменная `num` будет хранить число компонент связности, а массив `component` — номер компоненты для каждой вершины, который, например, можно использовать, чтобы быстро проверять, существует ли путь между заданной парой вершин. diff --git a/content/russian/cs/graph-traversals/cycle.md b/content/russian/cs/graph-traversals/cycle.md index 5347e9cd..7a274da1 100644 --- a/content/russian/cs/graph-traversals/cycle.md +++ b/content/russian/cs/graph-traversals/cycle.md @@ -60,6 +60,7 @@ int dfs(int v, int p = -1) { } } } + return -1; } ``` diff --git a/content/russian/cs/interactive/answer-search.md b/content/russian/cs/interactive/answer-search.md index 28e4b4bc..0b38ce24 100644 --- a/content/russian/cs/interactive/answer-search.md +++ b/content/russian/cs/interactive/answer-search.md @@ -66,7 +66,7 @@ int solve() { Здесь, в отличие от предыдущей задачи, кажется, существует прямое решение с формулой. Но вместо того, чтобы о нем думать, можно просто свести задачу к обратной. Давайте подумаем, как по числу минут $t$ (ответу) понять, сколько листов напечатается за это время? Очень легко: $$ -\lfloor\frac{t}{x}\rfloor + \lfloor\frac{t}{y}\rfloor +\left \lfloor \frac{t}{x} \right \rfloor + \left \lfloor \frac{t}{y} \right \rfloor $$ -Ясно, что за $0$ минут $n$ листов распечатать нельзя, а за $xn$ минут один только первый принтер успеет напечатать $n$ листов. Поэтому $0$ и $xn$ — это подходящие изначальные границы для бинарного поиска. +Ясно, что за $0$ минут $n$ листов распечатать нельзя, а за $x \cdot n$ минут один только первый принтер успеет напечатать $n$ листов. Поэтому $0$ и $xn$ — это подходящие изначальные границы для бинарного поиска. diff --git a/content/russian/cs/layer-optimizations/_index.md b/content/russian/cs/layer-optimizations/_index.md index 492473b5..2456aa4c 100644 --- a/content/russian/cs/layer-optimizations/_index.md +++ b/content/russian/cs/layer-optimizations/_index.md @@ -10,10 +10,7 @@ date: 2021-08-29 **Задача.** Даны $n$ точек на прямой, отсортированные по своей координате $x_i$. Нужно найти $m$ отрезков, покрывающих все точки, минимизировав при этом сумму квадратов их длин. -**Базовое решение** — это следующая динамика: - -- $f[i, j]$ = минимальная стоимость покрытия $i$ первых точек, используя не более $j$ отрезков. -- Переход — перебор всех возможных последних отрезков, то есть +**Базовое решение** — определить состояние динамики $f[i, j]$ как минимальную стоимость покрытия $i$ первых точек используя не более $j$ отрезков. Пересчитывать её можно перебором всех возможных последних отрезков: $$ f[i, j] = \min_{k < i} \{f[k, j-1] + (x_{i-1}-x_k)^2 \} @@ -30,7 +27,7 @@ int cost(int i, int j) { } for (int i = 0; i <= m; i++) - f[0][k] = 0; // если нам не нужно ничего покрывать, то всё и так хорошо + f[0][i] = 0; // если нам не нужно ничего покрывать, то всё и так хорошо // все остальные f предполагаем равными бесконечности for (int i = 1; i <= n; i++) diff --git a/content/russian/cs/layer-optimizations/divide-and-conquer.md b/content/russian/cs/layer-optimizations/divide-and-conquer.md index a7731f49..c5e218db 100644 --- a/content/russian/cs/layer-optimizations/divide-and-conquer.md +++ b/content/russian/cs/layer-optimizations/divide-and-conquer.md @@ -8,44 +8,43 @@ published: true *Эта статья — одна из [серии](../). Рекомендуется сначала прочитать все предыдущие.* -Посмотрим на формулу пересчета динамики для базового решения: +Посмотрим на формулу пересчета динамики из базового решения: $$ f[i, j] = \min_{k < i} \{f[k, j-1] + (x_{i-1}-x_k)^2 \} $$ -Обозначим за $opt[i, j]$ оптимальный $k$ для данного состояния — то есть от выражения выше. Для однозначности, если оптимальный индекс не один, то выберем среди них самый правый. +Обозначим за $opt[i, j]$ оптимальный $k$ для данного состояния — то есть аргминимум от выражения выше. Для однозначности, если оптимальный индекс не один, то выберем среди них самый правый. -Конкретно в задаче покрытия точек отрезками, можно заметить следующее: +Конкретно в задаче покрытия точек отрезками можно заметить следующее: $$ -opt[i, j] \leq opt[i, j+1] +opt[i + 1, j] \leq opt[i, j] $$ -Интуиция такая: если у нас появился дополнительный отрезок, то последний отрезок нам не выгодно делать больше, а скорее наоборот его нужно «сжать». +Интуация такая: если нам нужно покрыть больший префикс точек, то начало последнего отрезка точно не будет раньше. -### Идея +### Алгоритм -Пусть мы уже знаем $opt[i, l]$ и $opt[i, r]$ и хотим посчитать $opt[i, j]$ для какого-то $j$ между $l$ и $r$. Тогда, воспользовавшись неравенством выше, мы можем сузить отрезок поиска оптимального индекса для $j$ со всего отрезка $[0, i-1]$ до $[opt[i, l], opt[i, r]]$. +Пусть мы уже знаем $opt[l, k]$ и $opt[r, k]$ и хотим посчитать $opt[i, k]$ для какого-то $i$ между $l$ и $r$. Тогда, воспользовавшись неравенством выше, мы можем сузить отрезок поиска оптимального индекса для $i$ со всего отрезка $[0, i - 1]$ до $[opt[l, k], opt[r, k]]$. -Будем делать следующее: заведем рекурсивную функцию, которая считает динамики для отрезка $[l, r]$, зная, что их $opt$ лежат между $l'$ и $r'$. Эта функция просто берет середину отрезка $[l, r]$ и линейным проходом считает ответ для неё, а затем рекурсивно запускается от половин, передавая в качестве границ $[l', opt]$ и $[opt, r']$ соответственно. - -### Реализация - -Один $k$-тый слой целиком пересчитывается из $(k-1)$-го следующим образом: +Будем делать следующее: заведем рекурсивную функцию, которая считает динамики для отрезка $[l, r]$ на $k$-том слое, зная, что их $opt$ лежат между $l'$ и $r'$. Эта функция просто берет середину отрезка $[l, r]$ и линейным проходом считает ответ для неё, а затем рекурсивно запускается от половин, передавая в качестве границ $[l', opt]$ и $[opt, r']$ соответственно: ```c++ +// [ l, r] -- какие динамики на k-том слое посчитать +// [_l, _r] -- где могут быть их ответы void solve(int l, int r, int _l, int _r, int k) { if (l > r) return; // отрезок пустой -- выходим int opt = _l, t = (l + r) / 2; + // считаем ответ для f[t][k] for (int i = _l; i <= min(_r, t); i++) { int val = f[i + 1][k - 1] + cost(i, t - 1); if (val < f[t][k]) f[t][k] = val, opt = i; } - solve(l, t - 1, _l, opt, k); - solve(t + 1, r, opt, _r, k); + solve(l, t - 1, _l, opt, k); + solve(t + 1, r, opt, _r, k); } ``` @@ -56,8 +55,6 @@ for (int k = 1; k <= m; k++) solve(0, n - 1, 0, n - 1, k); ``` -### Асимптотика - Так как отрезок $[l, r]$ на каждом вызове уменьшается примерно в два раза, глубина рекурсии будет $O(\log n)$. Так как отрезки поиска для всех элементов на одном «уровне» могут пересекаться разве что только по границам, то суммарно на каждом уровне поиск проверит $O(n)$ различных индексов. Соответственно, пересчет всего слоя займет $O(n \log n)$ операций вместо $O(n^2)$ в базовом решении. -Таким образом, мы улучшили асимптотику до $O(n m \log n)$. +Таким образом, мы улучшили асимптотику до $O(n \cdot m \cdot \log n)$. diff --git a/content/russian/cs/layer-optimizations/knuth.md b/content/russian/cs/layer-optimizations/knuth.md index 5c49dbe6..8a184d2d 100644 --- a/content/russian/cs/layer-optimizations/knuth.md +++ b/content/russian/cs/layer-optimizations/knuth.md @@ -9,13 +9,13 @@ prerequisites: Предыдущий метод оптимизации опирался на тот факт, что $opt[i, j] \leq opt[i, j + 1]$. -Асимптотику можно ещё улучшить, заметив, что $opt$ монотонен ещё и по первому параметру: +Асимптотику можно ещё улучшить, заметив, что $opt$ монотонен также и по второму параметру: $$ -opt[i-1, j] \leq opt[i, j] \leq opt[i, j+1] +opt[i - 1, j] \leq opt[i, j] \leq opt[i, j + 1] $$ -В задаче про покрытие отрезками это выполняется примерно по той же причине: если нам нужно покрывать меньше точек, то новый оптимальный последний отрезок будет начинаться не позже старого. +В задаче про покрытие отрезками это выполняется примерно по той же причине: если нам доступно больше отрезков, то последний отрезок в оптимальном решении точно не будет длиннее, чем раньше. ### Алгоритм diff --git a/content/russian/cs/matching/matching-problems.md b/content/russian/cs/matching/matching-problems.md index cedfe69d..cd14e54e 100644 --- a/content/russian/cs/matching/matching-problems.md +++ b/content/russian/cs/matching/matching-problems.md @@ -81,6 +81,6 @@ $$ Пусть у вершин левой доли есть какие-то веса, и нам нужно набрать максимальное паросочетание минимального веса. -Выясняется, что можно просто отсортировать вершины левой доли по весу и пытаться в таком порядке добавлять их в паросочетание стандартным алгоритмом Куна. Для доказательства этого факта читатель может прочитать про [жадный алгоритм Радо-Эдмондса](/cs/greedy/matroid), частным случаем которого является такая модификация алгоритма Куна. +Выясняется, что можно просто отсортировать вершины левой доли по весу и пытаться в таком порядке добавлять их в паросочетание стандартным алгоритмом Куна. Для доказательства этого факта читатель может прочитать про [жадный алгоритм Радо-Эдмондса](/cs/combinatorial-optimization/matroid), частным случаем которого является такая модификация алгоритма Куна. Аналогичную задачу, но когда у *ребер* есть веса, проще всего решать сведением к нахождению [потока минимальной стоимости](/cs/flows/mincost-maxflow). diff --git a/content/russian/cs/modular/reciprocal.md b/content/russian/cs/modular/reciprocal.md index 5d0e34e9..7b966de3 100644 --- a/content/russian/cs/modular/reciprocal.md +++ b/content/russian/cs/modular/reciprocal.md @@ -99,7 +99,7 @@ $$ ax + my = 1 \iff ax \equiv 1 \iff x \equiv a^{-1} \pmod m $$ int inv(int a, int m) { if (a == 1) return 1; - return (1 - inv(m % a, a) * m) / a + m; + return (1 - 1ll * inv(m % a, a) * m) / a + m; } ``` diff --git a/content/russian/cs/numerical/newton.md b/content/russian/cs/numerical/newton.md index 248e1b4e..5426cff5 100644 --- a/content/russian/cs/numerical/newton.md +++ b/content/russian/cs/numerical/newton.md @@ -66,9 +66,9 @@ double sqrt(double n) { Запустим метод Ньютона для поиска квадратного корня $2$, начиная с $x_0 = 1$, и посмотрим, сколько первых цифр оказались правильными после каждой итерации: -
-1
-1.5
+
+1.0000000000000000000000000000000000000000000000000000000000000
+1.5000000000000000000000000000000000000000000000000000000000000
 1.4166666666666666666666666666666666666666666666666666666666675
 1.4142156862745098039215686274509803921568627450980392156862745
 1.4142135623746899106262955788901349101165596221157440445849057
diff --git a/content/russian/cs/persistent/persistent-array.md b/content/russian/cs/persistent/persistent-array.md
index e476c355..018c287a 100644
--- a/content/russian/cs/persistent/persistent-array.md
+++ b/content/russian/cs/persistent/persistent-array.md
@@ -2,8 +2,9 @@
 title: Структуры с откатами
 weight: 1
 authors:
-- Сергей Слотин
-date: 2021-09-12
+  - Сергей Слотин
+date: {}
+published: true
 ---
 
 Состояние любой структуры как-то лежит в памяти: в каких-то массивах, или в более общем случае, по каким-то определенным адресам в памяти. Для простоты, пусть у нас есть некоторый массив $a$ размера $n$, и нам нужно обрабатывать запросы присвоения и чтения, а также иногда откатывать изменения обратно.
@@ -20,7 +21,7 @@ int a[N];
 stack< pair > s;
 
 void change(int k, int x) {
-    l.push({k, a[k]});
+    s.push({k, a[k]});
     a[k] = x;
 }
 
@@ -84,7 +85,7 @@ void rollback() {
 
 ```cpp
 int t = 0;
-vector versions[N];
+vector< pair > versions[N];
 
 void change(int k, int x) {
     versions[k].push_back({t++, x});
diff --git a/content/russian/cs/programming/bayans.md b/content/russian/cs/programming/bayans.md
index 7d8d773b..d7b42267 100644
--- a/content/russian/cs/programming/bayans.md
+++ b/content/russian/cs/programming/bayans.md
@@ -4,11 +4,12 @@ weight: 100
 authors:
 - Сергей Слотин
 created: 2017-2019
+date: 2022-07-17
 ---
 
 Везде, где не указано — время работы $O(n)$, а если есть конкретные числа, то TL 1 секунда.
 
-Задачи идут в порядке вспоминания, то есть в весьма рандомном.
+Задачи идут в порядке вспоминания/придумывания, то есть в весьма рандомном.
 
 ## Попугаи
 
@@ -121,12 +122,24 @@ int lower_bound(int x) {
 
 ## Нулевая сумма
 
-Дано  мультимножество из $n$ целых чисел. Найдите любое его подмножество, сумма чисел которого делится на $n$.
+Дано мультимножество из $n$ целых чисел. Найдите любое его непустое подмножество, сумма чисел которого делится на $n$.
 
 ## Мета-задача
 
 В задаче дана произвольная строка, по которой известным только авторам способом генерируется ответ yes/no. В задаче 100 тестов. У вас есть 20 попыток. В качестве фидбэка вам доступны вердикты на каждом тесте. Вердикта всего два: OK (ответ совпал) и WA. Попытки поделить на ноль, выделить терабайт памяти и подобное тоже считаются как WA. «Решите» задачу.
 
+## Мета-задача 2
+
+Условие как в «Мета-задаче», но сообщается только число пройденных тестов.
+
+100 тестов, 70 попыток.
+
+## Мета-задача 3
+
+Условие как в «Мета-задаче», но сообщается только номер первого не пройденного теста.
+
+10 тестов, 100 попыток.
+
 ## Ниточка
 
 В плоскую доску вбили $n$ гвоздей радиуса $r$, причём так, что соответствующие точки на плоскости образуют вершины выпуклого многоугольника. На эти гвозди натянули ниточку, причём ниточка «огибает» по кругу гвозди. Найдите длину ниточки, то есть периметр этого многоугольника с учётом закругления.
@@ -302,3 +315,56 @@ def query(y):
 ```
 
 Ваша задача — отгадать число, используя не более 10000 попыток.
+
+## Коммивояжер
+
+Даны $3 \cdot 10^5$ точек на плоскости. Выберите среди них любое подмножество из 500 точек и решите для него задачу коммивояжера: найдите минимальный по длине цикл, проходящий через все эти точки.
+
+## Анаграммы
+
+Найдите в строке $s$ первую подстроку, являющуюся анаграммой (пререстановкой символов) строки $t$ за $O(n)$.
+
+## Функциональный граф
+
+Дан ориентированный граф из $n < 10^5$ вершин, в котором из каждой вершины ведет ровно одно ребро. Требуется ответить на $q < 10^5$ запросов «в какую вершину мы попадем, если начнем в вершине $v_i$ и сделаем $k_i < 10^{18}$ переходов» за время $O(q + n)$.
+
+## Асинхронная шляпа
+
+Серёжа и его $(n - 1)$ друзей решили поиграть в «шляпу», в которой один игрок должен за ограниченное время объяснить как можно больше слов, чтобы его партнер их отгадал.
+
+Каждый игрок должен пообщаться с любым другим по разу; обычно игра проводится так:
+
+- 1-й игрок объясняет в течение минуты слова 2-му,
+- 2-й игрок объясняет слова 3-му,
+- ...,
+- $n$-й игрок объясняет слова 1-му,
+- 1-й игрок объясняет слова 3-му,
+- 2-й игрок объясняет слова 4-му…
+
+…и так далее, пока $(n-1)$-й игрок не закончит объяснять слова $(n-2)$-ому.
+
+Если друзей собралось много, то игра может занять приличное время. Серёжу интересует, какое минимальное время она может длиться, если разрешить парам участников общаться между собой одновременно и в любом порядке.
+
+Для данного $n \le 500$, найдите минимальное количество времени $k$ и соответствующее ему расписание.
+
+## Random coffee
+
+В компании, в которой вы работаете, устроено неизвестное число людей — от одного до бесконечности с равной вероятностью. Для борьбы с одиночеством, каждый сотрудник участвует в «random coffee»: каждую неделю вы встречаетесь со случайным человеком из компании, чтобы попить кофе и обсудить что угодно.
+
+Вы участвовали в random coffee $n$ раз и пообщались с $k$ разными людьми (с некоторыми — более одного раза). Какое наиболее вероятное число человек работает в компании?
+
+## Мафия
+
+В «мафию» играют 13 человек, из которых 10 мирных и 3 мафии. Все роли розданы с помощью стандартной колоды игральных карт: заранее выбрали и перемешали 10 красных и 3 чёрные карты, кто вытянул черную — мафия. Все карты различны и известны всем. Игра начинается с дневного голосования.
+
+Как мирным гарантированно победить?
+
+
+
+
diff --git a/content/russian/cs/programming/stress-test.md b/content/russian/cs/programming/stress-test.md
index b20c77b6..c67d1237 100644
--- a/content/russian/cs/programming/stress-test.md
+++ b/content/russian/cs/programming/stress-test.md
@@ -151,12 +151,12 @@ _, f1, f2, gen, iters = sys.argv
 
 for i in range(int(iters)):
     print('Test', i + 1)
-    os.popen('python3 %s > test.txt' % gen)
-    v1 = os.popen('./%s < test.txt' % f1).read()
-    v2 = os.popen('./%s < test.txt' % f2).read()
+    os.system(f'python3 {gen} > test.txt')
+    v1 = os.popen(f'./{f1} < test.txt').read()
+    v2 = os.popen(f'./{f2} < test.txt').read()
     if v1 != v2:
         print("Failed test:")
-        print(open("text.txt").read())
+        print(open("test.txt").read())
         print(f'Output of {f1}:')
         print(v1)
         print(f'Output of {f2}:')
diff --git a/content/russian/cs/range-queries/fenwick.md b/content/russian/cs/range-queries/fenwick.md
index f07a1ed4..9e37fc8d 100644
--- a/content/russian/cs/range-queries/fenwick.md
+++ b/content/russian/cs/range-queries/fenwick.md
@@ -84,7 +84,7 @@ int sum (int r1, int r2) {
     int res = 0;
     for (int i = r1; i > 0; i -= i & -i)
         for (int j = r2; j > 0; j -= j & -j)
-            ans += t[i][j];
+            res += t[i][j];
     return res;
 }
 ```
diff --git a/content/russian/cs/range-queries/img/prefix-sum.png b/content/russian/cs/range-queries/img/prefix-sum.png
new file mode 100644
index 00000000..4e00190a
Binary files /dev/null and b/content/russian/cs/range-queries/img/prefix-sum.png differ
diff --git a/content/russian/cs/range-queries/prefix-sum.md b/content/russian/cs/range-queries/prefix-sum.md
index 861200a1..f4e02570 100644
--- a/content/russian/cs/range-queries/prefix-sum.md
+++ b/content/russian/cs/range-queries/prefix-sum.md
@@ -52,13 +52,15 @@ $$
 
 Для ответа на запрос поиска суммы на произвольном полуинтервале нужно просто вычесть друг из друга две предподсчитанные префиксные суммы.
 
-@@
+
+
+![](../img/prefix-sum.png)
 
 ### Другие операции
 
diff --git a/content/russian/cs/range-queries/sqrt-structures.md b/content/russian/cs/range-queries/sqrt-structures.md
index bac0da16..25fe3b5e 100644
--- a/content/russian/cs/range-queries/sqrt-structures.md
+++ b/content/russian/cs/range-queries/sqrt-structures.md
@@ -1,10 +1,10 @@
 ---
 title: Корневые структуры
 authors:
-- Сергей Слотин
-- Иван Сафонов
+  - Сергей Слотин
+  - Иван Сафонов
 weight: 6
-date: 2021-09-13
+date: 2022-08-16
 ---
 
 Корневые оптимизации можно использовать много для чего, в частности в контексте структур данных.
@@ -23,16 +23,15 @@ date: 2021-09-13
 ```c++
 // c это и количество блоков, и также их размер; оно должно быть чуть больше корня
 const int maxn = 1e5, c = 330;
-int a[maxn], b[c];
-int add[c];
+int a[maxn], b[c], add[c];
 
 for (int i = 0; i < n; i++)
     b[i / c] += a[i];
 ```
 
-Заведем также массив `add` размера $\sqrt n$, который будем использовать для отложенной операции прибавления на блоке. Будем считать, что реальное значение $i$-го элемента равно `a[i] + add[i / c]`.
+Заведем также массив `add` размера $\sqrt n$, который будем использовать для отложенной операции прибавления на блоке: будем считать, что реальное значение $i$-го элемента равно `a[i] + add[i / c]`.
 
-Теперь мы можем отвечать на запросы первого типа за $O(\sqrt n)$ на запрос:
+Теперь мы можем отвечать на запросы первого типа за $O(\sqrt n)$ операций на запрос:
 
 1. Для всех блоков, лежащих целиком внутри запроса, просто возьмём уже посчитанные суммы и сложим.
 2. Для блоков, пересекающихся с запросом только частично (их максимум два — правый и левый), проитерируемся по нужным элементам и поштучно прибавим к ответу.
@@ -68,6 +67,7 @@ void upd(int l, int r, int x) {
             l += c;
         }
         else {
+            b[l / c] += x;
             a[l] += x;
             l++;
         }
@@ -111,8 +111,8 @@ vector< vector > blocks;
 // возвращает индекс блока и индекс элемента внутри блока
 pair find_block(int pos) {
     int idx = 0;
-    while (blocks[idx].size() >= pos)
-        pos -= blocks[idx--].size();
+    while (blocks[idx].size() <= pos)
+        pos -= blocks[idx++].size();
     return {idx, pos};
 }
 ```
diff --git a/content/russian/cs/sequences/_index.md b/content/russian/cs/sequences/_index.md
index d02ed49b..6888831d 100644
--- a/content/russian/cs/sequences/_index.md
+++ b/content/russian/cs/sequences/_index.md
@@ -1,7 +1,6 @@
 ---
 title: Последовательности
 weight: 4
-draft: true
 ---
 
-В этой главе рассматриваются некоторые алгоритмы на неотсортированных последовательностях.
+В этой главе рассматриваются алгоритмы для неотсортированных последовательностей.
diff --git a/content/russian/cs/sequences/compression.md b/content/russian/cs/sequences/compression.md
index 332011b3..5b469fec 100644
--- a/content/russian/cs/sequences/compression.md
+++ b/content/russian/cs/sequences/compression.md
@@ -3,46 +3,64 @@ title: Сжатие координат
 authors:
 - Сергей Слотин
 weight: -1
-draft: true
+date: 2022-04-20
 ---
 
+Часто бывает полезно преобразовать последовательность чисел либо каких-то других объектов в промежуток последовательных целых чисел — например, чтобы использовать её элементы как индексы в массиве либо какой-нибудь другой структуре.
 
-## Сжатие координат
-Это общая идея, которая может оказаться полезной. Пусть, есть $n$ чисел $a_1,\ldots,a_n$. Хотим, преобразовать $a_i$ так, чтобы равные остались равными, разные остались разными, но все они были от 0 до $n-1$. Для этого надо отсортировать числа, удалить повторяющиеся и заменить каждое $a_i$ на его индекс в отсортированном массиве.
+Эта задача эквивалентна нумерации элементов множества, что можно сделать за $O(n)$ через хеш-таблицу:
 
+```c++
+vector compress(vector a) {
+    unordered_map m;
 
-```
-int a[n], all[n];
-for (int i = 0; i < n; ++i) {
-    cin >> a[i];
-    all[i] = a[i];
+    for (int &x : a) {
+        if (m.count(x))
+            x = m[x];
+        else
+            m[x] = m.size();
+    }
+
+    return a;
 }
-sort(all, all + n);
-m = unique(all, all + n) - all; // теперь m - число различных координат
-for (int i = 0; i < n; ++i)
-    a[i] = lower_bound(all, all + m, x[i]) - all;
 ```
 
-```cpp
+Элементам будут присвоены номера в порядке их первого вхождения в последовательность. Если нужно сохранить *порядок*, присвоив меньшим элементам меньшие номера, то задача становится чуть сложнее, и её можно решить разными способами.
+
+Как вариант, можно отсортировать массив, а затем два раза пройтись по нему с хэш-таблицей — в первый раз заполняя её, а во второй раз сжимая сам массив:
+
+```c++
 vector compress(vector a) {
+    vector b = a;
+    sort(b.begin(), b.end());
+
     unordered_map m;
-    for (int x : a)
-        if (m.count(x))
+
+    for (int x : b)
+        if (!m.count(x))
             m[x] = m.size();
+
     for (int &x : a)
         x = m[x];
+
     return a;
 }
 ```
 
+Также можно выкинуть из отсортированного массива дупликаты (за линейное время), а затем использовать его для нахождения индекса каждого элемента исходного массива бинарным поиском:
 
-```cpp
+```c++
 vector compress(vector a) {
     vector b = a;
+
     sort(b.begin(), b.end());
     b.erase(unique(b.begin(), b.end()), b.end());
+
     for (int &x : a)
         x = int(lower_bound(b.begin(), b.end(), x) - b.begin());
+
     return a;
 }
 ```
+
+Оба подхода работают за $O(n \log n)$. Используйте тот, который больше нравится.
diff --git a/content/russian/cs/sequences/inversions.md b/content/russian/cs/sequences/inversions.md
index f18d1f4a..2fbec7d9 100644
--- a/content/russian/cs/sequences/inversions.md
+++ b/content/russian/cs/sequences/inversions.md
@@ -4,13 +4,18 @@ title: Число инверсий
 weight: 5
 authors:
 - Сергей Слотин
+draft: true
 ---
 
-Пусть у нас есть некоторая перестановка $p$ (какая-то последовательность чисел от $1$ до $n$, где все числа встречаются ровно один раз). *Инверсией* называется пара индексов $i$ и $j$ такая, что $i < j$ и $p_i > p_j$. Требуется найти количество инверсий в данной перестановке.
+**Определение.** *Инверсией* в перестановке $p$ называется пара индексов $i$ и $j$ такая, что $i < j$ и $p_i > p_j$.
 
-## Наивный алгоритм
+Например:
 
-Эта задача легко решается за $O(n^2)$ обычным перебором всех пар индексов и проверкой каждого на инверсию:
+- в перестановке $[1, 2, 3]$ инверсий нет,
+- в $[1, 3, 2]$ одна инверсия ($3 \leftrightarrow 2$),
+- в $[3, 2, 1]$ три инверсии ($3 \leftrightarrow 2$, $3 \leftrightarrow 1$ и $2 \leftrightarrow 1$).
+
+В этой статье мы рассмотрим, как находить количество инверсий в перестановке. Эта задача легко решается за $O(n^2)$ обычным перебором всех пар индексов и проверкой каждого на инверсию:
 
 ```cpp
 int count_inversions(int *p, int n) {
@@ -23,6 +28,8 @@ int count_inversions(int *p, int n) {
 }
 ```
 
+Решить её быстрее сложнее.
+
 ## Сортировкой слиянием
 
 Внезапно эту задачу можно решить сортировкой слиянием, слегка модифицировав её.
diff --git a/content/russian/cs/sequences/quickselect.md b/content/russian/cs/sequences/quickselect.md
index b1606bbd..7e83a267 100644
--- a/content/russian/cs/sequences/quickselect.md
+++ b/content/russian/cs/sequences/quickselect.md
@@ -1,12 +1,12 @@
 ---
-# TODO: реализация
 title: Порядковые статистики
 weight: 4
+draft: true
 ---
 
 Если в [начале предыдущей главы](/cs/interactive/binary-search) мы искали число элементов массива, меньших $x$ — также известное как индекс этого элемента в отсортированном массиве — то теперь нас интересует обратная задача: узнать, какой элемент $k$-тый по возрастанию.
 
-Если массив уже отсортирован, то задача тривиальная — просто берем $k$-тый элемент. Иначе мы его можем отсортировать, но на это потребуется $O(n \log n)$ операций — и мы знаем, что используя только сравнения быстрее не получится.
+Если массив уже отсортирован, то задача тривиальная: просто берем $k$-тый элемент. Иначе мы его можем отсортировать, но на это потребуется $O(n \log n)$ операций — и мы знаем, что если мы используем только сравнения, быстрее не получится.
 
 Есть другой подход — мы можем модифицировать алгоритм быстрой сортировки.
 
@@ -26,4 +26,17 @@ weight: 4
 
 Подумав над тем, что размер отрезка каждый раз убывает приблизительно в 2 раза, над ограниченностью суммы $n + \frac{n}{2} + \frac{n}{4} + \ldots = 2 \cdot n$, и немного помахав руками, получаем, что алгоритм работает за $O(n)$. 
 
+
+
 В C++ этот алгоритм уже реализован и доступен как `nth_element`.
diff --git a/content/russian/cs/set-structures/dsu.md b/content/russian/cs/set-structures/dsu.md
index 6c9a4d80..ee437a43 100644
--- a/content/russian/cs/set-structures/dsu.md
+++ b/content/russian/cs/set-structures/dsu.md
@@ -66,7 +66,7 @@ int leader(int v) {
 
 Следующие две эвристики похожи по смыслу и стараются оптимизировать высоту дерева, выбирая оптимальный корень для переподвешивания.
 
-**Ранговая эвристика**. Будем хранить для каждой вершины её *ранг* — высоту её поддереа. При объединении деревьев будем делать корнем нового дерева ту вершину, у которой ранг больше, и пересчитывать ранги (ранг у лидера должен увеличиться на единицу, если он совпадал с рангом другой вершины). Эта эвристика оптимизирует высоту дерева напрямую.
+**Ранговая эвристика**. Будем хранить для каждой вершины её *ранг* — высоту её поддерева. При объединении деревьев будем делать корнем нового дерева ту вершину, у которой ранг больше, и пересчитывать ранги (ранг у лидера должен увеличиться на единицу, если он совпадал с рангом другой вершины). Эта эвристика оптимизирует высоту дерева напрямую.
 
 ```cpp
 void unite(int a, int b) {
diff --git a/content/russian/cs/sorting/bubble.md b/content/russian/cs/sorting/bubble.md
index 2d9af9b5..38fa5c8a 100644
--- a/content/russian/cs/sorting/bubble.md
+++ b/content/russian/cs/sorting/bubble.md
@@ -1,9 +1,10 @@
 ---
 title: Сортировка пузырьком
 weight: 1
+published: true
 ---
 
-Наш первый подход будет заключаться в следующем: обозначим за $n$ длину массива и $n$ раз пройдёмся раз пройдемся по нему слева направо, меняя два соседних элемента, если первый больше второго.
+Наш первый подход будет заключаться в следующем: обозначим за $n$ длину массива и $n$ раз пройдёмся по нему слева направо, меняя два соседних элемента, если первый больше второго.
 
 Каждую итерацию максимальный элемент «всплывает» как пузырек к концу массива — отсюда и название.
 
diff --git a/content/russian/cs/sorting/quicksort.md b/content/russian/cs/sorting/quicksort.md
index f3a6a5d6..e6494cd3 100644
--- a/content/russian/cs/sorting/quicksort.md
+++ b/content/russian/cs/sorting/quicksort.md
@@ -7,13 +7,18 @@ draft: true
 Быстрая сортировка заключается в том, что на каждом шаге мы находим опорный элемент, все элементы, которые меньше его кидаем в левую часть, остальные в правую, а затем рекурсивно спускаемся в обе части.
 
 ```cpp
+// partition - функция разбивающие элементы 
+// на меньшие и больше/равные a[index], 
+// при этом функция возвращает границу разбиения
+void partition(int l, int r, int p) {
+
+}
+
 void quicksort(int l, int r){
     if (l < r){
         int index = (l + r) / 2; /* index - индекс опорного элемента для 
         начала сделаем его равным середине отрезка*/
-        index = divide(l, r, index); /* divide - функция разбивающие элементы 
-        на меньшие и больше/равные a[index], 
-        при этом функция возвращает границу разбиения*/
+        index = partition(l, r, index);
         quicksort(l, index);
         quicksort(index + 1, r);
     }
@@ -25,8 +30,6 @@ void quicksort(int l, int r){
 
 Существуют несколько выходов из этой ситуации :
 
-2. Давайте если быстрая сортировка работает долго, то запустим любую другую сортировку за $NlogN$.
-
-3. Давайте делить массив не на две, а на три части(меньше, равны, больше).
-
-4. Чтобы избавиться от проблемы с максимумом/минимумом в середине, давайте **брать случайный элемент**.
+1. Давайте если быстрая сортировка работает долго, то запустим любую другую сортировку за $NlogN$.
+2. Давайте делить массив не на две, а на три части(меньше, равны, больше).
+3. Чтобы избавиться от проблемы с максимумом/минимумом в середине, давайте **брать случайный элемент**.
diff --git a/content/russian/cs/sorting/selection.md b/content/russian/cs/sorting/selection.md
index b47f2320..30854b5f 100644
--- a/content/russian/cs/sorting/selection.md
+++ b/content/russian/cs/sorting/selection.md
@@ -1,6 +1,7 @@
 ---
 title: Сортировка выбором
 weight: 2
+published: true
 ---
 
 Похожим методом является **сортировка выбором** (минимума или максимума).
@@ -10,7 +11,7 @@ weight: 2
 ```cpp
 void selection_sort(int *a, int n) {
     for (int k = 0; k < n - 1; k++)
-        for (j = k + 1; j < n; j++)
+        for (int j = k + 1; j < n; j++)
             if (a[k] > a[j])
                 swap(a[j], a[k]);
 }
diff --git a/content/russian/cs/spanning-trees/kruskal.md b/content/russian/cs/spanning-trees/kruskal.md
index ddb9cabf..1f4c98a4 100644
--- a/content/russian/cs/spanning-trees/kruskal.md
+++ b/content/russian/cs/spanning-trees/kruskal.md
@@ -34,4 +34,4 @@ for (auto [a, b, w] : edges) {
 }
 ```
 
-Раз остовные деревья являются частным случаем [матроида](/cs/greedy/matroid), то алгоритм Краскала является частным случаем алгоритма Радо-Эдмондса.
+Раз остовные деревья являются частным случаем [матроида](/cs/combinatorial-optimization/matroid), то алгоритм Краскала является частным случаем алгоритма Радо-Эдмондса.
diff --git a/content/russian/cs/spanning-trees/prim.md b/content/russian/cs/spanning-trees/prim.md
index d9a00c6e..ff250c70 100644
--- a/content/russian/cs/spanning-trees/prim.md
+++ b/content/russian/cs/spanning-trees/prim.md
@@ -2,7 +2,8 @@
 title: Алгоритм Прима
 weight: 2
 prerequisites:
-- safe-edge
+  - safe-edge
+published: true
 ---
 
 Лемма о безопасном ребре говорит, что мы можем строить минимальный остов постепенно, добавляя по одному ребра, про которые мы точно знаем, что они минимальные для соединения какого-то разреза.
@@ -47,7 +48,7 @@ min_edge[0] = 0;
 
 for (int i = 0; i < n; i++) {
     int v = -1;
-    for (int u = 0; u < n; j++)
+    for (int u = 0; u < n; u++)
         if (!used[u] && (v == -1 || min_edge[u] < min_edge[v]))
             v = u;
 
diff --git a/content/russian/cs/spanning-trees/safe-edge.md b/content/russian/cs/spanning-trees/safe-edge.md
index cc7138c9..19f97006 100644
--- a/content/russian/cs/spanning-trees/safe-edge.md
+++ b/content/russian/cs/spanning-trees/safe-edge.md
@@ -24,4 +24,4 @@ weight: 1
 - Если веса всех рёбер различны, то остов будет уникален.
 - Минимальный остов является также и остовом с минимальным произведением весов рёбер (замените веса всех рёбер на их логарифмы).
 - Минимальный остов является также и остовом с минимальным весом самого тяжелого ребра.
-- Остовные деревья — частный случай [матроидов](/cs/greedy/matroid).
+- Остовные деревья — частный случай [матроидов](/cs/combinatorial-optimization/matroid).
diff --git a/content/russian/cs/string-searching/manacher.md b/content/russian/cs/string-searching/manacher.md
index 8954b653..16d32ccb 100644
--- a/content/russian/cs/string-searching/manacher.md
+++ b/content/russian/cs/string-searching/manacher.md
@@ -32,7 +32,7 @@ vector pal_array(string s) {
 
 Тот же пример $s = aa\dots a$ показывает, что данная реализация работает за $O(n^2)$.
 
-Для оптимизации применим идею, знакомую из алгоритма [z-функции](string-searching): при инициализации $t_i$ будем пользоваться уже посчитанными $t$. А именно, будем поддерживать $(l, r)$ — интервал, соответствующий самому правому из найденных подпалиндромов. Тогда мы можем сказать, что часть наибольшего палиндрома с центром в $s_i$, которая лежит внутри $s_{l:r}$, имеет радиус хотя бы $\min(r-i, \; t_{l+r-i})$. Первая величина равна длине, дальше которой произошел бы выход за пределы $s_{l:r}$, а вторая — значению радиуса в позиции, зеркальной относительно центра палиндрома $s_{l:r}$.
+Для оптимизации применим идею, знакомую из алгоритма [z-функции](/cs/string-searching/z-function/): при инициализации $t_i$ будем пользоваться уже посчитанными $t$. А именно, будем поддерживать $(l, r)$ — интервал, соответствующий самому правому из найденных подпалиндромов. Тогда мы можем сказать, что часть наибольшего палиндрома с центром в $s_i$, которая лежит внутри $s_{l:r}$, имеет радиус хотя бы $\min(r-i, \; t_{l+r-i})$. Первая величина равна длине, дальше которой произошел бы выход за пределы $s_{l:r}$, а вторая — значению радиуса в позиции, зеркальной относительно центра палиндрома $s_{l:r}$.
 
 ```c++
 
diff --git a/content/russian/cs/string-structures/aho-corasick.md b/content/russian/cs/string-structures/aho-corasick.md
index 369f5171..2ca1da65 100644
--- a/content/russian/cs/string-structures/aho-corasick.md
+++ b/content/russian/cs/string-structures/aho-corasick.md
@@ -1,10 +1,11 @@
 ---
 title: Алгоритм Ахо-Корасик
 authors:
-- Сергей Слотин
+  - Сергей Слотин
 weight: 2
 prerequisites:
-- trie
+  - trie
+published: true
 ---
 
 Представим, что мы работаем журналистами в некотором авторитарном государстве, контролирующем СМИ, и в котором время от времени издаются законы, запрещающие упоминать определенные политические события или использовать определенные слова. Как эффективно реализовать подобную цензуру программно?
@@ -36,7 +37,7 @@ prerequisites:
 
 **Определение.** *Суффиксная ссылка* $l(v)$ ведёт в вершину $u \neq v$, которая соответствует наидлиннейшему принимаемому бором суффиксу $v$.
 
-**Определение.** *Автоматный переход* $\delta(v, c)$ ведёт в вершину, соответствующую минимальному принимаемому бором суффиксу строки $v + c$.
+**Определение.** *Автоматный переход* $\delta(v, c)$ ведёт в вершину, соответствующую максимальному принимаемому бором суффиксу строки $v + c$.
 
 **Наблюдение.** Если переход и так существует в боре (будем называть такой переход *прямым*), то автоматный переход будет вести туда же.
 
diff --git a/content/russian/cs/string-structures/palindromic-tree.md b/content/russian/cs/string-structures/palindromic-tree.md
index 3d70c76b..9b57534a 100644
--- a/content/russian/cs/string-structures/palindromic-tree.md
+++ b/content/russian/cs/string-structures/palindromic-tree.md
@@ -19,7 +19,7 @@ weight: 3
 
 Будем поддерживать наибольший суффикс-палиндром. Когда мы будем дописывать очередной символ $c$, нужно найти наибольший суффикс этого палиндрома, который может быть дополнен символом $c$ — это и будет новый наидлиннейший суффикс-палиндром.
 
-Для этого поступим аналогично [алгоритму Ахо-Корасик](aho-corasick): будем поддерживать для каждого палиндрома суффиксную ссылку $l(v)$, ведущую из $v$ в её наибольший суффикс-палиндром. При добавлении очередного символа, будем подниматься по суффиксным ссылкам, пока не найдём вершину, из которой можно совершить нужный переход.
+Для этого поступим аналогично [алгоритму Ахо-Корасик](../aho-corasick): будем поддерживать для каждого палиндрома суффиксную ссылку $l(v)$, ведущую из $v$ в её наибольший суффикс-палиндром. При добавлении очередного символа, будем подниматься по суффиксным ссылкам, пока не найдём вершину, из которой можно совершить нужный переход.
 
 Если в подходящей вершине этого перехода не существовало, то нужно создать новую вершину, и для неё тоже понадобится своя суффиксная ссылка. Чтобы найти её, будем продолжать подниматься по суффиксным ссылкам предыдущего суффикс-палиндрома, пока не найдём второе такое место, которое мы можем дополнить символом $c$.
 
diff --git a/content/russian/cs/string-structures/suffix-array.md b/content/russian/cs/string-structures/suffix-array.md
index 80d2b129..a7b90768 100644
--- a/content/russian/cs/string-structures/suffix-array.md
+++ b/content/russian/cs/string-structures/suffix-array.md
@@ -22,7 +22,7 @@ weight: 100
 
 ![Сортировка всех суффиксов строки «mississippi$»](../img/sa-sort.png)
 
-**Где это может быть полезно.** Пусть вы хотите основать ещё один поисковик, и чтобы получить финансирование, вам нужно сделать хоть что-то минимально работающее — хотя бы просто научиться искать по ключевому слову документы, включающие его, а также позиции их вхождения (в 90-е это был бы уже довольно сильный MVP). Простыми алгоритмами — [полиномиальными хешами](/cs/hashing), [z- и префикс-функцией](/cs/string-searching) и даже [Ахо-Корасиком](/cs/automata/aho-corasick) — это сделать быстро нельзя, потому что на каждый раз нужно проходиться по всем данным, а суффиксными структурами — можно.
+**Где это может быть полезно.** Пусть вы хотите основать ещё один поисковик, и чтобы получить финансирование, вам нужно сделать хоть что-то минимально работающее — хотя бы просто научиться искать по ключевому слову документы, включающие его, а также позиции их вхождения (в 90-е это был бы уже довольно сильный MVP). Простыми алгоритмами — [полиномиальными хешами](/cs/hashing), [z- и префикс-функцией](/cs/string-searching) и даже [Ахо-Корасиком](../aho-corasick) — это сделать быстро нельзя, потому что на каждый раз нужно проходиться по всем данным, а суффиксными структурами — можно.
 
 В случае с суффиксным массивом можно сделать следующее: сконкатенировать все строки-документы с каким-нибудь внеалфавитным разделителем (`$`), построить по ним суффиксный массив, а дальше для каждого запроса искать бинарным поиском первый суффикс в суффиксном массиве, который меньше искомого слова, а также последний, который меньше. Все суффиксы между этими двумя будут включать искомую строку как префикс.
 
@@ -132,11 +132,11 @@ vector suffix_array(vector &s) {
 
 Тогда есть мотивация посчитать массив `lcp$` в котором окажутся наибольшие общие префиксы соседних суффиксов, а после как-нибудь считать минимумы на отрезках в этом массиве (например, с помощью [разреженной таблицы](/cs/range-queries/sparse-table)).
 
-Осталось придумать способ быстро посчитать массив `lcp`. Можно воспользоваться идеей из построения суффиксного массива за $O(n \log^2 n)$: с помощью [хешей](hashing) и бинпоиска находить `lcp` для каждой пары соседей. Такой метод работает за $O(n \log n)$, но является не самым удобным и популярным.
+Осталось придумать способ быстро посчитать массив `lcp`. Можно воспользоваться идеей из построения суффиксного массива за $O(n \log^2 n)$: с помощью [хешей](/cs/hashing/polynomial/) и бинпоиска находить `lcp` для каждой пары соседей. Такой метод работает за $O(n \log n)$, но является не самым удобным и популярным.
 
 ### Алгоритм Касаи, Аримуры, Арикавы, Ли, Парка
 
-Алгоритм в реальности называется как угодно, но не исходным способом (*алгоритм Касаи*, *алгоритм пяти корейцев*, и т. д.). Используется для подсчета $lcp$ за линейное время. Автору алгоритм кажется чем-то похожим на [z-функцию](string-searching) по своей идее.
+Алгоритм в реальности называется как угодно, но не исходным способом (*алгоритм Касаи*, *алгоритм пяти корейцев*, и т. д.). Используется для подсчета $lcp$ за линейное время. Автору алгоритм кажется чем-то похожим на [z-функцию](/cs/string-searching/z-function) по своей идее.
 
 **Утверждение.** Пусть мы уже построили суфмасс и посчитали $lcp[i]$. Тогда:
 
diff --git a/content/russian/cs/tree-structures/treap.md b/content/russian/cs/tree-structures/treap.md
index dd3417dd..ad11c794 100644
--- a/content/russian/cs/tree-structures/treap.md
+++ b/content/russian/cs/tree-structures/treap.md
@@ -100,7 +100,7 @@ $$
 
 Примечательно, что ожидаемая глубина вершин зависит от их позиции: вершина из середины должна быть примерно в два раза глубже, чем крайняя.
 
-**Упражнение.** Выведите по аналогии с этим рассуждением асимптотику [quicksort](/cs/sorting/quicksort).
+**Упражнение.** Выведите по аналогии с этим рассуждением асимптотику quicksort.
 
 ## Реализация
 
@@ -199,7 +199,7 @@ struct Node {
 Вместо того, чтобы модифицировать и `merge`, и `split` под наши хотелки, напишем вспомогательную функцию `upd`, которую будем вызывать при обновлении детей вершины:
 
 ```c++
-void sum(Node* v) { return v ? v->sum : 0; }
+int sum(Node* v) { return v ? v->sum : 0; }
 // обращаться по пустому указателю нельзя -- выдаст ошибку
 
 void upd(Node* v) { v->sum = sum(v->l) + sum(v->r) + v->val; }
diff --git a/netlify.toml b/netlify.toml
index 1b5ed16e..fb612037 100644
--- a/netlify.toml
+++ b/netlify.toml
@@ -2,7 +2,7 @@
 command = "hugo --gc --minify"
 
 [context.production.environment]
-HUGO_VERSION = "0.87.0"
+HUGO_VERSION = "0.96.0"
 HUGO_ENV = "production"
 HUGO_ENABLEGITINFO = "true"
 
@@ -10,20 +10,20 @@ HUGO_ENABLEGITINFO = "true"
 command = "hugo --gc --minify --enableGitInfo"
 
 [context.split1.environment]
-HUGO_VERSION = "0.87.0"
+HUGO_VERSION = "0.96.0"
 HUGO_ENV = "production"
 
 [context.deploy-preview]
 command = "hugo --gc --minify --buildFuture -b $DEPLOY_PRIME_URL"
 
 [context.deploy-preview.environment]
-HUGO_VERSION = "0.87.0"
+HUGO_VERSION = "0.96.0"
 
 [context.branch-deploy]
 command = "hugo --gc --minify -b $DEPLOY_PRIME_URL"
 
 [context.branch-deploy.environment]
-HUGO_VERSION = "0.87.0"
+HUGO_VERSION = "0.96.0"
 
 [context.next.environment]
 HUGO_ENABLEGITINFO = "true"
diff --git a/scripts/check-links.sh b/scripts/check-links.sh
new file mode 100644
index 00000000..9f87cefd
--- /dev/null
+++ b/scripts/check-links.sh
@@ -0,0 +1,2 @@
+# hugo serve
+wget --spider -r -nd -nv http://localhost:1313/
diff --git a/scripts/list-files.sh b/scripts/list-files.sh
new file mode 100644
index 00000000..47259b5c
--- /dev/null
+++ b/scripts/list-files.sh
@@ -0,0 +1 @@
+find ./ -type f -name "*.md" -exec wc {} +
diff --git a/themes/algorithmica/assets/dark.sass b/themes/algorithmica/assets/dark.sass
index c26997ba..b5a53b28 100644
--- a/themes/algorithmica/assets/dark.sass
+++ b/themes/algorithmica/assets/dark.sass
@@ -1,24 +1,22 @@
-$font-color: rgb(206, 177, 150)
-$background: black
-$borders: 1px solid #d4ae8d
+$font-color: #DDD
+$background: #222
+$borders: 1px solid rgb(57, 57, 57)
 
-$code-background: #222
-$code-border: 1px solid #333
-$quote-line-color: 0.25em #d4ae8d solid
+$code-background: #333
+$code-border: 1px solid #444
+$quote-line-color: 0.25em #444 solid
 
-$dimmed: #cea163
-$section-headers: #c77d0f
-$headers-color: rgb(200, 160, 130)
+$dimmed: rgb(179, 179, 179)
+$section-headers: rgb(239, 239, 239)
+$headers-color: rgb(239, 239, 239)
 $scrollbar1: #444
 $scrollbar2: #555
 $scrollbar3: #666
 
-$link-color: #ac7625
-$link-hover-color: #eb9a20
+$link-color: #80acd3
+$link-hover-color: #5490c5
 
 @import style.sass
 
 img
-  //filter: invert(100%) sepia(100%) saturate(0%) hue-rotate(288deg) brightness(102%) contrast(102%)
-  filter: invert(100%) sepia(20%) saturate(36.4%) hue-rotate(29deg) brightness(85%)
-  
\ No newline at end of file
+  filter: invert(85%) sepia(20%) saturate(100%) hue-rotate(29deg) brightness(85%)
diff --git a/themes/algorithmica/assets/style.sass b/themes/algorithmica/assets/style.sass
index fe3ebaeb..00a420cf 100644
--- a/themes/algorithmica/assets/style.sass
+++ b/themes/algorithmica/assets/style.sass
@@ -157,6 +157,11 @@ body
       &::before
         content: counter(chapter-counter) "." counter(section-counter) ". "
         font-weight: bold
+  
+  .draft, .draft a
+    color: $dimmed
+
+    
 
 #wrapper
   width: 100%
@@ -182,10 +187,10 @@ menu
   display: flex
   font-family: $font-headings
   
-  height: 30px
+  height: 26px
   background-color: $background
   justify-content: space-between
-  padding: 12px
+  padding: 14px
   margin: 0
   text-align: center
 
@@ -217,7 +222,37 @@ menu
     .title
       opacity: 1
       transition: opacity 0.1s
-    
+
+#search
+  display: none
+  font-family: $font-interface
+
+  input
+    width: 100%
+    padding: 6px
+
+    color: $font-color
+
+    background: $code-background
+    border: $code-border
+
+    &:focus
+      outline: 1px solid $dimmed
+
+  #search-count
+    margin-top: 8px
+    color: $dimmed
+  
+  #search-results
+    margin-top: 6px
+    border-bottom: $borders
+
+    li
+      list-style: none
+      margin: 12px 6px
+
+    p
+      margin-top: 0
 
 /*
   .github
@@ -460,7 +495,13 @@ pre
   padding-left: 8px
   font-size: 0.85em
   text-align: left
-  
+
+pre.center-pre
+  text-align: center
+  font-size: 1em
+  background: none
+  border: none
+
 .highlight
   margin: 0px
 
diff --git a/themes/algorithmica/i18n/en.toml b/themes/algorithmica/i18n/en.toml
index d58a7924..6fa12340 100644
--- a/themes/algorithmica/i18n/en.toml
+++ b/themes/algorithmica/i18n/en.toml
@@ -15,6 +15,15 @@ other = "updated"
 [sections]
 other = "sections"
 
+[search]
+other = "Search this book…"
+
+[searchCountPrefix]
+other = "Found"
+
+[searchCountSuffix]
+other = "pages"
+
 [prerequisites]
 other = "prerequisites"
 
@@ -22,7 +31,7 @@ other = "prerequisites"
 other = "translations"
 
 [copyright1]
-other = "Copyright 2021 Sergey Slotin"
+other = "Copyright 2021–2022 Sergey Slotin"
 
 [copyright2]
 other = " " # Content is distributed under CC BY-NC
diff --git a/themes/algorithmica/i18n/ru.toml b/themes/algorithmica/i18n/ru.toml
index a25a0c27..08d47b66 100644
--- a/themes/algorithmica/i18n/ru.toml
+++ b/themes/algorithmica/i18n/ru.toml
@@ -21,6 +21,15 @@ other = "обновлено"
 [sections]
 other = "статьи раздела"
 
+[search]
+other = "Поиск по сайту…"
+
+[searchCountPrefix]
+other = "Найдено"
+
+[searchCountSuffix]
+other = "страниц"
+
 [prerequisites]
 other = "пререквизиты"
 
@@ -28,7 +37,7 @@ other = "пререквизиты"
 other = "переводы"
 
 [copyright1]
-other = "Copyleft 2017–2021 Тинькофф Образование" # {{ .Count / . }}
+other = "Copyleft 2017–2022 Algorithmica.org" # {{ .Count / . }}
 
 [copyright2]
 other = "Материалы распространяются под CC BY-SA"
diff --git a/themes/algorithmica/layouts/_default/_markup/render-codeblock-center.html b/themes/algorithmica/layouts/_default/_markup/render-codeblock-center.html
new file mode 100644
index 00000000..d263bb5a
--- /dev/null
+++ b/themes/algorithmica/layouts/_default/_markup/render-codeblock-center.html
@@ -0,0 +1,3 @@
+
+{{.Inner}}
+
diff --git a/themes/algorithmica/layouts/_default/baseof.html b/themes/algorithmica/layouts/_default/baseof.html index f9056521..dbe71ede 100644 --- a/themes/algorithmica/layouts/_default/baseof.html +++ b/themes/algorithmica/layouts/_default/baseof.html @@ -6,6 +6,7 @@
{{- partial "buttons.html" . -}}
+ {{ partial "search.html" . }} {{- partial "header.html" . -}}
{{- block "main" . }}{{- end }} diff --git a/themes/algorithmica/layouts/_default/list.searchindex.json b/themes/algorithmica/layouts/_default/list.searchindex.json new file mode 100644 index 00000000..6310c263 --- /dev/null +++ b/themes/algorithmica/layouts/_default/list.searchindex.json @@ -0,0 +1,5 @@ +{{- $.Scratch.Add "searchindex" slice -}} +{{- range $index, $element := .Site.Pages -}} + {{- $.Scratch.Add "searchindex" (dict "id" $index "title" $element.Title "path" $element.RelPermalink "content" $element.Plain) -}} +{{- end -}} +{{- $.Scratch.Get "searchindex" | jsonify -}} diff --git a/themes/algorithmica/layouts/partials/buttons.html b/themes/algorithmica/layouts/partials/buttons.html index ce9d5728..265b63d9 100644 --- a/themes/algorithmica/layouts/partials/buttons.html +++ b/themes/algorithmica/layouts/partials/buttons.html @@ -3,16 +3,21 @@ {{ with .File }}{{ $path = .Path }}{{ end }}
{{.Title}}
@@ -20,7 +25,9 @@ - + diff --git a/themes/algorithmica/layouts/partials/head.html b/themes/algorithmica/layouts/partials/head.html index f87a8873..c5013dba 100644 --- a/themes/algorithmica/layouts/partials/head.html +++ b/themes/algorithmica/layouts/partials/head.html @@ -10,6 +10,11 @@ + + + + + {{ $dark := resources.Get "dark.sass" | toCSS | minify | fingerprint }} @@ -18,22 +23,101 @@ console.log("Toggling sidebar visibility") var sidebar = document.getElementById('sidebar') var wrapper = document.getElementById('wrapper') - if (sidebar.classList.contains('sidebar-toggled') || window.getComputedStyle(sidebar).display == 'block') { + if (sidebar.classList.contains('sidebar-toggled') || window.getComputedStyle(sidebar).display == 'block') { sidebar.classList.toggle('sidebar-hidden') wrapper.classList.toggle('sidebar-hidden') } sidebar.classList.add('sidebar-toggled') wrapper.classList.add('sidebar-toggled') } + function switchTheme(theme) { console.log("Changing theme:", theme) document.getElementById('theme').href = (theme == 'dark' ? "{{ $dark.RelPermalink }}" : "") document.getElementById('syntax-theme').href = (theme == 'dark' ? '/syntax-dark.css' : '/syntax.css') localStorage.setItem('theme', theme) } + + async function toggleSearch() { + console.log("Toggling search") + + var searchDiv = document.getElementById('search') + if (window.getComputedStyle(searchDiv).display == 'none') { + searchDiv.style.display = 'block' + window.scrollTo({ top: 0 }); + document.getElementById('search-bar').focus() + } else { + searchDiv.style.display = 'none' + } + + if (!index) { + console.log("Fetching index") + const response = await fetch('/searchindex.json') + const pages = await response.json() + index = lunr(function() { + this.use(lunr.multiLanguage('en', 'ru')) + this.field('title', { + boost: 5 + }) + this.field('content', { + boost: 1 + }) + pages.forEach(function(doc) { + this.add(doc) + articles.push(doc) + }, this) + }) + console.log("Ready to search") + } + } + + var articles = [] + var index = undefined + + function search() { + var query = document.getElementById('search-bar').value + var resultsDiv = document.getElementById('search-results') + var countDiv = document.getElementById('search-count') + + if (query == '') { + resultsDiv.innerHTML = '' + countDiv.innerHTML = '' + return + } + + var results = index.search(query) + + countDiv.innerHTML = '{{ T "searchCountPrefix" }} ' + results.length + ' {{ T "searchCountSuffix" }}' + + let resultList = '' + + for (const n in results) { + const item = articles[results[n].ref] + resultList += '
  • ' + item.title + '

    ' + const text = item.content + + const contextLimit = 80 + + if (text.includes(query)) { + const start = text.indexOf(query) + if (start > contextLimit) + resultList += '…' + resultList += text.substring(start - contextLimit, start) + + '' + query + '' + text.substring(start + query.length, start + query.length + contextLimit) + + } else { + resultList += text.substring(0, contextLimit * 2) + } + resultList += '…

  • ' + } + + resultsDiv.innerHTML = resultList + } + if (localStorage.getItem('theme') == 'dark') { switchTheme('dark') } + window.addEventListener('load', function() { var el = document.getElementById("active-element") //console.log(el) @@ -46,6 +130,7 @@ toggleSidebar() }*/ }) + window.addEventListener('scroll', function() { var menu = document.getElementById('menu') if (window.scrollY < 120) { @@ -56,8 +141,10 @@ menu.classList.add('scrolled') } }) + window.addEventListener('keydown', function(e) { if (e.altKey) { return } + if (document.activeElement.tagName == 'INPUT') { return } if (e.key == 'ArrowLeft') { document.getElementById('prev-article').click() } else if (e.key == 'ArrowRight') { diff --git a/themes/algorithmica/layouts/partials/search.html b/themes/algorithmica/layouts/partials/search.html new file mode 100644 index 00000000..ee853dfa --- /dev/null +++ b/themes/algorithmica/layouts/partials/search.html @@ -0,0 +1,6 @@ + diff --git a/themes/algorithmica/layouts/partials/sidebar.html b/themes/algorithmica/layouts/partials/sidebar.html index 2276957a..652a1f1b 100644 --- a/themes/algorithmica/layouts/partials/sidebar.html +++ b/themes/algorithmica/layouts/partials/sidebar.html @@ -24,13 +24,13 @@ {{ if isset .Params "part" }}
  • {{.Params.Part}}
  • {{ end }} -
  • {{ .Title }}
  • {{ if .IsSection }}
      {{ range .Pages }} -
    1. {{ .Title }}
    2. {{ end }} diff --git a/themes/algorithmica/static/scripts/lunr.multi.min.js b/themes/algorithmica/static/scripts/lunr.multi.min.js new file mode 100644 index 00000000..6f417304 --- /dev/null +++ b/themes/algorithmica/static/scripts/lunr.multi.min.js @@ -0,0 +1 @@ +!function(e,t){"function"==typeof define&&define.amd?define(t):"object"==typeof exports?module.exports=t():t()(e.lunr)}(this,function(){return function(e){e.multiLanguage=function(){for(var t=Array.prototype.slice.call(arguments),i=t.join("-"),r="",n=[],s=[],p=0;p=W.limit)return!1;W.cursor++}return!0}function t(){for(;!W.out_grouping(S,1072,1103);){if(W.cursor>=W.limit)return!1;W.cursor++}return!0}function w(){b=W.limit,_=b,e()&&(b=W.cursor,t()&&e()&&t()&&(_=W.cursor))}function i(){return _<=W.cursor}function u(e,n){var r,t;if(W.ket=W.cursor,r=W.find_among_b(e,n)){switch(W.bra=W.cursor,r){case 1:if(t=W.limit-W.cursor,!W.eq_s_b(1,"а")&&(W.cursor=W.limit-t,!W.eq_s_b(1,"я")))return!1;case 2:W.slice_del()}return!0}return!1}function o(){return u(h,9)}function s(e,n){var r;return W.ket=W.cursor,!!(r=W.find_among_b(e,n))&&(W.bra=W.cursor,1==r&&W.slice_del(),!0)}function c(){return s(g,26)}function m(){return!!c()&&(u(C,8),!0)}function f(){return s(k,2)}function l(){return u(P,46)}function a(){s(v,36)}function p(){var e;W.ket=W.cursor,(e=W.find_among_b(F,2))&&(W.bra=W.cursor,i()&&1==e&&W.slice_del())}function d(){var e;if(W.ket=W.cursor,e=W.find_among_b(q,4))switch(W.bra=W.cursor,e){case 1:if(W.slice_del(),W.ket=W.cursor,!W.eq_s_b(1,"н"))break;W.bra=W.cursor;case 2:if(!W.eq_s_b(1,"н"))break;case 3:W.slice_del()}}var _,b,h=[new n("в",-1,1),new n("ив",0,2),new n("ыв",0,2),new n("вши",-1,1),new n("ивши",3,2),new n("ывши",3,2),new n("вшись",-1,1),new n("ившись",6,2),new n("ывшись",6,2)],g=[new n("ее",-1,1),new n("ие",-1,1),new n("ое",-1,1),new n("ые",-1,1),new n("ими",-1,1),new n("ыми",-1,1),new n("ей",-1,1),new n("ий",-1,1),new n("ой",-1,1),new n("ый",-1,1),new n("ем",-1,1),new n("им",-1,1),new n("ом",-1,1),new n("ым",-1,1),new n("его",-1,1),new n("ого",-1,1),new n("ему",-1,1),new n("ому",-1,1),new n("их",-1,1),new n("ых",-1,1),new n("ею",-1,1),new n("ою",-1,1),new n("ую",-1,1),new n("юю",-1,1),new n("ая",-1,1),new n("яя",-1,1)],C=[new n("ем",-1,1),new n("нн",-1,1),new n("вш",-1,1),new n("ивш",2,2),new n("ывш",2,2),new n("щ",-1,1),new n("ющ",5,1),new n("ующ",6,2)],k=[new n("сь",-1,1),new n("ся",-1,1)],P=[new n("ла",-1,1),new n("ила",0,2),new n("ыла",0,2),new n("на",-1,1),new n("ена",3,2),new n("ете",-1,1),new n("ите",-1,2),new n("йте",-1,1),new n("ейте",7,2),new n("уйте",7,2),new n("ли",-1,1),new n("или",10,2),new n("ыли",10,2),new n("й",-1,1),new n("ей",13,2),new n("уй",13,2),new n("л",-1,1),new n("ил",16,2),new n("ыл",16,2),new n("ем",-1,1),new n("им",-1,2),new n("ым",-1,2),new n("н",-1,1),new n("ен",22,2),new n("ло",-1,1),new n("ило",24,2),new n("ыло",24,2),new n("но",-1,1),new n("ено",27,2),new n("нно",27,1),new n("ет",-1,1),new n("ует",30,2),new n("ит",-1,2),new n("ыт",-1,2),new n("ют",-1,1),new n("уют",34,2),new n("ят",-1,2),new n("ны",-1,1),new n("ены",37,2),new n("ть",-1,1),new n("ить",39,2),new n("ыть",39,2),new n("ешь",-1,1),new n("ишь",-1,2),new n("ю",-1,2),new n("ую",44,2)],v=[new n("а",-1,1),new n("ев",-1,1),new n("ов",-1,1),new n("е",-1,1),new n("ие",3,1),new n("ье",3,1),new n("и",-1,1),new n("еи",6,1),new n("ии",6,1),new n("ами",6,1),new n("ями",6,1),new n("иями",10,1),new n("й",-1,1),new n("ей",12,1),new n("ией",13,1),new n("ий",12,1),new n("ой",12,1),new n("ам",-1,1),new n("ем",-1,1),new n("ием",18,1),new n("ом",-1,1),new n("ям",-1,1),new n("иям",21,1),new n("о",-1,1),new n("у",-1,1),new n("ах",-1,1),new n("ях",-1,1),new n("иях",26,1),new n("ы",-1,1),new n("ь",-1,1),new n("ю",-1,1),new n("ию",30,1),new n("ью",30,1),new n("я",-1,1),new n("ия",33,1),new n("ья",33,1)],F=[new n("ост",-1,1),new n("ость",-1,1)],q=[new n("ейше",-1,1),new n("н",-1,2),new n("ейш",-1,1),new n("ь",-1,3)],S=[33,65,8,232],W=new r;this.setCurrent=function(e){W.setCurrent(e)},this.getCurrent=function(){return W.getCurrent()},this.stem=function(){return w(),W.cursor=W.limit,!(W.cursor=i&&(e-=i,t[e>>3]&1<<(7&e)))return this.cursor++,!0}return!1},in_grouping_b:function(t,i,s){if(this.cursor>this.limit_backward){var e=r.charCodeAt(this.cursor-1);if(e<=s&&e>=i&&(e-=i,t[e>>3]&1<<(7&e)))return this.cursor--,!0}return!1},out_grouping:function(t,i,s){if(this.cursors||e>3]&1<<(7&e)))return this.cursor++,!0}return!1},out_grouping_b:function(t,i,s){if(this.cursor>this.limit_backward){var e=r.charCodeAt(this.cursor-1);if(e>s||e>3]&1<<(7&e)))return this.cursor--,!0}return!1},eq_s:function(t,i){if(this.limit-this.cursor>1),f=0,l=o0||e==s||c)break;c=!0}}for(;;){var _=t[s];if(o>=_.s_size){if(this.cursor=n+_.s_size,!_.method)return _.result;var b=_.method();if(this.cursor=n+_.s_size,b)return _.result}if((s=_.substring_i)<0)return 0}},find_among_b:function(t,i){for(var s=0,e=i,n=this.cursor,u=this.limit_backward,o=0,h=0,c=!1;;){for(var a=s+(e-s>>1),f=0,l=o=0;m--){if(n-l==u){f=-1;break}if(f=r.charCodeAt(n-1-l)-_.s[m])break;l++}if(f<0?(e=a,h=l):(s=a,o=l),e-s<=1){if(s>0||e==s||c)break;c=!0}}for(;;){var _=t[s];if(o>=_.s_size){if(this.cursor=n-_.s_size,!_.method)return _.result;var b=_.method();if(this.cursor=n-_.s_size,b)return _.result}if((s=_.substring_i)<0)return 0}},replace_s:function(t,i,s){var e=s.length-(i-t),n=r.substring(0,t),u=r.substring(i);return r=n+s+u,this.limit+=e,this.cursor>=i?this.cursor+=e:this.cursor>t&&(this.cursor=t),e},slice_check:function(){if(this.bra<0||this.bra>this.ket||this.ket>this.limit||this.limit>r.length)throw"faulty slice operation"},slice_from:function(r){this.slice_check(),this.replace_s(this.bra,this.ket,r)},slice_del:function(){this.slice_from("")},insert:function(r,t,i){var s=this.replace_s(r,t,i);r<=this.bra&&(this.bra+=s),r<=this.ket&&(this.ket+=s)},slice_to:function(){return this.slice_check(),r.substring(this.bra,this.ket)},eq_v_b:function(r){return this.eq_s_b(r.length,r)}}}},r.trimmerSupport={generateTrimmer:function(r){var t=new RegExp("^[^"+r+"]+"),i=new RegExp("[^"+r+"]+$");return function(r){return"function"==typeof r.update?r.update(function(r){return r.replace(t,"").replace(i,"")}):r.replace(t,"").replace(i,"")}}}}}); \ No newline at end of file