Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions src/literals.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ use memchr::{memchr, memchr2, memchr3};
use syntax;

use freqs::BYTE_FREQUENCIES;

use simd_accel::teddy128::Teddy;

/// A prefix extracted from a compiled regular expression.
Expand Down
142 changes: 92 additions & 50 deletions src/simd_accel/teddy128.rs
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
/*!
Teddy is a simd accelerated multiple substring matching algorithm. The name
and the core ideas in the algorithm were learned from the Hyperscan[1]
and the core ideas in the algorithm were learned from the [Hyperscan][1_u]
project.


Background
----------

The key idea of Teddy is to do *packed* substring matching. In the literature,
packed substring matching is the idea of examing multiple bytes in a haystack
at a time to detect matches. Implementations of, for example, memchr (which
Expand All @@ -15,20 +16,20 @@ extended to substring matching. The PCMPESTRI instruction (and its relatives),
for example, implements substring matching in hardware. It is, however, limited
to substrings of length 16 bytes or fewer, but this restriction is fine in a
regex engine, since we rarely care about the performance difference between
searching for a 16 byte literal and a 16 + N literal---16 is already long
searching for a 16 byte literal and a 16 + N literal16 is already long
enough. The key downside of the PCMPESTRI instruction, on current (2016) CPUs
at least, is its latency and throughput. As a result, it is often faster to do
substring search with a Boyer-Moore variant and a well placed memchr to quickly
skip through the haystack.

There are fewer results from the literature on packed substring matching,
and even fewer for packed multiple substring matching. Ben-Kiki et al.[2]
and even fewer for packed multiple substring matching. Ben-Kiki et al. [2]
describes use of PCMPESTRI for substring matching, but is mostly theoretical
and hand-waves performance. There is other theoretical work done by Bille[3]
and hand-waves performance. There is other theoretical work done by Bille [3]
as well.

The rest of the work in the field, as far as I'm aware, is by Faro and Kulekci
and is generally focused on multiple pattern search. Their first paper[4a]
and is generally focused on multiple pattern search. Their first paper [4a]
introduces the concept of a fingerprint, which is computed for every block of
N bytes in every pattern. The haystack is then scanned N bytes at a time and
a fingerprint is computed in the same way it was computed for blocks in the
Expand All @@ -44,13 +45,13 @@ presumably because of how the algorithm uses certain SIMD instructions. This
essentially makes it useless for general purpose regex matching, where a small
number of short patterns is far more likely.

Faro and Kulekci published another paper[4b] that is conceptually very similar
Faro and Kulekci published another paper [4b] that is conceptually very similar
to [4a]. The key difference is that it uses the CRC32 instruction (introduced
as part of SSE 4.2) to compute fingerprint values. This also enables the
algorithm to work effectively on substrings as short at 7 bytes with 4 byte
algorithm to work effectively on substrings as short as 7 bytes with 4 byte
windows. 7 bytes is unfortunately still too long. The window could be
technically shrunk to 2 bytes, thereby reducing minimum length to 3, but the
small window size ends up negating most performance benefits---and it's likely
small window size ends up negating most performance benefitsand it's likely
the common case in a general purpose regex engine.

Faro and Kulekci also published [4c] that appears to be intended as a
Expand All @@ -59,18 +60,19 @@ the high throughput/latency time of PCMPESTRI and therefore chooses other SIMD
instructions that are faster. While this approach works for short substrings,
I personally couldn't see a way to generalize it to multiple substring search.

Faro and Kulekci have another paper[4d] that I haven't been able to read
Faro and Kulekci have another paper [4d] that I haven't been able to read
because it is behind a paywall.


Teddy
-----

Finally, we get to Teddy. If the above literature review is complete, then it
appears that Teddy is a novel algorithm. More than that, in my experience, it
completely blows away the competition for short substrings, which is exactly
what we want in a general purpose regex engine. Again, the algorithm appears
to be developed by the authors of Hyperscan[1]. Hyperscan was open sourced late
2015, and no earlier history could be found. Therefore, tracking the exact
to be developed by the authors of [Hyperscan][1_u]. Hyperscan was open sourced
late 2015, and no earlier history could be found. Therefore, tracking the exact
provenance of the algorithm with respect to the published literature seems
difficult.

Expand Down Expand Up @@ -142,8 +144,8 @@ How do we perform lookup though? It turns out that SSSE3 introduced a very cool
instruction called PSHUFB. The instruction takes two SIMD vectors, `A` and `B`,
and returns a third vector `C`. All vectors are treated as 16 8-bit integers.
`C` is formed by `C[i] = A[B[i]]`. (This is a bit of a simplification, but true
for the purposes of this algorithm. For full details, see Intel's Intrinsics
Guide[5].) This essentially lets us use the values in `B` to lookup values in
for the purposes of this algorithm. For full details, see [Intel's Intrinsics
Guide][5_u].) This essentially lets us use the values in `B` to lookup values in
`A`.

If we could somehow cause `B` to contain our 16 byte block from the haystack,
Expand Down Expand Up @@ -241,6 +243,7 @@ haystack.

Implementation notes
--------------------

The problem with the algorithm as described above is that it uses a single byte
for a fingerprint. This will work well if the fingerprints are rare in the
haystack (e.g., capital letters or special characters in normal English text),
Expand Down Expand Up @@ -268,15 +271,52 @@ The way to extend it is:

The implementation below is commented to fill in the nitty gritty details.

[1] - https://github.com/01org/hyperscan
[2a] - http://drops.dagstuhl.de/opus/volltexte/2011/3355/pdf/37.pdf
[2b] - http://www.cs.haifa.ac.il/~oren/Publications/bpsm.pdf
[3] - http://www.sciencedirect.com/science/article/pii/S1570866710000353
[4a] - http://www.dmi.unict.it/~faro/papers/conference/faro32.pdf
[4b] - https://pdfs.semanticscholar.org/fed7/ca62dc469314f3552017d0da7ebd669d4649.pdf
[4c] - http://arxiv.org/pdf/1209.6449.pdf
[4d] - http://www.sciencedirect.com/science/article/pii/S1570866714000471
[5] - https://software.intel.com/sites/landingpage/IntrinsicsGuide
References
----------

- **[1]** [Hyperscan on GitHub](https://github.com/01org/hyperscan),
[webpage](https://01.org/hyperscan)
- **[2a]** Ben-Kiki, O., Bille, P., Breslauer, D., Gasieniec, L., Grossi, R.,
& Weimann, O. (2011).
_Optimal packed string matching_.
In LIPIcs-Leibniz International Proceedings in Informatics (Vol. 13).
Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
DOI: 10.4230/LIPIcs.FSTTCS.2011.423.
[PDF](http://drops.dagstuhl.de/opus/volltexte/2011/3355/pdf/37.pdf).
- **[2b]** Ben-Kiki, O., Bille, P., Breslauer, D., Ga̧sieniec, L., Grossi, R.,
& Weimann, O. (2014).
_Towards optimal packed string matching_.
Theoretical Computer Science, 525, 111-129.
DOI: 10.1016/j.tcs.2013.06.013.
[PDF](http://www.cs.haifa.ac.il/~oren/Publications/bpsm.pdf).
- **[3]** Bille, P. (2011).
_Fast searching in packed strings_.
Journal of Discrete Algorithms, 9(1), 49-56.
DOI: 10.1016/j.jda.2010.09.003.
[PDF](http://www.sciencedirect.com/science/article/pii/S1570866710000353).
- **[4a]** Faro, S., & Külekci, M. O. (2012, October).
_Fast multiple string matching using streaming SIMD extensions technology_.
In String Processing and Information Retrieval (pp. 217-228).
Springer Berlin Heidelberg.
DOI: 10.1007/978-3-642-34109-0_23.
[PDF](http://www.dmi.unict.it/~faro/papers/conference/faro32.pdf).
- **[4b]** Faro, S., & Külekci, M. O. (2013, September).
_Towards a Very Fast Multiple String Matching Algorithm for Short Patterns_.
In Stringology (pp. 78-91).
[PDF](http://www.dmi.unict.it/~faro/papers/conference/faro36.pdf).
- **[4c]** Faro, S., & Külekci, M. O. (2013, January).
_Fast packed string matching for short patterns_.
In Proceedings of the Meeting on Algorithm Engineering & Expermiments
(pp. 113-121).
Society for Industrial and Applied Mathematics.
[PDF](http://arxiv.org/pdf/1209.6449.pdf).
- **[4d]** Faro, S., & Külekci, M. O. (2014).
_Fast and flexible packed string matching_.
Journal of Discrete Algorithms, 28, 61-72.
DOI: 10.1016/j.jda.2014.07.003.

[1_u]: https://github.com/01org/hyperscan
[5_u]: https://software.intel.com/sites/landingpage/IntrinsicsGuide
*/

// TODO: Extend this to use AVX2 instructions.
Expand Down Expand Up @@ -306,7 +346,7 @@ pub struct Match {
pub pat: usize,
/// The start byte offset of the match.
pub start: usize,
/// The end byte offset of the match. This is always start + pat.len().
/// The end byte offset of the match. This is always `start + pat.len()`.
pub end: usize,
}

Expand All @@ -325,7 +365,7 @@ pub struct Teddy {
}

/// A list of masks. This has length equal to the length of the fingerprint.
/// The length of the fingerprint is always `max(3, len(smallest substring))`.
/// The length of the fingerprint is always `max(3, len(smallest_substring))`.
#[derive(Debug, Clone)]
struct Masks(Vec<Mask>);

Expand All @@ -339,9 +379,9 @@ struct Mask {
}

impl Teddy {
/// Create a new Teddy multi substring matcher.
/// Create a new `Teddy` multi substring matcher.
///
/// If a Teddy matcher could not be created (e.g., `pats` is empty or has
/// If a `Teddy` matcher could not be created (e.g., `pats` is empty or has
/// an empty substring), then `None` is returned.
pub fn new(pats: &syntax::Literals) -> Option<Teddy> {
let pats: Vec<_> = pats.literals().iter().map(|p|p.to_vec()).collect();
Expand Down Expand Up @@ -369,7 +409,7 @@ impl Teddy {
})
}

/// Returns all of the substrings matched by this Teddy.
/// Returns all of the substrings matched by this `Teddy`.
pub fn patterns(&self) -> &[Vec<u8>] {
&self.pats
}
Expand All @@ -384,7 +424,7 @@ impl Teddy {
self.pats.iter().fold(0, |a, b| a + b.len())
}

/// Searches `haystack` for the substrings in this Teddy. If a match was
/// Searches `haystack` for the substrings in this `Teddy`. If a match was
/// found, then it is returned. Otherwise, `None` is returned.
pub fn find(&self, haystack: &[u8]) -> Option<Match> {
// If our haystack is smaller than the block size, then fall back to
Expand All @@ -403,7 +443,7 @@ impl Teddy {
}
}

/// find1 is used when there is only 1 mask. This is the easy case and is
/// `find1` is used when there is only 1 mask. This is the easy case and is
/// pretty much as described in the module documentation.
#[inline(always)]
fn find1(&self, haystack: &[u8]) -> Option<Match> {
Expand All @@ -413,7 +453,7 @@ impl Teddy {
debug_assert!(len >= BLOCK_SIZE);
while pos <= len - BLOCK_SIZE {
let h = unsafe { u8x16::load_unchecked(haystack, pos) };
// N.B. res0 is our `C` in the module documentation.
// N.B. `res0` is our `C` in the module documentation.
let res0 = self.masks.members1(h);
// Only do expensive verification if there are any non-zero bits.
if res0.ne(zero).any() {
Expand All @@ -426,7 +466,7 @@ impl Teddy {
self.slow(haystack, pos)
}

/// find2 is used when there are 2 masks, e.g., the fingerprint is 2 bytes
/// `find2` is used when there are 2 masks, e.g., the fingerprint is 2 bytes
/// long.
#[inline(always)]
fn find2(&self, haystack: &[u8]) -> Option<Match> {
Expand All @@ -440,12 +480,12 @@ impl Teddy {
);
let zero = u8x16::splat(0);
let len = haystack.len();
// The previous value of C (from the module documentation) for the
// The previous value of `C` (from the module documentation) for the
// *first* byte in the fingerprint. On subsequent iterations, we take
// the last bitset from the previous C and insert it into the first
// position of the current C, shifting all other bitsets to the right
// one lane. This causes C for the first byte to line up with C for the
// second byte, so that they can be AND'd together.
// the last bitset from the previous `C` and insert it into the first
// position of the current `C`, shifting all other bitsets to the right
// one lane. This causes `C` for the first byte to line up with `C` for
// the second byte, so that they can be `AND`'d together.
let mut prev0 = u8x16::splat(0xFF);
let mut pos = 1;
debug_assert!(len >= BLOCK_SIZE);
Expand All @@ -455,17 +495,19 @@ impl Teddy {

// The next three lines are essentially equivalent to
//
// (prev0 << 15) | (res0 >> 1)
// ```rust,ignore
// (prev0 << 15) | (res0 >> 1)
// ```
//
// ... if SIMD vectors could shift across lanes. There is the
// PALIGNR instruction, but apparently LLVM doesn't expose it as
// `PALIGNR` instruction, but apparently LLVM doesn't expose it as
// a proper intrinsic. Thankfully, it appears the following
// sequence does indeed compile down to a PALIGNR.
// sequence does indeed compile down to a `PALIGNR`.
let prev0byte0 = prev0.extract(15);
let res0shiftr8 = res0.shuffle_bytes(res0shuffle);
let res0prev0 = res0shiftr8.replace(0, prev0byte0);

// AND's our C values together.
// `AND`'s our `C` values together.
let res = res0prev0 & res1;
prev0 = res0;
if res.ne(zero).any() {
Expand All @@ -482,12 +524,12 @@ impl Teddy {
self.slow(haystack, pos.checked_sub(1).unwrap())
}

/// find3 is used when there are 3 masks, e.g., the fingerprint is 3 bytes
/// `find3` is used when there are 3 masks, e.g., the fingerprint is 3 bytes
/// long.
///
/// N.B. This is a straight-forward extrapolation of find2. The only
/// difference is that we need to keep track of two previous values of
/// C, since we now need to align for three bytes.
/// N.B. This is a straight-forward extrapolation of `find2`. The only
/// difference is that we need to keep track of two previous values of `C`,
/// since we now need to align for three bytes.
#[inline(always)]
fn find3(&self, haystack: &[u8]) -> Option<Match> {
let zero = u8x16::splat(0);
Expand Down Expand Up @@ -571,7 +613,7 @@ impl Teddy {
///
/// If a match exists, it returns the first one.
///
/// offset is an additional byte offset to add to the position before
/// `offset` is an additional byte offset to add to the position before
/// substring match verification.
#[inline(always)]
fn verify_64(
Expand Down Expand Up @@ -673,17 +715,17 @@ impl Masks {
}

/// Adds the given pattern to the given bucket. The bucket should be a
/// power of 2 <= 2^7.
/// power of `2 <= 2^7`.
fn add(&mut self, bucket: u8, pat: &[u8]) {
for (i, mask) in self.0.iter_mut().enumerate() {
mask.add(bucket, pat[i]);
}
}

/// Finds the fingerprints that are in the given haystack block. i.e., this
/// returns C as described in the module documentation.
/// returns `C` as described in the module documentation.
///
/// More specifically, for i in 0..16 and j in 0..8, C[i][j] == 1 if and
/// More specifically, `for i in 0..16` and `j in 0..8, C[i][j] == 1` if and
/// only if `haystack_block[i]` corresponds to a fingerprint that is part
/// of a pattern in bucket `j`.
#[inline(always)]
Expand All @@ -710,8 +752,8 @@ impl Masks {
(res0, res1)
}

/// Like members1, but computes C for the first, second and third bytes in
/// the fingerprint.
/// Like `members1`, but computes `C` for the first, second and third bytes
/// in the fingerprint.
#[inline(always)]
fn members3(&self, haystack_block: u8x16) -> (u8x16, u8x16, u8x16) {
let masklo = u8x16::splat(0xF);
Expand Down