Skip to content

aduston/textsim

Repository files navigation

Benchmarks on a 2.3 GHz Intel Core i7 running OSX:

BenchmarkTokenizeRabin-8            	   20000	     64687 ns/op
BenchmarkTokenizeFnv-8              	   30000	     46575 ns/op
BenchmarkTokenizeSpooky-8           	   20000	     62806 ns/op
BenchmarkConvertToShinglesRabin-8   	  100000	     21595 ns/op
BenchmarkConvertToShinglesFnv-8     	   30000	     43278 ns/op
BenchmarkPermutationFnv-8           	  100000	     19496 ns/op
BenchmarkPermutationLinear-8        	   20000	     65613 ns/op

Supposing that the text in this benchmark is representative, we have about 20 microseconds per permutation, which means about 2 milliseconds per document (assuming the document signature is calculated using 100 permutations). This means we can calculate minhashes for one million pages in about 40 minutes.

About

Playing around with text similarity using minhash in golang

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages