VIBE

Vector Index Benchmark for Embeddings (VIBE) is an extensible benchmark for approximate nearest neighbor search methods, or vector indexes, using modern embedding datasets.

Modern vector index benchmark with embedding datasets
Includes datasets for both in-distribution and out-out-distribution settings
Includes the most comprehensive collection of state-of-the-art vector search algorithms
Support for quantized datasets in both 8-bit integer and binary precision
Support for GPU algorithms
Support for HPC environments with Slurm

Results

The current VIBE results can be viewed on our website:

https://vector-index-bench.github.io

The website also features several other tools and visualizations to explore the results.

Credits

The evaluation code and some algorithm implementations in VIBE are based on the ann-benchmarks project.

Datasets

Name	Type	n	d	Distance
agnews-mxbai-1024-euclidean	Text	769,382	1024	euclidean
arxiv-nomic-768-normalized	Text	1,344,643	768	any
gooaq-distilroberta-768-normalized	Text	1,475,024	768	any
imagenet-clip-512-normalized	Image	1,281,167	512	any
landmark-nomic-768-normalized	Image	760,757	768	any
yahoo-minilm-384-normalized	Text	677,305	384	any
celeba-resnet-2048-cosine	Image	201,599	2048	cosine
ccnews-nomic-768-normalized	Text	495,328	768	any
codesearchnet-jina-768-cosine	Code	1,374,067	768	cosine
glove-200-cosine	Word	1,192,514	200	cosine
landmark-dino-768-cosine	Image	760,757	768	cosine
simplewiki-openai-3072-normalized	Text	260,372	3072	any
coco-nomic-768-normalized	Text-to-Image	282,360	768	any
imagenet-align-640-normalized	Text-to-Image	1,281,167	640	any
laion-clip-512-normalized	Text-to-Image	1,000,448	512	any
yandex-200-cosine	Text-to-Image	1,000,000	200	cosine
yi-128-ip	Attention	187,843	128	IP
llama-128-ip	Attention	256,921	128	IP

Algorithms

Method	Version
ANNOY	1.17.3
FALCONN++	git+5fd3f17
FlatNav	0.1.2
CAGRA	25.4.0
GGNN	0.9
GLASS	1.0.5
HNSW	0.8.0
IVF (Faiss)	1.11.0
IVF-PQ (Faiss)	1.11.0
LVQ (SVS)	0.0.7
LeanVec (SVS)	0.0.7
LoRANN	0.2
MLANN	git+40848e7
MRPT	2.0.1
NGT-ONNG	git+83d5896
NGT-QG	git+83d5896
NSG	1.11.0
PUFFINN	git+fd86b0d
PyNNDescent	0.5.13
RoarGraph	git+f2b49b6
ScaNN	1.4.0
SymphonyQG	git+32a0019
Vamana (DiskANN)	0.7.0

Getting started

Requirements

Apptainer (or Singularity)
Python 3.6+

Some algorithms may require that the CPU supports AVX-512 instructions. Most GPU algorithms assume that an NVIDIA GPU is available.

Building library images

Building all library images can be done using

./install.sh

The script can be used to either build images for all available libraries (./install.sh) or an image for a single library (e.g. ./install.sh --algorithm faiss).

Tip

install.sh takes an argument --build-dir that specifies the temporary build directory. For example, to speed up the build in a cluster environment, you can set the build directory to a location on an SSD while the project files are on a slower storage medium.

Tip

See an example Slurm job for building the libraries using Slurm.

Running benchmarks

The benchmarks for a single dataset can be run using run.py. For example:

python3 run.py --dataset agnews-mxbai-1024-euclidean

Common options for run.py:

--parallelism n: Use n processes for benchmarking.
--algorithm algo: Run the benchmark for only algo.
--count k: Run the benchmarks using k nearest neighbors (default 100).
--gpu: Run the benchmark in GPU mode.

The benchmark should take less than 24 hours to run for a given dataset using parallelism > 8. We recommend having at least 16 GB of memory per used process.

Tip

See an example Slurm job for running the benchmark using Slurm.

Plotting results

To plot the results, you must first build the plot.sif image:

singularity build plot.sif plot.def

The results can then plotted with:

./plot.sh

Creating datasets from scratch

The benchmark code downloads precomputed embedding datasets. However, the datasets can also be recreated from scratch, and it is also possible to create new datasets by modifying the datasets.py file.

Creating the datasets can be done using create_dataset.sh. It first requires that dataset.sif is built:

singularity build dataset.sif dataset.def

The VIBE_CACHE environment variable should be set to a cache directory with at least 200 GB of free space when creating image embeddings using the Landmark or ImageNet datasets. Datasets can then be created using the --dataset argument (the --nv argument specifies that an available GPU can be used):

export VIBE_CACHE=$LOCAL_SCRATCH
./create_dataset "--bind $LOCAL_SCRATCH:$LOCAL_SCRATCH --nv" --dataset agnews-mxbai-1024-euclidean

Tip

See an example Slurm job for creating datasets using Slurm.

License

VIBE is available under the MIT License (see LICENSE). The pyyaml library is also distributed in the vibe folder under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
images		images
slurm		slurm
vibe		vibe
website		website
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_dataset.py		create_dataset.py
create_dataset.sh		create_dataset.sh
dataset.def		dataset.def
dataset_environment.yml		dataset_environment.yml
environment.yml		environment.yml
export_results.py		export_results.py
install.sh		install.sh
logging.conf		logging.conf
plot.def		plot.def
plot.py		plot.py
plot.sh		plot.sh
pyproject.toml		pyproject.toml
run.py		run.py
run_algorithm.py		run_algorithm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VIBE

Results

Credits

Datasets

Algorithms

Getting started

Requirements

Building library images

Running benchmarks

Plotting results

Creating datasets from scratch

License

About

Uh oh!

Releases

Packages

Languages

License

allthingsllm/vibe

Folders and files

Latest commit

History

Repository files navigation

VIBE

Results

Credits

Datasets

Algorithms

Getting started

Requirements

Building library images

Running benchmarks

Plotting results

Creating datasets from scratch

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages