- Modern vector index benchmark with embedding datasets
- Includes datasets for both in-distribution and out-out-distribution settings
- Includes the most comprehensive collection of state-of-the-art vector search algorithms
- Support for quantized datasets in both 8-bit integer and binary precision
- Support for GPU algorithms
- Support for HPC environments with Slurm
The current VIBE results can be viewed on our website:
https://vector-index-bench.github.io
The website also features several other tools and visualizations to explore the results.
The evaluation code and some algorithm implementations in VIBE are based on the ann-benchmarks project.
| Name | Type | n | d | Distance |
|---|---|---|---|---|
| agnews-mxbai-1024-euclidean | Text | 769,382 | 1024 | euclidean |
| arxiv-nomic-768-normalized | Text | 1,344,643 | 768 | any |
| gooaq-distilroberta-768-normalized | Text | 1,475,024 | 768 | any |
| imagenet-clip-512-normalized | Image | 1,281,167 | 512 | any |
| landmark-nomic-768-normalized | Image | 760,757 | 768 | any |
| yahoo-minilm-384-normalized | Text | 677,305 | 384 | any |
| celeba-resnet-2048-cosine | Image | 201,599 | 2048 | cosine |
| ccnews-nomic-768-normalized | Text | 495,328 | 768 | any |
| codesearchnet-jina-768-cosine | Code | 1,374,067 | 768 | cosine |
| glove-200-cosine | Word | 1,192,514 | 200 | cosine |
| landmark-dino-768-cosine | Image | 760,757 | 768 | cosine |
| simplewiki-openai-3072-normalized | Text | 260,372 | 3072 | any |
| coco-nomic-768-normalized | Text-to-Image | 282,360 | 768 | any |
| imagenet-align-640-normalized | Text-to-Image | 1,281,167 | 640 | any |
| laion-clip-512-normalized | Text-to-Image | 1,000,448 | 512 | any |
| yandex-200-cosine | Text-to-Image | 1,000,000 | 200 | cosine |
| yi-128-ip | Attention | 187,843 | 128 | IP |
| llama-128-ip | Attention | 256,921 | 128 | IP |
| Method | Version |
|---|---|
| ANNOY | 1.17.3 |
| FALCONN++ | git+5fd3f17 |
| FlatNav | 0.1.2 |
| CAGRA | 25.4.0 |
| GGNN | 0.9 |
| GLASS | 1.0.5 |
| HNSW | 0.8.0 |
| IVF (Faiss) | 1.11.0 |
| IVF-PQ (Faiss) | 1.11.0 |
| LVQ (SVS) | 0.0.7 |
| LeanVec (SVS) | 0.0.7 |
| LoRANN | 0.2 |
| MLANN | git+40848e7 |
| MRPT | 2.0.1 |
| NGT-ONNG | git+83d5896 |
| NGT-QG | git+83d5896 |
| NSG | 1.11.0 |
| PUFFINN | git+fd86b0d |
| PyNNDescent | 0.5.13 |
| RoarGraph | git+f2b49b6 |
| ScaNN | 1.4.0 |
| SymphonyQG | git+32a0019 |
| Vamana (DiskANN) | 0.7.0 |
- Apptainer (or Singularity)
- Python 3.6+
Some algorithms may require that the CPU supports AVX-512 instructions. Most GPU algorithms assume that an NVIDIA GPU is available.
Building all library images can be done using
./install.shThe script can be used to either build images for all available libraries (./install.sh) or an image for a single library (e.g. ./install.sh --algorithm faiss).
Tip
install.sh takes an argument --build-dir that specifies the temporary build directory. For example, to speed up the build in a cluster environment, you can set the build directory to a location on an SSD while the project files are on a slower storage medium.
Tip
See an example Slurm job for building the libraries using Slurm.
The benchmarks for a single dataset can be run using run.py. For example:
python3 run.py --dataset agnews-mxbai-1024-euclideanCommon options for run.py:
--parallelism n: Usenprocesses for benchmarking.--algorithm algo: Run the benchmark for onlyalgo.--count k: Run the benchmarks usingknearest neighbors (default 100).--gpu: Run the benchmark in GPU mode.
The benchmark should take less than 24 hours to run for a given dataset using parallelism > 8. We recommend having at least 16 GB of memory per used process.
Tip
See an example Slurm job for running the benchmark using Slurm.
To plot the results, you must first build the plot.sif image:
singularity build plot.sif plot.defThe results can then plotted with:
./plot.shThe benchmark code downloads precomputed embedding datasets. However, the datasets can also be recreated from scratch, and it is also possible to create new datasets by modifying the datasets.py file.
Creating the datasets can be done using create_dataset.sh. It first requires that dataset.sif is built:
singularity build dataset.sif dataset.defThe VIBE_CACHE environment variable should be set to a cache directory with at least 200 GB of free space when creating image embeddings using the Landmark or ImageNet datasets. Datasets can then be created using the --dataset argument (the --nv argument specifies that an available GPU can be used):
export VIBE_CACHE=$LOCAL_SCRATCH
./create_dataset "--bind $LOCAL_SCRATCH:$LOCAL_SCRATCH --nv" --dataset agnews-mxbai-1024-euclideanTip
See an example Slurm job for creating datasets using Slurm.
VIBE is available under the MIT License (see LICENSE). The pyyaml library is also distributed in the vibe folder under the MIT License.
