Magesh Ravi

Artist | Techie | Entrepreneur

avatar

Recommended default values for the metrics

MetricValue
m16
ef_construction200
ef_search100
Permalink
avatar

Search Time Breadth ef_search: The size of the dynamic candidates list during search (How many nodes are explored to find the nearest neighbours).

This metric affects the query time, not the indexing time.

Permalink
avatar

Construction Time Search Width ef_construction: The number of candidates considered while inserting a node into the graph.

Higher ef_construction means better selection of neighbours resulting in higher at indexing time cost.

Permalink
avatar

Maximum Connections m: The maximum number of bi-directional edges permitted for each node in the graph.

Higher m means denser graph and hence better recall accuracy. However, more memory usage and slower indexing.

Permalink
avatar

Next, choosing a value for,

  1. Maximum connections m
  2. Construction Time Search Width ef_construction and
  3. Search Time Breadth ef_search
Permalink
avatar

I created vector embeddings for emails from a real-time civil dispute case. I sampled a few vectors using dbeaver. There are too many values closer to zero. I'm assuming the dataset is sparse and going ahead with HNSW index.

Permalink
avatar

Dense vector: Most (or all) dimensions have non-zero values.

Example: [0.12, -0.93, 0.55, 0.08, 0.23]

Sparse vector: Most dimensions are zero (or near zero).

Example: [0, 0, 0.01, 0, 0, 0.2, 0, 0]

Permalink
avatar

How do I know if my vector dataset is dense or sparse?

A quick way is to sample a few vectors and inspect their values.

Permalink
avatar

Another aspect to consider is the size and density of the vector dataset.

  • IVFFlat is better for large, dense datasets.
  • HNSW is better where initial data is sparse or data accumulates gradually (our use case).
Permalink
avatar

HNSW creates a multi-layered graph for the vectors. A search query navigates the layers (or zoom levels) to find the nearest neighbours. This requires longer indexing times and more memory but outperforms IVFFlat in speed-recall metrics.

Permalink