Magesh Ravi
Artist | Techie | Entrepreneur
Recommended default values for the metrics
Metric | Value |
---|---|
m | 16 |
ef_construction | 200 |
ef_search | 100 |
Search Time Breadth ef_search
: The size of the dynamic candidates list during search (How many nodes are explored to find the nearest neighbours).
This metric affects the query time, not the indexing time.
Construction Time Search Width ef_construction
: The number of candidates considered while inserting a node into the graph.
Higher ef_construction
means better selection of neighbours resulting in higher at indexing time cost.
Maximum Connections m
: The maximum number of bi-directional edges permitted for each node in the graph.
Higher m
means denser graph and hence better recall accuracy. However, more memory usage and slower indexing.
Next, choosing a value for,
m
ef_construction
andef_search
I created vector embeddings for emails from a real-time civil dispute case. I sampled a few vectors using dbeaver. There are too many values closer to zero. I'm assuming the dataset is sparse and going ahead with HNSW index.
Dense vector: Most (or all) dimensions have non-zero values.
Example: [0.12, -0.93, 0.55, 0.08, 0.23]
Sparse vector: Most dimensions are zero (or near zero).
Example: [0, 0, 0.01, 0, 0, 0.2, 0, 0]
How do I know if my vector dataset is dense or sparse?
A quick way is to sample a few vectors and inspect their values.
Another aspect to consider is the size and density of the vector dataset.
IVFFlat
is better for large, dense datasets.HNSW
is better where initial data is sparse or data accumulates gradually (our use case).HNSW
creates a multi-layered graph for the vectors. A search query navigates the layers (or zoom levels) to find the nearest neighbours. This requires longer indexing times and more memory but outperforms IVFFlat in speed-recall metrics.