Magesh Ravi
Artist | Techie | Entrepreneur
Someone from blazenode.io messaged me today.
One of my domains' old DNS records point to an IP address (206.189.132.211) that's currently assigned to their active web server.
So, its messing up their traffic/analytics.
Recommended default values for the metrics
| Metric | Value |
|---|---|
m | 16 |
ef_construction | 200 |
ef_search | 100 |
Search Time Breadth ef_search: The size of the dynamic candidates list during search (How many nodes are explored to find the nearest neighbours).
This metric affects the query time, not the indexing time.
Construction Time Search Width ef_construction: The number of candidates considered while inserting a node into the graph.
Higher ef_construction means better selection of neighbours resulting in higher at indexing time cost.
Maximum Connections m: The maximum number of bi-directional edges permitted for each node in the graph.
Higher m means denser graph and hence better recall accuracy. However, more memory usage and slower indexing.
Next, choosing a value for,
mef_construction andef_searchI created vector embeddings for emails from a real-time civil dispute case. I sampled a few vectors using dbeaver. There are too many values closer to zero. I'm assuming the dataset is sparse and going ahead with HNSW index.
Dense vector: Most (or all) dimensions have non-zero values.
Example: [0.12, -0.93, 0.55, 0.08, 0.23]
Sparse vector: Most dimensions are zero (or near zero).
Example: [0, 0, 0.01, 0, 0, 0.2, 0, 0]
How do I know if my vector dataset is dense or sparse?
A quick way is to sample a few vectors and inspect their values.
Another aspect to consider is the size and density of the vector dataset.
IVFFlat is better for large, dense datasets.HNSW is better where initial data is sparse or data accumulates gradually (our use case).