Magesh Ravi
Artist | Techie | Entrepreneur
Next, choosing a value for,
mef_construction andef_searchI created vector embeddings for emails from a real-time civil dispute case. I sampled a few vectors using dbeaver. There are too many values closer to zero. I'm assuming the dataset is sparse and going ahead with HNSW index.
Dense vector: Most (or all) dimensions have non-zero values.
Example: [0.12, -0.93, 0.55, 0.08, 0.23]
Sparse vector: Most dimensions are zero (or near zero).
Example: [0, 0, 0.01, 0, 0, 0.2, 0, 0]
How do I know if my vector dataset is dense or sparse?
A quick way is to sample a few vectors and inspect their values.
Another aspect to consider is the size and density of the vector dataset.
IVFFlat is better for large, dense datasets.HNSW is better where initial data is sparse or data accumulates gradually (our use case).HNSW creates a multi-layered graph for the vectors. A search query navigates the layers (or zoom levels) to find the nearest neighbours. This requires longer indexing times and more memory but outperforms IVFFlat in speed-recall metrics.
IVFFlat divides the vectors in clusters. While querying, the clusters closest to the search vector are chosen. It's a straight-forward implementation promising faster indexing times and less memory usage.
I'm using pgvector with postgres for Exhibit AI. I now have to choose an indexing method between IVFFlat and HNSW for the VectorField.
New blog post: Announcing Exhibit AI
Depending on how remarkable the work is, the tribe loves it, hates it or worse, ignores it.