Magesh Ravi

Artist | Techie | Entrepreneur

avatar

Next, choosing a value for,

  1. Maximum connections m
  2. Construction Time Search Width ef_construction and
  3. Search Time Breadth ef_search
Permalink
avatar

I created vector embeddings for emails from a real-time civil dispute case. I sampled a few vectors using dbeaver. There are too many values closer to zero. I'm assuming the dataset is sparse and going ahead with HNSW index.

Permalink
avatar

Dense vector: Most (or all) dimensions have non-zero values.

Example: [0.12, -0.93, 0.55, 0.08, 0.23]

Sparse vector: Most dimensions are zero (or near zero).

Example: [0, 0, 0.01, 0, 0, 0.2, 0, 0]

Permalink
avatar

How do I know if my vector dataset is dense or sparse?

A quick way is to sample a few vectors and inspect their values.

Permalink
avatar

Another aspect to consider is the size and density of the vector dataset.

  • IVFFlat is better for large, dense datasets.
  • HNSW is better where initial data is sparse or data accumulates gradually (our use case).
Permalink
avatar

HNSW creates a multi-layered graph for the vectors. A search query navigates the layers (or zoom levels) to find the nearest neighbours. This requires longer indexing times and more memory but outperforms IVFFlat in speed-recall metrics.

Permalink
avatar

IVFFlat divides the vectors in clusters. While querying, the clusters closest to the search vector are chosen. It's a straight-forward implementation promising faster indexing times and less memory usage.

Permalink
avatar

I'm using pgvector with postgres for Exhibit AI. I now have to choose an indexing method between IVFFlat and HNSW for the VectorField.

Permalink
avatar

New blog post: Announcing Exhibit AI

Permalink
avatar
  1. Then they imitate, remix and repeat.
  2. Finally, they share their version with the intended tribe.

Depending on how remarkable the work is, the tribe loves it, hates it or worse, ignores it.

Permalink