82 comments
  • ignoramous1y

    Some more resources:

    - A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge, https://arxiv.org/abs/2310.11703

    - Survey of Vector Database Management Systems, https://arxiv.org/abs/2310.14021

    - What are Embeddings, https://raw.githubusercontent.com/veekaybee/what_are_embeddi...

    ---

    h/t: https://twitter.com/eatonphil/status/1745524630624862314 and https://twitter.com/ChristophMolnar/status/17457316026829826...

  • ripley121y

    The "Do you need a Dedicated Vector Database?" slide is quite interesting, but doesn't really answer its own question! This is something I've been wondering myself, if anyone has any guidelines or rules of thumb I would appreciate it.

    I've recently been using Simon Willison's (hi simonw) excellent `llm` tool that can help with embeddings, and it takes the simplest approach possible: just store embeddings in SQLite with a few UDFs for calculating distance etc.

    The simplicity of that approach is very appealing, but presumably at some level of traffic+data an application will outgrow it and need a more specialized database. Does anyone have a good intuition for where that cutoff might be?

    • estebarb1y

      Some years ago I wrote a vector index that used LSH. Search was implemented in the most trivial way: scanning everything and comparing using hamming distance (xor and popcount). I was able to scan 200k hashes and find the most similar in less than 10ms, in a single core of a 2011 MBP.

      • winrid1y

        That's cool, but 200k is pretty low. Also 10ms with logn and indexes hot in memory should be able to sort through a billion or so index keys at least

    • summarity1y

      I’ve had some success scaling a quick and dirty engine (that powers findsight.ai) to tens of millions of vectors, details in the talk here: https://youtu.be/elNrRU12xRc?t=1556

      That’s maybe 1kLOC so I didn’t need an external one after all.

    • tristanho1y

      Honestly just loading all vectors in-memory (and stuff like sqlite, pgvector) is totally fine when you're dealing with O(100k) vectors, but beyond that all the workable options like pinecone get gnarly, slow, and ridiculously expensive.

      The best option by far I know of is turbopuffer.com , which is like 100x cheaper than pinecone and seems to actually scale.

      Since it's not listed in the suggested vector dbs section of the slides, wanted to lob it in as a solid suggestion :)

    • darkteflon1y

      YMMV, of course, but this[1] article - posted here a few weeks back - really helped us to focus on how to think about what we actually needed: in short, a search engine.

      [1] https://news.ycombinator.com/item?id=38703943

    • jide_tracc1y

      Thanks for the feedback! I'll make this the topic of my next article and do a deeper dive on this

  • jide_tracc1y

    A few months ago, I taught a class on Vector Databases for a TGE Data private client and then decided to record it into a short course for a wider audience.

    The course is a mix of theory and demos discussing some of the underlying concepts of Vectors, Vector Databases, Indexing, Search Similarity and ending with demos specifically for Pinecone and Weaviate databases.

    • CharlesW1y

      Cool! Is there a video as well, then?

      • jide_tracc1y

        There is but it's a udemy course unfortunately. I don't want to spam the group with links but the link is on the lat page of the document

  • PaulDavisThe1st1y

    A nice introduction. However, I feel that many such introductions, it skips too quickly over the question of feature selection.

    This is the step that introduces, rather subtly and sometimes almost ignorably, human judgement into what otherwise feels like a very automated and "it's just math" system.

    Let's take audio. What features will you select from it to make your N-dimensional vector? An easy answer might appear to be "as many as possible". However, your first problem is that you may not have access to feature characterization even for just the features you can trivially name. Your second problem is that without deep knowledge of the domain, you may not even be aware of potential features that you should probably be using. Your third problem is that even with deep knowledge of the domain, you may still be unaware of potential features that you probably should be using.

    A practical example may help. Perhaps you're a fan of Reichian phase-shifting minimalist music. You (indirectly) interrogate a vector database to find music similar to a canonical piece from this genre (say, Piano Phase). The database uses many audio & music related features, including but not limited to: dominant frequency, note onset intervals, volume, frequency-distribution-based timbral characteristics, apparent root notes and scales etc etc etc.

    Unfortunately for you, the feature set in the database does not include "note-to-note intervals over time is constant". Your query will manage to find things that are timbrally, harmonically, melodically and rythmically similar, but it will be by sheer good luck that any of them share the key characteristic: steady changes in the relative phase of two different (or the same) melodic lines.

    This is just one example. It's not hard to cook up equivalents for visual, textual, numerical ... any data at all.

    Now of course, none of this makes vector databases and feature classification useless. It does mean that when you do or do not find match patterns in a given data set, one of the early questions you need to ask yourself is whether the feature set has any strong guarantee of being complete, and if not, how it may need to be expanded.

    • yunyu1y

      Vector databases are oriented around search and retrieval. The usual approach of generating vectors is to fine tune a large pretrained model and extract the inner representations. Because the dataset contains successful queries and retrieval results, all you need to do is optimize a loss function (usually contrastively or approximated versions of common ranking functions) using raw inputs on the similarity objective supported by the vector database. For common modalities like tabular/text/image/audio data, there is basically no human judgement involved in feature selection - just apply attention.

      Note: current state of the art text to vector models like E5-Mistral (note: very different from original E5) don’t even require human curation in the dataset

      • PaulDavisThe1st1y

        This is also just eliding the work that humans still have to do.

        I give you an audio file (let's say just regular old PCM wav format). You cannot do anything with this without making some decisions about what happens next to the data. For audio, you're very minimally faced with the question of whether to do a transform into the frequency domain. If you don't do that, there's a ton of feature classification that can never be done. No audio to vector model can make that sort of decision for itself - humans have to make the possible.

        Raw inputs are suitable for some things, but essentially E5 is just one that has already had a large number of assumptions built into it that happen to give pretty good results. Nevertheless, were you interested for some weird reason in a very strange metric of text similarity, nothing prebuilt, even E5, is going to give that to you. Let's look at what E5 does:

        > The primary purpose of embedding models is to convert discrete symbols, such as words, into continuous-valued vectors. These vectors are designed in such a way that similar words or entities have vectors that are close to each other in the vector space, reflecting their semantic similarity.

        This is great, but useful for only a particular type of textual similarity consideration.

        And oh, what's this:

        > However, the innovation doesn’t stop there. To further elevate the model’s performance, supervised fine-tuning was introduced. This involved training the E5 embeddings with labeled data, effectively incorporating human knowledge into the learning process. The outcome was a consistent improvement in performance, making E5 a promising approach for advancing the field of embeddings and natural language understanding.

        Hmmm ....

        Anyway, my point still stands: choosing how to transform raw data into "features" is a human activity, even if the actual transformation itself is automated.

    • ravetcofx1y

      Great comment about you don't know what you don't know. Also wonderful to see another Steve Reich aficionado. What are some similar pieces to music for 18 Musicians which I adore?

      • PaulDavisThe1st1y

        There's nothing quite like Mf18M.

        However, I do love several of the pieces by Marc Mellits, check out "Real Quiet plays the music of Marc Mellits". All rather short, but some totally lovely.

        You may enjoy an 80's album by Andrew Poppy, in particular the track "The Object is a Hungry Wolf".

        Do be sure to check "Shift" by Chris Hughes, which are re-instrumentations of several early Reich pieces including Piano Phase.

        It's quite different in many ways from Mf18M, and you may have heard it, but I adore "The Chairman Dances" by John Adams. It couples the "chugging along" of Mf18M with much less minimalist melodic and harmonic structures. YMMV.

        I'm sure I am going to remember a bunch more after I send this, so check back here in a couple of days, and I'll reply again with any extras.

        Do you have any "similars" ?

  • danjl1y

    Great overview, but the final section doesn't address the obvious question about how to decide between using a "vector store", like Postgres+pgvector vs a "vector database", like Pinecone. I'd love to see another presentation which discusses the various tradeoffs, like query speed, insertion/index-building speed, ease-of-use, and others to help guide people trying to decide which of the options is best for their application.

    • jide_tracc1y

      Thanks for the feedback! I'll explore this topic for upcoming articles for a deeper dive

    • Beefin1y

      i started compiling comparisons between a bunch a while back: http://vectorsearch.dev/

      welcome any PRs

      • danjl1y

        Although excellent for vector databases, unfortunately this does not include postgres with pgvector.

    • esafak1y

      I'd call the former a vector extension. A database is a store with bells and whistles.

  • simonw1y

    Since digitaloceanspaces.com is an S3-style hosting provider, it would be neat if Hacker News could special-case it to display something like tge-data-web.nyc3.digitaloceanspaces.com as the domain instead of just digitaloceanspaces.com

    Though it looks like S3 has the same problem: https://news.ycombinator.com/item?id=38876761

    There's precedent for this elsewhere though - sites on a x.github.io subdomain have special treatment here: https://news.ycombinator.com/from?site=lfranke.github.io

  • nostrebored1y

    From looking at this, I think it’s a very risky starting point for an engineer to kick off from.

    Things like mentioning they’re clustered by meaning and optimized for analytics are questionable.

    The clustering depends on the embedding you calculate. If you think that the embedding is a good semantic approximation of the data then maybe this is a fine way of thinking about it. But it’s not hard to imagine embeddings that may violate this — eg if I use an audio file and a text file that are identical in meaning through the same embed process, unless it is multimodal they will likely be distant in the embedding vector space.

    I fully expect to see embeddings that put things close together in the vector space based on utilization rather than semantic similarity. If I’m creating a recommender system, I don’t want to group different varieties of one off purchases closely. For instance, the most semantically similar flight is going to be another fight to the same destination at a different time or a flight to a nearby airport. But I would want to group hotels often purchased by people who have previously bought the flight.

    Vector databases also allow you to provide extra dimensionality into the data, like time awareness. Nothing is forcing you to use a vector that encodes semantic meaning.

    And from this, you can see that we’re optimized for lookups or searches based on an input vector. This is not analogous to OLAP queries. This is more akin to elasticsearch than snowflake. If you are using a vector database thinking it’s going to give you reporting or large scale analytics on the vector space afaik there isn’t a readily available offering.

    • chasd001y

      calculating the embeddings is still a mystery to me. I get going from a picture of an Apple to a vector representing "appleness" and then comparing that vector to other vectors using all the usual math. What I don't get is, who/what takes the image as input and outputs the vector. Same goes for documents, let's say i want to add a dimension (another number in the array) what part of the vector database do I modify to include this dimension in the vector calculation? Or is going from doc/image/whatver to the vector representation done outside the database in some other way?

      edit: it seems like calculating embeddings would be something an ML algorithm would do but then, again, you have to train that one first. ...it's training all the way down.

      • nostrebored1y

        Yup it happens outside of the system — but there are a number of perks to being able to store that data in a db — including easily adding metadata, updating entries, etc.

        I think in 10y we will see retail systems heavily utilizing vector dbs and many embedding as a service products that take into account things like conversion. In this model you can add metadata about products to the vector db and direct program flow instead of querying back out to one or more databases to retrieve relevant metadata.

        They’ll also start to enable things like search via image for features like “show us your favorite outfit” pulling up a customized wardrobe based on individual items extracted from the photo and run through the embedder.

        Just one of many ways these products will exist outside of RAG. I think we’ll actually see a lot of the opposite — GAR.

  • jhj1y

    I don't know why PQ is listed as an "indexing strategy". It's a vector compression/quantization technique, not a means of partitioning the search space. You could encode vectors with PQ when using brute-force/flat index, an IVF index, with HNSW (all of which are present in Faiss with PQ encoding as IndexPQ, IndexIVFPQ and IndexHNSWPQ respectively), or even k-D trees or ANNOY if someone wanted to do that.

    "Use HNSW or Annoy for very large datasets where query speed is more important than precision": Graph-based methods have huge memory overhead and construction cost, and they aren't practical for billion-scale datasets. Also, they will usually be more accurate and faster than IVF techniques (as you would need to visit a large number of IVF cells to get comparable accuracy), though IVF can scale to trillion-sized databases without much overhead yet with reasonable speed/accuracy tradeoffs unlike other techniques. I'd say "use for medium-scale datasets where query speed is important, yet high accuracy is still desired and flat/brute-force indexing is impractical".

    • m00x1y

      It turns a continuous space into a discrete space. You would first do PQ, then do KNN of the new discrete vector. This way you can compress the vocabulary to a fixed size.

  • stefanha1y

    In the table on slide 15 the Indexing & Search Efficiency cells for Traditional Databases and Vector Databases appear to be swapped.

    • kevindamm1y

      Yes that last row looks swapped to me as well.

  • dimatura1y

    Any recommendations for an embedded embedding database (heh)? Embedded, as in sqlite. For smaller-scale problems, but hopefully more convenient than say, LMDB + FAISS.

  • tluyben21y

    I found the code of Qdrant very readable (with nice comments in the top) even if you are not (quite) aware of how vector databases (exactly) work. You need to be a programmer, but with that as a given, it reads surprisingly easy for a production database (that companies like OpenAI etc use).

  • AtlasBarfed1y

    ANNOY is essentially a binary space partition tree for points without having to worry about sets of points aka polygons?

  • perone1y

    I find it interesting how everyone ignore EuclidesDB (https://euclidesdb.readthedocs.io) which came before Milvus and others in 2018, it is free and open-source. Same for all presentations from major DBs.

  • esafak1y

    Missing coverage of hybrid search (vector + lexical).

    • swyx1y

      this is quite important. every vector db will provide lexical features, every traditional db will provide vector features. everything trends hybrid.

  • gk11y

    This is really cool, thanks for making this and sharing with the community!

  • anotheraccount91y

    If I read organized data from a traditional db, and add new tables based on potential similarities from the existing data/tables, with scores, would that be a vector db?

  • kgeist1y

    When does a simple linear search become an issue? For example, a cos similarity function which runs even over a million of vectors is surprisingly fast. We can further shard vectors by user (i.e. if they can only search in their own data they uploaded), or even by document (for things like "chat with this PDF") then it's likely that most searches will happen over small ranges of vectors.

    I suspect the main downsides would be increased CPU usage and having to load vectors to RAM all the time. However, for small projects with low RPS you could probably get away without a special database?

    Does anyone know of any benchmarks where they compare vector databases to a plain linear search? How they scale in regards to increased vector count and RPS, etc.

    • Radim1y

      Brute force search is both exact (100% accurate) and pleasantly linear – a predictable algorithm. CPUs and caches like that, so performance is much better than you might otherwise expect.

      From my https://rare-technologies.com/performance-shootout-of-neares... benchmark of kNN libs:

      "Brute force doesn’t care about the number of neighbours, so its performance is 679ms/query regardless of “k”. When run in batch mode (issuing all 100 queries at once), average query time drops to 354ms/query."

      This was 500 dimensions over a 3.7M dataset (the English Wikipedia), in 2014. So, ~700ms/search, or half that if you can batch several searches together at once. YMMV.

    • rolisz1y

      Do you mean using brute force (exact search) and not approximate nearest neighbor search? From my unscientific benchmarks, doing an exact search with 100k vectors takes about 1 second. Usually chatbots take much longer to generate the text, so that's still an acceptable time for doing exact search, which also means you will always find the best matches.

  • ok1234561y

    Are these dedicated vector databases doing anything more complicated than what can be accomplished using Postgres with the Cube extension?

  • dmezzetti1y

    It's also important to understand why we'd even want vectors in the first place.

    This article covers the advantages of semantic search over keyword search - https://medium.com/neuml/getting-started-with-semantic-searc...

  • yagami_takayuki1y

    For recognizing features such as hair and skin color, which would do a better job? Machine learning with image classification? Or a vector database?

    I've had Weaviate return a dog given a human as input when doing an image similarity search, so I was wondering if there's some way to improve the results or whether I'm barking up the wrong tree.

    • dimatura1y

      For those two in particular? You'd definitely get a better result with an ML model such as a convolutional neural net. In some sense, using an image similarity query is a kind of ML model - nearest neighbor - which can work in some scenarios. But for this specifically, I'd recommend a CNN.

    • bfeynman1y

      I would think you could improve your embedding space to address that issue, partially. Similarity search (as a result of some contrastive loss) definitely suffers at the tails and the OOD is pretty bad. That being said, you're more likely to have higher recall than a more classical technique.

    • Kubuxu1y

      You could consider something like BLIP2. There are multiple ways you could use it: embed images and match them against embeddings of text descriptors, train custom embedding of descriptors, or a classifier on top of the embedding (linear layer on top of the image embedding network).

      The approaches increase in complexity. It also allows for dataset bootstrap:

      Let's say you want to classify cats by breed. You could start by embedding images and text descriptors and distance-matching the embedded descriptors to the images. This gives you a dataset that might be 90% correct; you can then clean it up, which would be easier to do than manually labelling it. Based on that improved dataset, you can train a custom embedding for the labels or a classification layer on top of the image embedding network.

    • m00x1y

      You don't use vector databases independently, you need to input the embedding from a ML model.

      For your use-case, it should be pretty simple. You could use a CNN and train it, or use YOLO, Deepface, or other face detection algos, then within the face, find the hair and find the skin.

      From there you can use a vector database to get the colors that resemble other inputs, or you can use a simple CNN to classify the hair and skin to the closest label.

  • huqedato1y

    Excellent introduction (brief course)! 10/10! Please make more like this on transformers, LLM, DL, ML, AI etc.

  • dejli1y

    I watched yesterday talk from PostgresML https://www.youtube.com/watch?v=D8YMMOhAeBs and found it very practical and to the point on how you could use Postgres to build vector database.

  • takinola1y

    When I started playing around with AI applications and learnt about RAG techniques, it took me a while to grok vector databases and was a bit of pain to learn how to set one up. So I built my own little pet project, RagTag [1], which abstracts the vector database behind a simple CRUD API. I simply POST documents to RagTag and they are automatically converted into embeddings and made available to be queried for similarity searches.

    [1] RagTag - https://ragtag.weaveapi.com

  • jpasmore1y

    helpful - thx - building Latimer.ai using Pinecone

  • koreankid2271y

    Nice resource! One observation: the last row of the table in slide 15 (comparing traditional DBs with Vector DBs) seems flipped?

  • cubefox1y

    What are vector databases mainly used for in practice? Similarity search? Showing "related" entries?

  • m3kw91y

    Is this how most AI services implements RAG?

    • gk11y

      Yes, most RAG is done with a vector database. Or something that approximates a vector database, like a traditional database with a vector-index add-on.

    • charcircuit1y

      Keyword or text matching based search is likely still more popular due to how long its been around, its simplicity, and the tooling built around it. Most companies who have internal search are most likely not using vector / semantic search, but are doing something more basic.

  • itsthecourier1y

    The table with index types in the slide "comparison" is inverted

  • derrickrburns1y

    Well done!

  • cranberryturkey1y

    is surrealdb anygood as a vector database?

    • esafak1y

      That doesn't make sense. SurrealDB doesn't even tout itself as having good performance. If you don't want a custom vector database, you might as well use postgres.

    • m11171y

      Pinecone is surreal!