6 comments
  • TheNewAndy7y

    I'm a bit confused by this (I've only read the summary that is linked to).

    As I understand it, the "raw" neural bloom filter is basically a function which you can query "with what certainty can you say that item X is in the set". Unlike a classic bloom filter, it will never be able to say "item X is definitely not in the set". Once you add thresholds, this turns into a function which can say "might be a member" or "might not be a member".

    So to avoid having false negatives, it sounds like they have a backup "classic bloom filter". e.g. from the summary, this is the full algorithm:

    > 1. If the Neural Bloom Filter predicts membership with probability greater than the chosen threshold, return “(might be a) member” > 2. Otherwise check the secondary Bloom filter, if it indicates the element might be a member, return “(might be a) member” > 3. Otherwise return “not a member”

    Assuming the secondary (classic) bloom filter is the same size as what they are comparing to, then there can't be any space savings - it is strictly an addition.

    If the secondary bloom filter is smaller than it would have otherwise been, then it will introduce more false positives - and since a negative from the neural bloom filter means "listen to the secondary bloom filter", then this just means that the error can only ever be worse than using just the secondary bloom filter (since it is possible that the neural bloom filter might get something wrong that the secondary bloom filter would have got right).

    • geocar7y

      I can imagine how the secondary (classic) bloom filter can be smaller: It only needs the bits that fail the neural bloom filter that are actually in the set. That means once you're done training the neural bloom filter, you can revisit your training set to find the set of filters that miss the primary so you know what goes in the secondary.

  • egloo7y

    Neat, but I'm not sure this has practical value - being 400x slower without a GPU for a 3-40x space savings puts this into a different bucket of usefulness than regular bloom filters. Space is far cheaper than GPU compute.

    Also, going from 0.02 ms latency to 5.1 ms (cpu) - 13 ms (gpu) is a huge con to this approach.

    That said, I think it's really cool and might be the jump off point for other innovations.

  • closetCS7y

    Its pretty smart to use neural networks to exploit the non uniformity of the data, but I thought they were going to have replaced just the hash function part of a bloom filter, instead of the entire algorithm. I wonder if it could preserve the speed of the bloom filter while exploiting potentially non uniform data.

    • stochastic_monk7y

      There's probably something worth using in this tradeoff space, though this isn't it.

      • sitkack7y

        Maybe the network can be used to construct the hash function after it has learned the shape of the data.