This is fantastic work. The focus on a local, sandboxed execution layer is a huge piece of the puzzle for a private AI workspace. The `coderunner` tool looks incredibly useful.
A complementary challenge is the knowledge layer: making the AI aware of your personal data (emails, notes, files) via RAG. As soon as you try this on a large scale, storage becomes a massive bottleneck. A vector database for years of emails can easily exceed 50GB.
(Full disclosure: I'm part of the team at Berkeley that tackled this). We built LEANN, a vector index that cuts storage by ~97% by not storing the embeddings at all. It makes indexing your entire digital life locally actually feasible.
Combining a local execution engine like this with a hyper-efficient knowledge index like LEANN feels like the real path to a true "local Jarvis."
Code: https://github.com/yichuan-w/LEANN Paper: https://arxiv.org/abs/2405.08051
> A vector database for years of emails can easily exceed 50GB.
In 2025 I would consider this a relatively meager requirement.
Yeah, that's a fair point at first glance. 50GB might not sound like a huge burden for a modern SSD.
However, the 50GB figure was just a starting point for emails. A true "local Jarvis," would need to index everything: all your code repositories, documents, notes, and chat histories. That raw data can easily be hundreds of gigabytes.
For a 200GB text corpus, a traditional vector index can swell to >500GB. At that point, it's no longer a "meager" requirement. It becomes a heavy "tax" on your primary drive, which is often non-upgradable on modern laptops.
The goal for practical local AI shouldn't just be that it's possible, but that it's also lightweight and sustainable. That's the problem we focused on: making a comprehensive local knowledge base feasible without forcing users to dedicate half their SSD to a single index.
You already need very high end hardware to run useful local LLMs, I don't know if a 200gb vector database will be the dealbreaker in that scenario. But I wonder how small you could get it with compression and quantization on top
I'm no dev either and still set up remote ssh login to be able to use LaTeX at home PC from my laptop.
Also, with many games and dual boot on my gaming PC I still have some space left on my 2TB NVME SSD. And my not enthusiast MOBO could fit two more.
It took so much time to install LaTeX and packages, and also so much space, my 128GB drive couldn't handle it.
I've worked in other domains my whole career, so I was astonished this week when we put a million 768-len embeddings into a vector db and it was only a few GB. Napkin math said ~25 GB and intuition said a long list of widely distributed floats would be fairly uncompressable. HNSW is pretty cool.
You can already do A LOT with an SLM running on commodity consumer hardware. Also it's important to consider that the bigger an embedding is, the more bandwidth you need to use it at any reasonable speed. And while storage may be "cheap", memory bandwidth absolutely is not.
> You already need very high end hardware to run useful local LLMs
A basic macbook can run gpt-oss-20b and it's quite useful for many tasks. And fast. Of course Macs have a huge advantage for local LLMs inference due to their shared memory architecture.
The mid-spec 2025 iPhone can run “useful local LLMs” yet has 256GB of total storage.
(Sure, this is a spec distortion due to Apple’s market-segmentation tactics, but due to the sheer install-base, it’s still a configuration you might want to take into consideration when talking about the potential deployment-targets for this sort of local-first tech.)
Question: would it be possible to invert the problem? I.e., rather than decreasing the size of the RAG — use the RAG to compress everything other than the RAG index itself.
E.g., design a filesystem so that the RAG index is part of / managed internally within the metadata of the filesystem itself; and then, for each FS inode data-extent, give it two polymorphic on-disk representations:
1. extents hold raw data; rag-vectors are derivatives and updated after extent is updated (as today)
2. rag-vectors are canonical; extents hold residuals from a predictive-coding model that took the rag-vectors as input and tried to regenerate the raw data of the extent. When extent is read [or partially overwritten], use predictive-coding model to generate data from vectors and then repair it with residue (as in modern video-codec p-frame generation.)
———
Of course, even if this did work (in the sense of providing a meaningful decrease in storage use), this storage model would only really be practical for document files that are read entirely on open and atomically overwritten/updated (think Word and Excel docs, PDFs, PSDs, etc), not for files meant to be streamed.
But, luckily, the types of files this technique are amenable to are exactly the same types of files that a “user’s documents” RAG would have any hope of indexing in the first place!
The DGX Spark being just $3-4,000 with 4TB of storage, 128GB unified memory, etc (or the Mac Studio tbh) is a great indicator that Local AI can soon be cheap and, along with the emerging routing and expert mixing strategies, incredibly performant for daily needs.
While your aims are undoutably sincere, in practice for the 'local ai' target people building their own rigs usually have. 4TB or more fast ssd storage.
The bottom tier (not meant disparagingly) are people running diffusion models as these do not have the high vram requirements. They generate tons of images or video, going form a one-click instally like Easydiffusion to very sophisticated workflows in comfyui.
For those going the LLM route, which would be your target audience, they quickly run into the problemm that to go beyond toying around, the hardware and software requirements and expertise grows exponential beyong just toying around with small, highly quantized model with small context windows.
Inlight of the typical enthusiast investments in this space, the few TB of fast storage will pale in comparison to the rest of the expenses.
Again, your work is absolutely valuable, it is just that the storage space requirement for the vector store in this particular scenario is not your strongest card to play.
Everyone benefits from focusing on efficiency and finding better ways of doing things. Those people with 4TB+ of fast storage can now do more than they could before as can the "bottom tier."
It's a breath of fresh air anytime someone finds a way to do more with less rather than just wait for things to get faster and cheaper.
Of course. And I am not arguing against that at all. Just like if someone makes an inference runtime that is 4% faster, I'll take that win. But would it be the decisive factor in my choice? Only if that was my bottleneck, my true constraint.
All I tried to convey was that for most of the people in the presented scenario (personal emails etc.) , a 50 or even 500GB storage requirement is not going to be that primary constraint. So the suggestion was the marketing for this usecase might be better spotlighting also something else.
You are glossing over the fact that for RAG you need to search over those 500GB+ which will be painfully slow and CPU-intensive. The goal is fast retrieval to add data to the LLM context. Storage space is not the sole reason to minimize the DB size.
You’re not searching over 500GB, you’re searching an index of the vectors. That’s the magic of embeddings and vector databases.
Same way you might have a 50TB relational database but “select id, name from people where country=‘uk’ and name like ‘benj%’ might only touch a few MB of storage at most.
That’s precisely the point I tried to clear up in the previous comment.
The LEANN author proposes to create a 9GB index for a 500GB archive, and the other poster argued that it is not helpful because “storage is cheap”.
Speak for yourself! If it took me 500GB to store my vectors , on top of all my existing data, it would be a huge barrier for me.
A 4tb external drive is £100. A 1TB sd card or usb stick a similar cost.
Maybe Im too old to appreciate what “fast” means, but storage doesnt seem an enormous cost once you stripe it.
This "...doesn't seem an enormous cost once you stripe it." gave me an idea. I KNOW that I will come back to link a blog post about it in the future.
Maybe time to update your storage?
Take whatever you're indexing and make it 16-20x and that’s a good approximation of what the vector db’s total size is going to be.
Why is it like that, currently? There is no information added by a vector index compared to the original text. And the text is highly redundant and compressible with even lossless functions. Furthermore a vector index is already lossy and approximate. So conceptually it is at least possible to have an index that would be a fraction of the size of what is indexed?
That can't be the correct paper...
I think you meant this: https://arxiv.org/abs/2506.08276
No no, getting your entire workflow local requires solving P=NP.
Wait how?
Just some sarcasm. You can safely disregard if you didn’t get a chuckle.
Yeah that's it. My bad lol
It feels weird that the search index is bigger than the underlying data, weren't search indexes supposed to be efficient formats giving fast access to the underlying data?
Exactly. That's because instead of just mapping keywords, vector search stores the rich meaning of the text as massive data structures, and LEANN is our solution to that paradoxical inefficiency.
Good point! Maybe indexing is a bad term here, and it's more like feature extraction (and since embeddings are high dimensional we extract a lot of features). From that point of view it makes sense that "the index" takes more space than the original data.
Why would the embeddings be higher dimensionally than the data? I imagine the embeddings would contain relatively higher entropy (and thus lower redundancy) than many types of source data.
depends on the chunk-size used to create the embedding.
I guess for semantic search(rather than keyword search), the index is larger than the text because we need to embed them into a huge semantic space, which make sense to me
Nonclustered indexes in RDBMS can be larger than the tables. It’s usually poor design or indexing a very simple schema in a non-trivial way, but the ultimate goal of the index is speed, not size. As long as you can select and use only a subset of the index based on its ordering it’s still a win.
Why is that considred relevant to get a RAG of people digital traces burdening them in every single interactions they have with a computer?
Having locally distributed similar grounds is one thing. Push everyone to much in its own information bubble, is an other orthogonal topic.
When someone mind recall about that email from years before, having the option to find it again in a few instants can interesting. But when the device is starting to funnel you through past traces, then it doesn't matter much whether it the solution is in local or remote: the spontaneous thought flow is hijacked.
In mindset dystopia, the device prompts you.
Since it’ll be local, this behavior can be controlled. I for one find the option of it digging through my personal files to give me valuable personal information attractive.
Thank you for the pointer to LEANN! I've been experimenting with RAGs and missed this one.
I am particularly excited about using RAG as the knowledge layer for LLM agents/pipelines/execution engines to make it feasible for LLMs to work with large codebases. It seems like the current solution is already worth a try. It really makes it easier that your RAG solution already has Claude Code integration![1]
Has anyone tried the above challenge (RAG + some LLM for working with large codebases)? I'm very curious how it goes (thinking it may require some careful system-prompting to push agent to make heavy use of RAG index/graph/KB, but that is fine).
I think I'll give it a try later (using cloud frontier model for LLM though, for now...)
[1]: https://github.com/yichuan-w/LEANN/blob/main/packages/leann-...
I'm gonna put it here for visibility: Use patchright instead of Playwright: https://github.com/Kaliiiiiiiiii-Vinyzu/patchright
What problem does patchright solve?
Not being detected by things like bot detection.
I know next to nothing about embeddings.
Are there projects that implement this same “pruned graph” approach for cloud embeddings?
It’s in the works… been meaning to do a show HN moment to see if it flies or I Fall on my face..
This looks incredibly useful for making large-scale local AI truly practical.
This is annoyingly Apple-only though. Even though my main dev machine is a Macbook, this would be a LOT more useful if it was a Docker container.
I'd still take a Docker container over an Apple container, because even though docker is not VM-level-secure, it's good enough for running local AI generated code. You don't need DEFCON Las Vegas levels of security for that.
And also because Docker runs on my windows gaming machine with a fast GPU with WSL ubuntu, and my linux VPS in the cloud running my website, etc etc. And most people have already memorized all the basic Docker commands.
This would be a LOT better if it was just a single docker command we can copy paste, run it a few times, and then delete if necessary.
I’m no expert on these things, but since Apple Containerization uses OCI images, I’d think you’d be able to sub in Docker (or Podman, etc) as the runtime pretty trivially. Like Podman, it uses a very similar command line interface to Docker’s.
Edit: Oh, I see now that Coderunner is Apple Containerization-specific.
I have 26tb hardrives, 50gb doesnt scare me. Or should I be?
I think you'd want things in RAM for performance reasons but would love to be corrected by people with more knowledge/experience on the subject
Oh the number was memory space? That changes the maths a little bit. But I do have 50gb available for a model no problem whatsoever. 384gb is the new 32gb.