8B coefficients are packed into 53B transistors, 6.5 transistors per coefficient. Two-inputs NAND gate takes 4 transistors and register takes about the same. One coefficient gets processed (multiplied by and result added to a sum) with less than two two-inputs NAND gates.
I think they used block quantization: one can enumerate all possible blocks for all (sorted) permutations of coefficients and for each layer place only these blocks that are needed there. For 3-bit coefficients and block size of 4 coefficients only 330 different blocks are needed.
Matrices in the llama 3.1 are 4096x4096, 16M coefficients. They can be compressed into only 330 blocks, if we assume that all coefficients' permutations are there, and network of correct permutations of inputs and outputs.
Assuming that blocks are the most area consuming part, we have block's transistor budget of about 250 thousands of transistors, or 30 thousands of 2-inputs NAND gates per block.
250K transistors per block * 330 blocks / 16M transistors = about 5 transistors per coefficient.
Looks very, very doable.
It does look doable even for FP4 - these are 3-bit coefficients in disguise.
I'm looking forward to the model.toVHDL() method in PyTorch.
Ugh, quick, everyone start panic-buying FPGAs now.
largest FPGAs have on the order of tens of millions of logic cells/elements. They’re not even remotely big enough to emulate these designs except to validate small parts of it at a time and unlike memory chips or GPUs, companies don’t need millions of them to scale infrastructure.
(The chips also cost tens of thousands of dollars each)
they also arent power friendly
Pretty close to what you describe: https://github.com/fastmachinelearning/hls4ml
Deep Differentiable Logic Gate Networks
I see you and I raise approximate logic synthesis [1] [2].
[1] https://www.sciencedirect.com/science/article/pii/S138376212...
[2] https://arxiv.org/abs/2506.22772
You can synthesize a logic circuit that is as complex as it gets to have a certain accuracy.
Deep differentiable logic networks, in my experience, do not scale well for larger (more inputs) logic elements. One still has to apply logic optimization and synthesis afterwards. So why not to synthesize ones own approximate circuit to the accuracy one's desire?
Is this a thing?
I gave a short talk about compiling PyTorch to Verilog at Latte '22. Back then we were just looking at a simple dot product operation, but the approach could theoretically scale up to whole models.
https://capra.cs.cornell.edu/latte22/paper/2.pdf
https://www.youtube.com/watch?v=QxwZpYfD60g
They mentioned that they using strong quantization (iirc 3bit) and that the model was degradeted from that. Also, they don't have to use transistors to store the bits.
I think they are talking about the transistors that apply the weights to the inputs.
gpt-oss is fp4 - they're saying they'll next try mid size one, I'm guessing gpt-oss-20b then large one, i'm guessing gpt-oss-120b as their hardware is fp4 friendly
Whats the theoretixal full wafer scale model they could produce?