I am really puzzled by TPUs. I've been reading everywhere that TPUs are powerful and a great alternative to NVIDIA.
I have been playing with TPUs for a couple of months now, and to be honest I don't understand how can people use them in production for inference:
- almost no resources online showing how to run modern generative models like Mistral, Yi 34B, etc. on TPUs - poor compatibility between JAX and Pytorch - very hard to understand the memory consumption of the TPU chips (no nvidia-smi equivalent) - rotating IP addresses on TPU VMs - almost impossible to get my hands on a TPU v5
Is it only me? Or did I miss something?
I totally understand that TPUs can be useful for training though.
> I think the tech stack needs another 12-18 months to mature
Google has been doing AI before any other company even thought about it. They are on the 6th generation of TPU hardware.
I don't think there is any maturity issue, just an availability issue because they are all being used internally.
100% agree, if I have access to the TPU team internally, it'll be very easy to use in production.
If you aren't internal, the documentation, support, and even just general bug fixing is impossible to get.
(Has an expert team dedicated solely to optimizing for exotic hardware) = an option
(Doesn't have a team like that) = stick to mass-use, commodity hardware
That's generally been the trade-off since ~1970. And usually, the performance isn't worth the people-salaries.
How many examples of successful hardware that isn't well-documented and doesn't have drop-in 1:1 SDK coverage vs (more popular solution) are there?
It seems like a heavy-lift to even get something that does have parity in those ways adopted, given you're fighting market inertia.
Google sells access to TPUs in its cloud platform, so you'd think they would be more open about sharing development and tooling frameworks for TPUs. It's like Borg (closed source, never used outside Google, made them no profit) vs. Kubernetes (open source, used everywhere, makes them profit).
> Google has been doing AI before any other company even thought about it
This not even remotely true. SRI was working on AI in various forms long before google existed
Who or what is SRI?
next to NASA, probably the most innovative organization in human history
https://www.sri.com/timeline-of-innovation/
See http://www.sri.com
I feel like I have been hearing that since V1 TPU. I think Google is the perfect solution because they are teams whose job is to take a model and TPUify it. Elsewhere there is no team, so it's no fun.
I agree with that, and I'm not sure they'll be able to improve the stack dramatically by themselves without the open-source community being more involved.