Happy that this paper finally made it to arxiv. The biggest reason for writing it was to try and showcase some of the breadth of applications we see for really high quality first class AD support at the language level. There are several communities that need this technology, so it makes sense to try and build one system that can address all of them and share tricks. I'm also hoping this gives people a sense of why our list of feature requirements for this system is so extensive (pervasive custom gradients, higher order AD, mixed mode, ultra fast scalar AD, etc). These days AD is often discussed only in the context of deep learning, but as the frontiers of deep learning are pushed and hybrid models become more and more popular, I'm expecting our focus on generality to pay off here.
Keno: first of all, let me give a big public thank you to you and your colleagues. (For those here who don't know, Keno is listed as one of the authors of the paper, and works closely with Mike Innes, lead author and also lead developer of Zygote. Mike is also an active member of HN.)
Second, let me bring up what I think is a significant issue. My perception is that most deep learning researchers and practitioners -- that includes me -- tend to iterate very rapidly over new ideas. We want to test ideas in code as quickly as possible, and in practice we will not invest the time and effort necessary to figure out how to write cache-optimized kernels (e.g., for GPUs) every time we might need one. In fact, I'd say the default attitude is that it doesn't even make sense for us to use (what we perceive as slow) automated kernel-writing tools to search for and compile optimized kernels. In practice, it always seems faster and easier (from a developer-time standpoint) to write code that leverages existing, inflexible, prebuilt, handtuned kernels that have already proven to work well, are fast, and are already nicely integrated with frameworks like PyTorch and TensorFlow.
There was a good discussion of this topic in the forums a few weeks ago[a] in response to a recent paper published by some folks at Google in which they make a compelling case that this issue is holding back AI research.[b] As an example, they use capsule networks[c], the implementations of which have proven difficult to optimize (e.g., they copy a lot more data around than is strictly necessary, in order to slice and reshape data in a manner that is compatible with preexisting hardware-accelerator kernels).
What are your thoughts on this? Will Zygote provide any improvements or advantages on this front?
--
[a] https://news.ycombinator.com/item?id=20301619
[b] https://dl.acm.org/citation.cfm?id=3321441
[c] https://arxiv.org/abs/1710.09829
Zygote is an orthogonal piece of technology on this front and relies on a good optimizing compiler behind it to target actual hardware. Its focus is primarily on expressability. We've been talking about automatic kernel generation for a while (and when I saw kernel generation what I mean is basically search for access patterns), but note that it's not quite as bad a problem in julia, because you can use higher order functions to get a lot of composability (e.g. if there's a hand-optimized parameterized matmul, fusing in inputs and outputs is just passing in extra functions). There's some academic work at the JuliaLab that's promising. We're also in discussions with commercial partners who care about this kind of thing to see if somebody is willing to fund it, but so far existing solutions have been good enough while we work on higher priority items. I do agree that being limited to a few hand-tuned kernels is a significant drag on research, so I hope we can do something here.
In terms of trying to break free of dependence on hand optimized kernels: a few people, myself included, have been working on some theoretical approaches to generating cache-efficient rearrangements for neutral net like problems. We've worked it out for convolution like problems [1] and have some upcoming results generalizing these techniques to other problems. Please feel free to email if you'd like to talk.
[1] https://arxiv.org/abs/1802.06905
God damn you RMS.
Just do this please:
https://twitter.com/thisbounty_com/status/115239073901339443...
https://twitter.com/thisbounty_com/status/115239756590162329...
Thank you for your response. Makes sense, and I'm happy to hear you and others are aware of the issue.
PS. I now feel that I asked my question without first thinking about it a bit more; sorry about that. I temporarily "forgot" that Zygote is a source-to-source AD package because, as someone who is developing and iterating over deep learning models for potential deployment to production, I naturally tend to think in terms of monolithic software stacks -- e.g.,"the TensorFlow stack," "the PyTorch stack," "the nascent Julia stack," and so on.
There are people in the Julia Lab working on high level tensor operation languages and compilers. It's a hard problem but one that many are interested in solving with Julia.
Chris: As a user of these tools, I cannot tell you how thankful I am for the work you, Keno, Mike and others do. (For those who don't know, Chris works closely with Mike, Keno, and others in the Julia team.)
I recognize that this is a hard problem.[a]
FWIW, I read or heard (can't remember which) that there are people working with Chris Lattner seeking to use predictive AI (instead of search and heuristics) to address this issue in MLIR. Let me add that my understanding of how that would work is very limited, though.
[a] Only superficially. As you can imagine, I'm dealing at a very different level of abstraction with my own set of problems and frustrations.
I'm not familiar with that part of the MLIR work, though I wouldn't be surprised if they are working at it. Google Brain is doing some of the finest work on ML for systems programming, so this'd be right up their alley. If it works well and they come up with some useful models, we'll be more than happy to incorporate them (or just use MLIR directly - there's a talk at JuliaCon next week from a Google engineer on using Julia as an MLIR frontend: https://pretalx.com/juliacon2019/talk/3YBZLC/).
Congratulations on some great work. I like the diversity shown in the examples. In particular, it's nice that you threw those working with stochastic processes a bone.
does the AD algorithm support functions with variant input sizes? For example, the input to a function is an array?