Hacker News

fheinsen•12h

Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation arxiv.org

83 comments

thomasahle•11h
There's a graveyard of 100s of papers with "approximate near linear time attention."
They always hope the speed increase makes up for the lower quality, but it never does. The quadratic time seems inherent to the problem.
Indeed, there are lower bounds showing that sub n^2 algorithms can't work: https://arxiv.org/pdf/2302.13214
- jcarreiro•10h
  The paper says that:
  > In practice, we find that four Taylor terms (P = 4) suffice for recovering conventional attention with elementwise errors of approximately the same magnitude as Float16 resolution, acceptable for many AI applications.
  ie., the claim is that this method reproduces the results of conventional attention, up to float16 numerical precision.
  - kristjansson•7h
    > approximately the same magnitude
    and they really do mean that, their results show +/- 1 on log10 plots.
    - cptroot•2h
      I don't think this is an accurate characterization of the error magnitude? Their error plots (from appendix 3) are all showing `log_10(|Y - \dot{Y}|)` as having a median of ~-3 (difference of 0.001) and a max of ~1.5 (difference of 0.035), and this is with only 3 Taylor terms.
  - fheinsen•9h
    The method is more general. The github repository's first example is with eight Taylor terms (P = 8).
    - torginus•2h
      I'm clueless about this whole thing, but from my EE education I remember that in general:
      Taylor approximations converge slowly in terms of error if the function they're representing is discontinuous (the error disappears quadratically if continuous, linearly if not), and they tend to create highly energetic swings near discontinuties (similarly to Fourier series with Gibbs oscillations).
      Moreover, Taylor series are inherently nonlinear, and much of the mathematical toolset around AI assumes general linearity (cue linear algebra), with the exception of sigmoids , and going beyond cubic approximations tends to make errors worse (as expressed in SNR).
  - energy123•9h
    It converges on conventional attention as P goes up
- kristjansson•10h
  > self-attention is efficiently computable to arbitrary precision with constant cost per token
  This paper at least aspires to reproduce 'true' attention, which distinguishes it from many of the others. TBD if its successful in that.
  - logicchains•10h
    It can't be successful at that any more than 1+1 can equal 3. Fundamentally, if every token wants to be able to look at every previous token without loss of information, it must be O(n^2); N tokens looking at N tokens is quadratic. Any sub-quadratic attention must hence necessarily lose some information and be unable to support perfect recall on longer sequences.
    - orlp•7h
      > N tokens looking at N tokens is quadratic
      Convolving two arrays can be done perfectly accurately in O(n log n), despite every element being combined with every other element.
      Or consider the even more basic sum of products a[i] * b[j] for all possible i, j:
      total = 0 for i in range(len(a)): for j in range(len(b)): total += a[i] * b[j]
      This can be computed in linear time as sum(a) * sum(b).
      Your logic that 'the result contains terms of all pairs, therefore the algorithm must be quadratic' simply doesn't hold.
      - CrazyStat•3h
        One of my favorite bits of my PhD dissertation was factoring an intractable 3-dimensional integral
        \iiint f(x, y, z) dx dy dz = \int [\int g(x, y) dx]*[\int h(y, z) dz] dy
        which greatly accelerated numerical integration (O(n^2) rather than O(n^3)).
        My advisor was not particularly impressed and objectively I could have skipped it and let the simulations take a bit longer (quite a bit longer--this integration was done millions of times for different function parameters in an inner loop). But it was clever and all mine and I was proud of it.
      - anvuong•6h
        This brings me back to DSP class, man learning about FFT was eye-opening.
      - noosphr•2h
        Convolution is a local operation.
        Attention is a global operation.
      - logicchains•5h
        That's like saying sorting can be done in O(n) because radix sort exists. If you assume some structure, you lose generality, i.e. there'll be some problems it's no longer able to solve. It can no longer approximate any arbitrary function that needs perfect memory over the sequence.
    - hellohello2•9h
      I'm not saying if the paper is correct or not (since I can't tell), but I don't think your argument really holds. Consider applying it to multiplication:
      Fundamentally, multiplication need to look at every pair of integer from the two input numbers. It must be O(n^2); N digits looking at N other digits is quadratic. Any sub-quadratic multiplication must hence necessarily lose some information.
      - nine_k•27m
        Integer multiplication x * y can be trivially done in O(k): k = log₂(min(x, y)). This is because we can do addition in constant time, adding all bits in parallel.
        By combining many more adding units, we can do (fixed-size) multiplication in constant time, too: https://en.wikipedia.org/wiki/Dadda_multiplier
      - sifar•13m
        Multiplication can be sub-quadratic using Karatsuba's algorithm.
      - actionfromafar•8h
        Doesn't that have to do with how many bits you allow in the actual calculation in physical reality?
        hellohello2•7h
        Well, for multiplication complexity is defined in terms of on the number of digits/bits digits directly. For attention, complexity is defined on terms of the number of input vectors which are all at fixed precision. I don't understand what happens to the method proposed in the paper at higher precision (since I don't understand the paper), but in reality in doesn't matter since there is no value in anything over float16 for machine learning.
      - logicchains•4h
        Multiplication has some properties like being cumulative. If we assume the sequence has any specific properties then we no longer have a general sequence model.
        direwolf20•3h
        I think you meant commutative.
        Attention also has some specific properties.
        And sometimes results are just unexpected. Did you know that anything a Turing machine can do in t tome steps, a different Turing machine can do in O(sqrt(t log t)) memory cells? https://news.ycombinator.com/item?id=44055347
    - naasking•8h
      Your argument just assumes there is no latent structure that can be exploited. That's a big assumption.
      - logicchains•5h
        It's a necessary assumption for the universal approximation property; if you assume some structure then your LLM can no longer solve problems that don't fit into that structure as effectively.
        direwolf20•3h
        Neural nets are structured as matrix multiplication, yet, they are universal approximators.
        noosphr•2h
        You're missing the non-linear activations.
        naasking•4h
        But language does have structure, as does logic and reasoning. Universal approximation is great when you don't know the structure and want to brute force search to find an approximate solution. That's not optimal by any stretch of the imagination though.
    - oasisaimlessly•9h
      That argument could also be used to say that the FFT's time complexity of O(n log n) should be impossible.
  - energy123•10h
    It's like claims of room temperature superconductors or millenium prize solutions. Earth shattering if true. It'd be such a black swan. Terrible for Nvidia.
    - SeanAnderson•10h
      Well, we solved one of the Millennium Prize problems (honestly kinda quickly) so maybe there's hope :)
- fheinsen•10h
  As the error via linear approximation approaches similar magnitude as numerical error via quadratic computation, don’t the two start becoming comparable in practice?
  I ask because in practice, for inference, attention is typically computed with low-precision (4-bit, 8-bit, 16-bit) floats.
  Numerical error, in fact, may be a key factor as to why quadratic attention, in practice, exhibits context rot as context gets longer, analogous to an RNN:
  https://www.anthropic.com/engineering/effective-context-engi...
- antirez•6h
  I agree with the fundamental idea that attention must be O(N^2), with the exception of recent DeepSeek sparse attention approach (DSA), that does not escape N^2 but attempts to lower constant times so much that N^2 is more acceptable, by creating a much faster layer that predicts high scoring tokens.
- cobolexpert•10h
  Dumb question: is the quadratic time complexity for training, inference, or both?
  - dave_universetf•9h
    Both, with caveats. The attention computation is fundamentally quadratic: for every token in the sequence, you're doing a computation that has to compute over every other token in the sequence. So it's O(N) per token, O(N^2) for the whole sequence.
    The big mitigation for this is that in causal transformers (i.e. all the chatbot type applications, where each token is only allowed to see tokens before it), you're running inference repeatedly on the same prefix in order to grow it by one token at a time. So if you cache the computations for tokens 0..N-1, on each inference pass you only have to compute O(N) for the newly added token at the end of the sequence.
    That's why caching (and caching charges) appear so prominently everywhere in the pricing of inference.
    In practice, caching is most beneficial at inference time, because you typically have relatively long conversations that start with the same cacheable prefix (the system prompt). At training time the same optimization can apply, but you're typically not pushing the same prefixes through the model repeatedly so you end up paying the quadratic cost more often.
    The quadratic cost of attention is the fundamental compute bottleneck for transformer architectures, which is why there's research like this trying to find shortcuts in computing attention, as well as research into completely new primitives to replace attention (e.g. SSM, which is O(N) on a cold cache and O(1) on a warm cache).
  - omneity•10h
    Attention is calculated during the forward pass of the model, which happens in both inference (forward only) and training (forward & backward).
    - SubiculumCode•10h
      Dumb question: Can inference be done in a reverse pass? Outputs predicting inputs?
      - dave_universetf•8h
        Strictly speaking: no. The "forward pass" terminology does not imply that there exists a "reverse pass" that does the same kind of computation. Rather, it's describing two different kinds of computation, and the direction they occur in.
        The forward pass is propagating from inputs to outputs, computing the thing the model was trained for. The reverse/backwards pass is propagating from outputs back to inputs, but it's calculating the gradients of parameters for training (rougly: how much changing each parameter in isolation affects the output, and whether it makes the output closer to the desired training output). The result of the "reverse pass" isn't a set of inputs, but a set of annotations on the model's parameters that guide their adjustment.
        The computations of the forward pass are not trivially reversible (e.g. they include additions, which destroys information about the operand values). As a sibling thread points out, you can still probabilistically explore what inputs _could_ produce a given output, and get some information back that way, but it's a lossy process.
        And of course, you could train a "reverse" model, one that predicts the prefix of a sequence given a suffix (trivially: it's the same suffix prediction problem, but you train it on reversed sequences). But that would be a separate model trained from scratch on that task, and in that model the prefix prediction would be its forward pass.
        direwolf20•3h
        I do want to see ChatGPT running upwards on my screen now, predicting earlier and earlier words in a futile attempt to explain a nonsense conclusion. We could call it ChatJeopardy.
      - gpm•9h
        Not as trivially as the forwards direction, unsurprisingly information is lost, but better than you might expect. See for example https://arxiv.org/pdf/2405.15012
      - root_axis•10h
        Sounds like a great premise for a sci-fi short story.
        anu7df•9h
        Sci-fi ? You mean historical fiction!
- wetwater•4h
  I agree. This from the paper mill for the paper mill.
- WhitneyLand•9h
  The 2023 paper even if true doesn’t preclude the 2026 paper from being true, it just sets constraints on how a faster attention solution would have to work.
- naasking•10h
  I think any kind of innovation here will have to take advantage of some structure inherent to the problem, like eliminating attention in favour of geometric structures like Grassman flows [1].
  [1] Attention Is Not What You Need, https://arxiv.org/abs/2512.19428
  - findalex•10h
    Right - e.g., if you're modeling a physical system it makes sense to bake in some physics - like symmetry.
    - naasking•9h
      Indeed, and I think natural language and reasoning will have some kind of geometric properties as well. Attention is just a sledgehammer that lets us brute force our way around not understanding that structure well. I think the next step change in AI/LLM abilities will be exploiting this geometry somehow [1,2].
      [1] GrokAlign: Geometric Characterisation and Acceleration of Grokking, https://arxiv.org/abs/2510.09782
      [2] The Geometry of Reasoning: Flowing Logics in Representation Space, https://arxiv.org/abs/2506.12284
- quotemstr•3h
  You can't stuff O(N) bits in O(1) space, so any scheme that purports, in general to do constant-time inference on unbounded context is snake oil, like a perpetual motion machine. Every such scheme must decay somehow. All you can do is choose how it decays.
- cubefox•11h
  I think DeepSeek V3.2 is sub n^2, but it clearly performs quite well, refuting the alleged lower bounds in the paper.
  - andy12_•10h
    It really isn't sub N^2. The main attention is only O(Nk), but only thanks to a lightning indexer that still has complexity O(N^2). So overall it still has the same complexity; just with a smaller constant factor [1]
    > DSA reduces the core attention complexity of the main model from O(L^2) to O(Lk), where k (<< L) is the number of selected tokens. Although the lightning indexer still has a complexity of O(L^2), it requires much less computation compared with MLA in DeepSeek-V3.1-Terminus
    [1] https://arxiv.org/pdf/2512.02556
    - cubefox•8h
      Okay, then let's see whether we are going to see real linear architectures, like Gated DeltaNet or Mamba-3, in some larger models. I don't believe there is a "lower bound" which states that those can never get to (or exceed) the real-world performance of quadratic attention. (Perfect recall in unrealistic needle-in-haystack tests doesn't count.)
      - andy12_•4h
        I'm also sure that some kind of linear architecture is possible. After all, humans don't have N^2 perfect recall either.
amluto•9h
I skimmed the paper, and I think I completely lost the plot.
Sections 2.1 through 2.4 talk about the decomposing the per-token-pair attention (key vector from the ith token with query vector from the jth token, where, in inference, the jth token is the one being sampled) into an approximation that is only mildly outrageously exponential in size compared to the original exponential-of-a-dot product. And they get something that's a polynomial (in the mathematical sense -- you're literally evaluating a polynomial) and has a size that's manageable at 4th order.
Okay, great, they took something simple and made it bigger and nastier but less transcendental without losing too much precision. (As far as I know, there is really nothing special about the exp in attention in the first place, so trying to approximate it well seems mostly useful insofar as it will keep existing models working.)
But the reason that attention is quadratic is that each token gets evaluated with respect to each other token. They haven't changed this at all. Section 2.5 seems like it's deferring this to an appendix. Section 2.6 gives the hidden state size per token, which, on first read, is strictly larger than the hidden state in normal attention (in normal attention it's d_v * d_k -- I'm not sure where their +1 comes from).
So what did the paper gain? Is there some detail that I missed or that the paper completely glossed over that explains why there is any gain of efficiency at all?
For what it's worth, the paper's overall claim is, in some sense, impossible. You can think of attention as being a sort of vector database, and this gets more accurate the sharper you make the exponential. If you replace softmax with actual max, a query locates the key that is the closest match to the query and returns the associated value. This operation is a plain linear search, it's possible (in principle anyway) to do lots of queries and recover the entire contents of the database, and I think that any paper claiming to do it faster than linear time should explain how it's compressing the data and where the loss is.
In language model terms, imagine an prompt like so:
```
    1: [string 1]
    2: [string 2]
    3: [string 3]
    ...
    n: [string n]
    
    Tell me the string associated with the number k.
```
As long as there's enough precision and enough query/key space to fit some embedding of the number k that will match the right thing (and there is a lot of room in high-dimensional spaces), one might expect a transformer to be able to answer this question. But this obviously requires memory with size linear in the prompt length. If you try to get rid of that, you necessarily lose something. (This is not to say that nice attention scaling is impossible -- one could imagine schemes where it takes the model multiple tokens to answer the question, and the number of tokens needed could scale, say, logarithmically with prompt size. But you still need that linear memory.)
- fheinsen•9h
  This is a form of linear attention (https://arxiv.org/abs/2006.16236) that approximates standard scaled dot-product attention to arbitrary precision, by adding Taylor terms in an efficient manner. Each additional Taylor term improves the approximation. Efficiency is achieved by exploiting certain mathematical symmetries that become evident only after decomposing the standard formulation of attention into an expression over chains of tensor products. The github repository's README walks through examples. The first example is with 8 Taylor terms.
- yorwba•8h
  > But the reason that attention is quadratic is that each token gets evaluated with respect to each other token. They haven't changed this at all. Section 2.5 seems like it's deferring this to an appendix.
  They defer it to the appendix because it's a standard construction (Q'K)V = Q'(KV), where Q'K is an n×n matrix and requires O(n²) to compute, but KV has a constant size and can be computed in O(n) time, and the multiplication with Q' can also be done in O(n) time.
  > Section 2.6 gives the hidden state size per token, which, on first read, is strictly larger than the hidden state in normal attention (in normal attention it's d_v * d_k -- I'm not sure where their +1 comes from).
  Actually, their hidden state has a (large) constant size, so strike the words "per token" from section 2.6. In normal attention, the total state is n(d_v + d_k), but their state is basically (d_v + 1)D_k, where D_k is much larger than d_k, but independent of n. The +1 is because they also need to compute the normalization factor for the softmax.
  It's true that a constant state size implies that you cannot use it to losslessly store arbitrarily large databases, but LLMs in practice cannot do this either, so there's no loss of capability in that sense. (In fact, if you use enough terms in the Taylor expansion to get the same result as standard attention to within machine precision, the resulting constant state size should give you an upper bound for the amount of data the LLM can effectively retrieve from its context.)
- csense•9h
  This paper combines two different insights, the second one is buried in the appendix.
  Let's say you consider the 3 most-recent tokens. The first insight is that you can use a Taylor approximation: At token position 3 you compute A_3 = ((q1, q2, q3) . (k1, k2, k3))^1, B_3 = ((q1, q2, q3) . (k1, k2, k3)^2, C_3 = ((q1, q2, q3) . (k1, k2, k3))^3, etc. [1] [2]
  The second insight is that you can compute e.g. B_{i+1} incrementally from B_i, with much fewer FLOPS than computing B_{i+1} from scratch. [3]
  [1] I'd buy that it's empirically "good enough" that you don't need to go beyond D_3 (fourth degree polynomial).
  [2] I'd also buy that it's empirically "good enough" to assume the inputs aren't extreme enough for E_3, F_3 etc. to matter. I agree with other posters that radius of convergence worries aren't addressed. I find it plausible that these issues don't sink the paper. I'd not be surprised to learn that either it doesn't matter in practice, or workarounds can be implemented without much performance impact.
  [3] The author's choice to bury this insight in an appendix rather than putting it front and center is a baffling pedagogical choice but it's a small issue in the grand scheme of things. Perhaps that second insight is prior work (possibly by others) that experts in the latest LLM linear algebra could reasonably be expected to be familiar with, but is included as an appendix because it's not universally known in e.g. HN comment sections?
  - fheinsen•9h
    [3] is linear attention, https://arxiv.org/abs/2006.16236, a well-known result with ~3K citations: https://scholar.google.com/scholar_lookup?arxiv_id=2006.1623...
- jsenn•9h
  > Section 2.6 gives the hidden state size per token, which, on first read, is strictly larger than the hidden state in normal attention
  This is where you’ve gone off track. The “hidden state” for their model is a fixed size thing, like in an RNN, not per token. For a transformer, the “hidden state” is called the KV cache, and it grows with sequence length. This is why their method is linear not quadratic.
  The Taylor Series they derive isn’t just for softmax (after all, real implementations of softmax will likely already use the Taylor series!), it’s for the entire tensor-level softmax(QK) computation.
riemannzeta•9h
Neat result. The symmetry exploitation here reminds me of recent work connecting neural network training dynamics to renormalization group theory. Charles Martin's SETOL paper https://arxiv.org/abs/2507.17912 shows that well-trained layers converge to something like an RG fixed point—the eigenvalue spectrum of the weight matrix develops power-law tails with exponent α ≈ 2, which is the signature of scale invariance. At this fixed point, the "effective correlation space" is low-dimensional: you can truncate the SVD aggressively and recover nearly identical test accuracy.
I wonder if there's a connection to your Taylor truncation order. In RG terms, higher-order polynomial interactions are "irrelevant operators"—they get suppressed as you flow toward the fixed point. If trained attention heads are sitting near this fixed point, that might explain why modest truncation orders work: the network has already learned to concentrate its computation in the lower-order terms. A testable prediction: layers with α closer to 2 (measurable via weightwatcher https://github.com/CalculatedContent/WeightWatcher) might need fewer Taylor terms for accurate approximation than layers with α far from 2. If true, you could potentially use the spectral statistics to adaptively choose truncation order per-head.
- charleshmartin•15m
  Right. If the dynamics of training are governed by RG flow, then the best optimization path should remove redundant directions, as specified by the RG operator(s)
- fheinsen•8h
  Yes, there must be a connection. While adaptive truncation may prove impractical, it should be possible to measure spectral statistics on sample data, and specify a different fixed truncation order per layer, per head, etc. The github repository lists many other possible improvements: https://github.com/glassroom/sata_attention#proof-of-concept
bluecoconut•11h
I almost feel like this goes opposite to what attention is good at. This would be good at approximating all the places where attention is low / not sharp. Where attention/the exponential is key is when it selects out / needle-in-haystack / winner-takes-all focus (the word "attention" itself sorta implies this), and this is where the taylor expression would fail to represent the values well. This just... softens attentions ability to attend?
(I'm imagining that if in the context there's ~4-8 "similar" attention-targets that should be sharp, and regular attention learns to select the correct one, this taylor approximation version would wash out any difference and they'd all loosly be attended to, and it'd fail to isolate the correct signal)
Really wish this had some downstream tests -- apply it to a pretrained model and see how performance degrades, train a fresh one, etc. The tests are worth doing, but I somehow don't feel that hopeful this is the unlock required for sub-quadratic attention. It's possible that a freshly trained model with this learns to attend without the sharp attention signals, but that seems a bit dubious to me.
But also, maybe this combined with some other selective (sparse attention) trick, means that the hybrid model gets the "fuzzy long tail" of attention well represented as well as the sharpness well represented, and all together it could actually be a part of the larger solution.
- energy123•11h
  > this is where the taylor expression would fail to represent the values well.
  "In practice, we find that four Taylor terms (P = 4) suffice for recovering conventional attention with elementwise errors of approximately the same magnitude as Float16 resolution"
  - seanhunter•11h
    I read that too, but I wondered whether elementwise error is the right metric. Surely the actual error metric should be to evaluate model performance for a conventional transformer model and then the same model with the attention mechanism replaced by this 4th order Taylor approximation?
    - vlovich123•10h
      Bounded error weights by definition is a more strict evaluation criterion than “performance” metrics through running the model.
      - ehsanu1•6h
        To spell it out for myself and others: approaching equivalent calculations for each individual attention block means we also approach equivalent performance for the combination of them. And with an error bar approaching floating point accuracy, the performance should be practically identical to regular attention. Elementwise errors of this magnitude can't lead to any noteworthy changes in the overall result, especially given how robust LLM networks seem to be to small deviations.
- mapontosevenths•11h
  > This just... softens attentions ability to attend?
  I think this does soften, but not linearly. That is to say the fixed state size limitation means that it softens more as it gets further into the past.
- tehsauce•11h
  Right, and when they compare to floating point accuracy they seem to be using the number of decimals supported by the mantissa, but the exponent is important no?
  - seanhunter•11h
    When someone says the error is of a certain magnitude they mean the absolute value of the difference between the the two things, so what they're saying is that the values they produced with their approximation are about as wrong as the difference between the actual values and those values cast to float16. The exponent is most definitely important and would be included in that.
Kubuxu•7h
A paper on the same topic: On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective, Gabriel Mongaras, Eric C. Larson, https://arxiv.org/abs/2507.23632
Video presentation if someone prefers it: https://www.youtube.com/watch?v=PN3nYBowSvM
Linear attention is a first-degree approximation of Softmax attention, and model performance gets better as you increase the degree of the Taylor approximation.
I'm thinking about adapting an existing model to Taylor-approximated attention. I think it should be possible with some model surgery and rehabilitation training.
abeppu•10h
I haven't tried to follow the math closely but should there not be some concern about the region of convergence? It looks like they don't specifically discuss it. Or is there some reason this isn't a problem in this context?
- reactordev•10h
  I fear they have completely overlooked it.
- measurablefunc•8h
  The Taylor series for the exponential is convergent everywhere so what radius of convergence are you talking about? All the functions they're approximating are convergent everywhere & you can easily prove that compositions of functions that are convergent everywhere are still convergent everywhere.
alyxya•10h
The best and proven linear attention is the Gated DeltaNet or variations of it, used by Kimi and Qwen. Anyone who thinks linear attention can't work is forgetting that models are a fixed size so attention should always be compressable to be linear. Another way to think of the feasibility of linear attention is that the standard attention mechanism can be made linear simply by removing the softmax so the kv cache stores the kv product as a constant size matrix instead. Softmax just normalizes attention, but it's not theoretically required.
mapontosevenths•11h
This uses the Taylor approximation to approximate softmax, but that IS only an approximation. I wonder exactly how much that trade-off costs in terms of accuracy vs performance? I note that they say it's close to float16 with four Taylor terms.
My other concern would be that Taylor itself is fairly complex. I wonder how well GPU's handle this in comparison to good old fashioned softmax? The last time I used Taylor with a custom Triton kernel it was still very slow. That could just have been my own jank vibe-coded implementation though.
- slashdave•8h
  If the model learns by using the approximate softmax, then why does it matter? We only need the behavior of softmax, not an exact numerical solution.
  - mapontosevenths•7h
    I guess that what I'm saying is I'd love to see an LLM actually have it's attention mechanism replaced with this and get benchmarked on real world tasks in comparison to quadratic attention. They don't seem to have done that here. They claim that's it's close to being the same, but my experience tells me that it needs to do better than get "pretty close."
    They also haven't' tried to write a high performance kernel for triton yet. If it goes the way my last experiment with Taylor did they're in for some bad news.
    I'm just a hobbyist though, it's certainly possible that people with more time/resources could outperform me without much effort. I just want to see it tested on something familiar and benchmark-able.
spacewhales•11h
Github here: https://github.com/glassroom/sata_attention
observationist•11h
This could turbocharge ByT5 and other tokenless architectures, whose big downside was the increase in compute over longer sequences. It's easy to imagine a bunch of strategies with variable levels of "focus" and so on with a fixed compute budget assigned on the fly with learned optimizers informing the distribution.
andes314•11h
Linear time attention doesn’t work, by principle. Dead end pursuit. Much great research on more efficient quadratic time inference
- smokel•10h
  What about n log n?
yanosh_kunsh•11h
So does that mean that LLM inference could go down significantly in price and/or context length would dramatically increase?
physicsguy•7h
With this, they've not provided an upper bound the error on the kernel expanded with N terms which I think is a big missing piece.
NedCode•9h
Reference implementation: https://github.com/glassroom/sata_attention
rvz•11h
> Our work enables unbounded token generation at modest fixed cost, substantially reducing the infrastructure and energy demands of large-scale Transformer models. The mathematical techniques we introduce are of independent interest.
Now this is a very interesting paper, which hopefully should address the chronic inefficiencies of the AI lack of efficient methods and approaches in reducing their significant computational and energy demands which are off the charts.
> These factors penalize performance relative to what a fused, hardware-optimized implementation could achieve, and the reported runtime results should therefore be interpreted conservatively.
It's still early with several limitations, but the need for wasting billions on GPUs will begin to not make any sense soon.