There is something deep in this observation. When I reflect on how I write code, sometimes it’s backwards. Sometimes I start with the data and work back through to the outer functions, unnesting as I go. Sometimes I start with the final return and work back to the inputs. I notice sometimes LLMs should work this way, but can’t. So they end up rewriting from the start.
Makes me wonder if future llms will be composing nonlinear things and be able to work in non-token-order spaces temporarily, or will have a way to map their output back to linear token order. I know nonlinear thinking is common while writing code though. current llms might be hiding a deficit by having a large and perfect context window.
Yes, there are already diffusion language models, which start with paragraphs of gibberish and evolve them into a refined response as a whole unit.
Right, but that smoothly(ish) resolves all at the same time. That might be sufficient, but it isn't actually replicating the thought process described above. That non-linear thinking is different than diffuse thinking. Resolving in a web around a foundation seems like it would be useful for coding (and other structured thinking, in general).
With enough resolution and appropriately chosen transformation steps, it is equivalent. E.g., the diffusion could focus on one region and then later focus on another, and it's allowed to undo the effort it did in one region. Nothing architecturally prohibits that solution style from emerging.
The choice of transformation steps to facilitate this specific diffuse approach seems like a non-trivial problem. It doesn't follow such an organic solution would emerge at all, now, does it?
The pattern ", now, " is indicative of a sort of patronization I don't normally engage with, but, yes, you're correct.
In some measure of agreeing with you: For other classes of models we know for a fact that there exist problems which can be solved by those architectures and which can't be trained using current techniques. It doesn't feel like a huge stretch that such training-resistent data might exist for diffusion models.
That said, I still see three problems. Notably, the current ancestral chain of inquiry seems to care about the model and not the training process, so the point is moot. Secondarily, in other similar domains (like soft circuits) those organic solutions do seem to emerge, suggesting (but not proving) that the training process _is_ up to par. Lastly, in other related domains, when such a solution doesn't emerge it ordinarily happens because some simpler methodology achieves better results, meaning that even with individual data points suggesting that diffusion solutions don't model that sort of linearity you still need to work a little bit to prove that such an observation actually matters.
The process of developing software involves this kind of non-linear code editing. When you learn to do something (and the same should go for code, even if sometimes people don't get this critical level of instruction), you don't just look at the final result: you watch people construct the result. The process of constructing code involves a temporarily linear sequence of operations on a text file, but your cursor is bouncing around as you put in commands that move your cursor through the file. We don't have the same kind of copious training data for it, but thereby what we really need to do is to train models not on code, but on all of the input that goes into a text editor. (If we concentrate on software developers that are used to do doing work entirely in a terminal this can be a bit easier, as we can then just essentially train the model on all of the keystrokes they press.)
I think long term LLMs should directly generate Abstract Syntax Trees. But this is hard now because all the training data is text code.
The training data is text code that can be compiled, though, so the training data can also easily be an Abstract Syntax Tree.
It's possible that LLMs build ASTs internally for programming. I have no 1st hand data on this, but it would not surprise me at all.
LLMs don't have memory, so they can't build anything. Insofar as they produce correct results, they have implicit structures corresponding to ASTs built into their networks during training time.
There's a fair amount of experimental work happening trying different parsing and resolution procedures such that the training data reflects an AST and or predicts nodes in an AST as an in-filling capability.
Do you know if any such experimental work is using a special tokenizer for example in Lisp a special token for left or right parenthesis?
> Sometimes I start with the final return and work back to the inputs.
Shouldn't be hard to train a coding LLM to do this too by doubling the training time: train the LLM both forwards and backwards across the training data.
GP is talking about the nonlinear way that software engineers think, reason, and write down code. Simply doing the same thing but backwards provides no benefit.