Outlines is a Python library that focuses on text generation with large language models. Brandon and I are not LLM experts and started the project a few months ago because we wanted to understand better how the generation process works. Our original background is probabilistic, relational and symbolic programming.
Recently we came up with a fast way to generate text that matches a regex (https://blog.normalcomputing.ai/posts/2023-07-27-regex-guide...). The basic idea is simple: regular expressions have an equivalent Deterministic-Finite Automaton (DFA) representation. We can transform this DFA into a generative model: in each state we get a list of symbols which correspond to completions that partially match the regular expression. We mask the other symbols in the logits returned by a large language model, sample a new symbol and move to the next state. The subtelty is that language models work with tokens, not symbols, so we derive a new FSM whose alphabet is the model's vocabulary. We can do this in only one pass over the vocabulary.
Generating the token masks thus only requires a dictionary lookup at each state. Our method blows other libraries like Microsoft's guidance out of the water.
From there it was only a small leap to be able to generate text that follows a JSON schema (https://json-schema.org/), or is parseable into a Pydantic model (https://docs.pydantic.dev/latest/usage/models/). The method works with union types, optional types, nested schemas, arrays, everything. It is guaranteed that the output is parseable.
I think it's cool, and I've spent a lot of time watching even tiny models output valid JSON over the weekend. Hope you will too.
I look forward to feedback, bug reports, feature requests and discussions!
Edit: Link to our pre-print explaining the method and how this can be extended to generate text that follows a Context-Free Grammar https://arxiv.org/abs/2307.09702
I'm not sure of why you would want to use raw llama-2 though when there is a million super strong instruction fine-tuned versions of llama-2 on HF hub that would do the job a million times better? Like Stability-AI's Beluga-2. See https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...
About your second point, the goal is that the model can only generate JSON (for example), which can 100% be done by constraining which output token can and cannot be used.
Don't rely too much on automated benchmarks for LLMs. They are often gamed, made to overfit, and result in worse performance in the general case.
Human evaluation is the gold standard and the Llama 2 paper gave significant evidence that Llama 2 70b chat is on-par, if not, better than ChatGPT for that metric so I tend to stick to it unless there is good reason not to.
The problem with Llama 2 chat versions is that they have been RLHF-ed to death. You can't ask questions without getting a sermon of how your question may be inappropriate for this or that reason.
I think it's worse on the smaller models, but still present in the 70B one.
Apologies if you’d already seen this and were only trying to make a point, but you might like this article from a week or 2 ago that talks about how to run Llama 2 “uncensored” locally, and it seems to do a decent job of mitigating the sermons!
Article: https://ollama.ai/blog/run-llama2-uncensored-locally
Discussion: https://news.ycombinator.com/item?id=36973584
When you encounter "uncensored" in a llama model (1 or 2) what that means in that context is that the fine-tuning datasets used have had all refusals to respond removed. There's no way to uncensor the pre-trained model itself and fine-tuning only changes the style of the output.
For sure, that's a good reason for using the uncensored fine-tuned versions. There are other good reasons too like expanded context size, codegen, and story writing/rp. Just be careful of extraordinary benchmarks.
Btw, have you tried changing the default Llama 2 chat prompt? Meta tried to fine-tune it so that if you remove the safety part from the prompt, safety won't be applied[1]. Not sure how well it works myself, but worth a shot I guess
[1] can be found in the Llama 2 paper
> I'm not sure of why you would want to use raw llama-2
Sure. My concern was not specific to llama-2, and was only using it as a placeholder example of a decent pre-trained base model. Replace it with your favorite base model, which you want to use for guided generation. My question is more fundamental - how does post-hoc guided generation interfere with the potential benefits of instruction-tuning?
> About your second point, the goal is that the model can only generate JSON (for example), which can 100% be done by constraining which output token can and cannot be used.
Mechanistically, yes. I am not arguing that. The whole point is to generate JSON that is "useful".
I'm quite impressed with Llama 2 13B - the more time I spend with it the more I think it might be genuinely useful for more than just playing around with local LLMs.
I'm using the MLC version (since that works with a GPU on my M2 Mac) via my https://github.com/simonw/llm-mlc plugin.
Even the 7B model is shockingly good! I've been hacking on a project also built on MLC (but the web runtime) and the completions I'm seeing from Llama 2 7B, just running on my laptop's browser, have been really impressive. There's a demo page here: https://ad-llama.vercel.app/
That demo is really cool!
What are your use cases
The thing I really want to get working is retrieval augmented generation - so effectively answering questions based on a blob of context that I pass in, and being able to do good-enough summarization.
I haven't quite proved this to myself yet but I think it's going to work pretty well.
Not simonw, but I've been using Llama2-13B for search re-ranking very successfully.
search re-ranking?
Do a search, then re-order the results based on a criteria. Easy when the criteria is easy to code, less so when it isn't. But turns out LLMs are pretty good at interpreting the re-ranking instructions.
In our experience, at least for code generation, the experience has been that base models can be improved significantly by guiding token level generation.
In our paper titled "Guiding Language Models of Code with Global Context using Monitors" (https://arxiv.org/abs/2306.10763), we propose Monitor Guided Decoding, which interfaces LLMs to static analysis, and guides the model to generate type-consistent code. Without any kind of fine-tuning, we show that using static analysis to guide token level generation at specific points leads to significantly improved quality of generated code, both in terms of compilability and match with ground truth. Even very small models (1.1B) are able to generate more compilable code than much larger models (175B) while also improving on match with ground truth.
Thanks for the reference, Lakshya. Looks very cool!
(Just thinking out loud next)
If you allow me to be a little imprecise, guided-generation is prompting "just-in-time" unlike the other kind of prompting where you provide all reference tokens "ahead-of-time". Now there's work [1] out there that shows that smaller models rely much more on prompting than larger models do, i.e. smaller models are more faithful to the tokens in the prompt than the larger models which just do whatever they were going to do anyways.
Your results seem very much in line with this kind of a qualitative result --- you show that CodeGen-350M outperforms CodeGen-6B, and CodeGen-6B outperforms text-davinci-003 using MGD. Smaller models perhaps respond more strongly to certain kinds of prompting strategies than larger models do.
[1]: https://arxiv.org/pdf/2307.13702.pdf
It is an interesting paper. Any idea when the code/data will be released? It appears it has been almost 2 months since the paper was submitted, but the link given leads to a random bing page :-(
> ...given an instruction-tuned model, post-hoc masking of the state-space during generation then amounts to just changing the generation distribution...
Isn't that what we did with test driven development?
The primary difference was our generator functions were human instead of LLM. Why not cut out the middle-human?
Yes. And if that human was smart and knowledgable they would use property based testing to automatically generate test inputs. Most libraries make it trivial to do for custom data types and can even reduce the failing test case to a minimal size input. I have been using this since 2008 and it was around before that.
I think what I am saying is tangential to TDD. I am not really even concerned about the ability of LLM to function as desired, and its verification.
I was rather concerned about a broader fundamental question - how does post-hoc guided generation interfere with the potential benefits of instruction-tuning?
>you do need a fair bit of instruction-tuning for specific use cases to actually get things to work.
The instruction tuning part is "trivial"...it's the dealing with edge cases part that gets me.
With classic code edge cases are well insignificant edge cases. With LLM you never know what will make it go off on a tangent & the parsing code needs to deal with that chaos.
Or put differently the % of cases that are edge cases seems to have gone up dramatically