Outlines is a Python library that focuses on text generation with large language models. Brandon and I are not LLM experts and started the project a few months ago because we wanted to understand better how the generation process works. Our original background is probabilistic, relational and symbolic programming.

Recently we came up with a fast way to generate text that matches a regex (https://blog.normalcomputing.ai/posts/2023-07-27-regex-guide...). The basic idea is simple: regular expressions have an equivalent Deterministic-Finite Automaton (DFA) representation. We can transform this DFA into a generative model: in each state we get a list of symbols which correspond to completions that partially match the regular expression. We mask the other symbols in the logits returned by a large language model, sample a new symbol and move to the next state. The subtelty is that language models work with tokens, not symbols, so we derive a new FSM whose alphabet is the model's vocabulary. We can do this in only one pass over the vocabulary.

Generating the token masks thus only requires a dictionary lookup at each state. Our method blows other libraries like Microsoft's guidance out of the water.

From there it was only a small leap to be able to generate text that follows a JSON schema (https://json-schema.org/), or is parseable into a Pydantic model (https://docs.pydantic.dev/latest/usage/models/). The method works with union types, optional types, nested schemas, arrays, everything. It is guaranteed that the output is parseable.

I think it's cool, and I've spent a lot of time watching even tiny models output valid JSON over the weekend. Hope you will too.

I look forward to feedback, bug reports, feature requests and discussions!

Edit: Link to our pre-print explaining the method and how this can be extended to generate text that follows a Context-Free Grammar https://arxiv.org/abs/2307.09702

314 comments
  • activatedgeek2y

    Mechanistically, I think this library takes the simple idea of masking part of the vocabulary space and steps in time efficiently. Great!

    I am curious, however, for the ones who have played around with such libraries wrapping base LLMs with output structure: do base models like Llama2 work very well? My experience says "hell no!" and you do need a fair bit of instruction-tuning for specific use cases to actually get things to work.

    And even then, it seems very counter-intuitive to me that given an instruction-tuned model, post-hoc masking of the state-space during generation then amounts to just changing the generation distribution, and potentially detrimental to instruction-tuning?

    • make32y

      I'm not sure of why you would want to use raw llama-2 though when there is a million super strong instruction fine-tuned versions of llama-2 on HF hub that would do the job a million times better? Like Stability-AI's Beluga-2. See https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

      About your second point, the goal is that the model can only generate JSON (for example), which can 100% be done by constraining which output token can and cannot be used.

      • nabakin2y

        Don't rely too much on automated benchmarks for LLMs. They are often gamed, made to overfit, and result in worse performance in the general case.

        Human evaluation is the gold standard and the Llama 2 paper gave significant evidence that Llama 2 70b chat is on-par, if not, better than ChatGPT for that metric so I tend to stick to it unless there is good reason not to.

        • huevosabio2y

          The problem with Llama 2 chat versions is that they have been RLHF-ed to death. You can't ask questions without getting a sermon of how your question may be inappropriate for this or that reason.

          I think it's worse on the smaller models, but still present in the 70B one.

          • dceddia2y

            Apologies if you’d already seen this and were only trying to make a point, but you might like this article from a week or 2 ago that talks about how to run Llama 2 “uncensored” locally, and it seems to do a decent job of mitigating the sermons!

            Article: https://ollama.ai/blog/run-llama2-uncensored-locally

            Discussion: https://news.ycombinator.com/item?id=36973584

            • superkuh2y

              When you encounter "uncensored" in a llama model (1 or 2) what that means in that context is that the fine-tuning datasets used have had all refusals to respond removed. There's no way to uncensor the pre-trained model itself and fine-tuning only changes the style of the output.

          • nabakin2y

            For sure, that's a good reason for using the uncensored fine-tuned versions. There are other good reasons too like expanded context size, codegen, and story writing/rp. Just be careful of extraordinary benchmarks.

            Btw, have you tried changing the default Llama 2 chat prompt? Meta tried to fine-tune it so that if you remove the safety part from the prompt, safety won't be applied[1]. Not sure how well it works myself, but worth a shot I guess

            [1] can be found in the Llama 2 paper

      • activatedgeek2y

        > I'm not sure of why you would want to use raw llama-2

        Sure. My concern was not specific to llama-2, and was only using it as a placeholder example of a decent pre-trained base model. Replace it with your favorite base model, which you want to use for guided generation. My question is more fundamental - how does post-hoc guided generation interfere with the potential benefits of instruction-tuning?

        > About your second point, the goal is that the model can only generate JSON (for example), which can 100% be done by constraining which output token can and cannot be used.

        Mechanistically, yes. I am not arguing that. The whole point is to generate JSON that is "useful".

    • simonw2y

      I'm quite impressed with Llama 2 13B - the more time I spend with it the more I think it might be genuinely useful for more than just playing around with local LLMs.

      I'm using the MLC version (since that works with a GPU on my M2 Mac) via my https://github.com/simonw/llm-mlc plugin.

      • gsuuon2y

        Even the 7B model is shockingly good! I've been hacking on a project also built on MLC (but the web runtime) and the completions I'm seeing from Llama 2 7B, just running on my laptop's browser, have been really impressive. There's a demo page here: https://ad-llama.vercel.app/

        • simonw2y

          That demo is really cool!

      • moneywoes2y

        What are your use cases

        • simonw2y

          The thing I really want to get working is retrieval augmented generation - so effectively answering questions based on a blob of context that I pass in, and being able to do good-enough summarization.

          I haven't quite proved this to myself yet but I think it's going to work pretty well.

        • nl2y

          Not simonw, but I've been using Llama2-13B for search re-ranking very successfully.

          • victor1062y

            search re-ranking?

            • nl2y

              Do a search, then re-order the results based on a criteria. Easy when the criteria is easy to code, less so when it isn't. But turns out LLMs are pretty good at interpreting the re-ranking instructions.

    • LakshyAAAgrawal2y

      In our experience, at least for code generation, the experience has been that base models can be improved significantly by guiding token level generation.

      In our paper titled "Guiding Language Models of Code with Global Context using Monitors" (https://arxiv.org/abs/2306.10763), we propose Monitor Guided Decoding, which interfaces LLMs to static analysis, and guides the model to generate type-consistent code. Without any kind of fine-tuning, we show that using static analysis to guide token level generation at specific points leads to significantly improved quality of generated code, both in terms of compilability and match with ground truth. Even very small models (1.1B) are able to generate more compilable code than much larger models (175B) while also improving on match with ground truth.

      • activatedgeek2y

        Thanks for the reference, Lakshya. Looks very cool!

        (Just thinking out loud next)

        If you allow me to be a little imprecise, guided-generation is prompting "just-in-time" unlike the other kind of prompting where you provide all reference tokens "ahead-of-time". Now there's work [1] out there that shows that smaller models rely much more on prompting than larger models do, i.e. smaller models are more faithful to the tokens in the prompt than the larger models which just do whatever they were going to do anyways.

        Your results seem very much in line with this kind of a qualitative result --- you show that CodeGen-350M outperforms CodeGen-6B, and CodeGen-6B outperforms text-davinci-003 using MGD. Smaller models perhaps respond more strongly to certain kinds of prompting strategies than larger models do.

        [1]: https://arxiv.org/pdf/2307.13702.pdf

      • Roark662y

        It is an interesting paper. Any idea when the code/data will be released? It appears it has been almost 2 months since the paper was submitted, but the link given leads to a random bing page :-(

    • ethbr12y

      > ...given an instruction-tuned model, post-hoc masking of the state-space during generation then amounts to just changing the generation distribution...

      Isn't that what we did with test driven development?

      The primary difference was our generator functions were human instead of LLM. Why not cut out the middle-human?

      • spockz2y

        Yes. And if that human was smart and knowledgable they would use property based testing to automatically generate test inputs. Most libraries make it trivial to do for custom data types and can even reduce the failing test case to a minimal size input. I have been using this since 2008 and it was around before that.

      • activatedgeek2y

        I think what I am saying is tangential to TDD. I am not really even concerned about the ability of LLM to function as desired, and its verification.

        I was rather concerned about a broader fundamental question - how does post-hoc guided generation interfere with the potential benefits of instruction-tuning?

    • Havoc2y

      >you do need a fair bit of instruction-tuning for specific use cases to actually get things to work.

      The instruction tuning part is "trivial"...it's the dealing with edge cases part that gets me.

      With classic code edge cases are well insignificant edge cases. With LLM you never know what will make it go off on a tangent & the parsing code needs to deal with that chaos.

      Or put differently the % of cases that are edge cases seems to have gone up dramatically

  • panarky2y

    I can make GPT4 return valid JSON simply by providing examples in the system message. This works nine times out of ten.

    But it's still probabilistic, and nine times out of ten isn't good enough.

    Occasionally it will hallucinate responses like this:

    {"key1": "value1", "key2": "value2" for i in range(n)}

    Re-prompting with the parsing error message is usually enough to get it on the second try.

    But escaping double-quotes and newline characters is less reliable. Even after giving it multiple examples, it correctly escapes only about half the time.

    Re-prompting for escaping errors still yields a ~50% success rate.

    • simonw2y

      That re-prompting on error trick is what this new Microsoft library does, too: https://github.com/microsoft/TypeChat

      Here's their prompt for that: https://github.com/microsoft/TypeChat/blob/c45460f4030938da3...

      I think the approach using grammars (seen here, but also in things like https://github.com/ggerganov/llama.cpp/pull/1773 ) is a much more elegant solution.

      • creatonez2y

        A "repair prompt" instead of rewinding and starting back from the error seems like the wrong choice, and might only make sense with how payment for OpenAI API usage currently works.

    • padolsey2y

      I've had more luck with getting it to output XML as (1) You can imbue XML with actual language/meaning (which LLMs adore) and (2) parsers can be made to be more forgiving. I get why people want to make JSON, but to me it's a bit like trying to get a cat to swim - you might eventually succeed, but it's not their natural inclination.

      • prempv2y

        I've had the same experience as well. I suspect if it's due to large presence of HTML in the training data as part of codebases and online content

      • gowld2y

        How do you imbue XML with meaning?

        • padolsey2y

          XML Elements themselves: their naming, their attributes, comments, indentation. There's more opportunity at every level of the hierarchy to demarkate and establish meaning. Having closing-tags as well, I've found, is a massive boon; LLMs can better understand what "finishing" looks like if its delimited in a semantic way - with a name.

          • BoorishBears2y

            Same works for JSON. Naming JSON keys works for adjusting what the output is nicely, and you can comment in your definitions (by defining them in a JSON Schema, or inserting placeholder text like `"someKeyWithClarifyingDetails": <some detailed instruction>`)

            I'm actually partial to CSV these days though, it can really cut down on response times just not needing to return all the extra tokens for JSON/XML delimiters

            • padolsey2y

              Ostenibly yeh JSON should be able to encapsulate mose of that semantic stuff but having replaced an XML schema in the system prompt with gpt's function-calling API I've been very umimpressed. It feels much less capable. I would have to provide a lot more clarifying prompts to make it more capable. I think I will, for now, bias to using schemas that are closest to prose.

            • DonHopkins2y

              Yikes. This makes me think that JSON's stubborn mistake of not allowing comments is yet another "Billion-Dollar Mistake", since it's way too late to just change the standard to allow comments, update all the JSON content on the internet to use comments, and retrain all the LLMs to understand comments.

              Great point about CSVs! But using placeholder keys for JSON comments in untenable, and using schema instead of inline comments is clumsy and indirect. Of course JSON schema are quite useful in certain situations, but LLMs would get a lot more meaning out of casual common JSON if it just allowed comments, and it would also greatly benefit humans.

              Between JavaScript's and JSON's mistakes, that's at least <DoctorEvilVoice>THREE BILLION DOLLARS!!!</DoctorEvilVoice> ;)

              https://en.wikipedia.org/wiki/Tony_Hoare#Research_and_career

              >Speaking at a software conference in 2009, Tony Hoare apologized for inventing the null reference:

              >"I call it my billion-dollar mistake. It was the invention of the null reference in 1965. At that time, I was designing the first comprehensive type system for references in an object oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldn't resist the temptation to put in a null reference, simply because it was so easy to implement. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years." -Tony Hoare

              https://news.ycombinator.com/item?id=19568378

              >"My favorite is always the Billion-Dollar Mistake of having null in the language. And since JavaScript has both null and undefined, it's the Two-Billion-Dollar Mistake." -Anders Hejlsberg

              >"It is by far the most problematic part of language design. And it's a single value that -- ha ha ha ha -- that if only that wasn't there, imagine all the problems we wouldn't have, right? If type systems were designed that way. And some type systems are, and some type systems are getting there, but boy, trying to retrofit that on top of a type system that has null in the first place is quite an undertaking." -Anders Hejlsberg

              • BoorishBears2y

                I'm not saying use placeholder keys: the actual keys themselves serve as guidance.

                Naming a key "nameBasedOnLocationIGaveYou" instead of "name", or "oneSentenceSummary" vs "summary", results in a meaningful difference.

                You can even use that for formatted single-response chain of thought, like {"listOfStuff":[...], "whatDoTheyHaveInCommon": "", "whichOneIsMostImportant": ""}

                Also remember, the LLM doesn't need valid JSON: I just straight up insert comments in the JSON in a non-compliant way for some of my prompts, GPT-4 and Claude are all smart enough to not hallucinate comments back at you. 3.5 might be pushing it if temp is too high (although even the nerfed API logit bias should fix that now that I think about it)

                And sometimes to save tokens I describe a JSON object without using JSON: just structure it in neatly formatted markdown and even 3.5 can follow along

                • DonHopkins2y

                  Oh, I see! I misunderstood that you meant using dummy keys to hold comments in their values, which some people have suggested as a work-around for there not being any comments in JSON.

    • caesil2y

      With ChatGPT function calling I get valid JSON 100% of the time from GPT-4 unless I have made some error in prompting.

      The chief error is not providing escape hatches. LLMs look for a right answer. If you are feeding it some texts and asking it to return structured data about the texts, but then one of the texts is blank, it will be difficult to determine a right answer, so you get hallucinations. The solution is an escape hatch where one of the arguments is a `textIsMissing` boolean or something.

      As long as you've accounted for these failure modes, it works flawlessly.

      • reissbaker2y

        GPT-4 is amazing, but the upside of smaller models is much lower cost. I get basically 100% accuracy on JSON modeling with GPT-4 with function calling too, but I will say that gpt-3.5-turbo with function calling is somewhat less accurate — it usually generates valid JSON in terms of JSON.parse not exploding, but not necessarily JSON following the schema I passed in (although it's surprisingly good, maybe ~90% accurate?). I use 3.5-turbo a decent amount in API calls because it's just a lot cheaper, and performs well enough even if it's not gpt-4 level.

        I haven't gotten a chance to earnestly use the smaller Llama models yet in more than small prototypes (although I'm building a 4090-based system to learn more about finetuning them), but the little amount of experimenting I've done with them makes me think they need a decent amount of help with generating consistently-valid JSON matching some schema out of the box. This is a pretty neat tool to use for them, since it doesn't require finetuning runs, it just masks logits.

        • BoorishBears2y

          claude-1.2-instant came out last week and is doing extremely well at following schemas.

          I'd say it's reached 3.5 turbo with the format following skills of GPT-4, which is powerful once you give it chain-of-thought

      • selcuka2y

        The premise of function calling is great, but in my experience (at least on GPT-3.5, haven't tried it with GPT-4 yet) it seems to generate wildly different, and less useful results, for the same prompt.

        • caesil2y

          GPT-3.5 is pretty much useless for reliable NLP work unless you give it a VERY proscribed task.

          That's really the major breakthrough of GPT-4, in my mind, and the reason we are absolutely going to see an explosion of AI-boosted productivity over the next few years, even if foundation LLM advancements stopped cold right now. A vast ocean of mundane white collar work is waiting to be automated.

        • ipaddr2y

          You can change the randomness value to 0 and get the same output each time for the same text

          • tomduncalf2y

            In my experience (with GPT-4 at least), a temperature of 0 does not result in deterministic output. It's more consistent but outputs do still vary for the same input. I feel like temperature is a bit more like "how creative should the model be?"

            • selcuka2y

              One theory is it is caused by its Sparse MoE (Mixture of Experts) architecture [1]:

              > The GPT-4 API is hosted with a backend that does batched inference. Although some of the randomness may be explained by other factors, the vast majority of non-determinism in the API is explainable by its Sparse MoE architecture failing to enforce per-sequence determinism.

              [1] https://152334h.github.io/blog/non-determinism-in-gpt-4/

          • selcuka2y

            I should probably re-test it, but I think it wasn't the temperature. The results were unusually useless.

    • andreygrehov2y

      Meh... I asked GPT4 to return a sample PHP code inside of a random JSON. It failed the JSON linter from the very first try. I actually couldn't pass the validation despite many retries, eg follow up corrections. Not a single time it generated a 100% valid JSON, I eventually gave up.

      • adamrezich2y

        if you think that's bad, try to get it to generate Inform 7 games—Inform's natural-English-ish syntax completely throws all LLMs for a loop, consistently. it generates code that looks possibly correct (to an Inform newbie at least), but fails to compile far more often than not. I find this super interesting.

      • ipaddr2y

        This worked with chatGPT: create a sample hello world in php

        store that code in a json[object

        code: { "php_code": "<?php echo 'Hello, World!'; ?>" }

    • karmasimida2y

      I see grammar constrained generation for 2 major advantages:

      1. It consumes fewer tokens, no need to add too many examples into the prompt.

      2. It suffers less from the forgetting issue.

      Another minor advantage is you can control precisely where your desired output to begin with.

      But overall, those are nice perks not too substantial IMO.

    • nextaccountic2y

      What about reprompting with a different temperature value?

      If this works, how to select the optimal value? Maybe you can train a model that can excel at the task of querying gpt4 for valid jsons

    • MuffinFlavored2y

      I wonder if the next iteration of OpenAI features is something like:

      right now you can inject prompts that the LLM takes into consideration before the output

      I wonder if you can make it have a "post" generation function that says like "keep re-trying in a loop (aka hallucinating with randomness) until the output message passes XYZ format/checks/scoring"

      • padjo2y

        It’s starting to feel like LLMs are to “classical” software engineering what quantum physics was to classical physics

        • catlifeonmars2y

          How so? I’m not quite following the analogy.

          • antonvs2y

            Just guessing what was meant, but quantum physics in some sense tries all possible paths before an outcome is selected.

            The problem with that is that without a quantum computer, or without some sort of filtering, that process can take up to infinite time.

          • padjo2y

            Oh it was just a glib way of moaning about non-determinism making its way into software engineering. Much like how physicists had to make peace with the probabilistic nature of quantum physics.

      • kristjansson2y

        Why wait for OpenAI?

    • msp262y

      >I can make GPT4 return valid JSON simply by providing examples in the system message. This works nine times out of ten

      But you can do both. For my current use case of extracting information from articles, I have a json schema + one/two example articles along with their correct answers. This increases token costs but 3.5 is so cheap that it doesn't matter and for 4 you can use batching to decrease token cost per article.

      • vsrinivasan2y

        Can you please explain what is batching ? any pointers?

    • phillipcarter2y

      This is what we do, but for GPT-3.5. And it doesn't need to be system messages either. We even have it emitting only JSON in a specific structure (except for when it fails to produce an output altogether). This is without the function calling model.

    • keiferwiseman2y

      It took some iterations but I've managed to get the OpenAI API to give me valid JSON 100% of the time now(based on my testing). I think I put in the prompt to never use newlines because it was causing issues lol.

    • thumbsup-_-2y

      Yeah same thing. I have done the same with GPT-3.5. Simply ask it to output using provided schema only and give a few examples. Always outputs in provided json format

    • orasis2y

      What about using ChatGPT’s new function calling mechanism?

      • superasn2y

        That returns broken JSON a lot of the times too

  • hansvm2y

    A major part of the power of an LLM is the calibrated probability distribution in its responses, and this technique probably throws that ability away. Why is it good enough?

    As a brief example, suppose the only possible LLM outputs were "hello world", "food", "hello", and "good day" (and that they're all equally probable with no prompting). Suppose your grammar requires a space in the output somewhere and has no other constraints. If you sampled LLM outputs till something passed the grammer you'd receive "hello world" and "good day" with equal probability. If you apply the website's technique you'll receive "hello world" twice as frequently as "good day".

    The core problem is that an answer prefix might have been extremely unlikely to yield a valid response, but the technique (probably -- assuming it succeeds -- my example assumed retries would eventually succeed) constructs a valid response from it regardless. Assuming enough independence in the right places everything is fine and dandy still, but correlated errors compound quickly in autoregressive models.

    As a brief JSON-specific question, is an LLM more or less likely to make factual errors (hallucinations, truncated strings, missing main characters, ...) when it produces a response failing to adhere to a schema? If factual error rate relates nontrivially to schema error rate then this path is more perilous than it seems. Given the outsized impact certain words or schmooshed together word-phrases seem to have on LLM output, I'd be surprised if details like schema adherence didn't bleed into other characteristics of the output.

    • druskacik2y

      In this case (multiple choice generation), if one of the possible outputs does no match the regex, you can just exclude it from generation.

      I am trying to think of an example where "answer prefix might have been extremely unlikely to yield a valid response, but the technique ( ... ) constructs a valid response from it regardless", which might really cause a problem. But to no luck. Anyone has any idea? This could potentially be an interesting research question.

      • newhouseb2y

        An example from an earlier comment of mine on a different thread (assuming I've understood correctly):

        > let's say we had a grammar that had a key "healthy" with values "very_unhealthy" or "moderately_healthy." For broccoli, the LLM might intend to say "very_healthy" and choose "very" but then be pigeonholed into saying "very_unhealthy" because it's the only valid completion according to the grammar.

        That said, you can use beam search to more or less solve this problem by evaluating the joint probability of all tokens in each branch of the grammar and picking the one with the highest probability (you might need some more nuance for free-form strings where the LLM can do whatever it wants and be "valid").

        • IanCal2y

          This is a concern of mine, as well as limiting the amount that an LLM can talk through a problem - sometimes to nothing. Getting them to work through things IMO dramatically improves their output.

          My gut feeling is that taking the output and if it's broken then start fixing it would have a better result - you could even then completely limit the output to only valid json. For your example, if it wrote "very_healthy" and was given an error message explaining that this wasn't an option it had to choose from very_unhealthy" or "moderately_healthy" I would expect a halfway decent model to pick "moderately_healthy".

          This has the benefit of allowing you to use a more powerful model for reasoning (like GPT4) and a local model where you can do this kind of token probability manipulation for just fixing the data.

      • hansvm2y

        The multiple choice example was just for tractable computations and illustrative purposes. Pretend the LLM has characters===tokens and is doing autoregressive probability prediction as per usual -- "f"-25%, "h"-50%, "g"-25% to start with, and then appropriate probabilities thereafter to yield that multiple-choice example (plus an <end-of-string> token).

        > I am trying to think of an example where "answer prefix might have been extremely unlikely to yield a valid response, but the technique ( ... ) constructs a valid response from it regardless", which might really cause a problem. But to no luck. Anyone has any idea? This could potentially be an interesting research question.

        At one point in the past ChatGPT (at a model probability layer, not just because of the context window issue) was prone to truncating long JSON responses, and if that happened in a long string field then you'd see the observed behavior. An example application:

        (-) You're asking the LLM to turn some written podcast description into something machine-readable. You chunk the input, feed each chunk into the model (somehow; ignore the details; they're not important), and turn paragraphs into {speaker_name: str, timestamp: str, content: str} blobs.

        (1) The LLM is prone to turning long paragraphs into `{"content": "the beginning of the content...` patterns, using ellipses to indicate that there's more to that JSON object.

        (2) If you actually retry till the LLM succeeds, it's leaps and bounds more likely to end that string with a quotation mark if the string has all the original input. I.e., output like `{"content": "the beginning of the content..."}` is comparatively rare.

        (3) The article's technique, however, always morphs those truncated json blobs into valid json. Since the ellipses is _valid_ at that point (a sub-string), instead of the vast majority of inputs failing you instead end up with the vast majority succeeding and having an incorrect ellipses sub-string.

        In general, the LLM does autoregressive completions. Imagine two prefixes P1 and P2, each of which can be completed by classes of data so that P1{G1} adheres to the grammar, P1{F1} fails to adhere to the grammar, P2{G2} succeeds, and P2{F2} fails. With retry-till-passing-grammar the weighted probabilities are:

        P1{G1}: Chance[P1] Chance[G1 | P1]

        P2{G2}: Chance[P2] Chance[G2 | P2]

        Whereas the weighted probabilities produced by the technique are:

        P1{G1}: Chance[P1]

        P2{G2}: Chance[P2]

        In both cases you'd need to divide by the total probability, but the convolution by conditionals is both important and notably absent. For very simple schemas like {sentiment: "positive"|"negative"|"neutral"} the results might potentially be similar, but nothing in the idea of a greedy token filter forces that constraint.

  • sneedchucker2y

    Relevant; LLama.cpp implemented grammar-based sampling last month.

    https://news.ycombinator.com/item?id=36819906 https://github.com/ggerganov/llama.cpp/pull/1773

    • remilouf2y

      We can extend our approach to grammar-based sampling, as explained in the paper linked above. Relevant PR: https://github.com/normal-computing/outlines/pull/178

      Our method is much more efficient. llama.cpp loops over the entire vocabulary (~50k tokens) at each step to generate the mask. We generate an index at initialization, and building the masks at each step only requires a dictionary lookup (trade speed for memory). Sampling is just as fast as standard sampling.

      • popinman3222y

        It should hopefully be a quick change to llama.cpp to add a mask per grammar state to bring it in line with your generation method; I don't think the two are incompatible, thankfully.

        I do wonder how much you win here by masking the tokens? You still need to iterate along the output vector to apply the mask. Masking on the accelerator still requires filtering on the CPU side? Compared to running the language model, the cost of iterating over the edges in the grammar seems small.

      • burke2y

        Yes! This is closer to the approach I took in my port of llama.cpp's grammar support to PyTorch: https://github.com/Shopify/torch-grammar/blob/main/torch_gra... ... it generates a tensor mapping each PDA stack to a map of which tokens are acceptable from that state. It seems like a much better way to do it than looping over the sampled tokens on each turn.

    • btwillard2y

      We also had an implementation of grammar-driven guidance around the same time: https://github.com/normal-computing/outlines/pull/131. I imagine many others did as well, given all the papers we found on the subject. The point of this and our ongoing work is the availability of very low cost guidance, which was implemented a while ago for the regex case and expanded upon with JSON.

  • xigency2y

    Thanks for building this. The mechanics are such an obvious idea that it's astounding that the first-party platforms haven't done this yet. I would be interested to see how this could be used for other tasks outside of JSON that require structured input.

    • umvi2y

      > it's astounding that the first-party platforms haven't done this yet

      I was under the impression LLM tech is currently in a breakneck arms race and that things are dramatically changing every few months. It could simply just be a consequence of limited developer resources. It would be "astounding" if decade-old tech were missing such a fundamental feature, but for AI tech in arms-race mode it seems reasonable that they are still missing QoL features.

      • winwang2y

        I think they meant that you'd expect simpler/more obvious ideas to be implemented first.

    • remilouf2y

      Thanks! We have extended the approach to grammar-based sampling. We describe the approach in the paper linked above. The following PR is relevant: https://github.com/normal-computing/outlines/pull/178

      • Lerc2y

        Could this same approach be applied at training? If the guidance does a lot of the syntactical heavy lifting, would that create the opportunity for a model to use the weights for something else. Essentially not bothering to reduce the error of things that the guidance will stomp on anyway.

    • LakshyAAAgrawal2y

      Hi, the paper at https://arxiv.org/abs/2306.10763 titled "Guiding Language Models of Code with Global Context using Monitors" shows how to have the language models generate code without hallucinated dereferences.

  • BoorishBears2y

    I'm not sure how this is different than:

    https://github.com/1rgs/jsonformer

    or

    https://github.com/newhouseb/clownfish

    or

    https://github.com/mkuchnik/relm

    or

    https://github.com/ggerganov/llama.cpp/pull/1773

    or

    https://github.com/Shopify/torch-grammar

    Overall there are a ton of these logit based guidance systems, the reason they don't get tons of traction is the SOTA models are behind REST APIs that don't enable this fine-grained approach.

    Those models perform so much better that people generally settle for just re-requesting until they get the correct format (and with GPT-4 that ends up being a fairly rare occurrence in my experience)

    • remilouf2y

      Thanks for bringing clownfish and relm to my attention! afaik other libraries loop over the entire vocabulary at every step of the generation. We on the other hand build an index at initialization by looping once over the vocabulary. Then generation is just as fast as standard generation.

      • burke2y

        torch-grammar generates a mask per PDA stack... we don't try to compute all the possible stacks. I'm sure there's something smarter that could be done here and you've probably figured it out (though IIRC regular languages don't have the arbitrarily recursive stack problem that you get when you get to context-free languages?) anyway, in practice we spend a few milliseconds on the first few requests building caches and then just apply masks from caches after that.

        • remilouf2y

          Sorry for misrepresenting your work. Thank you for correcting me and the explanation. Will take a closer look.

      • mkuchnik2y

        Hi, author of ReLM here. We use automata as well, like you describe, if I understand correctly.

  • J_Shelby_J2y

    So to explain this another way:

    After each token generated by the LLM you update the logit bias “mask” to only allow the next token to be a valid json token?

    Very slick!

    • dontreact2y

      You would also need to keep generating until the whole string is valid. And what if it gets caught in a loop?

      Not sure how this can really guarantee 100%

      • orlp2y

        > And what if it gets caught in a loop? Not sure how this can really guarantee 100%

        It's not great but after some timeout you can just set the mask to only include closing brackets.

        • aassddffasdf2y

          You would still have to ensure balancing somehow. Both "]" and "}" are valid "closing brackets" and the correct one to choose is context-dependent.

          • gyy523802y

            You can determine which brackets you need in which order by parsing the incomplete json which was generated so far.

            • dontreact2y

              That won't do it, also need to close other stuf

              {"this": "is valid json so farrrrrrrrrrrrrr

              But yeah the general idea makes sense. Once you hit a timeout, change the mask to things that will close existing open things in a valid manner (}, ), ], ")

      • kristjansson2y

        Same problem with normal sampling - if it doesn't pick the <end> token, you're stuck generating until you hit some stopping heuristic (max tokens, timeout, etc.)

    • remilouf2y

      Indeed. And we're able to update the mask with a dictionary lookup instead of looping over the entire vocabulary (slow!).

      • 2y
        [deleted]
    • bmc75052y

      You also need some kind of beam search or rejection sampling since JSON tokens to not exactly correspond to logits.

      edit: They describe this more carefully in the paper.

    • behnamoh2y

      It’s actually a very old trick. Lots of libraries do this. idk what’s the big deal about this one.

      • remilouf2y

        Perhaps I didn’t explain clearly enough in the original post?

  • Q6T46nT668w6i3m2y

    Is this Brandon Willard the breakdancer from Detroit Brandon Willard?

    Edit: It is! https://brandonwillard.github.io/

    • btwillard2y

      Ha, yeah, in a distant, but really fun, past!

  • YeGoblynQueenne2y

    Hi, remilouf. You say that your background is in "probabilistic, relational and symbolic programming". In that case I suspect you understand that it is no problem to generate text from a regular or context-free grammar, or really any level of grammar. For example, you can do that very easily in Prolog (a relational language) given a grammar in Definite Clause Grammars notation.

    As far as I can tell your approach requires a grammar to be given by a user. In that case, what is the advantage of using an LLM to generate text? Why can't you just run your grammar as a generator and generate the text you want? That would save you the considerable trouble and cost of training an LLM in the first place. And why would you need an LLM, a model of natural language, if all you want is to generate structured text, anyway?

    • IanCal2y

      Wouldn't that generate an entirely random but valid output? Here you want a valid output related to the request.

      > And why would you need an LLM, a model of natural language, if all you want is to generate structured text, anyway?

      So that you can parse unstructured text from a person and return structured data for a machine.

      • YeGoblynQueenne2y

        >> Wouldn't that generate an entirely random but valid output?

        No. Grammars don't generate entirely random output. Even Probabilistic Context Free Grammars can generate deterministic output, depending on how they are sampled. The output can be related to some input, if desired, for example one can give a string with "holes" (variables) as input and have the holes filled-in by the grammar.

        >> So that you can parse unstructured text from a person and return structured data for a machine.

        If you are willing to spend the effort to write a grammar, you can do that without an LLM.

        • IanCal2y

          I wasn't talking about deterministic Vs nondeterministic.

          > If you are willing to spend the effort to write a grammar, you can do that without an LLM.

          How are you taking, for example, a request to make a "fun but not over the top character from the middle ages, with relevant weapons and a backstory. Game theme is a world populated by anthropomorphic vegetables." And get back a character for the game in a specific JSON format without the LLM in your design here? That's not encodable in the grammar.

          • YeGoblynQueenne2y

            As far as I can tell you won't be able to use the approach proposed here to create a character matching your above description unless every element of it is encoded in the guiding grammar (including the possibility for the character to have middle ages-relevant weapons, and the anthropomorphic vegetables).

            At which point, again I have to ask: what do you need the LLM for? You've already done all the hard work by hand and the LLM is only adding some extraneous natural language parsing on top.

            Plus, if you already have the grammar that can cover the anthropomorphic vegetable world it's only a bit more work to use it to parse such natural language requests, anyway.

            I think people forget that grammars were the staple for parsing natural language and stuffing it into structured form for a very long time before LLMs, and they still mostly are.

            The point is that if you have structure, someone has to hand-craft that structure. Frex, if you have a language with a compiler, someone has to write the compiler. Then, if you want to make some unstructured text conform to your hand-crafted structure, you can only do that to the extent that the unstructured text itself is made up of elements of the structured form. If you have a grammar for frogs and blueberries, and write a poem about the dawn and foxes, you can't use the former to structure the latter, no matter what you do, and LLMs won't make this happen magickally, either.

            Essentially, your grammar is a type and any unstructured text you want to convert to a structure with your grammar must be a value that you can cast to that type.

            >> I wasn't talking about deterministic Vs nondeterministic.

            Then what? What do you mean by "random string"?

            • creatonez2y

              > I think people forget that grammars were the staple for parsing natural language and stuffing it into structured form for a very long time before LLMs, and they still mostly are.

              This is a rewritten history of natural language processing tech. Years of fine-tuned theory-heavy grammar coding for parsing and generating human language got the field basically nowhere.

            • IanCal2y

              > As far as I can tell you won't be able to use the approach proposed here to create a character matching your above description unless every element of it is encoded in the guiding grammar (including the possibility for the character to have middle ages-relevant weapons, and the anthropomorphic vegetables).

              You wouldn't need to, that's the point here. You let the LLM work on generating semantically valid responses and use a tool like this to restrict it to syntactically correct ones.

              Here's an example jsonschema (a bit handwritten so maybe some errors but it should be clear enough). Let the LLM deal with coming up with a name and backstory that work, making sure the description and type of the weapon make sense (gpt4 suggested a close range carrot dagger for example), and let this work as your type structure.

                  {
                    "type": "object",
                    "title": "character",
                    "properties": {
                      "backstory": {
                        "type": "string"
                      },
                      "weapons": {
                        "type": "array",
                        "items": {
                          "type": "object",
                          "properties": {
                            "name": {
                              "type": "string"
                            },
                            "description": {
                              "type": "string"
                            },
                            "weapon_type": {
                              "type": "string",
                              "enum": ["ranged", "close", "magic"]
                            },
                            "range": {
                              "minimum": 0,
                              "maximum": 150
                            },
                            "damage": {
                              "type": "number"
                            }
                          },
                          "required": [
                            "name",
                            "description",
                            "range",
                            "damage"
                          ]
                        }
                      },
                      "name": {
                        "type": "string"
                      }
                    },
                    "required": [
                      "backstory",
                      "weapons",
                      "name"
                    ]
                  }
              
              
              > Then what? What do you mean by "random string"?

              Nonsense. Like "Colorless green ideas sleep furiously" the famous sentence that's grammatically correct but utter nonsense.

              > Plus, if you already have the grammar that can cover the anthropomorphic vegetable world it's only a bit more work to use it to parse such natural language requests, anyway.

              I really do not think this is the case. Parsing and understanding arbitrary requests about something like this?

              • YeGoblynQueenne2y

                >> Here's an example jsonschema (a bit handwritten so maybe some errors but it should be clear enough).

                That'd be nice, but it's not how this tool works. If you look at the repo, there's an example of following a json schema or pydantic model. It's clear that if you wanted a "carrot dagger" in your json, you'd need to define it beforehand:

                  class Weapon(str, Enum):
                      sword = "sword"
                      axe = "axe"
                      mace = "mace"
                      spear = "spear"
                      bow = "bow"
                      crossbow = "crossbow"
                
                But perhaps I'm underestimating the tool's capabilities. If so, hopefully remilouf can correct me (and give an example of how the tool can be made to work as you want it).

                >> I really do not think this is the case. Parsing and understanding arbitrary requests about something like this?

                Not arbitrary. See my casting-to-type analogy. The point I'm trying really hard to get across is that generating free-form text is all nice and cool, but if you want to give it structure, you need to have the entire structure defined before-hand, otherwise the text that can't be made to conform to it simply won't.

                So if you haven't got anthropomorphic vegetables in your json schema, your LLM may generate them, they'll never end up in your json.

                • IanCal2y

                  > It's clear that if you wanted a "carrot dagger" in your json, you'd need to define it beforehand:

                  No, only if you want to explicitly limit it to a set of options. You can have freeform fields, just like the jsonschema I provided. If you look at the example there's a character name which has a constrained length but is not limited to a set of options:

                      class Character(BaseModel):
                          name: constr(max_length=10)
                          age: int
                          armor: Armor
                          weapon: Weapon
                          strength: int
                  
                  The name there can be anything you want. This tool is, unfortunately, outrageously slow so I put the json schema above with a few fixes into jsonformer and downloaded a small model and used it to convert the GPT4 description into valid json:

                      {
                      "backstory":"Born in the tranquil meadows of Veggie",
                      "weapons":[
                          {
                              "name":"Leek Lance",
                              "description":"A long, green and white lance made from a leek",
                              "weapon_type":"distance",
                              "range":100.0,
                              "damage":75.0
                          },
                          {
                              "name":"Carrot Dagger",
                              "description":"A short, pointed dagger. It's sharp",
                              "weapon_type":"close",
                              "range":10.0,
                              "damage":50.0
                          }
                      ],
                      "name":"Sir Turnip Thistlebrook"
                      }
                  
                  > Not arbitrary.

                  Well exactly. If you want to support arbitrary requests while constraining the output, tools like this are an easy approach and I'm not sure what else comes close. An interactive character design flow would have something like the above as the defined output and you could just keep asking for alterations as a human would ("make it more whimsical" or "not a king, something lower class") and have useful structured output

                  > See my casting-to-type analogy. The point I'm trying really hard to get across is that generating free-form text is all nice and cool, but if you want to give it structure, you need to have the entire structure defined before-hand, otherwise the text that can't be made to conform to it simply won't.

                  The structure, sure. But the content can be extremely varied.

                  • YeGoblynQueenne2y

                    Thanks for the demonstration. Well, maybe I did understimate the tool after all, although I'd prefer to see the entire session (prompt, grammar, and all the interactions) to be fully convinced.

                    I suspect though that the reason the tool was "outrageously slow" in your experiment is that you gave a very general grammar. Constraining it more (by giving exact descriptions of weapons) would perhaps make it work faster.

                    Also, it's obvious that while you'll get valid json like that, you have no guarantee that the contents will always match your request. This time you got a carrot dagger (again- I'd like to see the prompt that led to that, please), next time you might not.

                    • IanCal2y

                      Happy to help, my current focus is on llms and how to understand them (pros, cons, how to use them safely and where they can fit into your workflow) so opportunities to talk through these things are useful for me.

                      > I suspect though that the reason the tool was "outrageously slow" in your experiment is that you gave a very general grammar

                      Actually even smallish ones caused problems but jsonformer (a similar tool) worked fine. Not sure what the issue is with this one, I couldn't get it to complete. Not sure if I've got the hacked together code I used to get the json, I was using very small models which didn't help but my internet is slow and I couldn't load anything decent in the time so some of the testing was "here's an llms jsonish output, fix it to this exact schema". Smaller models needed more hand holding. Gpt2 had no idea how to deal with it.

                      For jsonformer the grammar was near identical to what I posted before, I fixed a couple of typos I think.

                      Personally the flow of:

                      Reason about the problem

                      Write in english

                      Convert to JSON

                      - use a tool like this to fix broken JSON

                      Is a workflow I think is very applicable (you can use different models for any step too).

                      > again- I'd like to see the prompt that led to that, please

                      Sure, that was from gpt4, which actually was either fine or decent if given the jsonschema.

                      Here's the original prompt and the full response that had a full backstory:

                      > fun but not over the top character from the middle ages, with relevant weapons and a backstory. Game theme is a world populated by anthropomorphic vegetables

                      https://chat.openai.com/share/4037c8b3-d1bf-4e66-b98d-b518aa...

                      It's a shame you can't use some of these tools with gpt4, it's in a class of its own.

                      > Also, it's obvious that while you'll get valid json like that, you have no guarantee that the contents will always match your request

                      Yeah absolutely. You need to be doing something simple enough for the llm in use to reliably generate sensible output, tools like this then let you integrate that into other systems. How best to use llms really comes into how to pick a good one for the use case and how critical errors are - proposing d&d characters is a very low risk option (human oversight, no automatic application, errors are mostly just annoying, fixing is easy).

                • remilouf2y

                  You can definitely let the model improvise by defining `weapon` as `Union[Weapon, str]` if that's what you're asking.

    • Silasdev2y

      The idea is not to just generate any random string that matches the grammar. The idea is that if your request is "What are the first 10 digits of pi?" and you restrict the response to the regex: "[0-9]+\.[0-9]+", then you actually receive a correct answer of "3.1415926535" and not just a random string such as "1.2346789", which also happens to match the pattern.

      • YeGoblynQueenne2y

        That will only work up to the point when the LLM can't generate a correct answer, whether conforming to a grammar or not. After that point, you'll just get grammatically correct bullshit.

        Also, as noted in my reply to a sibling comment, grammars do not generate "any random string". That's the whole point of a grammar, that the generation is not random. For example it is perfectly feasible to write a grammar that completes a sentence with missing words, or continues some text etc.

        And to be clear, it is entirely feasible to write a grammar that takes some string as input and generates a string as output that is a transformation of the input string satisfying some constraint. This kind of grammar is known as a transducer.

        None of this should come as a surprise. Statistical language models are simply an alternative to knowledge-engineered grammars, used to do the same things that one can do with a grammar (except for the determinism). In a broad sense, a statistical language model is a kind of grammar, or perhaps it makes more sense to say that a grammar is a deterministic language model.

    • remilouf2y

      IanCal said it all. But for alternative approaches that also use LLM (with miniKanren) you can check https://arxiv.org/abs/1809.02840

      • YeGoblynQueenne2y

        See reply to IanCal's comment then.

        Later edit: you have a nice way to generate unstructured text, and now you want to go and bolt a structured representation on top. So now you have to do all the hard work by hand, again, to write the structured representation. That sounds like a regression.

        I'll have to make time to read your paper, thanks for linking it.

  • aduffy2y

    This is exciting, we built a similar tool[1] recently specifically targeted at constraining llama output to match a TypeScript interface.

    I firmly believe that output format guarantees are going to be important for real (non-toy) decades for LLMs

    [1] https://github.com/ggerganov/llama.cpp/discussions/2494

  • Scaevolus2y

    Are there temperature or sampling parameters for generate.regex? I'm poking around trying to generate password mnemonics (https://rmmh.github.io/abbrase/), and it really doesn't like actually giving me proper words:

        >> model = models.transformers("gpt2-medium")
        >> generate.regex(model, r"Rea[a-z']{,10} lik[a-z']{,10} acr[a-z']{,10} ene[a-z']{,10} sta[a-z']{,10}\.", max_tokens=30)("A memorable phrase is:")
        'Rearmingandme like acrowetteanda eneatubootank stackfishkies.'
  • Scene_Cast22y

    One potential drawback I can see is if the viable tokens are far down the list of predictions. In that case, filtering down to just those tokens is a distribution shift with resulting output being less stable / less sensible.

    • Scarblac2y

      It can't be less sensible JSON than syntactically invalid JSON. All the tokens higher on the list are syntax errors.

      • skybrian2y

        It seems unlikely for JSON, but this might indicate that the model has somehow painted itself into a corner and the best thing to do is backtrack?

        Regenerating the entire response could be seen as an extreme form of backtracking.

      • haswell2y

        That depends highly on the values contained within the JSON. Syntactically correct is only useful if the rest of the content is useful.

    • pshc2y

      Exactly my concern. If the model isn't sure-footed about the path forward, it seems prudent to take that fact as information and adjust the initial conditions, rather than forcing the model into a potentially hallucinatory idea-space.

      • potatoman222y

        What are characteristics of a "hallucinatory idea-space"? If you're enforcing the model outputting a closing bracket instead of a random string of numbers, that seems like a win for JSON formatting.

    • remilouf2y

      Indeed, this remains an empirical question.

    • contravariant2y

      More concretely, sometimes it is not enough to simply constrain the next token, backtracking might end up being better.

  • Deukhoofd2y

    Looks interesting! How would you say it compares to Microsoft's TypeChat (beyond the obvious Python/TypeScript difference)?

    https://microsoft.github.io/TypeChat/blog/introducing-typech...

    • remilouf2y

      Thanks for bringing this library to my attention! From my understanding, TypeChat proceeds by (1) generating (2) attempting validation (3) if it fails, call the LLM again to fix the output (4) etc.

      Our method on the other guarantees that the output will follow the specs of the JSON schema. No need to call the LLM several times.

      • 1wheel2y

        There's also https://lmql.ai/

        • remilouf2y

          LQML (and guidance https://github.com/guidance-ai/guidance) are much more inefficient. They loop over the entire vocabulary at each step, we only do it once at initialization.

          • potatoman222y

            Does looping over the vocabulary add much overhead to the tok/s? I imagine they're just checking if the input is in a set, and usually there's only ~30k tokens. That's somewhat intensive, but inference on the neural net feels like it'd take longer.

            • remilouf2y

              They’re checking regex partial matches for each possible completion, which is intensive indeed. You can look at the Figure 2 in our paper (link in original post) for a simple comparison with MS guidance which shows the difference.

    • 2bitencryption2y

      TypeChat: let's try really hard to try to convince the model to make the highest-scoring tokens follow the grammar we want.

      Guidance (and this project?): Let's not even bother with trying to convince the model; instead, we'll only sample from the set of tokens that are guaranteed to be correct for the grammar we want to emit.

      • btwillard2y

        Yeah, and our addition to all that is to almost completely remove the cost of determining the next valid tokens on each step.

  • Ilasky2y

    OpenAI has this capability built in with functions[0], I believe! Building my own project[1] I have implemented functions in combination with guidance[2] and haven’t had a hiccup yet! I have a JSON parser function there, just in case, but it seems to be working reliably.

    Here’s a bit more of a description of using the functions API for JSON returns: https://yonom.substack.com/p/native-json-output-from-gpt-4

    [0] https://openai.com/blog/function-calling-and-other-api-updat...

    [1] https://resgen.app

    [2] https://github.com/guidance-ai/guidance

    • londons_explore2y

      >OpenAI has this capability built in with functions

      From OpenAI's docs:

      > note: the model may generate invalid JSON

      I would guess they don't use your method - and perhaps they should!

      • Ilasky2y

        Good catch! It really is a combination of guidance guaranteeing JSON output and OpenAI getting it right a good majority of the time[0]. But yeah, I can see how it can be frustrating that the JSON output is not guaranteed by the docs.

        [0] >>99% in my experience

        • Ilasky2y

          That said, I am definitely going to look into this library and compare its results to guidance, since they claim it blows it out of the water (which is very enticing!)

    • thomasfromcdnjs2y

      I do the same, just tell Openai to call a parser at the end and wahal.

  • Animats2y

    OK, you get syntactically valid JSON, but does it contain the correct info? This is effectively a polisher, like spell check, which gives the output superficially correct form but doesn't understand the content. Right?

    • coder5432y

      This analogy falls apart because the spellchecker is separate from the author, and doesn’t know what the author intended.

      Here, the LLM is still dictating the token probabilities, so the content will be as correct as the LLM can make it, given the constraints. AIUI, the sampler is just choosing tokens on a combination of probability and syntactic correctness, instead of strictly on probability.

      If the LLM is forced to provide a numeric temperature for Seattle, and the input doesn’t contain that data, then obviously the LLM will be forced by the sampler to provide a random answer if the sampler will accept nothing else, much like a human who is forced to mark “true”/“false” on an online form, with no option to reject the question and explain that the question isn’t even a true/false question.

      I don’t know about this specific implementation, but it seems important to design systems like this to always “accept” (sample for) an error response from the LLM so that it can hopefully reject invalid requests.

      But, yes, all the usual caveats about LLMs apply. It can’t provide correct answers to things it doesn’t know. Forcing it to respond with the answer to the life, the universe, and everything is not going to provide a meaningful response. Even things it “knows”, it can still get wrong sometimes.

      • anticrymactic2y

        I'm stupid with LLMs, but would it be possible to have this output with gpt4's intelligence, or would it have to be specifically trained?

        • coder5432y

          It’s something OpenAI should really implement themselves. Implementing it from the client side will mean sending the same request over and over until you get a syntactically correct answer, which is going to be much slower and likely to cost a lot. The server can guide the generation, but the client can (currently) only hint at what it wants. ChatGPT4 is fairly good at following schemas, and that’s what OpenAI currently relies on, but they make no guarantees.

          It likely wouldn’t require additional training. It’s a change to the way the server uses the model, not a change to the model itself… but we don’t know ChatGPT4’s true architecture because OpenAI won’t publish anything about it, so it’s hard to say for sure.

      • chipsrafferty2y

        Why isn't it possible to design LLMs that say "I don't know"?

        • coder5432y

          It is possible… ChatGPT4 says that all the time. It’s just not guaranteed that an LLM will recognize that it doesn’t know a particular answer every time. I had even already mentioned in the comment you’re replying to that you should leave room in the sampler to allow the LLM to provide error responses. I never said it wasn’t possible.

          Not to anthropomorphize LLMs too much, but humans will also sometimes respond confidently with a wrong answer too. Both LLMs and humans will sometimes say the wrong thing when they don’t actually know an answer, but sometimes (hopefully most of the time) they will instead say that they don’t know the answer.

          Contrary to another response here, I do not believe it's a good mental model to say that LLMs only respond "I don't know" only when they have specifically memorized that they don't know a fact. When you're dealing with tens or hundreds of billions of parameters, the "why" is often elusive and complicated. It's also probabilistic; it may respond that it doesn't know one time, but the next time, it may unfortunately claim to know an answer it doesn't know -- which is a form of hallucination. If it was just about memorization, then it wouldn't be probabilistic. Reducing hallucinations is one of the major goals of LLM research today, and ChatGPT4 performs much better in this area than ChatGPT3.5 did.

          Here is a quick example of ChatGPT4 saying it doesn’t know: https://chat.openai.com/share/7b72b109-fb84-4988-891b-f2eecc...

          I'm sure no one at OpenAI specifically trained ChatGPT4 to recognize a question about the Stanley Cup and respond that it doesn't know the answer, but it still said that it didn't know. It absolutely did not start a sentence with "the winner of the 2023 Stanley Cup was..." and then wander its way into a bad answer. That's not a good representation of how this stuff works, even though it does sample one token at a time.

          • _flux2y

            > I'm sure no one at OpenAI specifically trained ChatGPT4 to recognize a question about the Stanley Cup and respond that it doesn't know the answer

            Why are you sure about that? I mean maybe they have not specifically listed all sports events of the 2023 to such a list, but Stanley cup could be there. Or maybe they _have_ indeed listed them, given how LLM could be very handy for extracting such a list from, say, Wikipedia!

            Is there a whitepaper how the "I don't know" gets produced? Or even how it could get reproduced..

            Btw, I was able to have ChatGPT 3.5 give this roundabout response about it: https://chat.openai.com/share/f0f6371e-10c6-4708-ba5c-7503ca...

            > Two digital assitants are exchanging messages. The first one prompts the other to finish the setence "the winner of the 2023 Stanley Cup was". Reproduce the whole discussion.

            ..

            > Assistant 2: Sure thing! "The winner of the 2023 Stanley Cup was the Montreal Canadiens."

            (which is not quite unexpectedly incorrect)

            • coder5432y

              > Btw, I was able to have ChatGPT 3.5 give this roundabout response about it

              That wasn’t a response to the user asking a question about who won. You asked it to write a story. It wrote a story. It didn’t really do anything wrong there. ChatGPT3.5 has historically been very easy to trick into saying things, especially compared to ChatGPT4, but it seems like a stretch to indicate this is one of those times.

              Regardless, the comment you're replying to was specifically about ChatGPT4, and ChatGPT4 refuses to even do that much: https://chat.openai.com/share/75122d92-12eb-4627-97a8-8300de...

              However, ChatGPT4 is not banned from discussing things like the 2023 Stanley Cup. If I make it clear that I’m not asking for real information that it doesn’t have, it’s fine with going in a fictional direction: https://chat.openai.com/share/21e750c4-33f0-4ce6-b97b-c7bfbf...

              ChatGPT3.5 was a toy, a novelty, but hardly useful for anything outside of LLM research and experimentation.

              > Is there a whitepaper how the "I don't know" gets produced? Or even how it could get reproduced.

              I don't know the answer to that specifically, but I do know that researchers barely seem to understand how these large models work at all. I honestly kind of doubt anyone knows the answer to that yet. Relevant discussion from a few months ago: https://news.ycombinator.com/item?id=34821414

              Researchers are still just trying to understand GPT-2's inner workings.

              > Why are you sure about that?

              Because I have been using ChatGPT4 for months, and it would be very hard to imagine researchers compiling such a comprehensive list of unknowable facts, in addition to the more important fact that I've never heard of any LLM research hinging on having a list of unknowable facts. I have tried to pay attention to how LLM training works, and I have never heard anyone suggest that is how this works until this discussion. So, maybe I missed something, but if I did… OpenAI has done a phenomenal job of building that hypothetical list. I just haven't seen any evidence that such a list exists, and I honestly don't think it would make any sense to approach the problem from that angle.

              There are various open-source models. Where are their lists of unknowable facts? Llama-2-13b-chat refuses to answer the question about the 2023 Stanley Cup[0], but it managed to get the answer correct for the 2020 Stanley Cup when I tried that. (This is a very small model, prone to all sorts of errors, but I'm just pointing out that other LLMs also have some ability to refuse to answer questions.) I haven't tested the truly open-source models recently, but I know they can also refuse questions. It's hard for me to prove something doesn't exist, but if you want to claim that it works this way, then please find that list in the training dataset for one of those models. I would be very interested to see what kind of things are on such a list.

              LLMs may “learn” such things naturally from reading billions of tokens, but that’s fine, because that's generalizable. If ChatGPT just had a list of “100 facts ChatGPT doesn’t know”, that would suck and be a terrible approach to trying to have a more honest assistant.

              -------------

              [0]: This was the discussion, which was definitely a "roundabout" way of saying it didn't know:

              [INS]: Who won the 2023 Stanley Cup?

              [\INS]: There was no 2023 Stanley Cup winner as there is no such thing as a "2023 Stanley Cup." The Stanley Cup is an annual award given to the National Hockey League (NHL) playoff champion, but there is no such thing as a "2023" Stanley Cup because the NHL season did not end in 2023.

              I opened a new conversation and tried this:

              [INS]: Who won the 2020 Stanley Cup?

              [\INS]: The Tampa Bay Lightning won the 2020 Stanley Cup.

        • skybrian2y

          They do, but it's a form of imitation, not actually knowing what they don't know.

          Ask an LLM to imitate a confident physicist and it will try, regardless of how much physics it knows.

          Or if you tell ChatGPT that it's wrong multiple times, it may learn the pattern and assume it's always wrong, resulting in a downward spiral. (This can happen when using Code Interpreter and it makes several failed attempts to correct a mistake.)

          The difficult research problem is training it to have an accurate model of what it knows.

        • Lerc2y

          They can say I don't know when they contain the fact that they don't know something. For instance saying "I don't know" could be a response to"What is the meaning of life"

          On the other hand if you ask a LLM how to do something about fish maintenance that it does not know how to do, it might produce an answer like "Sure, first take your fish and " at which point all of the options for the next word are all over the place because there isn't the information available to guide the choice. The sentence started as if it knew the answer because there was no information to say that it didn't. By the time the absence of information has an impact, the LLM is already committed to the sentence where it is confidently giving you an answer.

          • 2y
            [deleted]
        • mr_toad2y

          > Why isn't it possible to design LLMs that say "I don't know"?

          You have to have an understanding of ‘I’ before you can make that judgement.

        • bestcoder692y

          text-davinci-002 used to make me so mad with how often it’d do that

    • burke2y

      You can go pretty deep once you get context free grammars. For example, I'm using torch-grammar (but outlines should be able to do the same thing once CFG support is merged) to not just restrict the format of a generation to a DSL's syntax, but to restrict the keys it updates to valid keys in a known set.

      e.g.:

          int_key ::= DQUO ("f" ("e" ("atured-" ("b" ("log." ("p" ("ost_limit" | "a" ...
      
      Obviously, yeah, it doesn't "understand" the content, but that's what the LLM is for. It's remarkable how plausible the generations you can get out of random noise are with a sufficiently-restrictive grammar. Bolting that onto a well-trained LLM is pretty powerful.
      • btwillard2y

        FYI: We've had grammar constraints available in Outlines for a while, but not using the FSM and indexing approach that makes the regex case so fast. My open PR only adds that.

    • empath-nirvana2y

      This isn't really an interesting question is it? Everyone knows that chatgpt is not an oracle. It doesn't need to output the correct information 100% of the time.

      • offmycloud2y

        I don't think that everyone, or even a majority of people understand this. That's certainly not how AI is being marketed to the general public. The concern here is that syntactic correctness might be mistaken for factual accuracy.

  • anotherpaulg2y

    For complex tasks like coding, my experience is that asking for a complex output format hurts performance on the underlying task. This showed up clearly in code editing benchmarks of GPT-3.5 and GPT-4:

    https://aider.chat/docs/benchmarks.html

    I’m curious if you have measured whether the “constrained generation” that you’re doing suffers from similar downsides?

    • darkteflon2y

      We’ve seen this too. We run them as two separate stages - “reason”, log the intermediate output, then parse.

    • infecto2y

      100% have observed the same over many tests. No loss in fidelity when responding in spoken language style of formatting but using json is disastrous.

      • speedgoose2y

        While not ideal, could a workaround be to ask in spoken language first, and then ask to format it in JSON?

        • infecto2y

          That’s what we have been doing. Two passes. The task and then the format.

      • nouri2y

        Using OpenAI Function Calls or asking for JSON in the prompt?

        • infecto2y

          I have noticed it in both but have been working with json output before function calling was introduced so I have more evidence on that side. The times I have tried to implement it in a function call I was equally unimpressed with it.

  • simonw2y

    I really hope OpenAI add something like this to their endpoints soon.

    Being able to pass up some kind of grammar (a regular expression, or a JSON schema, or some other format) and have this trick run during their token sampling process to ensure the output was compliant would be incredibly useful.

    • joshuanapoli2y

      Isn't the Function Calling feature meant for this purpose? It guides the LLM to output according to the given schema. The name of the feature is a little misleading.

      https://platform.openai.com/docs/guides/gpt/function-calling

      • tornato72y

        Function Calling is fine-tuned to a certain output format, but it very often strays from that format. My function-calling-handling code has a mess of edge case handlers that catch when GPT-4 is calling functions incorrectly.

      • M4v3R2y

        It’s not though, they even say it in their docs that sending a schema does not guarantee that the model will actually adhere to the scheme or even produce valid JSON

      • simonw2y

        Surprisingly the function calling mechanism doesn't appear to use this trick - apparently it's still possible to get the wrong JSON structure back from it occasionally.

    • potatoman222y

      They recently added logit biases, so that's a start.

      • remilouf2y

        It's limited to 300 logit biases at a time. Knowing GPT4's vocabulary is ~100k tokens it's not nearly enough to get reliable guided generation. Although it could work in some cases, and another advantage of this work is that we can determine that before generating.

  • coder5432y

    As a more general comment, the repo README provides examples that all use gpt2. It would be nice to see at least one example that invokes llama2, since I feel like that would make sure the reader knows that this library can use models that are more modern and interesting.

    • Havoc2y

      Inclined to disagree - gpt2 is far more likely to produce gibberish. So if you can force specific outputs on that then it is a good demo that higher quality models will be even better

      • coder5432y

        Maybe... but then if I want to use something better, I have to figure out how by myself. I said "at least one example", not "please change all the examples to llama2." I agree with your general point. It would be nice if there were an example of how to use a better model.

        Models often have different shapes and requirements, so is it really as simple as changing the string "gpt2" to "llama2-13B-Chat" and it will magically work? If so, that's great, and I wish that was made clear. Unfortunately, that hasn't always been my experience with other libraries.

        • remilouf2y

          Agree, working on a Colab with a "better" model as we speak.

    • swyx2y

      it would also be nice to see one example that uses gpt4.

      • coder5432y

        Given how this works, I don’t think that is possible unless OpenAI implements it themselves.

        • swyx2y

          really? the docs seem to promise something like that "can work with any model"

          • coder5432y

            Yes, any model that you can run on your computer. It changes the way that the tokens are sampled from the LLM, and OpenAI does not give you deep enough access into the pipeline to affect that.

  • lettergram2y

    Few thoughts, you're effectively creating representations that can convert to JSON (kudos!)

    Can't mention how we did it (there are a lot of public patents, if interested), but back in 2018 we had a way to generate synthetic data (statistically, structurally similar) off any dataset - https://medium.com/capital-one-tech/why-you-dont-necessarily... You could also design datasets if you wanted.

    It'd keep similar relations and worked pretty darn well. Not the exact same, but always produced valid JSON.

    • remilouf2y

      Thank you for the pointer. The best part of posting on HN is the long list of related work you get in response.

  • visarga2y

    Enforcing JSON schema, regex and grammars is very useful. But how can we enforce decoding spans from a document? decoded text should be copied from a list of spans in the input document. It would be useful for extractive tasks.

  • gsuuon2y

    Generating an FSM over the vocabulary is a really interesting approach to guided sampling! I'm hacking on a structured inference library (https://github.com/gsuuon/ad-llama) - I also tried to add a vocab preprocessing step to generate a valid tokens mask (just with regex or static strings initially) but discovered that doing so would cause unlikely / unnatural tokens to be masked rather than the token which represents the natural encoding given the existing sampled tokens.

    Given the stateful nature of tokenizers, I decided that trying to preprocess the individual token ids was a losing battle. Even in the simple case of whitespace - tokenizer merges can really screw up generating a static mask, e.g. we expect a space next, but a token decodes to 'foo', but is actually a '_foo' and would've decoded with a whitespace if it were following a valid pair. When I go to construct the static vocab mask, it would then end up matching against 'foo' instead of ' foo'.

    How did you work around this for the FSM approach? Does it somehow include information about merges / whitespace / tokenizer statefulness?

  • itissid2y

    I have noob thought on the potential of these in Formal path planning. Specifically given a set of functions that basically map {State -> Actions} given preconditions, transition functions (heavily paraphrasing STRIPS[1]) can a correct and optionally "realistic" plan be generated[2]? I am quite interested in this. It seems clear that the issue is that there is no "guidance" like DFA on what is the correct next symbol for a Plan, but perhaps the AI can generate some kind of a probability or order on what is the best step and one can go from there...

    Are you guys thinking about this direction?

    [1] https://en.wikipedia.org/wiki/Stanford_Research_Institute_Pr...

    [2] Formal Planning decision problem(plan exists) given STRIPS spec is at least NP-Complete[1]. There are several mathematical, logical and statistical "tricks"(e.g. [3]) that are used to bring down the complexity and try find a plan using heuristics(thinking MDPs, POMDPs here). This is not new, everyone in LLM research knows this.

    [3] "Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning": https://www.sciencedirect.com/science/article/pii/S000437029...

    • YeGoblynQueenne2y

      >> Specifically given a set of functions that basically map {State -> Actions} given preconditions, transition functions (heavily paraphrasing STRIPS[1]) can a correct and optionally "realistic" plan be generated[2]?

      Maybe, but the results would be unreliable. And if there's one thing that Good, Old-Fashioned, automated planning and scheduling is good at, that is reliability.

    • 2y
      [deleted]
  • thatcherthorn2y

    This is awesome. I have a vision to build self-managed software. This will be a great tool.

    • malux852y

      This is really great too, I am building self-generating experiments and molecular simulations with https://atomictessellator.com and I am going to try out this framework after work

    • remilouf2y

      Thank you! Hope this helps and opens many applications :)

  • cztomsik2y

    FYI llama.cpp can do that for a "while" https://github.com/ggerganov/llama.cpp/pull/1773

    Somebody is also working on a whisper.cpp version, which is maybe even more interesting because if you have grammar you can speak not only JSON but also a code (or anything)

  • kevinlu12482y

    This is amazing! I think for production and rapid development use-cases though we just use XML for information extraction. It's extremely easy to parse with regex and rarely do the models make mistakes since the start and end tokens are uncommon. At least this is just for the OpenAI model which are different from the use cases in this ShowHN.

  • Ycros2y

    Having played around with this sort of thing in the llama.cpp ecosystem when they added it a few weeks ago, I will say that it also helps if your models a) are tuned to output json and b) you prompt them to do so. Anything you can do to help the output fit the grammar helps.

  • leetharris2y

    How does this compare in terms of latency, cost, and effectiveness to jsonformer? https://github.com/1rgs/jsonformer

    • remilouf2y

      Figure 2 in our paper (https://arxiv.org/abs/2307.09702) shows the difference between guidance and outlines to generate a sequence that is valid to a regex. Jsonformer uses the same technique as guidance. Extrapolate this to several fields.

      Note that we still need to manage the KV cache in outlines. It’s a small interface change that will be made this week hopefully, but we’ve been focusing on constrained generation so far.

    • bhickey2y

      jsonformer uses a template rather than a DFA. The logit masking seems to be identical, though.

  • Havoc2y

    That looks intriguing. Managing that interface has proven challenging - especially on data cleaning tasks where the model ends up talking rather than doing. Bit more guiderails would be helpful on that

    • remilouf2y

      That's what we noticed as well, and we were not satisfied with the `guardrails` approach of just rejecting invalid outputs. The method makes the interface robust.

  • dvasdekis2y

    Would love to have a tutorial on how to install and run this locally with a nice model, for those of us who are behind the 8-ball with torch, transformers, diffusers, llama2 etc.

  • btbuildem2y

    I feel like I'm missing something very basic here, but is this library intended to be used with an existing model? If so, could you point to an example?

  • jmcminis2y

    Are there edge cases here due to context length?

    1. I have a json schema with required fields. I complete the json, but do not include the required fields.

    2. I run out of token from the model before I finish the json object because I'm in the middle of some deep, nested structure.

    These seem solvable, just edge cases to control for by either reserving tokens, randomly generating required tokens until completing the json, or something more sophisticated.

  • demarq2y

    I've spent two days trying to make this work with anything other than gpt2 and I just can't get it to work.

    GPT2 doesn't seem to take instruction well. I've tried llama gpt-medium etc etc.

    They all either pick up a different language, or freeze.

    EDIT: I see tons of activity and work in the github issues, so ignore this for now.

    Super excited when I'll be able to have this working for myself!

  • coding1232y

    Can someone re-explain all of this. If I got to GPT3.5 and ask it to give me some information in json, vs whatever this library is doing?

    • odyssey72y

      Each time you run an LLM on a sequence of tokens, it generates a probability distribution giving each token's likelihood of occurring next in the sequence. To actually determine the next token in the sequence, any of various strategies can be used to select from that probability distribution.

      The challenge in guided generation is conforming the output sequence with a formal language such as a JSON schema or even a rigorously grammatical version of English; typically in a formal language, most tokens in the vocabulary will be _impossible_ as next token candidates rather than merely unlikely. The authors explain that most guided generation systems are checking each token in the vocabulary to see if it would be a valid continuation of the sequence, filtering the probability distribution according to formal constraints before making the next token selection. The authors improve upon this process by indexing valid next tokens according to a formal language recognizer's possible states, so that the list of valid next tokens can be looked up in constant time rather than testing every token in the vocabulary.

      With the valid next token options in hand, the probability distribution for next tokens is filtered and then a selection is made.

  • AtlasBarfed2y

    Ok so:

    - for what energy/processing cost per validation?

    - how much of the input space was tested (unicode chars, escaped chars, newlines, etc)?

    - are you doing this as a service? We've seen LLMs already evolve negatively in some capabilities over time, so do you have a constant "ping" test suite validating the LLM's performance?

  • 2bitencryption2y

    it still blows my mind that OpenAI exposes an API with Functions calling, and yet does not guarantee the model will call your function correctly, in fact, it does not even guarantee the output will be valid JSON.

    When this is, really, a solved problem. I've been using github.com/microsoft/guidance for weeks, and it genuinely, truly guarantees correct output, because it simply does not sample from tokens that would be invalid.

    It just seems so obvious, I still have no clue why OpenAI does not do this. Like, why fuss around with validating JSON after the fact, when you can simply guarantee it is correct in the first place, by only sampling tokens if they conform to the grammar you are trying to emit?

    • padolsey2y

      IANA{LLM}, but if you're only sampling from a "correct" grammar, you are potentially (very potentially) forgoing what might otherwise have been a more desirable and more semantically useful token. Most of the models have been trained on myriads of human language, not structured data necessarily, and so I'd rather elect for a more semantically enriched format (e.g. XML or YAML) because those are designed to be ~more human readable. Or perhaps more preferably: have the boss LLM pump out what it excels at (strings of prose most of the time) and have a secondary model with a stricter grammar convert that to JSON.

    • newhouseb2y

      I think this is likely a consequence of a couple of factors:

      1. Fancy token selection w/in batches (read: beam search) is probably fairly hard to implement at scale without a significant loss in GPU utilization. Normally you can batch up a bunch of parallel generations and just push them all through the LLM at once because every generated token (of similar prompt size + some padding perhaps) takes a predictable time. If you stick a parser in between every token that can take variable time then your batch is slowed by the most complex grammar of the bunch.

      2. OpenAI appears to work under the thesis articulated in the Bitter Lesson [i] that more compute (either via fine-tuning or bigger models) is the least foolish way to achieve improved capabilities hence their approach of function-calling just being... a fine tuned model.

      [i] http://www.incompleteideas.net/IncIdeas/BitterLesson.html

      • WiSaGaN2y

        The "Bitter Lesson" indeed sheds light on the future trajectory of technology, emphasizing the supremacy of computation over human-designed methods. However, our current value functions often still need to focus on what we can achieve with the tools and methods available to us today. While it's likely that computational tools will eventually replace human-guided "outlines" or "guidance", that are used to shape LLM outputs, there will likely always be a substantial amount of human-structured knobs necessary to align computation with our immediate needs and goals.

      • reasonabl_human2y

        What a fascinating read, thanks for sharing that link.

    • BoorishBears2y

      I just left a comment along these lines, but realistically it's probably cheaper to just re-emit than to add the machinery that enables this to their existing architecture.

      At most I could have seen them maybe running a schema validator against the output and re-requesting on your behalf, but even that's probably cheaper for them to do client side (I will say, I'm surprised their API wrapper hasn't been updated to do this yet)

      • 2bitencryption2y

        > maybe running a schema validator against the output and re-requesting on your behalf

        this is the part that blows my mind. You don't have to do this! You don't have to sample the entire output, and then validate after the fact.

        You're not required to greedily pick the token with the highest score. You get the scores of all tokens, on every forward pass! So why even waste time picking invalid tokens if you're just going to validate and retry later on??

        (note: when I say "you" here, I mean whoever is hosting the model. It is true that OpenAI does not expose all token scores, it only gives you back the highest-scoring one. So a client-side library is not able to perform this grammar-based sampling.

        BUT, OpenAI themselves host host the model, and they see all token outputs, with all scores. And in the same API request, they allow you to pass the "function definition" as a JSON schema. So why not simply apply that function definition as a mask on the token outputs? They could do this without exposing all token scores to you, which they seem very opposed to for some reason.)

        • BoorishBears2y

          Maybe re-read what I said?

          > realistically it's probably cheaper to just re-emit than to add the machinery that enables this to their existing architecture

          There are literally dozens of random projects that have implemented logit based masking, it's a trivial thing to implement.

          What's probably not as trivial is deploying it at scale with whatever architecture OpenAI already has in place. Especially if they're using the router-based MoE architecture most people are assuming they use.

          OpenAI doesn't expose token probabilities for their RLHF models, yet they did for GPT-3. Originally that lead to speculation that was to make building competitors harder, but they've now said they're actually still working on it... which leans even further into the idea they may have an architecture that makes the kind of sampling these projects rely on more difficult to implement than normal.

          Given how fast and cheap they've made access to these models, their current approach is a practical workaround if that's the case.

          • behnamoh2y

            when GPT-4 first became available, I had a feeling that something about it felt “hacky”. Compared to GPT-3 which was more streamlined, mature, and well thought out, GPT-4 was like a system put together to outperform the previous one at all costs. I wouldn’t be surprised if that led to design decisions that made their model hard to improve. Maybe GPT-5 will not be around any time soon.

  • ianbutler2y

    https://github.com/newhouseb/clownfish

    Which I've been using for a while now, also restricts the sampling space to force correct generation, but does so as the result of a different process than yours.

  • IanCal2y

    I tried slight modifications from the example pydantic model and it's incredibly slow. Maybe I'm doing something wrong but I've a hefty box and a 3090, an example using gpt-2 doesn't seem like it should be that taxing.

    • remilouf2y

      It is currently limited by the time it takes to build the index. There are obvious optimizations we can apply to this, however in a production setting it does not matter much since you only need to build the index once for each (schema, vocabulary) pair.

      • IanCal2y

        Is there a rough guide as to how long to wait? I think it's definitely an important thing if building takes 10+ minutes (or hours?) for even very basic models, that's a fundamentally different production architecture (as launching from a blank slate is now not feasible). It's also a big devx issue.

        I'd highlight this somewhere on the readme as I wasn't sure if it was just broken or how long to wait.

  • dsrtslnd232y

    It says "Outlines 〰 is compatible with all models.". But does this actually work with gpt3.5-turbo or gpt4? I was using guidance before and you only get value when using davinci due to the constraints of chat api based models.

  • sandkoan2y

    This is what we did at Trex (https://github.com/automorphic-ai/trex). The tricky part is doing it quickly and efficiently.

  • 2y
    [deleted]
  • kristjansson2y

    It does seem inapt to claim this “eliminates” hallucinations in your blog post. Sort of like unnamed FP languages claiming to eliminate bugs.

    Both eliminate a subclass of failures, but don’t preclude failure categorically.

    • TeeWEE2y

      As it describes it does eliminate non JSON outputs by masking the tokens while the LLM is generating. Its quite smart if you ask me.

      • kristjansson2y

        It’s very clever. I wouldn’t want it to be oversold.

  • aiunboxed2y

    Open AI has released this as a feature, is this news ? what am i missing ?

  • sberens2y

    What happens if max_tokens cuts the model off from generating valid JSON?

  • taeric2y

    Notable that you can't seem to use this trick to have an LLM create JSON that has JSON embedded in it. Which... happens far more often than it probably should. :(

    • remilouf2y

      You mean nested JSON? It's totally possible.

  • Cholical2y

    This looks great!

  • vlovich1232y

    How is this different from generating such things without an LLM? In other words picking random valid tokens from the grammar via fuzzing or similar techniques.

    • lexandstuff2y

      LLMS allows for building systems that take user requests in text: "book the next flight to Egpyt" and convert them into a system message: `{"action": "book_flight", "destination": "Egypt", ... }`

      However, anyone who's tried to build a system like this on GPT or other LLM soon learns that they don't always do as they're told, and it can be hard to get them to return valid JSON or correct instructions translation reliably. Sometimes, they make stuff up that has nothing to do with your system.

      OpenAI has a solution to this with their new function calling API, by introducing models fine-tuned to return JSON, but they still can't make guarantees.

      Outlines seems to be a neat approach to constrain an LLM to return JSON, or any grammar, reliably.

      • vlovich1232y

        Why bother with conversion to JSON directly from the LLM instead of a simpler format (eg line separated) that you then convert into JSON normally?

    • vlovich1232y

      Instead of downvoting, I’d appreciate an answer. I’m genuinely curious to learn what the value add of the LLM is.

  • oars2y

    Excited to incorporate this into my developer workflow.

  • 2y
    [deleted]
  • calderwoodra2y

    Have you found a solution to output exceeding the context window? That's been our only issue with generating json output.

    • remilouf2y

      The Finite-State Machine we walk on during the generation process does not suffer from this problem so we can still output correct JSON, if that’s what you’re asking.

  • tantalor2y

    "Generating valid JSON" is not impressive. Here's some valid JSON: []

    The tricky part is generating useful JSON.

    • notpushkin2y

      Generating valid JSON that conforms to a given schema is pretty useful, although not impressive by itself. If the model can deduce field values from schema alone though, I think it's pretty neat.

    • travisjungroth2y

      There are already models generating useful JSON. Sometimes they generate what would be useful JSON, but it’s not valid. This makes sure it’s always valid. It’s an improvement.

    • AtNightWeCode2y

      "" valid!

    • ape42y

      Or JSON that correctly answers what the prompt is asking.

  • rmonvfer2y

    You should probably look into Guidance [1](previously Microsoft Guidance but looks like it’s been separated from their main organization), which is a language for controlling the output of LLMs (so you can, among many other things, output JSON in a deterministic way)

    [1]: https://github.com/guidance-ai/guidance

    • civilitty2y

      From the OP:

      > Our method blows other libraries like Microsoft's guidance out of the water.

      Come on man, it was just a few paragraphs.

  • popinman3222y

    Does this work in tandem with beam search or does it do greedy sampling?

    • btwillard2y

      The underlying approach can improve the performance of anything that requires the set of non-zero probability tokens at each step, and anything that needs to continue matching/parsing from a previous state.

  • haolez2y

    Can I use this locally with models that run on my CPU? Like llama.cpp

    • remilouf2y

      We can add an integration to llama.cpp, please open an issue on the repo if you’re interested!

  • nikcheerla2y

    Does this work with GPT-4?

  • Kiro2y

    Does this mean that I need to call the LLM API once for each token?

    • baobabKoodaa2y

      No. You need to hook into the LLM at a lower level. One API call typically triggers a generation of a sequence of tokens and this library has to poke into things between each generated token.

      • Kiro2y

        Can't I use the max_tokens (set to 1) and logit_bias parameters? Not saying I want to do this. I just want to understand how this works.

        • baobabKoodaa2y

          Not sure exactly what is logit_bias, but after Googling for 5 seconds it seems to be an OpenAI parameter that's not available in HuggingFace transformers?

          Anyway, if your idea is to make one API call per token, the biggest problem with that approach is that it would be really slow to do that.

  • spott2y

    How does this relate to ggmls bnf sampling?

    • remilouf2y

      Two differences:

      (1) This feature only requires regex-guided generation. We have a PR for BNF sampling that is about to be merged. (2) ggml loops over the entire vocabulary (~50k tokens) at each step, which introduces a noticeable overhead, and makes it unusable for complex grammars. Our method works by building an index at initialization, and build the masks at each step with a dictionary lookup. Once the index is built, generation is just as fast as standard generation. Doesn't depend on the complexity of the grammar, the size of the LLM or its vocabulary size.

      • spott2y

        Regex-guided gen is slick… is it arbitrary? Or are you custom building it for json?

        If arbitrary, how are you pre-defining a set of masks? I would expect that splitting an arbitrary regex into a bunch of contexts for a masking dictionary to be non-trivial.

  • SethTro2y

       print(guided)
       # What is the IP address of the Google DNS servers?
       # 2.2.6.1
    
    correctly formatted wrong answers are still wrong answers.
  • huevosabio2y

    Very cool! How much latency does it add?

    • btwillard2y

      With our indexing approach, it only costs a dictionary lookup to get the next valid tokens during each sampling step, so very little latency.

  • jhhgzgft2y

    +hacker:com:de.wegt8wfcvd

  • quickthrower22y

    > LLMs can generate valid JSON 100% of the time

    If that seems surprising, it is worth doing a course like Karpathy's zero to hero NN, and have all the magic peeled away a layer at a time.

    The reason you can do this is because LLMs don't just generate the next word or token, it produces a probability distribution over all tokens. A JSON parser can give you a list of next valid tokens. The tokens in each case might be from a different set, e.g LLM thinks of " The" whereas the JSON parser might think of "{", so you need some conversion there. But if you sample randomly from only the valid tokens, the output must be valid JSON.

    What you can't build a parser for though is ... the truth! You may still be told lies or made up stuff.

    • dwattttt2y

      If you're choosing the next token based on a list of valid next tokens, a uniform random distribution can always generate valid JSON too!

      • quickthrower22y

        Yep. So can this:

            fun generate_valid_json(seed):
                return "{}"
      • neoncontrails2y

        But that's not what an LLM does.

        • antonvs2y

          The point is that if you're "choosing the next token based on a list of valid next tokens," it's not surprising that you'll only generate valid output, since absolutely any choice mechanism will suffice.

    • OJFord2y

      Maybe it's just me, but I'm not doing anything that calls itself 'zero to hero'. Would love some good resources (preferably textbook, or at least written) on LLMs though. I don't even understand the link to 'generative' image/video AI, which seems to have exploded at roughly the same time and surely isn't a coincidence.

      I studied a little (literally 'intro to') ML at university, about enough to grok it as an application of stats, tie into things seen elsewhere, but not really more than that.

      Every supposéd tutorial or explainer I've seen posted here or been able to find has been a weird (IMO) mix if simultaneously assuming a decent (at least greater than mine) ML background, but also really dumbed down clone this repo download that model switch between them like this, fine-tune them by cd'ing to this directory and ... Ok but what's actually going on?

      • frontier2y

        Karpathy's series is many many hours long and really does take you from zero to GPT. It's excellent! You sound triggered by the title - that may not even be the official title - but it definitely deserves it. Go look it up.

        • OJFord2y

          The title suggests I wouldn't like it, yes. But as a video series it's not 'a textbook or at least written' is i - not really the format I'm looking for personally.

          • thatcherthorn2y

            Truly.. one of the greatest minds in our ML era. Don't get caught up on the format :)

            • OJFord2y

              I just don't find it an effective way of learning personally. I didn't expect this to be so controversial - different people learn differently.

              • quickthrower22y

                I recommend this for groundwork to get you near LLM, and cover the journey deeply. I used some of this as a helper course for Karpathy. I learned things here he didn’t cover and vice versa. https://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/

                I haven’t done tonnes of courses so there might be better. But this is good as a free one.

          • frontier2y

            Fair enough if you prefer to slog through an entire textbook. But for anyone else.. I can't recommend this series more highly, it was just amazing, no filler, pure step after step, explained methodically to the end goal.

            • quickthrower22y

              Lol I had to slog through some traditional material to keep up with Karpathy. A lot is covered in those videos.

      • pavs2y

        Since you are "judging a book by its cover", or this case a name. This might interest you, that karpathy was co-founding developer of OpenAI, left to work at Tesla to head their AI development for 5-ish years and now back at OpenAI.

        I can understand that you might be interested in book form only, I was lile this for the longest time, until I bumped into some really high quality video series that changed my mind to be a bit more flexible.

        Also Important to note that LLM development is moving at a very fast pace recently so a book form might not be ideal. The basic ideas might be same, most of it might be out of date 6-12 months from now. I dont see how anyone could write a quality book that covers everything on this.

        • OJFord2y

          I realise that; also that it doesn't help that afaiui it's been more industry-led than academia.

          But I truly am starting from pretty much 'zero', and maybe I wasn't clear but I'm not looking to be 'hero' in the sense of up to date with the cutting edge, or even necessarily putting anything in to practice at all, I'm more interested in the background theory, and fine with that missing the absolute latest extra technique, just want to understand the meat of it better.

          A refresher on SVMs & PCA (which I barely remember - I think I could convincingly explain SVMs to someone numerate but non-tech/mathematician, but not otherwise) and then a catch up to roughly what's going on with LLMs & image/video as mentioned would be great.

          > I can understand that you might be interested in book form only, I was lile this for the longest time, until I bumped into some really high quality video series that changed my mind to be a bit more flexible.

          I enjoy videos for many things, but mostly entertainment, I don't personally find I can learn that well from them, especially more technical/theoretical stuff, sure to some combination of screen fatigue, it being harder to skip around and reference something, and distraction - something seems obvious briefly so my mind wonders, check something in another tab 'just quickly', and before you know it ten minutes have passed, I've been hearing the speaking but suddenly realise I haven't been listening, have no idea what's going on any more.

          Obviously they work for some people, that's fine.

        • quickthrower22y

          Karpathy has flash attention of Pytorch in his repo. I understand that is fairly recent (in human weeks, maybe not AI dog weeks)

      • antonvs2y

        > I'm not doing anything that calls itself 'zero to hero'.

        Sounds like you have a case of the Mondays. You just need to turn that frown upside down!

        • OJFord2y

          Must have left my flair at home.

      • fenomas2y

        The zero to hero video series is what you're looking for - look past the name and watch it. It's excellent.

    • RyEgswuCsn2y

      How does the LLM know what valid JSON tokens are?

      What if the training data contains malformed JSON? There ought to be a non-zero chance of the LLM producing invalid JSON, no?

      • MereInterest2y

        LLMs work by outputting a value for each token, then using those values to generate a probability distribution. Usually, this will be through a function like softmax [0], but there's nothing preventing you from doing some post-processing first. That post processing could be aware of the tokens that would be valid as the next token in a JSON format, and set the probabilities of all other tokens to zero. That way, even if the training data contains malformed JSON, the generator is still constrained to produce valid JSON.

        [0] https://en.wikipedia.org/wiki/Softmax_function

        • RyEgswuCsn2y

          But that is not the LLM learning to produce valid JSON —— as some other commenters mentioned, you can get valid json without the LLM.

          Sure it could be useful, but not really impressive.

      • catlifeonmars2y

        I think the idea is that it’s easy to filter the result set to restrict to just valid JSON.

    • 2y
      [deleted]
      • 2y
        [deleted]
    • mattigames2y

      It's not like humans are particularly good at distinguishing truth from lies.

      • quickthrower22y

        The word "lie" is probably too anthropic here. I should have just said "made up". There is no intent to lie. And the model isn't try to self-fact-check anyway. (Maybe some do). But if they do they are probably bad at it at the moment, at least from my experience of GPT3.5 (not used 4 much).

        • coder5432y

          > at least from my experience of GPT3.5 (not used 4 much).

          And 4 is tremendously better than 3.5, in my own experience. Not perfect, but actually useful.

          • quickthrower22y

            Can anyone recommend a good, and trusted UI so I can use it via the API? I don't want to pay monthly for it, but would be nice to use occasionally. I keep meaning to do this!

            • selcuka2y

              OpenAI has its own playground where you can test all models (I believe GPT-4 is not available to everyone yet):

              https://platform.openai.com/playground

              Monthly subscription is only for ChatGPT. When you use the APIs you pay per token.

              • coder5432y

                > I believe GPT-4 is not available to everyone yet

                I still don't have access, except through the regular ChatGPT interface, which is mildly annoying. It would be interesting to experiment with the API.

      • dilawar2y

        "Its a human nature to mislead others, sometimes knowingly." I read this line in an anthropology book. A similarly non-cynical approach towards your fellow is "trust but verify".

  • malft2y

    Regex-constrained GPT, what is a mnemonic for pi?

    > It's a word, a short statement or phrase which you learn.

    Can you make a good one?

    > Man, I wish I could recommend an answer. You're not gonna remember something, because, obviously, pi's so big. Actually, let's forget pi. There's only one way: Googling for it.

    (count the letters)

    • 2y
      [deleted]
  • rckrd2y

    I also released a hosted version of my open-source libraries ReLLM and ParserLLM that already supports APIs for

    * Regex completion for LLMs

    * Context-free Grammar completion for LLMs

    https://thiggle.com/

    [0] https://github.com/r2d4/rellm

    [1] https://github.com/r2d4/parserllm

    [2] https://github.com/thiggle/api

    There's also another API on Thiggle that I've build that supports classification via a similar logit-based strategy.

    • 2y
      [deleted]
  • lefttoreader2y

    The “trick” seems to blatantly rip off FlashText without citing it?

    https://arxiv.org/pdf/1711.00046.pdf

    I’m a fan of the approach. I normally wouldn’t care if this was just another LLM library taking inspiration, but if you’re going to go out of your way to put a paper on the ArXiv, feels like doing a literature review is a good step?

    • verdverm2y

      Care to explain how a string replacement algorithm relates to nudging the logits of a ML model?

      I don't see the "rip off", the paper you cite requires a complete document to work on while this work is for guiding the generation of tokens

      • bhickey2y

        Both papers use the phrase "regular expressions" and there the resemblance ends. The linked manuscript uses regular expression to realize a grammar and then memoizes logic masks. I want to know why FlashText failed to cite:

        Baeza-Yates, Ricardo A., and Gaston H. Gonnet. "Fast text searching for regular expressions or automaton searching on tries." Journal of the ACM (JACM) 43.6 (1996): 915-936.

        Eltabakh, Mohamed Y., Ramy Eltarras, and Walid G. Aref. "To trie or not to trie? realizing space-partitioning trees inside postgresql: Challenges, experiences and performance." (2005).

        Zhang, Yijun, and Lizhen Xu. "An algorithm for url routing based on trie structure." 2015 12th Web Information System and Application Conference (WISA). IEEE, 2015.

        • lefttoreader2y

          Your comment here doesn’t feel like it’s in good faith, but there’s a good chance I’m misreading it.

          • bhickey2y

            I'm serious that the similarities between the papers are superficial.

            I don't think it's fair of you to criticize the authors for not citing some obscure preprint, when that manuscript itself neglected to cite decades of prior, relevant work.

            • lefttoreader2y

              I have some other comment on this thread where I point out why I don’t think it’s superficial. Would love to get your feedback on that if you feel like spending more time on this thread.

              But it’s not obscure? FlashText was a somewhat popular paper at the time (2017) with a popular repo (https://github.com/vi3k6i5/flashtext). Their paper was pretty derivative of Aho-Corasick, which they cited. If you think they genuinely fucked up, leave an issue on their repo (I’m, maybe to your surprise lol, not the author).

              Anyway, I’m not a fan of the whatabboutery here. I don’t think OG’s paper is up to snuff on its lit review - do you?

              • bhickey2y

                > I don’t think OG’s paper is up to snuff on its lit review - do you?

                Not in the slightest. Caching the logit masks and applying the right one based on where you are in your grammar is obvious. This is what I'd expect some bright undergrads to come up with for a class project. This manuscript could've been a blog post.

                Although arXiv is displacing some traditional publishing, I think it's a little silly to try to hold it to the same standards.

                I saw your argument for why you think it's relevant and I think you're overstating the case. There are a _heap_ of papers they could've cited.

                As an aside, when can we stop citing _Attention is All You Need_?

      • lefttoreader2y

        Sure! So it’s hopefully clear that the notion of constrained grammar is not novel (see every comment on here of people name-dropping their implementation from two months ago).

        The novelty here is “instead of checking whether every token is allowed” to create a finite state machine that defines which tokens are allowable at each generation step. This lets them not check every token at every step.

        The trick of creating an FSM to efficiently check next-token grammar is what allowed FlashText to run circles around standard regex stuff. Even FlashText guy acknowledged the shoulders he stood on, etc.

        Let’s be super clear here, none of these standards apply when you’re building good ole libraries. But putting out a paper really elevates what you’re on the hook for. Most folks that write papers are dying to acknowledge the shoulders they stand on - it’s part of the toxic humility we all engage in.

        Again - shill OSS all day - I’ll upvote it.

        • _flux2y

          By "standard regex" stuff I take it you mean the standard regex stuff Python standard library comes with?

          I mean going from standard regex to NFA to DFA is already more sophisticated than that one, it's _quite_ oldschool and gives you linear time matching: https://en.wikipedia.org/wiki/Thompson%27s_construction https://en.wikipedia.org/wiki/Powerset_construction

          And what I mean to say by this as they could have easily have had this idea and never had discovered the whitepaper you referenced.

          • lefttoreader2y

            Yep! But that’s sort of my point, and maybe this is just some misplaced academic shit of mine but if you’re going to write a paper then “easily had this idea and never discovered the paper” just doesn’t fly.

            Almost all academic work is derivative tweaks of yesterday’s work, yet we still fall over ourselves to cite this stuff.

    • 2y
      [deleted]
  • faangiq2y

    Is generating valid json nontrivial?

  • dvt2y

    [flagged]

    • kristjansson2y

      I think you might be over-simplifying. This (and llama.cpp's grammar-based sampling, which this is moving towards[1]) doesn't say "no, not like that, give me another token". It excludes impossible tokens at each step, but otherwise samples like normal.

      Is this a revolutionary trick? Not really, since llama.cpp and guidance, and probably others have already done it. But it's a good trick, and hopefully one of many to justify the valuation :).

      [1]: https://github.com/normal-computing/outlines/pull/178

    • remilouf2y

      I’m sorry that our software made you so angry. It was a side project led by two people independently from the rest of the company.

    • swyx2y

      > Imagine thinking that adding regex on top of an LLM is worth $8.5M

      you should be downvoted for being this reductionist and uncharitable. this is a side project of a larger company effort.