33 comments
  • canyon2897d

    Hey folks, I'm on the Gemma team, we released new model(s) just recently, and I saw many questions here about function calling. We just published the docs to detail this more. In short Gemma3's prompted instruction following is quite good for the larger models and that's how you use the feature.

    You don't need to take our word for it! We were waiting for an external and independent validation from the Berkeley team, and they just published their results. You can use their metrics to get a rough sense of performance, and of course try it out yourself in AIstudio or locally with your own prompts.

    https://gorilla.cs.berkeley.edu/leaderboard.html

    Hope you all enjoy the models!

    • brianjking6d

      Thanks, Gemma is fantastic and that it supports function calling is great.

    • chadash4d

      so if i'm reading this correctly, it's essentially prompt engineering here and there's no guarantee for the output. Why not enforce a guaranteed output structure by restricting the allowed logits at each step (e.g. what outlines library does)?

      • canyon2894d

        So in short there's no guarantee for any output from any LLM whether its Gemma or any other (ignoring some details like setting a random seed or parameters like temperature to 0). Like you mentioned though libraries like outlines can constrain the output, whereas hosted models often already include this in their API, but they can do so because its a model + some server side code.

        With Gemma, or any open model, you can use the open libraries in conjunction to get what you want. Some inference frameworks like Ollama include structured output as part of their functionality.

        But you mentioned all of this already in your question so I feel like I'm missing something. Let me know!

        But I think you already mentioned all this in your response so I might be missing the question?

        • programmarchy3d

          With OpenAI models, my understanding is that token output is restricted so that each next token must conform to the specified grammar (ie json schema) so you’re guaranteed to get either a function call or an error.

          Edit: per simonw’s sibling comment, ollama also has this feature.

          • canyon2893d

            Ah, There's a distinction here with model vs model framework. The ollama inference framework supports token output restriction. Gemma in AI Studio also does, as does Gemini, there's a toggle in the right hand panel, but that's because both those models are being served in the API where the functionality is present in the server.

            The Gemma model by itself does not though, nor does any "raw" model, but many open libraries exist for you to plug into whatever local framework you decide to use.

      • simonw4d

        If you run Gemma via Ollama (as recommended in the Gemma docs) you get exactly that feature, because Ollama provides that for any model that they run for you: https://ollama.com/blog/structured-outputs

        Under the hood, it is using the llama.cpp grammars mechanism that restricts allowed logits at each step, similar to Outlines.

        • refulgentis3d

          I've been working on tool calling in llama.cpp for Phi-4 and have a client that can switch between local models and remote for agentic work/search/etc., I learned a lot about this situation recently:

          - We can constrain the output of a JSON grammar (old school llama.cpp)

          - We can format inputs to make sure it matches the model format.

          - Both of these combined is what llama.cpp does, via @ochafik, in inter alia, https://github.com/ggml-org/llama.cpp/pull/9639.

          - ollama isn't plugged into this system AFAIK

          To OP's question, specifying a format in the model unlocks training the model specifically had on functions calling: what I sometimes call an "agentic loop", i.e. we're dramatically increasing the odds we're singing in the right tune for the model to do the right thing in this situation.

          • anon3738393d

            Do you have thoughts on the code-style agents recommended by huggingface? The pitch for them is compelling, since structuring complex tasks in code is something very natural for LLMs. But then, I don’t see as much about this approach outside of HF.

    • jampekka4d

      Is the format used in the examples the same that's used in the function calling instruction training, i.e. should it be the optimal prompt for function calling?

      I find it a bit frustrating when details of the training is not known and one has to guess what kinds of prompts the model has been tuned with.

      • canyon2893d

        We feel this model excels at instructability which is why we're recommending bringing your own prompt! Benchmark wise you can see this performance from BFCL directly, they (independently) ran their eval using their prompted format the larger Gemma models performed quite well if you ask me.

        Specifically though I want to thank you for leaving a comment. We're reading all this feedback and its informing what we can do next to reduce frustration and create the best model experience the community

        • jampekka3d

          Do you mean that the exact prompt for tool use shouldn't matter? Has this been tested? Is the tool use trained with a variety of prompt styles?

          I would imagine training with a specific, perhaps structured, prompt could make the function calling a bit more robust.

          • canyon2893d

            Ah I see where confusion might come.

            I don't mean the exact prompt shouldn't matter, but I am saying that we noticed that these series of models picked up on tool call format quite readily in our various tests, which is what we express in the docs. We tested internally and I hope the independent BFCL results speak for themselves! All their code and evals are public fully public.

            > I would imagine training with a specific, perhaps structured, prompt could make the function calling a bit more robust.

            This is absolutely true. I show this in a tutorial last year where Gemma2 is finetuned for a specific format, and with some targeted SFT it produces a json output more readily. https://www.youtube.com/watch?v=YxhzozLH1Dk

            So this is all to say, Gemma is designed to be a great model for multiple types of users. If you want to use the "out of the box" weights with your own format, go ahead! We hope that makes it easier to integrate with whatever tooling you're using with minimal headache.

            If you need specific performance on your bespoke format finetune the model to be your own! Finetuning is supported across many frameworks so pick whatever library you like best.

            This is all to say we hope Gemma is flexible and usable for folks like yourself along a variety of dimensions. For myself I'm learning there's big interest in a specific prompt. Again can't thank you enough for the feedback here.

        • troupo3d

          > We feel this model excels at instructability which is why we're recommending bringing your own prompt!

          Sigh Taps the sign:

          --- start quote ---

          To put it succinctly, prompt engineering is nothing but an attempt to reverse-engineer a non-deterministic black box for which any of the parameters below are unknown:

          - training set

          - weights

          - constraints on the model

          - layers between you and the model that transform both your input and the model's output that can change at any time availability of compute for your specific query

          - and definitely some more details I haven't thought of

          "Prompt engineers" will tell you that some specific ways of prompting some specific models will result in a "better result"... without any criteria for what a "better result" might signify.

          https://dmitriid.com/prompting-llms-is-not-engineering

          --- end quote ---

          • canyon2893d

            With open models this isn't as true. The weights are local, you bring your own compute, there's nothing between you and the model. Regarding what is a better result, personally I encourage you to define what a better result is in an evalset and then optimize against that. Agree that having no criteria is not a great situation to be in.

            This was the main point in a tutorial I did about a month ago now showing how to make a simple AI app using gemma, though the principles hold for any LLM.

            https://www.youtube.com/live/9zM_93mYdu8

            Hope this helps!

            • troupo3d

              Even then the "You MUST" and "You SHOULD NOT" are just magical incantations that may (and will) randomly fail.

          • nyrikki3d

            In the field of AI wishful mnemonics is the rule, arguments about prescription are of limited value.

            CoT scratch space extends LLMs from DLOGTIME-UNIFORM TC0 to PTIME with polynomial sized scratch space.

            https://arxiv.org/abs/2502.02393

            Yes prompt engineering is probably better though of as prompt augmentation or stearing.

            But the systems identification problem and Rice's theorm rigorously debunk the above links core claims.

            It is a craft that can improve domain specificity and usefulness.

            All models are wrong (even formalized engineering ones), but some are useful.

            The price one has to pay when resorting to what is fundamentally compression as PAC Learning is, that it is fundamentally unstable under perturbations.

            You are basically searching through a hay stack with a magnet, and making sure that at least one of the needles you find is the correct one is a symantic property. Guiding the approximate retrieval process to improve your results will always be a craft.

            The snake oil is mostly on the side that claims that unrestricted natural language is a possibility. We still only have NLP, and true human level NLU is still thought to be beyond the limits of computation IMHO.

            Thus prompt augmentation is a consequence of the argument that link was trying to make.

    • 3d
      [deleted]
    • attentive4d

      ToolACE-2-8B and watt-tool-8B have impressive score for the size in that leaderboard.

    • 42lux3d

      Don’t wanna be that guy but you guys have too many models. I love Gemini and Gemma but it’s way too crowded atm.

  • minimaxir4d

    The example of function calling/structured output here is the cleanest example on how function it works behind the scenes, incorporating prompt engineering and JSON schema.

    With the advent of agents/MCP, the low level workflow has only become more confusing.

    • canyon2893d

      This is me speculating along with you so don't take this as fact, but my sense is that the LLMs tool stack is getting "layerized" like network layer architectures.

      Right now the space is moving fast so new concepts and things are getting introduced quite fast, and the ecosystem hasn't settled.

      https://en.wikipedia.org/wiki/OSI_model#Layer_architecture

      But like all other things with computers, like shells, terminals, GUIs etc we're getting there. Just faster than ever.

      • lioeters3d

        That's insightful. Thank you for sharing your work and the patient responses to everyone's questions.

        Yesterday I started exploring a smaller Gemma3 model locally with Ollama, and it's clearly a level up from the previous model I was using (Llama3) in terms of instruction comprehension and the sophistication of responses. It's faster, smaller, and smarter.

        I very much appreciate how such innovative technology is available for non-experts to benefit from and participate in. I think one of the best things about the emergence and evolution of LLMs is the power of open source, open standards, and the ideal of democratizing artificial intelligence and access to it. The age-old dream of machines augmenting the human intellect (Vannevar Bush, Doug Englebart, et al) is being realized in a surprising way, and seeing the foundational layers being developed in real time is wonderful.

        • canyon2893d

          Of course! Glad you can find models that work well for you and we're all learning together. Even on the "expert side" we're learning from what folks like yourself are doing and taking note so we can shape these models to be better you all.

  • nurettin3d

    So it's just a prompt? Well then you can do function calling with pretty much any model from this quarter.

  • zellyn4d

    Am I getting slightly different use-cases mixed up, or would it be better if everything just spoke MCP?

    • PufPufPuf4d

      MCP is the wire protocol, it doesn't say anything about how the LLM output is structured and parsed.

    • simonw4d

      You need function calling support in the models in order to layer MCP over the top of them.

  • mentalgear4d

    Great, your work on open-source SLM are much appreciated ! (btw: seems like the google page does not respect the theme device "auto" setting)

    • canyon2893d

      Thank you! Community vibes motivate us to code more up for you all. Really appreciate the note.

      Regarding the device theme in the browser, I'll ask some folks what's going on there.

  • sunrabbit3d

    It's honestly frightening to see how fast it's evolving. It hasn't even been that many years since GPT was first released.

  • behnamoh4d

    I'm glad this exists. It ruins the day for Trelis who took the open-source and free Llama and made it commercial by giving it function calling abilities: https://huggingface.co/Trelis/Meta-Llama-3-70B-Instruct-func...