Hey folks, I'm on the Gemma team, we released new model(s) just recently, and I saw many questions here about function calling. We just published the docs to detail this more. In short Gemma3's prompted instruction following is quite good for the larger models and that's how you use the feature.
You don't need to take our word for it! We were waiting for an external and independent validation from the Berkeley team, and they just published their results. You can use their metrics to get a rough sense of performance, and of course try it out yourself in AIstudio or locally with your own prompts.
https://gorilla.cs.berkeley.edu/leaderboard.html
Hope you all enjoy the models!
Thanks, Gemma is fantastic and that it supports function calling is great.
so if i'm reading this correctly, it's essentially prompt engineering here and there's no guarantee for the output. Why not enforce a guaranteed output structure by restricting the allowed logits at each step (e.g. what outlines library does)?
So in short there's no guarantee for any output from any LLM whether its Gemma or any other (ignoring some details like setting a random seed or parameters like temperature to 0). Like you mentioned though libraries like outlines can constrain the output, whereas hosted models often already include this in their API, but they can do so because its a model + some server side code.
With Gemma, or any open model, you can use the open libraries in conjunction to get what you want. Some inference frameworks like Ollama include structured output as part of their functionality.
But you mentioned all of this already in your question so I feel like I'm missing something. Let me know!
But I think you already mentioned all this in your response so I might be missing the question?
With OpenAI models, my understanding is that token output is restricted so that each next token must conform to the specified grammar (ie json schema) so you’re guaranteed to get either a function call or an error.
Edit: per simonw’s sibling comment, ollama also has this feature.
Ah, There's a distinction here with model vs model framework. The ollama inference framework supports token output restriction. Gemma in AI Studio also does, as does Gemini, there's a toggle in the right hand panel, but that's because both those models are being served in the API where the functionality is present in the server.
The Gemma model by itself does not though, nor does any "raw" model, but many open libraries exist for you to plug into whatever local framework you decide to use.
If you run Gemma via Ollama (as recommended in the Gemma docs) you get exactly that feature, because Ollama provides that for any model that they run for you: https://ollama.com/blog/structured-outputs
Under the hood, it is using the llama.cpp grammars mechanism that restricts allowed logits at each step, similar to Outlines.
I've been working on tool calling in llama.cpp for Phi-4 and have a client that can switch between local models and remote for agentic work/search/etc., I learned a lot about this situation recently:
- We can constrain the output of a JSON grammar (old school llama.cpp)
- We can format inputs to make sure it matches the model format.
- Both of these combined is what llama.cpp does, via @ochafik, in inter alia, https://github.com/ggml-org/llama.cpp/pull/9639.
- ollama isn't plugged into this system AFAIK
To OP's question, specifying a format in the model unlocks training the model specifically had on functions calling: what I sometimes call an "agentic loop", i.e. we're dramatically increasing the odds we're singing in the right tune for the model to do the right thing in this situation.
Do you have thoughts on the code-style agents recommended by huggingface? The pitch for them is compelling, since structuring complex tasks in code is something very natural for LLMs. But then, I don’t see as much about this approach outside of HF.
Is the format used in the examples the same that's used in the function calling instruction training, i.e. should it be the optimal prompt for function calling?
I find it a bit frustrating when details of the training is not known and one has to guess what kinds of prompts the model has been tuned with.
We feel this model excels at instructability which is why we're recommending bringing your own prompt! Benchmark wise you can see this performance from BFCL directly, they (independently) ran their eval using their prompted format the larger Gemma models performed quite well if you ask me.
Specifically though I want to thank you for leaving a comment. We're reading all this feedback and its informing what we can do next to reduce frustration and create the best model experience the community
Do you mean that the exact prompt for tool use shouldn't matter? Has this been tested? Is the tool use trained with a variety of prompt styles?
I would imagine training with a specific, perhaps structured, prompt could make the function calling a bit more robust.
Ah I see where confusion might come.
I don't mean the exact prompt shouldn't matter, but I am saying that we noticed that these series of models picked up on tool call format quite readily in our various tests, which is what we express in the docs. We tested internally and I hope the independent BFCL results speak for themselves! All their code and evals are public fully public.
> I would imagine training with a specific, perhaps structured, prompt could make the function calling a bit more robust.
This is absolutely true. I show this in a tutorial last year where Gemma2 is finetuned for a specific format, and with some targeted SFT it produces a json output more readily. https://www.youtube.com/watch?v=YxhzozLH1Dk
So this is all to say, Gemma is designed to be a great model for multiple types of users. If you want to use the "out of the box" weights with your own format, go ahead! We hope that makes it easier to integrate with whatever tooling you're using with minimal headache.
If you need specific performance on your bespoke format finetune the model to be your own! Finetuning is supported across many frameworks so pick whatever library you like best.
This is all to say we hope Gemma is flexible and usable for folks like yourself along a variety of dimensions. For myself I'm learning there's big interest in a specific prompt. Again can't thank you enough for the feedback here.
> We feel this model excels at instructability which is why we're recommending bringing your own prompt!
Sigh Taps the sign:
--- start quote ---
To put it succinctly, prompt engineering is nothing but an attempt to reverse-engineer a non-deterministic black box for which any of the parameters below are unknown:
- training set
- weights
- constraints on the model
- layers between you and the model that transform both your input and the model's output that can change at any time availability of compute for your specific query
- and definitely some more details I haven't thought of
"Prompt engineers" will tell you that some specific ways of prompting some specific models will result in a "better result"... without any criteria for what a "better result" might signify.
https://dmitriid.com/prompting-llms-is-not-engineering
--- end quote ---
With open models this isn't as true. The weights are local, you bring your own compute, there's nothing between you and the model. Regarding what is a better result, personally I encourage you to define what a better result is in an evalset and then optimize against that. Agree that having no criteria is not a great situation to be in.
This was the main point in a tutorial I did about a month ago now showing how to make a simple AI app using gemma, though the principles hold for any LLM.
https://www.youtube.com/live/9zM_93mYdu8
Hope this helps!
Even then the "You MUST" and "You SHOULD NOT" are just magical incantations that may (and will) randomly fail.
In the field of AI wishful mnemonics is the rule, arguments about prescription are of limited value.
CoT scratch space extends LLMs from DLOGTIME-UNIFORM TC0 to PTIME with polynomial sized scratch space.
https://arxiv.org/abs/2502.02393
Yes prompt engineering is probably better though of as prompt augmentation or stearing.
But the systems identification problem and Rice's theorm rigorously debunk the above links core claims.
It is a craft that can improve domain specificity and usefulness.
All models are wrong (even formalized engineering ones), but some are useful.
The price one has to pay when resorting to what is fundamentally compression as PAC Learning is, that it is fundamentally unstable under perturbations.
You are basically searching through a hay stack with a magnet, and making sure that at least one of the needles you find is the correct one is a symantic property. Guiding the approximate retrieval process to improve your results will always be a craft.
The snake oil is mostly on the side that claims that unrestricted natural language is a possibility. We still only have NLP, and true human level NLU is still thought to be beyond the limits of computation IMHO.
Thus prompt augmentation is a consequence of the argument that link was trying to make.
ToolACE-2-8B and watt-tool-8B have impressive score for the size in that leaderboard.
Don’t wanna be that guy but you guys have too many models. I love Gemini and Gemma but it’s way too crowded atm.