Hacker News

canyon289•4m

Gemma3 Function Calling google.dev

33 comments

canyon289•4m
Hey folks, I'm on the Gemma team, we released new model(s) just recently, and I saw many questions here about function calling. We just published the docs to detail this more. In short Gemma3's prompted instruction following is quite good for the larger models and that's how you use the feature.
You don't need to take our word for it! We were waiting for an external and independent validation from the Berkeley team, and they just published their results. You can use their metrics to get a rough sense of performance, and of course try it out yourself in AIstudio or locally with your own prompts.
https://gorilla.cs.berkeley.edu/leaderboard.html
Hope you all enjoy the models!
- brianjking•4m
  Thanks, Gemma is fantastic and that it supports function calling is great.
- chadash•4m
  so if i'm reading this correctly, it's essentially prompt engineering here and there's no guarantee for the output. Why not enforce a guaranteed output structure by restricting the allowed logits at each step (e.g. what outlines library does)?
  - canyon289•4m
    So in short there's no guarantee for any output from any LLM whether its Gemma or any other (ignoring some details like setting a random seed or parameters like temperature to 0). Like you mentioned though libraries like outlines can constrain the output, whereas hosted models often already include this in their API, but they can do so because its a model + some server side code.
    With Gemma, or any open model, you can use the open libraries in conjunction to get what you want. Some inference frameworks like Ollama include structured output as part of their functionality.
    But you mentioned all of this already in your question so I feel like I'm missing something. Let me know!
    But I think you already mentioned all this in your response so I might be missing the question?
    - programmarchy•4m
      With OpenAI models, my understanding is that token output is restricted so that each next token must conform to the specified grammar (ie json schema) so you’re guaranteed to get either a function call or an error.
      Edit: per simonw’s sibling comment, ollama also has this feature.
      - canyon289•4m
        Ah, There's a distinction here with model vs model framework. The ollama inference framework supports token output restriction. Gemma in AI Studio also does, as does Gemini, there's a toggle in the right hand panel, but that's because both those models are being served in the API where the functionality is present in the server.
        The Gemma model by itself does not though, nor does any "raw" model, but many open libraries exist for you to plug into whatever local framework you decide to use.
  - simonw•4m
    If you run Gemma via Ollama (as recommended in the Gemma docs) you get exactly that feature, because Ollama provides that for any model that they run for you: https://ollama.com/blog/structured-outputs
    Under the hood, it is using the llama.cpp grammars mechanism that restricts allowed logits at each step, similar to Outlines.
    - refulgentis•4m
      I've been working on tool calling in llama.cpp for Phi-4 and have a client that can switch between local models and remote for agentic work/search/etc., I learned a lot about this situation recently:
      - We can constrain the output of a JSON grammar (old school llama.cpp)
      - We can format inputs to make sure it matches the model format.
      - Both of these combined is what llama.cpp does, via @ochafik, in inter alia, https://github.com/ggml-org/llama.cpp/pull/9639.
      - ollama isn't plugged into this system AFAIK
      To OP's question, specifying a format in the model unlocks training the model specifically had on functions calling: what I sometimes call an "agentic loop", i.e. we're dramatically increasing the odds we're singing in the right tune for the model to do the right thing in this situation.
      - anon373839•4m
        Do you have thoughts on the code-style agents recommended by huggingface? The pitch for them is compelling, since structuring complex tasks in code is something very natural for LLMs. But then, I don’t see as much about this approach outside of HF.
- jampekka•4m
  Is the format used in the examples the same that's used in the function calling instruction training, i.e. should it be the optimal prompt for function calling?
  I find it a bit frustrating when details of the training is not known and one has to guess what kinds of prompts the model has been tuned with.
  - canyon289•4m
    We feel this model excels at instructability which is why we're recommending bringing your own prompt! Benchmark wise you can see this performance from BFCL directly, they (independently) ran their eval using their prompted format the larger Gemma models performed quite well if you ask me.
    Specifically though I want to thank you for leaving a comment. We're reading all this feedback and its informing what we can do next to reduce frustration and create the best model experience the community
    - jampekka•4m
      Do you mean that the exact prompt for tool use shouldn't matter? Has this been tested? Is the tool use trained with a variety of prompt styles?
      I would imagine training with a specific, perhaps structured, prompt could make the function calling a bit more robust.
      - canyon289•4m
        Ah I see where confusion might come.
        I don't mean the exact prompt shouldn't matter, but I am saying that we noticed that these series of models picked up on tool call format quite readily in our various tests, which is what we express in the docs. We tested internally and I hope the independent BFCL results speak for themselves! All their code and evals are public fully public.
        > I would imagine training with a specific, perhaps structured, prompt could make the function calling a bit more robust.
        This is absolutely true. I show this in a tutorial last year where Gemma2 is finetuned for a specific format, and with some targeted SFT it produces a json output more readily. https://www.youtube.com/watch?v=YxhzozLH1Dk
        So this is all to say, Gemma is designed to be a great model for multiple types of users. If you want to use the "out of the box" weights with your own format, go ahead! We hope that makes it easier to integrate with whatever tooling you're using with minimal headache.
        If you need specific performance on your bespoke format finetune the model to be your own! Finetuning is supported across many frameworks so pick whatever library you like best.
        This is all to say we hope Gemma is flexible and usable for folks like yourself along a variety of dimensions. For myself I'm learning there's big interest in a specific prompt. Again can't thank you enough for the feedback here.
    - troupo•4m
      > We feel this model excels at instructability which is why we're recommending bringing your own prompt!
      Sigh Taps the sign:
      --- start quote ---
      To put it succinctly, prompt engineering is nothing but an attempt to reverse-engineer a non-deterministic black box for which any of the parameters below are unknown:
      - training set
      - weights
      - constraints on the model
      - layers between you and the model that transform both your input and the model's output that can change at any time availability of compute for your specific query
      - and definitely some more details I haven't thought of
      "Prompt engineers" will tell you that some specific ways of prompting some specific models will result in a "better result"... without any criteria for what a "better result" might signify.
      https://dmitriid.com/prompting-llms-is-not-engineering
      --- end quote ---
      - canyon289•4m
        With open models this isn't as true. The weights are local, you bring your own compute, there's nothing between you and the model. Regarding what is a better result, personally I encourage you to define what a better result is in an evalset and then optimize against that. Agree that having no criteria is not a great situation to be in.
        This was the main point in a tutorial I did about a month ago now showing how to make a simple AI app using gemma, though the principles hold for any LLM.
        https://www.youtube.com/live/9zM_93mYdu8
        Hope this helps!
        troupo•4m
        Even then the "You MUST" and "You SHOULD NOT" are just magical incantations that may (and will) randomly fail.
      - nyrikki•4m
        In the field of AI wishful mnemonics is the rule, arguments about prescription are of limited value.
        CoT scratch space extends LLMs from DLOGTIME-UNIFORM TC0 to PTIME with polynomial sized scratch space.
        https://arxiv.org/abs/2502.02393
        Yes prompt engineering is probably better though of as prompt augmentation or stearing.
        But the systems identification problem and Rice's theorm rigorously debunk the above links core claims.
        It is a craft that can improve domain specificity and usefulness.
        All models are wrong (even formalized engineering ones), but some are useful.
        The price one has to pay when resorting to what is fundamentally compression as PAC Learning is, that it is fundamentally unstable under perturbations.
        You are basically searching through a hay stack with a magnet, and making sure that at least one of the needles you find is the correct one is a symantic property. Guiding the approximate retrieval process to improve your results will always be a craft.
        The snake oil is mostly on the side that claims that unrestricted natural language is a possibility. We still only have NLP, and true human level NLU is still thought to be beyond the limits of computation IMHO.
        Thus prompt augmentation is a consequence of the argument that link was trying to make.
- attentive•4m
  ToolACE-2-8B and watt-tool-8B have impressive score for the size in that leaderboard.
- •4m
  [deleted]
- 42lux•4m
  Don’t wanna be that guy but you guys have too many models. I love Gemini and Gemma but it’s way too crowded atm.
minimaxir•4m
The example of function calling/structured output here is the cleanest example on how function it works behind the scenes, incorporating prompt engineering and JSON schema.
With the advent of agents/MCP, the low level workflow has only become more confusing.
- canyon289•4m
  This is me speculating along with you so don't take this as fact, but my sense is that the LLMs tool stack is getting "layerized" like network layer architectures.
  Right now the space is moving fast so new concepts and things are getting introduced quite fast, and the ecosystem hasn't settled.
  https://en.wikipedia.org/wiki/OSI_model#Layer_architecture
  But like all other things with computers, like shells, terminals, GUIs etc we're getting there. Just faster than ever.
  - lioeters•4m
    That's insightful. Thank you for sharing your work and the patient responses to everyone's questions.
    Yesterday I started exploring a smaller Gemma3 model locally with Ollama, and it's clearly a level up from the previous model I was using (Llama3) in terms of instruction comprehension and the sophistication of responses. It's faster, smaller, and smarter.
    I very much appreciate how such innovative technology is available for non-experts to benefit from and participate in. I think one of the best things about the emergence and evolution of LLMs is the power of open source, open standards, and the ideal of democratizing artificial intelligence and access to it. The age-old dream of machines augmenting the human intellect (Vannevar Bush, Doug Englebart, et al) is being realized in a surprising way, and seeing the foundational layers being developed in real time is wonderful.
    - canyon289•4m
      Of course! Glad you can find models that work well for you and we're all learning together. Even on the "expert side" we're learning from what folks like yourself are doing and taking note so we can shape these models to be better you all.
nurettin•4m
So it's just a prompt? Well then you can do function calling with pretty much any model from this quarter.
zellyn•4m
Am I getting slightly different use-cases mixed up, or would it be better if everything just spoke MCP?
- PufPufPuf•4m
  MCP is the wire protocol, it doesn't say anything about how the LLM output is structured and parsed.
- simonw•4m
  You need function calling support in the models in order to layer MCP over the top of them.
mentalgear•4m
Great, your work on open-source SLM are much appreciated ! (btw: seems like the google page does not respect the theme device "auto" setting)
- canyon289•4m
  Thank you! Community vibes motivate us to code more up for you all. Really appreciate the note.
  Regarding the device theme in the browser, I'll ask some folks what's going on there.
sunrabbit•4m
It's honestly frightening to see how fast it's evolving. It hasn't even been that many years since GPT was first released.
behnamoh•4m
I'm glad this exists. It ruins the day for Trelis who took the open-source and free Llama and made it commercial by giving it function calling abilities: https://huggingface.co/Trelis/Meta-Llama-3-70B-Instruct-func...
- kristjansson•4m
  I mean Meta did that already with Llama 3.1
  https://www.llama.com/docs/model-cards-and-prompt-formats/ll...