To add some numbers, on MBP M1 64GB with ggml-org/gemma-3-4b-it-GGUF I get
25t/s prompt processing
63t/s token generation
Overall processing time per image is ~15secs, no matter what size the image is. The small 4B has already very decent output, describing different images pretty well.Steps to reproduce:
git clone https://github.com/ggml-org/llama.cpp.git
cmake -B build
cmake --build build --config Release -j 12 --clean-first
# download model and mmproj files...
build/bin/llama-server \
--model gemma-3-4b-it-Q4_K_M.gguf \
--mmproj mmproj-model-f16.gguf
Then open http://127.0.0.1:8080/ for the web interfaceNote: if you are not using -hf, you must include the --mmproj switch or otherwise the web interface gives an error message that multimodal is not supported by the model.
I have used the official ggml-org/gemma-3-4b-it-GGUF quants, I expect the unsloth quants from danielhanchen to be a bit faster.
For every image I try, I get the same response:
> This image shows a diverse group of people in various poses, including a man wearing a hat, a woman in a wheelchair, a child with a large head, a man in a suit, and a woman in a hat.
No, none of these things are in the images.
I don't even know how to begin debugging that.
I get the same as well, instead I get this message, no matter which image I upload: "This is a humorous meme that uses the phrase "one does not get it" in a mocking way. It's a joke about people getting frustrated when they don’t understand the context of a joke or meme."
Not sure why it's not working
Ok, following the following comment in this thread fixed the issue: https://news.ycombinator.com/item?id=43943624
Means it can't see the actual image. It's not loading for some reason.
I’m having a hard time imagining how failure to see an image would result in such a misleadingly specific wrong output instead of e.g. “nothing” or “it’s nonsense with no significant visual interpretation”. That sounds awful to work with.
LLMs have a very hard time saying "I am useless in this situation", because they are explicitly trained to be a helpful assistant.
So instead of saying "I can't help you with this picture", the thing hallucinates something.
That is the expected behavior by now. Not hard to imagine at all.
No controls in the training data?
Fun fact,you can prompt the llm's with no input and random nonsense will come out of them
And if you set the temperature to zero, you'll get the same output every time!
hmm, I'm getting the same results - but I see on M1 with a 7b model we should expect ~10x faster prompt processing
https://github.com/ggml-org/llama.cpp/discussions/4167
I wonder if it's the encoder that isn't optimized?
Are those numbers for the 4/8 bit quants or the full fp16?
It is a 4-bit quant gemma-3-4b-it-Q4_K_M.gguf. I just use "describe" as prompt or "short description" if I want less verbose output.
As you are a photographer, using a picture from your website gemma 4b produces the following:
"A stylish woman stands in the shade of a rustic wooden structure, overlooking a landscape of rolling hills and distant mountains. She is wearing a flowing, patterned maxi dress with a knotted waist and strappy sandals. The overall aesthetic is warm, summery, and evokes a sense of relaxed elegance."
This description is pretty spot on.
The picture I used is from the series L'Officiel.02 (L-officel_lanz_08_1369.jpg) from zamadatix' website.
I'm can neither claim to be a photographer nor that https://www.dansmithphotography.com/ my website, but I appreciate the example! The specific photo for other's reference, based on the filename: https://payload.cargocollective.com/1/15/509333/14386490/L-o...
That said I'm not as impressed of the description. The structure has some wood but it's certainly not just wooden, there are distant mountains but not much in the way of rolling hills to speak of. The dress is flowing but the waist is not knotted - the more striking note might have been the sleeves.
For 4 GB of model I'm not going to ding it too badly though. The question on which quant was mainly around the tokens/second angle (q4 requires 1/4th the memory bandwidth as the full model would) rather than quality angle. As a note: a larger multimodal model gets all of these points accurately (e.g. "wooden and stone rustic structure"), they aren't just things I noted myself.
n.b. the image processing is by a separate model, basically has to load the image and generate ~1000 tokens
(source: vision was available in llama.cpp but Very Hard, been maintaining an implementation)
(n.b. it's great work, extremely welcome, and new in that the vision code badly needed a rebase and refactoring after a year or two of each model adding in more stuff)
wait sorry, can you explain how this works? I thought gemma3 used siglip, which can output all 256 embeddings in parallel
(also, would you mind sharing a code pointer if you have any handy? I found this https://github.com/ggml-org/llama.cpp/blob/master/tools/mtmd... but not sure if that's the codepath taken)
do you have any example images it generated based on your prompts?
want to have a look before I try
To be clear, this model isn't generating images, it's describing images that are sent to it.