lúc 16:52 10 tháng 4, 2026

More Gemma4 fixes in the past 24 hours

**Reasoning budget fix** (merged): [https://github.com/ggml-org/llama.cpp/pull/21697](https://github.com/ggml-org/llama.cpp/pull/21697)

**New chat templates from Google to fix tool calling:**

31B: [https://huggingface.co/google/gemma-4-31B-it/blob/main/chat\_template.jinja](https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_template.jinja)

27B: [https://huggingface.co/google/gemma-4-26B-A4B-it/blob/main/chat\_template.jinja](https://huggingface.co/google/gemma-4-26B-A4B-it/blob/main/chat_template.jinja)

E4B: [https://huggingface.co/google/gemma-4-E4B-it/blob/main/chat\_template.jinja](https://huggingface.co/google/gemma-4-E4B-it/blob/main/chat_template.jinja)

E2B: [https://huggingface.co/google/gemma-4-E2B-it/blob/main/chat\_template.jinja](https://huggingface.co/google/gemma-4-E2B-it/blob/main/chat_template.jinja)

Please correct me if Im wrong, but you should use these new templates unless you redownload a new GGUF, that has been updated in the past 24 hours with the new template.

You can use specific templates in llama.cpp by the command argument:

--chat-template-file /models/gemma4/gemma4_chat_template_26B.jinja

My current llama-swap/llama.cpp config 26B example (testing on 16GB VRAM , so context window is limited):

"Gemma4-26B-IQ4_XS":

ttl: 300 # Automatically unloads after 5 mins of inactivity

cmd: >

/usr/local/bin/llama-server

--port ${PORT}

--host 127.0.0.1

--model /models/gemma4/gemma-4-26B-A4B-it-UD-IQ4_XS.gguf

--mmproj /models/gemma4/gemma-4-26B-A4B-it.mmproj-q8_0.gguf

--chat-template-file /models/gemma4/gemma4_chat_template_26B_09APR2026.jinja

--cache-type-k q8_0

--cache-type-v q8_0

--n-gpu-layers 99

--parallel 1

--batch-size 2048

--ubatch-size 512

--ctx-size 16384

--image-min-tokens 300

--image-max-tokens 512

--flash-attn on

--jinja

--cache-ram 2048

-ctxcp 2

filters:

stripParams: "temperature, top_p, top_k, min_p, presence_penalty, repeat_penalty"

setParamsByID:

"${MODEL_ID}:thinking":

chat_template_kwargs:

enable_thinking: true

reasoning_budget: 4096

temperature: 1.0

top_p: 0.95

top_k: 64

min_p: 0.0

presence_penalty: 0.0

repeat_penalty: 1.0

"${MODEL_ID}:thinking-coding":

chat_template_kwargs:

enable_thinking: true

reasoning_budget: 4096

temperature: 1.5

top_p: 0.95

top_k: 65

min_p: 0.0

presence_penalty: 0.0

repeat_penalty: 1.0

"${MODEL_ID}:instruct":

chat_template_kwargs:

enable_thinking: false

temperature: 1.0

top_p: 0.95

top_k: 64

min_p: 0.0

presence_penalty: 0.0

repeat_penalty: 1.0"

llamallmaidiscussionlap-trinh

700

Bình luận (0)

Đăng nhập để bình luận

Đang tải bình luận...