Conversation
chat_template
Chat Template strategy uses a jinja2 template that converts a list of messages into a prompt. Support using tokenizer’s template, a supported template, or custom jinja2.
data.jsonl
{"messages": [{"role": "...", "content": "..."}, {"role": "...", "content": "..."}, ...]}
See configs for full configs and supported templates.
Examples
Training on last message
(Legacy) Using the default chat template in the tokenizer_config.json on OpenAI messages format, training on only last message.
datasets:
- path: ...
type: chat_template
roles_to_train:
train_on_eos:
If you receive an error like “chat_template
choice is tokenizer_default
but tokenizer’s chat_template
is null.”, it means the tokenizer does not have a default chat_template
. Follow the examples below instead to set a custom chat_template
.
Overriding default chat template
Using the gemma
chat template to override the tokenizer_config.json’s chat template on OpenAI messages format, training on all assistant messages.
chat_template: gemma # this overwrites the tokenizer's chat_template
datasets:
- path: ...
type: chat_template
roles_to_train: ["assistant"] # default value
If you want to use built-in chat_template, use chat_template: tokenizer_default
(this is set by default).
Using default chat template with fallback
Using the tokenizer_config.json’s chat template or chatml
as fallback if the former’s chat template does not exist, on OpenAI messages format, training on all assistant messages.
chat_template: tokenizer_default_fallback_chatml # this overwrites the tokenizer's chat_template
datasets:
- path: ...
type: chat_template
Custom Jinja template
Using a custom jinja template on OpenAI messages format, training on all assistant messages.
# chat_template: jinja # `jinja` will be implied if the `chat_template_jinja` is set and this field is empty
chat_template_jinja: "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'system') %}{{'<|system|>' + '\n' + message['content'] + '<|end|>' + '\n'}}{% elif (message['role'] == 'user') %}{{'<|user|>' + '\n' + message['content'] + '<|end|>' + '\n' + '<|assistant|>' + '\n'}}{% elif message['role'] == 'assistant' %}{{message['content'] + '<|end|>' + '\n'}}{% endif %}{% endfor %}"
datasets:
- path: ...
type: chat_template
Please make sure that your tokenizer.eos_token
is same as EOS (End-of-Sequence) token in template. Otherwise, set eos_token
under special_tokens:
.
Using template with different token for EOT and EOS
- If you are using a template that has a different EOT (End-of-Turn) token from EOS token or multiple EOT tokens (like Mistral V7 Tekken), set the
eot_tokens:
config. The handling of EOT tokens followstrain_on_eos:
which defaults to turn.
eot_tokens:
- "[/INST]"
# - "[/SYSTEM_PROMPT]"
datasets:
- path: ...
type: chat_template
# optional
train_on_eot: turn # defaults read from train_on_eos (which defaults to turn)
See config documentation for detailed explanations of “turn”, “last”, and “all” options for training on tokens.
Using eot_tokens
requires each token that exists in chat_template
to be a single token in the tokenizer. Otherwise, the tokenizer will split the token and cause unexpected behavior.
You can add those tokens as new tokens under tokens:
or (recommended) override unused added_tokens via added_tokens_overrides:
. See config for more details.
- Continuing from the previous example, if you want to train on all EOT token trainable turns but only last EOS token, set
train_on_eos: last
.
eot_tokens:
- "[/INST]"
# ...
datasets:
- path: ...
type: chat_template
train_on_eos: last
train_on_eot: turn
If EOS token only appears at the end of a prompt, train_on_eos: last
is equivalent to train_on_eos: turn
. Therefore, generally, you can leave them to their defaults and omit them.
Using tool use
Instead of passing tools
via the system prompt, an alternative method would be to have the tools
in a separate column and loaded via chat_template
to let the template dynamically build it.
{
"tools": [
{
"type": "...",
"function": {
"name": "...",
"description": "...",
"parameters": {
"type": "...",
"properties": {
// ...
},
"required": ["..."],
},
},
},
],
"messages": [
// ...
{
"role": "assistant", // call the function via assistant
"tool_calls": [
{
"type": "function",
"function": {
"name": "...",
"arguments": {
"...": "...",
}
}
}
]
},
{
"role": "tool",
"name": "...",
"content": "..."
},
],
}
Tools need to follow JSON schema.
chat_template: llama4
datasets:
- path: ...
type: chat_template
# field_tools: tools # default is `tools`
Look into the chat_template
you are using to see if it supports tools
and what the expected role is for the tool answer. In the example above, the tool answer is expected to be in the tool
or ipython
role for llama4
template.
Using fine-grained control over token masking
(Advanced) Using fine-grained control over tokens and turns to train in a conversation
For a data sample that looks like:
data.jsonl
{
"conversations": [
{"from": "system", "value": "You are an AI assistant.", "train": false},
{"from": "human", "value": "Hello", "train": false},
{"from": "assistant", "value": "Hello", "train": true},
{"from": "human", "value": "How are you?", "train": true},
{
"from": "assistant",
"value": "I'm doing very well, thank you!",
"train_detail": [
{"begin_offset": 0, "end_offset": 8, "train": false},
{"begin_offset": 9, "end_offset": 18, "train": true},
{"begin_offset": 19, "end_offset": 30, "train": false},
],
},
{
"from": "human",
"value": "I'm doing very well, thank you!",
"train": true,
},
{"from": "assistant", "value": "Hi there!", "train": true}
]
}
The configuration would look like:
datasets:
- path: ...
type: chat_template
chat_template: tokenizer_default
field_messages: conversations
message_property_mappings:
role: from
content: value
roles_to_train: []
train_on_eos: turn
message_field_training: train
message_field_training_detail: train_detail
It is not necessary to set both message_field_training
and message_field_training_detail
at once.
Reasoning split
(For Qwen3 template only) Enable reasoning split, where the reasoning is split from the content and passed as a separate field into the template.
datasets:
- path: ...
type: chat_template
chat_template: qwen3
split_thinking: true
For example, a content can look like:
{
"content": "<think>Some thinking outputs</think>Output after thinking."
}
After split, it will look like:
{
"reasoning_content": "Some thinking outputs",
"content": "Output after thinking..."
}
pygmalion
data.jsonl
{"conversations": [{"role": "...", "value": "..."}]}