The framework supports the input types required for SLMs inference, including audio input, audio + text instruction input (NOT RECOMMEND), and pure text input, as well as pure text input for LLM model inference. For SLMs, we recommend using audio-only input to simulate end-to-end dialogue scenarios.
All configurations can be modified under registry/*/*.yaml.
infer_taskneeds to be specified for both the inference and evaluation stages.dataset,template,modelrefer to the custom dataset, data processing template, and model name, respectively. Please refer to their individual documentation for specific definitions.model,save_pred_audio,eval_taskcan be controlled through global variables inmain.py. These global variables override over the settings in the YAML files.reverse_spkranduse_model_historyare for multi-turn tests and not needed in single-turn tasks.- If the data includes two speakers (one for ground-truth), enabling
reverse_spkrswaps their roles. - If
use_model_historyis disabled, the model uses the ground-truth text as history instead of its own. (Note: Some SLMs only support using their own history.)
- If the data includes two speakers (one for ground-truth), enabling
your_infer_task_name: # the name of your own infer_task, it should be unique
class: src.config.InferTaskCfg
args:
dataset: your_dataset_name
template: your_template_name
model: your_model_name
save_pred_audio: False
eval_task: your_eval_task_name
# reverse_spkr: False # for multiturn
# use_model_history: True # for multiturn
# save_latest_only: True # for multiturn- For the list of supported evaluation tasks, see Available eval_task.
- The
eval_taskonly has two parameters:evaluator, which specifies the evaluator name, andsummarizer, which defines how the scores are processed. See their respective documentation for details.
your_eval_task_name:
class: src.config.EvalTaskCfg
args:
evaluator: your_evaluator_name
summarizer: your_summarizer_name-
The
BatchLoadersupports two formats: Parquet files from HuggingFace and local JSONL files.- When using HuggingFace, audio is decoded and saved to
temp_dir(auto-deleted after the task). Ifsave_query_audio_diris set, audio will be saved there and kept for future reuse. - For faster access, it's recommended to convert Parquet to local JSONL and WAV files using
tools/parquet2jsonl.py.
- When using HuggingFace, audio is decoded and saved to
-
key_col,ref_col,query_col, andextra_colcontrol what extra info is saved in${save_dir}/prediction/${model}/${infer_task}.jsonl. These are used for evaluation. Usually,extra_colis not needed. -
batch_sizesets how many samples are processed at once. For multi-turn tests, keep it at 1 to avoid OOM errors.
dataset-name:
class: src.dataset.BatchLoader
args:
file: path/to/huggingface # or path/to/*.jsonl or path/to/dataset
ref_col: answer # the reference answer column name in file
query_col: query # question col for logger
batch_size: 1
# key_col: key # Prepared for private data, this parameter allows you to set the key_col. The default is "key".
# extra_col: ["xxx", "xxx"] # List-type
# save_query_audio_dir: test_data/audios # if set, will decode and save test wav when generating from huggingface. This setting is not required for JSONL data.
# is_local: True # if True, will use the parquet in your local path (defined in 'file')- To add a custom model, implement your own model class under
src/models. You only need to define two functions:generate_oncefor single-turn inference andgenerate_multiturnfor multi-turn inference. The return format should be:return { "pred": model's text output, "pred_audio": model's audio output path (if save_pred_audio if True), "cache": model's kv_cache or generate_id (for multiturn), "his": model's history text which is differed from text output (for multiturn, only a few models need) }cacheandhisare optional. If both are missing, the model uses default history (usually the last output).cacheholds temporary states updated each turn (e.g. kv_cache or token_ids).hisis for output history. Use this if the model requires a different history format than plain output text. It will be accumulated intoassistant_history.
- If GPU memory is limited, you can use
load_model_with_auto_device_mapfrommodel_utils.pyto split the model across multiple GPUs by layers. See thesplit_devicemethod inkimi_audio.pyfor an example. - After implementing a custom model, just add a new model config. The parameters in
argscan be customized as needed and should match the__init__method of your model class.
model_name:
class: src.models.your_model_class # your model class defined in model/xxx.py
args:
path: path/to/model
# other_path: other path like tts module if needed
sample_params:
gen_type: greedy- Implement your evaluator class in
src/evaluator. The returned**Dict**must include at leastkeyandscore. Other fields depend on thesummarizerrequirements.- Use
@parallel_batchonevaluateto handle single-batch evaluation. The framework will manage multithreading automatically.
- Use
- Add a new evaluator config after implementation. If
max_workersis missing,default_workersis used. - LLM API evaluation requires a
key. Seeregistry/evaluator/llm.yamlfor examples.
templateis a Jinja data-processing template. Differentrole(e.g., system, instruct, user) needed to be divided. Text data uses thetextfield, and audio data uses theaudiofield.
summarizerdefines how evaluation results are processed and matches theevaluator.- For percentage scores,
AvgInfois recommended. - For subjective evaluations,
AvgThresholdis recommended. - To customize a
summarizer, implement a class with astatisticmethod undersrc/summarizer.
- For percentage scores,