[⭐️ICASSP 2026] Fake-HR1: Rethinking Reasoning of vision language model for Synthetic Image Detection
Changjiang Jiang1, Xinkuan Sha2, Fengchang Yu1,*, Jingjing Liu2,*, Jian Liu2,*, Mingqi Fang2, Chenfeng Zhang3, Wei Lu1
1 Wuhan University 2 Ant Group 3 Zhejiang University
* Corresponding author
pip install -r requirements.txtFakeClue GenImage
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
NNODES=$WORLD_SIZE \
NODE_RANK=$RANK \
MASTER_ADDR=${MASTER_ADDR} \
SIZE_FACTOR=8 \
NPROC_PER_NODE=16 \
swift sft \
--model "./Qwen2.5-VL-7B-Instruct" \
--model_type "qwen2_5_vl" \
--train_type full \
--dataset "./GenImage/train_cleaned_0915.json" \
"./FakeClue/data_json/train_clean_0916.json" \
--split_dataset_ratio 0.001 \
--max_length 8192 \
--torch_dtype bfloat16 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 4 \
--learning_rate 1e-6 \
--gradient_accumulation_steps 2 \
--freeze_vit false \
--freeze_llm false \
--freeze_aligner false \
--save_steps 1000 \
--save_total_limit 2 \
--logging_steps 200 \
--output_dir ./checkpoints/fakeclue_0917 \
--warmup_ratio 0.05 \
--dataloader_num_workers 4 \
--attn_impl flash_attention_2 \
--deepspeed zero2 \
--system "You are a helpful assistant for AI-generated image detection. Inspect the image and decide if it is real or fake.\nReasoning mode:\n- If the image shows **obvious AI-generation traces** and is easy to detect, give no think steps.\n- Otherwise, provide **careful, step-by-step** reasoning. If the image is easy to detect, output no think steps: <think>\n\n</think>\n\nreal or fake. If the user requests explain the think or the image is hard to detect, output the think steps: \n<think>\n[Your reasoning here]\n</think>\n\nreal or fake." \
--response_prefix "<think>\n" \
--max_pixels 12845056 \
--gradient_checkpointing trueNNODES=$WORLD_SIZE \
NODE_RANK=$RANK \
MASTER_ADDR=${MASTER_ADDR} \
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
custom_reward_function.name="compute_score" \
custom_reward_function.path="./reward.py" \
data.train_files=$train_file_path \
data.train_batch_size=32 \
data.max_prompt_length=8192 \
data.max_response_length=2048 \
data.filter_overlong_prompts=True \
data.truncation='error' \
data.image_key=images \
actor_rollout_ref.model.path=$model_path \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=128 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.04 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.tensor_model_parallel_size=8 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.rollout.max_num_batched_tokens=10240 \
actor_rollout_ref.rollout.enable_chunked_prefill=False \
actor_rollout_ref.rollout.enforce_eager=False \
actor_rollout_ref.rollout.free_cache_engine=False \
actor_rollout_ref.rollout.n=8 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.kl_ctrl.kl_coef=0.004 \
trainer.default_local_dir=$model_output_path \
trainer.critic_warmup=0 \
trainer.logger=['console','tensorboard'] \
trainer.project_name='verl_grpo_example_geo3k' \
trainer.experiment_name='test' \
trainer.n_gpus_per_node=16 \
trainer.nnodes=1 \
trainer.test_freq=50 \
trainer.total_epochs=1First get inference results:
You need to organize your dataset in the following format (jsonl):
{
"file_path": "image_0.jpg",
"label": "fake or real"
}Then run the evaluation script:
cd eva
infer_path=eva_total_ood_test.jsonl model_path=Qwen3-VL-30B-A3B-Instruct to_path=qwen25_vl_infer.jsonl
python qwen25_vl_infer.py \
--root "$infer_path" \
--to_path "$to_path" \
--model_path "$model_path"$to_path will contain the inference results.