The modles are available at ModelScope.
All fine-tuned LoRA adapters are available via Google Drive.
- [Nov. 2025]: ITDR has been accepted to KDD2026!
Large language models (LLMs) have demonstrated outstanding performance in natural language processing tasks. However, in the field of recommendation systems, due to the structural differences between user behavior data and natural language, LLMs struggle to effectively model the associations between user preferences and items. Although prompt-based methods can generate recommendation results, their inadequate understanding of recommendation tasks leads to constrained performance. To address this gap, in this work, we construct a sufficient instruction tuning dataset, ITDR, which encompasses 7 subtasks across two core root tasks—useritem interaction and user-item understanding. The dataset integrates data from 13 public recommendation datasets and is built using manually crafted standardized templates, comprising approximately 200,000 instances. Experimental results demonstrate that ITDR significantly enhances the performance of mainstream open-source LLMs such as GLM-4, Qwen2.5, Qwen2.5-Instruct and LLaMA-3.2 on recommendation tasks. Furthermore, we analyze the correlations between tasks and explore the impact of task descriptions and data scale on instruction tuning effectiveness. Finally, we perform comparative experiments against closed-source LLMs with substantial parameters.
| Dataset Name | Link |
|---|---|
| Anime Dataset 2023 | https://www.kaggle.com/datasets/dbdmobile/myanimelist-dataset |
| MovieLens 1M/32M | https://grouplens.org/datasets/movielens/ |
| Amazon Reviews 2023 | https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/tree/main |
| MicroLens | https://github.com/westlake-repl/MicroLens |
| PixelRec | https://github.com/westlake-repl/PixelRec |
| BookCrossing | https://www.kaggle.com/datasets/ruchi798/bookcrossing-dataset |
| Amazon Books Reviews | https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews?select=books_data.csv |
| MIND | https://msnews.github.io/ |
| Steam | https://github.com/kang205/SASRec?tab=readme-ov-file |
| Yelp | https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset |
| Last.FM 360K | http://ocelma.net/MusicRecommendationDataset/lastfm-360K.html |
| Last.FM 1K | https://yann.lecun.com/exdb/mnist/ |
We use LLaMA-Factory for model fine-tuning. Below is an example of fine-tuning GLM-4:
CUDA_VISIBLE_DEVICES=0 python src/train.py \
--model_name_or_path /root/shared-nvme/models/glm-4-9b/ZhipuAI/glm-4-9b \
--trust_remote_code \
--stage sft \
--do_train \
--dataset train \
--template glm4 \
--finetuning_type lora \
--output_dir saves/glm-4-9b/sft \
--overwrite_cache \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--logging_steps 100 \
--save_steps 500 \
--learning_rate 1e-4 \
--num_train_epochs 2.0 \
--plot_loss \
--bf16We use vLLM for deployment and inference. Below is an example:
CUDA_VISIBLE_DEVICES=0 vllm serve /root/ITDR-GLM-4-9B \
--host 0.0.0.0 \
--port 8098 \
--max-model-len 8192 \
--tensor-parallel-size 1 \
--disable-log-requests \
--gpu-memory-utilization 0.9 \
--enable-prefix-caching \
--trust-remote-code