export VLLM_USE_FLASHINFER_SAMPLER=0
CUDA_VISIBLE_DEVICES=0,1,2,3 \
vllm serve /your_path/Qwen/Qwen3-32B \
--tensor-parallel-size 4 \
--served-model-name "qwen" \
--gpu-memory-utilization 0.8 \
--host 0.0.0.0 \
--port 8002The core implementation is in local_search (offline) and web_tools (online).
Navigate to /Inference/Base.
Configure and deploy the Executor:
bash deploy.shConfigure the run script run.sh:
DATASET: test set path, ending with.jsonor.jsonlOUTPUT_PATH: output pathMODEL_PATH: tokenizer model pathTEST_CACHE_DIR: image search cache path (required for multimodal tasks)SERVICE_URL: search service URLMAX_LLM_CALL_PER_RUN: maximum number of tool interaction roundsJUDGE_URL: Judger service URL
In react_agent.py, configure the text search method (choose one):
from tool_search_local import * # offline text search
from tool_serper import * # online text searchRun:
bash run.sh