Skip to content

ZhiyingDu/HiMoE-VLA

Repository files navigation

HiMoE-VLA:
Hierarchical Mixture-of-Experts for Generalist Vision–Language–Action Policies

arXiv Project Page

HiMoE-VLA is a new vision–language–action (VLA) framework built to effectively handle the pronounced heterogeneity in modern large-scale robotic datasets. Existing VLA models struggle with the diversity of embodiments, action spaces, sensor setups, and control frequencies found in robotic demonstrations. HiMoE-VLA introduces a Hierarchical Mixture-of-Experts (HiMoE) action module that progressively abstracts away these differences across layers, enabling unified learning of shared robot behaviors. Across both simulated and real-world platforms, HiMoE-VLA consistently outperforms prior VLA baselines and exhibits stronger generalization to new robots and action spaces.

✅ To-Do List

  • Release the evaluation code
  • Release the dataset
  • Release the base and fine-tuned model checkpoints
  • Release the fine-tuning code
  • Release the multi-dataset sampler and pre-training code

🔑 Installation

When cloning this repository, remember to initialize the submodules:

git clone --recurse-submodules git@github.com:ZhiyingDu/HiMoE-VLA.git

# If you've already cloned the project, you can fetch the submodules with:
git submodule update --init --recursive

First, install uv using the following command:

wget -qO- https://astral.sh/uv/install.sh | sh

Once uv is installed, create the environment and install all dependencies:

GIT_LFS_SKIP_SMUDGE=1 uv sync

After the environment has been created, replace the relevant packages with our modified versions:

cp -r third_party/lerobot .venv/lib/python3.11/site-packages/
cp third_party/modeling_gemma.py .venv/lib/python3.11/site-packages/transformers/models/gemma

🤖 Model Checkpoints

We provide the following pretrained models:

Model Description Download
Base model Pretrained on OXE and ALOHA Download
Calvin D Finetuned on Calvin D Joint Angle Download
Libero 10 Finetuned on Libero 10 Download
Libero Goal Finetuned on Libero Goal Download
Libero Object Finetuned on Libero Object Download
Libero Spatial Finetuned on Libero Spatial Download

🏋️‍♂️ Training

Preparing data

We use the LeRoBot dataset, so you should convert your own data into the LeRobot format. We provide example scripts for reference, such as examples/calvin/convert_calvin_data_to_lerobot_joint.py. You can modify it to convert your own data and run the script with:

uv run examples/calvin/convert_calvin_data_to_lerobot_joint.py --data_dir /path/to/your/calvin_d/data

Defining your own training config

Here, we use the calvin_d_joint as an example. You need to update the following components:

Note: The data dir is os.path.join(assets_base_dir, repo_id)

After completing the steps above, you need to compute normalization statistics for your own data. Run the script below with the name of your training config:

uv run scripts/compute_norm_stats.py --config-name calvin_d_joint

Note: The dataset_mixture of calvin_d_joint must contain only one dataset.

Training

Now, you can run training using the command below:

accelerate launch scripts/train.py --deepspeed=scr/moevla/training/zero2.json --config=calvin_d_joint --exp-name=calvin_d_joint

Note: If you want to use wandbe, please update the wandbe key in line7 of train.py.

⚖️ Evaluation

To effeciently manage environment, we use server and client to run evaluation. First, you can launch a model server by the command below:

uv run scripts/serve_policy.py --env CALVIN_D_FINETUNE --port 9000

You can then launch a client for quering the server. See the CALVIN README for more details.

For Real-World Deployment, you can run with the commands below:

from moevla.policies import policy_config as _policy_config
from moevla.training import config as _config

# specific these parameter
train_config = ""
dataset_config = ""
checkpoint_dir = ""

policy = _policy_config.create_trained_policy(
    _config.get_training_config(train_config),    
    _config.get_dataset_config(dataset_config), 
    checkpoint_dir, 
    default_prompt=None
)

# Run inference on a dummy example.
example = {
    "observation/exterior_image_1_left": ...,
    "observation/wrist_image_left": ...,
    ...
    "prompt": "fold clothes"
}

action_chunk = policy.infer(example)["actions"]

Even the commands above can run infer, we still recommand using server and client for deployment.

🤝 Acknowledgements

We are deeply grateful for the development of openpi, LeRobot and DeepSeekMoE, from which our project draws extensively. We extend our sincere thanks to all contributors to these libraries for their hard work and dedication.

📜 Citation

If you find our work useful in your research, please consider citing our paper:

@article{du2025himoe,
  title={HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies},
  author={Du, Zhiying and Liu, Bei and Liang, Yaobo and Shen, Yichao and Cao, Haidong and Zheng, Xiangyu and Feng, Zhiyuan and Wu, Zuxuan and Yang, Jiaolong and Jiang, Yu-Gang},
  journal={arXiv preprint arXiv:2512.05693},
  year={2025}
}

About

Official repo for paper "HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages