Skip to content

ali-vosoughi/SoundCLIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SoundCLIP: Can Sound Replace Vision in LLaVA With Token Substitution?

[Code is coming soon.. stay tuned]

🌐 Live Demo: https://ali-vosoughi.github.io/SoundCLIP/

📄 Paper: Can Sound Replace Vision in LLaVA With Token Substitution? (ArXiv)

📊 Dataset: AVE-2 on HuggingFace

Project Overview

This is the official project webpage for "Can Sound Replace Vision in LLaVA With Token Substitution?" featuring an interactive demonstration of our SoundCLIP framework and the fundamental trade-off between cross-modal retrieval and text generation.

Authors

  • Ali Vosoughi - University of Rochester (Website)
  • Jing Bi - University of Rochester (Website)
  • Pinxin Liu - University of Rochester (Website)
  • Yunlong Tang - University of Rochester (Website)
  • Chenliang Xu - University of Rochester (Website)

🔗 Quick Links

Key Contributions

1. AVE-2 Dataset

  • 570,138 audio-visual clips with revolutionary 5-dimensional alignment annotations
  • Now available on HuggingFace with comprehensive documentation and usage examples
  • Systematic scoring across: Temporal Alignment, Spatial Coherence, Contextual Relevance, Physical Causality, Sound Source Visibility

2. SoundCLIP Framework

  • Token substitution approach: Replace CLIP's [CLS] token with audio tokens in LLaVA
  • Two alignment strategies:
    • Projected: MLP projection to CLIP space (maximizes I(A;V), better retrieval)
    • Raw: Padded audio features (preserves H(A|V), better generation)
  • Lightweight integration: Only 1.9M parameters for projection layer

3. Fundamental Trade-off Discovery

  • Retrieval vs Generation: y = 0.163x + 11.867 relationship discovered
  • Each percentage-point gain in retrieval incurs ~0.163% loss in generation quality

Citation

If you use SoundCLIP or the AVE-2 dataset in your research, please cite our paper:

@article{vosoughi2025soundclip,
  title={Can Sound Replace Vision in LLaVA With Token Substitution?},
  author={Vosoughi, Ali and Bi, Jing and Liu, Pinxin and Tang, Yunlong and Xu, Chenliang},
  journal={ArXiv},
  year={2025}
}

About

Audio-Visual Even Evaluation (AVE-2) dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages