Intel XPU Differential Gaussian Rasterization

A high-performance implementation of Differential Gaussian Rasterization for Intel Arc GPUs using SYCL/DPC++. This project provides a drop-in replacement for CUDA-based implementations, optimized for Intel XPU architecture.

Overview

This implementation is based on the paper "3D Gaussian Splatting for Real-Time Radiance Field Rendering" and provides:

SYCL/DPC++ kernels optimized for Intel Arc GPUs (tested on B580)
PyTorch integration for seamless use in training and inference pipelines
Forward pass implementation with tile-based rendering optimizations
Production-ready performance for real-time rendering

Hardware Requirements

GPU: Intel Arc GPU (A-series, B-series) with Level Zero support
Drivers: Native Support on Linux Kernel 6.7+. Ensure current Mesa.
Memory: 4GB+ VRAM recommended

Tested Configuration

GPU: Intel Arc B580 (12GB VRAM, 20 Xe-cores)
OS: Ubuntu 25.04 LTS / Arch Linux
Compiler: Intel DPC++ 2025.0.4 (Arch) / PyTorch XPU 2.8.0 (Ubuntu)

Performance Benchmarks

Tested on Intel Arc B580 at 800x800 resolution:

Gaussians	Frame Time	FPS	Notes
100	6.11ms	163.6	Minimal scene
1,000	5.84ms	171.3	Typical object
10,000	5.85ms	170.9	Complex object
50,000	7.02ms	142.4	Dense scene
100,000	11.08ms	90.2	Very dense
150,000	11.57ms	86.4	Stress test

Key Findings:

Consistent ~170 FPS for scenes up to 10K Gaussians
Graceful degradation for complex scenes
Memory bandwidth saturates around 150K Gaussians
Forward-only rendering (backward pass not yet implemented)

Comparison with CPU Implementation

Implementation	1K Gaussians	10K Gaussians	100K Gaussians	150K Gaussians
C CPU (Pure)	4.93 ms	19.12 ms	173.30 ms	256.83 ms
Intel Arc B580	5.84 ms	5.85 ms	11.08 ms	11.57 ms
Speedup	0.84× (≈ same)	3.27×	15.6×	22.2×

Installation

Method 1: Pre-built Binary Wheel (Recommended - Ubuntu/Arch)

Download the pre-built wheel from the Releases page:

# 1. Install PyTorch XPU
pip install torch torchvision --index-url https://download.pytorch.org/whl/xpu

# OR for CPU-only PyTorch
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu

# 2. Install the pre-built wheel
pip install intel_diff_gaussian_rasterization-0.1.0-cp310-cp310-linux_x86_64.whl

# 3. Set up environment (Ubuntu only)
export LD_LIBRARY_PATH="$(python -c 'import torch, os; print(os.path.join(os.path.dirname(torch.__file__), "lib"))')"

# 4. Test installation
python -c "import intel_diff_gaussian_rasterization; print('✓ Success')"

Platform Compatibility:

Python: 3.10
OS: Ubuntu 22.04/24.04/25.04, Arch Linux
Architecture: x86_64 only
GPU: Intel Arc (A/B-series) or compatible integrated graphics

Method 2: From Source Tarball (No Compilation Required)

Download the source tarball from Releases:

# 1. Install PyTorch XPU
pip install torch torchvision --index-url https://download.pytorch.org/whl/xpu

# 2. Extract tarball
tar xzf intel_diff_gaussian_rasterization-py3.10-20251107.tar.gz
cd intel_diff_gaussian_rasterization

# 3. Set up environment
export LD_LIBRARY_PATH="$(python -c 'import torch, os; print(os.path.join(os.path.dirname(torch.__file__), "lib"))')"
export PYTHONPATH="$(pwd):${PYTHONPATH}"

# 4. Test
python -c "import intel_diff_gaussian_rasterization; print('✓ Success')"

Or use the provided helper script:

source setup_env.sh
python your_script.py

Method 3: Build from Source (Arch Linux)

Prerequisites:

Arch Linux
Intel oneAPI DPC++ compiler
PyTorch (CPU or XPU version)

# 1. Install dependencies
sudo pacman -S intel-oneapi-dpcpp-cpp intel-oneapi-compiler-shared-runtime
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu

# 2. Clone repository
git clone https://github.com/andomeder/intel_diff_gaussian_rasterization.git
cd intel_diff_gaussian_rasterization

# 3. Build
bash build.sh

# This creates:
# - intel_diff_gaussian_rasterization/_C.cpython-310-*.so (compiled extension)
# - intel_diff_gaussian_rasterization-py3.10-*.tar.gz (distributable tarball)
# - dist/*.whl (binary wheel, if you run build_wheel.sh)

Quick Start After Installation

import torch
import intel_diff_gaussian_rasterization as idgr

# Move tensors to Intel GPU
device = torch.device("xpu")  # Intel Arc GPU

# Your Gaussian splatting code here...
rasterizer = idgr.GaussianRasterizer(settings)
image, radii = rasterizer(
    means3D=means.to(device),
    means2D=means2d.to(device),
    # ... other parameters
)

System Requirements

Minimum

OS: Ubuntu 22.04+ or Arch Linux
Python: 3.8-3.13
PyTorch: 2.0.0+
GPU: Intel Arc A380 or newer (4GB VRAM)

Build System Requirements (for building from source)

OS: Arch Linux
Compiler: Intel oneAPI DPC++ 2025.0.4+
CMake: 3.20+
Disk: 5GB free (for Intel oneAPI installation)

Downloading Pre-built Packages

Visit the Releases page to download:

Binary Wheels (.whl): For pip install, no compilation needed
- intel_diff_gaussian_rasterization-0.1.0-cp310-cp310-linux_x86_64.whl (Python 3.10)
Source Tarballs (.tar.gz): Pre-compiled .so file included
- intel_diff_gaussian_rasterization-py3.10-20251107.tar.gz

Choose the package matching your Python version.

Troubleshooting Installation

Issue: "No module named 'intel_diff_gaussian_rasterization'"

Solution: Ensure PYTHONPATH is set if using tarball:

export PYTHONPATH="/path/to/intel_diff_gaussian_rasterization:${PYTHONPATH}"

Issue: "libsycl.so.8: cannot open shared object file"

Solution: Set LD_LIBRARY_PATH to PyTorch's libraries:

export LD_LIBRARY_PATH="$(python -c 'import torch, os; print(os.path.join(os.path.dirname(torch.__file__), "lib"))')"

Issue: "LIBUR_LOADER version not found"

Solution: Run the above command or Remove conflicting Intel oneAPI installation on Ubuntu(ONLY IF YOU NEED TO):

# remove oneapi to unset vars(you can re-install later)
sudo rm -rf /opt/intel/oneapi
# Then restart your shell

Issue: "pip install" fails with "error: unrecognized command-line option '-fsycl'"

Explanation: Building from source requires the Intel DPC++ compiler, which is only properly set up on Arch Linux in our build system.

Solution: Use a pre-built wheel or tarball from Releases instead of building from source.

Usage

Basic Example

import torch
import intel_diff_gaussian_rasterization as idgr

# Setup rasterization settings
raster_settings = idgr.GaussianRasterizationSettings(
    image_height=800,
    image_width=800,
    tanfovx=0.5,
    tanfovy=0.5,
    bg=torch.tensor([0.0, 0.0, 0.0]).xpu(),
    scale_modifier=1.0,
    viewmatrix=view_matrix.xpu(),
    projmatrix=proj_matrix.xpu(),
    sh_degree=3,
    campos=camera_position.xpu(),
    prefiltered=False,
    debug=False
)

# Create rasterizer
rasterizer = idgr.GaussianRasterizer(raster_settings)

# Rasterize Gaussians
rendered_image, radii = rasterizer(
    means3D=gaussian_means.xpu(),
    means2D=gaussian_means_2d.xpu(),
    opacities=gaussian_opacities.xpu(),
    shs=spherical_harmonics.xpu(),
    scales=gaussian_scales.xpu(),
    rotations=gaussian_rotations.xpu()
)

Drop-in Replacement for CUDA Version

# Instead of:
# from diff_gaussian_rasterization import GaussianRasterizationSettings, GaussianRasterizer

# Use:
from intel_diff_gaussian_rasterization import GaussianRasterizationSettings, GaussianRasterizer

# Replace .cuda() with .xpu()
tensor = tensor.xpu()  # Instead of tensor.cuda()

Helper Scripts

The package includes wrapper scripts for proper environment setup:

# Run Python with correct library paths
./run_with_intel_raster.sh your_script.py

# Or source the environment
source setup_env.sh
python your_script.py

Architecture Details

Build System

This was built using a two-stage build process:

Built on Arch Linux with Intel oneAPI DPC++ compiler 2025.0.4
Deployed on Ubuntu using PyTorch XPU's bundled SYCL runtime

This avoided version conflicts between oneAPI and PyTorch on the deployment side.

Kernel Pipeline

Preprocessing Kernel (parallel per Gaussian)
- Frustum culling
- 3D to 2D projection via perspective transformation
- 3D covariance → 2D covariance projection
- Spherical harmonics to RGB conversion
Binning Phase
- Tile assignment (16x16 pixel tiles)
- Depth-based sorting (currently CPU-based)
- Tile range identification
Rendering Kernel (parallel per pixel)
- Tile-based processing with shared memory
- Alpha blending with early ray termination
- Sub-group optimizations for Intel Arc

Optimizations

Work-group size: 16×16 (256 work-items) for optimal Arc GPU occupancy
Memory coalescing: Aligned memory access patterns
Shared memory: Cooperative loading of Gaussian data per tile
Early termination: Skip pixels when accumulated alpha < 0.0001
RPATH: Binaries link directly to PyTorch's SYCL libraries

Current Limitations

✅ Forward pass fully implemented and tested
❌ Backward pass not implemented (training not supported)
⚠️ Sorting done on CPU (GPU radix sort planned)
⚠️ Device may reset with >150K Gaussians (driver stability)
⚠️ WebGPU export not yet available

Roadmap

Short Term

Optimize tile sorting with GPU-based radix sort
Improve memory management for large scenes
Add batch rendering support

Medium Term

Implement backward pass for training
Multi-GPU support via SYCL device selection
Integrate with Intel Extension for PyTorch (IPEX)

Long Term

WebGPU SPIR-V export for web deployment
Numba-dpex alternative implementation
Extended platform support (AMD via hipSYCL)

More Troubleshooting

GPU Not Detected

# Check SYCL devices
python -c "import torch; print(torch.xpu.is_available()); print(torch.xpu.device_count())"

# Check Level Zero
ls /usr/lib/x86_64-linux-gnu/libze_loader.so*

# Verify drivers
clinfo | grep Intel

Library Version Conflicts

If you see LIBUR_LOADER version errors:

# Remove any system-installed oneAPI (on Ubuntu)
sudo rm -rf /opt/intel/oneapi

# Ensure PyTorch's libraries are first in path
export LD_LIBRARY_PATH="$(python -c 'import torch, os; print(os.path.join(os.path.dirname(torch.__file__), "lib"))'):$LD_LIBRARY_PATH"

Device Lost Errors

The "Device Lost" error with very large scenes (>150K Gaussians) is a known driver limitation. Workarounds:

Reduce scene complexity
Render in tiles
Update to latest Intel GPU drivers

Citation

@Article{kerbl3Dgaussians,
    author       = {Kerbl, Bernhard and Kopanas, Georgios and Leimk{\"u}hler, Thomas and Drettakis, George},
    title        = {3D Gaussian Splatting for Real-Time Radiance Field Rendering},
    journal      = {ACM Transactions on Graphics},
    number       = {4},
    volume       = {42},
    month        = {July},
    year         = {2023},
    url          = {https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/}
}

License

This project inherits the license from the original Gaussian Splatting implementation. See LICENSE.md for details.

Acknowledgments

Original Gaussian Splatting by INRIA GRAPHDECO research group
C CPU implementation by MrSecant
Intel oneAPI and SYCL teams
PyTorch XPU backend team

Contributing

Contributions welcome! Priority areas:

Backward pass implementation
GPU-based sorting algorithms
Performance profiling and optimization
Documentation and examples

For questions or issues, please open a GitHub issue.

Project Status: Production-ready for inference, research-stage for training.
Maintainer: William Obino
Last Updated: November 2025

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
intel_diff_gaussian_rasterization		intel_diff_gaussian_rasterization
sycl_rasterizer		sycl_rasterizer
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE.md		LICENSE.md
README.md		README.md
build.sh		build.sh
build_wheel.sh		build_wheel.sh
ext.cpp		ext.cpp
pyproject.toml		pyproject.toml
rasterize_points.cpp		rasterize_points.cpp
run_with_torch_libs.sh		run_with_torch_libs.sh
setup.py		setup.py
setup_env.sh		setup_env.sh
test_rasterizer.py		test_rasterizer.py

License

andomeder/intel_diff_gaussian_rasterization

Folders and files

Latest commit

History

Repository files navigation

Intel XPU Differential Gaussian Rasterization

Overview

Hardware Requirements

Tested Configuration

Performance Benchmarks

Comparison with CPU Implementation

Installation

Method 1: Pre-built Binary Wheel (Recommended - Ubuntu/Arch)

Method 2: From Source Tarball (No Compilation Required)

Method 3: Build from Source (Arch Linux)

Quick Start After Installation

System Requirements

Minimum

Recommended

Build System Requirements (for building from source)

Downloading Pre-built Packages

Troubleshooting Installation

Issue: "No module named 'intel_diff_gaussian_rasterization'"

Issue: "libsycl.so.8: cannot open shared object file"

Issue: "LIBUR_LOADER version not found"

Issue: "pip install" fails with "error: unrecognized command-line option '-fsycl'"

Usage

Basic Example

Drop-in Replacement for CUDA Version

Helper Scripts

Architecture Details

Build System

Kernel Pipeline

Optimizations

Current Limitations

Roadmap

Short Term

Medium Term

Long Term

More Troubleshooting

GPU Not Detected

Library Version Conflicts

Device Lost Errors

Citation

License

Acknowledgments

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages