[WIP] Manipulation of Q8_0 tensors with Tornado `ByteArray`s #79

orionpapadakis · 2025-12-04T19:29:38Z

currently working for Llama.
other models are WIP

…n in TornadoVM acceleration.

…el loaders for consistent tensor loading.

# Conflicts: # src/main/java/org/beehive/gpullama3/model/loader/ModelLoader.java

…ray.fromSegmentShallow`

… copy

…uteKernels`

…0Byte` kernels for Q8_0 matrix-vector computations

…trix-vector computations

…thSiLUAndGLUActivationQ8_0Byte` kernels for byte-based Q8_0 computations

… compute kernels

Copilot

Pull request overview

This work-in-progress PR refactors Q8_0 quantized tensor handling to use Tornado's ByteArray type instead of separate arrays for quantized values and scales. The new approach stores Q8_0 blocks (2-byte FP16 scale + 32-byte quantized values) contiguously in ByteArrays, with new kernels that dequantize on-the-fly during computation. The changes are currently functional for Llama models, with other models still under development.

Key Changes:

New Q8_0 kernel implementations using ByteArray format with inline dequantization
Addition of modelType() to Configuration interface to distinguish FP16 vs Q8_0 models
New activation conversion layer supporting FP16-to-FP32 and Q8_0-to-FP32 transformations

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 32 comments.

Show a summary per file

File	Description
`TransformerComputeKernelsLayered.java`	Adds new Q8_0Byte kernel variants for matrix operations with inline dequantization
`TransformerComputeKernels.java`	Implements conversion kernels for FP16 and Q8_0 to FP32 format
`Q8_0TornadoTensor.java`	Adds ByteArray constructor and factory method; removes old unpacking methods
`TornadoTensor.java`	Adds `asByteArray()` method for Q8_0 tensor access
`Configuration.java` + implementations	Adds `modelType()` method to distinguish FP16 vs Q8_0 models
`AbstractModelLoader.java`	Implements `readModelType()` to map GGUF file types to model type strings
`ModelLoader.java`	Simplifies tensor loading by removing FP32 conversion helper
`State.java` + implementations	Adds `embeddingX` field and buffer allocation methods for quantized embeddings
`Activation.java`	Refactors to perform format conversion based on model type
`InferenceCore.java`	Updates token embedding copying to handle FP16 and Q8_0 formats
Various FFN layer files	Updates to use new ByteArray-based kernel APIs
`LogitsQ8_0Layer.java`	Updates to use new ByteArray-based kernel API
Various loader files	Removes `loadTornadoTensorAsFP32` usage in favor of unified loading

Comments suppressed due to low confidence (1)

src/main/java/org/beehive/gpullama3/tensor/tornado/Q8_0TornadoTensor.java:49

The method getSize() returns size which will be -1 if the tensor was created using the new Q8_0TornadoTensor(ByteArray) constructor. This will cause incorrect behavior for any code calling this method. The size should be calculated from the ByteArray if tornadoNativeArray is not null.

    public int getSize() {
        return size;
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-05T12:51:48Z

src/main/java/org/beehive/gpullama3/tensor/tornado/Q8_0TornadoTensor.java

+    public Q8_0TornadoTensor(ByteArray byteArray) {
+        this.size = -1;
+        this.scales = null;
+        this.quants = null;
+        this.segment = null;
+        this.tornadoNativeArray = byteArray;
+    }


This new constructor lacks documentation. Consider adding a JavaDoc comment explaining when this constructor should be used and the expected format of the ByteArray (Q8_0 format with 2-byte scales + 32-byte quants per block).

Copilot · 2025-12-05T12:51:49Z

src/main/java/org/beehive/gpullama3/inference/state/Qwen3State.java

+        switch (config.modelType()) {
+            case "FP16" -> fields.createActivationFP16(config.dim());
+            case "Q8_0" -> fields.createActivationQ8_0(config.dim());
+            default -> throw new UnsupportedOperationException("Quantization format " + config.modelType());


The error message is incomplete. It should include the actual modelType value to help with debugging. Consider: "Unsupported quantization format: " + config.modelType()

Suggested change

default -> throw new UnsupportedOperationException("Quantization format " + config.modelType());

default -> throw new UnsupportedOperationException("Unsupported quantization format: " + config.modelType());

Copilot · 2025-12-05T12:51:49Z

src/main/java/org/beehive/gpullama3/tornadovm/kernels/TransformerComputeKernelsLayered.java

+    public static void matrixVectorGenericWithResidualQ8_0Byte(KernelContext context, FloatArray x, FloatArray hb, ByteArray w, int n, int d, int localWorkGroupSize) {
+        // One row per workgroup (not per thread)
+        int rowId = context.groupIdx;
+        int localId = context.localIdx;
+        int localSize = localWorkGroupSize;
+
+        // Early exit if this workgroup is beyond our output dimension
+        if (rowId >= d) {
+            return;
+        }
+
+        float sum = matrixVectorRowMajorOptimizedQ8_0Byte(context, localSize, x, w, n);
+
+        // Thread 0 in each workgroup writes the final result
+        if (localId == 0) {
+            float result = hb.get(rowId) + sum;
+            hb.set(rowId, result);
+        }
+    }


This new public method lacks documentation. Consider adding a JavaDoc comment explaining its purpose, how it differs from the previous implementation, and the residual connection behavior.

Copilot · 2025-12-05T12:51:49Z

src/main/java/org/beehive/gpullama3/tornadovm/kernels/TransformerComputeKernelsLayered.java

+
+        // Thread 0 in each workgroup writes the final result
+        if (localId == 0) {
+            float silu = siluActivation(sum1);  // Using the new SiLU method


The comment "Using the new SiLU method" appears to be copy-pasted from the original method and is now misleading. Since this is duplicated in a new Q8_0Byte variant, this comment doesn't add value and could be removed or updated to reflect the Q8_0Byte-specific context.

Suggested change

float silu = siluActivation(sum1); // Using the new SiLU method

float silu = siluActivation(sum1);

Copilot · 2025-12-05T12:51:50Z

src/main/java/org/beehive/gpullama3/inference/state/Phi3State.java

+        switch (config.modelType()) {
+            case "FP16" -> fields.createActivationFP16(config.dim());
+            case "Q8_0" -> fields.createActivationQ8_0(config.dim());
+            default -> throw new UnsupportedOperationException("Quantization format " + config.modelType());


The error message is incomplete. It should include the actual modelType value to help with debugging. Consider: "Unsupported quantization format: " + config.modelType()

Suggested change

default -> throw new UnsupportedOperationException("Quantization format " + config.modelType());

default -> throw new UnsupportedOperationException("Unsupported quantization format: " + config.modelType());

Copilot · 2025-12-05T12:51:56Z

src/main/java/org/beehive/gpullama3/model/phi3/Phi3Configuration.java

+    @Override public String modelType() {
+    return type;
+    }


Inconsistent formatting compared to other Configuration classes. In other files (Qwen2Configuration, MistralConfiguration), the @Override annotation is on the same line as public String modelType(). Consider using consistent formatting across all configuration classes.

Copilot · 2025-12-05T12:51:56Z

src/main/java/org/beehive/gpullama3/tornadovm/kernels/TransformerComputeKernels.java

+    public static void convertFP16toFP32(KernelContext context, HalfFloatArray x, FloatArray wrapX) {
+        int i = context.globalIdx;
+        wrapX.set(i, x.get(i).getFloat32());
+    }


This new public method lacks documentation. Consider adding a JavaDoc comment explaining its purpose, parameters, and that it converts FP16 (half precision) to FP32 (single precision) format for GPU processing.

Copilot · 2025-12-05T12:51:57Z

src/main/java/org/beehive/gpullama3/tornadovm/kernels/TransformerComputeKernels.java

+        int blockSize = 32;
+        int Q8_0_BLOCK_BYTES = 34; // 2 bytes scale + 32 bytes quants


Magic numbers 32 and 34 are used directly. Consider extracting these as named constants (e.g., Q8_0_BLOCK_SIZE and Q8_0_BLOCK_BYTES) at the class level to improve maintainability and make the code more self-documenting.

Copilot · 2025-12-05T12:51:57Z

src/main/java/org/beehive/gpullama3/tensor/tornado/Q8_0TornadoTensor.java


+    @Override
+    public ByteArray asByteArray() {
+        return tornadoNativeArray;


The method asByteArray() returns tornadoNativeArray which will be null if the tensor was created using the old constructor Q8_0TornadoTensor(int, HalfFloatArray, Int8Array, MemorySegment). This will cause a NullPointerException when the new Q8_0 byte-based kernels try to use tensors created the old way. Since line 32 sets it to null, this is a critical compatibility issue.

Suggested change

return tornadoNativeArray;

if (tornadoNativeArray != null) {

return tornadoNativeArray;

} else if (segment != null) {

return ByteArray.fromSegmentShallow(segment);

} else {

throw new IllegalStateException("No ByteArray or MemorySegment available for this tensor");

}

Copilot · 2025-12-05T12:51:57Z

src/main/java/org/beehive/gpullama3/inference/InferenceCore.java

        final TornadoWeights weights = (TornadoWeights) model.weights();

-        MemorySegment.copy(weights.getTokenEmbeddingTable().asFloatArray().getSegment(), (long) token * configuration.dim() * Float.BYTES, state.wrapX.getSegment(), 0, configuration.dim() * Float.BYTES);
+        MemorySegment.copy(weights.getTokenEmbeddingTable().asHalfFloatArray().getSegment(), (long) token * configuration.dim() * Short.BYTES, state.embeddingX.getSegment(), 0, configuration.dim() * Short.BYTES);


The first line (586) appears to be dead code that should be removed. It copies to embeddingX using asHalfFloatArray(), but then immediately the switch statement below overwrites this with different logic based on the weight type. This suggests line 586 is leftover from refactoring.

Suggested change

MemorySegment.copy(weights.getTokenEmbeddingTable().asHalfFloatArray().getSegment(), (long) token * configuration.dim() * Short.BYTES, state.embeddingX.getSegment(), 0, configuration.dim() * Short.BYTES);

mikepapadim · 2025-12-05T13:23:28Z

/rerun all

github-actions · 2025-12-05T13:23:39Z

🚀 Workflow rerun started

Mode: all
Triggered by: @mikepapadim

View Actions

github-actions · 2025-12-05T13:23:41Z

✅ Workflow rerun success

View Actions

… associated usages.

mikepapadim and others added 25 commits December 4, 2025 19:44

Refactor tensor loading and introduce support for Half-Float precisio…

e6ce9e0

…n in TornadoVM acceleration.

Replace loadTornadoTensorAsFP32 with loadTornadoTensor across mod…

db30dba

…el loaders for consistent tensor loading.

Add modelType to Configuration

553015d

Add readModelType integration for all model loaders

da10c5c

Update Q8_0 tensor creation to use fromTornadoMemorySegment method

579d6ea

# Conflicts: # src/main/java/org/beehive/gpullama3/model/loader/ModelLoader.java

Add new Q8_0TornadoTensor constructor using ByteArray and `ByteAr…

7adfd80

…ray.fromSegmentShallow`

Add support for Q8_0 weight type in InferenceCore embedding table…

613cdd2

… copy

Change embeddingX type from HalfFloatArray to TornadoNativeArray

91f48b0

Add type field to LlamaConfiguration constructor

786bdc2

Add FP16 and Q8_0 support to Activation layer initialization

78d6a18

Add convertQ8_0toFP32 kernel for dequantization in `TransformerComp…

7456d59

…uteKernels`

Add matrixVectorGenericQ8Byte and `matrixVectorRowMajorOptimizedQ8_…

04db93d

…0Byte` kernels for Q8_0 matrix-vector computations

Update FFN and attention layers to use Q8_0 byte-based kernels for ma…

dd8064e

…trix-vector computations

Update LogitsQ8_0Layer to use byte-based Q8_0 kernels

2316ca1

Remove deprecated methods for Q8_0 tensor loading and conversion to FP32

56a960a

Add matrixVectorGenericWithResidualQ8_0Byte and `fusedFeedForwardWi…

9d0fb16

…thSiLUAndGLUActivationQ8_0Byte` kernels for byte-based Q8_0 computations

Replace getHalf with getHalfFloat for Q8_0 block scale loading in…

68729ee

… compute kernels

Add FP16 and Q8_0 activation initialization methods in State class

843e30c

Use quantization-specific activation init in Llama models

111dbdd

Use quantization-specific activation init in Qwen3 models

4e984fa

Update Qwen3 FFN layers to use byte-based Q8_0 kernels

ed5f882

Use quantization-specific activation init in Qwen2 and Deepseek models

4e30022

Update Qwen2 and Deepseek FFN layers to use byte-based Q8_0 kernels

c52bcaa

Use quantization-specific activation init in Phi3 models

9562505

Update Phi3 FFN layers to use byte-based Q8_0 kernels

2c8cf24

mikepapadim requested review from Copilot, mairooni and mikepapadim December 5, 2025 12:42

mikepapadim marked this pull request as ready for review December 5, 2025 12:42

Copilot started reviewing on behalf of mikepapadim December 5, 2025 12:42 View session

Copilot finished reviewing on behalf of mikepapadim December 5, 2025 12:47

Copilot AI reviewed Dec 5, 2025

View reviewed changes

orionpapadakis added 5 commits December 5, 2025 15:31

Rename modelType to quantization across configurations and update…

00a4faa

… associated usages.

Cleanup unused memorySegment copy

aef06cc

Cleanup and document Q8_0TornadoTensor

e63d919

Use Configuration.quantization() method in Activation

3f24fb3

[CI] Update Tornado dependencies to version 2.0.1-dev to run CI

fc1cc89

	default -> throw new UnsupportedOperationException("Quantization format " + config.modelType());
	default -> throw new UnsupportedOperationException("Unsupported quantization format: " + config.modelType());

	float silu = siluActivation(sum1); // Using the new SiLU method
	float silu = siluActivation(sum1);

		int blockSize = 32;
		int Q8_0_BLOCK_BYTES = 34; // 2 bytes scale + 32 bytes quants

-        return tornadoNativeArray;
+        if (tornadoNativeArray != null) {
+            return tornadoNativeArray;
+        } else if (segment != null) {
+            return ByteArray.fromSegmentShallow(segment);
+        } else {
+            throw new IllegalStateException("No ByteArray or MemorySegment available for this tensor");
+        }

[WIP] Manipulation of Q8_0 tensors with Tornado ByteArrays #79

Are you sure you want to change the base?

[WIP] Manipulation of Q8_0 tensors with Tornado ByteArrays #79

Uh oh!

Conversation

orionpapadakis commented Dec 4, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

mikepapadim commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP] Manipulation of Q8_0 tensors with Tornado `ByteArray`s #79

[WIP] Manipulation of Q8_0 tensors with Tornado `ByteArray`s #79