-
Notifications
You must be signed in to change notification settings - Fork 24
[WIP] Manipulation of Q8_0 tensors with Tornado ByteArrays
#79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[WIP] Manipulation of Q8_0 tensors with Tornado ByteArrays
#79
Conversation
…n in TornadoVM acceleration.
…el loaders for consistent tensor loading.
# Conflicts: # src/main/java/org/beehive/gpullama3/model/loader/ModelLoader.java
…ray.fromSegmentShallow`
…0Byte` kernels for Q8_0 matrix-vector computations
…trix-vector computations
…thSiLUAndGLUActivationQ8_0Byte` kernels for byte-based Q8_0 computations
… compute kernels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This work-in-progress PR refactors Q8_0 quantized tensor handling to use Tornado's ByteArray type instead of separate arrays for quantized values and scales. The new approach stores Q8_0 blocks (2-byte FP16 scale + 32-byte quantized values) contiguously in ByteArrays, with new kernels that dequantize on-the-fly during computation. The changes are currently functional for Llama models, with other models still under development.
Key Changes:
- New Q8_0 kernel implementations using ByteArray format with inline dequantization
- Addition of
modelType()to Configuration interface to distinguish FP16 vs Q8_0 models - New activation conversion layer supporting FP16-to-FP32 and Q8_0-to-FP32 transformations
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 32 comments.
Show a summary per file
| File | Description |
|---|---|
TransformerComputeKernelsLayered.java |
Adds new Q8_0Byte kernel variants for matrix operations with inline dequantization |
TransformerComputeKernels.java |
Implements conversion kernels for FP16 and Q8_0 to FP32 format |
Q8_0TornadoTensor.java |
Adds ByteArray constructor and factory method; removes old unpacking methods |
TornadoTensor.java |
Adds asByteArray() method for Q8_0 tensor access |
Configuration.java + implementations |
Adds modelType() method to distinguish FP16 vs Q8_0 models |
AbstractModelLoader.java |
Implements readModelType() to map GGUF file types to model type strings |
ModelLoader.java |
Simplifies tensor loading by removing FP32 conversion helper |
State.java + implementations |
Adds embeddingX field and buffer allocation methods for quantized embeddings |
Activation.java |
Refactors to perform format conversion based on model type |
InferenceCore.java |
Updates token embedding copying to handle FP16 and Q8_0 formats |
| Various FFN layer files | Updates to use new ByteArray-based kernel APIs |
LogitsQ8_0Layer.java |
Updates to use new ByteArray-based kernel API |
| Various loader files | Removes loadTornadoTensorAsFP32 usage in favor of unified loading |
Comments suppressed due to low confidence (1)
src/main/java/org/beehive/gpullama3/tensor/tornado/Q8_0TornadoTensor.java:49
- The method
getSize()returnssizewhich will be-1if the tensor was created using the newQ8_0TornadoTensor(ByteArray)constructor. This will cause incorrect behavior for any code calling this method. The size should be calculated from the ByteArray iftornadoNativeArrayis not null.
public int getSize() {
return size;
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public Q8_0TornadoTensor(ByteArray byteArray) { | ||
| this.size = -1; | ||
| this.scales = null; | ||
| this.quants = null; | ||
| this.segment = null; | ||
| this.tornadoNativeArray = byteArray; | ||
| } |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new constructor lacks documentation. Consider adding a JavaDoc comment explaining when this constructor should be used and the expected format of the ByteArray (Q8_0 format with 2-byte scales + 32-byte quants per block).
| switch (config.modelType()) { | ||
| case "FP16" -> fields.createActivationFP16(config.dim()); | ||
| case "Q8_0" -> fields.createActivationQ8_0(config.dim()); | ||
| default -> throw new UnsupportedOperationException("Quantization format " + config.modelType()); |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message is incomplete. It should include the actual modelType value to help with debugging. Consider: "Unsupported quantization format: " + config.modelType()
| default -> throw new UnsupportedOperationException("Quantization format " + config.modelType()); | |
| default -> throw new UnsupportedOperationException("Unsupported quantization format: " + config.modelType()); |
| public static void matrixVectorGenericWithResidualQ8_0Byte(KernelContext context, FloatArray x, FloatArray hb, ByteArray w, int n, int d, int localWorkGroupSize) { | ||
| // One row per workgroup (not per thread) | ||
| int rowId = context.groupIdx; | ||
| int localId = context.localIdx; | ||
| int localSize = localWorkGroupSize; | ||
|
|
||
| // Early exit if this workgroup is beyond our output dimension | ||
| if (rowId >= d) { | ||
| return; | ||
| } | ||
|
|
||
| float sum = matrixVectorRowMajorOptimizedQ8_0Byte(context, localSize, x, w, n); | ||
|
|
||
| // Thread 0 in each workgroup writes the final result | ||
| if (localId == 0) { | ||
| float result = hb.get(rowId) + sum; | ||
| hb.set(rowId, result); | ||
| } | ||
| } |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new public method lacks documentation. Consider adding a JavaDoc comment explaining its purpose, how it differs from the previous implementation, and the residual connection behavior.
|
|
||
| // Thread 0 in each workgroup writes the final result | ||
| if (localId == 0) { | ||
| float silu = siluActivation(sum1); // Using the new SiLU method |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment "Using the new SiLU method" appears to be copy-pasted from the original method and is now misleading. Since this is duplicated in a new Q8_0Byte variant, this comment doesn't add value and could be removed or updated to reflect the Q8_0Byte-specific context.
| float silu = siluActivation(sum1); // Using the new SiLU method | |
| float silu = siluActivation(sum1); |
| switch (config.modelType()) { | ||
| case "FP16" -> fields.createActivationFP16(config.dim()); | ||
| case "Q8_0" -> fields.createActivationQ8_0(config.dim()); | ||
| default -> throw new UnsupportedOperationException("Quantization format " + config.modelType()); |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message is incomplete. It should include the actual modelType value to help with debugging. Consider: "Unsupported quantization format: " + config.modelType()
| default -> throw new UnsupportedOperationException("Quantization format " + config.modelType()); | |
| default -> throw new UnsupportedOperationException("Unsupported quantization format: " + config.modelType()); |
| @Override public String modelType() { | ||
| return type; | ||
| } |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent formatting compared to other Configuration classes. In other files (Qwen2Configuration, MistralConfiguration), the @Override annotation is on the same line as public String modelType(). Consider using consistent formatting across all configuration classes.
| public static void convertFP16toFP32(KernelContext context, HalfFloatArray x, FloatArray wrapX) { | ||
| int i = context.globalIdx; | ||
| wrapX.set(i, x.get(i).getFloat32()); | ||
| } |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new public method lacks documentation. Consider adding a JavaDoc comment explaining its purpose, parameters, and that it converts FP16 (half precision) to FP32 (single precision) format for GPU processing.
| int blockSize = 32; | ||
| int Q8_0_BLOCK_BYTES = 34; // 2 bytes scale + 32 bytes quants |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic numbers 32 and 34 are used directly. Consider extracting these as named constants (e.g., Q8_0_BLOCK_SIZE and Q8_0_BLOCK_BYTES) at the class level to improve maintainability and make the code more self-documenting.
|
|
||
| @Override | ||
| public ByteArray asByteArray() { | ||
| return tornadoNativeArray; |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method asByteArray() returns tornadoNativeArray which will be null if the tensor was created using the old constructor Q8_0TornadoTensor(int, HalfFloatArray, Int8Array, MemorySegment). This will cause a NullPointerException when the new Q8_0 byte-based kernels try to use tensors created the old way. Since line 32 sets it to null, this is a critical compatibility issue.
| return tornadoNativeArray; | |
| if (tornadoNativeArray != null) { | |
| return tornadoNativeArray; | |
| } else if (segment != null) { | |
| return ByteArray.fromSegmentShallow(segment); | |
| } else { | |
| throw new IllegalStateException("No ByteArray or MemorySegment available for this tensor"); | |
| } |
| final TornadoWeights weights = (TornadoWeights) model.weights(); | ||
|
|
||
| MemorySegment.copy(weights.getTokenEmbeddingTable().asFloatArray().getSegment(), (long) token * configuration.dim() * Float.BYTES, state.wrapX.getSegment(), 0, configuration.dim() * Float.BYTES); | ||
| MemorySegment.copy(weights.getTokenEmbeddingTable().asHalfFloatArray().getSegment(), (long) token * configuration.dim() * Short.BYTES, state.embeddingX.getSegment(), 0, configuration.dim() * Short.BYTES); |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first line (586) appears to be dead code that should be removed. It copies to embeddingX using asHalfFloatArray(), but then immediately the switch statement below overwrites this with different logic based on the weight type. This suggests line 586 is leftover from refactoring.
| MemorySegment.copy(weights.getTokenEmbeddingTable().asHalfFloatArray().getSegment(), (long) token * configuration.dim() * Short.BYTES, state.embeddingX.getSegment(), 0, configuration.dim() * Short.BYTES); |
|
/rerun all |
|
🚀 Workflow rerun started Mode: |
|
✅ Workflow rerun success |
… associated usages.
currently working for Llama.
other models are WIP