Skip to content

Conversation

@orionpapadakis
Copy link
Collaborator

currently working for Llama.
other models are WIP

mikepapadim and others added 25 commits December 4, 2025 19:44
# Conflicts:
#	src/main/java/org/beehive/gpullama3/model/loader/ModelLoader.java
…0Byte` kernels for Q8_0 matrix-vector computations
…thSiLUAndGLUActivationQ8_0Byte` kernels for byte-based Q8_0 computations
Copilot finished reviewing on behalf of mikepapadim December 5, 2025 12:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This work-in-progress PR refactors Q8_0 quantized tensor handling to use Tornado's ByteArray type instead of separate arrays for quantized values and scales. The new approach stores Q8_0 blocks (2-byte FP16 scale + 32-byte quantized values) contiguously in ByteArrays, with new kernels that dequantize on-the-fly during computation. The changes are currently functional for Llama models, with other models still under development.

Key Changes:

  • New Q8_0 kernel implementations using ByteArray format with inline dequantization
  • Addition of modelType() to Configuration interface to distinguish FP16 vs Q8_0 models
  • New activation conversion layer supporting FP16-to-FP32 and Q8_0-to-FP32 transformations

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 32 comments.

Show a summary per file
File Description
TransformerComputeKernelsLayered.java Adds new Q8_0Byte kernel variants for matrix operations with inline dequantization
TransformerComputeKernels.java Implements conversion kernels for FP16 and Q8_0 to FP32 format
Q8_0TornadoTensor.java Adds ByteArray constructor and factory method; removes old unpacking methods
TornadoTensor.java Adds asByteArray() method for Q8_0 tensor access
Configuration.java + implementations Adds modelType() method to distinguish FP16 vs Q8_0 models
AbstractModelLoader.java Implements readModelType() to map GGUF file types to model type strings
ModelLoader.java Simplifies tensor loading by removing FP32 conversion helper
State.java + implementations Adds embeddingX field and buffer allocation methods for quantized embeddings
Activation.java Refactors to perform format conversion based on model type
InferenceCore.java Updates token embedding copying to handle FP16 and Q8_0 formats
Various FFN layer files Updates to use new ByteArray-based kernel APIs
LogitsQ8_0Layer.java Updates to use new ByteArray-based kernel API
Various loader files Removes loadTornadoTensorAsFP32 usage in favor of unified loading
Comments suppressed due to low confidence (1)

src/main/java/org/beehive/gpullama3/tensor/tornado/Q8_0TornadoTensor.java:49

  • The method getSize() returns size which will be -1 if the tensor was created using the new Q8_0TornadoTensor(ByteArray) constructor. This will cause incorrect behavior for any code calling this method. The size should be calculated from the ByteArray if tornadoNativeArray is not null.
    public int getSize() {
        return size;
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 35 to 41
public Q8_0TornadoTensor(ByteArray byteArray) {
this.size = -1;
this.scales = null;
this.quants = null;
this.segment = null;
this.tornadoNativeArray = byteArray;
}
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new constructor lacks documentation. Consider adding a JavaDoc comment explaining when this constructor should be used and the expected format of the ByteArray (Q8_0 format with 2-byte scales + 32-byte quants per block).

Copilot uses AI. Check for mistakes.
switch (config.modelType()) {
case "FP16" -> fields.createActivationFP16(config.dim());
case "Q8_0" -> fields.createActivationQ8_0(config.dim());
default -> throw new UnsupportedOperationException("Quantization format " + config.modelType());
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message is incomplete. It should include the actual modelType value to help with debugging. Consider: "Unsupported quantization format: " + config.modelType()

Suggested change
default -> throw new UnsupportedOperationException("Quantization format " + config.modelType());
default -> throw new UnsupportedOperationException("Unsupported quantization format: " + config.modelType());

Copilot uses AI. Check for mistakes.
Comment on lines +1131 to +1149
public static void matrixVectorGenericWithResidualQ8_0Byte(KernelContext context, FloatArray x, FloatArray hb, ByteArray w, int n, int d, int localWorkGroupSize) {
// One row per workgroup (not per thread)
int rowId = context.groupIdx;
int localId = context.localIdx;
int localSize = localWorkGroupSize;

// Early exit if this workgroup is beyond our output dimension
if (rowId >= d) {
return;
}

float sum = matrixVectorRowMajorOptimizedQ8_0Byte(context, localSize, x, w, n);

// Thread 0 in each workgroup writes the final result
if (localId == 0) {
float result = hb.get(rowId) + sum;
hb.set(rowId, result);
}
}
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new public method lacks documentation. Consider adding a JavaDoc comment explaining its purpose, how it differs from the previous implementation, and the residual connection behavior.

Copilot uses AI. Check for mistakes.

// Thread 0 in each workgroup writes the final result
if (localId == 0) {
float silu = siluActivation(sum1); // Using the new SiLU method
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "Using the new SiLU method" appears to be copy-pasted from the original method and is now misleading. Since this is duplicated in a new Q8_0Byte variant, this comment doesn't add value and could be removed or updated to reflect the Q8_0Byte-specific context.

Suggested change
float silu = siluActivation(sum1); // Using the new SiLU method
float silu = siluActivation(sum1);

Copilot uses AI. Check for mistakes.
switch (config.modelType()) {
case "FP16" -> fields.createActivationFP16(config.dim());
case "Q8_0" -> fields.createActivationQ8_0(config.dim());
default -> throw new UnsupportedOperationException("Quantization format " + config.modelType());
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message is incomplete. It should include the actual modelType value to help with debugging. Consider: "Unsupported quantization format: " + config.modelType()

Suggested change
default -> throw new UnsupportedOperationException("Quantization format " + config.modelType());
default -> throw new UnsupportedOperationException("Unsupported quantization format: " + config.modelType());

Copilot uses AI. Check for mistakes.
Comment on lines 17 to 19
@Override public String modelType() {
return type;
}
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent formatting compared to other Configuration classes. In other files (Qwen2Configuration, MistralConfiguration), the @Override annotation is on the same line as public String modelType(). Consider using consistent formatting across all configuration classes.

Copilot uses AI. Check for mistakes.
Comment on lines +25 to +28
public static void convertFP16toFP32(KernelContext context, HalfFloatArray x, FloatArray wrapX) {
int i = context.globalIdx;
wrapX.set(i, x.get(i).getFloat32());
}
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new public method lacks documentation. Consider adding a JavaDoc comment explaining its purpose, parameters, and that it converts FP16 (half precision) to FP32 (single precision) format for GPU processing.

Copilot uses AI. Check for mistakes.
Comment on lines +39 to +40
int blockSize = 32;
int Q8_0_BLOCK_BYTES = 34; // 2 bytes scale + 32 bytes quants
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic numbers 32 and 34 are used directly. Consider extracting these as named constants (e.g., Q8_0_BLOCK_SIZE and Q8_0_BLOCK_BYTES) at the class level to improve maintainability and make the code more self-documenting.

Copilot uses AI. Check for mistakes.

@Override
public ByteArray asByteArray() {
return tornadoNativeArray;
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method asByteArray() returns tornadoNativeArray which will be null if the tensor was created using the old constructor Q8_0TornadoTensor(int, HalfFloatArray, Int8Array, MemorySegment). This will cause a NullPointerException when the new Q8_0 byte-based kernels try to use tensors created the old way. Since line 32 sets it to null, this is a critical compatibility issue.

Suggested change
return tornadoNativeArray;
if (tornadoNativeArray != null) {
return tornadoNativeArray;
} else if (segment != null) {
return ByteArray.fromSegmentShallow(segment);
} else {
throw new IllegalStateException("No ByteArray or MemorySegment available for this tensor");
}

Copilot uses AI. Check for mistakes.
final TornadoWeights weights = (TornadoWeights) model.weights();

MemorySegment.copy(weights.getTokenEmbeddingTable().asFloatArray().getSegment(), (long) token * configuration.dim() * Float.BYTES, state.wrapX.getSegment(), 0, configuration.dim() * Float.BYTES);
MemorySegment.copy(weights.getTokenEmbeddingTable().asHalfFloatArray().getSegment(), (long) token * configuration.dim() * Short.BYTES, state.embeddingX.getSegment(), 0, configuration.dim() * Short.BYTES);
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first line (586) appears to be dead code that should be removed. It copies to embeddingX using asHalfFloatArray(), but then immediately the switch statement below overwrites this with different logic based on the weight type. This suggests line 586 is leftover from refactoring.

Suggested change
MemorySegment.copy(weights.getTokenEmbeddingTable().asHalfFloatArray().getSegment(), (long) token * configuration.dim() * Short.BYTES, state.embeddingX.getSegment(), 0, configuration.dim() * Short.BYTES);

Copilot uses AI. Check for mistakes.
@mikepapadim
Copy link
Member

/rerun all

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

🚀 Workflow rerun started

Mode: all
Triggered by: @mikepapadim

View Actions

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

Workflow rerun success

View Actions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants