Skip to content

Float 16 to float 32 looks incorrect #233

@daniel-emotech

Description

@daniel-emotech

This doesn't look correct. The exponent + sign for 16 bit floats is 6 bits and for 32 bit floats is 9 bits. If you simply shift the bits then parts of the mantissa for the half precision will be in the exponent for full precision (and the inverse).

impl Into<f32> for BFloat16 {
    fn into(self) -> f32 {
        unsafe {
            // Assumes that the architecture uses IEEE-754 natively for floats
            // and twos-complement for integers.
            mem::transmute::<u32, f32>((self.0 as u32) << 16)
        }
    }
}

impl From<f32> for BFloat16 {
    fn from(value: f32) -> Self {
        unsafe {
            // Assumes that the architecture uses IEEE-754 natively for floats
            // and twos-complement for integers.
            BFloat16((mem::transmute::<f32, u32>(value) >> 16) as u16)
        }
    }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions