Skip to content

Commit fbe549e

Browse files
committed
[self] post
1 parent e44afc5 commit fbe549e

File tree

1 file changed

+98
-7
lines changed

1 file changed

+98
-7
lines changed

_posts/data_rep/2020-06-21-data-rep-float.md

Lines changed: 98 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -68,11 +68,13 @@ sign exponent fraction
6868
31 30 .... 23 22 ....................... 0
6969
```
7070

71-
- The _sign_ part took 1 bit to indicate the sign of the floats
72-
- The _exponent_ part took 8 bits and represent a signed integer in _biased form_.
73-
It's a variant of _excess-128_ since it took out the `-127` (all 0s) and `128`
74-
(all 1s) for special numbers, so instead of unsigned `128`, the `u127` represent
75-
the actual `0`, and ranges `[-126, 127]` instead of `[-127, 128]`.
71+
- The _sign_ part took 1 bit to indicate the sign of the floats. (`0` for `+`
72+
and `1` for `-`. This is the same treatment as the [sign magnitute](2020-06-19-data-rep-int.md##sign-magnitude-原码).
73+
- The _exponent_ part took 8 bits and used [_offset-binary (biased) form_](2020-06-19-data-rep-int.md#offset-binary-移码) to represent a signed integer.
74+
It's a variant form since it took out the `-127` (all 0s) for zero and `+128`
75+
(all 1s) for non-numbers, thus it ranges only `[-126, 127]` instead of
76+
`[-127, 128]`. Then, it choose the zero offset of `127` in these 254 bits (like
77+
using `128` in _excess-128_), a.k.a the _exponent bias_ in the standard.
7678
- The _fraction_ part took 23 bits with an _implicit leading bit_ `1` and
7779
represent the actual _significand_ in total precision of 24-bits.
7880

@@ -98,10 +100,11 @@ S F × E = R
98100
Aha! It's the real number `1`!
99101
Recall that the `E = 0b0111 1111 = 0` because it used a biased representation!
100102

103+
We will add more non-trivial examples later.
101104

102105

103-
Code Sample
104-
-----------
106+
Demoing Floats in C/C++
107+
-----------------------
105108

106109
Writing sample code converting between binaries (in hex) and floats are not
107110
as straightforward as it for integers. Luckily, there are still some hacks to
@@ -168,6 +171,79 @@ std::cout << f; // 9
168171
```
169172

170173

174+
Representation of Non-Numbers
175+
-----------------------------
176+
177+
There are more in the IEEE-754!
178+
179+
Real numbers doesn't satisfy [closure property](https://en.wikipedia.org/wiki/Closure_(mathematics))
180+
as integers: notably, the set of real numbers is NOT closed under division! It
181+
could produce non-numbers results such as **infinity** (`1/0`) or even
182+
**NaN (Not-a-Number)** (taking a sqrt of a negative number).
183+
184+
- [NaN](https://en.wikipedia.org/wiki/NaN)
185+
186+
It would be wanted if the set of floating-point numbers can close under any
187+
floating-point arithmetics. That streamline the machine representation a lot.
188+
So the IEEE made it so and squeeze those non-numebers value into the same
189+
representation.
190+
191+
We will also include _zero_ in the table since it's special (the only two
192+
used `0x00` exponent).
193+
194+
```cpp
195+
(binary) (hex)
196+
0 00000000 00000000000000000000000 = 0000 0000 = 0
197+
1 00000000 00000000000000000000000 = 8000 0000 = −0
198+
199+
0 11111111 00000000000000000000000 = 7f80 0000 = infinity
200+
1 11111111 00000000000000000000000 = ff80 0000 = −infinity
201+
202+
_ 11111111 10000000000000000000001 = ffc0 0001 = qNaN (on x86 and ARM processors)
203+
_ 11111111 00000000000000000000001 = ff80 0001 = sNaN (on x86 and ARM processors)
204+
```
205+
206+
```cpp
207+
(8 bits) (23 bits)
208+
sign exponent fraction
209+
0 00 0 ...0 0 = -0
210+
1 00 0 ...0 0 = +0
211+
0 FF 0 ...0 0 = +infinity
212+
1 FF 0 ...0 0 = -infinity
213+
_ FF 1 ...0 1 = qNaN
214+
_ FF 0 ...0 1 = sNaN
215+
```
216+
217+
Encodings of qNaN and sNaN are not specified in IEEE 754 and implemented
218+
differently on different processors. Luckily, both x86 and ARM family use the
219+
"most significant bit of fraction" to indicate quiteness.
220+
221+
### More on NaN
222+
223+
If we look carefully into the IEEE 754-2008 spec, in the _page35, 6.2.1_, it
224+
actually defined anything with exponent `FF` and not infinity (i.e. with
225+
trailing bit of fraction being `0`), a NaN!
226+
227+
> All binary NaN bit strings have all the bits of the biased exponent field E set to 1 (see 3.4). A quiet NaN bit string should be encoded with the first bit (d1) of the trailing significand field T being 1. A signaling NaN bit string should be encoded with the first bit of the trailing significand field being 0.
228+
229+
That means, we actually have `2 ** 24 - 2` of NaNs in a 32-bits floats!
230+
The `24` came from the `1` sign bit plus `23` fractions and the `2` came from
231+
the `+/- inf`.
232+
233+
The contingious 22 bits inside the fraction looks quite a waste, and there
234+
would be 51 bits of them in the `double`! We will see how to made them useful
235+
in later episodes (spoiler: they are known as _NaN payload_).
236+
237+
It's also worth nothing that It's weird to use the MSB instead of sign bit for
238+
NaN quiteness/signalness:
239+
240+
> It seems strange to me that the bit which signifies whether or not the NaN is signaling is the top bit of the mantissa rather than the sign bit; perhaps something about how floating point pipelines are implemented makes it less natural to use the sign bit to decide whether or not to raise a signal.
241+
> -- <https://anniecherkaev.com/the-secret-life-of-nan>
242+
243+
I guess it might be something related to CPU pipeline.
244+
245+
246+
171247
IEEE-754 64-bits Double-Precision Floats
172248
----------------------------------------
173249

@@ -185,6 +261,21 @@ sign exponent fraction
185261
```
186262

187263

264+
IEEE-754-2008 16-bits Short Floats
265+
----------------------------------------
266+
267+
The 2008 edition of IEEE-754 also standardize the `short float`, which is
268+
neither in C or C++ standard. Though compiler extension might include it.
269+
270+
It looks like:
271+
272+
```cpp
273+
1 sign bit | 5 exponent bits | 10 fraction bits
274+
S E E E E E M M M M M M M M M M
275+
```
276+
277+
278+
188279
References
189280
----------
190281

0 commit comments

Comments
 (0)