Coder Social home page Coder Social logo

x448 / float16 Goto Github PK

View Code? Open in Web Editor NEW
62.0 4.0 7.0 162 KB

float16 provides IEEE 754 half-precision format (binary16) with correct conversions to/from float32

License: MIT License

Go 100.00%
binary16 float16 ieee754 floating-point half-precision go golang

float16's Introduction

Float16 (Binary16) in Go/Golang

Go Report Card Release License

x448/float16 package provides IEEE 754 half-precision floating-point format (binary16) with IEEE 754 default rounding for conversions. IEEE 754-2008 refers to this 16-bit floating-point format as binary16.

IEEE 754 default rounding ("Round-to-Nearest RoundTiesToEven") is considered the most accurate and statistically unbiased estimate of the true result.

All possible 4+ billion floating-point conversions with this library are verified to be correct.

Lowercase "float16" refers to IEEE 754 binary16. And capitalized "Float16" refers to exported Go data type.

Features

Current features include:

  • float16 to float32 conversions use lossless conversion.
  • float32 to float16 conversions use IEEE 754-2008 "Round-to-Nearest RoundTiesToEven".
  • conversions using pure Go take about 2.65 ns/op on a desktop amd64.
  • unit tests provide 100% code coverage and check all possible 4+ billion conversions.
  • other functions include: IsInf(), IsNaN(), IsNormal(), PrecisionFromfloat32(), String(), etc.
  • all functions in this library use zero allocs except String().

Status

This library is used by fxamacker/cbor and is ready for production use on supported platforms. The version number < 1.0 indicates more functions and options are planned but not yet published.

Current status:

  • Core API is done and breaking API changes are unlikely.
  • 100% of unit tests pass:
    • short mode (go test -short) tests around 65765 conversions in 0.005s.
    • normal mode (go test) tests all possible 4+ billion conversions in about 95s.
  • 100% code coverage with both short mode and normal mode.
  • Tested on amd64, arm64, ppc64le, and s390x.

Roadmap:

  • Add functions for fast batch conversions leveraging SIMD when supported by hardware.
  • Speed up unit test when verifying all possible 4+ billion conversions.

Float16 to Float32 Conversion

Conversions from float16 to float32 are lossless conversions. All 65536 possible float16 to float32 conversions (in pure Go) are confirmed to be correct.

Unit tests take a fraction of a second to check all 65536 expected values for float16 to float32 conversions.

Float32 to Float16 Conversion

Conversions from float32 to float16 use IEEE 754 default rounding ("Round-to-Nearest RoundTiesToEven"). All 4294967296 possible float32 to float16 conversions (in pure Go) are confirmed to be correct.

Unit tests in normal mode take about 1-2 minutes to check all 4+ billion float32 input values and results for Fromfloat32(), FromNaN32ps(), and PrecisionFromfloat32().

Unit tests in short mode use a small subset (around 229 float32 inputs) and finish in under 0.01 second while still reaching 100% code coverage.

Usage

Install with go get github.com/x448/float16.

// Convert float32 to float16
pi := float32(math.Pi)
pi16 := float16.Fromfloat32(pi)

// Convert float16 to float32
pi32 := pi16.Float32()

// PrecisionFromfloat32() is faster than the overhead of calling a function.
// This example only converts if there's no data loss and input is not a subnormal.
if float16.PrecisionFromfloat32(pi) == float16.PrecisionExact {
    pi16 := float16.Fromfloat32(pi)
}

Float16 Type and API

Float16 (capitalized) is a Go type with uint16 as the underlying state. There are 6 exported functions and 9 exported methods.

package float16 // import "github.com/x448/float16"

// Exported types and consts
type Float16 uint16
const ErrInvalidNaNValue = float16Error("float16: invalid NaN value, expected IEEE 754 NaN")

// Exported functions
Fromfloat32(f32 float32) Float16   // Float16 number converted from f32 using IEEE 754 default rounding
                                      with identical results to AMD and Intel F16C hardware. NaN inputs 
                                      are converted with quiet bit always set on, to be like F16C.

FromNaN32ps(nan float32) (Float16, error)   // Float16 NaN without modifying quiet bit.
                                            // The "ps" suffix means "preserve signaling".
                                            // Returns sNaN and ErrInvalidNaNValue if nan isn't a NaN.
                                 
Frombits(b16 uint16) Float16       // Float16 number corresponding to b16 (IEEE 754 binary16 rep.)
NaN() Float16                      // Float16 of IEEE 754 binary16 not-a-number
Inf(sign int) Float16              // Float16 of IEEE 754 binary16 infinity according to sign

PrecisionFromfloat32(f32 float32) Precision  // quickly indicates exact, ..., overflow, underflow
                                             // (inline and < 1 ns/op)
// Exported methods
(f Float16) Float32() float32      // float32 number converted from f16 using lossless conversion
(f Float16) Bits() uint16          // the IEEE 754 binary16 representation of f
(f Float16) IsNaN() bool           // true if f is not-a-number (NaN)
(f Float16) IsQuietNaN() bool      // true if f is a quiet not-a-number (NaN)
(f Float16) IsInf(sign int) bool   // true if f is infinite based on sign (-1=NegInf, 0=any, 1=PosInf)
(f Float16) IsFinite() bool        // true if f is not infinite or NaN
(f Float16) IsNormal() bool        // true if f is not zero, infinite, subnormal, or NaN.
(f Float16) Signbit() bool         // true if f is negative or negative zero
(f Float16) String() string        // string representation of f to satisfy fmt.Stringer interface

See API at godoc.org for more info.

Benchmarks

Conversions (in pure Go) are around 2.65 ns/op for float16 -> float32 and float32 -> float16 on amd64. Speeds can vary depending on input value.

All functions have zero allocations except float16.String().

FromFloat32pi-2  2.59ns ± 0%    // speed using Fromfloat32() to convert a float32 of math.Pi to Float16
ToFloat32pi-2    2.69ns ± 0%    // speed using Float32() to convert a float16 of math.Pi to float32
Frombits-2       0.29ns ± 5%    // speed using Frombits() to cast a uint16 to Float16

PrecisionFromFloat32-2  0.29ns ± 1%  // speed using PrecisionFromfloat32() to check for overflows, etc.

System Requirements

  • Go 1.12 (or newer).
  • amd64, arm64, ppc64le, or s390x.

Other architectures and Go versions may work, but are not tested regularly.

Special Thanks

Special thanks to Kathryn Long (starkat99) for creating half-rs, a very nice rust implementation of float16.

License

Copyright © 2019-present Montgomery Edwards⁴⁴⁸ and Faye Amacker.

x448/float16 is licensed under the MIT License. See LICENSE for the full license text.

float16's People

Contributors

deeglaze avatar dependabot[bot] avatar fxamacker avatar x448 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

float16's Issues

Add PrecisionFromFloat32() to return precision without performing conversion

It would be useful to know the precision of converting IEEE binary32 to binary16, if the function can be inlined.

PrecisionFromFloat32 should return Precision without performing the conversion.
Conversions from both Infinity and NaN values will always report PrecisionExact even
if NaN payload or NaN-Quiet-Bit is lost.

If this is too complex to be inlined by Go, then make it an extra return value as part of conversion functions.

// Precision indicates whether the conversion to Float16 is
// exact, inexact, underflow, or overflow.
type Precision int

const (
       PrecisionExact Precision = iota
       PrecisionInexact
       PrecisionUnderflow
       PrecisionOverflow
)

func PrecisionFromfloat32(f32 float32) Precision 

Add Fromfloat32ex returning both Float16 and the precision of conversion.

Add Fromfloat32ex as an extended version of Fromfloat32 that returns more info.

Fromfloat32ex returns:

  • Float16 converted from specified float32
  • Integer indicating precision of conversion

Precision returned could be one of:

  • PrecisionExact (0) -- OK
  • PrecisionQuietNaN (16) -- converted input NaN to float16 NaN
  • PrecisionInexact (32) -- sometimes OK, depends on requirements
  • PrecisionUnderflow (48) -- undesirable
  • PrecisionOverflow (49) -- undesirable

Add (f Float16) Bits() uint16

Add the method (f Float16) Bits() uint16 to improve API symmetry.

Float32 method is the reverse of Fromfloat32 function.
Bits method will be the reverse of Frombits function.

This addition will not increase bloat because calling Bits should inline as a simple type cast.

Add FromNaN32ps() to convert NaN with preserved signal and payload

Fromfloat32() is 100% compatible with AMD and Intel F16C instructions by producing identical results for all 4+ billion conversions. Unfortunately, this means NaN input values are converted to NaN with quiet bit always set.

It can be useful to preserve the original NaN signaling status, so provide FromNaN32ps() to convert 32-bit NaN to 16-bit NaN while preserving both signal and payload.

Additionally, implement the function so it can inline and perform faster than Fromfloat32().

// ErrInvalidNaNValue indicates a NaN was not received.
var ErrInvalidNaNValue = errors.New("float16: invalid NaN value, expected IEEE 754 NaN")

// FromNaN32ps converts nan to IEEE binary16 NaN while preserving both 
// signaling and payload. Unlike Fromfloat32(), which can only return
// qNaN because it sets quiet bit = 1, this can return both sNaN and qNaN.
// If the result is infinity (sNaN with empty payload), then the 
// lowest bit of payload is set to make the result a valid sNaN.
// This function was kept simple to be able to inline.
func FromNaN32ps(nan float32) (Float16, error)

Support for bfloat16

Thank you for making this very useful and well-tested library! Are you planning to add support for bfloat16 format, which is used in ML field? It has different bit widths for mantissa and exponent, but other rules are the same as in IEEE 754 formats.

ErrInvalidNaNValue should be a const

-var ErrInvalidNaNValue = errors.New("float16: invalid NaN value, expected IEEE 754 NaN")
+const ErrInvalidNaNValue = float16Error("float16: invalid NaN value, expected IEEE 754 NaN")
+
+type float16Error string
+
+func (e float16Error) Error() string { return string(e) }

Rewrite tests for PrecisionFromfloat32()

Also consider having two subnormal precisions instead of one:

  • PrecisionSubnormal____ - no bits are dropped during conversion to float16 and is subnormal
  • PrecisionSubnormal____ - bits were dropped during conversion to float16 and is subnormal

Probably shouldn't name the 1st one with "Exact" suffix because some of those won't round-trip back to float32.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.