waylonflinn / weblas Goto Github PK

View Code? Open in Web Editor NEW

701.0 28.0 58.0 730 KB

GPU Powered BLAS for Browsers :gem:

License: MIT License

JavaScript 76.92% Python 3.79% HTML 0.68% GLSL 18.61%

weblas's Introduction

GPU accelerated Javascript. Numerical computing in your browser with performance comparable to native.

Currently includes hundreds of unit tests, which verify correctness on hundreds of millions of data points.

Operations

Our focus is on numerical operations useful for neural networks and machine learning. So far, we've got 32-bit versions of each of these:

sscal - Matrix (and Vector) Scale (with addition)
sgemm - Matrix Multiply
sdwns - Matrix (and Image) Downsample (for Max Pooling)
sclmp - Matrix clamp (for ReLU)

Don't see what you need? Give a 👍 to an existing issue or create a new one!

Usage

First, include the weblas.js file (from a release or the dist directory).

<script type="text/javascript" src="weblas.js"></script>

Then use it like this.

<script>


// create Matrices as arrays
var height_A = 1024, width_A = 1024,
    height_B = 1024, width_B = 1024;

var A = new Float32Array(height_A * width_A);
var B = new Float32Array(height_B * width_B);

// fill A and B with science

var M = height_A,
    N = width_B,
    K = height_B; // must match width_A

var alpha = 1.0;
var beta = 0.0;
var C = new Float32Array(width_B)      // specialized for neural net bias calculation

// result will contain matrix multiply of A x B (times alpha)
result = weblas.sgemm(M, N, K, alpha, A, B, beta, C);

</script>

Pipeline Mode

Pipeline mode gives (sometimes very large) increases in performance by leaving data in GPU memory. A demo illustrating performance on a deep neural net can be found here.

Here's a basic example:

// create Tensor containers for interacting directly with GPU memory
var t0 = new weblas.pipeline.Tensor([M, K], data0);
// second matrix must be transposed
var t1 = new weblas.pipeline.Tensor([N, K], weblas.util.transpose(K, N, data1));
var t2 = new weblas.pipeline.Tensor([1, N], data2);
var alpha = 1.0;
var beta = 0.5;

/* NOTE: pipeline.sgemm takes a transpose matrix in the
  second slot (t1 here)
  (this requirement allows for improved performance)
 */
var t3 = weblas.pipeline.sgemm(alpha, t0, t1, beta, t2);

// result is a Float32Array
var result = t3.transfer();

More information can be found on the wiki Pipeline page.

Testing

Unit tests and benchmarks both require browserify and testling.

Install with:

npm install -g browserify
npm install -g testling

Unit Tests

All operations have unit test coverage. Unit tests use data generated outside the browser (to verify correctness). Generating the data requires python and the modules in requirements.txt.

With pip installed run:

pip install -r requirements.txt

Then, to generate the data, run:

npm run data

Then, run the unit tests with:

npm test

OS Setup

If the tests won't run, try this (it restores the default npm browser setting)

OSX

npm config set browser open

Linux

npm config set browser xdg-open

Windows

npm config set browser start

Benchmarks

After installing browserify and testling, run the benchmarks with:

npm run benchmark

results

[email protected]

TAP version 13
ok 1 sgemm: 128x128 . 128x128
# 1.032 GFlops/sec  ±3.71%  n = 50 µ = 4ms
ok 2 sgemm: 128x256 . 256x128
# 1.745 GFlops/sec  ±2.89%  n = 44 µ = 5ms
ok 3 sgemm: 256x256 . 256x256
# 5.061 GFlops/sec  ±2.89%  n = 42 µ = 7ms
ok 4 sgemm: 512x256 . 256x512
# 15.454 GFlops/sec  ±3.86%  n = 51 µ = 9ms
ok 5 sgemm: 256x512 . 512x256
# 10.262 GFlops/sec  ±2.76%  n = 47 µ = 7ms
ok 6 sgemm: 512x512 . 512x512
# 22.231 GFlops/sec  ±3.54%  n = 50 µ = 12ms
ok 7 sgemm: 513x513 . 513x513
# 14.474 GFlops/sec  ±4.51%  n = 43 µ = 19ms
ok 8 sgemm: 1024x512 . 512x1024
# 41.859 GFlops/sec  ±3.38%  n = 43 µ = 26ms
ok 9 sgemm: 512x1024 . 1024x512
# 31.353 GFlops/sec  ±2.60%  n = 46 µ = 17ms
ok 10 sgemm: 1024x1024 . 1024x1024
# 45.545 GFlops/sec  ±3.99%  n = 31 µ = 47ms
ok 11 sgemm: 2048x2048 . 2048x2048
# 62.159 GFlops/sec  ±28.88%  n = 13 µ = 276ms

1..11
# tests 11
# pass  11

# ok

more information about benchmarks (including test configuration) can be found on the wiki.

Donations

Want to see more happen here? Contribute on

weblas's People

Contributors

Stargazers

Watchers

Forkers

txd888 linhua55 sheshnath08 imclab gnonio afshinm benjamesbabala bradparks singularperturbation mmmika kesuskim cantren jjykh transcranial aashish24 borablanca socjs mindis xinyu391 angelokai neuralz praveenmunagapati javascriptiot alexxnica kryndex labbros samholt d4tocchini freelogic lsm scoobadood lukasdrgon jhare mahefnawy dirktoewe ds604 rainbell gittongzq bcexpt1123 5l1v3r1 ngovandau foo123 vitaly-z aoldershaw

weblas's Issues

Update dependency to glslify to 6.1+

GitHub claims that static-eval which is a transitive dependency of static-module ← glslify@<6.1 (commit) ← weblas has a security vulnerability (details: GHSA-5mjw-6jrh-hvfq)

Would it be okay we use newer version of the glslify?

iOS support?

Hi.

I am very sorry that your nice weblas does not work on iOS Safari since iOS Safari does not support WEBGL_color_buffer_float.
Is it hard for weblas to support half precision?
iOS Safari seems to support half float fully.
Please give me your suggestion.

Thanks.

Ensure Browser Compatibility

Even browsers that support TypedArray don't support all the methods. Here's a reference to check:

ES6 Compatibility Table: TypedArray

Enable Development on Windows

Development thus far has been exclusive to Linux and OSX. Some work remains to make unit testing and benchmarking work on Windows.

fix npm test
fix npm run benchmark

Can't generate data

Unfortunately, I get an error when trying to generate the test data. I updated node.js and pointed to python2.7 but no dice. This is on OSX.

bash-3.2# npm run data

> [email protected] data /Users/quax/projects/weblas
> node test/data/generate.js

Traceback (most recent call last):
  File "generate.py", line 84, in <module>
    for i in range(len(names)):

Native browser support for Linear Algebra Library or ML

Hi, I am from the browser world. I've been looking at the different libraries implementing ML on the web. This is the first one that have found its way to get GPU support. Have you encountered things you don't have support for yet?

GPU support across different kinds of machines

I'm doing very computationally-heavy calculations in the browser for data visualizations. Fortunately, I'm using Electron (Chrome) so the browser features are a known entity. However, people will be installing this on lots of different desktop machines. If those machines don't have the right GPU support and/or slow GPUs, the interactions will fail because CPU fallback computations will be too slow.

Does Weblas run or run quickly on a majority of desktop computers? Or is there a lot of variability? Are there guidelines or checks to see if a GPU on a machine is capable of running Weblas fast?

I'm wondering if I should be thinking of this like the old days of wondering if browsers support a certain feature. Now we've got pretty good browser feature parity (especially with build tools) but as we push computation to the GPU how do we evaluation GPU support?

Thank you for building this library.

Refactor Calculators to Use More StackGL

This should reduce code duplication in the Calculator classes significantly (eliminating bindUniforms and potentially bindInputTexture).

  //Bind shader
  shader.bind();

...

  //Set uniforms
  shader.uniforms.M = M_out;
  shader.uniforms.N = N_out;
  shader.uniforms.pad = pad_out;

from gl-shader

  shader.uniforms.texture = texture.bind()

from gl-texture2d

Other potentially useful classes:

gl-fbo

Add Support for Beta Parameter and C Matrix to GEMM

Unpacked rgba format

Cheers for a great lib weblas is. Very good stuff, learned quite a lot wrt webgl thanks to it.

I introduced some modifications to better suit my needs. But before begining PRs I wanted to let you know about the direction I am giving them: give a look at the initial weblas fork and the latter weblas-unpacked which evolved into a plugin depending on weblas.

The main modification is in respect to the texture format. Weblas-unpacked opts for a single value per texel format. This prevents packing and unpacking, both transfering to and from GPU at the cost of increased memory (which can be recovered to an extent). The aim being to simplify shader logic and packing/unpacking operations when algorithms encourage multiple stages within the GPU. It also provides direct float values reading, preventing encoding. This does imply that shaders cannot be shared between both formats (facilities to convert between them are provided though). Check the project for more details.

Currently integration of both is a bit hacky (overwriting some class methods, bad I know), I might focus a bit in easing this and suggest some PRs towards that. Hopefully I am not deviating too much from your aims, and my very lean coding experience won't boil your head (really, don't look at my gl-compute at all, this is mainly a hobby for me).

Thanks again for this great library.

Implement DGEMM

The 64 bit floating point version of Generalized Matrix Multiply.

requires #8

Simple GEMM Wrapper

Provide a simple top-level function that decides which GEMM implementation to use, sets it up and calls it.

Decision points and alternatives

Small Matrices -> Javascript
Float Texture Support -> GEMMFloatCalculator

Prerequisite: #1

Unit Test Sequential Pipeline Operations

Performing multiple pipeline operations sequentially might produce a new class of error. This should be tested with unit tests and the effects on precision determined.

error in $ npm run test

$ npm run test

> [email protected] test C:\Users\cgath\dev\railsagainstignorance\weblas
> browserify test/*.js | testling -x $npm_config_browser

Error: Cannot find module 'C:\Users\cgath\dev\railsagainstignorance\weblas\test\*.js' from 'C:\Users\cgath\dev\railsagainstignorance\weblas'
    at C:\Users\cgath\AppData\Roaming\npm\node_modules\browserify\node_modules\browser-resolve\node_modules\resolve\lib\async.js:55:21
    at load (C:\Users\cgath\AppData\Roaming\npm\node_modules\browserify\node_modules\browser-resolve\node_modules\resolve\lib\async.js:69:43)
    at onex (C:\Users\cgath\AppData\Roaming\npm\node_modules\browserify\node_modules\browser-resolve\node_modules\resolve\lib\async.js:92:31)
    at C:\Users\cgath\AppData\Roaming\npm\node_modules\browserify\node_modules\browser-resolve\node_modules\resolve\lib\async.js:22:47
    at FSReqWrap.oncomplete (fs.js:123:15)
'http:' is not recognized as an internal or external command,
operable program or batch file.
Command undefined terminated with non-zero exit code

Minified Distribution

Should also include a source map.

full tutorial ?

Hey, i want to build reinforcement ml, i already have basics on javascript,

how can i implement with you librairy ?

feedforward -> dot product ( inputs * weights of input to hidden ) && dot product ( hidden * weights of hidden to output ) is it possible to generate random float numbers in glsl between -1 and 1 ?(like 0.4884484844, -0.848484848 etc..) ?

Use Binary Array Data Tests

Tests currently use JSON. This makes both test data generation and testing slow. Storing and loading binary data for the arrays would likely improve this a great deal.

Add Support for Alpha Parameter to GEMM

GEMM needs support for the Alpha parameter

Optimize SLOKN

The GLSL shader for slokn is very complex. It also currently accounts for about ~17% of the runtime of the NN demo.

mobile with gpu problem

I tried simple matrix multiplications(1x1 with 1x1, 2x2 with 2x2, 1x2 with 2x1...) with weblas.sgemm on Android 7.0 with:

Chrome Mobile 58.0.3029.83
Firefox Mobile 53.0.2
Opera Mobile 42.7.2246.114996

but it is always producing the same number: 4.000000476837158 no matter what the values of the matrixes are (eg: 2x2 with 2x2 produces a 2x2 matrix where all of the 4 numbers are the same number)

It produces correct results on desktop browsers so I suspect :) that it maybe is related with encode_float. What can be the problem?

SIMD Optimized Fallback for Small Arrays

GPU isn't necessary (and is slower) for small matrices. We should provide an optimized javascript version to fallback on.

"new" keyword when creating pipeline tensor

I think we should use "new weblas.pipeline.Tensor(..)" when creating a tensor. Examples don't show it.

lib/test.js should not be part of main package

Importing weblas from npm package gives the following error, as arrayloader is a devDependency:

ERROR in ./~/weblas/lib/test.js
Module not found: Error: Can't resolve 'arrayloader' in node_modules/weblas/lib'
 @ ./~/weblas/lib/test.js 2:10-32
 @ ./~/weblas/index.js
 @ ./src/Tensor.js
 @ ./src/index.js
 @ multi main

Minified file does not contain license

dist/weblas.js does not have a @license header. This makes it difficult to comply with your MIT license that requires the copyright notice be shipped with the code.

If you put the following at the tops of your files, then your minifier tools should be smart enough to know what to do:

/**
 * @license
 * Weblas - GPU accelerated Javascript
 * Copyright (c) 2015 Waylon Flinn
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
 */

transfer() or tranfer()

Help me please. Can't figure it out right?

In one example, transfer ()
https://github.com/waylonflinn/weblas/wiki/Pipeline

in another tranfer ()
https://github.com/waylonflinn/weblas

Both options do not work

Uncaught TypeError: Cannot read property 'transfer' of undefined at weblas.html:20

<script type="text/javascript" src="./dist/weblas.js"></script>

<script>

var height_A = 52, width_A = 52,
    height_B = 52, width_B = 52;

var A = new Float32Array(height_A * width_A);
var B = new Float32Array(height_B * width_B);

	for (let k = 0; k < height_A*width_A; k++) 
	{
		A[k] =Math.random();
		B[k] =Math.random();
	}
	
var shape = [M, N]; // Строки столбцы
var t0 = weblas.pipeline.Tensor(shape, A);	
//var t1 = weblas.pipeline.sscal(1, 1, t0);

var result = t0.transfer();  // <<----<<----<<----<<----<<----
console.log('t0 = ', result);
	
var M = height_A,
    N = width_B,
    K = height_B; // must match width_A

var alpha = 1;
var beta = 1;
var C = new Float32Array(width_B)      // specialized for neural net bias calculation

//result = weblas.sgemm(M, N, K, alpha, A, B, beta, C);
//console.log(result);
</script>

Basic Pipelining

Moving data to and from GPU memory is a large bottleneck. Good performance for target applications (Machine Learning, Neural Networks) will not be had without a mechanism for easily using results of previous calculations, already resident in GPU memory, in subsequent operations.

Here's a potential syntax for this:

var P = weblas.pipeline(),
    O = P.output(); // special pipeline output variable;

var min = 0.0,
    width = 3,
    stride = 2;

// Convolution -> ReLU -> Pool
P.saxpy(m*n, a, A, y)                   // Conv.1
  .sgemm(m, n, k, alpha, O, B, beta, C) // Conv.2
  .max(O, min)                          // ReLU
  .maxPatch(O, width, stride)           // Pool
  .execute(function(err, result){       // async execution
    console.log(result);
  });

Add Benchmarks

benchmark.js

WebGL Warning for Non-Power of Two Render Target

This warning occurs on every call to GEMM.calculate

WebGL: drawElements: texture bound to texture unit 2 is not renderable.
It maybe non-power-of-2 and have incompatible texture filtering or is not
'texture complete'. Or the texture is Float or Half Float type with linear
filtering while OES_float_linear or OES_half_float_linear extension is not enabled.

Add Unit Tests - Comprehensive

More extensive set of unit tests, with a larger upper bound on runtime (< 5 min)

large matrices (> 1024)
larger range of numbers (0.0000001 - 1000000.0)

Substraction matrix ?

Why is there no Substraction method ?

Ensure Runtime Javascript Optimization

Make sure the javascript portion of the code can be effectively optimized in browsers.

Chrome

GDC 2012: From Console to Chrome

V8 --trace-bailout
V8 --trace-opt
V8 --trace-deopt

The purpose of the third Matrix C?

What is the purpose of the third matrix C? From the examples:

var C = new Float32Array(w2)      // specialized for neural net bias calculation

// result will contain matrix multiply of A x B (times alpha)
result = weblas.sgemm(M, N, K, alpha, A, B, beta, C);

In my tests, nothing happens to C, even when I change the beta value; it remains as whatever it was initialized.

Optimize Reshape

The GLSL shader for Tensor.reshape is very complex. It also currently accounts for ~15% of the execution time of the NN demo.

error in $npm run data

HI, I tried stepping through the Testing instructions, and bumped into this:

$ npm run data

> [email protected] data C:\Users\cgath\dev\railsagainstignorance\weblas
> node test/data/generate.js

Traceback (most recent call last):
  File "generate.py", line 84, in <module>
    for i in range(len(names)):
TypeError: object of type 'map' has no len()

Add Diagnostic Output to Unit Tests

Add output to unit tests to help pinpoint which platforms have which problems.

Unit Test Float Encode Exhaustively

The floating point encode step is a tricky bit to get right. Issue #10 shows that numerical stability issues can make a big difference in some edge cases.

Encoding each unique floating point value as an element of a 4098 square matrix should require only a few hundred of these matrices. This should be doable in a reasonable amount of time.

References

Emulate Higher Precision

Some hardware doesn't support 32-bit precision. We should emulate higher precision, where necessary.

This technique could also be useful for getting 64-bit precision out of WebGL.

Heavy computing with GLSL – Part 5: Emulated quadruple precision

sgemm inaccuracies (some elements scaled by 4) in VMware Workstation 12 Pro (Ubuntu 18.04/Windows 10)

I wanted to report what may be some incompatibility issue specific to VMware. The same code operates fine on physical machines using the same operating systems, but exhibits the same calculation failures across multiple operating systems using the same VMWare Workstation Pro environment. I'm not expecting resolution to this issue, but would be happy to help debug it if there was anything you thought I could try.

The issue occurs regularly when using sgemm with the net result that certain elements in the resultant matrix are scaled by 4. Here is some code that exhibits the behavior:

var M = 5;
var N = 5;
var K = 5;
var alpha = 1.0;
var beta = 0.0;

var A = Float32Array.from({length: M * K}, (v,i) => i * 1); // [ 0, 1, 2, ... M*K ]
var B = Float32Array.from({length: K * N}, (v,i) => i % (K + 1) ? 0 : 1); // identiy
var C = null;

result = weblas.sgemm(M, N, K, alpha, A, B, beta, C);
var asInt = Float32Array.from({length: M * N}, (v,i) => Math.round(result[i]));

Here are the contents of A, B, and result (aka asInt):

//            V         V                                             V
// asInt: [0, 4, 2, 3, 16, 20, 24, 28, 8, 9, 10, 11, 12, 13, 14, 15, 64, 68, 72, 76, 80, 84, 88, 92, 96]
//     A: [0, 1, 2, 3,  4,  5,  6,  7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]
//     B: [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1]

I have tried enabling and disabling GPU acceleration for the VM through VMware's configuration menu but get the same results. I'm using VMware Workstation Pro version 12.5.9 build-7535481. I see there is a new release (version 14), and will try to upgrade when I get the opportunity. result and asInt contain basically the same results, but result is just harder to read as floating point.

sdwns channel count

While attempting to max pool with weblas.sdwns, I first got an error (this one), about channel count must be a multiple of COMPONENTS_PER_PIXEL, in my case 4.

But I wasn't using 4-channel data. Is this a limitation? What if I am not pooling rgba data, but data of another dimension?

Minify GLSL in Minified Output

Runtime failures during 'strict mode'

Running weblas in a browser's 'strict mode' will throw runtime errors.

Repro: see https://jsfiddle.net/u7by8azp/, created by pasting dist/weblas.js into a script with 'use strict' at the top.
Expected: no console errors in devtools
Actual:

Uncaught ReferenceError: p is not defined
    at new DownsampleCalculator

This makes it challenging to use weblas as an external dependency when building a project that bundles or concatenates it with other scripts that have strict mode enabled.

It seems that if we patch a couple instances in weblas code, adding var before variable declarations, this should work:
https://jsfiddle.net/cft3aqvu/

Your example does not work for me.

Hi @waylonflinn,

It all looks very interesting and I would like to see your solution combined with neocortex.
I could not run your example in Chrome on Ubuntu. I have a build in I5 gpu and get:
"Bound framebuffer is not complete."
Error.

The error is in webgl.js:
if( gl.checkFramebufferStatus(gl.FRAMEBUFFER) != gl.FRAMEBUFFER_COMPLETE)
throw new Error("Bound framebuffer is not complete.");

Am I doing something wrong there?

Document All Operations

A short description of what the operation does and the type and function of each of it's arguments. Maybe autogenerated from comments.

This should also include things like expected memory layout of input (row-major order) and a brief discussion of how higher dimensions (channels) can be consistently handled.

Investigate and Fix '128' Rounding Error

Weird precision loss for values near 128. Possibly related to floating point encode.

See #4 for test output.

Sparse matrices

Given the limited memory and bandwidth in a browser, sparse matrices could speed up transmission times and lower the memory usage.

Some example use cases: (1) If a model inefficiently creates many zero weights (e.g. a ReLU nonlinearity), then this could be useful. I speak hypothetically, not from experience here. (2) Representing graphs of nodes and edges like in a network would benefit from a formal sparse format since networks often have many zeros. Analyzing reachability in a graph requires multiplying a matrix with itself. (3) In language models there can be very large one-hot vectors representing input data which could be more memory efficient. If a matrix multiply could handle a sparse input vector and dense matrix/array weight vector that would be cool.

Examples of existing javascript versions:
http://mathjs.org/examples/sparse_matrices.js.html
http://numericjs.com/wordpress/?p=26

Existing codebase:
https://www.tensorflow.org/api_docs/python/sparse_ops/
http://faculty.cse.tamu.edu/davis/suitesparse.html
https://cran.r-project.org/web/packages/Matrix/Matrix.pdf

Add Unit Tests - Simple

Simple unit test coverage, with a small upper bound on runtime (< 1 min)

various sizes (64 - 1024)
square / non-square
range of numbers (0.001 -> 100.0)

Resources