The need for speed
Coming from a real-time world (games and graphics), current machine learning training is a shock. Even small simple training tasks take a long time, which gives me plenty of time to think about the performance of my graphs.
Currently most of the open source deep learning stacks consist of a C++ and Cuda back-end driven by a Python/R/Lua front end. The Cuda code tends to be fairly generic, which makes sense from an HPC and experimental point of view.
Its important to note HPC stuff is very optimised but tends to really on standard interfaces with relatively custom generic back-ends. For example BLAS is a old FORTRAN standard for linear algebra, it has numerous back-ends included optimised x64, Cuda, OpenCL etc. However it only accelerates the tensor multiplies in a few data formats, other more data specific optimisation like overlapping conversions isn't in its remit.
Its quite a different optimisation strategy from the real-time world, which tends to be less portable but wider in scope of using the hardware. It makes sense given the different markets and issues. HPC are running on large clusters which are relatively out of the control of the data scientist, real-time is likely running on a few different sets of hardware where even a few percent gain is worth extra effort.
So the obvious question for me, is how fast could a custom path be? If we were to change the data formats, look at cache usage, etc. how fast could we make an end to end training path for a single graph.
Its unlikely that you'd get any speed up from pure ALU optimisations, most HPC back-ends will be issuing ALU ops as fast as the can with the data formats they are given. Any optimisations are going to come from memory, format and overlapping optimisations.
HPC world traditionally uses double (64 bit) floating point numbers, GPUs really don't like doubles, even the best (the latest NVIDIA Pascal chip) is significantly for doubles than the smaller floating point formats. Deep learning is relatively immune to precisions problems, so using smaller sized floats is an obvious win for them. Its the reason that Pascal chip has twice the performance each step smaller (floats (32 bit) are x2 and halfs (16 bit) are x4 as fast as doubles).
However this isn't the necessarily the limit of format optimisations. With limited range inputs in many fields, its begs the question about using integer maths might be a better option. Many inputs and weights are normalised to 0 to 1 or -1 to 1, which might allow fixed point integers to be used. Its no instant win, but its worth investigating on some datasets/platforms.
For example I'm doing some work with many single bits from a bitmap input, but floats are used in the neural network layers. The output is ultimately three probabilities, and the highest of those is selected. I suspect there some nice optimisations to be had if I froze the neural graph and selected the best formats through it, taking into account cache sizes etc.
Memory use is a real Achilles heel of most current ML tool kits, with essentially no real attempts to work with smaller memory patterns. Hence a GPU with 2 Gigabytes of RAM is barely usable with 8-24 GiB being standard for GPU. CPUs are a factor of 10 bigger at the least.
The usual reason given is 'Big Data', but its worth looking at the savings we could make. Smaller memory footprints may give us performance increases as well as the obvious large sets of data on smaller hardware advantage. Apart from using smaller data types throughout and not storing extra copies (harder than is sounds as the front-ends tend to be in a garbage collected language), a non-obvious thing to investigate is data compression.
Much of the data used in ML is likely highly compressible, and with fast codecs it may actually be a win to store in memory compressed and convert as it used.
Combining and Overlapping
Due to memory and ease of use, there is very little pipe-lining at the macro scale in ML, each operation is treated as a separate task. Combining and overlapping different operations may be more efficient use of the hardware. The non-linear function at every neural layer may be able to be done before storing the tensor and re-reading or even using look up tables instead of float ALUs.
This is also where certain platforms might win, for example overlapping compression with ALUs may be able to hide the cost completely and use cache more efficiently. Also potentially you could use underused components (such as the CPU in most GPU platforms). In real-time graphics this isn't uncommon, with fixed decompression texture units and custom decompression shaders used for just this reason.
The idea of custom ML acceleration hardware is an obvious one, and several companies produce such products. GPUs like NVIDIA Pascal are adding features specifically for ML techniques and FPGAs have been used for experiment in this field. Its something I've thought about a few times, I know enough RTL to be dangerous with an FPGA ;) but too many other things at the moment.
Hopefully Pascal will sell enough to encourage design of large MLPU (Machine Learning Processing Units) separate from GPUs, its likely they will share some parts with GPUs due to economics (at least in the medium term) but adding some specific hardware for ML would be awesome.
I did some work in a previous life on custom servers, and I can see some good possibilities. A hybrid CPU, GPU and FPGA on a fast bus with the right software could be a potential ML winner. Intel could easily use MIC instead of GPUs. I suspect there is a unicorn start-up there! :D