Friday, 13 May 2016

Machine Learning Performance


The need for speed

Coming from a real-time world (games and graphics), current machine learning training is a shock. Even small simple training tasks take a long time, which gives me plenty of time to think about the performance of my graphs.
Currently most of the open source deep learning stacks consist of a C++ and Cuda back-end driven by a Python/R/Lua front end. The Cuda code tends to be fairly generic, which makes sense from an HPC and experimental point of view. 
Its important to note HPC stuff is very optimised but tends to really on standard interfaces with relatively custom generic back-ends. For example BLAS is a old FORTRAN standard for linear algebra, it  has numerous back-ends included optimised x64, Cuda, OpenCL etc.  However it only accelerates the tensor multiplies in a few data formats, other more data specific optimisation like overlapping conversions isn't in its remit.
Its quite a different optimisation strategy from the real-time world, which tends to be less portable but wider in scope of using the hardware. It makes sense given the different markets and issues. HPC are running on large clusters which are relatively out of the control of the data scientist, real-time is likely running on a few different sets of hardware where even a few percent gain is worth extra effort.
So the obvious question for me, is how fast could a custom path be? If we were to change the data formats, look at cache usage, etc. how fast could we make an end to end training path for a single graph.
Its unlikely that you'd get any speed up from pure ALU optimisations, most HPC back-ends will be issuing ALU ops as fast as the can with the data formats they are given. Any optimisations are going to come from memory, format and overlapping optimisations.

Data Format 

HPC world traditionally uses double (64 bit) floating point numbers, GPUs really don't like doubles, even the best (the latest NVIDIA Pascal chip) is significantly for doubles than the smaller floating point formats. Deep learning is relatively immune to precisions problems, so using smaller sized floats is an obvious win for them. Its the reason that Pascal chip has twice the performance each step smaller (floats (32 bit) are x2 and halfs (16 bit) are x4 as fast as doubles).
However this isn't the necessarily the limit of format optimisations. With limited range inputs in many fields, its begs the question about using integer maths might be a better option. Many inputs and weights are normalised to 0 to 1 or -1 to 1, which might allow fixed point integers to be used. Its no instant win, but its worth investigating on some datasets/platforms.
For example I'm doing some work with many single bits from a bitmap input, but floats are used in the neural network layers. The output is ultimately three probabilities, and the highest of those is selected. I suspect there some nice optimisations to be had if I froze the neural graph and selected the best formats through it, taking into account cache sizes etc.

Memory

Memory use is a real Achilles heel of most current ML tool kits, with essentially no real attempts to work with smaller memory patterns. Hence a GPU with 2 Gigabytes of RAM is barely usable with 8-24 GiB being standard for GPU. CPUs are a factor of 10 bigger at the least.
The usual reason given is 'Big Data', but its worth looking at the savings we could make. Smaller memory footprints may give us performance increases as well as the obvious large sets of data on smaller hardware advantage. Apart from using smaller data types throughout and not storing extra copies (harder than is sounds as the front-ends tend to be in a garbage collected language), a non-obvious thing to investigate is data compression.
Much of the data used in ML is likely highly compressible, and with fast codecs it may actually be a win to store in memory compressed and convert as it used.

Combining and Overlapping

Due to memory and ease of use, there is very little pipe-lining at the macro scale in ML, each operation is treated as a separate task. Combining and overlapping different operations may be more efficient use of the hardware. The non-linear function at every neural layer may be able to be done before storing the tensor and re-reading or even using look up tables instead of float ALUs.
This is also where certain platforms might win, for example overlapping compression with ALUs may be able to hide the cost completely and use cache more efficiently. Also potentially you could use underused components (such as the CPU in most GPU platforms). In real-time graphics this isn't uncommon, with fixed decompression texture units and custom decompression shaders used for just this reason.

Custom Hardware?

The idea of custom ML acceleration hardware is an obvious one, and several companies produce such products. GPUs like NVIDIA Pascal are adding features specifically for ML techniques and FPGAs have been used for experiment in this field. Its something I've thought about a few times, I know enough RTL to be dangerous with an FPGA ;) but too many other things at the moment.
Hopefully Pascal will sell enough to encourage design of large MLPU (Machine Learning Processing Units) separate from GPUs, its likely they will share some parts with GPUs due to economics (at least in the medium term) but adding some specific hardware for ML would be awesome.
I did some work in a previous life on custom servers, and I can see some good possibilities. A hybrid CPU, GPU and FPGA on a fast bus with the right software could be a potential ML winner. Intel could easily use MIC instead of GPUs.  I suspect there is a unicorn start-up there! :D


Monday, 9 May 2016

A large scale archival storage system - ERC Reinenforcement

About 5 years ago, I came close to starting a tech company selling archival level storage systems. Part of the idea was an algorithm and technique I developed that efficiently stores large data reliably.

I wrote up a patent, even pushed it through the 1st phase of the UK patent system, but then decided not to go ahead. Effectively the not-patent is now open for anyone.

Its designed for large scale system, thousands of storage devices and isn't necessarily fast. However for large archival storage is saves significant storage devices over the tradition RAID systems. So if data retention is really important this is an idea that might be worth investigating...

The best example of its utility is that a 1000 storage unit system would have higher resilience and more disks used for storage than a RAID71 (the best normal system) system of similar size.

Anyway I found one of the drafts so though I'd publish it here (including typos!). I never actually published it though beyond the entry to the patent office so figured I might as well do it now!



Wednesday, 20 April 2016

Flu + teaching AI to play games

Flu and strange Ideas

I have a cold/flu thing at the moment and feel rotten, due to interaction with my general health when I get even a mild cold or flu I get pain everywhere and due to the levels of pain killer I take normally, I just have to grin and bare it. The way I tend to cope is to keep my mind occupied to try and not think about it. Strangely this is often a creative time in terms of random thoughts, I guess the body pushes more natural drugs into me to try and counteract the pain leading to me being a bit 'drugged'.

Thinking about teaching (not training) AIs

Last night I was deep in thought about Deep AIs, as you may have noticed from my recent blog posts is something I'm really enjoying and TBH can see myself working in the field. Before the fame of the recent AlphaGo wins, Deepmind were tackling other simpler games from the Atari 2600 machine. The paper "Human-level control through deep reinforcement learning" successfully learnt to play a number of games on the Atari 2600 via unsupervised iterations of playing. Essentially it learnt from nothing how the controls worked and what that did in terms of the game.

Playing Breakout for example, it gradually learns that the left button moves the little graphics at the bottom of the screen to the left. Every now and again it would randomly be under the ball, which bounced back and hit the blocks which gives you some points. The AI uses the game score as its learning control, always aiming to improve its score.

Which if sounds like how humans learn to play games without instructions, is exactly the point!

This was published in the prestigious journal "Nature" last year and is generally consider a major step in 'General AI', that is AIs that learn things by observation and trail and error. Which is one of the major parts of mammalian intelligence like our own. This however has limitations on the complexity of games it can learn directly as it rely on initial random chance and short term goal rewards. Its a building block rather than the end of the line.

For example if you take a more complex game, such as a graphic adventure like 'The Secret of Monkey Island' which uses a mouse 'point and click' control system, its unlikely to do well as the odds of moving a mouse randomly and clicking something that increases your score is tiny. Similarly for a hypothetical Minecraft AI, it would probably learn to explore and avoid/fight enemies but actually crafting is unlikely (how likely is it that it learns to punch trees, make planks, make a crafting bench and then makes a pickaxe?). The probability of a random string of actions providing a direct score increase limits the complexity of the system as is.
AlphaGo is much more advanced, and included playing itself to teach itself now strategies.

It dawned on me that a simply modification of the Q learning AI for Atari 2600 games might improve learning rates and extend the complexity of games it could play. Its a technique nature uses, so is likely a good strategy, that is the idea of 'teaching'.
This is fundamentally different from supervised learning, even though a first glance they seem the same.

Supplement Training with Teaching

Teaching is a form of showing the inputs to the AI versus supervised learning which shows the input and the desired output.  Just like in nature, many mammalian brains are shown how to do things whilst young. A litter of cats will learn to hunt via play supervised by their parents, this increases the kittens chance of surviving and breeding even though it puts more stress on the parents, who have to feed and look after them until they are trained enough to go it alone.
It can be thought of a form of indirect neural copying, the parent molds their young's neural network towards known good strategies, however whether the kitten takes on that advise as is, or modifies or even rejects is purely the result of its neural network seeing the benefit. This has several evolutionary advantages,

  1. As they will be indirectly replaying rather than direct copying, new mutations and modification will occur as the child incorporates its own input stimuli.
  2. Obsolete knowledge will die out over generations, as it won't have the reward that it had for their parents. 
  3. Young short cut basic training by learning from there parent, which should allow quicker adaption when a rare useful event occurs.
In direct terms of the game playing AI, it postulates that we can improve the adaption rate and/or ability to perform complex actions but showing them how to play the game.

In the Breakout case, if for the first N games, we in the role of parental human take over the controls (but fully observed by the AI) and move the bat back and forward, hitting the ball a few times we should short cut the generations the AI spent just learning how to control the bat and what actually the point of the game is.

I'm sure i'm not the only person to have thought of the idea, but so far haven't found any published papers on it (please let me know if you know of some). Its similar to genetic algorithms in terms of generational transfer, but in a fundamentally different ways. Rather than passing down genes, this is passing down successful strategies to the young.
Next step is to hack together a small modification of the Atari 2600 Q learning code to let me take over the controls for some number of initial games.

If that trends show the idea has merit, adding a more formalised teaching path (recording inputs perhaps) may allow us to explore more complex games, for example if we show enough Minecraft sessions where you make a pickaxe its possible it might learn that and so move it into a completely new phase.

More abstractly, it also brings in a social element to AIs that so far hasn't existed. In this model AIs have teachers, likely a form of parental figure that guides them until some point in there life...

Monday, 18 April 2016

The greatest lie ever told: Premature Optimisation is the root of all evil!

Its a lie because it implies there is a 'premature' aspect to writing code, that you shouldn't worry about performance or power at some phase of the research/project.

Its simple not true, there are time when its not the greatest priority but thinking about how it affects these areas is never wasted.

At some fundamental level, software is something that takes power (electricity) into math. Anytime you don't do it optimally you waste power.  The only time that doesn't matter is when you have a small problem and lots of power, in practise few interesting problems are lucky to get away with that.

If you work in machine learning or big data, the likely limit to what you can do is related to how many processors (wether they are CPU, GPU or FPGA) you can throw at the problem. Assuming you can't scale infinitely, then the results you can get out of a finite set of HW will largely be determined by how performant your code is.

When you've designed your latest ANN that will scale across a few hundred GPU. figuring out how to maximise memory usage or processing power could enable dramatic savings. Even 10% per machine adds up to a lot of extra performance, and that means bigger problems can be solved.

Many problems are of the type. performing N (where N is large) operations. To really get the best result those operations need to be thought about with performance as a priority, there is no premature optimisations! Whilst there lots of code where its not as important, the core code should be at the earliest phase thought about in terms of optimisations.

At the extreme that even means custom hardware, optimising the HW that will run the core algorithms to run just that algorithms very well. In deep learning, the 'accelerator' (clusters of GPUs or FPGAs) are vital, and it seem likely that at some point custom ASICs will enter the datacenter just to run the AI code at the heart of many big data algorithms.


Saturday, 2 April 2016

Teaching machines to render

I've been studying AI tech a fair bit recently for a variety of reasons. There are lots of areas I want to explore using AI as solvers/approximations but as someone whom is generally employed to do graphics stuff, there always an interest in the application of AI technology to rendering. Currently except for a few papers on copying artistic styles to photos, its not yet a major discipline.

The real big thing to take aware from AI like deep neural nets, is that they are approximate function solvers that are taught via showing them the data rather than be explicitly programmed. Once trained they evaluate the inputs, through there learnt 'function' and give you a result.
With enough nodes and layers in the network and a lot of training, they can give a solution to any equation you have existing data for.

In rendering, the classic Kajiya equation is the solution real and offline renderers attempt to solve. The reason why rendering takes some much compute power, is that direct solving the equation is unfeasible complex and approximations we use have massive dimensional complexity.

So the question is can we replace parts (all?) of the Kajiya equation with a trained deep AI?

A rule of thumb in deep AI, is that if humans can do something in 0.1 seconds, then its tractable to be solved with the number of layer and node counts we have now. Now we know that many artists are capable of good approximation to Kajiya in real time, so its seems to imply that a neural network might be able to do parts of it.

Teaching an AI to paint/shade is a pretty unexplored area, as at this point, most uses of deep AI are reducing complex data where paint/shade is data expansion. 

Tensorflow


TensorFlow
Tensorflow is Googles open source machine learning system, its consists of a Python/C++/CUDA framework for manipulating and training AI systems. My explorations of AI required me to get some underlaying ideas about how it worked at the machine level, so contrary to most inductions I wrote a small library in C++ to get a feel for the data at the low level. Whilst in practise I won't use it for anything serious and instead am using Tensorflow, going through the exercise of writing my own models, data normalisation and training at a low level helps me understand how Tensorflow and other libraries actually work.


Tensorflow tutorials cover using it for the MNIST classifier problem, MNIST is a database of hand written numbers and the AI should be trained to output the actual number the image represents. Each image in the training set consists of a 28x28 greyscale image and the actual number its represent. A deep ANN (Artificial Neural Network) usually with convolution nets as well is then trained and the score shows how well the AI is doing.

This is classic deep AI, taking large structured data (pictures, audio, words) and reducing down to a simpler classification. In the mnist case it takes 784 dimension data (the input image) and reduces it down to the 10 numeral digits.

This is a relatively simple problem for AI, hence its use as a classic tutorial, even so training can take a while. Tensorflow has a GPU backend, but currently only support CUDA under linux, so using Apple AMD GPU on OS X rules GPU backend completely out!

What I'm currently working on is the first step in my paint/shade AI idea. I'm creating a backwards mnist, the aim is to teach the AI that given a numeric digit, it will output a 28x28 image. Tensorflow python interface makes it easy to try different models out, so i'm hopefully converging on a working solution.

The compute problem makes me tempted to try cloud GPU vendors. Even a few hours of a 4xNV Titans would make my life much easier. If anyone has any recommendations I'd be happy to have them. I'm deliberately trying to avoid getting sucked into see how fast I can make Tensorflow on my machines, as I want to work on the algorithm side of things not the innards of the math kernels but when it takes hours for even simple training its tempted to see how I could improve things...

Where I want to go


Where I want it to go, is to train an AI that given some basic scene data as input, is able to render an image without being programmed. I seriously doubt it will be faster than traditional rendering by a wide margin but it would offer interesting possibility of copying movie or off line rendering onto other new data sets. If you take something like Ambient Occlusion which is extremely expensive to render, if we can train an AI to do a 'best guess' it may lead to new approaches to hard problems that Kajiya equation give us.

Off course I know most people will think its nuts and arse way around of working but its interesting to me and I like thinking laterally :)


Sunday, 6 March 2016

Skinning Roger Rabbit AKA Are we there yet?

I was lucky enough to grow up in the era when home computer and consoles were new, and movies using CG was something worth talking about if anybody actually used managed to actually used it (i'm looking at you Max Headroom and Tron, both designed to look like CG but mostly used old fashioned effects as computers weren't fast enough). One of my favourite movies didn't (AFAIK) use CG but now would use it a lot, Who Framed Roger Rabbit.


Roger Rabbit DVD CoverRoger Rabbit is a clever comedy/who dun it about a murder with the main suspect being the eponymous Roger Rabbit. The twist is that Roger Rabbit is a rabbit, a cartoon rabbit. The world is set up so that Hollywood has a section called Toon Town, where cartoons actually exist. Bugs Bunny, Micky Mouse actually exist and act in TV shows and movies just like other actors. The star (apart from Roger) is Eddie Valiant played by Bob Hoskins who is a classic 20/30s grizzled private eye who hates Toons after one kills his partner.

The part that stands out for me (even today) was the way the Toons were able to exist in our world and we could go into there world. So a movie set has Dumbo walking around a real world just doing what a cartoon elephant would do if placed in the real world. Mixing cartoons with real footage is not a new technique, but the world Roger Rabbit presents feels so much more interesting than our normal boring all reality world.

As a 13 year old and a computer nerd, I realised that one day we should be able to do that. Capture reality, edit it and show it to player/viewer in real-time, bringing Toon Town to our real world. Of course technology and my ability to execute were anywhere near the quality of the film. However now nearly 30 years on, I feel it might be possible.

Today we would call this AR - Augmented Reality, mixing reality with CG in a realtime view of the world. Holo-lens from MS appears to be the obvious machine to do it as it comes with depth cameras, visual cameras and a display that mixes reality and CG together. 

It would be an interesting mix of advanced realtime CG and AI, but there are several stages before a complete 'Roger Rabbit' like world could exist.

Dr Suess 3D movie graphics
The simplest path towards 'Toon Town' is skinning the real world, visually morphing the real world view into an non reality based view. This is probably the easiest part, as its mostly a visual problem and less about understanding the physical properties of the world. Whilst cool to be able to walk down the street seeing it as a Cartoon world, or other art style, it would be just that an illusion without the other portions able to understand and insert action into it.

The next is phase is to use a Toon avatar to represent someone you are remote chatting with. Its likely it should animated and appear to match the other persons emotional state/facial expressions and also intelligently apply itself into the real world, having someone stuck inside a wall would destroy the illusion.
Additional complexity is how to deal with dynamism of the real world, luckily Toons don't have to respond in a normal manner. If a door opens into a Toon, then its okay to play a comedic reaction of a flatten Toon that pops back to 3D in a volume that can hold the Toon.
In terms of programming complexity, most of the hard work is in understand 3D space and given the Toon body logic enough brains to be able to create the illusion of interaction.

Understanding in real-time the physical world is going to be a tricky and complex problem that games haven't really had to tackle before. Physics sims we are used to are about applying semi-realistic motion to CG worlds, good AR  we need to understand the physical properties gathered through the cameras and inject them into the physics sim first, then run the sim and then show the results.

Another stage would just add basic AI Toons to the world, find free space and insert avatars into it. As an approach I think of as "3D Clippy", Clippy the bane of a thousand jokes was a cartoon paper clip that popped up when using Office 95 and gave you 'handy' advise on doing things.

It would also need to identify things in real world and react to it. 3D Clippy might a helper like its 2D counterpart and might suggest you eat more fruit, so had to identify fruit in the world, point out a banana (for example) and tell you that you haven't had your' '5 a day'

Beyond this we start getting into philosphym Turin AIs and all that will it be possible to run a AI Toon 24/7 existing in 3D space, just only visible if you use the AR set? At this level we start to blur the lines between reality, AR and VR. If you throw robots into the real world controlled by AIs, real and virtual become more and more arbitrary labels rather than useful distinctions.

Toon Town and the connected technologies and ideas you can derive from it, seem to be the AR killer app. Exist in a place in the middle of winter, simply load up 'Summer in California' app and boom instant sunny days are here. Need to do some plumbing? load up a Plumber Toon app and get advise and instruction how to do it, whilst being entertained as your assistant goes through all the fun a Toon would working around water pipes! :)

Its clear that good AR, is going to be about extracting data from the real-world. I've been reading a lot on AI and data mining trying to figure our how the current state of the art can be converted into the form I want. Unfortunately I don't have access to a Holo-lens yet to start really exploring the idea. 

Hololens Minecraft AR
The early things will just use basic collision with the real-world but expect the next decade (or longer) of Siggraph paper to have a lot about turning data from depth and visual camera into physics sim, it will also see a lot more 'AI' coming into rendering and physics simulation. Our physics and rendering sims are about to get a lot more fuzzy, if AR takes off :)


Wednesday, 2 March 2016

Multi-frequency Shading and VR.

Multi-frequency Shading and VR.

Our peripheral vision is low resolution but high frequency, our focus vision is high resolution but slower frequency updates. This is one of the reasons 60hz isn't good enough for VR, at the edges most people can consciously see the flicker still.

Additional of course VR is stereoscopic, so requires two views of everything and low latency response, whilst you can just run everything at 90hz or even 144hz this is expensive performance and power wise.

Multi-frequency shading solves the issue by sharing where possible some calculation over time and space (view ports). Of course for this work, we need to break down the rendering into parts that are the same (or close enough) to be ran as shared.

Perhaps the oldest split has been diffuse versus specular. diffuse lighting is only dependent on light position and the surface being lit, so camera changes can be ignored. This has been used in lightmaps for a long time. For VR this means that diffuse lighting can be shared between eyes. Except for shadowing and GI it also only needs updating at the rate objects move, which may be infrequently.

So lets start with that split, decouple diffuse lighting from the entire equation. The first problem then is how we store these diffuse light values so they can be quickly picked up when required. A classic approach is to use a hash over the spatial position as a key into a fixed sized cache. The diffuse update runs asynchronous at for example 30hz, this is looked up at full display frequency (say 120hz) with specular and other high frequency effects are added.

This is in essence what light maps do but with a 0hz update rate, the 3d spatial lookup replaced with a surface based 2D lookup.

The question then becomes history management, how do you look up diffuse value in space (hash storage?) and do you need some form of cache replacement system to ensure that stall diffuse values get kicked out.

Again the key to high quality and fast VR seems to me to lay in changing how we define a frame. Decouple the display rate from the rest of renderer, allowing for the brain and other effects to paper over the differences. With only a tiny amount of time any error will be visible, its likely you want notice that its actually wrong sometimes.

However whether this is true, depends on wet-ware in our heads so only way to know is to try and see if anyone pukes up or gets a splitting headache.

Deano out