Show HN: NN-512 – Generate standalone C code for neural nets

37ef_ced3 · on Dec 3, 2020

NN-512 is an open-source Go program that generates fully AVX-512 vectorized, human-readable, stand-alone C implementations of convolutional neural nets

The generated C code is an example of AVX-512 programming using GCC's AVX-512 intrinsics. AVX-512 is exciting because its use of masking simplifies edge cases (partial loads, partial stores, etc.), there are 32 wide vector registers, and really excellent shuffle/permutation instructions (in particular, the two-input permute by var). Recent versions of GCC produce very good object code from C intrinsics

The goal of NN-512 is efficient neural net inference on inexpensive, CPU-only cloud instances. For example, a Skylake-X cloud compute instance costs $10 per CPU-core per month at Vultr, and the NN-512 generated code does about 18 DenseNet121 inferences per CPU-core per second (in series, not batched)

As AVX-512 becomes better supported by Intel and AMD chips, it becomes more attractive as an alternative to expensive GPU instances for workloads with small amounts of inference mixed with other computation

colejohnson66 · on Dec 3, 2020

Is there the option to use AVX2 (256-bit) instead of AVX-512 (which can cause thermal throttling on basically every chip[a])? Now, if Intel can get AVX-512 working well, that’s something else.

Side question: do any Ryzen processors support AVX-512? AFAIK, they only support up to AVX2.

[a]: IIRC, some tests show AVX2 code actually being faster than the equivalent AVX-512 code because AVX2 doesn’t cause the processor to throttle so hard

37ef_ced3 · on Dec 3, 2020

No support for AVX2. AVX-512 is the first really nice (from the programmer's perspective) SIMD instruction set on x86-64 CPUs, and it's significantly different from what came before. AVX-512 is not just a wider version of AVX2

NN-512 explores the way Winograd and Fourier convolutions can be done when you have 32 512-bit vector registers. Four 8x8 Winograd tiles simultaneously, four 8x8 Fourier tiles interleaved to form a 16x16 tile for strided convolutions, and so on. These multi-tile operations don't work for AVX2, too few registers, each register too narrow

GCC has only very recently come into a state where it properly supports AVX-512. For example, before GCC 9.1, GCC would split FNMADD into xor-negation followed by FMADD (doing an extra xor, using an extra register for the negation constant, instead of just using the FNMADD instruction).

Eventually the hardware will mature, too

aseipp · on Dec 3, 2020

AVX2 also causes various degrees of thermal throttling depending on which chip you're using, such as Haswell. AVX-512 on Ice Lake (client) however is quite good and causes very little throttling versus its sustained speeds (e.g. my Ice Lake laptop @ 3.6GHz or whatever it is). Ultimately you have to do benchmarking yourself and if you care about inferences/sec you'll have to think about this stuff.

That said, AVX2 support would also be cool simply because it's more readily available to use on more platforms, not just my laptop...

banachtarski · on Dec 4, 2020

AVX-512 causes just as much if not more throttling in my experience? In what case are you finding that AVX-512 runs at lower voltages than AVX2?

reitzensteinm · on Dec 4, 2020

There's no way an AVX2 version of this would run as fast even taking throttling in to account.

The throttling is aggressive in mixed code paths, but not really all that bad if you're almost exclusively running vector code.

AVX2 throttled a fair bit in the first implementation, and the penalty has generally reduced. I'd expect something similar here, it's just that Intel chips are barely changed in three years.

And as the author says, 512 is a game changer, not just a wider AVX2. It's much more flexible.

kevinventullo · on Dec 3, 2020

Out of curiosity, have you benchmarked this against some of the more standard NN libraries running on CPU?

37ef_ced3 · on Dec 3, 2020

No, the convolution generators were written to saturate the hardware

For example, under ideal conditions (no out-of-register memory access, no waiting on dependencies) you can sustain 27 single-precision FMADDs per CPU-core per cycle on a particular Skylake-X (i.e., approx 1.7 _mm512_fmadd_ps per cycle, each yielding 16 multiply-adds)

As soon as you start accessing memory, that number drops to about 20 FMADDs. With direct convolution methods (1x1 and arbitrary), the best you can do is achieve that, and NN-512 comes close

With the Fourier and Winograd convolutions, you start being limited by memory bandwidth, but the reduction in FMADDs that these methods provide means you end up ahead: your "effective" FMADD rate is much higher than what is possible through direct convolution. For example, NN-512 can exceed 48 effective FMADDs per cycle (on the 27 peak FMADD machine) with Winograd-Cook-Toom-Lavin, if the tensor is deep enough (enough channels)

So, NN-512 succeeds in saturating the hardware. Essentially all the time is spent in the matrix multiplications, doing FMADDs, or being blocked bringing half-precision weights into register for Fourier or Winograd

Until I generate a table of comparisons, you can use the previously stated number to do rough comparisons against the literature or other software packages: 18 DenseNet121 inferences per CPU-core per second on a cheap Skylake-X cloud instance

jklontz · on Dec 3, 2020

> For example, NN-512 can exceed 48 effective FMADDs per cycle (on the 27 peak FMADD machine) with Winograd-Cook-Toom-Lavin, if the tensor is deep enough (enough channels)

Roughly how many channels do you need for this approach to be worthwhile?

37ef_ced3 · on Dec 3, 2020

Enough that the data panel of the input tensor fills the thread's share of the L2 cache, and the output tensor is of similar depth

So it depends on the cache size, but you can think of it as being about 512 channels in, 512 channels out, something like that

tintor · on Dec 4, 2020

Hi, how does it compare to https://github.com/pytorch/FBGEMM?

Rochus · on Dec 3, 2020

Great idea.

rckoepke · on Dec 3, 2020

I've heard that heavy sustained AVX3 / AVX-512 workloads have the potential to damage CPU's [0][1]. Would running/testing/playing around this software potentially risk damaging my CPU? Or is this only a risk when overclocking?

I'm not being facetious with this question at all - more genuine curiosity / responsible preventative caution for my own machines, because I'm very much interested in playing around with this if it's reasonable to do so.

0: https://news.ycombinator.com/item?id=22382946

1: https://news.ycombinator.com/item?id=14426798

37ef_ced3 · on Dec 3, 2020

The reality is that AMD's CPUs don't yet properly support AVX-512, and Intel's CPUs provide good implementations (e.g., 3 cycle latency for the useful AVX-512 shuffle/permutes) with a big downclocking caveat

AVX-512 will be great, eventually

bravura · on Dec 3, 2020

````

37ef_ced3, I'm curious if is a signature? Is the idea to assert copyright which maintaining anonymity, until proof of ownership is required?

37ef_ced3 · on Dec 3, 2020

That's right, SHA-256 of my (unimportant) identity, with salt

SoSoRoCoCo · on Dec 3, 2020

Very nice. If you've tried using TFLite-Micro, you've encountered what C++ looks like in the hands of academics.

Any plans to support RNN layers?

37ef_ced3 · on Dec 3, 2020

Thanks. RNN layers later, if I can find the time

Computeiful · on Dec 4, 2020

This is an interesting project, I made a somewhat similar NN to C "compiler" but that was for feedforward networks only. Could this in theory be extended to support models saved in common formats (such as H5)?

37ef_ced3 · on Dec 4, 2020

The network parameters are passed in as a struct of float arrays. This struct can be simply read from a file, or from a socket, or from some other more complicated format. This is explained in every generated header file

ur-whale · on Dec 4, 2020

Very, very nice project.

Both style and substance.

I'm very glad I found this. If only all software projects on the internets had the same "to-the-point-edness" ...

keyth72 · on Dec 5, 2020

Very cool project, I’d be interested in something like this for RNN/LSTM inference. Nice work!

bravura · on Dec 3, 2020

I think this is a very interesting project.

I am curious what direction you are going.

Does it support dilated convolutions, like Wavenet uses?

Can it implement transformers?

37ef_ced3 · on Dec 4, 2020

As for dilated (and also grouped, like ResNeXt) convolutions, there is full support

The dilated (general-purpose, fallback) convolution algorithm executes FMADDs not much slower than the 1x1 convolution. It makes slow im2col-style approaches unnecessary. Example here: https://nn-512.com/example/4

The idea is to split the input and dilated weight tensors up according to the stride, and then do 1x1 convolutions with accumulation at heightwise and widthwise offsets

No transformers at the moment. You can see what is supported here: https://nn-512.com/docs/graph

tomberek · on Dec 4, 2020

Are there any examples of doing training available?

37ef_ced3 · on Dec 4, 2020

Train on a GPU with any tool, then write the weights, biases, and batchnorm parameters to a file (just a sequence of floats, no other formatting). NN-512 does inference only

desmap · on Dec 4, 2020

Does this run on GPUs and/or TPUs?

37ef_ced3 · on Dec 4, 2020