Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: NN-512 – Generate standalone C code for neural nets (nn-512.com)
111 points by 37ef_ced3 on Dec 3, 2020 | hide | past | favorite | 28 comments


NN-512 is an open-source Go program that generates fully AVX-512 vectorized, human-readable, stand-alone C implementations of convolutional neural nets

The generated C code is an example of AVX-512 programming using GCC's AVX-512 intrinsics. AVX-512 is exciting because its use of masking simplifies edge cases (partial loads, partial stores, etc.), there are 32 wide vector registers, and really excellent shuffle/permutation instructions (in particular, the two-input permute by var). Recent versions of GCC produce very good object code from C intrinsics

The goal of NN-512 is efficient neural net inference on inexpensive, CPU-only cloud instances. For example, a Skylake-X cloud compute instance costs $10 per CPU-core per month at Vultr, and the NN-512 generated code does about 18 DenseNet121 inferences per CPU-core per second (in series, not batched)

As AVX-512 becomes better supported by Intel and AMD chips, it becomes more attractive as an alternative to expensive GPU instances for workloads with small amounts of inference mixed with other computation


Is there the option to use AVX2 (256-bit) instead of AVX-512 (which can cause thermal throttling on basically every chip[a])? Now, if Intel can get AVX-512 working well, that’s something else.

Side question: do any Ryzen processors support AVX-512? AFAIK, they only support up to AVX2.

[a]: IIRC, some tests show AVX2 code actually being faster than the equivalent AVX-512 code because AVX2 doesn’t cause the processor to throttle so hard


No support for AVX2. AVX-512 is the first really nice (from the programmer's perspective) SIMD instruction set on x86-64 CPUs, and it's significantly different from what came before. AVX-512 is not just a wider version of AVX2

NN-512 explores the way Winograd and Fourier convolutions can be done when you have 32 512-bit vector registers. Four 8x8 Winograd tiles simultaneously, four 8x8 Fourier tiles interleaved to form a 16x16 tile for strided convolutions, and so on. These multi-tile operations don't work for AVX2, too few registers, each register too narrow

GCC has only very recently come into a state where it properly supports AVX-512. For example, before GCC 9.1, GCC would split FNMADD into xor-negation followed by FMADD (doing an extra xor, using an extra register for the negation constant, instead of just using the FNMADD instruction).

Eventually the hardware will mature, too


AVX2 also causes various degrees of thermal throttling depending on which chip you're using, such as Haswell. AVX-512 on Ice Lake (client) however is quite good and causes very little throttling versus its sustained speeds (e.g. my Ice Lake laptop @ 3.6GHz or whatever it is). Ultimately you have to do benchmarking yourself and if you care about inferences/sec you'll have to think about this stuff.

That said, AVX2 support would also be cool simply because it's more readily available to use on more platforms, not just my laptop...


AVX-512 causes just as much if not more throttling in my experience? In what case are you finding that AVX-512 runs at lower voltages than AVX2?


There's no way an AVX2 version of this would run as fast even taking throttling in to account.

The throttling is aggressive in mixed code paths, but not really all that bad if you're almost exclusively running vector code.

AVX2 throttled a fair bit in the first implementation, and the penalty has generally reduced. I'd expect something similar here, it's just that Intel chips are barely changed in three years.

And as the author says, 512 is a game changer, not just a wider AVX2. It's much more flexible.


Out of curiosity, have you benchmarked this against some of the more standard NN libraries running on CPU?


No, the convolution generators were written to saturate the hardware

For example, under ideal conditions (no out-of-register memory access, no waiting on dependencies) you can sustain 27 single-precision FMADDs per CPU-core per cycle on a particular Skylake-X (i.e., approx 1.7 _mm512_fmadd_ps per cycle, each yielding 16 multiply-adds)

As soon as you start accessing memory, that number drops to about 20 FMADDs. With direct convolution methods (1x1 and arbitrary), the best you can do is achieve that, and NN-512 comes close

With the Fourier and Winograd convolutions, you start being limited by memory bandwidth, but the reduction in FMADDs that these methods provide means you end up ahead: your "effective" FMADD rate is much higher than what is possible through direct convolution. For example, NN-512 can exceed 48 effective FMADDs per cycle (on the 27 peak FMADD machine) with Winograd-Cook-Toom-Lavin, if the tensor is deep enough (enough channels)

So, NN-512 succeeds in saturating the hardware. Essentially all the time is spent in the matrix multiplications, doing FMADDs, or being blocked bringing half-precision weights into register for Fourier or Winograd

Until I generate a table of comparisons, you can use the previously stated number to do rough comparisons against the literature or other software packages: 18 DenseNet121 inferences per CPU-core per second on a cheap Skylake-X cloud instance


> For example, NN-512 can exceed 48 effective FMADDs per cycle (on the 27 peak FMADD machine) with Winograd-Cook-Toom-Lavin, if the tensor is deep enough (enough channels)

Roughly how many channels do you need for this approach to be worthwhile?


Enough that the data panel of the input tensor fills the thread's share of the L2 cache, and the output tensor is of similar depth

So it depends on the cache size, but you can think of it as being about 512 channels in, 512 channels out, something like that


Hi, how does it compare to https://github.com/pytorch/FBGEMM?


Great idea.


I've heard that heavy sustained AVX3 / AVX-512 workloads have the potential to damage CPU's [0][1]. Would running/testing/playing around this software potentially risk damaging my CPU? Or is this only a risk when overclocking?

I'm not being facetious with this question at all - more genuine curiosity / responsible preventative caution for my own machines, because I'm very much interested in playing around with this if it's reasonable to do so.

0: https://news.ycombinator.com/item?id=22382946

1: https://news.ycombinator.com/item?id=14426798


The reality is that AMD's CPUs don't yet properly support AVX-512, and Intel's CPUs provide good implementations (e.g., 3 cycle latency for the useful AVX-512 shuffle/permutes) with a big downclocking caveat

AVX-512 will be great, eventually


````

Copyright (C) 2019 [ 37ef ced3 3727 60b4 3c29 f9c6 dc30 d518 f4f3 4106 6964 cab4 a06f c1a3 83fd 090e ] ```

37ef_ced3, I'm curious if is a signature? Is the idea to assert copyright which maintaining anonymity, until proof of ownership is required?


That's right, SHA-256 of my (unimportant) identity, with salt


Very nice. If you've tried using TFLite-Micro, you've encountered what C++ looks like in the hands of academics.

Any plans to support RNN layers?


Thanks. RNN layers later, if I can find the time


This is an interesting project, I made a somewhat similar NN to C "compiler" but that was for feedforward networks only. Could this in theory be extended to support models saved in common formats (such as H5)?


The network parameters are passed in as a struct of float arrays. This struct can be simply read from a file, or from a socket, or from some other more complicated format. This is explained in every generated header file


Very, very nice project.

Both style and substance.

I'm very glad I found this. If only all software projects on the internets had the same "to-the-point-edness" ...


Very cool project, I’d be interested in something like this for RNN/LSTM inference. Nice work!


I think this is a very interesting project.

I am curious what direction you are going.

Does it support dilated convolutions, like Wavenet uses?

Can it implement transformers?


As for dilated (and also grouped, like ResNeXt) convolutions, there is full support

The dilated (general-purpose, fallback) convolution algorithm executes FMADDs not much slower than the 1x1 convolution. It makes slow im2col-style approaches unnecessary. Example here: https://nn-512.com/example/4

The idea is to split the input and dilated weight tensors up according to the stride, and then do 1x1 convolutions with accumulation at heightwise and widthwise offsets

No transformers at the moment. You can see what is supported here: https://nn-512.com/docs/graph


Are there any examples of doing training available?


Train on a GPU with any tool, then write the weights, biases, and batchnorm parameters to a file (just a sequence of floats, no other formatting). NN-512 does inference only


Does this run on GPUs and/or TPUs?


No.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: