I have read about it quite a few weeks ago the first time and I found it very interesting.
Now that I have done more than enough CPU design inside FPGAs, I wanted to try something new, some computation heavy things that could benefit from an FPGA. Does anyone here know how feasable it'd be to implement something like that on an FPGA? I only have rather small chips (artix-7 35T and polarfire SoC with 95k logic slices). So I know I won't be able to press a full LLM into that, but something should be possible.
Maybe I should refresh the fundamentals though and start with MNIST. But the question is rather: What is a realistic goal that I could possibly reach with these small FPGAs? Performance might be secondary, I am rather interested in what's possible regarding complexity/features on a small device.
Also has anyone here compiled openCL (or GL?) kernels for FPGAs and can give me a starting point? I was wondering if it's possible to have a working backend for something like tinygrad[1]. I think this would be a good way to learn all the different layers on how such frameworks actually work
I've had the same idea. One way to go about it would be to modify an existing RISC-V cpu to include the ternary math ops to accelerate bitnet operations. And vector/matrix extensions based on those. Then your LLM is implemented in RISC-V assembly using those extensions. (It would be possible to do some work on the LLVM backend so you could use a C implementation of the LLM, but that starts to be a lot of work. Also, we'd need 2 bit signed int types in C.)
A completely different approach is differentiable logic networks. You end up with a logic-gate network after training. This logic gate network would be very easy to translate into Verilog or VHDL. https://github.com/Felix-Petersen/difflogic
Couldn't you implement a bitnet kernel, and use that as a co-processor to a PC? Or is the I/O bandwidth so low that it won't be worth it?
You gain in potential parallelism with FPGA, so with very small "at the edge" models they could speed things up, right? But the models are always going to be large, so memory bandwidth is going to be a bottle neck unless some v fancy FPGA memory "fabric" is possible. Perhaps for extremely low latency classification tasks? I'm having trouble picturing that application though.
The code itself is surprisingly small/tight. I'm been playing with llama.cpp for the last few days. The CPU only archive is like 8Mb on gitlab, and there is no memory allocation during run time. My ancient laptop (as in 2014!) is sweating but producing spookily good output with quantized 7B models.
(I'm mainly commenting to have someone correct me, by the way, since I'm interested in this question too!)