It depends where you get them from. A lot of the dev boards have extra tooling and of course a healthy chunk of "dev tax" unfortunately. Luckily you can find much more barebones boards available if you know where to look.
The Versal AI edge SOMs are mildly overpriced. The boards are worth it, but in the embedded space Nvidia is offering the cheapest solutions so an FPGA based application will always need to justify the additional cost for slightly worse performance, by arguing that the application has latency requirements that a GPU cannot help with.
GPUs tend to perform worse when you have small batches and frequent kernel launches. This is especially annoying in cases where a simple kernel wide synchronization barrier could solve your problems, but CUDA expects you to not synchronize like that within the kernel, you're supposed to launch a sequence of kernels one after the other. That's not a good solution if a for loop over n iterations turns into n kernel calls.
CUDA offers grid wide cooperative groups which can synchronize pretty efficiently. And there's also graphs if you know the kernels you're launching ahead of time.
It's not just that the boards are expensive; you'll also need a Vivado license to create any designs for it. That license is at least several thousand dollars for the Versal devices.
> Thanks to the extensive work of the MiSoC and LiteX crowd, there’s already IP cores for DRAM, PCI express, ethernet, video, a softcore CPU (your choice of or1k or lm32) and more.. LiteX produces a design that uses about 20% of an XC7A50 FPGA with a runtime of about 10 minutes, whereas Vivado produces a design that consumes 85% of the same FPGA with a runtime of about 30-45 minutes.. LiteX, in its current state, is probably best suited for people trained to write software who want to design hardware, rather than for people classically trained in circuit design who want a tool upgrade.
Thanks for the pointer! DARPA ERI investment was initially directed to US academic teams, while Yosys & related decentralized OSS efforts were barely running on conviction fumes in the OSS wilderness. Glad to see this umbrella ecosystem structure from LF Chips Alliance. Next we need a cultural step change in commercial EDA tools.
Artix 7 is simplistic compared to any of the Versal chips. You buy an expensive FPGA and then try using an "open-source" tool chain that exposes 25% of the FPGA's potential. Not a great trade-off, eh?
> the smallest, lowest power, and most cost-optimized member of the Zynq UltraScale+ family.. jump-start.. MPSoC-based end systems like miniaturized, compute-intensive edge applications in industrial and healthcare IoT systems, embedded vision cameras, AV-over-IP 4K and 8K-ready streaming, hand-held test equipment, consumer, medical applications and more.. board is ideal for design engineers, software engineers, system architects, hobbyists, makers and even students
F = Field
P = Programmable
G = Gate <---- important
A = Array
You aren't "programming", you're "wiring gates together". In other words, you can build custom hardware to solve a problem without using a generic CPU (or GPU) to do it. FPGAs are implemented as a fabric of LUTs (Look-up Tables) which take 4- or 6- (or more) inputs and produce an output. That allows Boolean algebra functions to be processed. The tools you use (Vivado / ISE / YoSys / etc.) take a your intended design, written in a HDL (Hardware Design Language) such as Verilog or VHDL, and turn it into a configuration file which is injected into the FPGA, causing it to be configured to into the hardware you want (if you've done it right). FPGAs are a stepping stone between generic hardware such as a CPU or GPU and a custom ASIC. They win when you can express the problem in specialized hardware much better than writing code to do something on a CPU/GPU. Parallelization is the key to many FPGA designs. Also, you don't have to spend >$1MM on a mask set to go have an ASIC fabricated by TSMC, etc.
Given the density of the PDF, I saw AMD and AI in the title and assumed the scientific community was trying to get AMD GPUs to work. This makes more sense.
Interesting to see Astron developing a radio astronomy accelerator that handles 200 Gbps streams with modest power consumption. The FPGA + MISD approach seems well-matched to the problem domain. Curious how this compares to other astronomy processing architectures in terms of FLOPS/watt metrics.
The title is editorialized: this has nothing to do with NPU (it does not appear in the PDF), which is the term of art for the version of these cores that are sold in laptops.
> The PFB is found in many different application domains such as radio astronomy, wireless communication, radar, ultrasound imaging and quantum computing.. the authors worked on the evaluation of a PFB on the AIE.. [developing] a performant dataflow implementation.. which made us curious about the AMD Ryzen NPU.
> The [NPU] PFB figure shows.. speedup of circa 9.5x compared to the Ryzen CPU.. TINA allows running a non-NN algorithm on the NPU with just two extra operations or approximately 20 lines of added code.. on [Nvidia] GPUs CUDA memory is a limiting factor.. This limitation is alleviated on the AMD Ryzen NPU since it shares the same memory with the CPU providing up to 64GB of memory.
Consumer Ryzen NPU hardware is more accessible to students and hackers than industrial Versal AIE products.
The Versal AI Engine is the NPU. And the Ryzen CPUs NPU is almost exactly a Versal AI Engine IP block to the point that in the Linux kernel they share the same driver (amdxdna) and the reference material the kernel docs link to for the Ryzen NPUs is the Versal SoC's AI Engine architecture reference manual
My issue with your comment is that you're acting as if you're clarifying something, but you're just replacing it with another confusion.
There are three generations of AI Engines: AIE, AIE-ML and AIE-MLv2.
The latter are known as XDNA and XDNA2, which are available on laptops and the 8000G series on desktops. The former is exclusively available on select FPGAs specialising in DSP using single precision floating point.
The AI focused FPGAs use AIE-MLv2 and therefore are identical to XDNA2.
the cores/arches themselves are referred to by a bagillion different names AIE1 AIE2 AIEML Phoenix Strix blah blah (and *DNA refers to the driver/runtime not the core/arch itself) but NPU exclusively refers to consumer edge SoC products.
WHY ARE VERSAL BOARDS SO EXPENSIVE (i had to rant somewhere)
I’m waiting for the similar cost reduction that happened to Ultrascale+ devices and we finally got something like the ZuBoard
Versal "edge" VE2302 boards are coming from multiple vendors. much better pricing.
i'm guessing they will be available in a month or so - they are supposed to "Q2" but seem to be a little bit late (as is typical).
It depends where you get them from. A lot of the dev boards have extra tooling and of course a healthy chunk of "dev tax" unfortunately. Luckily you can find much more barebones boards available if you know where to look.
https://www.en.alinx.com/Product/SoC-Development-Boards/Vers...
The Versal AI edge SOMs are mildly overpriced. The boards are worth it, but in the embedded space Nvidia is offering the cheapest solutions so an FPGA based application will always need to justify the additional cost for slightly worse performance, by arguing that the application has latency requirements that a GPU cannot help with.
GPUs tend to perform worse when you have small batches and frequent kernel launches. This is especially annoying in cases where a simple kernel wide synchronization barrier could solve your problems, but CUDA expects you to not synchronize like that within the kernel, you're supposed to launch a sequence of kernels one after the other. That's not a good solution if a for loop over n iterations turns into n kernel calls.
CUDA offers grid wide cooperative groups which can synchronize pretty efficiently. And there's also graphs if you know the kernels you're launching ahead of time.
> FPGA based application will always need to justify the additional cost for slightly worse performance
Do mean FPGA has slightly worse performance? Care to elaborate?
It's not just that the boards are expensive; you'll also need a Vivado license to create any designs for it. That license is at least several thousand dollars for the Versal devices.
It's taken many years of reverse engineering, but there's now an efficient OSS toolchain for the smaller Artix7 FPGA family, https://antmicro.com/blog/2020/05/multicore-vex-in-litex/
This blog doesn't seem to talk about the OSS toolchain, litex/vexriscv are very neat but they don't replace Vivado, right?
Like all open-source, it's an ongoing effort. Bunnie has a comparison, https://www.bunniestudios.com/blog/2017/litex-vs-vivado-firs...
> Thanks to the extensive work of the MiSoC and LiteX crowd, there’s already IP cores for DRAM, PCI express, ethernet, video, a softcore CPU (your choice of or1k or lm32) and more.. LiteX produces a design that uses about 20% of an XC7A50 FPGA with a runtime of about 10 minutes, whereas Vivado produces a design that consumes 85% of the same FPGA with a runtime of about 30-45 minutes.. LiteX, in its current state, is probably best suited for people trained to write software who want to design hardware, rather than for people classically trained in circuit design who want a tool upgrade.
I think transpute likely meant to link F4PGA[1] or one of the projects it makes use of (Yosys, nextpnr, Project IceStorm, Project X-Ray, etc).
[1] https://f4pga.org/
Is this workgroup currently funded?
Thanks for the pointer! DARPA ERI investment was initially directed to US academic teams, while Yosys & related decentralized OSS efforts were barely running on conviction fumes in the OSS wilderness. Glad to see this umbrella ecosystem structure from LF Chips Alliance. Next we need a cultural step change in commercial EDA tools.
Artix 7 is simplistic compared to any of the Versal chips. You buy an expensive FPGA and then try using an "open-source" tool chain that exposes 25% of the FPGA's potential. Not a great trade-off, eh?
Are the boards significantly more expensive than the devices?
Need something like this $160 Zuboard entry to 5-figure Zynq market, https://news.avnet.com/press-releases/press-release-details/...
> the smallest, lowest power, and most cost-optimized member of the Zynq UltraScale+ family.. jump-start.. MPSoC-based end systems like miniaturized, compute-intensive edge applications in industrial and healthcare IoT systems, embedded vision cameras, AV-over-IP 4K and 8K-ready streaming, hand-held test equipment, consumer, medical applications and more.. board is ideal for design engineers, software engineers, system architects, hobbyists, makers and even students
Can someone with more knowledge of AMD explain if these are useful for real AI work? Without CUDA does it feel like working in the dark ages?
They are useful for AI, but it's a completely different beast than a GPU.
F = Field P = Programmable G = Gate <---- important A = Array
You aren't "programming", you're "wiring gates together". In other words, you can build custom hardware to solve a problem without using a generic CPU (or GPU) to do it. FPGAs are implemented as a fabric of LUTs (Look-up Tables) which take 4- or 6- (or more) inputs and produce an output. That allows Boolean algebra functions to be processed. The tools you use (Vivado / ISE / YoSys / etc.) take a your intended design, written in a HDL (Hardware Design Language) such as Verilog or VHDL, and turn it into a configuration file which is injected into the FPGA, causing it to be configured to into the hardware you want (if you've done it right). FPGAs are a stepping stone between generic hardware such as a CPU or GPU and a custom ASIC. They win when you can express the problem in specialized hardware much better than writing code to do something on a CPU/GPU. Parallelization is the key to many FPGA designs. Also, you don't have to spend >$1MM on a mask set to go have an ASIC fabricated by TSMC, etc.
Ah, I wasn't aware AMD had a line of FPGAs.
Given the density of the PDF, I saw AMD and AI in the title and assumed the scientific community was trying to get AMD GPUs to work. This makes more sense.
> line of FPGAs
Via Xilinx $49B acquisition, https://www.crn.com/news/components-peripherals/amd-complete...
Interesting to see Astron developing a radio astronomy accelerator that handles 200 Gbps streams with modest power consumption. The FPGA + MISD approach seems well-matched to the problem domain. Curious how this compares to other astronomy processing architectures in terms of FLOPS/watt metrics.
IIRC, the European Extremely Large Telescope(love the name) is using Nvidia GPUs to handle adaptive optics.
This AstronNL project also uses Nvidia GPUs just a stage further down the processing chain.
https://youtu.be/RpXTbcBRiRw?si=0yTCNmPZuK29Cf1-
The title is editorialized: this has nothing to do with NPU (it does not appear in the PDF), which is the term of art for the version of these cores that are sold in laptops.
Author ported their software between near-identical AMD AIE and NPU platforms, https://www.hackster.io/tina/tina-running-non-nn-algorithms-...
> The PFB is found in many different application domains such as radio astronomy, wireless communication, radar, ultrasound imaging and quantum computing.. the authors worked on the evaluation of a PFB on the AIE.. [developing] a performant dataflow implementation.. which made us curious about the AMD Ryzen NPU.
> The [NPU] PFB figure shows.. speedup of circa 9.5x compared to the Ryzen CPU.. TINA allows running a non-NN algorithm on the NPU with just two extra operations or approximately 20 lines of added code.. on [Nvidia] GPUs CUDA memory is a limiting factor.. This limitation is alleviated on the AMD Ryzen NPU since it shares the same memory with the CPU providing up to 64GB of memory.
Consumer Ryzen NPU hardware is more accessible to students and hackers than industrial Versal AIE products.
FYI, amd has some prototype alternative programming models for these NPU engines now, although they are certainly very immature: https://github.com/Xilinx/mlir-aie/tree/main/programming_gui...
The Versal AI Engine is the NPU. And the Ryzen CPUs NPU is almost exactly a Versal AI Engine IP block to the point that in the Linux kernel they share the same driver (amdxdna) and the reference material the kernel docs link to for the Ryzen NPUs is the Versal SoC's AI Engine architecture reference manual
https://docs.kernel.org/next/accel/amdxdna/amdnpu.html
My issue with your comment is that you're acting as if you're clarifying something, but you're just replacing it with another confusion.
There are three generations of AI Engines: AIE, AIE-ML and AIE-MLv2.
The latter are known as XDNA and XDNA2, which are available on laptops and the 8000G series on desktops. The former is exclusively available on select FPGAs specialising in DSP using single precision floating point.
The AI focused FPGAs use AIE-MLv2 and therefore are identical to XDNA2.
the cores/arches themselves are referred to by a bagillion different names AIE1 AIE2 AIEML Phoenix Strix blah blah (and *DNA refers to the driver/runtime not the core/arch itself) but NPU exclusively refers to consumer edge SoC products.