While the search feature is nice, the reference itself still lacks some details about what an instruction actually does. Take for example, [1], and compare it with say [2] (with diagram), [3] (ditto), or [4] (only pseudocode but helpful nonetheless). Of course, all the alternatives mentioned only cater to x86 but still it'd be great if this site also follows the approach taken by the other three.
Hi, I'm one of the SIMD.info team, thanks for your feedback.
We would actually like to include more information, but our goal is to complement the official documentation not replace it. We actually provide links to Felix Cloutier's and Intel sites anyway, same for Arm and Power, where we can.
The biggest problem is the generation of the diagrams, we're investigating some way to generate these diagrams in a common manner for all architectures, but this will take time.
The ISA extension tags are mostly incorrect. According to that web site, all SSE2, SSE3, SSSE3, and SSE4.1 intrinsics are part of SSE 4.2, and all FMA3 intrinsics are part of AVX2. BTW there’s one processor which supports AVX2 but lacks FMA3: https://en.wikipedia.org/wiki/List_of_VIA_Eden_microprocesso...
The search is less than ideal. Search for FMA, it will find multiple pages of NEON intrinsics, but no AMD64 like _mm256_fmadd_pd
Hi, thanks for your feedback, we are being "incorrect" on purpose. All intrinsics up to and including SSE4.2 were including as part of SSE4.2. We have no intention of providing full granularity for any ISA extension esp one that is 20 years old. For the same reason, we are listing VSX as including in Power ISA 3.0, but not eg. Altivec or Power7/Power8 VSX. If you need such granularity, you are better off visiting the Intel Intrinsics Guide or the ISA manuals.
So the separation for x86 is split into 3 groups, SSE4.2 (up to and including), AVX2 (including AVX) and AVX512 (also including some but not all variants). Something like x86_64-v1, x86_64-v2, etc that is used on compilers. We will probably do a finer granularity listing the exact extension in the description in the future, but not as part of the categorization.
Now the search is indeed less than ideal, we're working on replacing our search engine with a much more robust that doesn't favour one architecture over the other, especially in such words.
In any case, thank you for your feedback. It's still in beta but it is already very useful for us, as we're actually using it for development on our own projects.
Note that this issue also affects NEON. Two examples are vmull_p64(), which requires the Crypto extension -- notably absent on RPi3/4 -- and vqrdmlah_s32(), which requires FEAT_RDM, not guaranteed until ARMv8.1. Unlike Intel, ARM doesn't do a very good job of surfacing this in their intrinsics guide.
Would also be nice to remove empty categories from tree view. For example, right now you can uncheck VSX and still see "Memory Operations - VSX Unaligned ..." full of empty tags.
I clicked the “go” button just to see the typical format, and it gave… zero results. Because the example is “e.g. integer vector addition” and it doesn't strip away the “e.g.” part!
Apart from that, I find the search results too sparse (doesn't contain the prototype) and the result page too verbose (way too much fluff in the description, and way too much setup in the example; honestly, who cares about <stdio.h>[1]), so I'll probably stick to the existing x86/Arm references.
[1] Also, the contrast is set so low that I literally cannot read all of the example.
You make some good points. I represent Vectorcamp (creators of simd.info). It's still in Beta status, because we know there are some limitations currently, but we are already using it in production for our own projects. Now to comment on your points:
1. Empty string -> zero results, obviously a bug, we'll add some default value.
2. The sparse results are because of VSX, VSX provides multiple prototypes per intrinsic, which we thought would increase the size of the results a bit too much. Including the prototypes in the results is not a problem, but we don't want to have too much information on the other hand, that would make it too hard for the developer to find the relevant intrinsic. We'll take another look at this.
The description actually is rather bare, we intend to include a lot more information, like diagrams, pseudo code for the operation etc.
Examples are meant to be run as self-contained compile units, in the Compiler Explorer or locally, to demonstrate the intrinsic hence the extra setup. This will not change.
We also think that nothing will replace the official ISA references, we also include links to those anyway.
3. Regarding the contrast, we're already working on a light/dark theme.
Neat idea, the 'search' feature is a bit odd though if you don't know which instruction you are looking for. e.g. searching for 'SHA' shows the autocomplete for platforms not selected and then 0 results due to the filters (they haven't been added for SSE/AVX yet), but searching for 'hash' gets you 100 results like '_mm256_castsi256_ph' which has nothing to do with the search.
Thanks for your comment. We have noticed some strange behavior with the “search” feature, you are right to mention that & we are currently trying to improve its performance. Regarding the SHA you don’t get any results when filtering out NEON or VSX, because the AVX512 SHA intrinsics hasn’t been added yet (under dev atm). When searching for “HASH”, the first 3 results that you get are correct (NEON), the other ones are as mentioned before are bad behavior of the search component - it must have found some similarity.
It is interesting how often SIMD stuff is discussed on here. Are people really directly dealing with SIMD calls a lot?
I get the draw -- this sort of to-the-metal hyper-optimization is legitimately fun and intellectually rewarding -- but I suspect that in the overwhelming majority of cases simply using the appropriate library, ideally one that is cross-platform and utilizes what SIMD a given target hosts, is a far better choice than bothering with the esoterica or every platform and generation of SIMD offerings.
I kinda agree with the main point, but keep in mind those libraries with SIMD optimizations don't just appear out of nowhere... people write those. Also it's pretty common for someone to write software for an org that thas 10^5 or more identical cores running in a datacenter (or datacenters)... some specialized optimization can easily be cost-effective in those situations. Then there's crazy distributed systems stuff, where a small latency reduction in the right place can have significant impact for an entire cluster. And on and on....
Point being, while not everyone is in a position this stuff is relevant (and not everyone who sometimes finds this stuff relevant can say it's relevant often), it's more widely applicable than you're suggesting.
For sure there are obviously developers building those computation libraries like numpy, compilers, R, and so on. These people exist and are grinding out great code and abstractions for the rest of us to use, and many of them are regulars on HN. But these people are seldom the target of the "learn SIMD" content that appears on here regularly.
If someone is an average developer building a game or a corporate information or financial system or even a neural network implementation, if you are touching SIMD code directly you're probably approaching things in a less than optimal fashion and there are much better ways to utilize whatever features your hardware, or hardware in the future, may offer up.
This is not entirely accurate... think about it this way. Every time you issue a floating-point addition or multiplication, you're using an 8th or a 16th of your CPU's theoretical performance. Of course, it's a bit more complicated than that, but that's the general gist of it. Compilers won't generate SIMD code for you (autovectorisation) except for the simplest cases, and they certainly won't do the necessary transformations (AoS to SoA or AoSoA) necessary to efficiently use SIMD.
Now of course, many of these transformations can be wrapped in a higher-level API (think of sums over an array, reductions, string length, string encoding, etc.) but not all of them.
Of course, multithreading also exists to improve performance, but for many tasks, it's more worth it to run it on one core without the sync overhead, especially with data-parallel algorithms where you're doing the exact same thing on all of the data and you have a fairly large dataset. Or even better, you can combine the two, partition the data between multiple cores into a few small sets then use a SIMD "kernel" to process it. With extremely embarrassingly parallel problems, you can achieve 1000x speedups this way, not exaggerating. A typical speedup is much smaller but still, using your machine well can easily produce an order-of-magnitude difference in performance. If you read up on ISPC benchmarks, you'll find that for even any kind of existing, very branchy and not SIMD-friendly code, they regularly had a free 4x speedup without changing the behaviour or the result of the program.
Seriously, it's not that if you use SIMD-powered libraries then you're set for performance, if you have a holistic view of your system's performance, you can do really amazing things.
I'm not advocating against using vectorization/SIMD. Of course people should eek every bit of performance out.
The point is that the average developers job in achieving this is not knowing every SIMD primitive for every platform. It is almost always folly to ever touch a platform's SIMD instructions unless you're one of the aforementioned library/compiler developers.
Instead the ordinary developer is usually best served by using equipped libraries and understand how to structure code where it can enable vectorization. Whether it's Eigen, or even the C++ compiler (which, in my experience, is often excellent at vectorizing code), or the Accelerate library on Apple platforms. These things mean that your code can adapt and exploit every possible platform they're deployed to, versus Jimmy SIMD who carefully made his SSE 3.2 instructions and his code was then lost in time.
I agree. Most developers don't need to write SIMD code. Some do however and those need good documentation to write good SIMD code, even more so when it comes to porting.
Interesting that you mention Eigen. Long before I started my company VectorCamp, wrote the original Altivec/VSX, Arm and Z ports for Eigen and it took me a lot more to do a proper port back then -iirc I started that effort in 2008- than it would take me now, because now the tools are far far better. I started this company to provide SIMD optimizations for all architectures and this tool SIMD.info began because I wanted to help other developers find the information that I wish I had back then. It's that simple.
For me and my company, anything that needs to be performant critical to be written in a compiled language like C, C++, Rust -and now Zig- is worth optimizing. How much depends on your time, money and skills. Not everything needs SIMD of course. But it's definitely NOT only for library and compiler developers. You would be surprised how much SIMD code is even in application code out there. It all depends on the expectations and the required performance. Also, some of the libraries don't utilize all available instructions.
I agree, it's always best to use something that already exists and is optimized for your platform, unless it doesn't exist or you need extra features that are not covered. In those cases you need to read large ISA manuals, use each vendor's intrinsic site or use our tool SIMD.info :)
The link to SIMD.AI is interesting. I didn't have a perfect experience trying to get Claude to convert a scalar code to AVX512.
Claude seems to enjoy storing 16 bit masks in 512 bit vectors but the compiler will find that easily.
The biggest issue I encountered was that when converting nested if statements into mask operations, it would frequently forget to and the inner and outer mask together.
Getting an LLM to translate code is very tricky, we haven't included AVX2 and AVX512 yet in our SIMD.ai because it requires a lot more work. However, translating code between similarly sized vector engines is doable when we finetuned our own data to the LLM. We tested both ChatGPT and Claude -and more- but none could do even the simplest translations between eg SSE4.2 and Neon or VSX. So trying something harder like AVX512 felt like a bit of a stretch. But we're working on it.
It used to be the case that if you wanted to write code once and run it on multiple platforms you'd use a library, and if you wanted to avoid writing code which was ISA specific you used a compiler. Now we use an LLM. This is progress. Probably. It's definitely different anyway.
You still have to use the library, and it will still work the same way for normal scalar C code. The whole point is that the vectorization is difficult to write and an LLM just might be able to help with some cases, not all.
Yeah, the plan is to get all SIMD engines there, RVV is the hardest though (20k intrinsics). Currently we're doing IBM Z, which should be done probably within the month? It still needs some work, and progress is slow because we're just using our own funds. Plan is IBM Z (currently worked on), Loongson LSX/LASX, MIPS MSA, ARM SVE/SVE2 and finally RVV 1.0. LSX/LASX and MSA are very easy. Ideally, I'd like to open source everything, but I can't just now, as I would just hand over all the data to big players like OpenAI. Once I manage to ensure adequate funding, we're going to open source the data (SIMD.info) and probably the model itself (SIMD.ai).
A bit late to this comment but most of these intrinsics are overloads of different LMUL and SEW on a single instruction. I'm pretty sure the actual number of RVV instructions is way less. So maybe you could consolidate overloads of the same instruction into the same page or something.
While the search feature is nice, the reference itself still lacks some details about what an instruction actually does. Take for example, [1], and compare it with say [2] (with diagram), [3] (ditto), or [4] (only pseudocode but helpful nonetheless). Of course, all the alternatives mentioned only cater to x86 but still it'd be great if this site also follows the approach taken by the other three.
[1]: https://simd.info/c_intrinsic/_mm256_permute_pd [2]: https://www.felixcloutier.com/x86/vpermilpd [3]: https://officedaytime.com/simd512e/simdimg/si.php?f=vpermilp... [4]: https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
https://github.com/dzaima/intrinsics-viewer is like Intels Guide, but also for Arm, RISC-V and wasm.
RISC-V and wasm are hosted here: https://dzaima.github.io/intrinsics-viewer/
You need to download it your self if you want to use the others.
Hi, I'm one of the SIMD.info team, thanks for your feedback.
We would actually like to include more information, but our goal is to complement the official documentation not replace it. We actually provide links to Felix Cloutier's and Intel sites anyway, same for Arm and Power, where we can.
The biggest problem is the generation of the diagrams, we're investigating some way to generate these diagrams in a common manner for all architectures, but this will take time.
https://dougallj.github.io/asil/ is like officedaytime but for SVE.
The ISA extension tags are mostly incorrect. According to that web site, all SSE2, SSE3, SSSE3, and SSE4.1 intrinsics are part of SSE 4.2, and all FMA3 intrinsics are part of AVX2. BTW there’s one processor which supports AVX2 but lacks FMA3: https://en.wikipedia.org/wiki/List_of_VIA_Eden_microprocesso...
The search is less than ideal. Search for FMA, it will find multiple pages of NEON intrinsics, but no AMD64 like _mm256_fmadd_pd
Hi, thanks for your feedback, we are being "incorrect" on purpose. All intrinsics up to and including SSE4.2 were including as part of SSE4.2. We have no intention of providing full granularity for any ISA extension esp one that is 20 years old. For the same reason, we are listing VSX as including in Power ISA 3.0, but not eg. Altivec or Power7/Power8 VSX. If you need such granularity, you are better off visiting the Intel Intrinsics Guide or the ISA manuals. So the separation for x86 is split into 3 groups, SSE4.2 (up to and including), AVX2 (including AVX) and AVX512 (also including some but not all variants). Something like x86_64-v1, x86_64-v2, etc that is used on compilers. We will probably do a finer granularity listing the exact extension in the description in the future, but not as part of the categorization.
Now the search is indeed less than ideal, we're working on replacing our search engine with a much more robust that doesn't favour one architecture over the other, especially in such words.
In any case, thank you for your feedback. It's still in beta but it is already very useful for us, as we're actually using it for development on our own projects.
Note that this issue also affects NEON. Two examples are vmull_p64(), which requires the Crypto extension -- notably absent on RPi3/4 -- and vqrdmlah_s32(), which requires FEAT_RDM, not guaranteed until ARMv8.1. Unlike Intel, ARM doesn't do a very good job of surfacing this in their intrinsics guide.
Would also be nice to remove empty categories from tree view. For example, right now you can uncheck VSX and still see "Memory Operations - VSX Unaligned ..." full of empty tags.
Thank you for this comment, will be taken into consideration.
I clicked the “go” button just to see the typical format, and it gave… zero results. Because the example is “e.g. integer vector addition” and it doesn't strip away the “e.g.” part!
Apart from that, I find the search results too sparse (doesn't contain the prototype) and the result page too verbose (way too much fluff in the description, and way too much setup in the example; honestly, who cares about <stdio.h>[1]), so I'll probably stick to the existing x86/Arm references.
[1] Also, the contrast is set so low that I literally cannot read all of the example.
You make some good points. I represent Vectorcamp (creators of simd.info). It's still in Beta status, because we know there are some limitations currently, but we are already using it in production for our own projects. Now to comment on your points:
1. Empty string -> zero results, obviously a bug, we'll add some default value. 2. The sparse results are because of VSX, VSX provides multiple prototypes per intrinsic, which we thought would increase the size of the results a bit too much. Including the prototypes in the results is not a problem, but we don't want to have too much information on the other hand, that would make it too hard for the developer to find the relevant intrinsic. We'll take another look at this.
The description actually is rather bare, we intend to include a lot more information, like diagrams, pseudo code for the operation etc.
Examples are meant to be run as self-contained compile units, in the Compiler Explorer or locally, to demonstrate the intrinsic hence the extra setup. This will not change.
We also think that nothing will replace the official ISA references, we also include links to those anyway.
3. Regarding the contrast, we're already working on a light/dark theme.
Thank you for your comments.
I don't think it's that it's not stripping "e.g.", but that the search criteria are empty. The empty result set is prefaced by "Search results for:".
I actually like that the example is a complete, standalone program that you can compile or send to Compiler Explorer.
Neat idea, the 'search' feature is a bit odd though if you don't know which instruction you are looking for. e.g. searching for 'SHA' shows the autocomplete for platforms not selected and then 0 results due to the filters (they haven't been added for SSE/AVX yet), but searching for 'hash' gets you 100 results like '_mm256_castsi256_ph' which has nothing to do with the search.
Thanks for your comment. We have noticed some strange behavior with the “search” feature, you are right to mention that & we are currently trying to improve its performance. Regarding the SHA you don’t get any results when filtering out NEON or VSX, because the AVX512 SHA intrinsics hasn’t been added yet (under dev atm). When searching for “HASH”, the first 3 results that you get are correct (NEON), the other ones are as mentioned before are bad behavior of the search component - it must have found some similarity.
Neat tool.
It is interesting how often SIMD stuff is discussed on here. Are people really directly dealing with SIMD calls a lot?
I get the draw -- this sort of to-the-metal hyper-optimization is legitimately fun and intellectually rewarding -- but I suspect that in the overwhelming majority of cases simply using the appropriate library, ideally one that is cross-platform and utilizes what SIMD a given target hosts, is a far better choice than bothering with the esoterica or every platform and generation of SIMD offerings.
I kinda agree with the main point, but keep in mind those libraries with SIMD optimizations don't just appear out of nowhere... people write those. Also it's pretty common for someone to write software for an org that thas 10^5 or more identical cores running in a datacenter (or datacenters)... some specialized optimization can easily be cost-effective in those situations. Then there's crazy distributed systems stuff, where a small latency reduction in the right place can have significant impact for an entire cluster. And on and on....
Point being, while not everyone is in a position this stuff is relevant (and not everyone who sometimes finds this stuff relevant can say it's relevant often), it's more widely applicable than you're suggesting.
For sure there are obviously developers building those computation libraries like numpy, compilers, R, and so on. These people exist and are grinding out great code and abstractions for the rest of us to use, and many of them are regulars on HN. But these people are seldom the target of the "learn SIMD" content that appears on here regularly.
If someone is an average developer building a game or a corporate information or financial system or even a neural network implementation, if you are touching SIMD code directly you're probably approaching things in a less than optimal fashion and there are much better ways to utilize whatever features your hardware, or hardware in the future, may offer up.
This is not entirely accurate... think about it this way. Every time you issue a floating-point addition or multiplication, you're using an 8th or a 16th of your CPU's theoretical performance. Of course, it's a bit more complicated than that, but that's the general gist of it. Compilers won't generate SIMD code for you (autovectorisation) except for the simplest cases, and they certainly won't do the necessary transformations (AoS to SoA or AoSoA) necessary to efficiently use SIMD.
Now of course, many of these transformations can be wrapped in a higher-level API (think of sums over an array, reductions, string length, string encoding, etc.) but not all of them.
Of course, multithreading also exists to improve performance, but for many tasks, it's more worth it to run it on one core without the sync overhead, especially with data-parallel algorithms where you're doing the exact same thing on all of the data and you have a fairly large dataset. Or even better, you can combine the two, partition the data between multiple cores into a few small sets then use a SIMD "kernel" to process it. With extremely embarrassingly parallel problems, you can achieve 1000x speedups this way, not exaggerating. A typical speedup is much smaller but still, using your machine well can easily produce an order-of-magnitude difference in performance. If you read up on ISPC benchmarks, you'll find that for even any kind of existing, very branchy and not SIMD-friendly code, they regularly had a free 4x speedup without changing the behaviour or the result of the program.
Seriously, it's not that if you use SIMD-powered libraries then you're set for performance, if you have a holistic view of your system's performance, you can do really amazing things.
I'm not advocating against using vectorization/SIMD. Of course people should eek every bit of performance out.
The point is that the average developers job in achieving this is not knowing every SIMD primitive for every platform. It is almost always folly to ever touch a platform's SIMD instructions unless you're one of the aforementioned library/compiler developers.
Instead the ordinary developer is usually best served by using equipped libraries and understand how to structure code where it can enable vectorization. Whether it's Eigen, or even the C++ compiler (which, in my experience, is often excellent at vectorizing code), or the Accelerate library on Apple platforms. These things mean that your code can adapt and exploit every possible platform they're deployed to, versus Jimmy SIMD who carefully made his SSE 3.2 instructions and his code was then lost in time.
I agree. Most developers don't need to write SIMD code. Some do however and those need good documentation to write good SIMD code, even more so when it comes to porting.
Interesting that you mention Eigen. Long before I started my company VectorCamp, wrote the original Altivec/VSX, Arm and Z ports for Eigen and it took me a lot more to do a proper port back then -iirc I started that effort in 2008- than it would take me now, because now the tools are far far better. I started this company to provide SIMD optimizations for all architectures and this tool SIMD.info began because I wanted to help other developers find the information that I wish I had back then. It's that simple.
For me and my company, anything that needs to be performant critical to be written in a compiled language like C, C++, Rust -and now Zig- is worth optimizing. How much depends on your time, money and skills. Not everything needs SIMD of course. But it's definitely NOT only for library and compiler developers. You would be surprised how much SIMD code is even in application code out there. It all depends on the expectations and the required performance. Also, some of the libraries don't utilize all available instructions.
My 2c.
I agree, it's always best to use something that already exists and is optimized for your platform, unless it doesn't exist or you need extra features that are not covered. In those cases you need to read large ISA manuals, use each vendor's intrinsic site or use our tool SIMD.info :)
The link to SIMD.AI is interesting. I didn't have a perfect experience trying to get Claude to convert a scalar code to AVX512.
Claude seems to enjoy storing 16 bit masks in 512 bit vectors but the compiler will find that easily.
The biggest issue I encountered was that when converting nested if statements into mask operations, it would frequently forget to and the inner and outer mask together.
Getting an LLM to translate code is very tricky, we haven't included AVX2 and AVX512 yet in our SIMD.ai because it requires a lot more work. However, translating code between similarly sized vector engines is doable when we finetuned our own data to the LLM. We tested both ChatGPT and Claude -and more- but none could do even the simplest translations between eg SSE4.2 and Neon or VSX. So trying something harder like AVX512 felt like a bit of a stretch. But we're working on it.
It used to be the case that if you wanted to write code once and run it on multiple platforms you'd use a library, and if you wanted to avoid writing code which was ISA specific you used a compiler. Now we use an LLM. This is progress. Probably. It's definitely different anyway.
You still have to use the library, and it will still work the same way for normal scalar C code. The whole point is that the vectorization is difficult to write and an LLM just might be able to help with some cases, not all.
This is pretty useful! Any plan for adding ARM SVE and RISC-V V extension?
A response from the SIMD.info folks:
Yeah, the plan is to get all SIMD engines there, RVV is the hardest though (20k intrinsics). Currently we're doing IBM Z, which should be done probably within the month? It still needs some work, and progress is slow because we're just using our own funds. Plan is IBM Z (currently worked on), Loongson LSX/LASX, MIPS MSA, ARM SVE/SVE2 and finally RVV 1.0. LSX/LASX and MSA are very easy. Ideally, I'd like to open source everything, but I can't just now, as I would just hand over all the data to big players like OpenAI. Once I manage to ensure adequate funding, we're going to open source the data (SIMD.info) and probably the model itself (SIMD.ai).
> RVV is the hardest though (20k intrinsics)
A bit late to this comment but most of these intrinsics are overloads of different LMUL and SEW on a single instruction. I'm pretty sure the actual number of RVV instructions is way less. So maybe you could consolidate overloads of the same instruction into the same page or something.
Maybe std::simd could be worked into this.
integration with std::simd might be doable, but there are no plans at least on our side to integrate with any particular library.
SIMD from MCUs would also be awesome!
Do you mean Helium from Arm? Yes, that would be nice to include and relatively easy as it's mostly the same as Neon.
No, something more basic, like https://github.com/mberntsen/STM32-Libraries/blob/master/CMS...
Yeah, that's definitely doable, thanks for the pointer, we will add it in the list :)