C is not suited to SIMD (2019)

blog.vmchale.com

86 points by zetalyrae 6 days ago

Const-me 3 days ago

I would rather conclude that automatic vectorizers are still less than ideal, despite SIMD instructions have been widely available in commodity processors for 25 years now.

The language is actually great with SIMD, you just have to do it yourself with intrinsics, or use libraries. BTW, here’s a library which implements 4-wide vectorized exponent functions for FP32 precision on top of SSE, AVX and NEON SIMD intrinsics (MIT license): https://github.com/microsoft/DirectXMath/blob/oct2024/Inc/Di...

nine_k 3 days ago

Why does FORTRAN, which is older than C, has no trouble with auto-vectorization?
Mostly because of a very different memory model, of course. Arrays are first-class things, not hacks on top of pointers, and can't alias each other.
- nitwit005 2 days ago
  
  When I tested this before (some years ago), adding "restrict" to the C pointers resulted in the same behavior, so the aliasing default seem to be the key issue.
  - AlotOfReading 2 days ago
    
    Your compiler very likely assumes strict aliasing at whatever optimization level you're already using, so restrict usually won't do much. GCC does at O2, keil (and clang?) at O1.
    
    legobmw99 2 days ago
    
    “strict aliasing” refers to disallowing pointers of different types to alias. In the context of a function that accepts two arrays of floats, for example, it doesn’t apply
    
    nitwit005 a day ago
    
    That looks correct from playing around in godbolt.
  - kazinator 2 days ago
    
    But restrict says "I promise that these things don't overlap, on penalty of my program going to hell". Is that what's going on in Fortran?
    
    cozzyd 2 days ago
    
    yes, Fortran forbids referring to the same memory location by multiple names.
    
    pklausler 2 days ago
    
    Nope. There’s lots of ways to alias storage in Fortran.
    
    cozzyd 2 days ago
    
    sorry, I meant in the case of function parameters, though I'm far from a fortran expert so there may be exceptions.
- TinkersW 2 days ago
  
  The article seems to conflate C and C++ in its very first sentence, though in reality these are very different languages. In C++ you can use std::array, and you can make your SIMD containers use restrict to make it more like Fortran.
  Anyway auto-vectorization will never compete with intrinsics for performance, so this all seems rather silly. Try expressing pshub in auto-vectorizer. Or a bunch of optimal load/swizzles to out perform gather.
- Const-me 2 days ago
  
  I believe it’s ecosystem. For example, Python is very high level language, not sure it even has a memory model. But it has libraries like NumPy which support all these vectorized exponents when processing long arrays of numbers.
  - goku12 2 days ago
    
    Everything that's heavily vectorized in the Python ecosystem, including numpy achieves it using optimized backend code written in other languages - fortran in particular. Python is only a thin veneer over those backends. In fact, you're constantly reminded to offload the control flow as much as possible to those backends for the sake of performance, instead of doing things like looping in Python. If that's enough to consider Python to be good in vectorization, I can just link high performance fortran libraries with C, handle non-vector control flow from there and call it a day. I guarantee you that this arrangement will be far more performant than what Python can ever achieve. I have to strongly agree with the other commenter's observation that the memory model is the key to vector performance.
    And of course Python has a memory model. While that model is not as well understood as C's model, it is the key to Python's success and popularity as a generic programming language and as a numeric/scientific programming language. Python's memory model unlike C's or Fortran's, isn't designed for high performance. It's designed for rich abstractions, high ergonomics and high interoperability with those performant languages. For most people, the processing time lost executing python code is an acceptable tradeoff for the highly expressive control that Python gives them over the scheduling of lower level operations.
    
    ilayn 2 days ago
    
    NumPy has no Fortran code, for quite a long time now. SciPy has and it is being rewritten. What you mention is the ufunc machinery underneath which is all C. NumPy also has SIMD support (albeit limited to certain functions). BLAS is also C/Assembly but only LAPACK is F77 (which is too much code to be rewritten).
    This does not mean Fortran is bad (obligatory disclaimer for Fortran fans).
    
    nine_k 2 days ago
    
    Also, Fortran has progressed immensely since F77 days. (I actually wrote some F77 code back in the day, when it was already ancient.) C has also progressed quite a bit since K&R days, but, to my mind, not nearly as much.
    
    holdenweb 19 hours ago
    
    Oh, my God, I actually wrote some code in Fortran 2 back in 1969! It's certainly a much better language now!
    
    cozzyd 2 days ago
    
    right, the problem for scipy developers I believe is that not enough of them know fortran, whereas C knowledge is necessary to hack on CPython native extensions.
    
    holdenweb 19 hours ago
    
    I wouldn't say Python is good _at_ vectorization so much as good _with_ it. It's also good with AI, web systems, etc, as a result of its "play well with others" philosophy, which pragmatically accepted from the start that no one language could be best for all purposes.
    One of the best improvements to Python's memory model was extending it to allow sensible access to blobs of memory containing regular structures such as vectors and matrices, making it relatively easy to interface with and control already-available, well-tried, highly-optimised algorithms.
    
    janwas 2 days ago
    
    > Python is only a thin veneer over those backends.
    There is an interesting writeup by dvyukov that shows the potential cost of "thin veneers" :) https://github.com/dvyukov/perf-load
  - vrighter 2 days ago
    
    any python libraries that offer any sort of real performance, is written in C
    
    holdenweb 19 hours ago
    
    As well they should be.
- MisterTea 3 days ago
  
  > Arrays are first-class things, not hacks on top of pointers, and can't alias each other.
  "Hacks" feels like the wrong term to use here. Comparing managed objects surrounded by runtime to memory isn't a fair comparison.
  - nine_k 2 days ago
    
    There's no language-level difference between int[] and int* in C. Indexing is just pointer arithmetics. No support for (non-jagged) multidimensional arrays either.
    A proper array would at least know its length. It does not require a lot of runtime to do, and most C code require libc anyway for basic stuff like malloc().
    
    jjmarr 2 days ago
    
    This isn't true. int[] decays into an int* but is a different type.
    An array member of a struct will have its data allocated contiguously with other struct members (subject to compiler padding). An int* would point outside the struct. This is possible even with variable length arrays. Specifically, you can declare the final member of a struct as an int[] which makes it a "flexible array member". Then you have to malloc the struct to be the sizeof the struct + whatever size you want the array to be.
    https://en.wikipedia.org/wiki/Flexible_array_member
    This is used in struct sockaddr for generic APIs where the actual struct can be of different sizes.
    https://lwn.net/Articles/997094/
    This is rather pedantic but the memory model for arrays is important here. If you're iterating over multiple struct members that are arrays, the fact the arrays are stored contiguously in memory is going to matter to SIMD.
    
    charleslmunger 2 days ago
    
    >Then you have to malloc the struct to be the sizeof the struct + whatever size you want the array to be.
    Ultra pedantic, but you actually do not need to alloc struct+size; you want something like:
    // A flexible array member may have lower alignment requirements than the struct // overall - in that case, it can overlap with the trailing padding of the rest // of the struct, and a naive sizeof(base) + sizeof(flex) * count calculation // will not take into account that overlap, and allocate more than is required. #define SIZEOF_FLEX(type, member, count) MAX(sizeof(type), offsetof(type, member[count]))
    I'm sure there's a more elegant way to calculate this, but I am no macro master.
    
    Gibbon1 2 days ago
    
    The C language spec isn't what important here it's the ABI doesn't know the difference.
    
    jjmarr 2 days ago
    
    if using an int* over an int[] changes the memory layout of a struct, that necessarily implies a difference in the ABI.
    As an example, C++'s std::array can be implemented as a struct containing a fixed-size C-style array. This can be passed-by-value. This means that returning a std::array can be more performant than returning a std::vector (which might be implemented as an int* that is reallocated when you add too many elements), because a std::array is returned on the stack while a std::vector is returned on the heap.
    I was bit by this once when returning a std::unordered_map because I was doing a heap-allocation in a performance-critical section.
    
    Gibbon1 a day ago
    
    The memory layout is the ABI not the language spec.
    
    jjmarr 14 hours ago
    
    The language spec for C constrains the memory layout of a struct. From 6.7.3.2.7 of the draft spec:
    > a structure is a type consisting of a sequence of members, whose storage is allocated in an ordered sequence, (emphasis added)
    https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3220.pdf
    
    nine_k 2 days ago
    
    This is fair, thanks!
    
    MisterTea 2 days ago
    
    > There's no language-level difference between int[] and int* in C
    sizeof(int[n]) vs sizeof(int*). Then there's: 'int getint(int *I){return *I;}' which wont let me pass '&int[n]' but I can 'int* = int[n]' then pass &int*. It's not so much a hack but a very primitive implementation on top of pointers (alright maybe it is a bit hacky.)
    > A proper array would at least know its length. It does not require a lot of runtime to do, and most C code require libc anyway for basic stuff like malloc().
    Something like a fat pointer 'typedef struct {long len, void *data} MyArray;' then make a libMyArray to operate on MyArray objects. But now you have a libMyArray dependency. You also loose type information unless you add fields to MyArray with an enum or have a MyArray for each type which gets messy. Then you cant do shit like just use an identifier in a conditional e.g. 'while(MyArray[n++])' unless you change how C works which makes it notC (uh, maybe pick another name.)
    I feel that C is fine as it's an old machine oriented language. Leave it alone. I would look to reach for another tool when the primitive design of C makes life difficult.
    
    MisterTea 2 days ago
    
    Oh crap, markdown strikes again and ate the extra asterisk in getint(**I)
  - pjmlp 2 days ago
    
    Fortran is not a managed language, and has a runtime as big a C.
Laiho 2 days ago

I've had great success with autovectorization in Rust. Majority of non-branching loops seem to consistently generate great assembly.
pjmlp 2 days ago

None of that is ISO C though, and any language can have similar extensions, no need to be stuck with C for that.
Keyframe 3 days ago

isn't DirectXMath C++, not C?
- npalli 3 days ago
  
  Yes, it is in C++.
- Const-me 2 days ago
  
  Yeah, but if you need C versions of low-level SIMD algorithms from there, copy-paste the implementation and it will probably compile as C code. That’s why I mentioned the MIT license: copy-pasting from GPL libraries may or may not work depending on the project, but the MIT license is copy-paste friendly.

AnimalMuppet 6 days ago

The argument seems to be that C isn't suited to SIMD because some functions (exp, for instance) are not. But it then talks about how exp is implemented in hardware, in a way that forces it to be a non-SIMD calculation. That's not a C issue, that's a CPU issue. No language can save you there (unless it uses a software-only function for exp).

Is this article really confused, or did I misunderstand it?

The thing that makes C/C++ a good language for SIMD is how easily it lets you control memory alignment.

woooooo 3 days ago

Not in the article, but I've read that the way C does pointers, a single address with length implicit, makes it hard for compilers to assert that 2 arrays don't overlap/alias and this is an obstacle for generating SIMD instructions.
- drysine 3 days ago
  
  That's what the restrict type qualifier[0] is for.
  [0] https://en.cppreference.com/w/c/language/restrict
- alkh 3 days ago
  
  I thought you could simply add `restrict`[1] to indicate that only one pointer points to a specific object? Shouldn't this help?
  [1]https://en.cppreference.com/w/c/language/restrict
  - aidenn0 3 days ago
    
    You can add that qualifier, but nothing will enforce that objects don't overlap. So you replaced slow correct code with fast incorrect code.
    
    TheNewAndy 2 days ago
    
    If you believed the pointers didn't alias and they actually did then you replaced slow incorrect code with fast incorrect code
    
    aidenn0 2 days ago
    
    That doesn't follow? If you replaced every call of memcpy with memmove (to use an example from the standard library), then your program is no less correct than it was before (and possibly more). The converse is that adding "restrict" qualifiers to existing functions can only make your program less correct, not more.
    
    ablob 2 days ago
    
    In this context it follows. If you believe the pointers dont alias (and want to use simd) then there should be no data-dependencies inbetween indices. (If there are data-dependencies it is quite likely that you have designed your algorithm to accomodate that, i.e. make it suitable for vectorised access. parallel prefix-sum would be an example of this). And without data-dependencies vectorized code is exactly equal to unvectorized code in all circumstances. If however, you do have data-dependencies, then your algorithm is wrong. Vectorizing can surface this error, while keeping it unvectorized may hide it. In my previous example (parallel prefix-sum) if you had a direct data-dependency to the previous index your program would've been wrong - regardless of vectorization.
    In short, while your statement is true in general, I believe that it is not applicable in the context of the discussion.
- wayoverthecloud 3 days ago
  
  In C++ there is the align_alloc specificier. https://en.cppreference.com/w/c/memory/aligned_alloc
  Not sure for C
  - ryanpetrich 3 days ago
    
    aligned_alloc is a C function. It doesn't help the compiler prove that two pointers can't alias. restrict is the keyword that does.
    
    bobmcnamara 12 hours ago
    
    The compiler is allowed to assume the results of malloc and certain other allocation functions do not alias.
    Restrict does too.
- bee_rider 3 days ago
  
  When compiling a library, it is generally impossible for the compiler to know whether or not the pointers will be aliased, right? That decision is made after the library is already compiled.
  - drwu 3 days ago
    
    The function declaration in the header file of the library can carry the required 'restrict'. It works for c++ invokers too, as most c++ compilers also support and check the __restrict__ for old plain pointer types.
- 1718627440 3 days ago
  
  Not really. Pointers from different allocations are guaranteed to never alias. If they do, it is UB.
  - AnimalMuppet 3 days ago
    
    Right, but the compiler in general has no way to know that pointers are from different allocations - especially several function calls away from the allocations, with some pointer arithmetic in between.
    
    bee_rider 3 days ago
    
    In general, the information as to whether or not the pointers will alias might not even be known, right? Like we could be compiling a library that will take in some pointers from the user.
    
    1718627440 3 days ago
    
    Yes, that is why restrict was introduced.
ta3401 2 days ago

> But it then talks about how exp is implemented in hardware, in a way that forces it to be a non-SIMD calculation.
The fact that exp is implemented in hardware is not the argument. The argument is that exp is a library function, compiled separately, and thus the compiler cannot inline the function and fuse it with an array-wide loop, to detect later on an opportunity to generate the SIMD instructions.
It is true however that exp is implemented in hardware in the X86 world, and to be fair, perhaps a C compiler only needs to represent that function as an intrinsic instead, to give itself a chance to later replace it either with the function call or some SIMD instruction; but I guess that the standard doesn't provide that possibility?
- AlotOfReading 2 days ago
  
  The compiler can (and does) know enough about the stdlib functions to treat them specially. The C standard doesn't require that libm exist as a separately linked library either, that's just an implementation detail from an earlier time that got enshrined in POSIX.
- vmchale 2 days ago
  
  > perhaps a C compiler only needs to represent that function as an intrinsic instead, to give itself a chance to later replace it either with the function call or some SIMD instruction
  I gesture to this in the blog post:
  > In C, one writes a function, and it is exported in an object file. To appreciate why this is special, consider sum :: Num a => [a] -> a in Haskell. This function exists only in the context of GHC. > ... > perhaps there are more fluent methods for compilation (and better data structures for export à la object files).
  - ta3401 2 days ago
    
    I now realize that I somewhat rephrased what was said in the blog. How strange!
adgjlsfhk1 3 days ago

> But it then talks about how exp is implemented in hardware, in a way that forces it to be a non-SIMD calculation.
This is especially surprising given that very few architectures have an `exp` in hardware. It's almost always done in software.
stabbles 3 days ago

I think the interpretation is more conceptual about functions, reusability and composability.
The function exp has an abstract mathematical definition as mapping from numbers to numbers, and you could implement that generically if the language allows for it, but in C you cannot because it's bound to the signature `double exp(double)` and fixed like that at compile time. You cannot use this function in a different context and pass e.g. __m256d.
Basically because C does not have function dispatch, it's ill-suited for generic programming. You cannot write an algorithm on scalar, and now pass arrays instead.
- Someone 3 days ago
  > You cannot write an algorithm on scalar, and now pass arrays instead.
  It is limited to cases where you know all overloads beforehand and limited because of C’s weak type system (can’t have an overload for arrays of int and one for pointers to int, for example), but you can. https://en.cppreference.com/w/c/language/generic:
  #include <math.h> #include <stdio.h> // Possible implementation of the tgmath.h macro cbrt #define cbrt(X) _Generic((X), \ long double: cbrtl, \ default: cbrt, \ float: cbrtf \ )(X) int main(void) { double x = 8.0; const float y = 3.375; printf("cbrt(8.0) = %f\n", cbrt(x)); // selects the default cbrt printf("cbrtf(3.375) = %f\n", cbrt(y)); // converts const float to float, // then selects cbrtf }
  It also, IMO, is a bit ugly, but that fits in nice with the rest of C, as seen through modern eyes.
  There also is a drawback for lots of cases that you cannot overload operators. Implementing vector addition as ‘+’ isn’t possible, for example.
  Many compilers partially support that as a language extension, though, see for example https://clang.llvm.org/docs/LanguageExtensions.html#vectors-....
  - uecker 3 days ago
    
    I am not sure what you mean by "weak" type system. C's type system is not weak IMHO and also differentiates between arrays and pointers:
    _Generic((typeof(x), int*: 1, int[]: 2)
    \_Generic with type argument is a new feature, it is a bit less elegant without it:
    _Generic((typeof(x)*)0, typeof(int*)*: 1, typeof(int[])\*: 2)
    
    glouwbug 2 days ago
    
    Any pointer converts to void* without warning. That's pretty weak.
    
    uecker a day ago
    
    This is the point of void*, one does not have to use it.
    
    glouwbug a day ago
    
    void* malloc(size_t size);
    void free(void *ptr);
    You use it a lot more than you think ;)
    
    uecker a day ago
    
    This is not a type safety issue though. For malloc the size has to agree. But in any case, you could wrap it into type-aware allocation demonstrating that this is not a limitation of the type system.
    
    ta3401 2 days ago
    
    Perhaps not in the sense of weak vs strong, but rather in weak vs powerful. C type system isn't very powerful, if you compare it to languages from the scientific community (Python is an exception here imho). It is simple and probably good enough for what it was designed for (system programming), but it's not very good for scientific computing.
    
    uecker a day ago
    
    Why? I use it for scientific computing and find it very useful for exactly this purpose. But maybe you could be more precise and explain what exactly about the type system is not powerful enough.
    
    ta3401 12 hours ago
    
    Imho, a good type system for scientific computing will let you check that your computations are well formed. For instance, if one tries to apply a matrix to a vector of the wrong size, the type system should report it, or rather one should be able to express this sort of constraint using the type system to the same effect.
    You are however correct, that doesn't mean that C cannot be useful for scientific computing, but I think that there are better alternatives. Although, one can write the core functionalities in C for speed and efficiency, and use another language as a driver.
- uecker 3 days ago
  
  You can definitely implement generic algorithms in C and there are different ways to do this. For example, you create an abstract data type "ring" with addition, multiplication, and multiplication with a scalar, and pass objects to this type and functions for these three operations to a generic exponential function.
leecarraher 3 days ago

i think the article is confused. i write quite a bit of SIMD code for HPC and c is my goto for low level computing because of the low level memory allocation in c. In fact that tends to be the lion share of work to implement simd operations where memory isn't the bottleneck. if only it were as simple as calling vfmadd132ps on two datatypes.
high_na_euv 3 days ago

Even high level languages like c# allow you to express such a thing
https://stackoverflow.com/questions/73269625/create-memory-a...
Also structlayout and fieldoffset

PaulHoule 6 days ago

So are those functional languages. The really cool SIMD code I see people writing now is by people like Lemire using the latest extensions who do clever things like decoding UTF-8 which I think will always take assembly language to express unless you are getting an SMT solver to write it for you.

janwas 2 days ago

Lemire and collaborators often write in C++ intrinsics, or thin platform-specific wrappers on top of them.
I count ~8 different implementations [1], which demonstrates considerable commitment :) Personally, I prefer to write once with portable intrinsics.
https://github.com/simdutf/simdutf/tree/1d5b5cd2b60850954df5...
- mananaysiempre 2 days ago
  
  I don’t think that really works beyond 128 bits. AVX/AVX2 with its weird tiered 2×128-bit structure occasionally forces you down some really awkward paths. AVX-512 with its masks is powerful and quite elegant, but the idioms it enables don’t really translate to other ISAs. And even in 128-bit land, the difference between good code for SSE and NEON can be quite significant, all because of a single instruction (PMOVMSKB) the latter lacks[1].
  Portable instructions and “scalable” length-agnostic vectors are usually fine for straightforward mathy (type-1[2]) SIMD code. But real SIMD trickery, the one that starts as soon as you take your variable byte shuffle out of the stable, is rarely so kind.
  [1] https://branchfree.org/2019/04/01/fitting-my-head-through-th...
  [2] https://branchfree.org/2024/06/09/a-draft-taxonomy-of-simd-u...
  - janwas 2 days ago
    
    Valid points, but I am doing exactly that full-time for the past years :) Including compression and quicksort, which are nontrivial uses of SIMD.
    The 2x128 is handled by adopting AVX2 semantics for our InterleaveLower/Upper op, and when we want the extra fixup, there is also InterleaveWholeLower. I agree AVX-512 is awesome, it's my favorite target. Some of its useful ops (compress[store]) are fine to emulate. What idioms did you have in mind that don't translate? Spot on about PMOVMSKB, that's a pain point. We provide several ops to replace it with more relaxed semantics that are easier to emulate on other platforms, for example AllTrue, CountTrue, AllFalse. (I think DanilaK's NEON VSHRN trick is faster than the one from your first link?)
kibwen 3 days ago

> So are those functional languages.
The author is suggesting array languages as the solution, which are a separate category from functional languages.
dagelf 3 days ago

Link?
- mananaysiempre 3 days ago
  
  https://simdjson.org/publications/
  https://simdutf.github.io/simdutf/
  https://lemire.me/blog/
  https://branchfree.org/
  http://0x80.pl/
throwawaymaths 3 days ago

not even an SMT solver, that's not what they are for.
- 1propionyl 3 days ago
  
  No? SMT solvers underpin the majority of program synthesis research tools.
  When you get down to it, you're optimizing (searching) for some program that maximizes/minimizes some objective function with terms for error relative to specification/examples and size of synthesized program, while applying some form of cleverness to pruning the search space.
  This is absolutely within the wheelhouse of SMT solvers, and something that they are used for.
  SMT doesn't have to be used, but for implementation it enables iterating on the concept more quickly (see also, more broadly: prototyping in Prolog), and in other cases it's simply the most effective tool for the job. So, it tends to get a lot of play.
  - throwawaymaths 2 days ago
    
    If SMT solvers were all you needed, then for example, LLVM wouldn't need a polyhedral optimizer. And a polyhedral optimizer is a subset of the sorts of things you will need for doing the sort of things you're asking to do automatically.
    
    dzaima 2 days ago
    
    SMT solvers might not be feasible for large-scale things, but they can certainly be used to find a sequence of a couple instructions doing a specific thing if you give enough computational resources, and that's what OP was talking about. (whether it'll complete in a reasonable time (i.e. less than like an hour) for something with more than like 4 instructions is a separate question; but 4 instructions is enough for some fancy stuff).
    Like, just playing around not particularly seriously, I've gotten SMT to spit out [1] for narrowing & packing four int32x8 vectors into one int8x32 vector, assuming no wrapping necessary (4 minutes via Bitwuzla), among other things; nothing one couldn't figure out by doing a bit of thinking, but it does work! And could conceivably get more feasible for larger tasks over time.
    [1]: https://dzaima.github.io/paste/#0KytITM4uLi5PUqjMzTXRAZEGYNK...

npalli 3 days ago

C++ does OK.

Google's highway [1]

Microsoft DirectXMath [2]

[1] https://github.com/google/highway

[2] https://github.com/microsoft/DirectXMath

vmchale 2 days ago

From highway:
> Does what you expect: Highway is a C++ library with carefully-chosen functions that map well to CPU instructions without extensive compiler transformations. The resulting code is more predictable and robust to code changes/compiler updates than autovectorization.
So C compilers are not a good place to start if one wants to write a compiler for an array language (which naturally expresses SIMD calculations). Which is what I point out in the last paragraph of the blog post:
> To some extent this percolates compilers textbooks. Array languages naturally express SIMD calculations; perhaps there are more fluent methods for compilation (and better data structures for export à la object files).
pjmlp 2 days ago

It is more like, ISO C++ with compiler extensions does ok.

commandlinefan 2 days ago

This seems more of a generalization of Richard Steven's observation:

"Modularity is the enemy of performance"

If you want optimal performance, you have to collapse the layers. Look at Deepseek, for example.

mlochbaum 3 days ago

Several comments seem confused about this point: the article is not about manual SIMD, which of course C is perfectly capable of with intrinsics. It's discussing problems in compiling architecture-independent code to SIMD instructions, which in practice C compilers very often fail to do (so, exp having scalar hardware support may force other arithmetic not to be vectorized). An alternative mentioned is array programming, where operations are run one at a time on all the data; these languages serve as a proof of concept that useful programs can be run in a way that uses SIMD nearly all the time it's applicable, but at the cost of having to write every intermediate result to memory. So the hope is that "more fluent methods for compilation" can generate SIMD code without losing the advantages of scalar compilation.

As an array implementer I've thought about the issue a lot and have been meaning to write a full page on it. For now I have some comments at https://mlochbaum.github.io/BQN/implementation/versusc.html#... and the last paragraph of https://mlochbaum.github.io/BQN/implementation/compile/intro....

mlochbaum 3 days ago

Finding it increasingly funny how many people have come out of the woodworks to "defend" C by offering this or that method of explicit vectorization. What an indictment! C programmers can no longer even conceive of a language that works from a description of what should be done and not which instructions should be used!
mlochbaum 3 days ago

Got the introduction written at https://mlochbaum.github.io/BQN/implementation/compile/fusio....
brundolf 3 days ago

Thank you. The title is confusing
pjmlp 2 days ago

Compiler specific C is capable of, ISO C has no SIMD support of any form.

notorandit 3 days ago

C has been invented when CPUs, those few available, were just single cored and single threaded.

Wanna SMP? Use multi-thread libreries. Wanna SIMD/MIMD? Use (inline) assembler functions. Or design your own language.

dragontamer 3 days ago

Fortran is older but it's array manipulation maps naturally to SIMD.
LiamPowell 3 days ago

That's missing the point of the article, but it's also not true, at least not at the time of C89. This can be easily verified by a Google search. As an aside, Ada also had multi-threading support before C89 was released, but this article is about SIMD, not multi-threading.
- johnnyjeans 3 days ago
  
  c89 is 19 years into c's life
  - LiamPowell 3 days ago
    
    Yes, but before standardisation there were many implementations all doing different things based on a book with no particularly formal specifications, so picking 1989 makes more sense than 1970 in my opinion. Multi-threading also existed in 1970 but wasn't as widespread.
  - pjmlp 2 days ago
    
    You had plenty of places outside Bell Labs doing concurrency stuff, Concurrent Pascal and Solo OS is one example among many.

camel-cdr 3 days ago

This depends entirely on compiler support. Intels ICX compiler can easily vectorize a sigmoid loop, by calling SVMLs vectorized expf function: https://godbolt.org/z/no6zhYGK6

If you implement a scalar expf in a vectorizer friendly way, and it's visible to the compiler, then it could also be vectorized: https://godbolt.org/z/zxTn8hbEe

dzaima 3 days ago

gcc and clang are also capable of it, given certain compiler flags: https://godbolt.org/z/z766hc64n
- camel-cdr 3 days ago
  
  Thanks, I didn't know about this. Interesting that it seems to require fast-math.
  - ComputerGuru 3 days ago
    
    That means it’ll never be used. ffast-math is verboten in most serious codebases.
  - dzaima 3 days ago
    
    Seems "-fno-math-errno" is enough for clang. gcc needs a whole "-fno-math-errno -funsafe-math-optimizations -ffinite-math-only".
    
    Avamander 3 days ago
    
    So is the optimization "wrong" or "unsafe" even in the case of Intel's ICX compiler? Is it that you can't express the right (error) semantics in C?
    I'm just wondering why those two require the flag and the other doesn't.
    
    dzaima 3 days ago
    
    ICX appears to just default to fast-math: https://godbolt.org/z/jzPazGjoh
    Requiring -fno-math-errno is sane enough, essentially noone needs math errno anyway (and that flag is needed to vectorize even sqrt, for which there's a proper full hardware SIMD instruction, but which obviously doesn't set errno on a negative input or whatever).
    Probably depends on the vector math library used whether it handles near-infinity or inf/NaN values properly. And there's also the potential concern that the scalar and vector exp() likely give different results, leading to weird behavior, which might be justification for -funsafe-math-optimizations.
bee_rider 3 days ago

I’m not sure what “suited to SIMD” means exactly in this context. I mean, it is clearly possible for a compiler to apply some SIMD optimizations. But the program is essentially expressed as a sequential thing, and then the compiler discovers the SIMD potential. Of course, we write programs that we hope will make it easy to discover that potential. But it can be difficult to reason about how a compiler is going to optimize, for anything other than a simple loop.
- camel-cdr 3 days ago
  
  Suites for SIMD means you write the scalar equivalent of what you'd do on a single element in a SIMD implementation.
  E.g. you avoid lookup tables when you can, or only use smaller ones you know to fit in one or two SIMD registers. gcc and clang can't vevtorize it as is, but they do if you remove the brancjes than handle infinity and over/under-flow.
  In the godbolt link I copied the musl expf implementation and icx was able to vectorize it, even though it uses a LUT to large for SIMD registers.
  #pragma omp simd and equivalents will encourage the compiler to vectorize a specific loop and produce a warning if a loop isn't vectorized.
  - bee_rider 3 days ago
    
    I shouldn’t have started my comment with the sort of implied question or note of confusion. Sorry, that was unclear communication.
    I agree that it is possible to write some C programs that some compilers will be able to discover the parallel potential of. But it isn’t very ergonomic or dependable. So, I think this is not a strong counter-argument to the theory of the blog post. It is possible to write SIMD friendly C, but often it is easier for the programmer to fall back to intrinsics to express their intent.
- sifar 3 days ago
  
  It means auto-vectorization. Write scalar code that can be automatically vectorized by the compiler by using SIMD instructions

svilen_dobrev 3 days ago

reminds me of this old article.. ~2009, Sun/MIT - "The Future is Parallel" .. and "sequential decomposition and usual math are wrong horse.."

https://ocw.mit.edu/courses/6-945-adventures-in-advanced-sym...

vkaku 2 days ago

I also think that vectorizers and compilers can detect parallel memory adds/moves/subs and without that, many do not take time to provide adequate hints to the compiler about it.

Some people have vectorized successfully with C, even with all the hacks/pointers/union/opaque business. It requires careful programming, for sure. The ffmpeg cases are super good examples of how compiler misses happen, and how to optimize for full throughput in those cases. Worth a look for all compiler engineers.

Vosporos 3 days ago

Woo, very happy to see more posts by McHale these days!

exitcode0000 2 days ago

Of course C easily allows you to directly write SIMD routines via intrinsic instructions or inline assembly:

```

  generic
     type T is private;
     Aligned : Bool := True;
  function Inverse_Sqrt_T (V : T) return T;
  function Inverse_Sqrt_T (V : T) return T is
    Result : aliased T;
    THREE         : constant Real   := 3.0;
    NEGATIVE_HALF : constant Real   := -0.5;
    VMOVPS        : constant String := (if Aligned then "vmovaps" else "vmovups");
    begin
      Asm (Clobber  => "xmm0, xmm1, xmm2, xmm3, memory",
           Inputs   => (Ptr'Asm_Input ("r", Result'Address),
                        Ptr'Asm_Input ("r", V'Address),
                        Ptr'Asm_Input ("r", THREE'Address),
                        Ptr'Asm_Input ("r", NEGATIVE_HALF'Address)),                     
           Template => VMOVPS & "       (%1), %%xmm0         " & E & --   xmm0 ← V
                       " vrsqrtps     %%xmm0, %%xmm1         " & E & --   xmm1 ← Reciprocal sqrt of xmm0
                       " vmulps       %%xmm1, %%xmm1, %%xmm2 " & E & --   xmm2 ← xmm1 \* xmm1
                       " vbroadcastss   (%2), %%xmm3         " & E & --   xmm3 ← NEGATIVE_HALF
                       " vfmsub231ps  %%xmm2, %%xmm0, %%xmm3 " & E & --   xmm3 ← (V - xmm2) \* NEGATIVE_HALF
                       " vbroadcastss   (%3), %%xmm0         " & E & --   xmm0 ← THREE
                       " vmulps       %%xmm0, %%xmm1, %%xmm0 " & E & --   xmm0 ← THREE \* xmm1
                       " vmulps       %%xmm3, %%xmm0, %%xmm0 " & E & --   xmm0 ← xmm3 \* xmm0
                       VMOVPS & "     %%xmm0,   (%0)         ");     -- Result ← xmm0
      return Result;
    end;

  function Inverse_Sqrt is new Inverse_Sqrt_T (Vector_2D, Aligned => False);
  function Inverse_Sqrt is new Inverse_Sqrt_T (Vector_3D, Aligned => False);
  function Inverse_Sqrt is new Inverse_Sqrt_T (Vector_4D);

```

```C

  vector_3d vector_inverse_sqrt(const vector_3d\* v) {
  ...
  vector_4d vector_inverse_sqrt(const vector_4d\* v) {
    vector_4d out;
    static const float THREE = 3.0f;          // 0x40400000
    static const float NEGATIVE_HALF = -0.5f; // 0xbf000000

    __asm__ (
        // Load the input vector into xmm0
        "vmovaps        (%1), %%xmm0\n\t"
        "vrsqrtps     %%xmm0, %%xmm1\n\t"
        "vmulps       %%xmm1, %%xmm1, %%xmm2\n\t"
        "vbroadcastss   (%2), %%xmm3\n\t"
        "vfmsub231ps  %%xmm2, %%xmm0, %%xmm3\n\t"
        "vbroadcastss   (%3), %%xmm0\n\t"
        "vmulps       %%xmm0, %%xmm1, %%xmm0\n\t"
        "vmulps       %%xmm3, %%xmm0, %%xmm0\n\t"
        "vmovups      %%xmm0,   (%0)\n\t"                         // Output operand
        :
        : "r" (&out), "r" (v), "r" (&THREE), "r" (&NEGATIVE_HALF) // Input operands
        : "xmm0", "xmm1", "xmm2", "memory"                        // Clobbered registers
    );

    return out;

} ```

dvorack101 2 days ago

Not my code, but illustrates how SIMD works.

https://github.com/dezashibi-c/a-simd_in_c

Copy rights go to Navid Dezashibi.

jiehong 3 days ago

It seems that Zig is well suited for writing SIMD [0].

If only GPU makers could standardise an extended ISA like AVX on CPU, and we could all run SIMD or SIMT code without needing any librairies, but our compilers.

[0]: https://zig.guide/language-basics/vectors/

dundarious 3 days ago

I like zig, but no, what it provides pales in comparison to what intrinsics offer.

musicale 2 days ago

That's why CUDA never went anywhere. /s ;-)

(Oh, that's SIMT. Carry on then.)

musicale a day ago

https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/gro...

TinkersW 3 days ago

This reads like jibberish.

C functions can't be vectorized? WTF are you talking about? You can certainly pass vector registers to functions.

Exp can also be vectorized, AVX512 even includes specific instructions to make it easier.( there is no direct exp instruction on most hardware,it is generally a sequence of instructions)