That's a nice trick, but contrary to function statics, it is susceptible to SIOF.
This kind of optimization is useful only on extraordinarily hot paths, so I wouldn't generally recommend it.
> On ARM, such atomic load incurs a memory barrier---a fairly expensive operation.
Not quite, it is just a load-acquire, which is almost as cheap as a normal load. And on x86 there's no difference.
One thing where both GCC and Clang seem to be quite bad at is code layout: even in the example in the article, the slow path is largely inlined. It would be much better to have just a load, a compare, and a jump to the slow path in a cold section.
In my experience, in some rare cases reimplementing the lazy initialization explicitly (especially when it's possible to use a sentinel value, thus doing a single load for both value and guard) did produce a noticeable win.
> That's a nice trick, but contrary to function statics, it is susceptible to SIOF.
For those (like me) who don’t recognize that abbreviation, “The static initialization order fiasco (ISO C++ FAQ) refers to the ambiguity in the order that objects with static storage duration in different translation units are initialized in” (https://en.cppreference.com/w/cpp/language/siof.html)
Yes, thanks for the clarification, what I probably should have said is that the trick is basically syntactic sugar to declare a scoped global static variable, and as such it inherits all the problems of global static variables.
The way block scope statics are handled in C++ is a mistake. Block scope statics that don't depend on any non-static local variables should be initialized when the program starts up. E.g.:
void fun(int arg)
{
obj foo(arg); // delayed until function called (dependency on arg)
obj bar(); // inited at program start (no dependency on arg)
}
In other words, any static that can be inited at program startup should be, leaving only the troublesome cases that depend on run-time context.
“Dynamic initialization of a block-scope variable with static storage duration or thread storage duration is performed the first time control passes through its declaration
[…]
this would initialise everything correctly: by the time foo() is called, its b has already been initialised.”
I guess that means this trick can change program behavior, especially if the function containing the static is never called in a program’s execution.
Funny enough I recently wrote my own hack using this linker feature in C, to implement an array of static counter definitions that can be declared anywhere and then written out (e.g., to prometheus) in one place.
Note that as I later found out, this doesn't work with Mac OS's linker, so you need to use some separate incantations for Mac OS.
The rabbit hole I just went down is called C/C++ Statement Expressions [1] which are a GCC extension:
#define FAST_STATIC(T) \
*({ \
\ // statements separated by semicolons
reinterpret_cast<T *>(ph.buf); \ // the value of the macro as a statement
})
The reinterpret_cast<T*>(...) statement is a conventional C++ Expression Statement, but when enclosed in ({}), GCC considers the whole kit and kaboodle a Statement Expression that propagates a value.
There is no C equivalent, but in in C++, since C++11 you can achieve the same effect with lambdas:
auto value = [](){ return 12345; }();
As noted in the linked SO discussion, this is analogous to a JS Immediately-Invoked Function Expression (IIFE).
Lambdas are not fully equivalent since return statements in statement expressions will return at the function level, whereas return statements in lambdas will only return at the lambda level.
Just for clarity, since I didn't understand on first reading.
int foo() {
int result = ({
if (some_condition)
return -1; // This returns from foo(), not just the statement expression
42;
});
// This line might never be reached
}
>Even after the static variable has been initialised, the overhead of accessing it is still considerable: a function call to __cxa_guard_acquire(), plus atomic_load_explicit(&__b_guard, memory_order::acquire) in __cxa_guard_acquire().
No. The lock calls are only done during initialization, in case two threads run the initialization concurrently while the guard variable is 0. Once the variable is initialized, this will always be skipped by "je .L3".
Right, I was scratching my head exactly for that reason too. Even if the analysis was correct it would still be a solution for the problem that doesn't exist.
`NoDestructor` just ensures that the destructor is not called on the wrapped object, but you still need to manage the lifetime. If you look at the example, its recommended usage is with a function static. In other words, it's a utility to implement leaky Meyers' singletons.
std::launder is a bit weird here. Technically it should be used every time you use placement new but access the object by casting the pointer to its storage (which NoDestructor does). However, very little code actually uses it. For example, every implementation of std::optional should use it? But when you do, it actually prevents important compiler optimizations that make std::optional a zero-cost abstraction (or it did last time I looked into this).
std::launder should probably be used more than it is in low-level code if you care about correctness, even though it doesn’t always bite you in the ass. It is a logical no-op. std::launder is a hint to the compiler to forget everything it thinks it knows about the type instance, sort of like marking it “volatile” only for a specific moment in time.
The use of std::launder should be more common than it is, I’ve seen a few bugs in optimized builds when not used, but compilers have been somewhat forgiving about not using it in places you should because it hasn’t always existed. Rigorous code should be using it instead of relying on the leniency of the compiler.
In database engine code it definitely gets used in the storage layers.
> For this we need a certain old, but little-known feature of UNIX linkers
STOP WRITING NON-PORTABLE CODE YOU BASTARDS.
The correct answer is, as always, “stop using mutable global variables you bastard”.
Signed: someone who is endlessly annoyed with people who incorrectly think Unix is the only platform their code will run on. Write standard C/C++ that doesn’t rely on obscure tricks. Your co-workers will hate you less.
> Every time someone ships successful code that's hard to port to Windows
Until your boss tells you to port your so-far Linux-only code to Windows, and you run that struggle.
Signed, someone who spent the past year or so porting Linux code to Windows and macOS because the business direction changed and the company saw what was the money-maker.
P.S. Not the parent commenter, because I just realised they, too, had a paragraph beginning with 'signed, ...'
I tell my coworkers, "Hey, we need this coded up as a Windows service!" and I get crickets.
So I spin up a Debian VM and POSIX the hell out of it. If they dare to complain, I tell 'em to do their damn jobs and not leave all the hard stuff to the guy that only programs on UNIX.
To be fair to your coworkers, coding a windows service and setting up logging for it is surprisingly complicated. I'm the only person at my place of work that can do it, and even then only if I can use a compile-to-native language or .NET.
That's a nice trick, but contrary to function statics, it is susceptible to SIOF. This kind of optimization is useful only on extraordinarily hot paths, so I wouldn't generally recommend it.
> On ARM, such atomic load incurs a memory barrier---a fairly expensive operation.
Not quite, it is just a load-acquire, which is almost as cheap as a normal load. And on x86 there's no difference.
One thing where both GCC and Clang seem to be quite bad at is code layout: even in the example in the article, the slow path is largely inlined. It would be much better to have just a load, a compare, and a jump to the slow path in a cold section. In my experience, in some rare cases reimplementing the lazy initialization explicitly (especially when it's possible to use a sentinel value, thus doing a single load for both value and guard) did produce a noticeable win.
> That's a nice trick, but contrary to function statics, it is susceptible to SIOF.
For those (like me) who don’t recognize that abbreviation, “The static initialization order fiasco (ISO C++ FAQ) refers to the ambiguity in the order that objects with static storage duration in different translation units are initialized in” (https://en.cppreference.com/w/cpp/language/siof.html)
Yes, thanks for the clarification, what I probably should have said is that the trick is basically syntactic sugar to declare a scoped global static variable, and as such it inherits all the problems of global static variables.
FDO/PGO seem to really improve optimizations for hot/cold functions. I wonder if it does the kind of thing you're suggesting.
Not with any of the Clang versions I tried, but last time I checked it was a couple of years ago.
The way block scope statics are handled in C++ is a mistake. Block scope statics that don't depend on any non-static local variables should be initialized when the program starts up. E.g.:
In other words, any static that can be inited at program startup should be, leaving only the troublesome cases that depend on run-time context.FTA:
“Dynamic initialization of a block-scope variable with static storage duration or thread storage duration is performed the first time control passes through its declaration
[…]
this would initialise everything correctly: by the time foo() is called, its b has already been initialised.”
I guess that means this trick can change program behavior, especially if the function containing the static is never called in a program’s execution.
Funny enough I recently wrote my own hack using this linker feature in C, to implement an array of static counter definitions that can be declared anywhere and then written out (e.g., to prometheus) in one place.
Note that as I later found out, this doesn't work with Mac OS's linker, so you need to use some separate incantations for Mac OS.
I wrote a portable abstraction for this that works across Linux, MacOS, and Windows: https://github.com/protocolbuffers/protobuf/blob/4917ec250d3...
I call them "linker arrays". They are great when you need to globally register a set of things and the order between them isn't significant.
The rabbit hole I just went down is called C/C++ Statement Expressions [1] which are a GCC extension:
The reinterpret_cast<T*>(...) statement is a conventional C++ Expression Statement, but when enclosed in ({}), GCC considers the whole kit and kaboodle a Statement Expression that propagates a value.There is no C equivalent, but in in C++, since C++11 you can achieve the same effect with lambdas:
As noted in the linked SO discussion, this is analogous to a JS Immediately-Invoked Function Expression (IIFE).[1] https://stackoverflow.com/questions/76890861/what-is-called-...
Lambdas are not fully equivalent since return statements in statement expressions will return at the function level, whereas return statements in lambdas will only return at the lambda level.
Just for clarity, since I didn't understand on first reading.
>Even after the static variable has been initialised, the overhead of accessing it is still considerable: a function call to __cxa_guard_acquire(), plus atomic_load_explicit(&__b_guard, memory_order::acquire) in __cxa_guard_acquire().
No. The lock calls are only done during initialization, in case two threads run the initialization concurrently while the guard variable is 0. Once the variable is initialized, this will always be skipped by "je .L3".
Right, I was scratching my head exactly for that reason too. Even if the analysis was correct it would still be a solution for the problem that doesn't exist.
Looks similar to absl::NoDestructor
https://github.com/abseil/abseil-cpp/blob/master/absl/base/n...
Which is basically the only usage of std::launder I have seen
`NoDestructor` just ensures that the destructor is not called on the wrapped object, but you still need to manage the lifetime. If you look at the example, its recommended usage is with a function static. In other words, it's a utility to implement leaky Meyers' singletons.
std::launder is a bit weird here. Technically it should be used every time you use placement new but access the object by casting the pointer to its storage (which NoDestructor does). However, very little code actually uses it. For example, every implementation of std::optional should use it? But when you do, it actually prevents important compiler optimizations that make std::optional a zero-cost abstraction (or it did last time I looked into this).
std::launder should probably be used more than it is in low-level code if you care about correctness, even though it doesn’t always bite you in the ass. It is a logical no-op. std::launder is a hint to the compiler to forget everything it thinks it knows about the type instance, sort of like marking it “volatile” only for a specific moment in time.
The use of std::launder should be more common than it is, I’ve seen a few bugs in optimized builds when not used, but compilers have been somewhat forgiving about not using it in places you should because it hasn’t always existed. Rigorous code should be using it instead of relying on the leniency of the compiler.
In database engine code it definitely gets used in the storage layers.
TIL about encapsulation symbols.
Why not just use constinit (iff applicable), construct_at, or lessen the cost with -fno-threadsafe-statics?
> For this we need a certain old, but little-known feature of UNIX linkers
STOP WRITING NON-PORTABLE CODE YOU BASTARDS.
The correct answer is, as always, “stop using mutable global variables you bastard”.
Signed: someone who is endlessly annoyed with people who incorrectly think Unix is the only platform their code will run on. Write standard C/C++ that doesn’t rely on obscure tricks. Your co-workers will hate you less.
Every time someone ships successful code that's hard to port to Windows the world becomes a better place.
> Every time someone ships successful code that's hard to port to Windows
Until your boss tells you to port your so-far Linux-only code to Windows, and you run that struggle.
Signed, someone who spent the past year or so porting Linux code to Windows and macOS because the business direction changed and the company saw what was the money-maker.
P.S. Not the parent commenter, because I just realised they, too, had a paragraph beginning with 'signed, ...'
Looks like you're complaining about too much job security. Also it's you choice to accept that task and make the world a slightly worse place again.
I tell my coworkers, "Hey, we need this coded up as a Windows service!" and I get crickets.
So I spin up a Debian VM and POSIX the hell out of it. If they dare to complain, I tell 'em to do their damn jobs and not leave all the hard stuff to the guy that only programs on UNIX.
To be fair to your coworkers, coding a windows service and setting up logging for it is surprisingly complicated. I'm the only person at my place of work that can do it, and even then only if I can use a compile-to-native language or .NET.
Their tasks would be less hard if the UNIX guy would stop writing non-portable POSIX code =P