1. For ck_pr_cas_foo_value, let the inline assembly save the observed
value in a register, and store to the output reference in C.
This lets the C optimiser eliminate the memory access once the
CAS function is inlined.
2. Specify the result of the CAS as a condition code in EFLAGS instead
of executing SETcc in inline assembly, when possible. GCC gained
this functionality in GCC 6; CAS loops can now directly branch on
the condition code, without SETcc / TEST.
TESTED=existing regression tests.
* Implement ck_pr_dec_is_zero family of functions
* include/ck_pr.h: add ck_pr_{dec,inc}_is_zero and implement ck_pr_{dec,inc}_zero in terms of the new functions. Convert the architecture-specific implementations of ck_pr_foo_zero for x86 and x86-64 to ck_pr_foo_is_zero.
* regressions/ck_pr/validate: add smoke tests for ck_pr_dec_{,is_}zero and ck_pr_inc_{,is_}zero
* doc: document ck_pr_inc_is_zero
A note has also been added around some ambiguity with respect to WC
memory and relaxed memory semantics (so, heavier-weight mfence semantics
for strict acquire-release interface).
All fences related to atomic operations have been removed as they were
just unnecessary, and so, confusing.
These primitives are meant to be used in lock implementations
where control dependency ordering is sufficient to enforce
ordering of critical section. At the moment, this only affects
PPC. Currently, we rely on lwsync for entry into critical sections
which is insufficient. sync is rather heavy-weight, and assuming
we aren't falling victim into compiler re-ordering, isync should
be sufficient.
There is follow-up work to be done in ARM, as we may have cheaper
(but target-specialized) ISB-tricks for load-load ordering.
We use some macro trickery to enforce that ck_pr_store_* is actually
storing the correct type into the target variable, without any actual
side effects--by making the assignment into an rvalue and using a
comma expression, the compiler should optimize it away.
On the load side, we simply cast the result to the type of the target
variable for pointer loads.
There is an unsafe version of the store_ptr macro called
ck_pr_store_ptr_unsafe for those times when you are _really_ sure that
you know what you're doing.
This commit also updates some of the source files (ck_ht, ck_hs,
ck_rhs): ck_ht now uses the unsafe macro, as its conversion between
uintptr_t and void * is invalid under the new macros. ck_hs and ck_rhs
have had some casts added to preserve validity.
This was accidentally grouped into previous commit.
The destiny of this interface for internal use is still unclear (in context of
utilization in built-in data structures). The interface is enabled by default
on x86, as it is compatible with read-side prefetch* operations and
binary-compatible with 3DNow! extension. Older compilers will waste an
additional byte (they generate 3DNow! variant) on this, but older compilers
waste more on spillage if we encode instruction. Power support is coming soon.
These add unnecessary complexity to the ck_pr_fence interface.
Instead, it can be safely assumed that developers will use
ck_pr_fence_X to enforce X -> X ordering.
These operations serialize atomic-RMW operations with respect
to each other, loads and stores. In addition to this, the
load_depends implementations have been removed.
ck_pr_fence_{load_load,store_store,load_store,store_load} operations
have been added. In addition to this, it is no longer the responsibility
of architecture ports to determine when to emit a specific fence. Instead,
the underlying port will always emit the necessary instructions to
enforce strict ordering. The higher-level include/ck_pr implementation will
enforce whether or not a fence is necessary to be emitted according to
the memory model specified by ck_md (CK_MD_{TSO,RMO,PSO}).
In other words, only ck_pr_fence_strict_* is implemented by the MD-ck_pr
port.
build:
- configure step will generate relevant CFLAGS.
- build profiles are for convenience (developers can use themu
for cross-compilation).
regressions:
- Renamed ck_barrier unit tests to work-around behavior
of Solaris linker.
- Adopted use of a PTHREAD_CFLAGS variable.
ck_cc:
- Added internal CK_CC_IMM macro for compilers that are
verbose against impossible inline constraints (or limited
optimizers).
ck_pr/x86*:
- Adopted CK_CC_IMM macro.
- Dropped redundant constraints.
This work was mostly completed by Theo Schlossnagle
<jesus@omniti.com>, much thanks to him. He has
also provided access to a machine with Sun Studio 12.
There is a bug first generation AMD Opteron processors'
with cpuid family 0Fh and models less than 40h when it
comes to read-modify write operations after load/store
sequence. Not worth supporting this processor.
If you are on this processor, you can find more information
at: http://bugzilla.kernel.org/show_bug.cgi?id=11305#c2