There is a bug first generation AMD Opteron processors'
with cpuid family 0Fh and models less than 40h when it
comes to read-modify write operations after load/store
sequence. Not worth supporting this processor.
If you are on this processor, you can find more information
at: http://bugzilla.kernel.org/show_bug.cgi?id=11305#c2
The barriers have been restructured into individual file
per implementation. Some micro-optimizations were implemented
for some barriers (caching common computations in the barrier).
State subsription is now explicit with the TID counter allocated
on a per-barrier basis.
Tournament barriers remaining and then another round will be done
for correctness and algorithmic improvements.
These are the tournament and mcs barriers from "Algorithms for Scalable
Synchronization on Shared-Memory Multiprocessors." Validation tests have
also been added for these barriers to regressions/ck_barrier/validate.
ck_pr_load_32_2 (and thus ck_pr_load_ptr_2) were previously implemented in
terms of lock cmpxchg8b, which is considerably slower than just using movq.
Relevant tests making use of load_ptr_2 still pass, so I'm confident this
change is correct.
C casts to unsigned int by default, so we were experiencing some negative
undefined behavior in the 1 << 31 case. x86 now works; bts and btc are
both passing.
It turns out that the "p" constraint doesn't work on clang, so we have to get
rid of that. This means that we may need to require GCC 4.3+ if it turns out
that GCC 4.1 / 4.2 still run out of registers compiling this version.
Fix a typo that was causing several validation tests to hang.
(Doing cmpxchg8b (%eax) isn't going to work very well.) I am
wondering if something is wrong with the general implementation
of ck_pr_bts_64 and ck_pr_btc_64 because it's pretty clear that
with the stack tests passing, ck_pr_cas_32_2_value works fine.