Don't attempt to be to smart, and just follow the algorithm, failing to
do so may lead to getting a thread to wrongly believe it owns the lock
when it does not.
This should fix the random failures reported on PPC with many threads.
This only affects RMO. This adds stricter semantics for critical section
serialization. In addition to this, asymmetric synchronization primitives will
now provide load ordering with respect to readers.
This also modifies locked operations to have acquire semantics
(they're there for elision predicates, and this doesn't impact them
in any way). There are several performance improvements included in this
as well (redundant fence was removed from days of wanting to support
Alpha).