ck_hs_hash takes two arguments, the hash set and the key, then computes the
hash value. Performance gain from CK_HS_HASH is negligible with most workloads.
Basic test coverage in serial.c.
Shoutout to Theo Schlossnagle <jesus@circonus.com> who provided initial patch.
Less undefined behaviour is always good. In practice, compilers seem
to handle either form correctly, but some static analysis tools have
trouble with (size_t)&((T *)0)->M).
These were always no-op on TSO. For non-TSO, we assume
data dependent ordering. Next to data does not require ordering
but data to next does. However, a re-ordering here is benign
in that a stale load will observe an initialized NULL value.
We would rather not pay the load cost for the common case.
Write-side fences are unchanged.
From discussions with markj@freebsd.org.
1. For ck_pr_cas_foo_value, let the inline assembly save the observed
value in a register, and store to the output reference in C.
This lets the C optimiser eliminate the memory access once the
CAS function is inlined.
2. Specify the result of the CAS as a condition code in EFLAGS instead
of executing SETcc in inline assembly, when possible. GCC gained
this functionality in GCC 6; CAS loops can now directly branch on
the condition code, without SETcc / TEST.
TESTED=existing regression tests.
This new interface allows for slot reservation to avoid additional
copy-overhead from consumer. The primary use-case is for the type-specialized
variant of ck_ring. The initial patch-set does not migrate enqueue and
dequeue to be implemented in terms of reserve and commit but will be a future
commit.
ck_ec implements 32- and (on 64 bit platforms) 64- bit event
counts. Event counts let us easily integrate OS-level blocking (e.g.,
futexes) in lock-free protocols. Waking up waiters only locks in the
OS kernel, and does not happen at all when no waiter is blocked.
Waiters only block conditionally, if the event count's value is
still equal to some prior value.
ck_ec supports multiple producers (wakers) and consumers (waiters),
and, on x86-TSO, has a more efficient specialisation for single
producer mode. In the latter mode, the overhead compared to a version
counter is on the order of 2-3 cycles and 1-2 instructions, in the
fast path. The slow path, when there are threads blocked on the event
count, consists of one additional atomic instruction and a futex
syscall.
Similarly, the fast path for consumers, when an update comes quickly,
has no overhead compared to spinning on a read-only counter. After
a few thousand cycles, consumers (waiters) enter the slow path with
one atomic instruction and a few blocking syscalls.
The single-producer specialisation requires the x86-TSO memory model,
x86's non-atomic read-modify-write instructions, and, ideally a
futex-like OS abstraction. On !x86/x86_64 platforms, single producer
increments fall back to the multiple producer code path.
Fixes https://github.com/concurrencykit/ck/issues/79
On FreeBSD, atomic operations in the kernel must access the nucleus
address space. Userland may use either the atomic instruction set
which goes without an ASI (address space identifier) or specify the
primary address space.
To avoid hardcoding the address space here, we grab the corresponding
identifier from the appropriate machine header but also only for the
kernel so the namespace doesn't get polluted for userland.
Don't attempt to be to smart, and just follow the algorithm, failing to
do so may lead to getting a thread to wrongly believe it owns the lock
when it does not.
This should fix the random failures reported on PPC with many threads.
These new macros are very convenient for modifying a SLIST after
using CK_SLIST_FOREACH_PREVPTR to find an element to remove
or a position to insert.
FreeBSD sys/queue.h already has SLIST_REMOVE_PREVPTR.
I would like to use the new macros in a change that I am planning
for FreeBSD kernel:
https://reviews.freebsd.org/D16016https://reviews.freebsd.org/D15905
* Implement ck_pr_dec_is_zero family of functions
* include/ck_pr.h: add ck_pr_{dec,inc}_is_zero and implement ck_pr_{dec,inc}_zero in terms of the new functions. Convert the architecture-specific implementations of ck_pr_foo_zero for x86 and x86-64 to ck_pr_foo_is_zero.
* regressions/ck_pr/validate: add smoke tests for ck_pr_dec_{,is_}zero and ck_pr_inc_{,is_}zero
* doc: document ck_pr_inc_is_zero
In an attempt to prevent gcc from emiting warnings, ck_pr_md_load_ptr and
ck_pr_md_store_ptr were made wrong in commit
5ae12a19d0.
load_ptr would return target instead of *target, and store would store the
value in target instead of in *target.
This is an attempt at fixing this, while still trying to avoid warnings.
This primarily affects the FreeBSD kernel, where the popcount builtin
can be problematic (relies on compiler-provided libraries). See the
history of __POPCNT__ for details [1].
- A new flag, CK_MD_CC_BUILTIN_DISABLE, can be set to indicate that CK
should not rely on compiler builtins when possible.
- ck_cc_clz has been removed, it was unused.
- ck_internal_bsf has been removed, it was duplicate of ck_cc_ffs but broken,
replaced in favor of ck_cc_ffs. Previous consumers were using the bsf
instruction, eitherway.
- ck_{rhs,hs,ht} have been updated to use ck_cc_ffs*.
If FreeBSD requires the builtins for performance reasons, we will lift the
appropriate detection into ck_md (at least, bt* bs* family of functions don't
have the same problems on most targets unlike popcount).
1: https://lists.freebsd.org/pipermail/svn-src-head/2015-March/069663.html
A note has also been added around some ambiguity with respect to WC
memory and relaxed memory semantics (so, heavier-weight mfence semantics
for strict acquire-release interface).
All fences related to atomic operations have been removed as they were
just unnecessary, and so, confusing.
Memoize the map into ck_hs_iterator_t to make iteration more safe in the face of growth or shrinkage of the map. Tests for same.
Work from Riley Berton.
With preemption, it is possible for _ck_ring_enqueue_mp to have a
snapshot of p_head so stale with respect to the later snapshot of
c_head that a comparison modulo (small) ring size will erroneously
conclude that the ring is full.
Detect that case and retry rather than failing. We only retry when
the enqueuers have made global forward progress, so the first loop
is as lock-free as it ever was.
Bonus: the new condition should be marginally faster.