There is an off-by-one for slot ID sizeof(bytelock->readers) + 1.
This patch fixes the handling of this slot ID. Based off a patch
submitted by Albi Kavo <albi.kavo@gma....>.
Add a read-mostly mode, in which entries won't share cache lines with
associated datas (probes, probe_bound, etc).
This makes write operations slower, but make get faster.
This will serve to drive both pointer-based and type-specialized ring
implementations. Some full fences have also been removed as they are
unnecessary with the introduction of X_Y fence interface.
Remove the ring buffer from the struct ck_ring, it is now required to
explicitely pass the ring buffer for enqueue/dequeue. That can be useful for
systems with multiple address spaces.
This operation is short-hand notation for rebuilding
a hash table. This rebuild can occur in the presence
of concurrent readers and will require twice the amount
of memory of the existing hash table until completion.
This borrows from a technique described by Purcell and Harris
in "Non-blocking hashtables with open addressing" technical report.
Essentially, every slot will have an associated local probe maxim,
including tombstones. Highly aggressive workloads may still require
occassional garbage collection.
This function allows for faster insertions into tombstone-heavy
probe sequences by short-circuiting on tombstones rather than
continuing to probe. The user must already guarantee that the
entry being inserted is unique. If a non-unique key is inserted
with this operation, undefined behavior will result.
Could not find suitable use-case and generally doesn't
appear interesting to academics in the existing
form. Maybe it will make a come-back in the future with
fewer memory and latency compromises.
This adds support for CAS_64{_VALUE}, CAS_PTR_2{_VALUE},
LOAD_64, STORE_64 and other primitives built on universal
CAS primitive.
Patch submitted by Olivier Houchard <cognet@FreeBSD>.
The array is optimized for SPMC and fast iteration (though MPMC
transformation is also possible). This is an extremely simple
implementation with support for atomic in-place modification
through put -> remove elimination.
Besides implementing Thumb 2 supports, this fixes incorrect usage of
"cmp" where "cmpeq" was meant.
Patch submitted by Olivier Houchard <cognet@freebsd>.
This operation moves ownership from one hash set object
to another and re-assigns callback functions to developer-specified
values. This allows for dynamic configuration of allocation
callbacks and is necessary for use-cases involving executable code
which may be unmapped underneath the hash set.
The developer is responsible for enforcing barriers and enforcing
the visibility of the new hash set.
When spinning on global counters, it cannot be assumed that is_locked
functions will guarantee atomic to load ordering, an explicit fence
is necessary. is_locked will only guarantee load ordering.
These come in the form of CK_ELIDE_ADAPTIVE_PROTOTYPE,
CK_ELIDE_LOCK_ADAPTIVE and CK_ELIDE_UNLOCK_ADAPTIVE.
Primarily pushing this for the few that are playing with
master.
This is inspired by Andi Kleen's work for adaptive behavior
in the Linux kernel's RTM locks implementation. There are
various differences in the state machine, however. Specifically,
the concept of a retry and a busy-wait has been unified due
to state machine simplification such that any exhausted busy-wait
cycle reverts to a forfeit (a busy-wait is a specialized retry).
Follow-up work will involve allowing for is_locked behavior
to yield what users expect, if called from with-in a transaction
through the wrapper. It is warned that this will come at a performance
penalty.
This is an example limitation of fence_X_Y variant. I am
considering extending this to include an acquire extension.
Use a memory fence to force total order in a manner that
will be clearer to other developers who read this.
This did not manifest as a problem on any target architectures
due to their handling of atomic operations (SPARC models it as
both a load and a store, while Power atomic_load ordering was
enforced through a full barrier).
It is possible this will be moved to a self-contained file.
For a majority of architectures, RTM is an unnecessary
implementation-specific optimization.
I accidentally removed ck_pr_fence implicit compiler
barrier semantics in re-structure of ck_pr_fence.
This does affect the correctness of any data structures
in ck_pr_fence or the correctness of consumers of ck_pr
operations where ck_pr serves as linearization points.
The reason it does not affect any CK data structures is
that explicit compiler barriers (whether they are store/load
operations or atomic ready-modify-write operations) always
serve as linearization points.
However, if consumers are doing tricky things like using
these barriers to serialize aliased locations for correctness,
then it is possible for compiler re-ordering to bite them in
the ass.
These add unnecessary complexity to the ck_pr_fence interface.
Instead, it can be safely assumed that developers will use
ck_pr_fence_X to enforce X -> X ordering.
This includes fixing acquire semantics on mcs_lock fast path.
This represents an additional fence on the fast path for
acquire semantics post-acquisition.
More specifically, note that in memory models where atomic
operations do not have serializing effects that atomic
read-modify-write operations are modeled as store operations.
These operations serialize atomic-RMW operations with respect
to each other, loads and stores. In addition to this, the
load_depends implementations have been removed.
ck_pr_fence_{load_load,store_store,load_store,store_load} operations
have been added. In addition to this, it is no longer the responsibility
of architecture ports to determine when to emit a specific fence. Instead,
the underlying port will always emit the necessary instructions to
enforce strict ordering. The higher-level include/ck_pr implementation will
enforce whether or not a fence is necessary to be emitted according to
the memory model specified by ck_md (CK_MD_{TSO,RMO,PSO}).
In other words, only ck_pr_fence_strict_* is implemented by the MD-ck_pr
port.
This function allows for explicit execution of all
deferred callbacks in an epoch_record. The primary
motivation is currently for performance profiling
but there are other use-cases where best-effort
semantics could be applied.
Several counter-examples were found which break in the
presence of store-to-load re-ordering. Strict fence
semantics are necessary.
Thanks to Paul McKenney for helpful discussions.
These variants of ck_ring_enqueue_* return the snapshot of queue
length with respect to the linearization point. This can be used to
extract ring size without incurring additional cacheline invalidation
overhead from the writer.
I still need to implement benchmark tests and write documentation. The reader-writer cohort locks also required that I add a method to the existing ck_cohort framework to determine whether or not a cohort lock is currently in a locked state.
John Wittrock has contributed a phase-fair reader-writer
lock implementation. These locks allow phase fairness
guarantees between readers and writers. This work includes
additional changes and clean-up.
Follow-up work is expected.
Thanks to John Wittrock for patches and Professor Gabriel
Parmer (http://www.seas.gwu.edu/~gparmer/) for advising.
Upon popular request, added a variant of the ticket spinlock
with trylock support. This is pending additional verification
on other architectures besides x86*. It is still unclear whether
this implementation will be the default as it is has slower
fast path.
Add trylock support to the ck_spinlock validation tests.
It currently only tests ck_spinlock_ticket_t trylock
functionality if available.
CK_LIST_INSERT_HEAD was incorrectly managing prev
pointer on insertion to non-empty list. This bug
would cause erroneous behavior on CK_LIST_REMOVE
to non-head elements. Unit test will be updated
for this regression.
An off-by-one was introduced in downgrade path from writer.
This can cause deadlock if a writer downgrades from a write lock.
Pointed out by Jeffrey Birnbaum <jmb...@...>.
Both LLVM-backed compilers and GCC incorrectly treat
a barrier-sandwiched load as a loop invariant in dequeue_spmc.
Forcing volatile atomic load semantics generates the right
thing.
Thanks to Devon O'Dell and Abel Mathew for help in catching
this issue.
The distinction between additive/exponential implementation
and geometric implementation does little but confuse users.
The terminology used in ck_backoff now reflects terminology
used in literature.
ck_backoff_gb has been removed.
This operation is of format:
CK_S*LIST_MOVE(a, b, linkage) and is equivalent to intializing
a with the contents of b. This is done in a manner that is atomic
with respect to readers. Read-only operations are still valid in
b, but behavior is undefined for write-side operations on b after
a MOVE operation.
I had the pleasure of spending a significant amount of time at the most
recent LPC with Mathieu Desnoyers and Paul McKenney. In discussing
RCU semantics in relation to epoch reclamation, it was argued that
epoch reclamation is a specialisation of RCU (rather than a generalization).
In light of this discussion, I thought it would make more sense to not expose
write-side synchronization semantics aside from ck_epoch_call (similar to
RCU call), ck_epoch_poll (identical to tick), ck_epoch_barrier and
ck_epoch_synchronization (similar to ck_epoch_synchronization). Writers will
now longer have to use write-side epoch sections but can instead rely on
epoch_barrier/synchronization for blocking semantics and ck_epoch_poll
for old tick semantics.
One advantage of this is we can avoid write-side recursion for certain workloads.
Additionally, for infrequent writes, epoch_barrier and epoch_synchronization both
allow for blocking semantics to be used so you don't have to pay the cost of
epoch_entry for non-blocking dispatch.
Example usage:
e = stack_pop(mystack);
ck_epoch_synchronize(...);
free(e);
read_begin and read_end has been replaced with ck_epoch_begin and ck_epoch_end.
If multiple writers need SMR guarantees, then they can also use ck_epoch_begin
and ck_epoch_end. Any dispatch in presence of multiple writers should be done
with-in an epoch section (for now).
There are some follow-up commits to come.