I still need to implement benchmark tests and write documentation. The reader-writer cohort locks also required that I add a method to the existing ck_cohort framework to determine whether or not a cohort lock is currently in a locked state.
I added an extra argument to the CK_COHORT_INIT macro to allow users to specify custom pass limits
when using it. I also added a reference to the paper on which the cohort implementation was based.
The constants and macros in ck_cohort.h didn't conform to CK naming conventions. Additionally, I was able to remove a lot of unnecessary atomic operations and memory fences, which reduced latency by 10-20% and increased throughput using ticket locks by nearly an order of magnitude.
John Wittrock has contributed a phase-fair reader-writer
lock implementation. These locks allow phase fairness
guarantees between readers and writers. This work includes
additional changes and clean-up.
Follow-up work is expected.
Thanks to John Wittrock for patches and Professor Gabriel
Parmer (http://www.seas.gwu.edu/~gparmer/) for advising.
Upon popular request, added a variant of the ticket spinlock
with trylock support. This is pending additional verification
on other architectures besides x86*. It is still unclear whether
this implementation will be the default as it is has slower
fast path.
Add trylock support to the ck_spinlock validation tests.
It currently only tests ck_spinlock_ticket_t trylock
functionality if available.
This primarily involved changing the configure script and adding
several utility functions to regressions/common.h for unit testing.
Signed-off-by: Samy Al Bahra <sbahra@appnexus.com>
The distinction between additive/exponential implementation
and geometric implementation does little but confuse users.
The terminology used in ck_backoff now reflects terminology
used in literature.
ck_backoff_gb has been removed.
Execution history exists such that first thread will dequeue
from an uninitialized ring buffer. The unit test has been
(un)fortunate in that it would busy-wait during dequeue.
Full barrier semantics will enforce visibility.
Reported by: Maxime Henrion <mhenrion@gma....>.
This operation is of format:
CK_S*LIST_MOVE(a, b, linkage) and is equivalent to intializing
a with the contents of b. This is done in a manner that is atomic
with respect to readers. Read-only operations are still valid in
b, but behavior is undefined for write-side operations on b after
a MOVE operation.
I had the pleasure of spending a significant amount of time at the most
recent LPC with Mathieu Desnoyers and Paul McKenney. In discussing
RCU semantics in relation to epoch reclamation, it was argued that
epoch reclamation is a specialisation of RCU (rather than a generalization).
In light of this discussion, I thought it would make more sense to not expose
write-side synchronization semantics aside from ck_epoch_call (similar to
RCU call), ck_epoch_poll (identical to tick), ck_epoch_barrier and
ck_epoch_synchronization (similar to ck_epoch_synchronization). Writers will
now longer have to use write-side epoch sections but can instead rely on
epoch_barrier/synchronization for blocking semantics and ck_epoch_poll
for old tick semantics.
One advantage of this is we can avoid write-side recursion for certain workloads.
Additionally, for infrequent writes, epoch_barrier and epoch_synchronization both
allow for blocking semantics to be used so you don't have to pay the cost of
epoch_entry for non-blocking dispatch.
Example usage:
e = stack_pop(mystack);
ck_epoch_synchronize(...);
free(e);
read_begin and read_end has been replaced with ck_epoch_begin and ck_epoch_end.
If multiple writers need SMR guarantees, then they can also use ck_epoch_begin
and ck_epoch_end. Any dispatch in presence of multiple writers should be done
with-in an epoch section (for now).
There are some follow-up commits to come.
Though a new implementation is in the works, roll in
some performance improvements in the mean time.
The probe routines have been broken out into separate
reader/writer variants. These variants are much less
branch-intensive (and don't involve predict to stall
in many cases).
New implementation is attempting to deal with
interface-induced overheads.
Documentation and regressions tests have been updated to reflect this.
This functionality allows for individual hash tables use to different
allocation functions. Thanks to Wez Furlong for pointing out the necessary
documentation update for ck_ht.
Migrate available block list to CK_LIST.
New blocks are only allocated when the available list is exhausted.
Remove bag->avail_tail.
Print out number of writer iterations for unit test.
Lengthen duration of unit test.
This changes comes at the cost of clear linearizability, which
is suitable for my use-case. Users can easily implement linereazability
through an additional level of indirection to the ck_bitmap object.
Add necessary load fence to iterator.
Initialize iterator appropriately for empty bags.
Improve unit test.
Fix bag linkage bug for non x86_64 targets.
Fix block accounting on removal.
Specifically, any platform that has CK support for 64-bit
load/store operations.
Additional improvements have been made to the unit tests
to disambiguate put/get failures.
This is a hash table that is optimized for architectures that
implement total store ordering and workloads that are read-heavy
involving a single writer and multiple readers. Unlike traditional
non-blocking multi-producer/multi-consumer hash table
implementations this version allows for immediate re-use of deleted
buckets (no need for explicit reclamation cycles) and is more
conducive to traditional safe memory reclamation schemes used in
unmanaged languages (otherwise, we would require key duplication).
It is relatively heavy-weight for MPMC workloads on architectures
which do not implement TSO in comparison to Click's MPMC hash
table. However, it still has better performance characteristics
than a blocking hash table.
The committed version currently only provides x86_64 support. This is
being committed for review by peers and for a silent release that will
allow us to test ck_ht_spmc under high production workloads.
Next public release will include additional documentation as well as
support for other architectures.
In the mean time, please see the unit tests for example usage. Included in
this commit: Dropped -Wbad-function-cast from GCC port.
Writer-side synchronization is still necessary. My current use-cases call for
SLIST and LIST implementations, and as such, I've only implemented support
for these. TAILQ facilities will be developed when the time comes that I require
them or if there is sufficient user-demand.
Several users in the past have noted it was difficult for them
to decide what spinlock implementation to use. In light of this,
a light-weight greedy default is chosen (currently ck_spinlock_fas).
build:
- configure step will generate relevant CFLAGS.
- build profiles are for convenience (developers can use themu
for cross-compilation).
regressions:
- Renamed ck_barrier unit tests to work-around behavior
of Solaris linker.
- Adopted use of a PTHREAD_CFLAGS variable.
ck_cc:
- Added internal CK_CC_IMM macro for compilers that are
verbose against impossible inline constraints (or limited
optimizers).
ck_pr/x86*:
- Adopted CK_CC_IMM macro.
- Dropped redundant constraints.
This work was mostly completed by Theo Schlossnagle
<jesus@omniti.com>, much thanks to him. He has
also provided access to a machine with Sun Studio 12.
ck_epoch_reclaim is now the replacement for ck_epoch_flush.
ck_epoch_purge guarantees that all entries are reclaimed
for the provided record before exiting.
n_peak counter has been added, which provides the peak number
of items across all reclamation lists. n_reclamations provides
the number of reclamations across the lifetime of the record.
These are cleared on unregister.
ck_epoch_update has been renamed to ck_epoch_tick.
Hazardous sections which mutate shared structures are now
expected to begin with ck_epoch_write_begin and end with
ck_epoch_end.
Hazardous sections which read shared structures are now
expected to begin with ck_epoch_read_begin and end with
ck_epoch_end.
ck_hp_free is now more aggressive. It will attempt a
reclamation cycle any time the pending count is long.
I should probably add a ck_hp_retire to have a version
which allows for bulk updates to local reclamation lists.
The barriers have been restructured into individual file
per implementation. Some micro-optimizations were implemented
for some barriers (caching common computations in the barrier).
State subsription is now explicit with the TID counter allocated
on a per-barrier basis.
Tournament barriers remaining and then another round will be done
for correctness and algorithmic improvements.
These are the tournament and mcs barriers from "Algorithms for Scalable
Synchronization on Shared-Memory Multiprocessors." Validation tests have
also been added for these barriers to regressions/ck_barrier/validate.
This is the software combining tree barrier from the MCS paper. Currently,
it uses a binary tree; it may be changed later to use an n-ary tree.
Validation (combining_validation.c in regressions/ck_barrier/validate)
has also been added.
Instead of executing a join on the threads, aggregate existing state
of their counters. This can mean one missed increment per thread,
which is not a big deal.
Moved rdtsc and affinity logic to a single file which other
regression tests use. Single point of reference will ease
porting these to future architectures and platforms. Removed
invalid Copyright statement.
Added CK_CC_USED to force some code generation that I found
useful for debugging.
Added ck_stack latency tests and a modified version of djoseph's
modifications to benchmark.h for spinlock latency tests.