I had the pleasure of spending a significant amount of time at the most
recent LPC with Mathieu Desnoyers and Paul McKenney. In discussing
RCU semantics in relation to epoch reclamation, it was argued that
epoch reclamation is a specialisation of RCU (rather than a generalization).
In light of this discussion, I thought it would make more sense to not expose
write-side synchronization semantics aside from ck_epoch_call (similar to
RCU call), ck_epoch_poll (identical to tick), ck_epoch_barrier and
ck_epoch_synchronization (similar to ck_epoch_synchronization). Writers will
now longer have to use write-side epoch sections but can instead rely on
epoch_barrier/synchronization for blocking semantics and ck_epoch_poll
for old tick semantics.
One advantage of this is we can avoid write-side recursion for certain workloads.
Additionally, for infrequent writes, epoch_barrier and epoch_synchronization both
allow for blocking semantics to be used so you don't have to pay the cost of
epoch_entry for non-blocking dispatch.
Example usage:
e = stack_pop(mystack);
ck_epoch_synchronize(...);
free(e);
read_begin and read_end has been replaced with ck_epoch_begin and ck_epoch_end.
If multiple writers need SMR guarantees, then they can also use ck_epoch_begin
and ck_epoch_end. Any dispatch in presence of multiple writers should be done
with-in an epoch section (for now).
There are some follow-up commits to come.
Some people might be confused as far as lack of
fencing in the lock. Add a comment to clarify that
old values should not be equal to new values
of current position (where acquiring the current position
already has a global ordering).
As ck_pr semantics were still not molded, I was designing
under the assumption I would potentially go towards
acq/req interface. Since RMO will be the semantic norm for
the ck_pr model from now on, enforce stricter ordering
requirements on rwlock.
ck_rwlock_write_unlock function will now also serialize both
loads and stores.
I was actually unsure of the exact memory model
I wanted for atomic RMW operations. It was
made apparent with time that I had to adopt RMO
if I didn't want to sacrifice performance. Make
sure we can assume RMO for the stack.
I accidentally swapped head/tail load in ck_hp_fifo (not in
ck_fifo, however). We must acquire head snapshot before tail snapshot.
An example execution history which could cause an incorrect update to occur
is below.
- tail <- fifo.tail / fifo.head != fifo.tail
- dequeue to empty (until final CAS which renders fifo.head = fifo.tail)
- head <- fifo.head / (head != tail)
- next <- fifo.head->next / next = NULL
- As head != tail, update to next pointer (where next is NULL).
However, if
- head <- fifo.head / (fifo.head != fifo.tail)
- dequeue to empty (until final CAS which renders fifo.head = fifo.tail)
- tail <- fifo.tail / fifo.head != fifo.tail
- next <- fifo.head->next / next = NULL
If we caught tail in final transition, the by the time we read next pointer,
head would have also changed forcing us to re-read. Thanks to Hendrik Donner
for reporting this.
Documentation and regressions tests have been updated to reflect this.
This functionality allows for individual hash tables use to different
allocation functions. Thanks to Wez Furlong for pointing out the necessary
documentation update for ck_ht.
Pointer packing is now disabled by default for x86_64 targets.
Jeffrey M. Birnbaum <jmbny.@...> told me that according to his
discussions with Intel engineers, Haswell will be bumping up
VMA bits to 56 bits from 48.
If you control the hardware that CK is deployed to and don't
envision a migration to 48-bits anytime soon, then you may
enable old behavior (resulting in significant memory savings
for some data structures, namely ck_ht) by passing the
--enable-pointer-packing flag to configure.
Migrate available block list to CK_LIST.
New blocks are only allocated when the available list is exhausted.
Remove bag->avail_tail.
Print out number of writer iterations for unit test.
Lengthen duration of unit test.
This changes comes at the cost of clear linearizability, which
is suitable for my use-case. Users can easily implement linereazability
through an additional level of indirection to the ck_bitmap object.
Add necessary load fence to iterator.
Initialize iterator appropriately for empty bags.
Improve unit test.
Fix bag linkage bug for non x86_64 targets.
Fix block accounting on removal.
Specifically, any platform that has CK support for 64-bit
load/store operations.
Additional improvements have been made to the unit tests
to disambiguate put/get failures.
This is a hash table that is optimized for architectures that
implement total store ordering and workloads that are read-heavy
involving a single writer and multiple readers. Unlike traditional
non-blocking multi-producer/multi-consumer hash table
implementations this version allows for immediate re-use of deleted
buckets (no need for explicit reclamation cycles) and is more
conducive to traditional safe memory reclamation schemes used in
unmanaged languages (otherwise, we would require key duplication).
It is relatively heavy-weight for MPMC workloads on architectures
which do not implement TSO in comparison to Click's MPMC hash
table. However, it still has better performance characteristics
than a blocking hash table.
The committed version currently only provides x86_64 support. This is
being committed for review by peers and for a silent release that will
allow us to test ck_ht_spmc under high production workloads.
Next public release will include additional documentation as well as
support for other architectures.
In the mean time, please see the unit tests for example usage. Included in
this commit: Dropped -Wbad-function-cast from GCC port.
We shouldn't offload the responsibility of the read_begin flush
for shared data mutations to the user. read_end requires a load
barrier at the least, not a store barrier.
Writer-side synchronization is still necessary. My current use-cases call for
SLIST and LIST implementations, and as such, I've only implemented support
for these. TAILQ facilities will be developed when the time comes that I require
them or if there is sufficient user-demand.
Several users in the past have noted it was difficult for them
to decide what spinlock implementation to use. In light of this,
a light-weight greedy default is chosen (currently ck_spinlock_fas).
build:
- configure step will generate relevant CFLAGS.
- build profiles are for convenience (developers can use themu
for cross-compilation).
regressions:
- Renamed ck_barrier unit tests to work-around behavior
of Solaris linker.
- Adopted use of a PTHREAD_CFLAGS variable.
ck_cc:
- Added internal CK_CC_IMM macro for compilers that are
verbose against impossible inline constraints (or limited
optimizers).
ck_pr/x86*:
- Adopted CK_CC_IMM macro.
- Dropped redundant constraints.
This work was mostly completed by Theo Schlossnagle
<jesus@omniti.com>, much thanks to him. He has
also provided access to a machine with Sun Studio 12.