您的位置：首页 > 理论基础 > 数据结构算法

Java免锁数据结构

2015-08-19 21:57 459 查看

今天看到一本书上讲Java中的免锁数据结构，看了下CopyOnWriteArrayList的实现，发现是读操作不加锁，写操作每次新建一个容量+1的副本，加锁，最后替换。用的都是ReentrantLock，感觉明明用了显示锁为什么说是免锁的，查了下定义，如下：

Lock-free data structures

For a data structure to qualify as lock-free, more than one thread must be able to

access the data structure concurrently. They don’t have to be able to do the same

operations; a lock-free queue might allow one thread to push and one to pop but

break if two threads try to push new items at the same time. Not only that, but if one of

the threads accessing the data structure is suspended by the scheduler midway

through its operation, the other threads must still be able to complete their operations

without waiting for the suspended thread.

Algorithms that use compare/exchange operations on the data structure often

have loops in them. The reason for using a compare/exchange operation is that

another thread might have modified the data in the meantime, in which case the code

will need to redo part of its operation before trying the compare/exchange again.

Such code can still be lock-free if the compare/exchange would eventually succeed if

the other threads were suspended. If it wouldn’t, you’d essentially have a spin lock,

which is nonblocking but not lock-free.

Lock-free algorithms with such loops can result in one thread being subject to starvation.

If another thread performs operations with the “wrong” timing, the other

thread might make progress while the first thread continually has to retry its operation.

Data structures that avoid this problem are wait-free as well as lock-free.

Types of nonblocking data structures

Back in chapter 5, we implemented a basic mutex using std::atomic_flag as a spin

lock. The code is reproduced in the following listing.

class spinlock_mutex

{

std::atomic_flag flag;

public:

spinlock_mutex():

flag(ATOMIC_FLAG_INIT)

{}

void lock()

{

while(flag.test_and_set(std::memory_order_acquire));

}

void unlock()

{

flag.clear(std::memory_order_release);

}

};

This code doesn’t call any blocking functions; lock() just keeps looping until the call

to test_and_set() returns false. This is why it gets the name spin lock—the code

“spins” around the loop. Anyway, there are no blocking calls, so any code that uses this

mutex to protect shared data is consequently nonblocking. It’s not lock-free, though. It’s

still a mutex and can still be locked by only one thread at a time. Let’s look at the definition

of lock-free so you can see what kinds of data structures are covered.

Wait-free data structures

A wait-free data structure is a lock-free data structure with the additional property that

every thread accessing the data structure can complete its operation within a bounded

number of steps, regardless of the behavior of other threads. Algorithms that can

involve an unbounded number of retries because of clashes with other threads are

thus not wait-free.

Writing wait-free data structures correctly is extremely hard. In order to ensure that

every thread can complete its operations within a bounded number of steps, you have

to ensure that each operation can be performed in a single pass and that the steps performed

by one thread don’t cause an operation on another thread to fail. This can

make the overall algorithms for the various operations considerably more complex.

Given how hard it is to get a lock-free or wait-free data structure right, you need

some pretty good reasons to write one; you need to be sure that the benefit outweighs

the cost. Let’s therefore examine the points that affect the balance.

无阻塞数据结构是指的线程不会被阻塞，一般请求不到时会循环尝试，自旋，免锁数据结构定义不是很明确，大概是拿无阻塞队列来讲，同时可以有一个写，多个读但不允许有多个写，但是写线程不会因为其它线程阻塞而等待，就是说一直可以有写线程在运行。无等待数据结构是指的，任何一个线程都可以在有限个步骤内运行完。

感觉还不明确，查阅了 Andrei Alexandrescu的论文（以下是转载自：http://www.drdobbs.com/184401865）：

Lock-free data structures guarantee the progress of at least one thread when executing mutlithreaded procedures, thereby helping you avoid deadlock.

Andrei Alexandrescu is a graduate student in Computer Science at the University of Washington and author of Modern C++ Design. He can be contacted at andrei@metalanguage.com.

After skipping an installment of "Generic<Programming>" (it's naive, I know, to think that grad school asks for anything less than 100 percent of one's time), there has been an embarrassment of riches as far as topic candidates for this article. One topic
candidate was a discussion of constructors; in particular, forwarding constructors, handling exceptions, and two-stage object construction. One other topic candidate—and another glimpse into the Yaslander technology [2]—was creating containers (such as lists,
vectors, or maps) of incomplete types, something that is possible with the help of an interesting set of tricks, but not guaranteed by the standard containers.

While both candidates are intriguing, they couldn't stand a chance against lock-free data structures, which are all the rage in the multithreaded programming community. At this year's Programming Language Design and Implementation conference (http://www.cs.umd.edu/~pugh/pldi04/),
Michael Maged presented the world's first lock-free memory allocator [7], which surpasses at many tests other more complex, carefully designed lock-based allocators.

This is the most recent of many lock-free data structures and algorithms that have appeared in the recent past.

What Do You Mean, "lock-free?"

That's exactly what I would have asked only a while ago. As the bona-fide mainstream multithreaded programmer that I was, lock-based multithreaded algorithms were familiar to me. In classic lock-based programming, whenever you need to share some data, you
need to serialize access to it. The operations that change data must appear as atomic, such that no other thread intervenes to spoil your data's invariant. Even a simple operation such as ++count_ (where count_ is an integral
type) must be locked. Incrementing is really a three-step (read, modify, write) operation that isn't necessarily atomic.

In short, with lock-based multithreaded programming, you need to make sure that any operation on shared data that is susceptible to race conditions is made atomic by locking and unlocking a mutex. On the bright side, as long as the mutex is locked, you can
perform just about any operation, in confidence that no other thread will trample on your shared state.

It is exactly this "arbitrary"-ness of what you can do while a mutex is locked that's also problematic. You could, for example, read the keyboard or perform some slow I/O operation, which means that you delay any other threads waiting for the same mutex.
Worse, you could decide you want access to some other piece of shared data and attempt to lock its mutex. If another thread has already locked that last mutex and wants access to the first mutex that your threads already holds, both processes hang faster than
you can say "deadlock."

Enter lock-free programming. In lock-free programming, you can't do just about anything atomically. There is only a precious small set of things that you can do atomically, a limitation that makes lock-free programming way harder. (In fact, there must be
around half a dozen lock-free programming experts around the world, and yours truly is not among them. With luck, however, this article will provide you with the basic tools, references, and enthusiasm to help you become one.) The reward of such a scarce framework
is that you can provide much better guarantees about thread progress and the interaction between threads. But what's that "small set of things" that you can do atomically in lock-free programming? In fact, what would be the minimal set of atomic primitives
that would allow implementing any lock-free algorithm—if there's such a set?

If you believe that's a fundamental enough question to award a prize to the answerer, so did others. In 2003, Maurice Herlihy was awarded the Edsger W. Dijkstra Prize in Distributed Computing for his seminal 1991 paper "Wait-Free Synchronization" (see http://www.podc.org/dijkstra/2003.html,
which includes a link to the paper, too). In his tour-de-force paper, Herlihy proves which primitives are good and which are bad for building lock-free data structures. That brought some seemingly hot hardware architectures to instant obsolescence, while clarifying
what synchronization primitives should be implemented in future hardware.

For example, Herlihy's paper gave impossiblity results, showing that atomic operations such as test-and-set, swap, fetch-and-add, or even atomic queues (!) are insufficient for properly synchronizing more than two threads. (That's surprising because queues
with atomic push and pop operations would seem to provide quite a powerful abstraction.) On the bright side, Herlihy also gave universality results, proving that some simple constructs are enough for implementing any lock-free algorithm for any number of threads.

The simplest and most popular universal primitive, and the one that I use throughout, is the compare-and-swap (CAS) operation:

template <class T>
bool CAS(T* addr, T expected, T value) {
if (*addr == expected) {
*addr = value;
return true;
}
return false;
}

CAS compares the content of a memory address with an expected value, and if the comparison succeeds, replaces the content with a new value. The entire procedure is atomic. Many modern processors implement CASor equivalent
primitives for different bit lengths (the reason for which we've made it a template, assuming an implementation uses metaprogramming to restrict possible Ts). As a rule of thumb, the more bits a CAS can compare-and-swap atomically,
the easier it is to implement lock-free data structures with it. Most of today's 32-bit processors implement 64-bit CAS; for example, Intel's assembler calls it CMPXCHG8 (you gotta love those assembler mnemonics).

A Word of Caution

Usually a C++ article is accompanied by C++ code snippets and examples. Ideally, that code is Standard C++, and "Generic<Programming>" strives to live up to that ideal.

When writing about multithreaded code, however, giving Standard C++ code samples is simply impossible. Threads don't exist in Standard C++, and you can't code something that doesn't exist. Therefore, the code for this article is "pseudocode" and not meant
as Standard C++ code for portable compilation. Take memory barriers, for example. Real code would need to be either assembly language translations of the algorithms described herein, or at least sprinkle C++ code with some so-called "memory barriers"—processor-dependent
magic that forces proper ordering of memory reads and writes. I don't want to spread the discussion too thin by explaining memory barriers in addition to lock-free data structures. If you are interested, refer to Butenhof's excellent book [3] or to a short
introduction [6]. For purposes here, I assume that the compiler and the hardware don't introduce funky optimizations (such as eliminating some "redundant" variable reads, a valid optimization under a single-thread assumption). Technically, that's called a
"sequentially consistent" model in which reads and writes are performed and seen in the exact order in which the source code does them [8].

Wait-Free and Lock-Free versus Locked

A "wait-free" procedure can complete in a finite number of steps, regardless of the relative speeds of other threads.

A "lock-free" procedure guarantees progress of at least one of the threads executing the procedure. That means some threads can be delayed arbitrarily, but it is guaranteed that at least one thread makes progress at each step. So the system as a whole always
makes progress, although some threads might make slower progress than others. Lock-based programs can't provide any of the aforementioned guarantees. If any thread is delayed while holding a lock to a mutex, progress cannot be made by threads that wait for
the same mutex; and in the general case, locked algorithms are prey to deadlock—each waits for a mutex locked by the other—and livelock—each tries to dodge the other's locking behavior, just like two dudes in the hallway trying to go past one another but end
up doing that social dance of swinging left and right in synchronicity. We humans are pretty good at ending that with a laugh; processors, however, often enjoy doing it until rebooting sets them apart.

Wait-free and lock-free algorithms enjoy more advantages derived from their definitions:

Thread-killing Immunity: Any thread forcefully killed in the system won't delay other threads.
Signal Immunity: The C and C++Standards prohibit signals or asynchronous interrupts from calling many system routines such as malloc. If the interrupt calls malloc at the same time with an interrupted thread, that could
cause deadlock. With lock-free routines, there's no such problem anymore: Threads can freely interleave execution.
Priority Inversion Immunity: Priority inversion occurs when a low-priority thread holds a lock to a mutex needed by a high-priority thread. Such tricky conflicts must be resolved by the OS kernel. Wait-free and lock-free algorithms are immune to such problems.

A Lock-Free WRRM Map

Column writing offers the perk of defining acronyms, so let's define WRRM (Write Rarely Read Many) maps as maps that are read a lot more than they are mutated. Examples include object factories [1], many instances of the Observer design pattern [5], mappings
of currency names to exchange rates that are looked up many, many times but are updated only by a comparatively slow stream, and various other look-up tables.

WRRM maps can be implemented via std::map or the post-standard unordered_map (http://www.open-std.org/jtcl/sc22/wg21/docs/papers/2004/n1647.pdf),
but as I argue in Modern C++ Design, assoc_vector (a sorted vector or pairs) is a good candidate for WRRM maps because it trades update speed for lookup speed. Whatever structure is used, our lock-free aspect is orthogonal to it;
let's just call the back-end Map<Key, Value>. Also, for the purposes of this article, iteration is irrelevant—maps are only tables that provide a means to lookup a key or update a key-value pair.

To recap how a locking implementation would look, let's combine a Map object with a Mutex object like so:

// A locking implementation of WRRMMap
template <class K, class V>
class WRRMMap {
Mutex mtx_;
Map<K, V> map_;
public:
V Lookup(const K& k) {
Lock lock(mtx_);
return map_[k];
}
void Update(const K& k,
const V& v) {
Lock lock(mtx_);
map_[k] = v;
}
};

To avoid ownership issues and dangling references (that could bite us harder later), Lookup returns its result by value. Rock-solid—but at a cost. Every lookup locks/unlocks the Mutex, although (1) parallel lookups don't
need to interlock, and (2) by the spec, Update is much less often called than Lookup. Ouch! Let's now try to provide a better WRRMMap implementation.

Garbage Collector, Where Art Thou?

The first shot at implementing a lock-free WRRMMap rests on this idea:

Reads have no locking at all.
Updates make a copy of the entire map, update the copy, and then try to CAS it with the old map. While the CAS operation does not succeed, the copy/update/CAS process is tried again in a loop.
Because CAS is limited in how many bytes it can swap, WRRMMap stores the Map as a pointer and not as a direct member of WRRMMap.

// 1st lock-free implementation of WRRMMap
// Works only if you have GC
template <class K, class V>
class WRRMMap {
Map<K, V>* pMap_;
public:
V Lookup (const K& k) {
//Look, ma, no lock
return (*pMap_) [k];
}
void Update(const K& k,
const V& v) {
Map<K, V>* pNew = 0;
do {
Map<K, V>* pOld = pMap_;
delete pNew;
pNew = new Map<K, V>(*pOld);
(*pNew) [k] = v;
} while (!CAS(&pMap_, pOld, pNew));
// DON'T delete pMap_;
}
};

It works! In a loop, the Update routine makes a full-blown copy of the map, adds the new entry to it, and then attempts to swap the pointers. It is important to do CAS and not a simple assignment; otherwise, the following
sequence of events could corrupt the map:

Thread A copies the map.
Thread B copies the map as well and adds an entry.
Thread A adds some other entry.
Thread A replaces the map with its version of the map—a version that does not contain whatever Badded.

With CAS, things work pretty neatly because each thread says something like, "assuming the map hasn't changed since I last looked at it, copy it. Otherwise, start all over again."

This makes Update lock-free but not wait-free, by my definitions. If many threads call Update concurrently, any particular thread might loop indefinitely, but at all times some thread will be guaranteed to update the structure
successfully, thus global progress is being made at each step. Luckily, Lookup is wait-free.

In a garbage-collected environment, we'd be done, and this article would end on an upbeat note. Without garbage collection, however, there is much pain to come. This is because you cannot simply dispose of the oldpMap_ willy-nilly; what
if, just as you are trying to delete it, many other threads are frantically looking for things inside pMap_ via the Lookup function? You see, a garbage collector would have access to all threads' data and private stacks; it
would have a good perspective on when the unused pMap_ pointers aren't perused anymore, and would nicely scavenge them. Without a garbage collector, things get harder. Much harder, actually, and it turns out that deterministic memory freeing
is quite a fundamental problem in lock-free data structures.

Write-Locked WRRM Maps

To understand the viciousness of our adversary, it is instructive to try a classic reference-counting implementation and see where it fails. So, think of associating a reference count with the pointer to map, and have WRRMMap store a pointer
to the thusly formed structure:

template <class K, class V>
class WRRMMap {
typedef std::pair<Map<K, V>*,
unsigned> Data;
Data* pData_;
...
};

Sweet. Now, Lookup increments pData_->second, searches through the map all it wants, then decrementspData_->second. When the reference count hits zero, pData_->first can be deleted, and
then so can pData_itself. Sounds foolproof, except...except it's "foolish" (or whatever the antonym to "foolproof" is). Imagine that right at the time some thread notices the refcount is zero and proceeds on deleting pData_,
another thread...no, better: A bazillion threads have just loaded the moribund pData_ and are about to read through it! No matter how smart a scheme is, it hits this fundamental Catch-22—to read the pointer to the data, you need to increment
a reference count; but the counter must be part of the data itself, so it can't be read without accessing the pointer first. It's like an electric fence that has the turn-off button up on top of it: To safely climb the fence you need to disable it first, but
to disable it you need to climb it.

So let's think of other ways to delete the old map properly. One solution would be to wait, then delete. You see, the old pMap_ objects will be looked up by less and less threads as processor eons (milliseconds)
go by; this is because new lookups use the new maps; as soon as the lookups that were active just before the CAS finish, thepMap_ is ready to go to Hades. Therefore, a solution would be to queue up old pMap_ values
to some "boa serpent" thread that, in a loop, sleeps for, say, 200 milliseconds, wakes up and deletes the least recent map, and then goes back to sleep for digestion.

This is not a theoretically safe solution (although it practically could well be within bounds). One nasty thing is that if, for whatever reason, a lookup thread is delayed, the boa serpent thread can delete the map under that thread's feet.
This could be solved by always assigning the boa serpent thread a lower priority than any other's, but as a whole, the solution has a stench that is hard to remove. If you agree with me that it's hard to defend this technique with a straight face, let's move
on.

Other solutions [4] rely on an extended DCAS atomic instruction, which is able to compare-and-swap two noncontiguous words in memory:

template <class T1, class T2>
bool  DCAS(T1* p1, T2* p2,
T1 e1, T2 e2,
T1 v1, T2 v2) {
if (*p1 == e1 && *p2 == e2) {
*p1 = v1; *p2 = v2;
return true;
}
return false;
}

Naturally, the two locations would be the pointer and the reference count itself. DCAS has been implemented (very inefficiently) by the Motorola 68040 processors, but not by other processors. Because of that, DCAS-based
solutions are considered of primarily theoretical value.

The first shot at a solution with deterministic destruction is to rely on the less-demanding CAS2. Again, many 32-bit machines implement a 64-bit CAS, often dubbed as CAS2. (Because it only operates on contiguous
words,CAS2 is obviously less powerful than DCAS.) For starters, let's store the reference count next to the pointer that it guards:

template <class K, class V>
class WRRMMap {
typedef std::pair<Map<K, V>*,
unsigned> Data;
Data data_;
...
};

(Notice that this time the count sits next to the pointer that it protects, a setup that eliminates the Catch-22 problem mentioned earlier. You'll see the cost of this setup in a minute.)

Then, let's modify Lookup to increment the reference count before accessing the map, and decrement it after. In the following code snippets, I ignore exception safety issues (which can be taken care of with standard techniques) for the sake
of brevity.

V Lookup(const K& k) {
Data old;
Data fresh;
do {
old = data_;
fresh = old;
++fresh.second;
} while (CAS(&data_, old, fresh));
V temp = (*fresh.first)[k];
do {
old = data_;
fresh = old;
--fresh.second;
} while (CAS(&data_, old, fresh));
return temp;
}

Finally, Update replaces the map with a new one—but only in the window of opportunity when the reference count is 1.

void Update(const K& k,
const V& v) {
Data old;
Data fresh;
old.second = 1;
fresh.first = 0;
fresh.second = 1;
Map<K, V>* last = 0;
do {
old.first = data_.first;
if (last != old.first) {
delete fresh.first;
fresh.first = new Map<K, V>(old.first);
fresh.first->insert(make_pair(k, v));
last = old.first;
}
} while (!CAS(&data_, old, fresh));
delete old.first; // whew
}

Here's how Update works. It defines the now-familiar old and fresh variables. But this time old.last (the count) is never assigned from data_.last; it is always 1. This
means that Update loops until it has a window of opportunity to replace a pointer with a counter of 1, with another pointer having a counter of 1. In plain English, the loop says "I'll replace the old map with a new, updated one, and I'll
be on the lookout for any other updates of the map, but I'll only do the replacement when the reference count of the existing map is one." The variablelast and its associated code are only one optimization: Avoid rebuilding the map over and
over again if the old map hasn't been replaced (only the count).

Neat, huh? Not that much. Update is now locked: It needs to wait for all Lookups to finish before it has a chance to update the map. Gone with the wind are all the nice properties of lock-free data structures. In particular,
it is easy to starve Update to death: Just look up the map at a high-enough rate—and the reference count never goes down to 1. So what you really have so far is not a WRRM (Write-Rarely-Read-Many) map, but a WRRMBNTM (Write-Rarely-Read-Many-But-Not-Too-Many)
one instead.

Conclusion

Lock-free data structures are promising. They exhibit good properties with regards to thread killing, priority inversion, and signal safety. They never deadlock or livelock. In tests, recent lock-free data structures surpass their locked counterparts by
a large margin [9]. However, lock-free programming is tricky, especially with regards to memory deallocation. A garbage-collected environment is a plus because it has the means to stop and inspect all threads, but if you want deterministic destruction, you
need special support from the hardware or the memory allocator. In the next installment of "Generic<Programming>," I'll look into ways to optimize WRRMMapsuch that it stays lock-free while performing deterministic destruction.

And if this installment's garbage-collected map and WRRMBNTM map dissatisfied you, here's a money saving tip: Don't go watch the movie Alien vs. Predator, unless you like "so bad it's funny" movies.

Acknowlegments

David B. Held, Michael Maged, Larry Evans, and Scott Meyers provided very helpful feedback. Also, Michael graciously agreed to coauthor the next "Generic<Programming>" installment, which will greatly improve on ourWRRMap implementation.

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航