RSS 2.0

Personal Info:

Joe Send mail to the author(s) is a lead architect on an OS incubation project at Microsoft, and was the architect for Parallel Extensions to .NET. He is an author and frequent speaker.

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2010, Joe Duffy

 
Searched for : plinq

That immutability facilitates increased degrees of concurrency is an oft-cited dictum. But is it true? And either way, why?

My view on this matter may be a controversial one. Immutability is an important foundational tool in the toolkit for building concurrent – in addition to reliable and predictable – software. But it is not the only one that matters. Making all your data immutable isn’t going to instantly lead to a massively scalable program. Natural isolation is also critically important, perhaps more so. And, as it turns out, sometimes mutability is just what the doctor ordered, as with large-scale data parallelism.

Isolation first; immutability second; synchronization last

Stepping back for a moment, the recipe for concurrency is rather simple. Say you’ve got multiple concurrent pieces of work running simultaneously (or have a goal of getting there); for discussion’s sake, call them tasks. Take two tasks. The first critical decision has two cases: either these tasks concurrently access overlapping data in shared-memory, or they do not. If they do not, they are isolated, and no precautions associated with racing memory updates are needed. If they do share data, on the other hand, then something else must give. If all concurrently accessed data is immutable, or all functions used to interact with data are pure, then dangerous concurrency hazards are avoided. All is well. If some data is mutable, however, then this is where things get tricky, and higher-level synchronization is needed to make accesses safe. This decision tree is straightforward and clear.

I have listed those four attributes – isolated, immutable and pure, and synchronization – in a very intentional order. Thankfully, this order mirrors the natural top-down hierarchical architecture of most modern object- and component-based programs: we have large containers that communicate through well-defined interfaces, each comprised of layers of such containers, and somewhere towards the leaves, a fair amount of intimate commingling of knowledge regarding data and invariants.

This order also reflects the order of complexity and execution-time costs, from least to most. Isolation is simple, because components depend on each other in loosely-coupled ways, and in fact scales superiorly in a concurrent program because no synchronization is necessary, the “right” data structure may be chosen for the job – immutable or not – and locality is part-and-parcel to the architecture. Immutability at least avoids the morass of synchronization, which can affect programs immensely in complexity, runtime overheads, and write-contention for shared data. It is clear that synchronization is something to avoid at all costs, particularly anything done in an ad-hoc manner like locks.

Making the concurrency

But where did all this concurrency come from, anyway?

It came from two things:

  1. The coarse-grained breakdown of a program into isolated pieces.
  2. The fine-grained data parallelism.

On #1: Program fragments that are isolated are already half-way down the road to running concurrently as tasks. The second half of this journey, of course, is teaching them to interact with one another asynchronously, most frequently through message-passing or by sticking them into a pipeline. The details of course depend on what programming language you are using. It may be through agents, actors, active objects, COM objects, EJBs, CCR receivers, web-services, something ad-hoc built with .NET tasks, or some other reification. Nevertheless the isolation is common to all these.

On #2: Data parallelism, it turns out, often works best with mutable data structures. These structures must be partitionable, of course, so that tasks comprising the data parallel operations may operate with logically isolated chunks of this data safely, even if they are parts of the same physical data structure. So chunks of them are isolated even though they don’t appear to be. This is trivially achievable with many important parallel-friendly data structures like arrays, vectors, and matrices. Capturing this isolation in the type system is of course no small task, though region typing gets close (see UIUC’s Data Parallel Java).

But you usually don’t want these structures to be immutable, because they can be modified in constant-time and space if they are their classic simple mutable forms. Programmers doing HPC-style data-parallelism a la FORTRAN, vectorization, and GPGPU know this quite well. Compare this a world where we are doing data-parallelism over immutable data structures, where modifications often necessitate allocations or more complicated big-oh times due to clever techniques meant to avoid such allocations, as with persistent immutable data structures. This is likely less ideal. It is true that some data parallel operations are not in-place against mutable data – as with PLINQ – at which point purity, but not immutability, is key. The two are related but not identical: immutability pervades the construction of data structures, whereas purity pervades the construction of functions. But if you can get by with one copy of the data, why not do it? Particularly since most datasets amenable to parallel speedups are quite large.

Immutability: the bricks, not the mortar

Notice that the concurrency did not actually come from immutable data structures in either case, however. So what are they good for?

One obvious use, which has little to do with concurrency, is to enforce characteristics of particular data structures in a program. A translation lookup table may not have been meant to be written to except for initialization time, and using an immutable data structure is a wonderful way to enforce this intent.

What about concurrency? Immutable data structures facilitate sharing data amongst otherwise isolated tasks in an efficient zero-copy manner. No synchronization necessary. This is the real payoff.

For example, say we’ve got a document-editor and would like to launch a background task that does spellchecking in parallel. How will the spellchecker concurrently access the document, given that the user may continue editing it simultaneously? Likely we will use an immutable data structure to hold some interesting document state, such as storing text in a piece-table. OneNote, Visual Studio, and many other document-editors use this technique. This is zero-cost snapshot isolation.

Not having immutability in this particular scenario is immensely painful. Isolation won’t work very well. You could model the document as a task, and require the spellchecker to interact with it using messages. Chattiness would be a concern. And, worse, the spellchecker’s messages may now interleave with other messages, like a user editing the document. Those kinds of message-passing races are non-trivial to deal with. Synchronization won’t work well either. Clearly we don’t want to lock the user out of editing his or her document just because spellchecking is occurring. Such a boneheaded design is what leads to spinning donuts, bleached-white screens, and “(Not Responding)” title bars. But clearly we don’t want to acquire a lock and then make a full copy of the entire document. Perhaps we’d try to copy just what is visible on the screen. This is a dangerous game to play.

Immutability does not solve all of the problems in this scenario, however. Snapshots of any kind lead to a subtle issue that is familiar to those with experience doing multimaster, in which multiple parties have conflicting views on what “the” data ought to be, and in which these views must be reconciled.

In this particular case, the spellchecker sends the results back to the task which spawned it, and presumably owns the document, when it has finished checking some portion of the document. Because the spellchecker was working with an immutable snapshot, however, its answer may now be out-of-date. We have turned the need to deal with message-level interleaving – as described above – into the need to deal with all of the messages that may have interleaved within a window of time. This is where multimaster techniques, such as diffing and merging come into play. Other techniques can be used, of course, like cancelling and ignoring out-of-date results. But it is clear something intentional must be done.

In conclusion

It is safe to say that immutability facilitates important concurrent architectures and algorithms. It can really help big time, for sure. But it is clearly no panacea. Whether mutability or immutability is the right choice for a particular data structure in your program, as with all things, depends.

It could be the case that choosing a piece-table for storing your text facilitates large-scales of concurrency in version two of your software application, but that in version one you have no use for it. Making that call ahead of time may pay in spades down the road, even if it comes at a marginal cost up-front. Or it could be that choosing an immutable data structure costs you in time and space, and you never end up exploiting the fact that you could have shared that particular structure in a zero-cost way across agents in your program.

One thing’s for sure: I’m glad to be programming in languages like C#, F#, Clojure, and Scala, where I’ve got a choice.

7/11/2010 8:18:30 PM (Pacific Daylight Time, UTC-07:00)  #    Comments [6]

Rewind the clock to mid-2004.  Around this time awareness about the looming “concurrency sea change” was rapidly growing industry-wide, and indeed within Microsoft an increasing number of people – myself included – began to focus on how the .NET Framework, CLR, and Visual C++ environments could better accommodate first class concurrent programming.  Of course, our colleagues in MSR and researchers in the industry more generally had many years’ head start on us, in some cases dating back 3+decades.  It is safe to say that there was no shortage of prior art to understand and learn from.

One piece of prior art was particularly influential on our thoughts: software transactional memory.  (STM, or, in short just TM.)  In fact, right around that time, Tim Harris’s TM work grew in notoriety (my first exposure arriving by way of OOPSLA’03’s proceedings, which contained the “Language Support for Lightweight Transactions” paper).  TM was immediately fascinating, and simultaneously promising.  For a number of reasons:

  • TM hid sophisticated synchronization mechanisms under a simple veil.
  • It could be implemented using sophisticated (and scalable) techniques, again under a simple veil.
  • It built on decades of experience in building scalable and parallel transactional databases.
  • Among others.  But most of all, it was a bright shiny light in a sea of complexity.
  • And how fortunate: Tim was a colleague in our neighboring MSR Cambridge offices (and still is).

In a nutshell, TM offered declarative concurrency-safety.  You declare what you’d like in as few simple words as possible, and you get what you want.  In this case, those simple words are ‘atomic { S; }’.

Many people latched onto TM rapidly and simultaneously, both inside and outside of Microsoft.  I hacked together a little prototype built atop SSCLI (“Rotor”), and another architect on our team built an even more feature-rich prototype using MSIL rewriting.  We compared notes, began jointly exploring the design space, and talking more regularly with other colleagues like Tim in MSR.  Soon thereafter we kicked off a small working group with about a dozen architects and researchers from around the company, aiming to articulate what a real productized TM might look like.  Fun times.

We were eventually given the OK for an official “incubation” project, and multiple years’ of exploration and hard work ensued.  In fact, the fruits of a team of many’s labor recently got released in the form of a Community Technology Preview -- a good conduit for experimentation, but with no commitment to add it to any of Microsoft’s products.  To be clear, I had only a small part to play in this ambitious project, and mostly towards the start.  Partway through, I stepped away to do PLINQ and Parallel Extensions to .NET, both of which are now part of the .NET Framework 4.0.  Dozens of amazing people played a significant role in the project over the years.  But I am getting way ahead of myself…

I’ve been away from the nitty-gritty day-to-day details of TM for about 3 years now, which feels sufficiently long to develop a healthy perspective on the project.  So here it is.  What follows is of course in no way Microsoft’s “official position” on the technology, but rather my own personal one.  I’ve interspersed generalizations with specific details because that’s just how my brain thinks about TM.

Towards the North Star

A wondrous property of concurrent programming is the sheer number and diversity of programming models developed over the years.  Actors, message-passing, data parallel, auto-vectorization, ...; the titles roll off the tongue, and yet none dominates and pervades.  In fact, concurrent programming is a multi-dimensional space with a vast number of worthy points along its many axes.

This rich history is simultaneously a blessing and a daunting curse.  But in any case can make for some very interesting multi-year-long immersion.  My UW talk from 1 1/2 years ago just barely touches on the sheer breadth.

TM’s greatest virtue is the first word in its name: transactional.  It turns out that, no matter your concurrent programming model du jour, three fundamental concepts crop up again and again: isolation (of state), atomicity (of state transitions), and consistency (of those atomic transitions).  We use locks in shared-memory programming, coarse grained messages in message-passing, and functional programming to achieve all of these things in different ways.  Transactions are another such mechanism, sure, but more than that, transactions are an all-encompassing way of thinking about how programs behave at their most fundamental core.  Transaction is a religion.

Not everybody believes this, and of course why would they: it is an immensely subjective and qualitative statement.  Some will claim that models like message passing entirely avoid the likes of “race conditions,” and such, but this is clearly false: state transitions are made, complicated state invariants are erected amongst a sea of smaller isolated states, and care must be taken, just as in shared memory.  Even Argus, a beautiful early incarnation of message-passing (via promises) demands that messages are atomic in nature.  This property is not checked and, if done improperly, leads to “races in the large.”  Even Argus introduced the notion of transactions and persistence in the form of guardians.

Of course, message passing helps push you in the right direction.  It is not, however, a panacea.

I was reading my ICFP proceedings recently and was reminded of research done in the context of Erlang that supports this assertion.  In it, they apply CHESS-like techniques (with clever search space culling) to find race conditions.  Indeed we use similar techniques very successfully for our message-passing programming models on my team here at Microsoft.

Transactions are terrific because they are “automatic”.  You declare the boundaries, and the transactional machinery takes care of the rest.  This is true of databases and also TM.  Countless developers in the wild write massively concurrent programs by issuing operations against databases: they can do this so easily because they grok the simple façade that transactions provide.  Numerous server-side state-based applications use transactions to shield programmers from the pitfalls of concurrency.  Behold MSDTC.  The bet we were making is that similar models would scale down just as well “in the small”.

The canonical syntactic footprint of TM is also beautiful and simple.  You say:

atomic {

    … concurrency-safe code goes here …

}

And everything in that block is magically concurrency-safe.  (Well, you still need to ensure the consistency part, but isolation and atomicity are built-in.  Mix this with Eiffel- or Spec#-style contracts and assertions like those in .NET 4.0, run at the end of each transaction, and you’re well on your way to verified consistency also.  The ‘check E’ work in Haskell was right along these lines.)  You can read and write memory locations, call other methods, all without worrying about whether concurrency-safety will be at risk.

For example, consider three transactions running concurrently:

int x = 0, y = 0, z = 0;

 

atomic {      atomic {      atomic {

    x++;          y++;          z++;

}                 x++;          y++;

              }                 x++;

                           }

No matter the order in which these run, the end result will be x == 3, y == 2, z == 1.

Contrast this elegant simplicity with the many pitfalls of locks:

  • Data races.  Like forgetting to hold a lock when accessing a certain piece of data.  And other flavors of data races, such as holding the wrong lock when accessing a certain piece of data.  Not only do these issues not exist, but the solution is not to add countless annotations associating locks with the data they protect; instead, you declare the scope of atomicity, and the rest is automatic.
  • Reentrancy.  Locks don’t compose.  Reentrancy and true recursive acquires are blurred together.  If a locked region expects reentrancy, usually due to planned recursion, life is good; if it doesn’t, life is bad.  This often manifests as virtual calls that reenter the calling subsystem while invariants remain broken due to a partial state transition.  At that point, you’re hosed.
  • Performance.  The tension between fine-grained locking (better scalability) versus coarse-grained locking (simplicity and superior performance due to fewer lock acquire/release calls) is ever-present.  This tension tugs on the cords of correctness, because if a lock is not held for long enough, other threads may be able to access data while invariants are still broken.  Scalability pulls you to engage in a delicate tip-toe right up to the edge of the cliff.
  • Deadlocks.  This one needs no explanation.

In a nutshell, locks are not declarative.  Not even close.  They are not associated with the data protected by those locks, but rather the code that accesses said data.  (For example: in the above code snippet, do we need three locks?  Or one?  Or …?  Imagine we choose three: one for each variable, x, y, and z.  What if we increment z, release its associated lock, and some other thread can now see the newly incremented z before the y and x get incremented.  Whether this is acceptable depends on the program.)  Sure, you can achieve atomicity and isolation, but only by intimately reasoning about your code by understanding the way they are implemented.  And if you care about performance, you are also going to need to think about hardware esoterica such as CMPXCHG, spin waiting, cache contention, optimistic techniques with version counters and memory models, ABA, and so on.

The contrast is stark.  Atomic-block-style transactions provide automatic serializability of whole regions of code, no matter what that code does, and the TM infrastructure does the rest, choosing between: optimistic, pessimistic, coarse, fine, etc.  The linearization point of a transaction is clear: the end of the atomic block.  TM can even adjust strategies based on the surrounding environment: hardware, dynamic program behavior, etc.  (“Policy”.)  In comparison to locks, TM is an order of magnitude simpler.  There have even been studies whose conclusions support this assertion.

(Transactions unfortunately do not address one other issue, which turns out to be the most fundamental of all: sharing.  Indeed, TM is insufficient – indeed, even dangerous – on its own is because it makes it very easy to share data and access it from multiple threads; first class isolation is far more important to achieve than faux isolation.  This is perhaps one major difference between client and server transactions.  Most server sessions are naturally isolated, and so transactions only interfere around the edges.  I’m showing my cards too early, and will return to this point much, much later in this essay.)

TM also has the attractive quality of automatic rollback of partial state updates.  (How did I get this far without discussing rollback?)  Concurrency aside, this avoids needing to write backout code to run in the face unhandled exceptions.  In retrospect this capability alone is almost enough to justify TM in limited quantities.  Reams of code “out there” contain brittle, untested, and, therefore, incorrect error handling code.  We have seen such code lead to problems ranging in severity: reliability issues leading to data loss, security exploits, etc.  Were we to replace all those try/catch/rethrow blocks of code with transactions, we could do away with this error prone spaghetti.  We’d also eliminate try/filter exploits thanks to Windows/Win32 2-pass SEH.  Sometimes I wish we focused on this simple step forward, forgot about concurrency-safety, and baby stepped our way forward.  Likely it wouldn’t have been enough, but I still wonder to this day.

We also toyed with the ability to replace reliability-oriented CER blocks with transactions.  As you go through a transaction, there is a log of forward progress and how to undo it.  So no matter the kind of failure, including OOM, you can rollback the partial state updates with zero allocation required.

At some point we began describing an ‘atomic’ block as though the program used a single global lock for all its concurrency operations.  This would be grossly inefficient, of course, and fails to capture the precise isolation and rollback properties, but nevertheless conveys the basic idea.  It also, as an aside, foreshadows a few of the difficult problems that lie ahead, namely strong vs. weak atomicity.  Even though there is only one, if you forget to hold this one global lock while accessing shared data, you’ve still got a data race on your hands.  This model won’t save you.  We will return to this later on.

Tough Decisions: Life as a Starving Artist

We faced some programming model decisions requiring artistic license early on.

One that we quickly decided was whether to automatically roll back a transaction in response to an unhandled exception thrown from within.  Such as with this code:

atomic {

    x++;

    if (p)

        throw new Exception(“Whoops”);

}

If p evaluates to true, and hence an unhandled exception thrown, should that x++ be rolled back?

Most on the team said “Yes” as a gut reaction, whereas some argued we should require the programmer to catch-and-rollback by hand.  We settled on the automatic approach because it seemed to do what you would expect in all the cases we looked at.  Your transaction failed to complete normally and consistently.  We also debated whether to support a unilateral “Transaction.Abort()” capability; while we agreed a “Transaction.Commit()” would be silly – the only way to commit a transaction being to reach its end non-exceptionally – the jury remained split on unilateral abort.  We eventually found that, particularly when nesting is involved, the ability to detect a dire problem with the universe and bail unilaterally can be useful.

And we also hit some tough snags early on.  Some were trivial, like what happens when an exception is thrown out of an atomic block.  Of course that exception was likely constructed within the atomic block (‘throw new SomeException()’ being the most common form of ‘throw’), so we decided we probably need to smuggle at least some of that exception’s state out of the transaction.  Like its stack trace.  And perhaps its message string.  I wrote the initial incarnation of the CLR exception subsystem support, and stopped at shallow serialization across the boundary.  But this was a slippery slope, and eventually the general problem was seen, leading to more generalized nesting models (which I shall describe briefly below).  Another snag, which was quite non-trivial, was the impact to the debugging experience.  Depending on various implementation choices – like in-place versus buffered writes – you may need to teach the debugger about TM intrinsically.  And some of the challenges were fundamental to building a first class TM implementation.  Clearly the GC needed to know about TM and its logging, because it needs to keep both the “before” and “after” state of the transaction alive, in case it needed to roll back.  The JIT compiler was very quickly splayed open and on the surgery table.  And so on.

Throughout, it became abundantly clear that TM, much like generics, was a systemic and platform-wide technology shift.  It didn’t require type theory, but the road ahead sure wasn’t going to be easy.

So we knocked down many early snags, and kept plowing forward, eagerly and excitedly.  None of these challenges were insurmountable.  We remained hopeful and happy (perhaps even blissful) to continue exploring the space of possible solutions.  More irksome snags lurked right around the corner, however.  And little did we know that some decisions we were about to make would subject us to some of the biggest such snags.  TM’s greatest feature – slap an atomic around a block of code and it just gets better – would turn out to be its greatest challenge… but alas, I am again jumping ahead; more on that later.

Turtles, but How Far Down?  Or, Bounded vs. Unbounded Transactions

Not all transactions are equal.  There is a broad spectrum of TMs, ranging from those that are bounded to updating, say, 4 memory locations or cache lines, to those that are entirely unbounded.  Indeed TM blurs together with other hardware-accelerated synchronization techniques, like speculative lock elision (SLE).  The more constrained TM models are often hardware-hybrids, and the limitations imposed are typically due to physical hardware constraints.  Models can be pulled along other axes, however, such as whether memory locations must be tagged in order to be used in a transaction or not, etc.  Haskell requires this tagging (via TVars) so that side-effects are evident in the type system as with any other kind of monad.

We quickly settled on unbounded transactions.  Everything else looked like multi-word CAS and, although we knew multi-word CAS would be immensely useful for developing new lock-free algorithms, our aim was to build something radically new and with broader appeal.  If we ended up with a hardware-hybrid, we would expect the software to pick up the slack; you’d get nice acceleration within the hardware constraints, and then “fall off the silent cliff” to software emulation thereafter.  Thus the unbounded approach was chosen.

In hindsight, this was a critical decision that had far-reaching implications.  And to be honest, I now frequently doubt that it was the right call.  We had our hearts in the right places, and the entire industry was trekking down the same path at the same time (with the notable exception of Haskell).  But with the wisdom of age and hindsight, I do believe limited forms of TM could be wildly successful at particular tasks and yet would have avoided many of the biggest challenges with unbounded TM.

And believe me: many such challenges arose in the ensuing months.

An example of one challenge that didn’t threaten the model of TM per se, but sure did make our lives more difficult, is the compilation strategy we were forced to adopt.  Transactions cost something.  To transact a read or write entails a non-trivial amount of extra work; we spent a lot of time optimizing away redundant work, and developing new optimizations that reduced the overhead of TM.  But at the end of the day, the cost is not zero – and in fact, the common case is far from it.  Imagine you have an unbounded transaction model and are faced with compiling a particular method from MSIL to native code.  A simple separate-module -based compiler (i.e. not whole-program) will not necessarily know whether this method will get called from a transaction, or from non-transactional code, such that in the worst case the method must be prepared for transactional access.  There are a variety of techniques to use to produce code that supports both: the two extremes are (1) cloning, or (2) sharing w/ conditional dynamic checking.  Neither extreme is particularly attractive, and this choice represents a classic space-time tradeoff that entails finding a reasonable middle ground.  A JIT compiler can dynamically produce the version that is needed at the moment, but offline compilers – like the CLR’s NGEN – do not have this luxury.  And within Microsoft at least, and among shrink-wrap ISVs, offline compilation is of greater importance than JIT compilation.  For better or for worse.

The model of unbounded transactions is the hard part.  You surround any arbitrary code with ‘atomic { … }’ and everything just works.  It sounds beautiful.  But just think about what can happen within a transaction: memory operations, calls to other methods, P/Invokes to Win32, COM Interop, allocation of finalizable objects, I/O calls (disk, network, databases, console, …), GUI operations, lock acquisitions, CAS operations, …, the list goes on and on.  Versus bounded transactions, where we could say something like: if you do more than N things, the transaction will fail to run – deterministically.

Unbounded really was the golden nugget.   But we should not be shy about what this decision implies.

Implementing the Idea

This leads me to a brief tangent on implementation.  Given that we didn’t implement TM with a single global lock, as the naïve mental model above suggests, you may wonder how we actually did do it.  Three main approaches were seriously considered:

  • IL rewriting.  Use a tool that passes over the IL post-compilation to inject transaction calls.
  • Hook the (JIT) compiler.  The runtime itself would intrinsically know how to inject such calls.
  • Library-based.  All transactional operations would be explicit calls into the TM infrastructure.

Approaches #1 and #2 would look similar, but the latter would be quite different.  Instead of:

atomic {

    x++;

}

Or:

Atomic.Run(() => {

    x++;

});

You might say something like:

Atomic.Run(() => {

    Atomic.Write(Atomic.Read(ref x) + 1);

});

With enough language work, we could have tried to desugar the latter into the former, but when you start crossing method boundaries, everything gets more complicated.  (Do you create transactional clones of every method, and rewrite calls from ordinary methods to the transactional clone?  This is easy to do with a rewriter or compiler, but quite difficult with a pure library approach.)   We also knew we’d need to do some very sophisticated compiler optimizations to get TM’s performance to the point of acceptable.  So we chose approach #2 for our “real” prototype, and never looked back.

After this architectural approach was decided, a vast array of interesting implementation choices remained.

We moved on to building the primitive library with all the TM APIs that the JIT would introduce calls into.  We quickly settled an approach much like Harris’s (and, at the time, pretty much the industry/research standard): optimistic reads, in-place pessimistic writes, and automatic retry.  That means reads do not acquire locks of any sort, and instead, once the end of the transaction has been reached, all reads are validated; if any locations read have been modified concurrently (or an uncommitted value was read), the whole transaction is thrown away and reattempted from the start.  Writes work like locks.  This approach makes reads cheap: a single read consists of reading the value, and a version number whose address is at a statically known offset.  No interlockeds.  This is great since reads typically far outnumber writes.  Down the line, we explored adding more sophisticated policy than this, which I will detail in brief below.

So the compiler would inject hooks for the above code like so:

while (true) {

    TX tx = new TX();

    try {

        // x++;

        tx.OpenReadOptimistic(ref x);

        int tmp = x;

        tx.OpenWritePessimistic(ref x);

        x = (tmp + 1);

 

        if (!tx.Validate())

            continue;

 

        tx.Commit();

    }

    catch {

        tx.Rollback();

        throw;

    }

}

Notice there are some obvious overheads in here:

  • The atomic block becomes a loop (to support automated retry).
  • A new transaction must be allocated and likely placed in TLS (if methods are called).
  • A try/catch block is used to initiate rollback on unhandled exceptions.
  • Each unique location read in a block requires at least one call to OpenReadOptimistic.
  • Each unique location written requires at least one call to OpenWritePessimistic.
  • Each location read must be validated (at Validate), and finally the transaction is committed (at Commit).

Much of the work in the compiler was meant to reduce these overheads.  For example, if the same location is read multiple times, there’s no need to call OpenReadOptimistic more than once.  If the compiler can statically detect this, it may elide some of the calls.  If the same location is read and then written – as in the above example – only the write lock must be acquired.  If no methods are called, the transaction object can be enregistered, and we needn’t add it to TLS so long as the exception trap code knows how to move it from register to TLS on demand.  Et cetera.

There are other overheads that are not so obvious.  Optimistic reads mandate that there is a version number for each location somewhere, and pessimistic writes mandate that there is a lock for each location somewhere.

A straightforward technique is to use a hashing scheme to associate locations with this auxiliary data: each address is hashed to index into a table of version numbers and locks.  This leads to false sharing, of course, but reduces space overhead and makes lookup fast.  Unfortunately, in a garbage collected environment, addresses are not stable and therefore hashing becomes complicated.  You can use object hash codes for this purpose, but .NET hash codes are overridable; and generating them is not nearly as cheap as using the memory location’s address, which by definition is already in-hand.  Other alternatives of course exist.  You can associate version numbers and locks with the objects themselves, just like monitors and object headers/sync-blocks in the CLR: this provides object-granularity locking.  Ahh, the age old tension of fine vs. coarse grained locking comes up again.

We eventually realized we’d want both optimistic and pessimistic reads, the latter of which worked a lot like reader/writer locks.  We crammed all these into a clever little word-sized data structure which worked a lot like Vista’s SRWL data structure.  Except that it also contained a version number.

It was always surprising to me what strange things in the runtime we bumped up against.  We realized a nice GC optimization: instead of keeping strong references to all intermediary states in a transaction log, we could keep weak references to all but the “before” and “after” state.  This is important when transacting synthetic situations like this:

static BigHonkinFoo s_f;

atomic {

    for (int i = 0; i < 1000000; i++)

        s_f = new BigHonkinFoo();

}

Of course you wouldn’t write that code exactly.  But there’s no need to keep alive all but the s_f that existed prior to entering the atomic block and the current one at any given time.  But this leads to particularly hairy finalization issues.  If a finalizable object is allocated within a transaction (say BigHonkinFoo), and is then reclaimed, its Finalize() method will be scheduled to run on a separate thread.  Yet the transaction log may contain references to it.  Thus there is a race between the transaction’s final outcome and the invocation of the finalizer.  We came up with a clever solution for this, but there were countless other clever solutions for various things not worth diving too deep into.

Hacking is fun.  However, it was not going to be what made or broke TM as a model.

Disillusionment Part I: the I/O Problem

It wasn’t long before we realized another sizeable, and more fundamental, challenge with unbounded transactions.  Finalizers touched on this.  What do we do with atomic blocks that do not simply consist of pure memory reads and writes?  (In other words, the majority of blocks of code written today.)  This was not just a pesky question of how to compile a piece of code, but rather struck right at the heart of the TM model.

You already saw the OpenReadOptimistic, OpenWritePessimistic, Validate, Commit, and Rollback pseudo-TM infrastructure calls, each of which operated on memory locations.  But what about a read or write from a single block or entire file on the filesystem?  Or output to the console?  Or an entry in the Event Log?  What about a web service call across the LAN?  Allocation of some native memory?  And so on.  Ordinarily these kinds of operations will be composed with other memory operations, with some interesting invariant relationship holding between the disparate states.  A transaction comprised of a mixture still ought to remain atomic and isolated.

The answer seemed clear.  At least in theory.  The transaction literature, including Reuter and Gray’s classic, had a simple answer for such things: on-commit and on-rollback actions, to perform or compensate the logical action being performed, respectively.  (Around this same time, the Haskell folks were creating just this capability in their STM, where ‘onCommit’ and ‘onRollback’ would take arbitrary lambdas to execute the said operation at the proper time.)  Because we were working primarily in .NET – with a side project targeting C++ -- we decided to use the new System.Transactions technology in 2.0 to hook into inherently transactional resources, like transacted NTFS, registry, and, of course, databases.

(Digging through my blog, I found this article written back in June 2006 about building a volatile resource manager for memory allocation/free operations, just as an example.)

This worked, though we were quite obviously swimming upstream.  Numerous challenges confronted us.

A significant problem was that not all operations are inherently transactional, so in many cases we were faced with the need to add faux transactions on top of existing non-transactional services.  (Already-transactional services were easy, like databases.  Except that mixing fine-grain TM transactions with distributed DTC transactions makes my skin crawl.)  For example, how would you undo a write to the console?  Well, you can’t, really.  So we decided maybe the right default for Console.WriteLine was to use an on-commit action to perform the actual write only once the transaction had committed.

But in even thinking this thought, we realized we were standing on shaky ground.  What if the WriteLine was followed by something like a ReadLine, for example, where the program was meant to wait for the user to enter something into the console (likely in response to the prompt output by WriteLine)?  (This example is a toy, of course, but represents a more fundamental pattern common in networked programs.)  The basic problem was immediately clear.  Adding isolation to an existing non-isolated operation is not always behavior-preserving, particularly when I/O is involved.  Sometimes it is necessary to step outside of the isolation that would otherwise get poured on top by a simple transactional model.

This particular problem isn’t specific to traditional I/O per se.

Foreign function interface calls through.NET’s P/Invoke suffer from like problems.  A call to CreateEvent may be compensatable (via an on-rollback action) with a call to CloseHandle.  But this is flawed.  Once that event’s HANDLE is requested, and/or it is passed to other Win32 APIs like MsgWaitForMultipleObjects, then the isolation of the faux transaction is broken, and real state must be provided to the Win32 APIs.  And if another thread were to look up that HANDLE – perhaps through a name given to it in the call to CreateEvent – it may be able to see and interact with that event before the enclosing transaction has been committed.  The abstraction leaks.  And even if the abstraction is perfect, it is obvious there’s quite a bit of work to be had in order to transact all the touch points between .NET and Win32, of which there are many.  And I mean many.

Other issues wait just around the corner.  For example, how would you treat a lock block that was called from within a transaction?  (You might say “that’s just not supported”, but when adding TM to an existing, large ecosystem of software, you’ve got to draw the compatibility line somewhere.  If you draw it too carelessly, large swaths of existing software will not work; and in this case, that often meant that we claimed to provide unbounded transactions, and yet we would be placing bounds on them such that a lot of existing software could not be composed with transactions.  Not good.)  A seemingly straightforward answer is to treat a lock block like an atomic block.  So if you encounter:

atomic {

    lock (obj) { … }

}

it is logically transformed into:

atomic {

    atomic { … }

}

On the face of it, this looks okay.  (Forget problems like freeform use of Monitor.Enter/Exit for now.)  We’re strengthening the atomicity and isolation, so what could go wrong?  Well, it turns out that examples like this can also suffer from the “too much isolation” problem.  Adding transactions to a lock-block extends the lifetime of the isolation of that particular block’s effects, possibly leading to lack of forward progress.  In fact, you don’t need locks to illustrate the problem.  Imagine a simple lock-free algorithm that communicates between threads using shared variables:

volatile int flag = 0;

flag = 1;                   while (flag != 1) ;

while (flag == 1) ;         flag = 2;

If you invoke this code from within a transaction (on each thread), you’re apt to lead to deadlock.  Both transactions’ effects will be isolated from the others’, whereas we are quite obviously intending to publish the updates to the flag variable immediately.

Anyway, the whole lock thing is a bit of a digression.  The simple fact is that very little .NET code would actually run inside an atomic block but for things like collections and pure computations due to the I/O problem.  You can develop one-off solutions for each problem that arises – and indeed we did so for many of them – and even hang those solutions underneath one general framework – like System.Transactions – but you cannot help but eventually become overwhelmed by the totality of the situation.  The team experimented with static checking to turn these dynamic failures into static ones, but this only marginally improved matters.

I could go on and on about the I/O problem, its various incarnations, and what we did about it.  Instead I will sum it up: this problem was, and still is, the “elephant in the room” threatening unbounded TM’s broader adoption.

The question ultimately boils down to this: is the world going to be transactional, or is it not?

Whether unbounded transactions foist unto the world will succeed, I think, depends intrinsically on the answer to this question.  It sure looked like the answer was going to be “Yes” back when transactional NTFS and registry was added to Vista.  But the momentum appears to have slowed dramatically.

Nesting

Let’s get back to some fun, less depressing material.  There are more surprises lurking ahead.

I already mentioned a great virtue of transactions is their ability to nest.  But I neglected to say how this works.  And in fact when we began, we only recognized one form of nesting.  You’re in one atomic block and then enter into another one.  What happens if that inner transaction commits or rolls back, before the fate of the outer transaction is known?  Intuition guided us to the following answer:

  • If the inner transaction rolls back, the outer transaction does not necessarily do so.  However, no matter what the outer transaction does, the effects of the inner will not be seen.
  • If the inner transaction commits, the effects remain isolated in the outer transaction.  It “commits into” the outer transaction, we took to saying.  Only if the outer transaction subsequently commits will the inner’s effects be visible; if it rolls back, they are not.

For example, consider this code:

void f() {                  void g() {

    atomic { // Tx0             atomic { // Tx1

        x++;                       y++;

        try {                      if (p1)

            g();                  throw new BarException();

        } catch {                         }

            if (p0)               }

                throw;            }

        }

        if (p2)

            throw new FooException();

    }

}

Imagine x = y = 0 at the start, and we invoke f.  Many outcomes are possible.

  • If p1 is true, g will throw an exception, aborting Tx1’s write to y. There are then two possibilities.  (1)If p0 is true, the exception is repropagated and Tx0 will also abort, rolling back its write to x; this leaves x == y == 0.  (2) If p0 is false, the exception is swallowed, and Tx0 proceeds to committing its write to x; this leaves x == 1, whereas y == 1.
  • If p1 is false, on the other hand, g will not throw anything.  Tx1 will commit its write to y “into” the outer transaction Tx0.  One of two outcomes will now occur depending on the value of p2.  (1) If p2 is true, an exception is thrown out of f, and Tx0 rolls back both the inner transaction Tx1’s effects and its own, leaving x == y == 0.  (2) Else, f completes ordinarily, and Tx0 commits both Tx1’s and its own effects, leading to x == y == 1.

We expected most peoples’ intuition to match this behavior.

The canonical working example was a BankAccount class:

class BankAccount
{

    decimal m_balance;

 

    public void Deposit(decimal delta) {

        atomic { m_balance += delta; }

    }

 

    public static void Transfer(

            BankAccount a, BankAccount b, decimal delta) {

        atomic {

            a.Deposit(-delta);

            b.Deposit(delta);

        }

    }

}

This was an illustrative and beautiful example.  It made beautiful slide-ware.  We are composing the Deposit operations of two separate bank accounts into a single Transfer method.  Of course doing the a.Deposit(-delta) and b.Deposit(delta) must be made atomic, else a failure could either lead to missing money, and/or someone could witness the world with the money in transit (and nowhere except for one a thread’s stack) rather than having been transferred atomically.  And building the same thing with locks is frustratingly difficult: using fine-grained per-account locks can lead to deadlock very quickly.

Intuitively we walked down many variants of this mode of nesting.  We reacquainted ourselves with Moss’s great dissertation on the topic, and remembered this intuitive nesting mode as closed nested transactions.  And we shortly recognized the need for another mode: open nested transactions.

To motivate the need for open nesting, imagine we’ve got a hashtable whose physical storage is independent from its logical storage.  Resizing the table of buckets, for example, has little to do with whether a particular {key, value} pair exists within those buckets.  The resizing operation, in fact, is logically idempotent and isolated: the same set of keys will exist within the table both before and after such an operation.  So we can actually commit the physical effects of such an operation eagerly.  With a naïve TM implementation, two independent keys hashing to the same bucket will conflict, and the reads and writes for such operations will live as long as the enclosing user-level transactions.  Instead, we can serialize logical operations with respect to one another at a “higher level” than physically independent operations do, leading to greater concurrency.  Two transactions will only conflict in long-running transactions if they truly operate on the same keys, rather than just happening to hash to the same bucket.

Open nesting forced us to contemplate the sharing of state between outer and inner transactions more deliberately, and gave us some troubles syntactically.  We had wanted to say:

atomic { // ordinary closed nesting.

    Foo f = new Foo();

    atomic(open) { /// open nesting.

        … f? …

    }

}

But is it really legal for the inner transaction here to access the ‘f’, which has been constructed and is presumably uncommitted in the outer transaction?  With closed nested transactions there is lock compatibility between the outer and inner transactions.  An inner closed nested transaction can of course read a memory location write-locked by the outer transaction, for example.  However, the same must not true of open nesting, because an open nested transaction commits “to the world” rather than into its outer transaction.  Allowing it to read and then potentially publish uncommitted state would violate serializability.  It’s possible that the inner open nested transaction will commit, whereas the outer will roll back.  (The reverse situation is equally problematic.)  And yet it’s darn useful to pass state from an outer to an inner transaction – and indeed, often impossible to do anything otherwise – yet what if the key itself were a complicated object graph rather than value, and the key bleeds across transaction boundaries?

Many issues like this arose.  Our straightforward answer was that only pass-by-value worked across such a boundary.  I don’t think we ever found nirvana here.

We developed other transaction modes also.

As we added data parallel operations within a nested transaction, we realized that we’d need something a lot like closed nesting but with special accommodation for intra-transaction parallelism.  This led us to parallel nested transactions, enabling lock sharing from a parent to its many data parallel children.  These children could not communicate with one another other than to “commit into” the parent, and subsequently reforking, thereby ensuring non-interference between them.  Of course children could share read-locks amongst one another, just not write locks.

And we continued to reject the temptation of adding weakened serializability modes a la relational databases (unrepeatable reads, etc).  Although we expected this to arise out of necessity with time, it never did; the various nesting modes we provided seemed to satisfy the typical needs.

A Better Condition Variable

Here’s a brief aside on one of TM’s bonus features.

Some TM variants also provide for “condition variable”-like facilities for coordination among threads.  I think Haskell was the first such TM to provide a ‘retry’ and ‘orElse’ capability.  When a ‘retry’ is encountered, the current transaction is rolled back, and restarted once the condition being sought becomes true.  How does the TM subsystem know when that might be?  This is an implementation detail, but one obvious choice is to monitor the reads that occurred leading up to the ‘retry’ – those involved in the evaluation of the predicate – and once any of them changes, to reschedule that transaction to run.  Of course, it will reevaluate the predicate and, if it has become false, the transaction will ‘retry’ again.

A simple blocking queue could be written this way.  For example:

object TakeOne()

{

    atomic {

        if (Count == 0)

            retry;

        return Pop();

    }

}

If, upon entering the atomic block, Count is witnessed as being zero, we issue a retry.  The transaction subsystem notices we read Count with a particular version number, and then blocks the current transaction until Count’s associated version number changes.  The transaction is then rescheduled, and races to read Count once again.  After Count is seen as non-zero, the Pop is attempted.  The Pop, of course, may fail because of a race – i.e. we read Count optimistically without blocking out writers – but the usual transaction automatic-reattempt logic will kick in to mask the race in that case.

The ‘orElse’ feature is a bit less obvious, though still rather useful.  It enables choice among multiple transactions, each of which may end up issuing a ‘retry’.  I don’t think I’ve seen it in any TMs except for ours and Haskell’s.

To illustrate, imagine we’ve got 3 blocking queues like the one above.  Now imagine we’d like to take from the first of those three that becomes non-empty.  ‘orElse’ makes this simple:

BlockingQueue bq1 = …, bq2 = …, bq3 = …;

atomic {

    object obj =

        orElse {

            bq1.TakeOne(),

            bq2.TakeOne(),

            bq3.TakeOne()

    };

}

While ‘orElse’ is perhaps an optional feature, you simply can’t write certain kinds of multithreaded algorithms without ‘retry’.  Anything that requires cross-thread communication would need to use spin variables.

Deliberate Plans of Action: Policy

I waved my hands a bit above perhaps without you even knowing it.  When I talk about optimistic, pessimistic, and automatic retry, I am baking in a whole lot of policy.  It turns out there is a wide array of techniques.  The simplest question we faced early on was, when an optimistic read fails to validate at the end of a transaction, when should we reattempt execution of that transaction?

The naïve answer is “immediately”.  But obviously that would lead to livelock under some conditions.  A more reasonable answer is “spin for N cycles and then retry”.  But this too can lead to livelock.  A better answer is to either choose some random strategy, or to make an intelligent adaptive choice.  We experimented with many such variants, including random backoff, sophisticated waiting and signaling based on the memory locations in question, among others.  We even played games like giving transactions karma points for cooperatively acquiescing to other competing transactions, and allowing those transactions with the most karma points to make more forward progress before interrupting them.

A few good papers supplied useful (and entertaining) reading material on the topic, but to be honest, nobody had a good answer at the time.  Thankfully these are all implementation details.  So we were free to experiment.

Deadlock breaking also requires policy.  Thankfully we can actually roll back the effects of transactions engaged in a deadly embrace with TM, so we merely need to know how often to run the deadlock detection algorithm.  There was a similar problem when deciding to back off outer layers of nesting, and in fact this becomes more complicated when deadlocks are involved.  Imagine:

atomic {             atomic {

    x++;                 y++;

    atomic {             atomic {

        y++;                 x++;

    }                    }

}                    }

This deadlock-prone example is tricky because rolling back the inner-most transactions won’t be sufficient to break the deadlock that may occur.  Instead the TM policy manager needs to detect that multiple levels of nesting are involved and must be blown away in order to unstick forward progress.

Another variant that went beyond deciding when to favor one transaction over another was to upgrade to pessimistic locking if optimistic let us down.  The whole justification behind optimistic is that, …well, we’re optimistic that conflicts won’t happen.  So it seems reasonable that, if they do occur, we fall back to something more, …well, pessimistic.  There is a dial here too.  Perhaps you only want to fall back to pessimistic after failing optimistically N times in a row, where N > 1.  As I mentioned above, our single-word lock associated with each object supported both locking and versioning cheaply.

Disillusionment Part II: Weak or Strong Atomicity?

All along, we had this problem nipping at our heels.  What happens if code accesses the same memory locations from inside and outside a transaction?  We certainly expected this to happen over the life of a program: state surely transitions from public and shared among threads to private to a single thread regularly.  But if some location were to be accessed transactionally and non-transactionally concurrently, at once, we’d (presumably) have a real mess on our hands.  A supposedly atomic, isolated, etc. transaction would no longer be protected from the evils of racey code.

For example:

atomic { // Tx0            x++; // No-Tx

    x++;

}

Can we make any statements about the value of x after Tx0 commits (or rolls back)?  Not really.  It depends on the way the particular TM being used has been implemented.  An in-place model that rolls back could not only roll back Tx0’s but also the unprotected x++’s write.  And so on.

On one hand, this code is racey.  So you could explain away the undefined behavior as being a race condition.  On the other hand, it was also troublesome.  All those problems with locks begin cropping up all over the place.  It would have been ideal if we could notify developers that they made a mistake.  Then we could have made the assertion that data races are simply not possible with TM.

(Except for consistency-related ones, of course.)

At the same time, many hardware models were being explored.  And of course in hardware you’ve got the physical addresses that variables resolve to and needn’t worry about aliasing.  So it was actually possible to issue a fault if a location was used transactionally and non-transactionally at once.  But given that our solution was software-based, we were uncomfortable betting the farm on hardware support.

Another approach was static analysis.  We could require transactional locations to be tagged, for example.  This had the unfortunate consequence of making reusable data structures less, well, reusable.  Collections for example presumably need to be usable from within and outside transactions alike.  After-the-fact analysis could be applied without tagging, but false positives were common.  We never really took a hard stance on this problem, but always assumed the combination of static analysis, tooling, and, perhaps someday, hardware detection would make this problem more diagnosable.  But I think we generally resolved ourselves to the fact that our TM would suffer from weak atomicity problems.

We thought this was explainable.  Sadly it led to something that surely was not.

Disillusionment Part III: the Privatization Problem

I still remember the day like it was yesterday.  A regular weekly team meeting, to discuss our project’s status, future, hard problems, and the like.  A summer intern on board from a university doing pioneering work in TM, sipping his coffee.  Me, sipping my tea.  Then that same intern’s casual statement pointing out an Earth-shattering flaw that would threaten the kind of TM we (and most of the industry at the time) were building.  We had been staring at the problem for over a year without having seen it.  It is these kinds of moments that frighten me and make me a believer in formal computer science.

Here it is in a nutshell:

bool itIsOwned = false;

MyObj x = new MyObj();

atomic { // Tx0                          atomic { // Tx1

    // Claim the state for my use:           if (!itIsOwned)

    itIsOwned = true;                            x.field += 42;

}                                        }

int z = x.field;

...

The Tx0 transaction changes itIsOwned to true, and then commits.  After it has committed, it proceeds to using whatever state was claimed (in this case an object referred to by variable x) outside of the purview of TM.  Meanwhile, another transaction Tx1 has optimistically read itIsOwned as false, and has gone ahead to use x.  An update in-place system will allow that transaction to freely change the state of x.  Of course, it will roll back here, because isItOwned changed to true.  But by then it is too late: the other thread using x outside of a transaction will see constantly changing state – torn reads even – and who knows what will happen from there.   A known flaw in any weakly atomic, update in-place TM.

If this example appears contrived, it’s not.  It shows up in many circumstances.  The first one in which we noticed it was when one transaction removes a node from a linked list, while another transaction is traversing that same list.  If the former thread believes it “owns” the removed element simply because it took it out of the list, someone’s going to be disappointed when its state continues to change.

This, we realized, is just part and parcel of an optimistic TM system that does in-place writes.  I don’t know that we ever fully recovered from this blow.  It was a tough pill to swallow.  After that meeting, everything changed: a somber mood was present and I think we all needed a drink.  Nevertheless we plowed forward.

We explored a number of alternatives.  And so did the industry at large, because that intern in question published a paper on the problem.  One obvious solution is to have a transaction that commits a change to a particular location wait until all transactions that have possibly read that location have completed – a technique we called quiescence.  We experimented with this approach, but it was extraordinarily complicated, for obvious reasons.

We experimented with blends of pessimistic operations instead of optimistic, alternative commit protocols, like using a “commit ticket” approach that serializes transaction commits, each of which tended to sacrifice performance greatly.  Eventually the team decided to do buffered writes instead of in-place writes, because any concurrent modifications in a transaction will simply not modify the actual memory being used outside of the transaction unless that transaction successfully commits.

This, however, led to still other problems, like the granular loss of atomicity problem.  Depending on the granularity of your buffered writes – we chose object-level – you can end up with false sharing of memory locations between transactional and non-transactional code.  Imagine you update two separate fields of an object from within and outside a transaction, respectively, concurrently.  Is this legal?  Perhaps not.  The transaction may bundle state updates to the whole object, rather than just one field.

All these snags led to the realization that we direly needed a memory model for TM.

Disillusionment Part IV: Where is the Killer App?

Throughout all of this, we searched and searched for the killer TM app.  It’s unfair to pin this on TM, because the industry as a whole still searches for a killer concurrency app.  But as we uncovered more successes in the latter, I became less and less convinced that the killer concurrency apps we will see broadly deployed in the next 5 years needed TM.  Most enjoyed natural isolation, like embarrassingly parallel image processing apps.  If you had sharing, you were doing something wrong.

In Conclusion

I eventually shifted focus to enforcing coarse-grained isolation through message-passing, and fine-grained isolation through type system support a la Haskell’s state monad.  This would help programmers to realize where they accidentally had sharing, I thought, rather than merely masking this sharing and making it all work (albeit inefficiently).

I took this path not because I thought TM had no place in the concurrency ecosystem.  But rather because I believed it did have a place, but that several steps would be needed before getting there.

I suspected that, just like with Argus, you’d want transactions around the boundaries.  And that you’d probably want something like open nesting for fine-grained scalable data structures, like shared caches.  These are often choke points in a coarse-grained locking system, and often cannot be fully isolated, at least in the small.  Ironically I am just now arriving there.  In the system I work on I see these issues actually staring us in the face.

This is just my own personal view on TM.  You may also be interested in reading the current STM.NET team’s views also, available on their MSDN blog.

For me the TM project was particularly enjoyable.  And it was a great learning experience.  I worked with some amazing people, and it was a privilege.  You really had the sense that something big was right around the corner, and every day was a rush of enjoyment.  Despite running as fast as we could, it seemed like we could just barely keep pace with the research community.  Over time more and more researchers turned to TM, and I distinctly recall reading at least one new TM paper per week.

This was also the first time I realized that Microsoft, at its core, really does operate like a collection of many startups.  Our TM work was a grassroots movement, and there was no official sponsorship for our effort at the start.  It was just a group of people independently getting together to discuss how TM might fit into the direction the industry was headed.  Eventually TM started showing up on slide decks in presentations to management, followed by dedicated TM reviews, and even a BillG review.  I will never forget, a couple years after that review – during an overall concurrency review – Bill standing up at the whiteboard, drawing the code “atomic { … }” and asking something to the effect: “Why can’t you just use transactional memory for that?”  I guess the idea stuck with him too.

Who knows.  Maybe in 10 years, the world will be transactional after all.

1/3/2010 11:05:12 AM (Pacific Standard Time, UTC-08:00)  #    Comments [12]

Well, Visual Studio 2010 Beta 2 is out on the street.  It contains plenty of neat new things to keep one busy for at least a rainy Saturday.  I proved this today.

Of course, Parallel Extensions is in the box.  .NET 4.0's Task and Task<T> abstractions are used to implement such things as PLINQ and Parallel.For loops, but of course they are great for representing asynchronous work too.  The FromAsync adapters move you from the dark ages of IAsyncResult to the glitzy new space age of tasks.

Not only are tasks tastier than hamburgers, but they enable complex dataflow graphs of asynchronous work to unfold dynamically at runtime, thanks to the ContinueWith method.  From a Task<T> you can get a Task<U> that was computed based on the T; ad infinitum.  We like dataflow.  It is the key to unlocking parallelism, or more accurately, boiling away all else except for dataflow is the key.  But what about control flow, you might ask?  We like it less.  But you can do it, so long as you put in some work.  F#'s async workflows make this sort of thing a tad easier, but the raw libraries in .NET 4.0 don't come with any sort of loops or conditional capabilities.  Perhaps in the future they will.  Nevertheless, in this post I shall demonstrate how to build a couple simple ones.

Not because the lack of them is going to cause unprecidented and unheard of horrors, but rather because in doing so we'll see some neat features of tasks.

The two methods I will illustrate in this post are:

public static class TaskControlFlow

{

    public static Task For(int from, int to, Func<int, Task> body, int width)

    public static Task While(Func<int, bool> condition, Func<int, Task> body, int width)

}

Notice that each body is given the iteration index and is expected to launch asynchronous work and return a Task.  The parameters that these methods take are probably obvious.  Well, except for the last one.  The "width" indicates how many outstanding asynchronous bodies should be in flight at once.  The Task returned by For and While won't be considered done until all iterations are done, and any exceptions will be propagated as you might hope.  It would be pretty useless otherwise.

For example, we could write a while loop that does something very silly:

TaskControlFlow.While(

    i => i < 100,

    i => { return CreateTimerTask(250).ContinueWith(_ => Console.WriteLine(i)); },

    4

).Wait();

This just prints returns a "timer task" that completes after 250ms and prints out the iteration to the console. We pass a width of 4, so only four tasks will be outstanding at any given time.  Notice we call Wait at the end, since both For and While return tasks representing the in flight work.  This could have instead been written using a For loop as follows:

TaskControlFlow.For(0, 100,

    i => { return CreateTimerTask(250).ContinueWith(_ => Console.WriteLine(i)); },

    4

).Wait();

The CreateTimerTask method, by the way, looks like this:

private static Task CreateTimerTask(int ms)

{

    var tcs = new TaskCompletionSource<bool>();

    new Timer(x => ((TaskCompletionSource<bool>)x).SetResult(true), tcs, ms, -1);

    return tcs.Task;

}

As something more realistic, imagine we wanted to do something with a large number of files, and don't want to block a whole bunch of threads in the process.  The following "simple" expression will count up all of the bytes for all of the files in a particular directory, without once blocking the thread -- well, except for the initial call to Directory.GetFiles:

string win = "c:\\...\\";

string[] files = Directory.GetFiles(win);

int total = 0;

TaskControlFlow.For(0, files.Length,

    i => {

        bool eof = false;

        int offset = 0;

        byte[] buff = new byte[4096];

        FileStream fs = File.OpenRead(files[i]);

        return TaskControlFlow.While(

            j => !eof,

            j => Task<int>.Factory.

                FromAsync<byte[],int,int>(

                    fs.BeginRead, fs.EndRead, buff, offset, buff.Length,

                    null, TaskCreationOptions.None

                ).

                ContinueWith(v => {

                    if (eof = v.Result < buff.Length) {

                        fs.Close();

                    }

                    offset += v.Result;

                    Interlocked.Add(ref total, v.Result);

                }),

                1

        );

    },

    8

).Wait();

Console.WriteLine(total);

Pretty neat.  We've somewhat arbitrarily chosen a width of 8 for this loop.  And notice something very subtle but important here: we've chosen a width of 1 for the inner loop that plows through the bytes of a file.  This is because we're sharing state, and it would not be safe to launch numerous iterations at once.  The same byte[], eof variable, and so forth, would become corrupt.  I will mention in passing that it's unfortunate that we've got that interlocked stuck in there to add to the total.  Refactoring this so that we could just do a LINQ reduce over the whole thing would be nice.  Indeed, it can be done.

We can do away with the For implementation very quickly.  It is just implemented in terms of While:

public static Task For(int from, int to, Func<int, Task> body, int width)

{

    return While(i => from + i < to, body, width);

}

And it turns out that the While implementation is not terribly complicated either.  Here it is:

public static Task While(Func<int, bool> condition, Func<int, Task> body, int width)

{

    var tcs = new TaskCompletionSource<bool>();

 

    int currIx = -1; // Current shared index.

    int currCount = width; // The number of outstanding tasks.

 

    int canceled = 0; // 1 if at least one body was cancelled.

    ConcurrentBag<Exception> exceptions = null; // A collection of exceptions, if any.

 

    // Generate a continuation action: this fires for each body that completes.

    Action<Task> fcont = null;

    fcont = tsk => {

        if (tsk.IsFaulted) {

            // Accumulate exceptions.

            LazyInitializer.EnsureInitialized(ref exceptions);

            foreach (Exception inner in tsk.Exception.InnerExceptions) {

                exceptions.Add(inner);

            }

        }

        else if (tsk.IsCanceled) {

            // Mark that cancellation has occurred.

            canceled = 1;

        }

        else if (canceled == 0 && exceptions == null) {

            // If no cancellations / exceptions are found, attempt to kick off more work.

            int ix = Interlocked.Increment(ref currIx);

            if (condition(ix)) {

                // Generate a new body task, handling exceptions.  Then make sure we

                // tack on the continuation on that new task, so we can keep on going...

                // If the condition yielded 'false', we'll simply fall through and try to finish.

                Task btsk;

                try {

                    btsk = body(ix);

                }

                catch (Exception ex) {

                    btsk = AlreadyFaulted(ex);

                }

                btsk.ContinueWith(fcont);

                return;

            }

        }

 

        // If this is the last task, signal completion.

        if (Interlocked.Decrement(ref currCount) == 0) {

            if (exceptions != null) {

                tcs.SetException(exceptions);

            }

            else if (canceled == 1) {

                tcs.SetCanceled();

            }

            else {

                tcs.SetResult(true);

            }

        }

    };

 

    // Fire off the right number of starting tasks.

    for (int i = 0; i < width; i++) {

        AlreadyDone.ContinueWith(fcont);

    }

 

    return tcs.Task;

}

I've commented the code inline to illustrate what is going on.  The only other part that isn't shown are the AlreadyDone and AlreadyFaulted members, which simply give Tasks that are already in a final state.  This isn't strictly necessary, but come in handy in a number of situations:

internal static Task AlreadyDone;

 

static TaskControlFlow()

{

    var tcs = new TaskCompletionSource<bool>();

    tcs.SetResult(true);

    AlreadyDone = tcs.Task;

}

 

private static Task AlreadyFaulted(Exception ex)

{

    var tcs = new TaskCompletionSource<bool>();

    tcs.SetException(ex);

    return tcs.Task;

}

And that's it.  I'm done for now.  Hope you enjoyed it.  I've got a few other posts in the works -- primarily the result of a day full of hacking (I got in the office at 7am this morning, and have been here ever since, 14 hours later) -- demonstrating how to do speculative asynchronous work for if/else branches.  Finally, I also have a neat example that illustrates how to do deep dataflow-based speculation without having to wait for work to complete.  This combines the new .NET 4.0 dynamic capabilities with parallelism, so I'm pretty excited to get it working and write about it.

10/31/2009 8:06:24 PM (Pacific Daylight Time, UTC-08:00)  #    Comments [1]

Embarrassingly, I neglected to write about the oldest trick in the book in my last post: designing the producer/consumer data structure to reduce false sharing.  As I've written about several times previously (e.g. see here), and more so in the book, false sharing is always deadly and must be avoided.

As a simple example, consider a program that merely increments a shared counter over and over again.  If we give P threads their own separate counters, and ask them to increment the respective counter an equal number of times.  Each thread can of course do this without synchronization, because the counters are distinct: no locks or even interlocked operations are necessary.  Naively, one might expect that running P of them in parallel leads to no interference, and hence perfect parallelization.  However, when I run a little benchmark on my 8-way machine, the numbers for increasing values of P tell a very different story:

1 = 22425789

2 = 42023726   (187%)

4 = 175828522  (784%)

8 = 333906288  (1489%)

It is clear that the throughput drops dramatically as P increases.  The reason?  Each counter, being only 8 bytes wide, shares a cache line with as many as 7 other counters -- or 15 if we're on a machine with 128 byte cache lines.  A simple change to the counter's layout, so that individual counters do not share the same cache line, will remedy the situation.  The numbers improve dramatically.  In fact, they remain constant no matter the value of P:

1 = 21914250

2 = 21900392   (100%)

4 = 21865781   (100%)

8 = 21934008   (100%)

This perfect scaling isn't always possible due to memory bandwidth, but because we're just incrementing a single counter per core this doesn't manifest as a problem.

For what it's worth, the machine I am running these tests on is an 8-way, dual-socket, quad-core.  Pairs of cores share an L1 cache, and all cores in a socket share an L2 cache.  So the pairs {0,1}, {2,3}, {4,5}, and {6,7} are each expected to have distinct L1 caches and the groups {0,1,2,3} and {4,5,6,7} are expect to have distinct L2 caches.  The 2 number above is run with two threads affinitized such that they share the same L1 cache.  If we force them apart, however, we get slightly different results:

2 = 42023726   (187%) -- same L1 cache

2 = 54706505   (244%) -- same L2 cache

2 = 75030977   (335%) -- separate sockets

As expected, the more distance in the cache hierarchy, the greater the slowdown due to the increased ping pong paths.

The specific results are of course unique to my machine, but nevertheless the conclusion is clear: reducing sharing leads to substantial performance gains, particularly with large numbers of threads hammering on the shared lines.  Often more so than eliminating other sources of wasted cycles, like interlocked operations.  Eliminating those sources is clearly important too, but it really is amazing how deadly and yet difficult to discover false sharing can be: few cases are as obvious as this one.

One aside is worth mentioning before winding down.  When I first ran this experiment, I had done it two ways: (1) with fields of a shared object, then using StructLayout(LayoutKind=Explicit) to keep fields apart, and (2) with counters crammed into an array, which then contains padding elements to eliminate the false sharing.  The former is shown above.  If you try the latter, you may be surprised.  The layout of arrays on the CLR is such that an array's length resides before the first element.  So unless you pad the first element of the array, all accesses will perform bounds checking that touches the first element's line.  Because this line is being mutated by the thread incrementing the first counter, terrible false sharing results.  To solve this, we must pad the first element too.

For example, here are the array numbers with false sharing:

1 = 27366202

2 = 125264714  (458%)

4 = 1383953372 (7969%)

8 = 3136996731 (11463%)

Notice the P = 8 case is over 100x slower!  Yowzas.  After fixing things, with the first element padded, we again observe perfect scaling:

1 = 27393869

2 = 27465999   (100%)

4 = 27370901   (100%)

8 = 27408631   (100%)

Clearly false sharing is not merely a theoretical concern.  In fact, during our Beta1 performance milestone in Parallel Extensions, most of our performance problems came down to stamping out false sharing in numerous places: the partitioning logic of parallel for loops, polling cancellation token flags, enumerators allocated at the beginning of a PLINQ query and constantly mutated during its execution, and even in our examples (e.g. see Herb's matrix multiplication example), etc.  It is terribly simple to make a mistake and, in a complicated system, terribly difficult to pinpoint the origin of what can be a truly crippling scalability bottleneck.

In the next post, we will go back and take a look at our single-producer / single-consumer buffer, and redesign it to have substantially better cache behavior.

~

For reference, here's the basic program used for a lot of these tests:

//#define CACHE_FRIENDLY

//#define USE_ARRAY

 

#pragma warning disable 0169

 

using System;

using System.Diagnostics;

using System.Runtime.InteropServices;

using System.Threading;

 

class Program

{

    const int P = 1;

 

#if USE_ARRAY

    class Counters

    {

        long[] m_longs;

 

        internal Counters(int n) {

#if CACHE_FRIENDLY

            m_longs = new long[(n+1)*16];

#else

            m_longs = new long[n];

#endif

        }

 

        public void Increment(int i) {

#if CACHE_FRIENDLY

            m_longs[(i+1)*16]++;

#else

            m_longs[i]++;

#endif

        }       

    }

#else // USE_ARRAY

#if CACHE_FRIENDLY

    [StructLayout(LayoutKind.Explicit)]

#endif

    struct Counters

    {

#if CACHE_FRIENDLY

        [FieldOffset(0)]

#endif

        public long a;

#if CACHE_FRIENDLY

        [FieldOffset(128)]

#endif

        public long b;

#if CACHE_FRIENDLY

        [FieldOffset(256)]

#endif

        public long c;

#if CACHE_FRIENDLY

        [FieldOffset(384)]

#endif

        public long d;

#if CACHE_FRIENDLY

        [FieldOffset(512)]

#endif

        public long e;

#if CACHE_FRIENDLY

        [FieldOffset(640)]

#endif

        public long f;

#if CACHE_FRIENDLY

        [FieldOffset(768)]

#endif

        public long g;

#if CACHE_FRIENDLY

        [FieldOffset(896)]

#endif

        public long h;

    }

 

    static Counters s_c = new Counters();

#endif // USE_ARRAY

 

    public static void Main(string[] args)

    {

        int p = int.Parse(args[0]);

        const int iterations = int.MaxValue / 4;

        ManualResetEvent mre = new ManualResetEvent(false);

 

#if USE_ARRAY

        Counters c = new Counters(p);

#endif

 

        Thread[] ts = new Thread[p];

        for (int i = 0; i < ts.Length; i++) {

            int tid = i;

            ts[i] = new Thread(delegate() {

                SetThreadAffinityMask(GetCurrentThread(), new UIntPtr(1u << tid));

                mre.WaitOne();

                for (int j = 0; j < iterations; j++)

#if USE_ARRAY

                    c.Increment(tid);

#else

                    switch (tid) {

                        case 0: s_c.a++; break;

                        case 1: s_c.b++; break;

                        case 2: s_c.c++; break;

                        case 3: s_c.d++; break;

                        case 4: s_c.e++; break;

                        case 5: s_c.f++; break;

                        case 6: s_c.g++; break;

                        case 7: s_c.h++; break;

                    }

#endif

            });

            ts[i].Start();

        }

 

        Stopwatch sw = Stopwatch.StartNew();

        mre.Set();

        foreach (Thread t in ts) t.Join();

        Console.WriteLine(sw.ElapsedTicks);

    }

 

    [System.Runtime.InteropServices.DllImport("kernel32.dll")]

    static extern IntPtr GetCurrentThread();

 

    [System.Runtime.InteropServices.DllImport("kernel32.dll")]

    static extern UIntPtr SetThreadAffinityMask(IntPtr hThread, UIntPtr dwThreadAffinityMask);

}

10/19/2009 5:59:20 PM (Pacific Daylight Time, UTC-07:00)  #    Comments [9]

Commonly two threads must communicate with one another, typically to exchange some piece of information.  This arises in low-level shared memory synchronization as in PLINQ’s asynchronous data merging, in the implementation of higher level patterns like message passing, inter-process communication, and in countless other situations.  If only two agents partake in this arrangement, however, it is possible to implement a highly efficient exchange protocol.  Although the situation is rather special, exploiting this opportunity can lead to some interesting performance benefits.

The standard technique for shared-memory situations is to use a ring buffer.  This buffer is ordinarily an array of fixed length that may become full or empty.  The two threads in this arrangement assume the role of producer and consumer: the producer adds data to the buffer and may make it full, whereas the consumer removes data from the buffer and may make it empty.  It is possible to generalize this to multi-consumers or multi-producers, with some added cost to synchronization.  What is described below is for the two thread case.

We will call this a ProducerConsumerRendezvousBuffer<T>, and its basic structure looks like this:

using System;

using System.Threading;

 

public class ProducerConsumerRendezvousPoint<T>

{

    private T[] m_buffer;

    private volatile int m_consumerIndex;

    private volatile int m_consumerWaiting;

    private AutoResetEvent m_consumerEvent;

    private volatile int m_producerIndex;

    private volatile int m_producerWaiting;

    private AutoResetEvent m_producerEvent;

 

    public ProducerConsumerRendezvousPoint(int capacity)

    {

        if (capacity < 2) throw new ArgumentOutOfRangeException("capacity");

        m_buffer = new T[capacity];

        m_consumerEvent = new AutoResetEvent(false);

        m_producerEvent = new AutoResetEvent(false);

    }

 

    private int Capacity

    {

        get { return m_buffer.Length; }

    }

 

    private bool IsEmpty

    {

        get { return (m_consumerIndex == m_producerIndex); }

    }

 

    private bool IsFull

    {

        get { return (((m_producerIndex + 1) % Capacity) == m_consumerIndex); }

    }

 

    public void Enqueue(T value)

    {

        ...

    }

 

    public T Dequeue()

    {

        ...

    }

}

There are some basic invariants to call out:

  • Our buffer holds our elements, producer index says at what position the next element enqueued will be stored, and the consumer index says from what position the next request to dequeue an element will retrieve its value.
  • We reserve one element in our buffer to differentiate between fullness and emptiness.  This is why we demand that capacity be >= 2.  We could alternatively know how to distinguish between a free slot and a used one, such as checking for null, but keep things simple for now.
  • Thus, IsEmpty is true when the consumer and producer index are the same.  Whereas IsFull is true when the consumer is one ahead of the producer, such that producing would make the producer index collide with the consumer index (otherwise leading to IsEmpty).
  • It should be obvious that our intent is to block consumption when IsEmpty == true and production when IsFull == true.  This is the point of the waiting flags and events.

Now let us look at the implementation first of Enqueue and then Dequeue, paying special attention to the necessary synchronization operations.  They look nearly identical:

    public void Enqueue(T value)

    {

        if (IsFull) {

            WaitUntilNonFull();

        }

 

        m_buffer[m_producerIndex] = value;

        Interlocked.Exchange(ref m_producerIndex, (m_producerIndex + 1) % Capacity);

 

        if (m_consumerWaiting == 1) {

            m_consumerEvent.Set();

        }

    }

Enqueue begins, as expected, by checking whether the queue is full.  Notice that we have not yet issued any memory fences yet.  The only thread that may make the buffer full is the current one, which will obviously not occur before proceeding, and therefore we needn’t perform any expensive synchronization operation for this check.  The value seen may of course be stale but we can deal with that possibility inside the slow path, WaitUntilNonFull.  We’ll look at that momentarily.

We then proceed to placing the value in the buffer at the current producer’s index.  Only the current thread will update the producer index and a consumer will not read from the current value so long as the producer index refers to it.  The value may not even be written atomically, e.g. for T’s that are greater than a pointer sized word.  This is okay: only the act of incrementing the index allows a consumer to access the element in question.  Writes on the CLR 2.0 memory model are retired in order and the reading side will use an acquire load of the index before accessing the element’s words.  Indeed we could use complicated multipart value types that are comprised of lengthy buffers, header words, and so on.

We then increment the producer index, handling the possibility of wrap-around by modding with the capacity.  This uses an Interlocked.Exchange for one simple reason: we are about to read a consumer waiting flag, and must prevent the load of that flag from moving prior to the producer index write.  The consumer sets this flag when it notices the queue is empty and waits.  This enables us to use a “Dekker style” check to minimize synchronization.  We could have alternatively just unconditionally set the event, doing away with the interlocked operation altogether.  But that call, if it involves kernel transitions, which is quite likely, is going to be much more expensive and would occur on every call to Enqueue.  And any event of this kind that doesn’t require kernel transitions is going to at least require an interlocked operation for the same reason we require one here.  An alternative technique involves setting when we transition the buffer from empty to non-empty or full to non-full, but this wastes a possibly expensive signal if the other party isn't even currently waiting.  If full or empty is a rare situation, then full or empty and with a peer actually physically waiting is even rarer.

Let’s now look at the WaitUntilNonFull method.  It’s really the reverse of what the consumer does, so based on everything said till this point, I am guessing it’s obvious:

    private void WaitUntilNonFull()

    {

        Interlocked.Exchange(ref m_producerWaiting, 1);

        try {

            while (IsFull) {

                m_producerEvent.WaitOne();

            }

        }

        finally {

            m_producerWaiting = 0;

        }

    }

We begin by issuing a memory fence and setting the producer waiting flag.  This memory fence is necessary to advertise that we are about to wait, and also to ensure the subsequent check of IsFull is synchronized.  The consumer does something very much like the producer does (above) after taking an element: if the producer is waiting, the consumer has made space for it and therefore must signal.  But it could be the case that the consumer has already made the queue non-full before it could notice the producer’s waiting flag.  We catch this by ensuring the producer’s check of IsFull cannot go before setting the producer waiting; similarly, the consumer cannot make IsFull false without subsequently noticing that the producer is waiting; this avoids deadlock.

Everything else is self explanatory.  Well, almost.  We need a loop here to catch one subtle situation.  Imagine a producer enters into this method thinking the buffer is full.  It sets its flag, and then immediately notices that the buffer is not full anymore.  A consumer has generated a new item of interest.  But imagine that consumer noticed that the producer was waiting and hence set its event.  This is an auto-reset event, so the next time the producer must wait, the event will have already been set and it’ll likely wake up before IsFull has become true.  An alternative way of dealing with this is to call Reset on the event if we didn’t actually wait on the event, but again we keep things simple.

At this point, the consumer side is going to look very familiar:

    public T Dequeue()

    {

        if (IsEmpty) {

            WaitUntilNonEmpty();

        }

 

        T value = m_buffer[m_consumerIndex];

        m_buffer[m_consumerIndex] = default(T);

        Interlocked.Exchange(ref m_consumerIndex, (m_consumerIndex + 1) % Capacity);

 

        if (m_producerWaiting == 1) {

            m_producerEvent.Set();

        }

 

        return value;

    }

 

    private void WaitUntilNonEmpty() {

        Interlocked.Exchange(ref m_consumerWaiting, 1);

        try {

            while (IsEmpty) {

                m_consumerEvent.WaitOne();

            }

        }

        finally {

            m_consumerWaiting = 0;

        }

    }

This is near-identical to Enqueue and WaitUntilNonFull, and so needs little explanation.  The acquire load inside IsEmpty of the producer index ensures that we observe the producer index for this particular value being beyond the current consumer’s index before loading the value itself, thereby ensuring we see the whole set of written words.  The one other thing to point out is that we “null out” the element consumed which, for large buffers, helps to avoid space leaks that would have otherwise been possible.

There are certainly some opportunities for improving this.

For example, we might add a little bit of spinning in the wait cases.  This would be worthwhile for cases that exchange data at very fast rates and have small buffers, meaning that the chance of hitting empty and full conditions is quite high.  Avoiding the context switch thrashing is likely to lead to hotter caches, because threads will remain runnable for longer, and the raw costs of switching themselves.

Additionally, we technically could use a single event if we wanted to spend the effort.  We’d have to handle a few tricky cases, however: namely, the case where a producer or consumer ends up waiting on an event because it “just missed” the event of interest, thus satisfying the event.  Indeed both threads could actually end up waiting on the event simultaneously and we need to somehow ensure the right one eventually gets awakened.  This leads to some chatter and probably isn’t worth the added complexity.

Here is a peek at some rough numbers from a little benchmark that has two threads enqueuing and dequeuing elements as fast as humanly (or computerly) possible.  This is a particularly unique and unlikely situation, but stresses the implementation in a few interesting ways.  In particular, it will stress the contentious slow paths; although these are expected to be rarer, the fast paths are just so easy to get right in this data structure that they are mostly uninteresting to stress performance-wise.  There are then a few variants, each based on the original version shown above:

  • 2 element capacity, which means we’ll be transitioning from empty to full and back a lot.
  • 1024 element capacity, which means we won’t.
  • With spinning, using .NET 4.0’s new System.Threading.SpinWait type.
  • An implementation that overuses interlocked operations as many naïve programmers would do.

The 2 element capacity situation is common in some message passing systems, e.g. Ada rendezvous, Comega joins, and the like.  Whereas the 1,024 element capacity situation is common for more general purpose channels, where some amount of pipelining is anticipated.

I whipped together a benchmark -- so quickly that we can barely trust it, I might add -- to measure these things.  Here’s a small table, showing the observed relative costs:

                     2 capacity    1024 capacity

As-is  No-spin       100.00%       1.93%

       Spin          56.41%        1.66%

Naïve  No-spin       101.20%       2.09%

       Spin          67.73%        1.87%

As with most microbenchmarks, take the results with a grain of salt.  And there are certainly more interesting variants to compare this against, including a monitor-based implementation that locks around access to the buffer itself.  Nevertheless, we can draw a few conclusions from this: as expected, the version that uses a single interlocked on enqueue and single interlocked on dequeue is faster than the naïve version that uses multiple (surprise!); spinning makes a much more interesting difference on the 2 element capacity situation, as expected, because it reduces the number of context switches dramatically; and, finally, the larger capacity enables a producer to race ahead of the consumer, hence avoiding far fewer transitions from full to empty to full and so forth.

This post was more of a case study than anything else.  There is nothing conclusive or groundbreaking here, and I suppose I should have said that would be the case up front.  That said, I’ve seen this technique used in over a dozen situations in actual product code now, so I figured I’d write a little about it, with a focus on how to minimize the synchronization operations.  We even contemplated shipping such a type in Parallel Extensions to .NET, but it’s just too darn specialized to warrant it.  So the closest thing we provided is BlockingCollection<T>.  Enjoy.

10/4/2009 6:03:17 PM (Pacific Daylight Time, UTC-07:00)  #    Comments [57]

I wrote this memo over 2 1/2 years ago about what to do with concurrent exceptions in Parallel Extensions to .NET.  Since Beta1 is now out, I thought posting it may provide some insight into our design decisions.  (And yes, most design discussions start this way.  Somebody develops a personal itch, dives deep into it, and emerges with a proposal for others to vote up, shoot down, or, as is typically the case, somewhere in the middle (provide constructive feedback, iterate, etc).)  I've made only a few slight edits (like replacing code- and type-names), but it's mainly in original form.  I still agree with much of what I wrote, although I'd definitely write it differently today.  And in retrospect, I would have driven harder to get deeper runtime integration.  Perhaps in the next release.

~~~

Concurrency and Exceptions
October, 2006

Exceptions raised inside of concurrent workers must be dealt with in a deliberate way.  Failures can happen concurrently, and yet often the programmer is working with an API that appears to them as though it’s sequential.  The basic question is, then, how do we communicate failure?

The problem

Fork/join concurrency, in which a single “master” thread forks and coordinates with N separate parallel workers, is an incredibly common instance of one of these sequential-looking concurrent operations.  The same callback is run by many threads at once, and may fail zero, one, or multiple times.  The exception propagation problem is inescapable here and comes with a lot of expectations, because the programmer is presented a traditional stack-based function calling interface papered on top of data or task parallelism underneath.

I am faced with the need for a solution to this problem for PLINQ right now and, while I could invent a one-off solution, we owe it to our customers to come up with a common platform-wide approach (or at least ManyCore-wide).  Any solution should compose well across the stack, so that somebody invoking a PLINQ query from within their TPL task that was spawned from a thread pool thread yields the expected and consistent result.  And I would like for us to reach consensus for both managed and native programming models.

Before moving on, there is one non-goal to call out.  Long-running tasks not under the category of fork/join also deserve some attention, because of the ease with which stack traces can be destroyed and the corresponding impact to debugging, but I will ignore them for now.  The problem is not new, exists with the IAsyncResult pattern, and PLINQ doesn’t use this sort of singular asynchronous concurrency.  These cases can typically be trivially solved using existing mechanisms, like standard exception marshaling.

No errors, one error, many errors

To understand the core of the issue, imagine we have an API ‘void ForAll<T>(Action<T> a, T[] d)’.  It takes a delegate and an array, and for every element ‘e’ in ‘d’ invokes the delegate, passing the element, i.e. ‘a(e)’.  If multiple processors are available, the implementation of ForAll may use some heuristic to distribute work among several OS threads, for instance by partitioning the array, probably running one partition on the caller’s thread, and finally joining with these threads before returning so that the caller knows that all of the work is complete when the API returns.

ForAll is not fictitious, and is similar to a number of PLINQ APIs: Where, Select, Join, Sort, etc.  It is also exposed directly by the TPL runtime’s Parallel class which intelligently forks and joins with workers.

‘a’ is a user-specified delegate and can do just about anything.  That includes, of course, throwing an exception.  What’s worse, because ‘a’ is run in several threads concurrently, there may be more than one exception thrown.  In fact, there are three distinct possibilities:

  1. No errors: No invocations of ‘a’ throw an exception.
  2. One error: A single invocation of ‘a’ throws an exception.
  3. Many errors: Concurrent invocations of ‘a’ on separate threads throw exceptions.

Clearly letting an exception crash whichever thread the problematic ‘a(e)’ happened to be run on is problematic and confusing.  If for no other reason than the IAsyncResult pattern has established precedent.  But realistically, the developer would be forced to devise his or her own scheme to marshal the failure back to the calling thread in order for any sort of chance at recovery.  They would get it wrong and it would lead to incompatible and poorly composing silos over time.  A Byzantine model that fully prohibits exceptions passing fork/join barriers goes against the simple, familiar, and understandable (albeit often deceptively so) model of exceptions.

(That said, marshaling leads to a crappy debugging experience.  An already attached debugger will get a break-on-throw notification at the exception on the origin thread, but since we catch, marshal, and (presumably) rethrow, the first and second chances for unhandled exceptions won’t happen until after the exception been marshaled.  This breaks the first pass, and by the time the debugger breaks in, or a crash dump is taken, the stack associated with the origin thread is apt to have gone away, been reused for another task (in the case of the thread pool), etc.  We generally try to avoid breaking the first pass in the .NET Framework, but do it in plenty of places: the BCL today already contains tons of try { … } catch { /*cleanup */ throw; }-style exception handlers, for example.  For this reason I’m not terribly distraught over the implications of doing it ourselves.  And sans deeper integration with the exception subsystem – something we ought to consider – there aren’t many reasonable alternatives.)

What makes this problem really bad is that ForAll appears as though it’s synchronous:

void f() {

    // do some stuff

    ForAll(…, …);

    // do some more stuff, ‘ForAll’ is completely done

}

The method call to ForAll itself is synchronous, but of course its internal execution is not.  But still, to the developer, the call to this function represents one task, one logical piece of work, regardless of the fact that the implementation uses multiple threads for execution.  As higher level APIs are built atop things like ForAll, the low level parallel infrastructure problem becomes a higher level library or application problem.  A Sort that is internally parallel must now decide what exception(s) it will tell callers it may throw.

Nondeterministic exception ordering

We assume the ForAll API stops calling ‘a(e)’ on any given thread when it first encounters an exception.  That is, each thread just does something like this:

for (int i = start_idx; i < end_idx; i++) {

    a(d[i]);

}

The for loop terminates when any single iteration throws an exception.  Imagine our array contains 2048 elements and that ForAll smears the data across 8 threads, partitioning the array into 256-element sized chunks of contiguous elements.  So partition 0 gets elements [0…256), partition 1 gets [256…512), …, and partition 7 gets [1792…2048).  Now imagine that ‘a’ throws an exception whenever fed a null element, and that every 256th element in ‘d’, starting at element 10, is null.  What can a developer reasonably expect to happen?

On one hand, if we’re trying to preserve the illusion of sequential execution, we would only want to surface the exception from the 10th element.  With a sequential loop, this would have prevented the 266th, 522nd, and so on, elements from even being passed to ‘a’.  So we might simply say that the “left most” exception (based on ordinal index) is the one that gets propagated.  The obvious problem with this is there are races involved: subsequent iterations indeed may have actually run.  Alternatively, we might consider only letting the “first” propagate.  Unfortunately, that doesn’t work either, because we unfortunately can’t necessarily determine, for a set of concurrent exceptions, which got thrown first.  Even if they have timestamps, they could occur in parallel at indistinguishably close times.  Nor does this really matter, because it feels fundamentally wrong.

The reason is that we can’t simply throw away failures without true recoverability in the system, a la STM.  The execution of code leading up to the exception did actually happen, after all, and there could be residual effects.  We might be masking a terrible problem by throwing failures away, possibly leading to (more) state corruption and (prolonged, perhaps unrecoverable) damage.  What if the 10th element was a simple ArgumentNullException that the caller chooses to tolerate, but the 266th element’s exception was in response to a catastrophic error from which the application can’t recover?  We can’t choose to propagate the 10th but swallow the 266th.  Broadly accepted exceptions best practices suggest that app and library devs never catch and swallow exceptions they cannot reasonably handle.  We should do our best to follow the spirit of this guidance too.

Re-propagation

We could employ an approach similar to the IAsyncResult pattern, with some slight tweaks.

If each concurrent copy of ForAll caught any unhandled exceptions and marshaled them to the forking thread, including any exceptions that happen on the forking thread itself, we could then propagate all of them together after the join completes.  The question is then: what exactly do we propagate?

If there is just a single exception, it’s tempting to just rethrow it.  But I don’t believe this is a good approach for two primary reasons:

  1. This will destroy the stack trace of the original exception.  This means no information about the actual source of the error inside ‘a’ is available.  With some help from the CLR team, we might be able to get a special type of ‘rethrow’ that copied the original stack trace before recreating a new one.  This is already done for remoted exceptions, and the Exception base class will prefix the original remoted stack trace to the new stack trace.
  2. This doesn’t scale to handle multiple exceptions.  If we could solve #1, it might be attractive because it appears as-if things happened sequentially, but we can’t escape #2, no matter what we do.  We could have different behavior in these two cases, but I believe it’s better to remain consistent instead.  Otherwise, developers will need to write their exception handles two ways: one way to handle singular cases, and the other way to handle multiple cases, where the same API may do either nondeterministically.

Given that we need to propagate multiple exceptions, we should wrap them in an aggregate exception object, and propagate that instead.  At least this way, the original exceptions will be preserved, stack trace and all.  Of course the original exceptions themselves might be other aggregates, handling arbitrary composition.

For sake of discussion, call this aggregate exception System.AggregateException, which of course derives from System.Exception.  It exposes the raw Exception objects thrown by the threads, via an ‘Exception[] InnerExceptions’ property, and additional meta-data about each exception: from which thread it was thrown, and any API specific information about the concurrent operation itself.  This last part is just to help debuggability.  For instance, we might tell the developer that the ArgumentNullException was thrown from a thread pool thread with ID 1011, and that it occurred while invoking the 266th element ‘e’ of array ‘d’.  We might also guarantee the exceptions will be stored in the order in which they were marshaled back to the forking thread, just to help the developer (as much as we can) piece together the sequence of events leading to failures.

(Editor’s note: we decided against storing this meta-data information for various reasons.)

Now the dev can do whatever he or she wishes in response to the exception.  Previously they might have written:

try {

    ForEach(a, d);

} catch (FileNotFoundException fex) {

    // Handler(fex);

}

And now they would have to instead write:

try {

    ForAll(a, d);

} catch (AggregateException pex) {

    List<Exception> unhandled = new List<Exception>();

 

    foreach (Exception e in pex.InnerExceptions) {

        FileNotFoundException fex = e as FileNotFoundException;

        if (fex == null) {

            unhandled.Add(fex);

        } else {

            // Handler(fex);

        }

    }

 

    if (unhandled.Count > 0)

        throw new AggregateException(unhandled);

}

In other words, they would catch the AggregateException, enumerate over the inner exceptions, and react to any FileNotFoundExceptions as they would have normally.  (Taking into consideration that there might have been multiple.)  At the end, if there are any non-FileNotFoundExceptions left over, we propagate a new AggregateException with the handled FileNotFoundExceptions removed.  If there was only one remaining, we could, I suppose, try to rethrow just that, but this has the same nondeterminism problems mentioned above.

Very few people will write this code.  But one of the most vocal arguments against it is: just throw one singular exception, such as ForAllException, and let it crash, because no developer will handle it.  Well, that scheme is no better than throwing the AggregateException.  At least the aggregation model lets people write backout and recovery code if they have the patience to deal with the reality that multiple exceptions occurred.

To make this slightly easier, we could expose an API, ‘void Handle(Func<Exception, bool> a) where T : Exception’, that effectively encapsulates the same logic as shown above, repropagating the exception at the end if all the exceptions weren’t handled (i.e. some weren’t of type T):

try {

    ForAll(a, d);

} catch (AggregateException pex) {

    pex.Handle(delegate(Exception ex) {

        FileNotFoundException fex = ex as FileNotFoundException;

        if (fex != null) {

            // Handle(fex);

            return true;

        }

        return false;

    });

}

(One problem with this approach is that the ‘throw’ inside of Handle will destroy the original stack trace for ‘pex’.  An alternative might be for Handle to modify the AggregateException in place, keeping the stack trace intact, returning a bool that the caller switches on and does a ‘throw’ if it returns false; this is unattractive because it’s error prone and could lead to accidentally swallowing, but in the end might help debuggability.)

If we cared about eliminating unnecessary catch/rethrows, we could use 1st pass filters instead, but this would only be available to VB and C++/CLI programmers, as C# doesn’t expose filters.  For example, in pseudo-code:

try {

    ForAll(a, d);

} catch (fex.InnerExceptions.Contains<FileNotFoundException>()) {

    // Handle …

}

Although interesting, we’re trying to move away from our two pass model.  So let’s forget about this for now.

This approach suffers when composing with non-aggregate exception aware code.  For it to work well, everybody on the call stack needs to be looking inside the aggregate for “their” exception, handling it, and possibly repropagating.  If we want existing BCL APIs to start using data parallelism internally, we would have to be careful here, not to break AppCompat because we start throwing AggregateExceptions instead of the originals.

This is probably where there’s an opportunity for better CLR and tool integration.  For instance, you could imagine a world where the CLR automatically unravels the parallel failures, matching and running handlers for specific exceptions inside the aggregate as it goes, but repropagating if all exceptions weren’t handled.  This is very hand-wavy and fundamentally changes the way exceptions work, so it would require a lot more thought.  A catch block that swallows an exception (today) is just about guaranteed—asynchronous exceptions aside—that the IP will soon reach the next instruction after the try/catch block.  This is a pretty basic invariant.  With this proposal, that wouldn’t be the case, and would be bound to break large swaths of code.  Sticking with the library approach (with all its imperfections) seems like the best plan of attack for now.

Waiting for the “join” to finish

There was something implicit in the design mentioned above.  The ForAll API, and others like it, wouldn’t actually propagate exceptions until the fate of all threads was known.

Imagine we have the scenario described earlier (2048 elements, 8 threads), but slightly different: the 0th element causes an exception, but no other.  It turns out this is probably a common case, i.e. that only a subset of the partitions will yield an exception.  In this case, we would still have to wait for 7*256 = 1,792 elements to be run through ‘a’ before this exception is propagated.  Imagine a slightly different case.  The 0th element throws a catastrophic exception, and the application is going to terminate as soon as it propagates.  ‘a’ simply can’t be run any more, and will keep reporting back this same exception.  But it will take 8 of these exceptions to actually stop the application, i.e. by calling ‘a’ on the 0th, 256th, 512th, etc. elements, if we wait for all tasks to complete.  If each exception corresponds to some failed attempt at forward progress, one that possibly corrupts state, then the damage is O(N) times “worse” (for some measurement) than in the sequential program, where N is the number of concurrent tasks.

Instead of waiting helplessly, we could try to aggressively shut down these concurrent workers.

At first, you might be tempted to employ CLR asynchronous thread aborts, but this is fraught with peril.  Almost all .NET Framework code today is taught that thread abort == AppDomain unload, and reacts accordingly.  State corruption stemming from libraries as fundamental as the BCL would be just about guaranteed.  Changing this state of mind and the state of our software would be quite the undertaking.

Instead, we can have the concurrent API itself periodically check an ‘abort’ flag shared among all workers.  The first thread to propagate an exception would set this flag.  And whenever a worker has seen that it has been set, it voluntarily returns instead of finishing processing data:

for (int i = start_idx; i < end_idx && !aborted; i++) {

    a(d[i]);

}

This increases the responsiveness of exception propagation, but clearly isn’t foolproof.  There will still be a delay for long-running callbacks.  Thankfully, with PLINQ, TPL, and I hope most of our parallel libraries, the units of work will be individually fine-grained, and therefore this technique should suffice.

If a concurrent worker is blocked, there’s not a whole lot we can do.  Much like thread aborts, you might be tempted to use Thread.Interrupt to remove it from the wait condition.  Unfortunately this will leave state corruption in its wake, because plenty of code does things like WaitHandle.WaitOne(Timeout.Infinite) without checking the return value or expecting a ThreadInterruptionException.  The same argument applies to, say, user-mode APCs.  Eventually you might also be tempted to use IO cancellation in Windows Vista to cancel errant, runaway network or disk IO requests.  This would be great.  But this also generally has the same problem as interruption, so until we find a general solution to that, we can’t do any of this.

(Editor’s note: We eventually solved this problem by coming up with a unified cancellation framework.)

One last note

This path forward seems best for now, but it leaves me wanting more.

In the end, this feels like a more fundamental problem.  An API like ForAll gives the illusion of an ordinary, old sequential caller/callee relationship.  But the callee doesn’t use a stack-based calling approach: instead, it distributes work among many concurrent workers, turning the linear stack into a sort of dynamically unfolding cactus stack (or tree).  And SEH exceptions are fundamentally linear stack-based creatures.

In this world, it’s just a simple fact that data all over the place can become corrupt simultaneously.  Many things can fail at once because many things are happening at once.  It’s inescapable.  Recovery is disastrously difficult, so most failures will end in crashes.  STM’s promise for automatic recovery offers a glimmer of hope, but without it, I worry that papering a sequential “feel” on top of data/task parallelism is a dangerous game to play.

6/23/2009 8:13:52 PM (Pacific Daylight Time, UTC-07:00)  #    Comments [7]

As we were hard at work creating PFX, we had a sister team of great talent working with us every step of the way.  Their job?  To do to Visual Studio 2010 what PFX did to .NET 4.0, by substantially improving the development experience for parallel programming on Windows.  This includes both diagnostics & debugging, as well as profiling.

Daniel Moth, the program manager for a lot of the IDE features, just wrote up a comprehensive blog post on the new tasks window:

Parallel Tasks - new Visual Studio 2010 debugger window

The new window gives you a view into all of the tasks in your process, their statuses, and where they are running:

Because both TPL and PLINQ use tasks for execution, it supports both quite nicely.  And it has (consistent!) support for both .NET and C++ tasks.  The parallel stacks window is also an impressive new feature, and I'm guessing Daniel will also cover that in the coming weeks.  Keep your eyes peeled. If all goes well, you'll even get to try them out first-hand, once Beta1 is available.

And if that weren't enough to entice you to visit his blog, check out this nasty machine that Daniel uses to run his kitchen appliances:

Oh, the insanity.  I am thinking Task Manager will need revising in Windows 8.

5/16/2009 5:27:35 PM (Pacific Daylight Time, UTC-07:00)  #    Comments [0]

The enumeration pattern in .NET unfortunately implies some overhead that makes it difficult to compete with ordinary for loops.  In fact, the difference between

T[] a = …;

for (int i = 0, c = a.Length; i < c; i++) …action(a[i])…;

and

T[] a = …;

IEnumerator<int> ae = ((IEnumerable<T>)a).GetEnumerator();

while (ae.MoveNext()) …action(ae.Current)…;

is about 3X.  That is, the former is 1/3rd the expense of the latter, in terms of raw enumeration overhead.  Clearly as action becomes more expensive the significance of this overhead lessens.  But if your plan is to invoke a small action over a large number of elements, using an enumerator instead of indexing directly into the array could in fact cause your algorithm to take 3X longer to finish.

There are many reasons for this problem.  They are probably obvious.  Using an enumerator requires at least two interface method calls just to extract a single element from the array.  Because there are O(length) number of these operations, the overhead imposed will be O(length) as well.  Contrast that with the nice, compact for loop, which emits ldarg IL instructions that access the array directly.  This will end up computing some offset (e.g., i * sizeof(T)) and dereferencing right into the array memory.  The enumerator needs to do that, of course, but only after the two interface calls are made.  Additionally, it is possible for the JIT compiler to omit the bounds check on the array access if it knows ‘c’ in the predicate ‘i < c’ was computed from ‘a.Length’, because arrays in .NET are immutable and their size cannot change.

(Strangely, it appears going through IList<T> is even slower than enumeration.  In fact, it appears to be more than 3X the cost of going through IList<T>’s enumerator, and over 10X that of indexing into the array using true ldarg instructions instead of interface calls to IList<T>’s element indexer.)

All of this actually makes it somewhat difficult for those on my team building PLINQ to compete with hand written programs.  That’s true of LINQ generally.  In fact, LINQ tends to be worse, because you string several enumerators together to form a query, often leading to even more overhead attributed to enumeration.  So you might reasonably wonder: if people care about performance, then why would they willingly start off 3X “in the hole” in hopes that they will eventually gain it back when they use machines with >= 4 cores?  It’s a completely fair criticism (although you must recall that everything I’m talking about is “pure overhead” and once you begin to have sizable computations in the per-element action it matters less and less).  We continually do a lot of work to try to recoup these costs.

There are actually many alternative enumeration models, and I think .NET needs to change direction in the future.  In addition to the overhead associated with the pattern, .NET’s enumeration pattern is a “pull” model (versus “push”), which makes it incredibly hard to tolerate blocking within calls to MoveNext.  Over time, I think we will need to pursue the push model more seriously.

I’ve thrown together a few different examples of alternative enumeration techniques.  To cut to the chase, here is a simple micro-benchmark test that enumerates over 1,000,000 elements 25 times, invoking an empty (non-inlineable) method for each element.  The per-element work here is quite small (although not empty) and so the results are a bit more extreme than a real workload would show:

For loop (int[])        739255 tcks     % of baseline

For loop (IList<int>)   7534609 tcks    1019.216%

ForEach loop (int[])    829617 tcks     112.2234%

int[] IEnumerator<int>  2152414 tcks    291.1599%

IEnumerator<int>        2062876 tcks    279.048%

IFastEnumerator<int>    1758992 tcks    237.9412%

IForEachable<int> [s]   1103745 tcks    149.305%

IForEachable<int> [i]   976742 tcks     132.1252%

IForEachable2<int>      957883 tcks     129.5741%

These are:

  • “For loop (int[])” is an ordinary for loop over the array directly.
  • “For loop (IList<int>)” is an ordinary for loop over the array’s IList<T> interface.
  • “ForEach loop (int[])” is an ordinary foreach loop over the array directly.
  • “int[] IEnumerator<int>” uses the array’s implementation of IEnumerator<T>.
  • “IEnumerator<int>” is a custom IEnumerator<T> implementation.
  • “IFastEnumerator<int>” is an implementation of new pull interface (defined below).
  • “IForEachable<int>” is an implementation of a new push interface (defined below) that uses delegates to represent the per-element action.  The only difference between the “[s]” and “[i]” variants are that the delegate is bound to a static method for “[s]” and an instance method for “[i]”.
  • “IForEachable2<int>” is a slight variant of IForEachable<T> (also defined below).

Notice that with IForEachable2<T>, we’ve gotten within 30% of the efficient for loop.  Unfortunately, I do get somewhat different numbers when compiling with the /o+ switch:

For loop (int[])        777746 tcks     % of baseline

For loop (IList<int>)   7569517 tcks    973.2634%

ForEach loop (int[])    735846 tcks     94.61264%

int[] IEnumerator<int>  2340361 tcks    300.9159%

IEnumerator<int>        2063039 tcks    265.2587%

IFastEnumerator<int>    1806568 tcks    232.2825%

IForEachable<int> [s]   1090644 tcks    140.2314%

IForEachable<int> [i]   946090 tcks     121.6451%

IForEachable2<int>      1234201 tcks    158.6895%

For comparison purposes, I get numbers like this if the loop body is completely empty except for accessing the current element:

For loop (int[])        452039 tcks     % of baseline

For loop (IList<int>)   422732 tcks     93.51671%

ForEach loop (int[])    461274 tcks     102.043%

int[] IEnumerator<int>  1958711 tcks    433.3058%

IEnumerator<int>        1730502 tcks    382.8214%

IFastEnumerator<int>    1372421 tcks    303.6068%

IForEachable<int> [s]   1091720 tcks    241.5101%

IForEachable<int> [i]   958401 tcks     212.0173%

IForEachable2<int>      664572 tcks     147.0165%

And this (with /o+):

For loop (int[])        262146 tcks     % of baseline

For loop (IList<int>)   263302 tcks     100.441%

ForEach loop (int[])    372924 tcks     142.2581%

int[] IEnumerator<int>  1889132 tcks    720.6412%

IEnumerator<int>        1635837 tcks    624.0175%

IFastEnumerator<int>    1479579 tcks    564.4103%

IForEachable<int> [s]   1096712 tcks    418.3592%

IForEachable<int> [i]   962261 tcks     367.0706%

IForEachable2<int>      698340 tcks     266.3935%

These numbers aren’t quite as meaningful because we have no idea what’s being optimized away by the C# and JIT compilers.  For example, they may notice we’re not using the current element at all and therefore eliminate the access altogether.  Nevertheless, the relative ranking of efficiency has remained nearly the same (with the notable exception of the array’s IList<T> test being much less worse).

(All of these numbers were gathered on a 32-bit OS on a 64-bit machine.  Because the JIT compilers for 32-bit and 64-bit are so different, you can expect vastly different results across architectures.)

Anyway, here is what IFastEnumerator<T>, IForEachable<T>, and IForEachable2<T> look like:

interface IFastEnumerable<T>

{

    IFastEnumerator<T> GetEnumerator();

}

 

interface IFastEnumerator<T>

{

    bool MoveNext(ref T elem);

}

 

interface IForEachable<T>

{

    void ForEach(Action<T> action);

}

 

interface IForEachable2<T>

{

    void ForEach(Functor<T> functor);

}

 

abstract class Functor<T>

{

    public abstract void Invoke(T t);

}

I also have a data type called SimpleList<T> that implements each of these, including IEnumerable<T>.  This is what the test harness uses for its benchmarking.  So any boneheaded mistakes I’ve made in the implementation of this class could cause us to draw the wrong conclusions about the interfaces themselves.  Hopefully there are none:

class SimpleList<T> :

    IEnumerable<T>, IFastEnumerable<T>, IForEachable<T>, IForEachable2<T>

{

    private T[] m_array;

 

    public SimpleList(T[] array) { m_array = array; }

 

       // Etc …

}

The class of course implements IEnumerable<T> in the standard way:

IEnumerator<T> IEnumerable<T>.GetEnumerator()

{

    return new ClassicEnumerable(m_array);

}

 

System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()

{

    return new ClassicEnumerable(m_array);

}

 

class ClassicEnumerable : IEnumerator<T>

{

    private T[] m_a;

    private int m_index = -1;

    internal ClassicEnumerable(T[] a) { m_a = a; }

 

    public bool MoveNext() { return ++m_index < m_a.Length; }

    public T Current { get { return m_a[m_index]; } }

    object System.Collections.IEnumerator.Current { get { return Current; } }

    public void Reset() { m_index = -1; }

    public void Dispose() { }

}

The idea behind IFastEnumerable<T> (and specifically IFastEnumerator<T>) is to return the current element during the call to MoveNext itself.  This cuts the number of interface method calls necessary to enumerate a list in half.  The impact to performance isn’t huge, but it was enough to cut our overhead from about 3X to 2.3X.  Every little bit counts:

IFastEnumerator<T> IFastEnumerable<T>.GetEnumerator()

{

    return new FastEnumerable(m_array);

}

 

class FastEnumerable : IFastEnumerator<T>

{

    private T[] m_a;

    private int m_index = -1;

    internal FastEnumerable(T[] a) { m_a = a; }

 

    public bool MoveNext(ref T elem)

    {

        if (++m_index >= m_a.Length)

            return false;

        elem = m_a[m_index];

        return true;

    }

}

(Update: after writing the blog post, I made a couple slight optimizations that make this a bit tighter (fewer field fetches):

class FastEnumerable : IFastEnumerator<T>

{

    private T[] m_a;

    private int m_index = -1;

    internal FastEnumerable(T[] a) { m_a = a; }

 

    public bool MoveNext(ref T elem)

    {

        T[] a = m_a;

        int i;

        if ((i = ++m_index) >= a.Length)

            return false;

        elem = a[i];

        return true;

    }

}

The impact to performance isn't huge, but does improve the performance to about 2.1X of the baseline.)

The IForEachable<T> interface is a push model in the sense that the caller provides a delegate and the ForEach method is responsible for invoking it once per element in the collection.  ForEach doesn’t return until this is done.  In addition to having far fewer method calls to enumerate a collection, there isn’t a single interface method call.  Delegate dispatch is also much faster than interface method dispatch.  The result is nearly twice as fast as the classic IEnumerator<T> pattern (when /o+ isn’t defined).  Now we’re really getting somewhere!

void IForEachable<T>.ForEach(Action<T> action)

{

    T[] a = m_array;

    for (int i = 0, c = a.Length; i < c; i++)

        action(a[i]);

}

Delegate dispatch still isn’t quite the speed of virtual method dispatch.  And delegates bound to static methods are actually slightly slower than those bound to instance methods, which is why you’ll notice a slight difference in the original “[s]” versus “[i]” measurements.  The reason is subtle.  There is a delegate dispatch stub that is meant to call the target method: when the delegate refers to an instance method, the ‘this’ reference pushed in EAX points to the delegate object when it is invoked and the stub can simply replace it with the target object and jump; for static methods, however, all of the arguments need to be “shifted” downward, because there is no ‘this’ reference to be passed and therefore the first actual argument to the static method must take the place of the current value in EAX.

The IForEachable2<T> interface just replaces delegate calls with virtual method calls.  Somebody calling it will pass an instance of the Functor<T> class with the Invoke method overridden.  The implementation of ForEach then looks quite a bit like IForEachable<T>’s, just with virtual method calls in place of delegate calls:

void IForEachable2<T>.ForEach(Functor<T> functor)

{

    T[] a = m_array;

    for (int i = 0, c = a.Length; i < c; i++)

        functor.Invoke(a[i]);

}

And that’s it.  Here is the program that drives the little micro-benchmark tests that I showed output for at the beginning:

class Program

{

    public static void Main()

    {

        const int size = 2500000;

 

        Random r = new Random();

        int[] array = new int[size];

        for (int i = 0; i < size; i++) array[i] = r.Next();

 

        SimpleList<int> list = new SimpleList<int>(array);

 

        const int iters = 25;

 

        long baseline = 0;

 

        GC.Collect();

        GC.WaitForPendingFinalizers();

 

        // Regular for loop

        {

            Stopwatch sw = Stopwatch.StartNew();

            for (int i = 0; i < iters; i++)

            {

                for (int j = 0, c = array.Length; j < c; j++)

                    DoNothing(array[j]);

            }

            baseline = sw.ElapsedTicks;

            Console.WriteLine("For loop (int[])\t{0} tcks\t% of baseline", baseline);

        }

 

        // Regular for loop (IList<int>)

        {

            Stopwatch sw = Stopwatch.StartNew();

            IList<int> ia = array;

            for (int i = 0; i < iters; i++)

            {

                for (int j = 0, c = ia.Count; j < c; j++)

                    DoNothing(ia[j]);

            }

            long elapsed = sw.ElapsedTicks;

            Console.WriteLine("For loop (IList<int>)\t{0} tcks\t{1}%",

                elapsed, 100*(elapsed / (float)baseline));

        }

 

        GC.Collect();

        GC.WaitForPendingFinalizers();

 

        // Regular foreach loop

        {

            Stopwatch sw = Stopwatch.StartNew();

            for (int i = 0; i < iters; i++)

            {

                foreach (int x in array)

                    DoNothing(x);

            }

            long elapsed = sw.ElapsedTicks;

            Console.WriteLine("ForEach loop (int[])\t{0} tcks\t{1}%",

                elapsed, 100 * (elapsed / (float)baseline));

        }

 

        GC.Collect();

        GC.WaitForPendingFinalizers();

 

        // Regular foreach loop

        {

            Stopwatch sw = Stopwatch.StartNew();

            for (int i = 0; i < iters; i++)

            {

                IEnumerator<int> e = ((IEnumerable<int>)array).GetEnumerator();

                while (e.MoveNext())

                    DoNothing(e.Current);

            }

            long elapsed = sw.ElapsedTicks;

            Console.WriteLine("int[] IEnumerator<int>\t{0} tcks\t{1}%",

                elapsed, 100 * (elapsed / (float)baseline));

        }

 

        // IEnumerator<T>

        {

            Stopwatch sw = Stopwatch.StartNew();

            for (int i = 0; i < iters; i++)

            {

                IEnumerator<int> e = ((IEnumerable<int>)list).GetEnumerator();

                while (e.MoveNext())

                    DoNothing(e.Current);

            }

            long elapsed = sw.ElapsedTicks;

            Console.WriteLine("IEnumerator<int>\t{0} tcks\t{1}%",

                elapsed, 100 * (elapsed / (float)baseline));

        }

 

        GC.Collect();

        GC.WaitForPendingFinalizers();

 

        // IFastEnumerator<T>

        {

            Stopwatch sw = Stopwatch.StartNew();

            for (int i = 0; i < iters; i++)

            {

                int x = 0;

                IFastEnumerator<int> e = ((IFastEnumerable<int>)list).GetEnumerator();

                while (e.MoveNext(ref x))

                    DoNothing(x);

            }

            long elapsed = sw.ElapsedTicks;

            Console.WriteLine("IFastEnumerator<int>\t{0} tcks\t{1}%",

                elapsed, 100 * (elapsed / (float)baseline));

        }

 

        GC.Collect();

        GC.WaitForPendingFinalizers();

 

        // IForEachable<T>

        {

            Stopwatch sw = Stopwatch.StartNew();

            for (int i = 0; i < iters; i++)

            {

                Action<int> act = new Action<int>(DoNothing);

                ((IForEachable<int>)list).ForEach(act);

            }

            long elapsed = sw.ElapsedTicks;

            Console.WriteLine("IForEachable<int> [s]\t{0} tcks\t{1}%",

                elapsed, 100 * (elapsed / (float)baseline));

        }

 

        GC.Collect();

        GC.WaitForPendingFinalizers();

 

        // IForEachable<T>

        {

            Stopwatch sw = Stopwatch.StartNew();

            for (int i = 0; i < iters; i++)

            {

                DoNothingClosure dnc = new DoNothingClosure();

                Action<int> act = new Action<int>(dnc.DoNothing);

                ((IForEachable<int>)list).ForEach(act);

            }

            long elapsed = sw.ElapsedTicks;

            Console.WriteLine("IForEachable<int> [i]\t{0} tcks\t{1}%",

                elapsed, 100 * (elapsed / (float)baseline));

        }

 

        GC.Collect();

        GC.WaitForPendingFinalizers();

 

        // IForEachable2<T>

        {

            Stopwatch sw = Stopwatch.StartNew();

            for (int i = 0; i < iters; i++)

            {

                DoNothingFunctor dnf = new DoNothingFunctor();

                ((IForEachable2<int>)list).ForEach(dnf);

            }

            long elapsed = sw.ElapsedTicks;

            Console.WriteLine("IForEachable2<int>\t{0} tcks\t{1}%",

                elapsed, 100 * (elapsed / (float)baseline));

        }

 

    }

 

    [System.Runtime.CompilerServices.MethodImpl(

        System.Runtime.CompilerServices.MethodImplOptions.NoInlining)]

    private static void DoNothing(int x) { }

 

    class DoNothingClosure

    {

        [System.Runtime.CompilerServices.MethodImpl(

            System.Runtime.CompilerServices.MethodImplOptions.NoInlining)]

        public void DoNothing(int x) { }

    }

 

    class DoNothingFunctor : Functor<int>

    {

        public override void Invoke(int x) { DoNothing(x); }

    }

}

To summarize, .NET enumeration costs something over typical for loops that index straight into arrays.  Most programs needn’t worry about these kinds of overheads.  If you’re accessing a database, manipulating a large complicated object, or what have you, inside of the individual iterations, then the overheads we’re talking about here are miniscule.  In fact, walking 1,000,000 elements is in the microsecond range for all of the benchmarks I showed, even the slowest ones.  So none of this is anything to lose sleep over.  But if you have a closed system that controls all of its enumeration, it may be worth doing some targeted replacement of enumerators with the more efficient patterns, particularly if you tend to enumerate lots and lots of elements lots and lots of times in your program.

9/21/2008 12:14:58 AM (Pacific Daylight Time, UTC-07:00)  #    Comments [9]

I just submitted the final manuscript for Concurrent Programming on Windows to Addison-Wesley.

This marks the exciting transition from things happening on my timetable to things happening on AW’s timetable.

A lot has changed for me since I decided to write this book.  You might be surprised to hear that I actually signed the contract for it on November 29th, 2005.  That’s 2 years and 7 months ago.  It’s almost unbelievable that this book took so long to finish.  By comparison, my first one took just a little over a year.  The road has been a long one, full of personal ups and downs, but it’s no doubt been an exciting trip.

I’ve been at Microsoft the whole time.  At the outset, I was a PM on the CLR Team, hacking on software transactional memory and PLINQ as an evening activity.  Then I transitioned to doing it full time, but still as a PM.  Then I joined the Parallel Computing team as the dev for PLINQ.  Then I kicked off the whole Parallel Extensions effort (which is 20 members and growing strong), became the dev lead, and here I am today.  It’s pretty strange to say this, but without the book very little of that would have happened.  I can’t think of a better way to get entrenched in a technology, experience the breadth, and force yourself to learn every little intricate and often enlightening detail.  If you can afford the impact to mental health and personal relationships ;), it’s an activity I highly recommend to anybody wanting to master a technology...  not that one can actually master the concurrency beast, but y’know...

In retrospect, it should have taken a year.  Maybe next time.

The good news is that you will have the book in your hands soon.  (Well, if you decide to buy a copy, that is.)  If you manage to make it to my PDC 2008 pre-con session, I’m hoping we will have some copies available.  No promises, since I missed my final deadline by a couple weeks, but my fingers are crossed.

Oh yeah, and you can expect me to pick up blogging again now that I’ll have some free time.  Hmm, free time?  What will I do with myself!

Laissez les bon temps roulez!

6/23/2008 12:34:18 AM (Pacific Daylight Time, UTC-07:00)  #    Comments [12]

We sat down last week with Charles from Channel9 to discuss the new CTP.  Both parts got posted today:

We focus on the new aspects of the stack, incl. the new scheduler and CDS, and also discuss what's changed in PLINQ and TPL.

If you have ideas for future videos, or any feedback/questions, you know where to send 'em.  joedu AT youknowwhere DOT com.

6/5/2008 5:14:47 PM (Pacific Daylight Time, UTC-07:00)  #    Comments [1]

We just released a new CTP of Parallel Extensions to .NET: get it here.

Some relevant information is up on our team blog:

I'm really excited to see our entire stack finally shipping as one cohesive unit: the data structures we use throughout the implementation exposed publicly (what we now call CDS), a new scheduler built from the ground up, TPL and PLINQ better together, and lots more.  We're still very far from the end of the road, and have plenty of cool stuff still in the works, but we've made a ton of progress in just 6 months since our last CTP.

Have fun and, as always, feedback is much appreciated.

6/2/2008 9:00:26 AM (Pacific Daylight Time, UTC-07:00)  #    Comments [5]

PDC'08 is officially on for October 27-30th this year: http://microsoftpdc.com/.

My team will certainly have some really fun stuff to show off, and just glancing at the preliminary list of teaser sessions, it's going to be a blast.

A few of us have teamed up to give a PreCon.  That's code-word for a full day of neverending concurrency goodness.  From http://microsoftpdc.com/agenda/Preconference.aspx:

Concurrent, Multi-core Programming on Windows and .NET
Presenter(s): David Callahan, Joe Duffy, Stephen Toub
The leap from single-core to multi-core technology is altering computing as we know it, enabling increased productivity, powerful energy-efficient performance, and leading-edge advanced computing experiences. The good news is that Windows and .NET offer rich support for threading and synchronization to take advantage of these new platforms. This session, presented by David Callahan, Microsoft distinguished engineer, Joe Duffy, author of “Concurrent Programming on Windows” (Addison-Wesley), and Stephen Toub, program manager lead for the Concurrency Development Platform team at Microsoft, will cover a broad range of topics, including mechanisms to create, coordinate, and synchronize among threads; best practices for concurrent libraries and apps; and techniques for improving scalability, including lock-free algorithms. Focus will be on .NET programming, including the next generation of parallel programming support within the Framework, but Windows internals and C++ nuggets will be discussed too.

About the presenter(s):
David Callahan joined Microsoft in 2005. He is a Distinguished Engineer leading the Parallel Computing Platform Team within Visual Studio® focused on incubating technology for the coming manycore processors. This team is producing exciting new technologies as part of Visual Studio and also driving the Parallel Computing Initiative, a company wide effort to deliver customer value from the power of future high-performance processors. David’s background is in programming languages, parallel programming techniques, and compilation techniques focused on expressing and exploiting concurrency.

Stephen Toub is a Senior Program Manager Lead on the Parallel Computing Platform team at Microsoft, where he spends his days focusing on the next generation of programming models for concurrency. Stephen is also a Contributing Editor for MSDN® Magazine, for which he writes the .NET Matters column, and he’s an avid speaker at conferences like TechEd and DevConnections. Prior to working on the Parallel Computing Platform, Stephen designed and built enterprise applications for companies such as GE, McGraw-Hill, BankOne, and JetBlue. He was a developer for Microsoft Outlook as well as for the Microsoft Office Solution Accelerators. In his spare time, Stephen loves to sing, and he spends as much time as possible with his beautiful wife Tamara.

Joe Duffy leads development for Microsoft's Parallel Extensions to .NET technology, a set of library and runtime technologies for concurrent and parallel computing. He founded the project in 2006 with Parallel Language Integrated Query (aka PLINQ), an innovative declarative parallel query analysis and execution engine. Prior to Parallel Extensions, Joe worked on transactional memory, library and VM support for concurrency in the Common Language Runtime (CLR) team, and has written 3 functional language compilers (Scheme, Common LISP, and Haskell). He has written two books, including Concurrent Programming on Windows (Addison-Wesley, 2008), and in his spare time reads and writes (code and text), plays guitar, and studies music theory.

Be there, or be square.  The slides aren't even finalized yet, so let me know if you have particular topics you'd like to learn more about.

5/29/2008 11:19:11 PM (Pacific Daylight Time, UTC-07:00)  #    Comments [2]

A long time ago, I wrote that you’d never need to write another finalizer again.  I’m sorry to say, my friends, that I may have (unintentionally) lied.  In my defense, the blog title where I proclaimed this fact did end with “well, almost never.”

Finalizers have historically been used to ensure reclamation of resources that are finite or outside of the purview of the CLR’s GC.  Native memory and Windows kernel HANDLEs immediately come to mind.  Without a finalizer, resources would leak; server apps would die, client apps would page like crazy, and life would be a mess.  For such resources, properly authored frameworks also provide IDisposable implementations to eagerly and deterministically reclaim the resources when they are definitely done.  Three years ago, I wrote a lengthy treatise on the subject.

The finalizer is there as a backstop.  It is often meant to clean up after bugs , such as when a developer forgets to call Dispose in the first place, tried to but failed due to some runtime execution path skipping it (often exceptions-related), or a framework or library author hasn’t respect the transitive IDisposable rule, meaning that eager reclamation isn’t even possible.  It also avoids tricky ref-counting situations as are prevalent in native code:  since the GC handles tracking references, you, the programmer, can avoid needing to worry about such low-level mechanics.  In all honesty, the finalizer’s main purpose is probably that we wanted to facilitate a RAD and VB-like development experience on .NET, where programmers don’t need to think about resource management at all, unlike C++ where it’s in your face.  While one could reasonable argue that IDisposable is all you need (the C++ argument), that would have gone against this goal.

Concurrency changes things a little bit.  A thread is just another resource outside of the purview of the CLR’s GC, and is actually backed by a kernel object and associated resources like non-pageable memory for the kernel stack, some data for the TEB and TLS, and 1MB of user-mode stack, to name a few.  They also add pressure to the thread scheduler.  Threads are fairly expensive to keep around, and “user” code is responsible for creating and destroying them.

Now, it’s true that we are moving towards a world where threads and logical tasks are not one and the same.  This is a ThreadPool model.  But it’s also true that a task that is running on a thread is effectively keeping that thread alive, and perhaps more concerning, preventing other tasks from running on it.  Use of a resource is a kind of actual resource itself, although more difficult to quantify.

So, what does all of this have to do with finalization?

If some object kicks off a bunch of asynchronous work and then becomes garbage—i.e. the consumer of that object no longer needs to access it’s information—then it’s possible (or even likely) that any outstanding asynchronous operations ought to be killed as soon as possible.  Otherwise they will continue to use up system resources (like threads, the CPU, IO, system locks, virtual memory, and so on), all in the name of producing results that will never be needed.  The only reason this task stays alive is because the scheduler itself has a reference to it.

Just as with everything discussed above pertaining to non-GC resources, we’d like it to be the case that such a component would offer two methods of cleanup:

  1. Dispose: to get rid of associated asynchronous computations immediately when the caller knows they no longer need the object.
  2. Finalization: to get rid of associated resources that are still outstanding when the GC collects the root object that is responsible for managing those asynchronous computations.

You’ll notice that we support cancelation in a first class and deeply-ingrained way in the Task Parallel Library.  While not exposed in PLINQ (yet), there is actually cancelation support built-in (though not as fundamental as we’d like (yet)).  This is a useful hook to allow us to build support for both resource reclamation models.  In this sense, cancelation as a pattern of stopping expensive things from happening is quite similar to resource cleanup.  Clearly they aren't identical, but we will need to figure out the specific deltas.

I should also point out that we will prefer and push structured parallelism for many reasons.  Parallel.For is an example, where the API looks synchronous but is internally highly parallel.  One reason we like this model is that the point at which concurrency begins and ends is very specific.  The call won’t return until all work is accounted for and completed.  It’s only when you bleed computations into the background after a call returns that everything stated above becomes an issue.  This is obviously nice for failures (e.g. you are forced to deal with them right away), but also because it alleviates this problem nicely.

I don’t think we’re at a point where we can recommend definite tried-and-true best practices for cancelation of asynchronous work and how it pertains to resource management.  I do think we need to get there by the time we ship Parallel Extensions V1.0.  And I think we will.  Here’s a snapshot summary of my current thinking, however, and I would love to get feedback on it:

  1. We should tell people to implement IDisposable and to Cancel tasks inside Dispose, when their classes own unstructured asynchronous computations.
  2. We may or may not want people to implement a finalizer to do the same.  I currently believe we will.
  3. I am undecided about whether these cancelations should be synchronous.  In some sense, they should be since you’d like to know that all resources have definitely been reclaimed.  But this would mean blocking (possibly indefinitely) on the finalizer thread.  That’s a definite no-no.  Blocking in Dispose would mean blocking (possibly indefinitely) inside a finally block.  That’s also a no-no, although it’s less severe of one than the finalizer.  It just means hosts can’t take over threads as easily when they need to abort them.  Thankfully we offer the Task.Cancel method which is non-blocking.  Possibly we should suggest synchronous cancelation inside of Dispose, and asynchronous inside of the finalizer.
  4. If we did do synchronous anywhere, presumably with Task.CancelAndWait, we’d need to recommend a practice for communicating failures.  Throwing from Dispose is discouraged, but so is swallowing failures.  The kind of code usually run inside of Dispose is much less likely to generate exceptions than running arbitrary tasks full of user code.  Catch-22.
  5. There are some cases we can do the cancelation thing ourselves.  Whether we do or not is subject to debate, but I believe we should.  If we ensured the scheduler’s references are weak, then once all other code in the process drops the reference, we would not schedule it.  This implies that tasks are seldom executed “for effect”, which is certainly a judgment call.  It might be worth exposing an option that allows “for effect” tasks to be created not subject to this rule.
  6. The trickiest case is when a task is already running.  For short-running tasks, this may not be a huge concern, but a lot of such tasks do recursively queue up additional ones.  It would be nice if the fact that its results are no longer needed somehow flowed automatically to the task, perhaps through cancelation.   This also means waking tasks from blocking calls.

It’s interesting to point out that 5 and 6 were part of the original motivation for the inventors of the future abstraction.  They noted that representing computations as futures, and allowing the GC to collect them before they run once they’ve become unreachable, effectively makes computations garbage-collectable.  This, I think, is a neat idea, particularly if your program uses futures pervasively.

In any case, I wanted to point these subtleties out, and hear any feedback folks out there might have.  What I find particularly interesting about concurrency, as we move forward on things like Parallel Extensions, is that there are a lot of subtle implications to the way programs are written.  This includes fundamental things like exceptions and resource management.  There are other subtle impacts, like whether the ordering of results coming out of a computation matters.  PLINQ surfaced this early on, and I didn’t expect the pervasive nature of the issue.   Debugging and profiling are also extraordinarily different.  I suspect we’ll continue running into many such things throughout the evolution to highly parallel software.

2/17/2008 8:29:07 PM (Pacific Standard Time, UTC-08:00)  #    Comments [6]

Today is an extraordinarily exciting day for me.  After about two years of work by several great people across the company, the first Parallel Extensions (a.k.a. Parallel FX) CTP has been posted to MSDN.   Check out Soma’s blog post for an overview, and the new MSDN parallel computing dev center for more details.  Keep an eye on the team’s new blog too, as we’ll be posting a lot of content there as we make progress on the library; in fact, thanks to Steve (who writes blog posts in his sleep), there’s already a bunch of reading to catch up on!

I began kicking the tires on PLINQ back in October of 2005.  In September of 2006, I described PLINQ as “a fully functional prototype” and “research.”  Well, it’s come a long way since then, and we’re finally ready for real human beings to start hammering on it.  Not only that, but we’ve expanded the scope of the original project significantly, from PLINQ to Parallel FX, to include new imperative data parallel APIs (for situations where expressing a problem in LINQ is unnatural), powerful task APIs that offer waiting and cancelation, all supported by a common work scheduler based on CILK-style work-stealing techniques developed in collaboration with Microsoft Research.  And there’s even more to come.  Daniel Moth spilled some beans in his screencast on Channel9 when he described the additional data structures, like synchronization primitives and scalable collections, which will come online later.  Some of them are even in the new CTP, but have remained internal for now.

The shift to parallel computing will have an industry-wide impact, and will undoubtedly take several phases and many years to tame completely.  We have focused on the lowest hanging fruit and the most important foundational shifts in direction we can incite—like encouraging the over-representation of latent parallelism to aid in future scalability—but there are certainly things that the current CTP doesn’t fully address.  GPGPUs, verifiable thread safety, automatic parallelism, great tools support, etc., are all topics that are of great interest to us.  We have a lot of work to do for the final release of Parallel FX, and expect a whole lot of feedback from the community on specific features and general direction.  So let us have it!  You can use our Connect site, or even just email me directly at joedu AT you-know-where DOT com.

Consider this an early Christmas present.  Now you have something fun to do, in the privacy of your own office, when trying to avoid family members during the holidays.  Whoops—did I say that out loud?  Enjoy!

11/29/2007 5:32:35 PM (Pacific Standard Time, UTC-08:00)  #    Comments [16]

Two articles about ParallelFX (PFX) are in the October issue of MSDN magazine and have been posted online:

  1. Parallel LINQ: Running Queries on Multi-Core Processors.  An overview of an implementation of LINQ-to-Objects and -XML which automagically uses data parallelism internally to execute declarative language queries.  It supports the full set of LINQ operators, and several ways of consuming output in parallel.
  2. Parallel Performance: Optimize Managed Code for Multi-Core Machines.  Describes the Task Parallel Library (TPL), a new "thread pool on steroids" with cancelation, waiting, and pool isolation support, among many other things.  Uses dynamic work stealing techniques (see here and here) for superior scalability.

As noted in the article, there's a PFX CTP planned for 2007*.  Watch my blog for more details when it's available.

*Note: some might wonder why we released the articles before the CTP was actually online.  When we originally put the articles in the magazine's pipeline, our intent was that they would in fact line up.  And both were meant to align with PDC'07.  But when PDC was canceled, we also delayed our CTP so that we had more time to make progress on things that would have otherwise been cut.  It's less than ideal, but I'm still confident this was the right choice.

9/15/2007 9:30:29 AM (Pacific Daylight Time, UTC-07:00)  #    Comments [36]

Somebody I respect a lot on our team said something interesting the other day: paraphrasing, "parallelism is about taking one trick and applying it to as many things as possible."  Well, what's the trick?  The trick is breaking a problem into successively smaller pieces on which disjoint subsets of the overall computation can run concurrently.  Pieces in this sense can be little bits of data or instructions, or both.  It seems so obvious, but that really is all there is to it.  That's not to say it's easy, of course, though some people believe it is.  One of the nice things about PLINQ is that you express a big computation and we hide the tricks.  But the tricks aren't impossible to do on your own... today, even.

4/22/2007 11:45:55 PM (Pacific Daylight Time, UTC-07:00)  #    Comments [6]

Prior to coming to Microsoft, I had been a developer for about 6 years.  Back in 2004, when contemplating moving to a different company, I decided I was ready for a shift in focus.  Namely, I wanted to move away from the details of writing code and move towards higher-level architecture, design, and strategy.  So I joined Microsoft as a program manager (PM).

Now, less than 3 years later, I’ve moved back to being a software development engineer (SDE, i.e. developer).  Is it forever?  Who knows?  I didn’t move back because I disliked the PM job (as some might assume).  In fact, I thoroughly enjoyed most aspects of being a PM.  After a few months in my new role I actually find myself missing some of them from time to time.  There wasn’t any single thing that spurred the decision: it was fairly complicated, and involved a mixture of pursuing a specific opportunity, seeking growth, realizing I was missing many things that I enjoyed doing as a developer, and, I suppose, some degree of relapsing back to something I am more comfortable with.

I've written up some details about why I decided to change.  Who knows, maybe it will help somebody out there choose the right role for themselves?

The specific opportunity

The opportunity I’m referring to was, of course, PLINQ, and it came about at just the right time.  I spent the year leading up to my decision gradually focusing more and more on PLINQ.  At first, PLINQ was just a crazy idea.  Then I did some prototyping in my free time, had some quick successes, and was amazed at how simple it was to write parallel programs with LINQ queries.  I showed some people, and most really latched on to the idea; they could understand the value of the system with just a 30 second elevator pitch.  These positive reactions motivated me to stick with the idea, and to keep driving hard.  Then there was interest from the C# team, interest from executives, a BillG ThinkWeek paper, and finally, somebody wanted to do it for real.

In some sense, it was like a start-up.  I had an idea, figured it was an idea worth pursuing, did some prototyping, sold it among my peers, and finally convinced a VC team to fund the execution of the idea.  Well, if it truly were a start-up, what would I then want to do?  Code all night long, obviously!  I wanted to architect and design the actual product version, implement the darned thing, and feel the whole process from start to finish from several perspectives.  Doing PLINQ as a technical lead/developer seemed like a natural next step.

I should also note that I had considered taking developer positions before the PLINQ job.  But in all of those instances, I chose not to for two reasons: I was having a blast as the concurrency PM on the CLR team and because the right opportunity hadn't arisen to steal me away.

This was also a great growth opportunity for me.  Though I was a developer in the past, it was seldom my own idea that I was implementing, and this time it’s different.  I had set the ship sailing in a direction that I was happy with, and experiencing the consequences of that direction is something I seriously wanted.  It’s less about control than it is about feeling more ownership and responsibility for the direction in which our team is headed.  It’s also a way of putting my butt on the line, hopefully winning more trust and respect in the process, showing that I can actually follow through on good ideas (rather than just having the idea itself).  Most PMs certainly get a lot more of the influence and coarse-grained ownership opportunity than most SDEs, but my particular situation was such that my high-level contributions, leadership skills, and domain-experience will be more valuable and influential where I am now.  This last part turned out to be less of a SDE vs. PM thing (i.e. I could have probably had those things as a PLINQ PM too), but it played a huge role in my decision to change teams.

I missed writing code

During the many long nights of prototyping PLINQ, it also kept reminding me how much I loved coding and how much I missed it as a PM.  Put simply: I love programming.  Anybody who’s done something they love and gotten lost in it knows what “flow” is.  I used to experience flow when I played music, but I haven’t done that for so long that programming is now the only time I really feel it.  Maybe it has to do with the fact that I started coding when my brain was still in development (13 years old), and so I tend to be comfortable thinking in code.  You know: no conscious thought, just a direct conduit from your brain to your fingertips.  There is a sense of time standing still, the code simply appearing on the screen in a magical blur, and at some point the thing actually works.  You can get up, stretch, make some tea, map out the next few hours for a moment, and then you’re back into it.  Lost in the fun of it all.  It’s just great, and nothing I ever did as a PM gave me this feeling.

There’s an art to writing code.  Sure, there’s a non-negligible scientific and mathematical component to programming, but I like to think of development as the artful application of those formal techniques.  Inventing, structuring, and reifying abstractions, naming them, the careful placement of comments and whitespace, traversing the state space in your head, using intuition to make certain trade-offs, and so on.

Perhaps the most creative and artistic activity as a PM that I really enjoyed was writing.  But to be honest, I’ve always enjoyed the kind of writing I do on my own time—writing blog essays and books—more than the more prevalent style of writing that a PM tends to do (say, design specifications).  I would get the chance to write a strategic or influential technical document from time to time, but the rate at which I will contribute such things as a SDE seems like it will be the same as when I was a PM.  It has been so far, at least.  If you have the insight and know how, when, and with whom to share and present it, people will listen, regardless of your title.

Designing new things was also of course a creative activity I loved as a PM, but I’m doing more design work as a developer than when I was a PM.  Not all SDEs design higher-level things, but I do regularly and work closely with many architects, distinguished engineers, and technical fellows in the process.  Developers also get to design more at a slightly lower level of abstraction.  But everything's just an abstraction anyway, so the "level" doesn't matter much (to me) anyway.

Prior to PLINQ, my Sencha interpreter/compiler (Scheme at first and later, Common LISP) was my hobby development project.  (And, of course, there were many late nights spent on TopCoder.)  I was basically spending my days as a PM and my nights as a developer.  To be honest, I enjoy doing both things and, aside from the time it took to do both, it made me very happy.  But in retrospect, it turned out to be rather unhealthy.  Writing code was my sole hobby, and I evicted all of my pre-Microsoft hobbies from my system to make room for it.  Now I get to write code during the day, do the components of PM that I enjoyed most, and still preserve my sanity.  I’m just getting back into the things I used to love doing: writing more than just technical blog essays and books, playing guitar, sports, working out, food and wine, and basically just getting outside and feeling healthier, both mentally and physically.  Maybe I'm getting old.

Fractured role structure

There are plenty of things I miss as a PM.  But I associate the absence of many such things with where we are with the PLINQ project.  High-level planning, becoming familiar with the competition and pertinent research in the area, thinking about product strategy, etc. are all certainly crucial activities throughout the project, but they are much more prevalent and involved at the outset.  As the project shifts into execution mode, there’s a lot of thinking about release plans, customer interaction, and team building and management.  I enjoy those things too, but I also love the design, architecture, and implementation of software.  I still get to do them, just not as frequently.  As a developer, you tend not to be involved in as many management discussions and decisions, but I can cope with this knowing I have a solid PM and management team working with me.  I know that if there's anything that requires my attention or where my feedback would be valuable, they won't hesitate to bring it to me.

To be completely honest, I dislike the completely fractured role structure that most of Microsoft employs.  (And “dislike” is putting it mildly.)  I do believe that people need to specialize, particularly as they grow in their career.  You simply can’t do it all and expect to move up to the next level of your career without abstracting some details.  But when you get down to it, there really isn’t much of a difference between PM, SDE, and SDE/T.  I've been doing interviews for PM and SDE lately, and we expect one to have a better business- and customer-sense, while the other needs to be solid algorithmically and implementation-wise, but there's a substantial overlap.  There are obviously certain responsibilities which are unique to each role, but more often than not, the roles are fluid and (when things are going well) people are able to fill in whatever gaps happen to arise that they are good at filling.  I just don’t think this kind of specialization necessarily needs to be forced.  Maybe I'm influenced by my own style of sitting on the edge between the two roles.  Each project needs its own unique structure, and some dedicated managers, but it doesn’t happen according to some magical formula.  As I understand things, this is how it works at Google, and just about every research department I’ve ever known (including MSR).

I should also mention that the grass isn't all a vibrant shade of green on the other side of the fence.  A PM gets to operate at a higher-level, seldom spending more than 10 minutes triaging and talking to a developer about any particular bug, for example.  In contrast, a developer must focus on details most of the time.  This might mean spending anywhere from 1 minute to 1 week on just a single bug.  Some people enjoy getting lost in details (in fact, it can help lead to flow, though bug fixing is more investigative work than anything else, which isn't as conducive).  Checking in a bug makes it really easy to quantify “success” and some people need this.  (Not me.)  I prefer more ambiguous tasks which usually means design or coding an entirely new algorithm, and I find fixing bugs tedious.  This means the late-in-the-game product cycle will be a little rough for me.  But that’s OK: the details are of course part of the engineering process, so I do it and don’t ever find myself thinking “wow, I dislike this” while I'm doing it.  A lot of developers tend to be tools geeks, too.  They love building complex ecosystems of engineering support tools.  I’m not into that.  I value solid engineering practices, but prefer things to be as simple as possible such that nothing gets in between me, the code, and the compiler.  Reality isn’t quite always like this, and tools are often the most effective means to ensure certain quality properties of your product, but I still strive to eliminate (or avoid adding) as much unnecessary obstructions as possible.  A wise man once said "keep it as simple as possible, but no simpler."

Some final words

At the end of the day, when I look at influential people I've bonded with across the company, most (but certainly not all) rose up through the ranks as developers.  That’s clearly very subjective and personal: that’s not to say there aren’t great PMs (past or present), only that most architects, distinguished engineers, technical fellows, and researchers have or are coders at heart.   Moreover, I simply see myself being more successful as a SDE.  Not only have I done it for longer, but the career skills required to move from developer to architect to distinguished engineer (in Microsoft) align more closely with the skills I already have.  While PMs can move to architecture with time, it simply looks like a much longer road from where I stand.

And after saying all of this, I truly believe job title doesn’t matter that much at all.  Ability and good intuition matter.  We’re all just shipping software.  At the end of the day, we do what needs to get done for that to happen, regardless of what hat we happen to be wearing at the time.

3/24/2007 5:19:17 PM (Pacific Daylight Time, UTC-07:00)  #    Comments [3]

[Update 2/19/07: the search for Jim has been called off after just about three weeks.  Thanks to everybody who helped, and best wishes to Jim's family.]

Jim Gray, a Microsoft technical fellow who is best known for his seminal work on database transactions and data management, has been missing for almost a week.  He took his 40-foot sailboat, Tenacious, last Sunday to scatter the remains of his late mother near the Farallon Islands off the coast of California, and has been missing ever since.  This article has additional details.

Although the Coast Guard has called off the search, friends and family are chartering private aircraft, scouring satellite imagery, and NASA even made a special run of its ER-2 aircraft to generate additional data.  Werner Vogels has more information on his website, including how you can help with the search.

Jim actually provided important feedback while I was thinking and writing about PLINQ, and was one of the most supportive and excited reviewers of the project.  Not to mention all of the papers he wrote or was involved with that inspired a lot of the direction.  I’d be very sad to see his life cut short, particularly in this way.

2/3/2007 6:06:00 PM (Pacific Standard Time, UTC-08:00)  #    Comments [4]

I will be giving a PLINQ talk at the Declarative Aspects of Multicore Programming (DAMP) workshop at POPL next week:

[Update: 1/23/07 -- added a link to the slide deck below.]

PLINQ: A Query Language for Data Parallel Programming

Microsoft’s language integrated query (LINQ) technology provides an expressive syntax and suite of APIs which facilitate classic data parallel operations: filtering, mapping, reductions, loop tiling, sorts, nested loops parallelism, prefix scans, and more.  In this talk, we look at a new set of extensions to the LINQ technology, called parallel LINQ (PLINQ), which automatically optimizes and parallelizes query operations based on dynamic runtime information.  We believe that the use of a familiar SQL-like syntax will broaden the reach of PLINQ in industry, and provides programmers with a more declarative and flexible way of expressing data-intensive computations.

Download presentation (PPT).

The workshop is being chaired by Guy Blelloch from CMU, is located in Nice, France, and features some interesting recent research in the field of parallel programming.  If anybody will be in town and wants to meet up, please drop me a line at joedu AT microsoft.

1/8/2007 1:09:41 PM (Pacific Standard Time, UTC-08:00)  #    Comments [27]

I recently left the CLR team to join a new team focusing on parallelism on Microsoft's platform in the 3-10 year timeframe.

I am leading design and development of the Parallel LINQ (PLINQ) project that I alluded to here.

We're looking for supersmart technical people to join the team and help change the face of programming for anybody writing code on the CLR or VC++. PLINQ isn't the only project. Solid CS skills are a must, but you don't necessarily have to be a concurrency guru (right away).

These job postings have some detail:

http://members.microsoft.com/careers/search/results.aspx?FromCP=Y&JobCategoryCodeID=&JobLocationCodeID=&JobProductCodeID=&JobTitleCodeID=&Divisions=&TargetLevels=&Keywords=2012%20&JobCode=&ManagerAlias=&Interval=10

If something catches your eye, respond on the jobs site or send me your resume directly at joedu AT you-know-where DOT com (i.e. microsoft DOT com).

I'll of course still be blogging about everything concurrency related, so you won't notice much of a difference content-wise. And I'm still happy to answer any CLR related questions and help you find the right folks inside the team to listen to your feedback.

12/1/2006 8:54:32 PM (Pacific Standard Time, UTC-08:00)  #    Comments [11]

LINQ coaxes developers into writing declarative queries that specify what is to be computed instead of how to compute the results. This is in contrast to the lion's share of imperative programs written today, which are huge rat nests of for-loops, switch statements, and function calls. The result of this new direction? Computationally intensive filters, projections, reductions, sorts, and joins can be evaluated in parallel... transparently... with little-to-no extra input from the developer. The more data the better.

If you buy the hypothesis--still unproven--that developers will write large swaths of code using LINQ, then by inference, they will now also be writing large swaths of implicitly data parallel code. This, my friends, is very good for taking advantage of multi-core processors.

If you want to get a little glimpse of what I've been spending my time working on, check out these (brief) stories about Parallel LINQ (aka PLINQ), a parallel query execution engine for LINQ:

We've spent many, many months now cranking out a fully functional prototype. The numbers were impressive enough to catch the eye of some key people around the company. And the rest is history... (well, not quite yet...)

I'll no doubt be disclosing more about this in the coming weeks.

(Note: I am in no way committing to any sort of product or release timeframe. This technology is quite early in the lifecycle, and, while unlikely, might never actually make the light of day... Label this puppy as "research" for now.)

9/13/2006 4:48:33 AM (Pacific Daylight Time, UTC-07:00)  #    Comments [18]

 

Recent Entries:

Search:

Browse by Date:
<September 2010>
SunMonTueWedThuFriSat
2930311234
567891011
12131415161718
19202122232425
262728293012
3456789

Browse by Category:

Notables: