Optimal Number of Workers for Parallel Julia

April 16 2016 in HPC, Julia, Programming, Stochastics | Tags: BLAS, hyperthreading, julia, parallel computing, workers | Author: Christopher Rackauckas

How many workers do you choose when running a parallel job in Julia? The answer is easy right? The number of physical cores. We always default to that number. For my Core i7 4770K, that means it’s 4, not 8 since that would include the hyperthreads. On my FX8350, there are 8 cores, but only 4 floating-point units (FPUs) which do the math, so in mathematical projects, I should use 4, right? I want to demonstrate that it’s not that simple.

Where the Intuition Comes From

Most of the time when doing scientific computing you are doing parallel programming without even knowing it. This is because a lot of vectorized operations are “implicitly paralleled”, meaning that they are multi-threaded behind the scenes to make everything faster. In other languages like Python, MATLAB, and R, this is also the case. Fire up MATLAB and run

A = randn(10000,10000)
B = randn(10000,10000)
A.*B

and you will see that all of your cores are used. Threads are a recent introduction to Julia, and so in version 0.5 this will also be the case.

Another large place where implicit parallelization comes up is in linear algebra. When one uses a matrix multiplication, it is almost surely calling an underlying program which is an implementation of BLAS. BLAS (Basic Linear Algebra Subroutines) is aptly named just a set of functions for solving linear algebra problems. These are written in either C or Fortran and are heavily optimized. They are well-studied and many smart people have meticulously crafted “perfect code” which minimizes cache misses and all of that other low level stuff, all to make this very common operation run smoothly.

Because BLAS (and LINPACK, Linear Algebra Package, for other linear algebra routines) is so optimized, people say you should always make sure that it knows exactly how many “real” processors it has to work with. So in my case, with a Core i7 with 4 physical cores and 4 from hyperthreading, forget the hyperthreading and thus there are 4. With the FX8350, there are only 4 processors for doing math, so 4 threads. Check to make sure this is best.

What about for your code?

Most likely this does not apply to your code. You didn’t carefully manage all of your allocations and tell the compiler what needs to be cached etc. You just wrote some beautiful Julia code. So how many workers do you choose?

Let’s take my case. I have 4 real cores, do I choose 4? Or do I make 3 workers to allow for 1 to “command” the others freely? Or do I make 7/8 due to hyperthreading?

I decided to test this out on a non-trivial example which is similar to what’s shown in this benchmark (but on a similar problem. I forget which SDE this was because that notebook got deleted, but it’s some simple SDE). I used high-order adaptive solver for stochastic differential equations on the same problem 1000 times. This is interesting because a stochastic model is probabilistic, so this then gives the approximate distribution of the result. The code sets up the problem and then calls pmap to do a Monte Carlo simulation and solve the equation 1000 times in parallel. The code is mostly math, but there is a slight twist where some values are stored on stacks (very lightweight datastructure). To make sure I could trust the times, I ran the code 1000 times and took the average, min, and max times.

So in this case, what was best? The results speak for themselves.

Average wall times vs Number of Worker Processes, 1000 iterations
Number of Workers	Average Wall Time	Max Wall Time	Min Wall Time
1	62.8732	64.3445	61.4971
3	25.749	26.6989	25.1143
4	22.4782	23.2046	21.8322
7	19.7411	20.2904	19.1305
8	19.0709	20.1682	18.5846
9	18.3677	18.9592	17.6
10	18.1857	18.9801	17.6823
11	18.1267	18.7089	17.5099
12	17.9848	18.5083	17.5529
13	17.8873	18.4358	17.3664
14	17.4543	17.9513	16.9258
15	16.5952	17.2566	16.1435
16	17.5426	18.4232	16.2633
17	16.927	17.5298	16.4492

Note there are two “1000”s here. I ran the Monte Carlo simulation (each solving the SDE 1000 times itself) 1000 times. I plotted the mean, max, and min times it took to solve the simulation. From the plot it’s very clear that the minimum exists somewhere around 15. 15!

What’s going on? My guess is that this is because of the time that’s not spent on the actual mathematics. Sometimes there are things performing logic, checking if statements, allocating new memory as the stacks grow bigger, etc. Although it is a math problem, there is more than just the math in this problem! Thus it seems the scheduler is able to effectively let the processes compete and more fully utilize the CPU by pushing the jobs around. This can only go so far: if you have too many workers, then you start to get cache misses and then the computational time starts to increase. Indeed, at 10 workers I could already see signs of problems in the resource manager.

Overload at 10 Workers as seen in Window’s Resource Manager

However, allowing one process to start re-allocating memory but causing a cache miss (or whatever it’s doing) seems to be a good tradeoff at low levels. Thus for this code the optimal number of workers is far above the number of physical cores.

Moral of the Story

The moral is, test your code. If your code is REALLY efficient, then sure, making sure you don’t mess with your perfect code is optimal. If your code isn’t optimal (i.e. it’s just some Julia code that is pretty good and you want to parallelize it), try some higher numbers of workers. You may be shocked what happens. In this case, the compute time dropped more than 30% by overloading the number of workers.

This post is open to read and review on The Winnower.

7 thoughts on “Optimal Number of Workers for Parallel Julia”

anon
says:

February 2, 2018 at 2:22 pm

It is not surprising that hyperthreading helps in imperfect code. I am very surprised that you profit from more threads than logical cores (i.e. 15 > 8 = 4 * 2). Is your code using a lot of syncing? That is, if one thread/process needs to wait on a different one before it can continue, then it is of course very helpful to have more threads.

But your original problem description looked embarrasingly parallel (optimally you would use 1e6 cores for a linear speedup of 1e6 for computation).

Reply
- Christopher Rackauckas
  says:
  
  February 2, 2018 at 3:04 pm
  
  Yes, the original problem is embarrassingly parallel. The point is that worrying about things like cache efficiency assumes that your algorithm is actually optimized enough to be using every cycle perfectly. If you’re talking about BLAS, sure. But if you write something yourself that’s more than a single summation loop then it could be more efficient to push a few more computations in there, and the only way to find out is to test.
  
  Reply
Ricardo Cruz
says:

April 20, 2016 at 2:26 pm

I have also noticed this and posted this over stackoverflow.

As you can, I was confused. I thought at first it had to do with my operating system not avoiding cache misses, so there was no trade-off. But I guess if I did some more proper empirical work such as yourself, I would see something as you noticed.

Reply
Jiahao Chen
says:

April 17, 2016 at 3:52 pm

Amdahl’s Law applies to every parallel program. However, the proportion of the program it estimates as not parallelizable need not be limited to the lines of code that you write specifically.

Conventional wisdom applies to statically compiled languages with manually managed memory, where you don’t have to worry about so much run time overhead. The fact that your code may look “seemingly” parallel but doesn’t exhibit perfect weak scaling is measuring the non-parallelizable overhead of everything you are using, including the Julia runtime and the specifics of how we implement pmap at present.

Reply
- Christopher Rackauckas
  says:
  
  April 17, 2016 at 5:10 pm
  
  Yes, we agree.
  
  Reply
Jiahao Chen
says:

April 17, 2016 at 2:44 am

Congratulations! You have just demonstrated Amdahl’s Law. Amdahl would have interpreted your results as saying that about 75% of your code can be run in parallel and 25% can only be run in serial.

Without any further details on your code, yes, you are probably measuring some overhead associated with the Julia runtime, such as the gc. But it is also likely that some part of your code is intrinsically not parallelizable, for example the overhead associated with scheduling, polling, synchronization and serialization in pmap.

Reply
- Christopher Rackauckas
  says:
  
  April 17, 2016 at 2:55 am
  
  I can’t go into detail on the specific adaptive SDE solver code until it’s published (yet), but you get get similar (but not as big) results using simple Euler-Maruyama code. Since it’s just solving a bunch of SDEs in Monte Carlo, Euler-Maruyama is just independently M different times doing “for i=1:numberofsteps u = u + f(u)*dt + g(u)*randn end”. The whole point is that Monte Carlo simulations are obviously “embarrassingly parallel cases” where you would just pmap to run this 10000 times and return a vector of u at the end. Mathematically, this is “perfectly parallel”, but there’s enough “computer stuff” in the background (the stuff you mention) that pushing the number of workers above the number of cores can have a positive effect on the runtime.
  
  I mention this because the common wisdom is always “choose the number of workers to match the number of cores”. Try it on your own code, even the stuff that is almost 100% parallel. I find that a few extra workers usually helps.
  
  (Also, Amdahl’s Law doesn’t really apply here since in the since of Amdahl the program is “seemingly” 100% parallel (as in the code is in the pmap part for almost 100% of the time just doing math in a loop). For this kind of stuff, if you double the cores, you get almost precisely half the runtime you’re just running twice as many simulations at the same time. Here I’m not even changing the underlying resources, I am adding more workers than the number of cores, essentially just playing with how it’s scheduling rather than letting it use more processors.)
  
  Reply

Write a Reply or Comment Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Stochastic Lifestyle

Categories

Bookmarks

Biology

Hilarious

Math

Programming

Optimal Number of Workers for Parallel Julia

April 16 2016 in HPC, Julia, Programming, Stochastics | Tags: BLAS, hyperthreading, julia, parallel computing, workers | Author: Christopher Rackauckas

Where the Intuition Comes From

What about for your code?

Moral of the Story

Related

7 thoughts on “Optimal Number of Workers for Parallel Julia”

anon

Christopher Rackauckas

Ricardo Cruz

Jiahao Chen

Christopher Rackauckas

Jiahao Chen

Christopher Rackauckas

Write a Reply or Comment Cancel reply

Stochastic Lifestyle

Categories

Bookmarks

Biology

Hilarious

Math

Programming

Optimal Number of Workers for Parallel Julia

April 16 2016 in HPC, Julia, Programming, Stochastics | Tags: BLAS, hyperthreading, julia, parallel computing, workers | Author: Christopher Rackauckas

Where the Intuition Comes From

What about for your code?

Moral of the Story

Share this:

Related

7 thoughts on “Optimal Number of Workers for Parallel Julia”

anon

Christopher Rackauckas

Ricardo Cruz

Jiahao Chen

Christopher Rackauckas

Jiahao Chen

Christopher Rackauckas

Write a Reply or Comment Cancel reply