Why Numba and Cython are not substitutes for Julia


August 6 2018 in Julia, Programming | Tags: | Author: Christopher Rackauckas

Sometimes people ask: why does Julia need to be a new language? What about Julia is truly different from tools like Cython and Numba? The purpose of this blog post is to describe how Julia’s design gives a very different package development experience than something like Cython, and how that can lead to many more optimizations. What I really want to show is:

  1. Julia’s compilation setup is built for specialization of labor which is required for scientific progress
  2. Composition of Julia codes can utilize the compilation process to build new programs which are greater than the sum of the parts

I will also explain some of the engineering tradeoffs which were made to make this happen. There are many state-of-the-art scientific computing and data science packages in Julia and what I want to describe is how these are using the more “hardcore language-level features” to get there.

Point 1: Any sufficiently competent IR generator will hit the optimization limit on small examples, but that doesn’t necessarily generalize to full package solutions

You will find benchmarks on small scripts (microbenchmarks) where Numba and Cython do as well as Julia. Can these tools optimize as well as Julia? I will give you even more than that. I will go as far as saying that any sufficient competent IR generating mechanism will be efficient on a single function (or in Julia terminology, method). It’s actually not that difficult of a problem in 2018. If you build an LLVM IR with concrete types then LLVM will compile it well, and pretty much any decent representation of that IR will get you to around the same spot. Cython, Numba, etc. all will work just as well as Julia if you have a single function with known input types and all of that because there’s just a limit to how optimal you can make your computations and most of the optimizations are just standard LLVM passes on the IR. It’s not hard to get a loop written in terms of simple basics (well okay, it is quite hard but I mean “not hard” as in “give enough programmers a bunch of time and they’ll get it right”). Cool, we’re all the same. And microbenchmark comparisons between Cython/Numba/Julia will point out 5% gains here and 2% losses here and try to extrapolate to how that means entire package ecosystems will collapse into chaos, when in reality most of this is likely due to different compiler versions and go away as the compilers themselves update. So let’s not waste time on these single functions.

Let’s talk about large code bases. Scaling code to real applications is what matters. Here’s the issue: LLVM cannot optimize a Python interpreter which sits in the middle between two of your optimized function calls, and this can HURT. Take a look at this example which introduces the DifferentialEquations.jl bindings in Python and R. In it we show how JIT compiling a function with Numba only moderately helps the ODE solver (i.e. 50% performance gain).

import numpy as np
from scipy.integrate import odeint
import timeit
import numba
 
def f(u, t, sigma, rho, beta):
    x, y, z = u
    return [sigma * (y - x), x * (rho - z) - y, x * y - beta * z]
 
u0 = [1.0,0.0,0.0]
tspan = (0., 100.)
t = np.linspace(0, 100, 1001)
sol = odeint(f, u0, t, args=(10.0,28.0,8/3))
 
def time_func():
    odeint(f, u0, t, args=(10.0,28.0,8/3),rtol = 1e-8, atol=1e-8)
 
time_func()
timeit.Timer(time_func).timeit(number=100)/100 # 0.07502969299999997 seconds
 
numba_f = numba.jit(f,nopython=True)
def time_func():
    odeint(numba_f, u0, t, args=(10.0,28.0,8/3),rtol = 1e-8, atol=1e-8)
 
time_func()
timeit.Timer(time_func).timeit(number=100)/100 # 0.04437951900000001 seconds

But if you allow the function to be a Julia code it can optimize it a lot further:

from julia import Main
from diffeqpy import de
import numpy as np
jul_f = Main.eval("""
function f(du,u,p,t)
  x, y, z = u[1], u[2], u[3]
  sigma, rho, beta = p[1], p[2], p[3]
  du[1] = sigma * (y - x)
  du[2] = x * (rho - z) - y
  du[3] = x * y - beta * z
end""")
u0 = [1.0,0.0,0.0]
tspan = (0., 100.)
t = np.linspace(0, 100, 1001)
p = [10.0,28.0,8/3]
prob = de.ODEProblem(jul_f, u0, tspan, p)
sol = de.solve(prob,saveat=t,abstol=1e-8,reltol=1e-8)
 
def time_func():
    sol = de.solve(prob,saveat=t,abstol=1e-8,reltol=1e-8)
 
time_func()
timeit.Timer(time_func).timeit(number=100)/100 # 0.0033111710000000016 seconds

with the “Julia called from Python” solution which is about 13x faster than the SciPy+Numba code, which was really just Fortran+Numba vs a full Julia solution. The main issue is that Fortran+Numba still has Python context switches in there because the two pieces were independently compiled and it’s this which becomes the remaining bottleneck that cannot be erased. And it’s clear why this fact will not be of influence in simple one-function microbenchmarks but it is an extremely important difference when trying to build an optimized full ecosystem of scientific computing tools. I don’t care about these little 5% compiler version differences when there’s a 13x difference the moment I throw a real problem at it, and this is the kind of technical issue which you see requires a completely different architectural structure to fully solve.

To further showcase the difference, let’s now solve the differential equation in pure Julia:

using DifferentialEquations, BenchmarkTools
function lorenz!(du,u,p,t)
    du[1] = p[1]*(u[2]-u[1])
    du[2] = u[1]*(p[2]-u[3]) - u[2]
    du[3] = u[1]*u[2] - p[3]*u[3]
end
u0 = [1.0;0.0;0.0]
p = [10.0,28.0,8/3]
tspan = (0.0,100.0)
prob = ODEProblem(lorenz!,u0,tspan,p)
@btime solve(prob,saveat=0.1,reltol=1e-8,abstol=1e-8) # 2.467 ms (13436 allocations: 1.00 MiB)
@btime solve(prob,Tsit5(),saveat=0.1,reltol=1e-8,abstol=1e-8) # 2.904 ms (1081 allocations: 155.70 KiB)

That’s now close to 2.5 miliseconds, or about 18x faster than SciPy+Numba. That’s just with the simplest way of solving the equations! Now if we follow the code optimization tutorial we get:

using StaticArrays
function lorenz_static(u,p,t)
    @inbounds begin
        dx = p[1]*(u[2]-u[1])
        dy = u[1]*(p[2]-u[3]) - u[2]
        dz = u[1]*u[2] - p[3]*u[3]
    end
    @SVector [dx,dy,dz]
end
u0 = @SVector [1.0,0.0,0.0]
p  = @SVector [10.0,28.0,8/3]
tspan = (0.0,100.0)
prob = ODEProblem(lorenz_static,u0,tspan,p)
@btime solve(prob,saveat=0.1,reltol=1e-8,abstol=1e-8) # 2.102 ms (13714 allocations: 12.97 MiB)
@btime solve(prob,Tsit5(),saveat=0.1,reltol=1e-8,abstol=1e-8) # 2.010 ms (56 allocations: 60.19 KiB)

That’s right: in a real application we see the Julia solution 22x faster than SciPy+Numba. Sure, when things are just a straight loop, Numba is okay. But when packages and real applications are timed, there’s a not so subtle difference. Please go run it on your computer and see what you get. You can just copy and paste it into the REPL and see for yourself. As a computational scientist, this matters a lot more to me than than a microbenchmark.

More Cross Language ODE Solver Benchmarks

For more in-depth cross language ODE solver benchmarks which compare speed and accuracy together, see the ODE Solver Multi-Language Wrapper Package Work-Precision Benchmarks (MATLAB, SciPy, Julia, deSolve (R)).

Can you avoid these performance issues in Cython?

You can work past this with Cython. You can design the entire package yourself as one monolithic code base. You write the whole thing in Cython and don’t use person X’s C++ nonlinear solver library or person Y’s Numba nonlinear optimization tool and don’t use person Z’s CUDA kernel because you cannot optimize them together, oh and you don’t use person W’s Cython code without modification because you needed your Cython compilation to be aware of the existence of their Cython-able object before you do the compilation.

The problem is that monolithic architectures become maintenance nightmares which decrease programming productivity since you’re having to repeat the coding of many algorithms which have already been done. But there’s an even larger issue in the context of scientific computing: it is the intersection of high performance computational utilities with complex mathematical algorithms that gives you strong performance. Write a modern EPIRK ODE integrator with Krylov exponential approximation (one of the state-of-the-art stiff ODE solvers with few implementations) in pure Python using objects to describe your scientific model and your problem will be bogged down due to the computational structures that are used. Write a fast and simple Runge-Kutta order 4 integrator in Cython and your simulation will be bogged down since the choice of mathematical algorithm is unoptimized and will require many more function calls than necessary. It’s the intersection: good algorithm plus efficient data structures and compiled code, that produces efficient large-scale scientific software.

Monolithic programming structures are antithetical to the knowledge specialization that’s required in higher level mathematics. It’s simply impossible for someone to be an expert in all areas of numerical mathematics, let alone have the know-how and time to create optimized implementations of the newest algorithms from every discipline. In fact, there are very few people in each mathematical discipline that can even know the state-of-the-art in any detail! I still haven’t seen a fixed-leading coefficient Nordsieck BDF formulation in Python at all so native Python is behind ecosystems like SUNDIALS algorithmically, and that’s not even Cython, and that’s not using GPUs, etc. Thinking you can quickly/productively write all of this yourself in one giant codebase is hubris.

Rewriting every single difficult algorithm from scratch in order to utilize the most modern methods with the most efficient structures is not a style of programming that scales well. This is 2018: we’re past this. We have packages and package managers now, and we want the ability to have separate packages work fully together. This solution already exists: it’s Julia.

Point 2: Julia uses fully-dependent compilation

Julia avoids this through its generic algorithms and dependent compilation: write the integrator once and recompile it to new data structures via multiple dispatch. Let’s explain this approach in some detail. If you’re new to the Julia-sphere and you haven’t read this explanation of the Julia compilation process, you may want to do a little pre-reading.

The idea is that if you know all of the code in its original un-compiled form then you can see you have literals (constants) on the top and propagate them throughout this code as true constants at compile time. You’d have to have no dynamicness in the middle, no interpreter, etc. since otherwise you wouldn’t be able to guarantee const-ness. You’d have to compile the entire package together. In order to propagate these constants and inline function calls into another package, you’d even need to compile different package calls together at the same. This is what Julia does and it’s why it’s able to do full interprodcedural optimization in a way that combines functions from different packages.

Take for example ForwardDiff.jl which can take arbitrary pure Julia code and do forward-mode automatic differentiation to it (even Julia’s Base library). When using ForwardDiff.jl inside of a large package like DifferentialEquations.jl, it doesn’t use a compiled version of ForwardDiff.jl operations inside of DifferentialEquations.jl, but instead it generates on-demand the full functions it needs to compile so that way the function’s value and its derivative are computed simultaneously. While doing this it inlines the ForwardDiff.jl arithmetic operations (along with any small first class functions which were passed into the routines). Julia then compiles the whole call together. And it’s not just packages: since Julia’s Base library and its standard library are written in Julia as well, it takes all of the Base library code as well to build a single typed script and can compile all of this together to a fully optimized form. The result is you get a full LLVM IR where ideas like “dual numbers” are abstractions which can be completely eliminated, and from there LLVM passes like common subexpression eliminated (CSE) can reduce the repeated arithmetic.

I say “can” because Julia’s compiler uses a cost model to decide what to keep separate as purely a function call, and then inserts the function call if it determines that the function call wouldn’t significantly change the runtime. By using a function call, it can use a separately compiled version of the function in order to reduce the amount of compilation time. This separately compiled version though is still a type-dependent compiled form. Note though that by using `@inline` you can bump the weight in the inlining heuristic to almost force it to happen (i.e. copy/paste the code in and compile in full instead of compiling the separate function).

This highlights a key difference between the Cython approach and the Julia approach, and it highlights the tradeoff. In Cython you have separately compiled functions and packages, much like static compilation to shared libraries in C++, and then you put function calls between them. In a few cases where you compile parts together it can inline, but generally you have separate packages/modules/etc. compile separately. This cuts down on compile time and makes it easier to generate a static binary but adds runtime costs.

In contrast, with Julia you have fully dependent compilation. Packages which call other packages can take control of the full code before compilation and then choices have to be made at how to separate it in a meaningful way. Yes, this means that managing compilation times is much more difficult in Julia as seen by the use of inlining cost models and “no-specialization” catches and tricks. If you go through the latency tag you can see that core developers are finding ways to automatically reduce specializations without cutting runtimes. Also, this means that static compilation is much more difficult than in other languages though Julia developers have already made large headway into making it a reality. There are package-level examples of this as well. In this SO post I describe how an ODE solver call can take a 2-3 second compilation to compile a version of the entire integrator program specifically to your ODE model which can reduce the runtime cost by about 4x-5x (which really matters in parameter estimation!), and how we have developed high level ways to turn this off to give users a choice to remove this extra specialization.

You can see that once we have fully dependent compilation, we want language level features in order to fully utilize and optimize this process. This is my next point.

Point 3: Compile-time control, code generation, and optimization require language-level compilation control features

Once you have the dependent compilation process as a large feature, you need/want language level features to be able to control this process. This begs the question: other than behind-the-scenes optimizations that this can add, are there any extra optimizations that can be had by allowing programmers control during this dependent compilation process?

Being able to fully optimize dependently JIT compiled code depends on having strong language level tools and utilizes to control the compilation process. You cannot even discuss these optimization opportunities from a Cython/Numba-perspective because you have a “Sapir-Whorf hypothesis problem”: Python does not have this dependent compilation setup so it doesn’t have the language or language-level tools for handling behavior at this phase. So let’s look at what you gain from dependent compilation and how Julia’s language level features interact with it.

This is where things like macros come in handy. Macros and metaprogramming apply at compile-time, so before functions get compiled you can modify what function you will be compilation based on an expression. But another huge fact is that “function” doesn’t even mean the same thing as in Python/Cython. In Julia, a function is a collection of methods and the method chosen to be used in the final compiled code is dependent on the input types. Since input types matter, you may want to specialize an algorithm for arrays of 64-bit floating point numbers separately from how you’d treat arrays of 32-bit floating point numbers. This is where Julia’s parametric typing comes in: you can specify Vector{Float64} vs Vector{Float32} and dispatch to separate algorithms which are optimal in each of the cases. You can even put these ideas together with generated functions which are “functions” where you can programmatically build a function expression during the compilation stages depending on the input types that you see. In fact, since the entire function compilation is at your control, you can build tools like Cassette.jl which allows you to take control of anyone else’s function and “overdub” it in the compilation process to change what it’s doing.

So what optimizations can you do with these tools? First of all, since you can see the entire code of a function, you can use the dependent compilation to build alternative output functions at compile-time. A type-based dispatch approach utilizes a wrapper type and have it re-write the internal function calls using Julia’s multiple dispatch. Once again, ForwardDiff.jl is a great example to point out. It uses a wrapper into Dual numbers and then on-demand compiles new versions of functions in a way that does automatic differentiation. For example, you can make an entire ODE solver be reconfigured at compile-time to be essentially two parallel ODE solvers which then computes the derivative simultaneous to the equation, with the parallel derivative part generated automatically by the compile-time controls. It’s not even that you can, it’s that it takes no more than 2 lines for a user to set it up. Actually, nobody had to write this functionality for it to exist in Julia since it’s a result of the compilation process! Let me repeat that: nobody had to implement this ability to efficiently autodifferentiate through the ODE solvers: this all happens due to Julia’s compilation process and ForwardDiff’s function overloads. The result is that packages compose nicely and efficiently.

This is that interaction of efficient data structures and difficult algorithms. The ODE solvers in DifferentialEquations.jl are generically-typed algorithms written independently by differential equation solving experts and include many optimized versions of the newest methods. ForwardDiff.jl was created by autodiff experts. The combination, fast autodifferentiation through Runge-Kutta-Chebyshev, EPIRK, Nordsieck BDF, etc. integrators is something that no one could create and optimize on their own, but it’s a combination that Julia can create and optimize! Autodifferentiation is much less work than numerical differentiation and is more accurate, so this is a good combination algorithm to have! Yes, in some cases like fully continuous ODEs you can improve upon this via forward/adjoint sensitivity analysis, but the code of this type-based compilation approach applies to ODEs/SDEs/DAEs/DDEs/SDAEs/mixed Gillespie + ODE/SDE, etc. and many of these cases the sensitivity analysis derivation has never been done and would be a bear to implement. And again, ForwardDiff.jl utilizes value-types (i.e. it’s pointer-free unlike objects) for its Dual numbers and inlines the small function calls, so it builds a very efficient AD runtime in other libraries. Separation of labor without overhead leads to huge productivity and efficiency gains!

So that’s a nice optimization which you may have missed because there’s no code to look at and see how this combination works in full. It works as the free result of compilation controls: the ODE solvers written generically and the autodifferentiation library reconfiguring algorithms using their number type overloads.

Composition of Julia codes gives new free and efficiently implemented features!

This is far from the only example. While Python has an uncertainties package with a type that calculates uncertainties, you cannot make a Cython kernel with NumPy linear algebra and throw this into SciPy ODE solvers and get an a fully optimized function call with uncertainty propagation. You would need to (A) recompile your Cython code to take into account this object (possible, but not automatic and it won’t automatically do this through all dependent packages even if they were Cython), (B) recompile the NumPy linear algebra kernels to use this object in their Fortran code (good luck), and (C) recompile the SciPy ODE solvers to utilize all of this type information internally to propagate it through all of the internal linear combinations. I have never seen cross-package type-dependent optimized auto-recompilation in Python, let alone one that crosses language barriers into the C/C++/Fortran code. But what about Julia? How much programming was needed to be done to solve this enormous problem? A user notified me in a Discourse post that it worked without having to do anything. Cool, that’s literally infinitely less work! Oh, and once again Julia’s dependent compilation process optimizes it.

I can keep going. Just see Cassette’s recent video for a bunch of compile-time optimizations and context-dependent compilation to take control of other people’s packages/code and recompile it to be distributed/gpu/etc.

Another interesting case of this is Julia’s broadcast system. You probably think of “broadcast” as synonymous with “vectorization”, but let me describe it in a very different way to highlight how it can be used to a much greater effect. Broadcast defines full expressions for “element-wise” generically using a lazy type-building system. It does this through a type promotion system plus function overloading. If a package uses broadcast for its element-wise kernels, this means that broadcast overloads allow you to essentially overdub these kernels. I describe in a fair bit of detail how this is used within the DifferentialEquations.jl architecture to allow for heterogeneous scientific models being solved on heterogeneous architecture (GPUs/multithreaded/distributed) to get specialized compilation for the DiffEq solvers. Notice though that this composition is greater than adding GPU functions to ODE solver calls. Instead, by overloading broadcast, the array type’s implementation takes control of the internal loops of the ODE solver and reconfigures/recompiles them to be OpenCL/CUDA kernels on the GPU. Every internal operation is now GPUitized, not just ones from the user passed in functions. And nothing in the DiffEq code was written for GPUs to make this work: this is all context and compilation controls.

Again, in Cython you could write an ODE solver which directly utilizes the structure of your mathematical model and puts pieces on the GPU as necessary, but this is far different than having an efficient combination built automatically from generic algorithms by the dependent compilation of different interacting parts of existing packages!

If you never thought about doing this, if you always believed you had to write code to build a software to solve your problem, then this is very Sapir-Whorf. While I will agree with you that these tools may not be what the vast majority of Julia users are using, this is the Julia that core developers of the base language and its package ecosystem are using to build the tools which are unrivaled in Python/Cython/Numba.

So what is the identity of Julia?

But Julia showcases itself as a simple language for R/Python/MATLAB users, and broadcast is described as a tool for vectorization?! Yes, this is because using any of these compilation control features is optional. I would probably say that most Julia programmers are not using it in full, and that’s fine. Even if the end user of your package doesn’t make use of all of these tools, someone else’s package can still optimize over your code if you have written Julia code. Remember, since we can dependent compile entire function calls, if all of the code is in Julia then we can do all of our powerful stuff like AD through your code even if you don’t know how these compilation controls work. This accumulation effect is in some sense like how as Google generates more data it gets more accurate predictions. In Julia, as the package ecosystem replaces ccall/PyCall packages with native Julia packages, the amount of compilation control and the available optimizations on larger scientific projects grows. So Julia does well by presenting a simple form of the language to users who want a “scripting language but faster”, i.e. it looks like it’s a Cython thing, to get everyone writing pure Julia code.

Julia is really not a simple language if you take the time to utilize all of the language’s features together. I would instead think about it like this. The FFTW package is well-known as the fastest open-source fast Fourier Transform library out there. It does this not by writing an FFT code in C, but with OCaml code which generates C code based on aspects of the problem you are trying to FFT. Because of the overwhelming success of code generation for the purpose of optimizing code, you can ask the question: what if we built a scripting language that is designed so we can do these kinds of things on a user’s scientific computing code? I don’t think it’s a surprise that the author of FFTW is now one of the core contributors of Julia. Regardless of first intentions or how Julia has been marketed, Julia has become a language of generic programming and compile-time optimization to its hardcore users and there is an entire rabbit hole of dynamic compilation control features to explore.

Conclusion

What Julia offers is different because of a full language level solution, which has its own tradeoffs that the Julia developers are working on. Julia as a language has parametric typing to make it multiple dispatch mechanism more powerful and easier to use because controlling compilation through different type structures is a way to hijack downstream/upstream code/packages in order to make the code more optimized as a whole. The language-level features like macros, generated functions, and the newer tools like Cassette.jl (which needed compiler changes to work in full) allow you to utilize the entire code and do modifications dynamically in order to take CPU code and optimize it or throw parts out to GPUs/TPUs/distributed, even if you don’t “own” the code. Dependent compilation fully eliminates the overhead that exists when different libraries are separately compiled. Broadcast overrides allow you to dictate how internal structures of scientific computing codes should be implemented and optimized on your specific model. These are all unnecessary if you could write your algorithm as a simple script, but this is necessary if you want to have packages play nicely together and automatically build optimized combinations of features. This is a whole different world than “write fast code”. Instead, you may never need to write the best features of your package, and they will still come out optimized. This is not something in the realm of Cython/Numba.

There is a tradeoff that I alluded to here and it’s compile-time. For this system to be used interactively, you have to take a step back and find out where you want to stop specializing and where to put up artificial walls. The core Julia developers have made a lot of headway just in the latest part of the Julia v0.7 release candidate in terms of latency, and this is what’s seen in the pretty new latency label on the Julialang/julia Github page. Parts like the Julia REPL can be compiled separately and more controls can be added. There’s a ton to do here. Also, getting full static compilation of libraries is just beginning to show up. Makie.jl is Julia’s next generation plotting library and it statically compiles. Again, there’s no such thing as “fully statically compiling” in Julia because everything is so extendable: you’d have to compile new functions dependent on the input types if these are “new types from packages” (example: think back to the numbers with uncertainties). This is not something that a Python plotting package would deal with (if you create a new primitive type in Cython and throw it to matplotlib it’ll give you a weird look and say “this Float64 doesn’t look right!”). Since Makie.jl is a Julia library, it has extension tie-ins via recipes which allow taking control of the internal function dispatching to change the datatype conversions, but this requires recompilation of specific internals for the new types and so mixing cached native precompilation and static compilation with this dependent compilation is an engineering challenge. But again, Julia and its packages are already making great headway into solving these issues, so I see a very bright future ahead of us.

This might not be the Julia most users are seeing, but if you want a programming language that gives you a massive rabbit hole to explore then this is just a peek at what’s available.

Edit: 10/20/2018

I wanted to add a link to this paper on Zygote.jl which explains how it uses Julia’s dependent compilation process to build and compile a derivative function for other functions, including those in separate packages, using the Julia Abstract Syntax Tree (AST) that’s presented to it at the first call (compilation-time). Unlike approaches like Flux.jl’s Tracker or Python’s Autograd, this does not take a tracing approach and instead autodifferentiates all branches, giving something that can be optimized and compiled once, while directly supporting language features like loops and conditionals. These are the kinds of tools that are available when compilation of functions gets to see the full code below it, with of course the engineering trade-off that a higher level function needs to compile it and many times its internal calls on its first call.

Edit: 12/4/2019

Updated timings and added pure Julia timings. Also, by popular demand I benchmarked torchdiffeq. From the timings, torchdiffeq is 30,000x slower than DifferentialEquations.jl. An investigation of this can be found here, which includes a bunch of other equations like 27-body equations and PDE discretizations to show orders of magnitude performance differences across the board. Please run the scripts yourself and see the difference in action.

Edit: 8/10/2020

A similar demonstration of a 100x performance difference in training neural ODEs is demonstrated here. Additionally, over 1000x performance advantage for Julia over PyTorch in SDEs is demonstrated here.

20 thoughts on “Why Numba and Cython are not substitutes for Julia

  1. Juan

    says:

    How does Julia compare to PyPy or Cython in terms of multithreading?


  2. Alec Kalinin

    says:

    Dear Christopher,

    Below are my explanations how to solve the problem with ODE performance in Python and some philosophical notes.

    1. Introduction

    I am a Python scientific developer. Sometimes I love Python, but sometime I hate Python and try to migrate to another language. Julia is the really great try to change the scientific calculations world! But I am still in Python. And I don’t see alternatives to the Python in the near future. Why?

    2. Problem with the Python performance. ODE in Python.

    At this time there are no performance problems in the Python world. In the article you mentioned the performance degradation due to inter numpy/scipy calls on the level of the slow python interpreter.

    This is the well known problem in the Python world. I think the best solutions is the

    Google Jax
    https://jax.readthedocs.io/en/latest/
    JAX is NumPy on the CPU, GPU, and TPU, with great automatic differentiation for high-performance machine learning research. With its updated version of Autograd, JAX can automatically differentiate native Python and NumPy code. JAX even lets you just-in-time compile your own Python functions into XLA-optimized kernels using a one-function API.

    And here some examples with really high performance:

    Lorentz ODE solver on the cloud TPU (!):
    https://github.com/google/jax/blob/master/cloud_tpu_colabs/README.md
    Here the solution of the ODE performance in Python

    https://cloud.google.com/blog/products/ai-machine-learning/google-breaks-ai-performance-records-in-mlperf-with-worlds-fastest-training-supercomputer
    Google breaks AI performance records in MLPerf with world’s fastest training supercomputer with the JAX and TensorFlow.

    3. Julia vs Python in modern numerical computing world

    Let add some philosophy to the modern scientific calculations. We will think as

    Thesis, antithesis, synthesis
    https://en.wikipedia.org/wiki/Thesis,_antithesis,_synthesis

    [I] Thesis. Python world.
    Python is slow on the interpreter level. The solution in general is the NumPy arrays and vectorized style of the code. You try to prepare big chunk of data as NumPy arrays and call number of the native optimized NumPy procedures.

    But here we meet some problems: a) not always easy to convert code to the NumPy vectorized style and b) you need to call NumPy functions from the python slow interpreter level.

    [II] Antithesis. Julia world.
    We want to keep high level scientific source code with the low level optimizations. To do it we use “function” with type system and multiple dispatch and LLVM compilation as a building blocks for our software code. It is not easy to do, bug Julia can!!! Win!!! Or not???

    [III] Synthesis. Numerical calculation graphs world.
    Yes and no…

    Why yes? Because we can really achieve high performance with the low level iterations and high level source code.

    Why no? Because this solution is not scalable for the modern heterogeneous hardware CPU, GPU, TPU, … This is must be the compiler work (!!!) to take our code and compile it to the CPU, GPU and TPU.

    And to do it we have to escape from the low level iteration style and rise to defining the numerical calculations graph. Yes, like in TensorFlow.

    And for this task Python with its high level and flexibility is the best choice.

    With the Julia it is a bit of temptation to write low level functional style code. Because it works fast!

    With the Python you have to define only numerical computation graphs if you want to achieve the high performance. You have to change you mind to some kind of functional style of programming. But at the end you can run your code with absolutely best performance on the cloud TPU.

    Best,
    Alec Kalinin


    • FWIW, using code from one of the Jax developers (https://twitter.com/shoyer/status/1217615543005413376) we see Julia about 2x faster than Jax here. This was:

      import jax.numpy as jnp
      from jax.experimental.ode import odeint
      import timeit
      
      def f(u, t, sigma, rho, beta):
          x, y, z = u
          return jnp.array([sigma * (y - x),
                            x * (rho - z) - y,
                            x * y - beta * z])
       
      u0 = jnp.array([1.0, 0.0, 0.0])
      tspan = (0., 100.)
      t = jnp.linspace(0, 100, 1001)
      sol = odeint(f, u0, t, 10.0, 28.0, 8/3, rtol=1e-8, atol=1e-8)
      
      def time_func():
          odeint(f, u0, t, 10.0, 28.0, 8/3, rtol=1e-8, atol=1e-8).block_until_ready()
      
      timeit.Timer(time_func).timeit(number=100)/100 # 3.1 ms
      

      so 3.1 ms vs the Julia DP5 of 1.6 ms. So yes, Jax does quite a bit better than the other Python optimizers here, but still quite a ways away from Julia.

      And indeed you can compile Julia code to TPUs (good resource on that: https://arxiv.org/pdf/1810.09868.pdf). That said, it’s not very useful for differential equations because BFloat16 arithmetic is far too lossy for anything that’s not extremely non-stiff.


  3. Maxx

    says:

    This is all really wrong.

    First off you are comparing 100 iterations to 1. Timeit.Timer(time_func).timeit(number=100) gives you the total time not the average time of 100 iterations. Correcting for this you get about 110ms and 51ms using native python and numba jit respectively. That 51ms includes the jit compilation for the first run so you can count it even less if you want to consistently compare to @btime.

    Second, if performance is your goal (I don’t mind waiting an addition 45ms for an ODE generally speaking) you would likely use a more robust package like Scikits.odes.

    Using SciKits.odes I get:
    8.78ms to solve without numba.
    6.15 ms with numba.
    3.5 ms with julia.

    So a true comparison shows Julia is offering a 1.8x performance improvement to python but is still far more limited in package availability, community, etc.

    Here is my python code using SciKits.odes if anyone wants to run their own tests:

    “`
    import numpy as np
    from scikits.odes.odeint import odeint
    import numba

    def f(t, u, udot):
    sigma=10.0
    rho=28.0
    beta=8/3

    udot[0] = sigma * (u[1] – u[0])
    udot[1] = u[0] * (rho – u[2]) – u[1]
    udot[2] = u[0] * u[1] – beta * u[2]

    u0 = [1.0, 0.0, 0.0]
    tspan = (0., 100.)
    numba_f = numba.jit(f,nopython=True)

    #Run once to jit compile
    odeint(numba_f, tspan, u0)

    %timeit odeint(f, tspan, u0)
    %timeit odeint(numba_f, tspan, u0)
    “`


    • Hey,
      Thank you for the suggestions. Indeed, timeit worked differently than I had thought, and the post has been corrected.

      >So a true comparison shows Julia is offering a 1.8x performance improvement to python

      This is an overestimate because the code that you show does not set the tolerances. From the Julia side I can test Sundials (I wasn’t able to get SciKits.odes installed. It seems they don’t support Windows?):

      using Sundials, BenchmarkTools
      function lorenz!(du,u,p,t)
          du[1] = p[1]*(u[2]-u[1])
          du[2] = u[1]*(p[2]-u[3]) - u[2]
          du[3] = u[1]*u[2] - p[3]*u[3]
      end
      u0 = [1.0;0.0;0.0]
      p = [10.0,28.0,8/3]
      tspan = (0.0,100.0)
      prob = ODEProblem(lorenz!,u0,tspan,p)
      @btime solve(prob,CVODE_BDF(),saveat=0.1,reltol=1e-6,abstol=1e-12)
      @btime solve(prob,CVODE_BDF(),saveat=0.1,reltol=1e-8,abstol=1e-8)
      

      which shows that the defaults that odes uses is more than 2x slower than the tolerances used in the tests here. This suggests that your using SciKits.odes with Numba is about 4x slower than using Julia, even after going out of ones way to use a special package and use a JIT compiler! I think that very much highlights the point of the article. If you’re okay with things running slower, that’s fine! But if you’re doing things like parameter estimation, this difference will be a larger real-time code because of how many times the ODE needs to be solved.

      >but is still far more limited in package availability, community, etc.

      In terms of the intangibles like package availability, community, etc., Python definitely has a disadvantage in this area of scientific computing. Notice how you had to go to a underdocumented and feature-deprived package (for example: GPU support? Windows support? Neural network support? BandedBlockBandedMatrix specializations? Sensitivity analysis?) just to get a slower version than the Julia standard. It’s also something where the community isn’t that lively, judging by the total of 6 StackOverflow posts in its history and only one mention ever on SciComp (January 24th, 2020), let alone the lack of tutorials. Compare that to the differential equation talk in Julia for StackOverflow SciComp, along with tutorial libraries and very active chat channels and that’s a very quantitative way to showcase the difference in community activity. So at least in scientific computing in the area of differential equations, I do not see good evidence that Python packages have a very feature-filled offering (in fact, Python seems to lacking in comparison to even Mathematica and Maple) nor a very strong community, whereas other languages like Julia and Mathematica do seem to have a very strong community.


  4. Juan

    says:

    What about memory usage (and garbage collection) when dealing with large datasets?


  5. Tony Xiang

    says:

    Hi,

    I think you are right. According to numba documentation you can only mutate existing array using cfunc, instead of returning new array. And scipy ode does not support a C function call back which mutates the existing array.

    Indeed, in order to bypass the Python interpreter. Both scipy and numba has to support the same format of low level functions. This is the case for integration (as shown in the numba documentation). But this is not universally possible.

    So you made a good point of Julia being able to JIT compile the whole thing in one go.

    Regards,
    Tony Xiang


  6. Royi

    says:

    Chris,

    Could it be that you should use what they do in http://numba.pydata.org/numba-doc/latest/user/cfunc.html#dealing-with-pointers-and-array-memory

    In order to output array in the function?


    • That doesn’t show how to output an array. None of the examples in the documentation show how to output an array from a cfunc. It shows how to modify an input array, which isn’t something I can do with the ODE integrators of SciPy. The function has to match their f(t,u) and output an array format. Please, if it’s so easy to do, just hand me a code and I’ll change the results. At this point it’s been hours and 10’s of people, so if you know how to do it, just give me code. Otherwise this feels like a wild goose chase, where people are pointing to a magical speedup doing something undocumented but won’t actually pony up any working code. It should be like 10 lines, right? Just paste it here and I’ll change the post.


  7. Tony Xiang

    says:

    Hi,

    I have a doubt on the benchmark method you use for the example of ODE.

    It seems that you misused the numba decorator here. Indeed, numba.jit is meant to create a Python extension with the decorated functions. And everytime you call it (in your example through the scipy ode solver), it has to go through e Python intepreter.

    However, numba provides a different decorator to generate a C callback function (pointer) to be used by another compiled caller (scipy ode). It is called numba.cfunc. You can pass the actuall compiled function pointer to the scipy ode. See the example in the numba documentation.
    http://numba.pydata.org/numba-doc/latest/user/cfunc.html

    I think if you re-run the benchmark with the actual cfunc function pointer. The results will be different.

    Regards,
    Tony Xiang


    • Yes, in fact when trying to find someone who could make this work, someone found in the Numba docs:

      >Arrays can be passed in to a function in nopython mode, but not returned. Arrays can only be returned in object mode.

      So if you think it’s possible to accelerate this more, please show me how. I couldn’t find anyone who could, and have spent a few days trying to get this, so at some point I need to give up.


    • Hey, sorry about that! As you know, sometimes Numba can be hard to use correctly, so I double checked with Python experts around MIT and that’s the best solution they could find. I couldn’t find someone who could get Numba to compile the cfunc. Could you help get this cfunc example compiling? Here’s where I am at:

      from numba import cfunc, types, carray
      c_sig = numba.float64[::1](types.CPointer(types.double),
                             types.CPointer(types.double),
                             types.float64, types.float64, types.float64)
      
      @cfunc(c_sig)
      def my_callback(in_, t, sigma, rho, beta):
          u = carray(in_, (3,1))
          x = u[1]
          y = u[2]
          z = u[3]
          return [sigma * (y - x), x * (rho - z) - y, x * y - beta * z]
      

      which throws a big error:

      Traceback (most recent call last):
        File "", line 1, in 
        File "C:\Users\accou\AppData\Local\Programs\Python\Python37\lib\site-packages\numba\decorators.py", line 245, in wrapper
          res.compile()
        File "C:\Users\accou\AppData\Local\Programs\Python\Python37\lib\site-packages\numba\compiler_lock.py", line 32, in _acquire_compile_lock
          return func(*args, **kwargs)
        File "C:\Users\accou\AppData\Local\Programs\Python\Python37\lib\site-packages\numba\ccallback.py", line 68, in compile
          cres = self._compile_uncached()
        File "C:\Users\accou\AppData\Local\Programs\Python\Python37\lib\site-packages\numba\ccallback.py", line 81, in _compile_uncached
          cres = self._compiler.compile(sig.args, sig.return_type)
        File "C:\Users\accou\AppData\Local\Programs\Python\Python37\lib\site-packages\numba\dispatcher.py", line 83, in compile
          pipeline_class=self.pipeline_class)
        File "C:\Users\accou\AppData\Local\Programs\Python\Python37\lib\site-packages\numba\compiler.py", line 955, in compile_extra
          return pipeline.compile_extra(func)
        File "C:\Users\accou\AppData\Local\Programs\Python\Python37\lib\site-packages\numba\compiler.py", line 377, in compile_extra
          return self._compile_bytecode()
        File "C:\Users\accou\AppData\Local\Programs\Python\Python37\lib\site-packages\numba\compiler.py", line 886, in _compile_bytecode
          return self._compile_core()
        File "C:\Users\accou\AppData\Local\Programs\Python\Python37\lib\site-packages\numba\compiler.py", line 873, in _compile_core
          res = pm.run(self.status)
        File "C:\Users\accou\AppData\Local\Programs\Python\Python37\lib\site-packages\numba\compiler_lock.py", line 32, in _acquire_compile_lock
          return func(*args, **kwargs)
        File "C:\Users\accou\AppData\Local\Programs\Python\Python37\lib\site-packages\numba\compiler.py", line 254, in run
          raise patched_exception
        File "C:\Users\accou\AppData\Local\Programs\Python\Python37\lib\site-packages\numba\compiler.py", line 245, in run
          stage()
        File "C:\Users\accou\AppData\Local\Programs\Python\Python37\lib\site-packages\numba\compiler.py", line 501, in stage_nopython_frontend
          self.locals)
        File "C:\Users\accou\AppData\Local\Programs\Python\Python37\lib\site-packages\numba\compiler.py", line 1105, in type_inference_stage
          infer.propagate()
        File "C:\Users\accou\AppData\Local\Programs\Python\Python37\lib\site-packages\numba\typeinfer.py", line 915, in propagate
          raise errors[0]
      numba.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
      No conversion from list(array(float64, 1d, C)) to array(float64, 1d, C) for '$0.34', defined at None
      
      File "", line 7:
      
      
      [1] During: typing of assignment at  (7)
      
      File "", line 7:
      
      

  8. juan

    says:

    Julia lacks of some important packages: multiple imputation and meta-analysis.


  9. Anon

    says:

    Does it matter that the python function returns a newly allocated Any list vs mutating a numeric array in place?


    • It does a little bit. Probably around 2x. There are a few confounding things in there, but I don’t think it really matters to the underlying explanation in the article which just needed a quick example (that I pulled from a previous blog post on a different topic) to show that integration benchmarks can give very different behavior. There’s a lot more ways you can optimize the DiffEq code there as well, a whole article’s worth probably, but I was just showing the basics.


  10. ZJ

    says:

    What would be a powerful supporting argument is a simple piece of code that Julia can optimise but not by Numba. The example involves ODE and not everyone knows what an ODE is so something that everyone can relate to would be really powerful


Write a Reply or Comment

Your email address will not be published. Required fields are marked *


*

This site uses Akismet to reduce spam. Learn how your comment data is processed.