Title: Learning Epidemic Models That Extrapolate

Speaker Chris Rackauckas, https://chrisrackauckas.com/

Abstract:

Modern techniques of machine learning are uncanny in their ability to automatically learn predictive models directly from data. However, they do not tend to work beyond their original training dataset. Mechanistic models utilize characteristics of the problem to ensure accurate qualitative extrapolation but can lack in predictive power. How can we build techniques which integrate the best of both approaches? In this talk we will discuss the body of work around universal differential equations, a technique which mixes traditional differential equation modeling with machine learning for accurate extrapolation from small data. We will showcase how incorporating different variations of the technique, such as ... READ MORE

The post Learning Epidemic Models That Extrapolate, AI4Pandemics appeared first on Stochastic Lifestyle.

]]>Title: Learning Epidemic Models That Extrapolate

Speaker Chris Rackauckas, https://chrisrackauckas.com/

Abstract:

Modern techniques of machine learning are uncanny in their ability to automatically learn predictive models directly from data. However, they do not tend to work beyond their original training dataset. Mechanistic models utilize characteristics of the problem to ensure accurate qualitative extrapolation but can lack in predictive power. How can we build techniques which integrate the best of both approaches? In this talk we will discuss the body of work around universal differential equations, a technique which mixes traditional differential equation modeling with machine learning for accurate extrapolation from small data. We will showcase how incorporating different variations of the technique, such as Bayesian symbolic regression and optimizing the choice of architectures, can lead to the recovery of predictive epidemic models in a robust way. The numerical difficulties of learning potentially stiff and chaotic models will highlight how most of the adjoint techniques used throughout machine learning are inappropriate for learning scientific models, and techniques which mitigate these numerical ills will be demonstrated. We end by showing how these improved stability techniques have been automated and optimized by the software of the SciML organization, allowing practitioners to quickly scale these techniques to real-world applications.

See more on: https://ai4pandemics.org/

The post Learning Epidemic Models That Extrapolate, AI4Pandemics appeared first on Stochastic Lifestyle.

]]>The post Useful Algorithms That Are Not Optimized By Jax, PyTorch, or Tensorflow appeared first on Stochastic Lifestyle.

]]>First of all, lets come up with a concrete idea of what a quasi-static algorithm is. It's the space of algorithms which in some way can be re-expressed as a static algorithm. Think of a "static algorithm" as one which has a simple mathematical description that does not require a full computer description, i.e. no loops, rewriting to memory, etc. As an example, let's take a look at an example from the Jax documentation. The following is something that the Jax JIT works on:

@jit def f(x): for i in range(3): x = 2 * x return x print(f(3))

Notice that it's represented by something with control flow, i.e. it is code represented with a loop, but the *but the loop is not necessary* We can also understand this method as 2*2*2*x or 8*x. The demonstrated example of where the JIT will fail by default is:

@jit def f(x): if x < 3: return 3. * x ** 2 else: return -4 * x # This will fail! try: f(2) except Exception as e: print("Exception {}".format(e))

In this case, we can see that there's essentially two compute graphs split at x<3, and so as stated this does not have a single mathematical statement that describes the computation. You can get around this by doing lax.cond(x < 3, 3. * x ** 2, -4 * x), but notice this is a fundamentally different computation: the lax.cond form always computes both sides of the if statement before choosing which one to carry forward, while the true if statement changes its computation based on the conditional. The reason why the lax.cond form thus works with Jax's JIT compilation system is thus because it is quasi-static. The computations that will occur are fixed, even if the result is not, while the original if statement will change what is computed based on the input values. This limitation exists because Jax traces through a program to attempt to build the static compute graph under the hood, and it then attempts to do its actual transformations on this graph. Are there other kinds of frameworks that do something similar? It also turns out that the set of algorithms which are transformable into purely symbolic languages is the set of quasi-static algorithms, so something like Symbolics.jl also has a form of quasi-staticness manifest in the behaviors of its algorithms. And it's for the same reason: in symbolic algorithms you define symbolic variables like "x" and "y", and then trade through a program to build a static compute graph for "2x^2 + 3y" which you then treat symbolically. In the frequently asked questions, there is a question for what happens when a conversion of a function to symbolic fails. If you take a look at the example:

function factorial(x) out = x while x > 1 x -= 1 out *= x end out end @variables x factorial(x)

You can see that the reason for this is because the algorithm is not representable as a single mathematical expression: the factorial cannot be written as a fixed number of multiplications because the number of multiplications is dependent on that value x you're trying to compute x! for! The error that the symbolic language throws is "ERROR: TypeError: non-boolean (Num) used in boolean context", which is saying that it does not know how to symbolically expand out "while x > 1" to be able to represent it statically. And this is not something that is not necessarily "fixable", it's fundamental to the fact that this algorithm is not able to be represented by a fixed computation and necessarily needs to change the computation based on the input.

The "solution" is to define a new primitive to the graph via "@register factorial(x)", so that this function itself is a fixed node that does not try to be symbolically expanded. This is the same concept as defining a Jax primitive or a Tensorflow primitive, where an algorithm simply is not quasi-static and so the way to get a quasi-static compute graph is to treat the dynamic block just as a function "y = f(x)" that is preordained to exist. In the context of both symbolic languages and machine learning frameworks, for this to work in full you also need to define derivatives of said function. That last part is the catch. If you take another look at the depths of the documentation of some of these tools, you'll notice that many of these primitives representing non-static control flow fall outside of the realm that is fully handled.

Right there in the documentation it notes that you can replace a while loop with lax.while_loop, but that is not amenable to reverse-mode automatic differentiation. The reason is because its reverse-mode AD implementation assumes that such a quasi-static algorithm exists and uses this for two purposes, one for generating the backpass but secondly for generating the XLA ("Tensorflow") description of the algorithm to then JIT compile optimize. XLA wants the static compute graph, which again, does not necessarily exist for this case, hence the fundamental limitation. The way to get around this of course is then to define your own primitive with its own fast gradient calculation and this problem goes away...

Or does it?

There are machine learning frameworks which do not make the assumption of quasi-staticness but also optimize, and most of these things like Diffractor.jl, Zygote.jl, and Enzyme.jl in the Julia programming language (note PyTorch does not assume quasi-static representations, though TorchScript's JIT compilation does). This got me thinking: are there actual machine learning algorithms for which this is a real limitation? This is a good question, because if you pull up your standard methods like convolutional neural networks, that's a fixed function kernel call with a good derivative defined, or a recurrent neural network, that's a fixed size for loop. If you want to break this assumption, you have to go to a space that is fundamentally about an algorithm where you cannot know "the amount of computation" until you know the specific values in the problem, and equation solvers are something of this form.

How many steps does it take for Newton's method to converge? How many steps does an adaptive ODE solver take? This is not questions that can be answered a priori: they are fundamentally questions which require knowing:

- What equation are we solving?
- What is the initial condition?
- Over what time span?
- With what solver tolerance?

For this reason, people who work in Python frameworks have been looking for the "right" way to treat equation solving (ODE solving, finding roots f(x)=0, etc.) as a blackbox representation. If you take another look at the Neural Ordinary Differential Equations paper, one of the big things it was proposing was the treatment of neural ODEs as a blackbox with a derivative defined by the ODE adjoint. The reason of course is because adaptive ODE solvers necessarily iterate to tolerance, so there is necessarily something like "while t < tend" which is dependent on whether the current computations are computed to tolerance. As something not optimized in the frameworks they were working in, this is something that was required to make the algorithm work.

No it's not fundamental to have to treat such algorithms as a blackbox. In fact, we had a rather popular paper a few years ago showing that neural stochastic differential equations can be trained with forward and reverse mode automatic differentiation directly via some Julia AD tools. The reason is because these AD tools (Zygote, Diffractor, Enzyme, etc.) do not necessarily assume quasi-static forms due to how they do direct source-to-source transformations, and so they can differentiate the adaptive solvers directly and spit out the correct gradients. So you do not necessarily have to do it in the "define a Tensorflow op" style, but which is better?

It turns out that "better" can be really hard to define because the two algorithms are not necessarily the same and can compute different values. You can boil this down to: do you want to differentiate the solver of the equation, or do you want to differentiate the equation and apply a solver to that? The former, which is equivalent to automatic differentiation of the algorithm, is known as discrete sensitivity analysis or discrete-then-optimize. The latter is continuous sensitivity analysis or optimize-then-discretize approaches. Machine learning is not the first field to come up against this problem, so the paper on universal differential equations and the scientific machine learning ecosystem has a rather long description that I will quote:

"""

Previous research has shown that the discrete adjoint approach is more stable than continuous adjoints in some cases [41, 37, 42, 43, 44, 45] while continuous adjoints have been demonstrated to be more stable in others [46, 43] and can reduce spurious oscillations [47, 48, 49]. This trade-off between discrete and continuous adjoint approaches has been demonstrated on some equations as a trade-off between stability and computational efficiency [50, 51, 52, 53, 54, 55, 56, 57, 58]. Care has to be taken as the stability of an adjoint approach can be dependent on the chosen discretization method [59, 60, 61, 62, 63], and our software contribution helps researchers switch between all of these optimization approaches in combination with hundreds of differential equation solver methods with a single line of code change.

"""

Or, tl;dr: there's tons of prior research which generally shows that continuous adjoints are less stable than discrete adjoints, but they can be faster. We have done recent follow-ups which show these claims are true on modern problems with modern software. Specifically, this paper on stiff neural ODEs shows why discrete adjoints are more stable that continuous adjoints when training on multiscale data, but we also recently showed continuous adjoints can be much faster at gradient computations than (some) current AD techniques for discrete adjoints.

So okay, there's a true benefit to using discrete adjoint techniques if you're handling these hard stiff differential equations, differentiating partial differential equations, etc. and this has been known since the 80's in the field of control theory. But other than that, it's a wash, and so it's not clear whether differentiating such algorithms is better in machine learning, right?

This now brings us to how the recent ICML paper fits into this narrative. Is there a non-quasi-static algorithm that is truly useful for standard machine learning? The answer turns out to be yes, but how to get there requires a few slick tricks. First, the setup. Neural ODEs can be an interesting method for machine learning because they use an adaptive ODE solver to essentially choose the number of layers for you, so it's like a recurrent neural network (or more specifically, like a residual neural network) that automatically finds the "correct" number of layers, where the number of layers is the number of steps the ODE solver decides to take. In other words, Neural ODEs for image processing are an algorithm that automatically do hyperparameter optimization. Neat!

But... what is the "correct" number of layers? For hyperparameter optimization you'd assume that would be "the least number of layers to make predictions accurately". However, by default neural ODEs will not give you that number of layers: they will give you whatever they feel like. In fact, if you look at the original neural ODE paper, as the neural ODE trains it keeps increasing the number of layers it uses:

So is there a way to change the neural ODE to make it define "correct number of layers" as "least number of layers"? In the work Learning Differential Equations that are Easy to Solve they did just that. How they did it is that they regularized the training process of the neural ODE. They looked at the solution and noted that ODEs which have more changes going on are necessarily harder to solve, so you can transform the training process into hyperparameter optimization by adding a regularization term that says "make the higher order derivative terms as small as possible". The rest of the paper is how to enact this idea. How was that done? Well, if you have to treat the algorithm as a blackbox, you need to define some blackbox way to defining high order derivatives which then leads to Jesse's pretty cool formulation of Taylor-mode automatic differentiation. But no matter how you put it, that's going to be an expensive object to compute: computing the gradient is more expensive than the forward pass, and the second derivative moreso than the gradient, and the third etc, so an algorithm that wants 6th derivatives is going to be nasty to train. With some pretty heroic work they got a formulation of this blackbox operation which takes twice as long to train but successfully does the hyperparmeter optimization.

End of story? Far from it!

Is there a way to make automatic hyperparameter optimization via neural ODEs train faster? Yes, and our paper makes them not only train faster than that other method, but makes it train faster than the vanilla neural ODE. We can make layer hyperparameter optimization less than free: we can make it cheaper than not doing the optimization! But how? The trick is to open the blackbox. Let me show you what a step of the adaptive ODE solver looks like:

Notice that the adaptive ODE solver chooses whether a time step is appropriate by using an error estimate. **The ODE algorithm is actually constructed so that the error estimate, the estimate of "how hard this ODE is to solve", is computed for free**. What if we use this free error estimate as our regularization technique? It turns out that is 10x faster to train than before, while similarly automatically performing hyperparameter optimization.

Notice where we have ended up: the resulting algorithm is necessarily not quasi-static. This error estimate is computed by the actual steps of the adaptive ODE solver: to compute this error estimate, you have to do the same computations, the same while loop, as the ODE solver. In this algorithm, you cannot avoid directly differentiating the ODE solver because pieces of the ODE solver's internal calculations are now part of the regularization. This is something that is fundamentally not optimized by methods that require quasi-static compute graphs (Jax, Tensorflow, etc.), and it is something that makes hyperparameter optimization cheaper than not doing hyperparameter optimization since the regularizer is computed for free. I just find this result so cool!

So yes, the paper is a machine learning paper on how to do hyperparameter optimization for free using a trick on neural ODEs, but I think the general software context this sits in highlights the true finding of the paper. This is the first algorithm that I know of where there is both a clear incentive for it to be used in modern machine learning, but also, there is a fundamental reason why common machine learning frameworks like Jax and Tensorflow will not be able to treat them optimally. Even PyTorch's TorchScript will fundamentally, due to the assumptions of its compilation process, not work on this algorithm. Those assumptions were smartly chosen because most algorithms can satisfy them, but this one cannot. Does this mean machine learning is algorithmically stuck in a rut? Possibly, because I thoroughly believe that someone working within a toolset that does not optimize this algorithm would have never found it, which makes it very thought-provoking to me.

What other algorithms are out there which are simply better than our current approaches but are worse only because of the current machine learning frameworks? I cannot wait until Diffractor.jl's release to start probing this question deeper.

The post Useful Algorithms That Are Not Optimized By Jax, PyTorch, or Tensorflow appeared first on Stochastic Lifestyle.

]]>Everything in the SciML organization is built around a principle of confederated modular development: let other packages influence the capabilities of your own. This is highlighted in a ... READ MORE

The post ModelingToolkit, Modelica, and Modia: The Composable Modeling Future in Julia appeared first on Stochastic Lifestyle.

]]>Everything in the SciML organization is built around a principle of confederated modular development: let other packages influence the capabilities of your own. This is highlighted in a paper about the package structure of DifferentialEquations.jl. The underlying principle is that not everyone wants or needs to be a developer of the package, but still may want to contribute. For example, it's not uncommon that a researcher in ODE solvers wants to build a package that adds one solver to the SciML ecosystem. Doing this in their own package for their own academic credit, but with the free bonus that now it exists in the multiple dispatch world. In the design of DifferentialEquations.jl, solve(prob,IRKGL16()) now exists because of their package, and so we add it to the documentation. Some of this work is not even inside the organization, but we still support it. The philosophy is to include every researcher as a budding artist in the space of computational research, including all of the possible methods, and building an infrastructure that promotes a free research atmosphere in the methods. Top level defaults and documentation may lead people to the most stable aspects of the ecosystem, but with a flip of a switch you can be testing out the latest research.

The Modelica ecosystem (open standard, OpenModelica, multiple commercial implementations), which started based on the simple idea of equation oriented modeling, has had a huge impact on industry and solved lots of difficult real industrial problems. The modern simulation system designer today, wants much more from their language and compiler stack. For example, in the Modelica language, there is no reference to what transformations are being done to your models in order to make them "simulatable". People know about Pantelides algorithm, and "singularity elimination", but this is outside the language. It's something that the compiler maybe gives you a few options for, but not something the user or the code actively interacts with. Every compiler is different, advances in one compiler do not help your model when you use another compiler, and the whole world is siloed. By this design, it is extremely difficult for an external user to write compiler passes in Modelica which effects this model lowering process. You can tweak knobs, or write a new compiler, or fork OpenModelica and hack on the whole compiler to just do the change you wanted. The barrier to entry can be significantly lowered, as the Julia compiler ecosystem has showed.

I think in many cases, the set of symbolic cases in Modelica may not be sufficient, and budding system designers might want to write their own. For example, on SDEs there's a Lamperti transformation which can exist which transforms general SDEs to SDEs with additive noise. It doesn't always apply, but when it does it can greatly enhance solver speed and stability. This is niche enough that it may never be in a commercial Modelica compiler (in fact, Modelica doesn't have SDEs at this moment), but it's something that some user might want to be able to add to the process.

So the starting goal of ModelingToolkit is to give an open and modular transformation system on which a whole modeling ecosystem can thrive. My previous blog post exemplified how unfamiliar use cases for code transformations can solve many difficult mathematical problems, and my goal is to give this power to the whole development community. `structural_simplify` is something built into ModelingToolkit to do "the standard transformations" on the standard systems, but the world of transformations is so much larger. Log-transforming a few variables? Exponentiating a few to ensure positivity? Lamperti transforms of SDEs? Transforming to the sensitivity equations? And not just transformations, but functionality for inspecting and analyzing models. Are the equations linear? Which parameters are structurally identifiable?

ModelingToolkit is a deconstruction of what a modeling language is. It pulls it down to its component pieces and then makes it easy to build new modeling languages like Catalyst.jl which internally use ModelingToolkit for all of the difficult transformations. The deconstructed form is a jumping point for building new domain-based languages, along with new transformations which optimize the compiler for specific models. And then in the end, everybody who builds off of it gets improved stability, performance, and parallelism as the core MTK passes improve.

Now there's two major aspects that need to be handle to fully achieve such a vision though. If you want people to be able to reuse code between transformations, what you want is to expose how you are changing code. To achieve this goal, a new Computer Algebra System (CAS), Symbolics.jl, was created for ModelingToolkit.jl. The idea being, if we want everyone writing code transformations, they should all have easy access to a general mathematical toolset for doing such code transformations. We shouldn't have everyone building a new code for differentiation, simplify, and substitution. And we shouldn't have everyone relying on undocumented internals of ModelingToolkit.jl either: this should be something that is open, well-tested, documented, and a well-known system so that everyone can easily become a "ModelingToolkit compiler developer". By building a CAS and making it a Julia standard, we can bridge that developer gap because now everyone knows how to easily manipulate models: they are just Symbolics.jl expressions.

The second major aspect is to achieve a natural embedding into the host language. Modelica is not a language in which people can write compiler passes, which introduces a major gap between the modeler and the developer of extensions to the modeling language. If we want to bridge this gap, we need to ensure the whole modeling language is used from a host which is a complete imperative programming language. And you need to do so in a language that is interactive, high performance, and has a well-developed ecosystem for modeling and simulation. Martin and Hilding had seen this fact as the synthesis for Modia with how Julia uniquely satisfies this need, but I think we can take it a step further. To really make the embedding natural, you should be able to on the fly automatically convert code to and from the symbolic form. In the previous blog post I showcased how ModelingToolkit.jl could improve people's code by automatically parallelizing it and performing index reduction even if the code was not written in ModelingToolkit.jl. This grows the developer audience of the transformation language from "anyone who wants to transform models" to "anyone who wants to automate improving models and general code". This expansion of the audience is thus pulling in developers who are interested in things like automating parallelism and GPU codegen and bringing them into the MTK developer community.

In turn, since all of these advances then apply to the MTK internals and code generation tools such as Symbolics.jl's build_function, new features are coming all of the time because of how the community is composed. The CTarget build_function was first created to transpile Julia code to C, and thus ModelingToolkit models can generate C outputs for compiling into embedded systems. This is serendipity when seeing one example, but it's design when you notice that this is how the entire system is growing so fast.

Now one of the questions we received early on was, won't you not be able to match the performance of a specialized compiler which was only made to work on Modelica, right? While at face value it may seem like hyperspecialization could be beneficial, the true effect of hyperspecialization is that algorithms are simply less efficient because less work has been put into them. Symbolics.jl has become a phenomenon of its own, with multiple different hundred comment threads digging through many aspects of the pros and cons of its designs, and that's not even including the 200 person chat channel which has had tens of thousands of messages in the less than 2 months since the CAS was released. Tons of people are advising how to improve every single plus and multiply operation.

So it shouldn't be a surprise that there are many details that have quickly been added which are still years away from a Modelica implementation. It automatically multithreads tree traversals and rewrite rules. It automatically generates fast parallelized code, and can do so in a way that composes with tearing of nonlinear equations. It lets users define their own high-performance and parallelized functions, register them, and stick them into the right hand side. And that is even excluding the higher level results, like the fact that it is fully differentiable and thus allows training neural networks decomposed within the models, and automatically discover equations from data.

Just at the very basic level we can see that the CAS is transforming the workflows of scientists and engineers in many aspects of the modeling process. By distributing the work of improving symbolic computing, we have already taken examples which were essentially not obtainable and making them instant with Symbolics.jl:

We are building out a full benchmarking system for the symbolic ecosystem to track performance over time and ensure it reaches the top level. It's integrating pieces from The OSCAR project, getting lots of people tracking performance in their own work, and building a community. Each step is another major improvement and this ecosystem is making these steps fast.

This is a rather good question because there are a lot of models already written in Modelica, and it would be a shame for us to not be able to connect with that ecosystem. I will hint that there is coming tooling as part of JuliaSim for connecting to many pre-existing model libraries. In addition, we hope to make use of tooling like Modia.jl and TinyModia.jl and collaboration with the Modelica community will help us make a bridge.

The composability and distributed development nature of ModelingToolkit.jl is its catalyst. This is why ModelingToolkit.jl looks like it has rocket shoes on: it is fast and it is moving fast. And it's because of the thought put into the design. It's because ModelingToolkit.jl is including the entire research community as its asset instead of just its user. I plan to keep moving forward from here, looking back to learn from the greats, but building it in our own image. We're taking the idea of a modeling language, distributing it throughout one of the most active developer communities in modeling and simulation, in a language which is made to build fast and parallelized code. And you're invited.

I'm just going to post a self-explanatory recent talk by Jonathan at the NASA Launch Services Program who saw a 15,000x acceleration by moving from Simulink to ModelingToolkit.jl.

Enough said. While we don’t expect every application will see this kind of speedup (although we wish they will!), we would love to hear about your experiences with ModelingToolkit.

Christopher Rackauckas, ModelingToolkit, Modelica, and Modia: The Composable Modeling Future in Julia, The Winnower 8:e162133.39054 (2021). DOI: 10.15200/winn.162133.39054

This post is open to read and review on The Winnower.

The post ModelingToolkit, Modelica, and Modia: The Composable Modeling Future in Julia appeared first on Stochastic Lifestyle.

]]>What I want to dig into in this blog post is a simple question: what is the trick behind automatic differentiation, why is it always differentiation, and are there other mathematical problems we can be focusing this trick towards? While very technical discussions on this can be found in our recent paper titled "ModelingToolkit: A Composable Graph Transformation System For Equation-Based Modeling" and descriptions of methods like intrusive uncertainty quantification, I want ... READ MORE

The post Generalizing Automatic Differentiation to Automatic Sparsity, Uncertainty, Stability, and Parallelism appeared first on Stochastic Lifestyle.

]]>What I want to dig into in this blog post is a simple question: what is the trick behind automatic differentiation, why is it always differentiation, and are there other mathematical problems we can be focusing this trick towards? While very technical discussions on this can be found in our recent paper titled "ModelingToolkit: A Composable Graph Transformation System For Equation-Based Modeling" and descriptions of methods like intrusive uncertainty quantification, I want to give a high-level overview that really describes some of the intuition behind the technical thoughts. Let's dive in!

To understand automatic differentiation in practice, you need to understand that it's at its core a code transformation process. While mathematically it comes down to being about Jacobian-vector products and Jacobian-transpose-vector products for forward and reverse mode respectively, I think sometimes that mathematical treatment glosses over the practical point that it's really about code.

Take for example . If we want to take the derivative of this, then we could do , but this misses the information that we actually know analytically how to define the derivative! Using the principle that algorithm efficiency comes from problem information, we can improve this process by directly embedding that analytical solution into our process. So we come to the first principle of automatic differentiation:

If you know the analytical solution to the derivative, then replace the function with its derivative

So if you see and someone calls ``derivative(f,x)``, you can do a quick little lookup to a table of rules, known as primitives, and if it's in your table then boom you're done. Swap it in, call it a day.

This already shows you that, with automatic differentiation, we cannot think of as just a function, just a thing that takes in values, but we have to know something about what it means semantically. We have to look at it and identify "this is sin" in order to know "replace it with cos". This is the fundamental limitation of automatic differentiation: it has to know something about your code, more information than it takes to call or run your code. This is why many automatic differentiation libraries are tied to specific implementations of underlying numerical primitives. PyTorch understands ``torch.sin`` as , but it does not understand ``tf.sin`` as , which is why if you place a TensorFlow function into a PyTorch training loop you will get an error thrown about the derivative calculation. This semantic mapping is the reason for libraries like ChainRules.jl which define semantic mappings for the Julia Base library and allows extensions: by directly knowing this mapping on all of standard Julia Base, you can cover the language and achieve "differentiable programming", i.e. all programs automatically can get derivatives.

But we're not done. Let's say we have . The answer is not to add this new function to the table by deriving it by hand: instead we have to come up with a way to make a function generate a derivative code whenever and are in our lookup table. The answer comes from the chain rule. I'm going to describe the forward application of the chain rule as it's a bit simpler to derive, but a full derivation of how this is done in the reverse form is described in these lecture notes. The chain rule tells us that . Thus in order to calculate , we need to know two things: and . If we calculate both of these quantities at every stage of our code, it doesn't matter how deep the composition goes, we will have all of the information that is required to reconstruct the result of the chain rule.

What this means is that automatic differentiation on this function can be thought of as the following translation process:

- Transform to and evaluate at
- Transform to and evaluate at
- Transform to . Now the second portion is the solution to the derivative

This translation process is "transform every primitive function into a tuple of (function,derivative), and transform every other function into a chain rule application using the two pieces" is **non-standard interpretation**. This is the process where an interpreter of a code or language runs under different semantics. An interpreter written to do this process acts on the same code but interprets it differently: it changes each operation to a tuple of the solution and its derivative, instead of just the solution .

Thus the non-standard interpretation version of the problem of calculating derivatives is to reimagine the problem as "at this step of the code, how should I be transforming it so that I have the information to calculate derivatives"? There are many ways to do this abstract interpretation process to a non-standard interpretation: operator overloading, prior static analysis to generate a new source code, etc. But there's one question we should bring up.

One way to start digging into this question is to answer a related question people pose to me often: if we have automatic differentiation, why do we not have automatic integration? While at face value it seems like the two should be analogues, digging deeper exposes what's special about differentiation. If we wanted to do the integral of , then yes we can replace this with . The heart of the question is to ask about the chain rule: what's the integral of ? It turns out that there is no general rule for the "anti-chain rule". A commonly known result is that the standard Gaussian probability distribution, , does not have a solution to its antiderivative that can be written with elementary functions, and that's just the case of and . While that is true, I don't think that captures all that is different about integrals.

When I said "we can replace this with " I was actually wrong: the antiderivative of is not , it's . There is no unique solution without imposing some external context or some global information like "and F(x)=0". Differentiation is special because it's purely local: only knowing the value of I can know the derivative of . Integration is a well-known example of a non-local operation in mathematics: in order to know the anti-derivative at a value of , you might need to know information about some value , and sometimes it's not necessarily obvious what that value should even be. This nonlocality manifests in other ways as well: while is not integrable, is easy to solve via a u-substitution, making and cancelling out the in-front of the . So there is no chain rule not because some things don't have an antiderivative, but because you have nonlocality, so can be non-integrable while is. There is no chain rule because you can't look at small pieces and transform them, instead you have to look at the problem holistically.

But this gives us a framework in order to judge whether a mathematical problem is amenable to being solved in the framework of non-standard interpretation: it must be local so that we can define a step-by-step transformation algorithm, or we need to include/impose some form of context if we have alternative information.

Let's look at a few related problems that can be solved with this trick.

Recall that the Jacobian is the matrix of partial derivatives, i.e. for where and are vectors, it's the matrix of terms . This matrix shows up in tons of mathematical algorithms, and in many cases it's sparse, so it's common problem to try and compute the sparsity pattern of a Jacobian. But what does this sparsity pattern mean? If you write out the analytical solution to , a zero in the Jacobian means that is not a function of . In other words, has no influence on . For an arbitrary program , can we use non-standard interpretation to calculate whether influences ?

It turns out that if we make this question a little simpler then it has a simple solution. Let's instead ask, can we use non-standard interpretation to calculate whether can influence ? The reason for this change is because the previous question was non-local: is programmatically dependent on , but mathematically so you could cancel it out if you have good global knowledge of the program. So "does it influence" this output is hard, but "can it influence" the output is easy. "Can influence" is just the question of "does show up in the calculation at all?"

So we can come up with an non-standard interpretation formulation to solve this problem. Instead of computing values, we can compute "influencer sets". The output of is influenced by . The output of is influenced by . For , the influencer set of is the same as the influencer set of . So our non-standard interpretation is to replace variables by influencer sets, and whenever the collide by a binary function like multiplication, we make the new influencer set be the union of the two. Otherwise we keep propagating it forward. The result of this way of running the program is that output that say "these are all of the variables which can be influencing this output variable". If never shows up at any stage of the computation of , then there is no way it could ever influence it, and therefore . So the sparsity pattern is bounded by the influencer set.

This is the process behind the the automated sparsity tooling of the Julia programming language, which are now embedded as part of Symbolics.jl. There is a bit more you have to do if you see branching: i.e. you have to take all branches and take the union of the influencer sets (so it's clear this cannot be generally solved with just operator overloading because you need to take non-local control of control flow). Details on the full process are described in Sparsity Programming: Automated Sparsity-Aware Optimizations in Differentiable Programming, along with extensions to things like Hessians which is all about tracking sets of linear dependence.

Let's say we didn't actually know the value to put into our program, and instead it was some . What could we do then? In a standard physics class you probably learned a few rules for uncertainty propagation: and so on. It's good to understand where this all comes from. If you said that was a random variable from a normal distribution with mean and standard deviation , and was an independent random variable from a normal distribution with mean and standard deviation , then would have mean and standard deviation . This means there are some local rules for propagating normal distributed random variables! What do you do for for a normally distributed input? You could approximate with its linear approximation: (this is another way to state that the derivative is the tangent vector at ). At a given value of then we just have a linear equation for some scalar , in which case we use the rule . This gives an non-standard interpretation implementation to approximately computing with normal distributions! Now all we have to do is replace function calls with automatic differentiation around the mean and then propagate forward our error bounds.

This is a very crude description which you can expand to linear error propagation theory where you can more accurately treat the nonlinear propagation of variance. However, this is still missing out on whether two variables are dependent: , there's no uncertainty there, so you need to treat that in a special way! If you think about it, "dependence propagation" is very similar to "propagating influencer sets", which you can use to even more accurately propagate the variance terms. This gives rise to the package Measurements.jl which transforms code to make it additionally do propagation of normally distributed uncertainties.

I note in passing that Interval Arithmetic is very similarly formulated as an alternative interpretation of a program. David Sanders has a great tutorial on what this is all about and how to make use of it, so check that out for more information.

Now let's look at solving something a little bit deeper: simulating a pendulum. I know you'll say "but I solved pendulum equations by hand in physics" but sorry to break it to you: you didn't. In an early physics class you will say "all pendulums are one dimensional", and "the angle is small, so sin(x) is approximately x" and arrive a beautiful linear ODE that you analytically solve. But the world isn't that beautiful. So let's look at the full pendulum. Instead you have the location of the swinger and its velocity . But you also have non-constant tension , and if we have a rigid rod we know that the distance of the swinger to the origin is constant. So the evolution of the system is in full given by:

There are differential equation solvers that can handle constraint equations, these are methods for solving differential-algebraic equations (DAEs), But if you take this code and give it to pretty much any differential equation solver it will fail. It's not because the differential equation solver is bad, but because of the properties of this equation. See, the derivative of the last equation with respect to is zero, so you end up getting a singularity in the Newton solve that makes the stepping unstable. This singularity of the algebraic equation with respect to the algebraic variables (i.e. the ones not defined by derivative terms) is known as "higher index". DAE solvers generally only work on index-1 DAEs, and this is an index-3 DAE. What does index-3 mean? It means if you take the last equation, differentiate it twice, and then do a substitution, you get:

This is mathematically the same system of equations, but this formulation of the equations doesn't have the index issue, and so if you give this to a numerical solver it will work.

It turns out that you can reimagine this algorithm to be something that is also solved by a form of non-standard interpretation. This is one of the nice unique features of ModelingToolkit.jl, spelled out in ModelingToolkit: A Composable Graph Transformation System For Equation-Based Modeling. While algorithms have been written before for symbolic equation-based modeling systems, it turns out you can use an non-standard interpretation process to extract the formulation of the equations, solve a graph algorithm to determine how many times you should differentiate which equations, do the differentiation using rules and structures from automatic differentiation libraries, and then regenerate the code for "the better version". As shown in a ModelingToolkit.jl tutorial on this feature, if you do this on the pendulum equations, you can change a code that is too unstable to solve into an equation that is easy enough to solve by the simplest solvers.

And then you can go even further. As I described in the JuliaCon 2020 talk on automated code optimization, now that one is regenerating the code, you can step in and construct a graph of dependencies and automatically compute independent portions simultaneously in the generated code. Thus with no cost to the user, an non-standard interpretation into symbolic graph building can be used to reconstruct and automatically parallelize code. The ModelingToolkit.jl paper takes this even further by showing how a code which is not amenable to parallelism can, after context-specific equation changes like the DAE index reduction, be transformed into an alternative variant that is suddenly embarrassingly parallel.

All of these features require a bit of extra context as to "what equations you're solving" and information like "do you care about maintaining the same exact floating point result", but by adding in the right amount of context, we can extend mathematical non-standard interpretation to solve these alternative problems in new domains.

By the way, if this excites you and you want to see more updates like this, please star ModelingToolkit.jl.

Let me end by going a little bit more mathematical. You can transform code about scalars into code about functions by using the vector-space interpretation of a function. In mathematical terms, functions are vectors in Banach spaces. Or, in the space of functions, functions are points. if you have a function and a function , then is a function too, and so is . You can do computation on these "points" by working out their expansions. For example, you can write and in their Fourier series expansions: . Approximating with finitely many expansion terms, you can represent via [a[1:n];b[1:n]], and same for . The representation of can be worked out from the finite truncation (just add the coefficients), and so can . So you can transform your computation about "functions" to "arrays of coefficients representing functions", and derive the results for what , , etc. do on these values. This is an non-standard interpretation of a program that transforms it into an equivalent program about function and measures as inputs.

Now it turns out you can formally use this to do cool things. A partial differential equation (PDE) is an ODE where where instead of your values being scalars at each time , your values are now functions at each time . So what if you represent those "functions" as "scalars" via their representation in the Sobolev space, and then put those "scalars" into the ODE solver? You automatically transform your ODE solver code into a PDE solver code. Formally, this is using a branch of PDE theory known as semigroup theory and making it a computable object.

In turns out this is something you can do. ApproxFun.jl defines types ``Fun`` which represent the functions as scalars in a Banach space, and defines a bunch of operations that are allowed for such functions. I showed in a previous talk that you can slap these into DifferentialEquations.jl to have it reinterpret into this function-based differential equation solver, and then start to solve PDEs via this representation.

Automatic differentiation gets all of the attention, but its really this idea of non-standard interpretation that we should be looking at. "How do I change the semantics of this program to solve another problem?" is a very powerful approach. In computer science it's often used for debugging: recompile this program into the debugging version. And in machine learning it's often used to recompile programs into derivative calculators. But uncertainty quantification, fast sparsity detection, automatic stabilization and parallelization of differential-algebraic equations, and automatic generation of PDE solvers all arise from the same little trick. That can't be all there is out there: the formalism and theory of non-standard interpretation seems like it could be a much richer field.

- ModelingToolkit: A Composable Graph Transformation System For Equation-Based Modeling
- Sparsity Programming: Automated Sparsity-Aware Optimizations in Differentiable Programming
- Uncertainty propagation with functionally correlated quantities

Christopher Rackauckas, Generalizing Automatic Differentiation to Automatic Sparsity, Uncertainty, Stability, and Parallelism, The Winnower 8:e162133.38896 (2021). DOI: 10.15200/winn.162133.38896

This post is open to read and review on The Winnower.

The post Generalizing Automatic Differentiation to Automatic Sparsity, Uncertainty, Stability, and Parallelism appeared first on Stochastic Lifestyle.

]]>Under the hood it's using the DifferentialEquations.jl package and the SciML stack, but it's abstracted from users so much that Julia is essentially an alternative to Rcpp with easier interactive development. The following example really brings the seamless ... READ MORE

The post JuliaCall Update: Automated Julia Installation for R Packages appeared first on Stochastic Lifestyle.

]]>Under the hood it's using the DifferentialEquations.jl package and the SciML stack, but it's abstracted from users so much that Julia is essentially an alternative to Rcpp with easier interactive development. The following example really brings the seamless integration home:

install.packages(diffeqr) library(diffeqr) de <- diffeqr::diffeq_setup() degpu <- diffeqr::diffeqgpu_setup() lorenz <- function (u,p,t){ du1 = p[1]*(u[2]-u[1]) du2 = u[1]*(p[2]-u[3]) - u[2] du3 = u[1]*u[2] - p[3]*u[3] c(du1,du2,du3) } u0 <- c(1.0,1.0,1.0) tspan <- c(0.0,100.0) p <- c(10.0,28.0,8/3) prob <- de$ODEProblem(lorenz,u0,tspan,p) fastprob <- diffeqr::jitoptimize_ode(de,prob) prob_func <- function (prob,i,rep){ de$remake(prob,u0=runif(3)*u0,p=runif(3)*p) } ensembleprob = de$EnsembleProblem(fastprob, prob_func = prob_func, safetycopy=FALSE) sol <- de$solve(ensembleprob,de$Tsit5(),degpu$EnsembleGPUArray(),trajectories=10000,saveat=0.01)

This example does the following:

- Automatically installs Julia
- Automatically installs DifferentialEquations.jl
- Automatically installs CUDA (via CUDA.jl)
- Automatically installs ModelingToolkit.jl and DiffEqGPU.jl
- JIT transpiles the R function to Julia via ModelingToolkit
- Uses KernelAbstractions (in DiffEqGPU) to generate a CUDA kernel from the Julia code
- Solves the ODE 10,000 times with different parameters 350x over deSolve

What a complicated code! Well maybe it would shock you to know that the source code for the diffeqr package is only 150 lines of code. Of course, it's powered by a lot of Julia magic in the backend, and so can your next package. For more details, see the big long post about differential equation solving in R with Julia.

The post JuliaCall Update: Automated Julia Installation for R Packages appeared first on Stochastic Lifestyle.

]]>The post GPU-Accelerated ODE Solving in R with Julia, the Language of Libraries appeared first on Stochastic Lifestyle.

]]>This is definitely not the first time this question was asked. The statistics libraries in Julia were developed by individuals like Douglas Bates who built some of R's most widely used packages like lme4 and Matrix. Doug had written a blog post in 2018 showing how to get top notch performance in linear mixed effects model fitting via JuliaCall. In 2018 the JuliaDiffEq organization had written a blog post demonstrating the use of DifferentialEquations.jl in R and Python (the Jupyter of Diffrential Equations). Now rebranded as SciML for Scientific Machine Learning, we looked to expand our mission and bring automated model discovery and acceleration include other languages like R and Python with Julia as the base.

With the release of diffeqr v1.0, we can now demonstrate many advances in R through the connection to Julia. Specifically, I would like to use this blog post to showcase:

- The new direct wrapping interface of diffeqr
- JIT compilation and symbolic analysis of ODEs and SDEs in R using Julia and ModelingToolkit.jl
- GPU-accelerated simulations of ensembles using Julia's DiffEqGPU.jl

Together we will demonstrate how models in R can be accelerated by 1000x without a user ever having to write anything but R.

Before continuing on with showing all of the features, I wanted to ask for support so that we can continue developing these bridged libraries. Specifically, I would like to be able to support developers interested in providing a fully automated Julia installation and static compilation so that calling into Julia libraries is just as easy as any Rcpp library. To show support, the easiest thing to do is to star our libraries. The work of this blog post is build on DifferentialEquations.jl, diffeqr, ModelingToolkit.jl, and DiffEqGPU.jl. Thank you for your patience and now back to our regularly scheduled program.

First let me start with the new direct wrappers of differential equations solvers in R. In the previous iterations of diffeqr, we had relied on specifically designed high level functions, like "ode_solve", to compensate for the fact that one could not directly use Julia's original DifferentialEquations.jl interface directly from R. However, the new diffeqr v1.0 directly exposes the entirety of the Julia library in an easy to use framework.

To demonstrate this, let's see how to define the Lorenz ODE with diffeqr. In Julia's DifferentialEquations.jl, we would start by defining an "ODEProblem" that contains the initial condition u0, the time span, the parameters, and the f in terms of `u' = f(u,p,t)` that defines the derivative. In Julia, this would look like:

using DifferentialEquations function lorenz(du,u,p,t) du[1] = p[1]*(u[2]-u[1]) du[2] = u[1]*(p[2]-u[3]) - u[2] du[3] = u[1]*u[2] - p[3]*u[3] end u0 = [1.0,1.0,1.0] tspan = (0.0,100.0) p = [10.0,28.0,8/3] prob = ODEProblem(lorenz,u0,tspan,p) sol = solve(prob,saveat=1.0)

With the new diffeqr, diffeq_setup() is a function that does a few things:

- It instantiates a Julia process to utilize as its underlying compute engine
- It first checks if the correct Julia libraries are installed and, if not, it installs them for you
- Then it exposes all of the functions of DifferentialEquations.jl into its object

What this means is that the following is the complete diffeqr v1.0 code for solving the Lorenz equation is:

library(diffeqr) de <- diffeqr::diffeq_setup() f <- function(u,p,t) { du1 = p[1]*(u[2]-u[1]) du2 = u[1]*(p[2]-u[3]) - u[2] du3 = u[1]*u[2] - p[3]*u[3] return(c(du1,du2,du3)) } u0 <- c(1.0,1.0,1.0) tspan <- c(0.0,100.0) p <- c(10.0,28.0,8/3) prob <- de$ODEProblem(f, u0, tspan, p) sol <- de$solve(prob,saveat=1.0)

This then carries on through SDEs, DDEs, DAEs, and more. Through this direct exposing form, the whole library of DifferentialEquations.jl is at the finger tips of any R user, making it a truly cross-language platform.

(Note that diffeq_setup installs Julia for you if it's not already installed!)

The reason for Julia is speed (well and other things, but here, SPEED!). Using the pure Julia library, we can solve the Lorenz equation 100 times in about 0.05 seconds:

@time for i in 1:100 solve(prob,saveat=1.0) end 0.048237 seconds (156.80 k allocations: 6.842 MiB) 0.048231 seconds (156.80 k allocations: 6.842 MiB) 0.048659 seconds (156.80 k allocations: 6.842 MiB)

Using diffeqr connected version, we get:

lorenz_solve <- function (i){ de$solve(prob,saveat=1.0) } > system.time({ lapply(1:100,lorenz_solve) }) user system elapsed 6.81 0.02 6.83 > system.time({ lapply(1:100,lorenz_solve) }) user system elapsed 7.09 0.00 7.10 > system.time({ lapply(1:100,lorenz_solve) }) user system elapsed 6.78 0.00 6.79

That's not good, that's about 100x difference! In this blog post I described that interpreter overhead and context switching are the main causes of this issue. We've also demonstrated that ML accelerators like PyTorch generally do not perform well in this regime since those kinds of accelerators rely on heavy array operations, unlike the scalarized nonlinear interactions seen in a lot of differential equation modeling. For this reason we cannot just slap any old JIT compiler onto the f call and then put it into the function since there would still be left over. So we need to do something a bit tricky.

In my JuliaCon 2020 talk, Automated Optimization and Parallelism in DifferentialEquations.jl I demonstrated how ModelingToolkit.jl can be used to trace functions and generate highly optimized sparse and parallel code for scientific computing all in an automated fashion. It turns out that JuliaCall can do a form of tracing on R functions, something that exploited to allow autodiffr to automatically differentiate R code with Julia's AD libraries. Thus it turns out that the same modelingtoolkitization methods used in AutoOptimize.jl can be used on a subset of R codes which includes a large majority of differential equation models.

In short, we can perform automated acceleration of R code by turning it into sparse parallel Julia code. This was exposed in diffeqr v1.0 as the `jitoptimize_ode(de,prob)` function (also `jitoptimize_sde(de,prob)`). Let's try it out on this example. All you need to do is give it the ODEProblem which you wish to accelerate. Let's take the last problem and turn it into a pure Julia defined problem and then time it:

fastprob <- diffeqr::jitoptimize_ode(de,prob) fast_lorenz_solve <- function (i){ de$solve(fastprob,saveat=1.0) } system.time({ lapply(1:100,fast_lorenz_solve) }) > system.time({ lapply(1:100,fast_lorenz_solve) }) user system elapsed 0.05 0.00 0.04 > system.time({ lapply(1:100,fast_lorenz_solve) }) user system elapsed 0.07 0.00 0.06 > system.time({ lapply(1:100,fast_lorenz_solve) }) user system elapsed 0.07 0.00 0.06

And there you go, an R user can get the full benefits of Julia's optimizing JIT compiler without having to write lick of Julia code! This function also did a few other things, like automatically defined the Jacobian code to make implicit solving of stiff ODEs much faster as well, and it can perform sparsity detection and automatically optimize computations on that.

To see how much of an advance this is, note that this Lorenz equation is the same from the deSolve examples page. So let's take their example and see how well it performs:

library(deSolve) Lorenz <- function(t, state, parameters) { with(as.list(c(state, parameters)), { dX <- a * X + Y * Z dY <- b * (Y - Z) dZ <- -X * Y + c * Y - Z list(c(dX, dY, dZ)) }) } parameters <- c(a = -8/3, b = -10, c = 28) state <- c(X = 1, Y = 1, Z = 1) times <- seq(0, 100, by = 1.0) out <- ode(y = state, times = times, func = Lorenz, parms = parameters) desolve_lorenz_solve <- function (i){ state <- c(X = runif(1), Y = runif(1), Z = runif(1)) parameters <- c(a = -8/3 * runif(1), b = -10 * runif(1), c = 28 * runif(1)) out <- ode(y = state, times = times, func = Lorenz, parms = parameters) } > system.time({ lapply(1:100,desolve_lorenz_solve) }) user system elapsed 5.03 0.03 5.07 > > system.time({ lapply(1:100,desolve_lorenz_solve) }) user system elapsed 5.42 0.00 5.44 > system.time({ lapply(1:100,desolve_lorenz_solve) }) user system elapsed 5.41 0.00 5.41

Thus we see 100x acceleration over the leading R library without users having to write anything but R code. This is the true promise in action of a "language of libraries" helping to extend all other high level languages!

What about writing C code and directly calling it with deSolve? It turns out that's still not as efficient as this JIT. Following the tutorial from deSolve on how to write an optimized Lorenz function, we first define the following function in C:

/* file lorenz.c */ #include <R.h> static double parms[3]; #define a parms[0] #define b parms[1] #define c parms[2] /* initializer */ void initmod(void (* odeparms)(int *, double *)) { int N = 3; odeparms(&N, parms); } /* Derivatives */ void derivs (int *neq, double *t, double *y, double *ydot, double *yout, int *ip) { ydot[0] = a * y[0] + y[1] * y[2]; ydot[1] = b * (y[1] - y[2]); ydot[2] = - y[0] * y[1] + c * y[1] - y[2]; }

Then we use system("R CMD SHLIB lorenzc.c") in R in order to compile this function to a .dll. Now we can call it from R:

library(deSolve) dyn.load("lorenz.dll") parameters <- c(a = -8/3, b = -10, c = 28) state <- c(X = 1, Y = 1, Z = 1) times <- seq(0, 100, by = 1.0) out <- ode(y = state, times = times, func = Lorenz, parms = parameters) desolve_lorenz_solve <- function (i){ state <- c(X = runif(1), Y = runif(1), Z = runif(1)) parameters <- c(a = -8/3 * runif(1), b = -10 * runif(1), c = 28 * runif(1)) ode(state, times, func = "derivs", parms = parameters, dllname = "lorenz", initfunc = "initmod") } > system.time({ lapply(1:100,desolve_lorenz_solve) }) user system elapsed 0.09 0.00 0.09

**Notice that even when rewriting the function to C, this still is almost 2x as slow as the direct JIT compiled R code!** This means that users, with less work, can get faster than what they had before!

Can we go deeper? Yes we can. In many cases like in optimization and sensitivity analysis of models for pharmacology the users need to solve the same ODE thousands or millions of times to understand the behavior over a large parameter space. To solve this problem well in Julia, we built DiffEqGPU.jl which transforms the pure Julia function into a .ptx kernel to then parallelize the ODE solver over. What this looks like is the following solves the Lorenz equation 100,000 times with randomized initial conditions and parameters:

using DifferentialEquations, DiffEqGPU function lorenz(du,u,p,t) du[1] = p[1]*(u[2]-u[1]) du[2] = u[1]*(p[2]-u[3]) - u[2] du[3] = u[1]*u[2] - p[3]*u[3] end u0 = [1.0,1.0,1.0] tspan = (0.0,100.0) p = [10.0,28.0,8/3] prob = ODEProblem(lorenz,u0,tspan,p) prob_func = (prob,i,repeat) -> remake(prob,u0=rand(3).*u0,p=rand(3).*p) monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy=false) sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=100_000,saveat=1.0f0)

Notice how this is only two lines of code different from what we had before, and now everything is GPU accelerated! The requirement for this to work is that the ODE/SDE/DAE function has to be written in Julia... but diffeqr::jitoptimize_ode(de,prob) accelerates the ODE solving in R by generating a Julia function, so could that mean...?

Yes, it does mean we can use DiffEqGPU directly on ODEs defined in R. Let's see this in action. Once again, we will write almost exactly the same code as in Julia, except with `de$` and with diffeqr::jitoptimize_ode(de,prob) to JIT compile our ODE definition. What this looks like is the following:

de <- diffeqr::diffeq_setup() degpu <- diffeqr::diffeqgpu_setup() lorenz <- function (u,p,t){ du1 = p[1]*(u[2]-u[1]) du2 = u[1]*(p[2]-u[3]) - u[2] du3 = u[1]*u[2] - p[3]*u[3] c(du1,du2,du3) } u0 <- c(1.0,1.0,1.0) tspan <- c(0.0,100.0) p <- c(10.0,28.0,8/3) prob <- de$ODEProblem(lorenz,u0,tspan,p) fastprob <- diffeqr::jitoptimize_ode(de,prob) prob_func <- function (prob,i,rep){ de$remake(prob,u0=runif(3)*u0,p=runif(3)*p) } ensembleprob = de$EnsembleProblem(fastprob, prob_func = prob_func, safetycopy=FALSE) sol = de$solve(ensembleprob,de$Tsit5(),degpu$EnsembleGPUArray(),trajectories=100000,saveat=1.0)

Note that diffeqr::diffeqgpu_setup() does the following:

- It sets up the drivers and installs the right version of CUDA for the user if not already available
- It installs the DiffEqGPU library
- It exposes the pieces of the DiffEqGPU library for the user to then call onto ensembles

This means that this portion of the library is fully automated, all the way down to the installation of CUDA! Let's time this out a bit. 100,000 ODE solves in serial:

@time sol = solve(monteprob,Tsit5(),EnsembleSerial(),trajectories=100_000,saveat=1.0f0) 15.045104 seconds (18.60 M allocations: 2.135 GiB, 4.64% gc time) 14.235984 seconds (16.10 M allocations: 2.022 GiB, 5.62% gc time)

100,000 ODE solves on the GPU in Julia:

@time sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=100_000,saveat=1.0f0) 2.071817 seconds (6.56 M allocations: 1.077 GiB) 2.148678 seconds (6.56 M allocations: 1.077 GiB)

Now let's check R in serial:

> system.time({ de$solve(ensembleprob,de$Tsit5(),de$EnsembleSerial(),trajectories=100000,saveat=1.0) }) user system elapsed 24.16 1.27 25.42 > system.time({ de$solve(ensembleprob,de$Tsit5(),de$EnsembleSerial(),trajectories=100000,saveat=1.0) }) user system elapsed 25.45 0.94 26.44

and R on GPUs:

> system.time({ de$solve(ensembleprob,de$Tsit5(),degpu$EnsembleGPUArray(),trajectories=100000,saveat=1.0) }) user system elapsed 12.39 1.51 13.95 > system.time({ de$solve(ensembleprob,de$Tsit5(),degpu$EnsembleGPUArray(),trajectories=100000,saveat=1.0) }) user system elapsed 12.55 1.36 13.90

R doesn't reach quite the level of Julia here, and if you profile you'll see it's because the `prob_func`, i.e. the function that tells you which problems to solve, is still a function written in R and this becomes the bottleneck as the computation becomes faster and faster. Thus you will get closer and closer to the Julia speed with longer and harder ODEs, but it still means there's work to be done. Another detail is that the Julia code is able to be further accelerated by using 32-bit numbers. Let's see that in action:

using DifferentialEquations, DiffEqGPU function lorenz(du,u,p,t) du[1] = p[1]*(u[2]-u[1]) du[2] = u[1]*(p[2]-u[3]) - u[2] du[3] = u[1]*u[2] - p[3]*u[3] end u0 = Float32[1.0,1.0,1.0] tspan = (0.0f0,100.0f0) p = Float32[10.0,28.0,8/3] prob = ODEProblem(lorenz,u0,tspan,p) prob_func = (prob,i,repeat) -> remake(prob,u0=rand(Float32,3).*u0,p=rand(Float32,3).*p) monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy=false) @time sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=100_000,saveat=1.0f0) # 1.781718 seconds (6.55 M allocations: 918.051 MiB) # 1.873190 seconds (6.56 M allocations: 917.875 MiB)

Right now the Julia to R bridge converts all 32-bit numbers back to 64-bit numbers so this doesn't seem to be possible without the user writing some Julia code, but we hope to get this fixed in one of our coming releases.

To figure out where that leaves us, let's use deSolve to solve that same Lorenz equation 100 and 1,000 times:

library(deSolve) Lorenz <- function(t, state, parameters) { with(as.list(c(state, parameters)), { dX <- a * X + Y * Z dY <- b * (Y - Z) dZ <- -X * Y + c * Y - Z list(c(dX, dY, dZ)) }) } parameters <- c(a = -8/3, b = -10, c = 28) state <- c(X = 1, Y = 1, Z = 1) times <- seq(0, 100, by = 1.0) out <- ode(y = state, times = times, func = Lorenz, parms = parameters) desolve_lorenz_solve <- function (i){ state <- c(X = runif(1), Y = runif(1), Z = runif(1)) parameters <- c(a = -8/3 * runif(1), b = -10 * runif(1), c = 28 * runif(1)) out <- ode(y = state, times = times, func = Lorenz, parms = parameters) } > system.time({ lapply(1:100,desolve_lorenz_solve) }) user system elapsed 5.06 0.00 5.13 > system.time({ lapply(1:1000,desolve_lorenz_solve) }) user system elapsed 55.68 0.03 55.75

We see the expected linear scaling of a scalar code, so we can extrapolate out and see that to solve 100,000 ODEs it would take deSolve 5000 seconds, as opposed to the 14 seconds of diffeqr or the 1.8 seconds of Julia. In summary:

- Pure R diffeqr offers a
**350x acceleation over deSolve** - Pure Julia DifferentialEquations.jl offers a
**2777x acceleration over deSolve**

And deSolve is not shabby: it's a library that calls Fortran libraries under the hood!

We hope that the R community has enjoyed this release and will enjoy our future releases as well. We hope to continue building further connections to Python as well. Together this will make Julia a true language of libraries that can be used to accelerate scientific computation in the surrounding higher level scientific ecosystem.

Christopher Rackauckas, GPU-Accelerated ODE Solving in R with Julia, the Language of Libraries, The Winnower 8:e162133.38974 (2021). DOI: 10.15200/winn.162133.38974

This post is open to read and review on The Winnower.

The post GPU-Accelerated ODE Solving in R with Julia, the Language of Libraries appeared first on Stochastic Lifestyle.

]]>Chris Rackauckas

Applied Mathematics Instructor, MIT

Senior Research Analyst, University of Maryland, Baltimore School of Pharmacy

This was a seminar talk given to the COVID modeling journal club on scientific machine learning for epidemic modeling.

Resources:

https://sciml.ai/

https://diffeqflux.sciml.ai/dev/

https://datadriven.sciml.ai/dev/

https://docs.sciml.ai/latest/

https://safeblues.org/

The post COVID-19 Epidemic Mitigation via Scientific Machine Learning (SciML) appeared first on Stochastic Lifestyle.

]]>Chris Rackauckas

Applied Mathematics Instructor, MIT

Senior Research Analyst, University of Maryland, Baltimore School of Pharmacy

This was a seminar talk given to the COVID modeling journal club on scientific machine learning for epidemic modeling.

Resources:

https://sciml.ai/

https://diffeqflux.sciml.ai/dev/

https://datadriven.sciml.ai/dev/

https://docs.sciml.ai/latest/

https://safeblues.org/

The post COVID-19 Epidemic Mitigation via Scientific Machine Learning (SciML) appeared first on Stochastic Lifestyle.

]]>Cheap But Effective: Instituting Effective Pandemic Policies Without Knowing Who's Infected

Chris Rackauckas

MIT Applied Mathematics Instructor

One way to find out how many people are infected is to figure out who's infected, but that's working too hard! In this talk we will look into cheaper alternatives for effective real-time policy making. To this end we introduce SafeBlues, a project that simulates fake virus strands over Bluetooth and utilizes deep neural networks mixed within differential equations to accurately approximate infection statistics weeks before updated statistics are available. We then introduce COEXIST, a quarantine policy which utilizes inexpensive "useless" tests to perform accurate regional case isolation. This work is all being done as part of the Microsoft Pandemic Modeling Project, where the Julia SciML tooling has accelerated the COEXIST simulations by 36,000x ... READ MORE

The post Cheap But Effective: Instituting Effective Pandemic Policies Without Knowing Who's Infected appeared first on Stochastic Lifestyle.

]]>Cheap But Effective: Instituting Effective Pandemic Policies Without Knowing Who's Infected

Chris Rackauckas

MIT Applied Mathematics Instructor

One way to find out how many people are infected is to figure out who's infected, but that's working too hard! In this talk we will look into cheaper alternatives for effective real-time policy making. To this end we introduce SafeBlues, a project that simulates fake virus strands over Bluetooth and utilizes deep neural networks mixed within differential equations to accurately approximate infection statistics weeks before updated statistics are available. We then introduce COEXIST, a quarantine policy which utilizes inexpensive "useless" tests to perform accurate regional case isolation. This work is all being done as part of the Microsoft Pandemic Modeling Project, where the Julia SciML tooling has accelerated the COEXIST simulations by 36,000x and quantitative systems pharmacology simulations for Pfizer by 175x in support of the efforts against COVID-19.

The post Cheap But Effective: Instituting Effective Pandemic Policies Without Knowing Who's Infected appeared first on Stochastic Lifestyle.

]]>Differentiable programming is a subset of modeling where you model with a program where each of the steps are differentiable, for the purpose of being able to find the correct program with parameter fitting using said ... READ MORE

The post Glue AD for Full Language Differentiable Programming appeared first on Stochastic Lifestyle.

]]>Differentiable programming is a subset of modeling where you model with a program where each of the steps are differentiable, for the purpose of being able to find the correct program with parameter fitting using said derivatives. Just like any modeling domain, different problems have different code styles which must be optimized in different ways. Traditional scientific computing code makes use of mutable buffers writing out nonlinear scalar operations and avoid memory allocations in order to keep top performance. On the other hand, many machine learning libraries allocate a ton of temporary arrays due to using out of place matrix multiplications, [which is fine because dense linear algebra costs grow much faster than the costs of the allocations](https://www.stochasticlifestyle.com/when-do-micro-optimizations-matter-in-scientific-computing/). Some need sparsity everywhere, others just need to fuse and build the fastest dense kernels possible. Some algorithms do great on GPUs, while some do not. This intersection between scientific computing and machine learning, i.e. scientific machine learning and other applications of differentiable programing, is too large of a domain for one approach to make anyone happy. And if an AD system is unable to reach top notch performance for a specific subdomain, it's simply better for the hardcore package author to not use the AD system and instead write their own adjoints.

Even worse is the fact that mathematically there are many cases where you should write your own adjoints, since differentiating through the code is very suboptimal. Any iterative algorithm is of this sort, where the derivative of a nonlinear solve f(x)=0 may use Newton's method to get f(x*)=0, but the adjoint is only defined at x* with f'(x*), so there's no need to ever differentiate through Newton's method. So we should all be writing adjoints! Does this mean that the story of differentiable programming is destroyed? Is it just always better to not do differentiable programming, so any hardcore library writer will ignore it?

Instead of falling into that despair, let's just follow down that road with a positive light. Let's assume that the best way to do differentiable programming is to write adjoints on library code. Then what's the purpose of a differentiable programming system? It's to help your adjoints get written and be useful. It's just matrix multiplications in machine learning. If the majority of the code is in some optimized kernel, then you don't need to worry about the performance of many other aspects rest: you just want it to work. With differentiable programming, if 99% of the computation is in the DiffEq/NLsolve/FFTW/etc. adjoints, what we need from a differentiable programming system is something that will get the rest of the adjoint done and be very easy to make correct. The way to facilitate this kind of workflow would be for the differentiable programming system to:

- Have very high coverage of the language. Sacrifice some speed if it needs to, that's okay, because if 99% of the compute time is in my adjoint, then I don't want that 1% to be hard to develop. It should just work, however it works.
- Be easy to debug and profile. Stacktraces should point to real code. Profiles should point to real lines of code. Errors should be legible.
- Have a language-wide system for defining adjoints. We can't have walled gardens if the way to "get good" is to have adjoints for everything: we need everyone to plug in and distribute the work. Not to just the developers of one framework, and not just to the users of one framework, but to every scientific developer in the entire programming language.
- Make it easy to swap out AD systems. More constrained systems may be more optimized, and if I don't want to define an adjoint, at least I can fallback to something that (a) works on my code and (b) matches its assumptions.

Thus what I think we're looking for is not one differentiable programming system that is the best in all aspects, but instead we're looking for a differentiable programming system that can glue together everything that's out there. "Differentiate all of the things, but also tell me how to do things better". We're looking for a glue AD.

Zygote is surprisingly close to being this perfect glue AD. It's stacktraces and profiling are fairly good because they point to the pieces generating the backpasses. It just needs some focus on this goal if it wants to obtain it. For (1), it would need to get higher coverage of the language, focusing on its expanse moreso than doing everything as fast as possible. Of course, it should do as well as it can, but for example, if it needs to sacrifice a bit of speed to get full performance in mutability today, that might be a good trade-off if the goal is to be a glue AD. Perfect? No, but if that's that would give you the coverage to then tell the user that if they need more on a particular piece of code, seek out more. To seek out more performance, users could just have Zygote call ReverseDiff.jl on a function and have that compile the tape (or other specialized AD systems which will be announced more broadly soon), or may want to write a partial adjoint.

So (4) is really the kicker. If I was to hit a slow mutating code today inside of a differential equation, it would probably be something perfect for ModelingToolkit.jl to handle, so the best thing to do is to build hyper-optimized adjoints of that differential equation using ModelingToolkit.jl. At that level, I can symbolically handle it and generate code that a compiler because I can make a lot of extra assumptions, like cos^2(x) + sin^2(x) = 1 in my mathematical context. I can move code around, auto-parallelize it, etc. easily because of the simple static graph I'm working on. Wouldn't it be a treat to just `@ModelingToolkitAdjoint f` and bingo now it's using ModelingToolkit on a portion of code? `@ForwardDiffAdjoint f` to tell it that it "you should forward mode mere". Yota.jl is a great reverse mode project, so `@YotaAdjoint f` and boom that could be more optimized than Zygote on some cases. `@ReverseDiff f` and let it compile the tape and it'll get fairly optimal on the places where ReverseDiff.jl is applicable.

Julia is the perfect language to develop such a system for because its AST is so nice and constrained for mathematical contexts that all of these AD libraries do not work on a special DSL language like TensorFlow graphs or torch.numpy, but instead work directly on the language itself and its original existing libraries. With ChainRules.jl allowing for adjoint overloads that apply to all AD packages, focusing on these "Glue AD" properties could really open up the playing field, allowing Zygote to be at the center of an expansive differentiable programming world that works everywhere, maybe makes some compromises to do so, but then gives a system for other developers to make assumptions and define easily define adjoints and plug alternative AD systems into the whole game. This is a true mixed mode which incorporates not just forward and reverse, but also different implementations with different performance profiles (and this can be implemented just through ChainRules overloads!). Zygote would then facilitate this playing field with just a solid debugging and profiling experience, along with a very high chance of working on your code on your first try. That plus buy-in by package authors would be a true solution to differentiable programming.

The post Glue AD for Full Language Differentiable Programming appeared first on Stochastic Lifestyle.

]]>Chris Rackauckas (MIT), "Generalized Physics-Informed Learning through Language-Wide Differentiable Programming"

Scientific computing is increasingly incorporating the advancements in machine learning to allow for data-driven physics-informed modeling approaches. However, re-targeting existing scientific computing workloads to machine learning frameworks is both costly and limiting, as scientific simulations tend to use the full feature set of a general purpose programming language. In this manuscript we develop an infrastructure for incorporating deep learning into existing scientific computing code through Differentiable Programming (∂P). We describe a ∂P system that is able to take gradients of full Julia programs, making Automatic Differentiation a first class language feature and compatibility with deep learning pervasive. Our system utilizes the one-language nature of Julia package development to augment the existing package ecosystem with deep learning, supporting almost all language ... READ MORE

The post Generalized Physics-Informed Learning through Language-Wide Differentiable Programming (Video) appeared first on Stochastic Lifestyle.

]]>Chris Rackauckas (MIT), "Generalized Physics-Informed Learning through Language-Wide Differentiable Programming"

Scientific computing is increasingly incorporating the advancements in machine learning to allow for data-driven physics-informed modeling approaches. However, re-targeting existing scientific computing workloads to machine learning frameworks is both costly and limiting, as scientific simulations tend to use the full feature set of a general purpose programming language. In this manuscript we develop an infrastructure for incorporating deep learning into existing scientific computing code through Differentiable Programming (∂P). We describe a ∂P system that is able to take gradients of full Julia programs, making Automatic Differentiation a first class language feature and compatibility with deep learning pervasive. Our system utilizes the one-language nature of Julia package development to augment the existing package ecosystem with deep learning, supporting almost all language constructs (control flow, recursion, mutation, etc.) while generating high-performance code without requiring any user intervention or refactoring to stage computations. We showcase several examples of physics-informed learning which directly utilizes this extension to existing simulation code: neural surrogate models, machine learning on simulated quantum hardware, and data-driven stochastic dynamical model discovery with neural stochastic differential equations.

Code is available at https://github.com/MikeInnes/zygote-paper

AAAI 2020 Spring Symposium on Combining Artificial Intelligence and Machine Learning with Physics Sciences, March 23-25, 2020 (https://sites.google.com/view/aaai-mlps)

https://figshare.com/articles/presentation/Generalized_Physics-Informed_Learning_through_Language-Wide_Differentiable_Programming/12751934

The post Generalized Physics-Informed Learning through Language-Wide Differentiable Programming (Video) appeared first on Stochastic Lifestyle.

]]>