There is a major philosophical difference which is seen in both the development and usage of the tools. Everything in the SciML organization is built around a principle of ... READ MORE

The post ModelingToolkit, Modelica, and Modia: The Composable Modeling Future in Julia appeared first on Stochastic Lifestyle.

]]>There is a major philosophical difference which is seen in both the development and usage of the tools. Everything in the SciML organization is built around a principle of confederated modular development: let other packages influence the capabilities of your own. This is highlighted in a paper about the package structure of DifferentialEquations.jl. The underlying principle is that not everyone wants or needs to be a developer of the package, but still may want to contribute. For example, it's not uncommon that a researcher in ODE solvers wants to build a package that adds one solver to the SciML ecosystem. Doing this in their own package for their own academic credit, but with the free bonus that now it exists in the multiple dispatch world. In the design of DifferentialEquations.jl, solve(prob,IRKGL16()) now exists because of their package, and so we add it to the documentation. Some of this work is not even inside the organization, but we still support it. The philosophy is to include every researcher as a budding artist in the space of computational research, including all of the possible methods, and building an infrastructure that promotes a free research atmosphere in the methods. Top level defaults and documentation may lead people to the most stable aspects of the ecosystem, but with a flip of a switch you can be testing out the latest research.

When approaching modeling languages like Modelica, I noticed this idea was completely foreign to modeling languages. Modelica is created by a committee, but the implementations that people use are closed like Dymola, or monolithic like OpenModelica. This is not a coincidence but instead a fact of the design of the language. In the Modelica language, there is no reference to what transformations are being done to your models in order to make them "simulatable". People know about Pantelides algorithm, and "singularity elimination", but this is outside the language. It's something that the compiler maybe gives you a few options for, but not something the user or the code actively interacts with. Every compiler is different, advances in one compiler do not help your model when you use another compiler, and the whole world is siloed. By this design, it is impossible for an external user to write compiler passes in Modelica which effects this model lowering process. You can tweak knobs, or write a new compiler. Or fork OpenModelica and hack on the whole compiler to just do the change you wanted.

I do not think that the symbolic transformations that are performed by Modelica are the complete set that everyone will need for all models for all time. I think in many cases you might want to write your own. For example, on SDEs there's a Lamperti transformation which can exist which transforms general SDEs to SDEs with additive noise. It doesn't always apply, but when it does it can greatly enhance solver speed and stability. This is niche enough that it'll never be in a commercial Modelica compiler (in fact, they don't even have SDEs), but it's something that some user might want to be able to add to the process.

So the starting goal of ModelingToolkit is to give an open and modular transformation system on which a whole modeling ecosystem can thrive. My previous blog post exemplified how unfamiliar use cases for code transformations can solve many difficult mathematical problems, and my goal is to give this power to the whole development community. `structural_simplify` is something built into ModelingToolkit to do "the standard transformations" on the standard systems, but the world of transformations is so much larger. Log-transforming a few variables? Exponentiating a few to ensure positivity? Lamperti transforms of SDEs? Transforming to the sensitivity equations? And not just transformations, but functionality for inspecting and analyzing models. Are the equations linear? Which parameters are structurally identifiable?

From that you can see that Modia was a major inspiration for ModelingToolkit, but Modia did not go in this direction of decomposing the modeling language: it essentially is a simplified Modelica compiler in Julia. But ModelingToolkit is a deconstruction of what a modeling language is. It pulls it down to its component pieces and then makes it easy to build new modeling languages like Catalyst.jl which internally use ModelingToolkit for all of the difficult transformations. The deconstructed form is a jumping point for building new domain-based languages, along with new transformations which optimize the compiler for specific models. And then in the end, everybody who builds off of it gets improved stability, performance, and parallelism as the core MTK passes improve.

Now there's two major aspects that need to be handle to fully achieve such a vision though. If you want people to be able to reuse code between transformations, what you want is to expose how you are changing code. To achieve this goal, a new Computer Algebra System (CAS), Symbolics.jl, was created for ModelingToolkit.jl. The idea being, if we want everyone writing code transformations, they should all have easy access to a general mathematical toolset for doing such code transformations. We shouldn't have everyone building a new code for differentiation, simplify, and substitution. And we shouldn't have everyone relying on undocumented internals of ModelingToolkit.jl either: this should be something that is open, well-tested, documented, and a well-known system so that everyone can easily become a "ModelingToolkit compiler developer". By building a CAS and making it a Julia standard, we can bridge that developer gap because now everyone knows how to easily manipulate models: they are just Symbolics.jl expressions.

The second major aspect is to achieve a natural embedding into the host language. Modelica is not a language in which people can write compiler passes, which introduces a major gap between the modeler and the developer of extensions to the modeling language. If we want to bridge this gap, we need to ensure the whole modeling language is used from a host which is a complete imperative programming language. And you need to do so in a language that is interactive, high performance, and has a well-developed ecosystem for modeling and simulation. Martin and Hilding had seen this fact as the synthesis for Modia with how Julia uniquely satisfies this need, but I think we need to take it a step further. To really make the embedding natural, you should be able to on the fly automatically convert code to and from the symbolic form. In the previous blog post I showcased how ModelingToolkit.jl could improve people's code by automatically parallelizing it and performing index reduction even if the code was not written in ModelingToolkit.jl. This grows the developer audience of the transformation language from "anyone who wants to transform models" to "anyone who wants to automate improving models and general code". This expansion of the audience is thus pulling in developers who are interested in things like automating parallelism and GPU codegen and bringing them into the MTK developer community.

In turn, since all of these advances then apply to the MTK internals and code generation tools such as Symbolics.jl's build_function, new features are coming all of the time because of how the community is composed. The CTarget build_function was first created to transpile Julia code to C, and thus ModelingToolkit models can generate C outputs for compiling into embedded systems. This is serendipity when seeing one example, but it's design when you notice that this is how the entire system is growing so fast.

Now one of the questions we received early on was, won't you not be able to match the performance a specialized compiler which was only made to work on Modelica, right? While at face value it may seem like hyperspecialization could be beneficial, the true effect of hyperspecialization is that algorithms are simply less efficient because less work has been put into them. Symbolics.jl has become a phenomenon of its own, with multiple different hundred comment threads digging through many aspects of the pros and cons of its designs, and that's not even including the 200 person chat channel which has had tens of thousands of messages in the less than 2 months since the CAS was released. Tons of people are advising how to improve every single plus and multiply operation.

So it shouldn't be a surprise that there are many details that have quickly been added which are still years away from a Modelica implementation. It automatically multithreads tree traversals and rewrite rules. It automatically generates fast parallelized code, and can do so in a way that composes with tearing of nonlinear equations. It lets users define their own high-performance and parallelized functions, register them, and stick them into the right hand side. And that is even excluding the higher level results, like the fact that it is fully differentiable and thus allows training neural networks decomposed within the models, and automatically discover equations from data.

Just at the very basic level we can see that the CAS is transforming the workflows of scientists and engineers in many aspects of the modeling process. By distributing the work of improving symbolic computing, we have already taken examples which were essentially not obtainable and making them instant with Symbolics.jl:

We are building out a full benchmarking system for the symbolic ecosystem to track performance over time and ensure it reaches the top level. It's integrating pieces from The OSCAR project, getting lots of people tracking performance in their own work, and building a community. Each step is another major improvement and this ecosystem is making these steps fast. It will be hard for a few people working on the internals of a single Modelica compiler to keep up with such an environment, let alone repeating this work to every new Modelica-based project.

This is a rather good question because there are a lot of models already written in Modelica, and it would be a shame for us to not be able to connect with that ecosystem. I will hint that there is coming tooling as part of JuliaSim for connecting to many pre-existing model libraries. In addition, we hope to make use of tooling like Modia.jl and TinyModia.jl will help us make a bridge.

The composability and distributed development nature of ModelingToolkit.jl is its catalyst. This is why ModelingToolkit.jl looks like it has rocket shoes on: it is fast and it is moving fast. And it's because of the thought put into the design. It's because ModelingToolkit.jl is including the entire research community as its asset instead of just its user. I plan to keep moving forward from here, looking back to learn from the greats, but building it in our own image. We're taking the idea of a modeling language, distributing it throughout one of the most active developer communities in modeling and simulation, in a language which is made to build fast and parallelized code. And you're invited.

I'm just going to post a self-explanatory recent talk by Jonathan at the NASA Launch Services Program who saw a 15,000x acceleration by moving from Simulink to ModelingToolkit.jl.

Enough said.

The post ModelingToolkit, Modelica, and Modia: The Composable Modeling Future in Julia appeared first on Stochastic Lifestyle.

]]>What I want to dig into in this blog post is a simple question: what is the trick behind automatic differentiation, why is it always differentiation, and are there other mathematical problems we can be focusing this trick towards? While very technical discussions on this can be found in our recent paper titled "ModelingToolkit: A Composable Graph Transformation System For Equation-Based Modeling" and descriptions of methods like intrusive uncertainty quantification, I want ... READ MORE

The post Generalizing Automatic Differentiation to Automatic Sparsity, Uncertainty, Stability, and Parallelism appeared first on Stochastic Lifestyle.

]]>What I want to dig into in this blog post is a simple question: what is the trick behind automatic differentiation, why is it always differentiation, and are there other mathematical problems we can be focusing this trick towards? While very technical discussions on this can be found in our recent paper titled "ModelingToolkit: A Composable Graph Transformation System For Equation-Based Modeling" and descriptions of methods like intrusive uncertainty quantification, I want to give a high-level overview that really describes some of the intuition behind the technical thoughts. Let's dive in!

To understand automatic differentiation in practice, you need to understand that it's at its core a code transformation process. While mathematically it comes down to being about Jacobian-vector products and Jacobian-transpose-vector products for forward and reverse mode respectively, I think sometimes that mathematical treatment glosses over the practical point that it's really about code.

Take for example . If we want to take the derivative of this, then we could do , but this misses the information that we actually know analytically how to define the derivative! Using the principle that algorithm efficiency comes from problem information, we can improve this process by directly embedding that analytical solution into our process. So we come to the first principle of automatic differentiation:

If you know the analytical solution to the derivative, then replace the function with its derivative

So if you see and someone calls ``derivative(f,x)``, you can do a quick little lookup to a table of rules, known as primitives, and if it's in your table then boom you're done. Swap it in, call it a day.

This already shows you that, with automatic differentiation, we cannot think of as just a function, just a thing that takes in values, but we have to know something about what it means semantically. We have to look at it and identify "this is sin" in order to know "replace it with cos". This is the fundamental limitation of automatic differentiation: it has to know something about your code, more information than it takes to call or run your code. This is why many automatic differentiation libraries are tied to specific implementations of underlying numerical primitives. PyTorch understands ``torch.sin`` as , but it does not understand ``tf.sin`` as , which is why if you place a TensorFlow function into a PyTorch training loop you will get an error thrown about the derivative calculation. This semantic mapping is the reason for libraries like ChainRules.jl which define semantic mappings for the Julia Base library and allows extensions: by directly knowing this mapping on all of standard Julia Base, you can cover the language and achieve "differentiable programming", i.e. all programs automatically can get derivatives.

But we're not done. Let's say we have . The answer is not to add this new function to the table by deriving it by hand: instead we have to come up with a way to make a function generate a derivative code whenever and are in our lookup table. The answer comes from the chain rule. I'm going to describe the forward application of the chain rule as it's a bit simpler to derive, but a full derivation of how this is done in the reverse form is described in these lecture notes. The chain rule tells us that . Thus in order to calculate , we need to know two things: and . If we calculate both of these quantities at every stage of our code, it doesn't matter how deep the composition goes, we will have all of the information that is required to reconstruct the result of the chain rule.

What this means is that automatic differentiation on this function can be thought of as the following translation process:

- Transform to and evaluate at
- Transform to and evaluate at
- Transform to . Now the second portion is the solution to the derivative

This translation process is "transform every primitive function into a tuple of (function,derivative), and transform every other function into a chain rule application using the two pieces" is **non-standard interpretation**. This is the process where an interpreter of a code or language runs under different semantics. An interpreter written to do this process acts on the same code but interprets it differently: it changes each operation to a tuple of the solution and its derivative, instead of just the solution .

Thus the non-standard interpretation version of the problem of calculating derivatives is to reimagine the problem as "at this step of the code, how should I be transforming it so that I have the information to calculate derivatives"? There are many ways to do this abstract interpretation process to a non-standard interpretation: operator overloading, prior static analysis to generate a new source code, etc. But there's one question we should bring up.

One way to start digging into this question is to answer a related question people pose to me often: if we have automatic differentiation, why do we not have automatic integration? While at face value it seems like the two should be analogues, digging deeper exposes what's special about differentiation. If we wanted to do the integral of , then yes we can replace this with . The heart of the question is to ask about the chain rule: what's the integral of ? It turns out that there is no general rule for the "anti-chain rule". A commonly known result is that the standard Gaussian probability distribution, , does not have a solution to its antiderivative that can be written with elementary functions, and that's just the case of and . While that is true, I don't think that captures all that is different about integrals.

When I said "we can replace this with " I was actually wrong: the antiderivative of is not , it's . There is no unique solution without imposing some external context or some global information like "and F(x)=0". Differentiation is special because it's purely local: only knowing the value of I can know the derivative of . Integration is a well-known example of a non-local operation in mathematics: in order to know the anti-derivative at a value of , you might need to know information about some value , and sometimes it's not necessarily obvious what that value should even be. This nonlocality manifests in other ways as well: while is not integrable, is easy to solve via a u-substitution, making and cancelling out the in-front of the . So there is no chain rule not because some things don't have an antiderivative, but because you have nonlocality, so can be non-integrable while is. There is no chain rule because you can't look at small pieces and transform them, instead you have to look at the problem holistically.

But this gives us a framework in order to judge whether a mathematical problem is amenable to being solved in the framework of non-standard interpretation: it must be local so that we can define a step-by-step transformation algorithm, or we need to include/impose some form of context if we have alternative information.

Let's look at a few related problems that can be solved with this trick.

Recall that the Jacobian is the matrix of partial derivatives, i.e. for where and are vectors, it's the matrix of terms . This matrix shows up in tons of mathematical algorithms, and in many cases it's sparse, so it's common problem to try and compute the sparsity pattern of a Jacobian. But what does this sparsity pattern mean? If you write out the analytical solution to , a zero in the Jacobian means that is not a function of . In other words, has no influence on . For an arbitrary program , can we use non-standard interpretation to calculate whether influences ?

It turns out that if we make this question a little simpler then it has a simple solution. Let's instead ask, can we use non-standard interpretation to calculate whether can influence ? The reason for this change is because the previous question was non-local: is programmatically dependent on , but mathematically so you could cancel it out if you have good global knowledge of the program. So "does it influence" this output is hard, but "can it influence" the output is easy. "Can influence" is just the question of "does show up in the calculation at all?"

So we can come up with an non-standard interpretation formulation to solve this problem. Instead of computing values, we can compute "influencer sets". The output of is influenced by . The output of is influenced by . For , the influencer set of is the same as the influencer set of . So our non-standard interpretation is to replace variables by influencer sets, and whenever the collide by a binary function like multiplication, we make the new influencer set be the union of the two. Otherwise we keep propagating it forward. The result of this way of running the program is that output that say "these are all of the variables which can be influencing this output variable". If never shows up at any stage of the computation of , then there is no way it could ever influence it, and therefore . So the sparsity pattern is bounded by the influencer set.

This is the process behind the the automated sparsity tooling of the Julia programming language, which are now embedded as part of Symbolics.jl. There is a bit more you have to do if you see branching: i.e. you have to take all branches and take the union of the influencer sets (so it's clear this cannot be generally solved with just operator overloading because you need to take non-local control of control flow). Details on the full process are described in Sparsity Programming: Automated Sparsity-Aware Optimizations in Differentiable Programming, along with extensions to things like Hessians which is all about tracking sets of linear dependence.

Let's say we didn't actually know the value to put into our program, and instead it was some . What could we do then? In a standard physics class you probably learned a few rules for uncertainty propagation: and so on. It's good to understand where this all comes from. If you said that was a random variable from a normal distribution with mean and standard deviation , and was an independent random variable from a normal distribution with mean and standard deviation , then would have mean and standard deviation . This means there are some local rules for propagating normal distributed random variables! What do you do for for a normally distributed input? You could approximate with its linear approximation: (this is another way to state that the derivative is the tangent vector at ). At a given value of then we just have a linear equation for some scalar , in which case we use the rule . This gives an non-standard interpretation implementation to approximately computing with normal distributions! Now all we have to do is replace function calls with automatic differentiation around the mean and then propagate forward our error bounds.

This is a very crude description which you can expand to linear error propagation theory where you can more accurately treat the nonlinear propagation of variance. However, this is still missing out on whether two variables are dependent: , there's no uncertainty there, so you need to treat that in a special way! If you think about it, "dependence propagation" is very similar to "propagating influencer sets", which you can use to even more accurately propagate the variance terms. This gives rise to the package Measurements.jl which transforms code to make it additionally do propagation of normally distributed uncertainties.

I note in passing that Interval Arithmetic is very similarly formulated as an alternative interpretation of a program. David Sanders has a great tutorial on what this is all about and how to make use of it, so check that out for more information.

Now let's look at solving something a little bit deeper: simulating a pendulum. I know you'll say "but I solved pendulum equations by hand in physics" but sorry to break it to you: you didn't. In an early physics class you will say "all pendulums are one dimensional", and "the angle is small, so sin(x) is approximately x" and arrive a beautiful linear ODE that you analytically solve. But the world isn't that beautiful. So let's look at the full pendulum. Instead you have the location of the swinger and its velocity . But you also have non-constant tension , and if we have a rigid rod we know that the distance of the swinger to the origin is constant. So the evolution of the system is in full given by:

There are differential equation solvers that can handle constraint equations, these are methods for solving differential-algebraic equations (DAEs), But if you take this code and give it to pretty much any differential equation solver it will fail. It's not because the differential equation solver is bad, but because of the properties of this equation. See, the derivative of the last equation with respect to is zero, so you end up getting a singularity in the Newton solve that makes the stepping unstable. This singularity of the algebraic equation with respect to the algebraic variables (i.e. the ones not defined by derivative terms) is known as "higher index". DAE solvers generally only work on index-1 DAEs, and this is an index-3 DAE. What does index-3 mean? It means if you take the last equation, differentiate it twice, and then do a substitution, you get:

This is mathematically the same system of equations, but this formulation of the equations doesn't have the index issue, and so if you give this to a numerical solver it will work.

It turns out that you can reimagine this algorithm to be something that is also solved by a form of non-standard interpretation. This is one of the nice unique features of ModelingToolkit.jl, spelled out in ModelingToolkit: A Composable Graph Transformation System For Equation-Based Modeling. While algorithms have been written before for symbolic equation-based modeling systems, it turns out you can use an non-standard interpretation process to extract the formulation of the equations, solve a graph algorithm to determine how many times you should differentiate which equations, do the differentiation using rules and structures from automatic differentiation libraries, and then regenerate the code for "the better version". As shown in a ModelingToolkit.jl tutorial on this feature, if you do this on the pendulum equations, you can change a code that is too unstable to solve into an equation that is easy enough to solve by the simplest solvers.

And then you can go even further. As I described in the JuliaCon 2020 talk on automated code optimization, now that one is regenerating the code, you can step in and construct a graph of dependencies and automatically compute independent portions simultaneously in the generated code. Thus with no cost to the user, an non-standard interpretation into symbolic graph building can be used to reconstruct and automatically parallelize code. The ModelingToolkit.jl paper takes this even further by showing how a code which is not amenable to parallelism can, after context-specific equation changes like the DAE index reduction, be transformed into an alternative variant that is suddenly embarrassingly parallel.

All of these features require a bit of extra context as to "what equations you're solving" and information like "do you care about maintaining the same exact floating point result", but by adding in the right amount of context, we can extend mathematical non-standard interpretation to solve these alternative problems in new domains.

By the way, if this excites you and you want to see more updates like this, please star ModelingToolkit.jl.

Let me end by going a little bit more mathematical. You can transform code about scalars into code about functions by using the vector-space interpretation of a function. In mathematical terms, functions are vectors in Banach spaces. Or, in the space of functions, functions are points. if you have a function and a function , then is a function too, and so is . You can do computation on these "points" by working out their expansions. For example, you can write and in their Fourier series expansions: . Approximating with finitely many expansion terms, you can represent via [a[1:n];b[1:n]], and same for . The representation of can be worked out from the finite truncation (just add the coefficients), and so can . So you can transform your computation about "functions" to "arrays of coefficients representing functions", and derive the results for what , , etc. do on these values. This is an non-standard interpretation of a program that transforms it into an equivalent program about function and measures as inputs.

Now it turns out you can formally use this to do cool things. A partial differential equation (PDE) is an ODE where where instead of your values being scalars at each time , your values are now functions at each time . So what if you represent those "functions" as "scalars" via their representation in the Sobolev space, and then put those "scalars" into the ODE solver? You automatically transform your ODE solver code into a PDE solver code. Formally, this is using a branch of PDE theory known as semigroup theory and making it a computable object.

In turns out this is something you can do. ApproxFun.jl defines types ``Fun`` which represent the functions as scalars in a Banach space, and defines a bunch of operations that are allowed for such functions. I showed in a previous talk that you can slap these into DifferentialEquations.jl to have it reinterpret into this function-based differential equation solver, and then start to solve PDEs via this representation.

Automatic differentiation gets all of the attention, but its really this idea of non-standard interpretation that we should be looking at. "How do I change the semantics of this program to solve another problem?" is a very powerful approach. In computer science it's often used for debugging: recompile this program into the debugging version. And in machine learning it's often used to recompile programs into derivative calculators. But uncertainty quantification, fast sparsity detection, automatic stabilization and parallelization of differential-algebraic equations, and automatic generation of PDE solvers all arise from the same little trick. That can't be all there is out there: the formalism and theory of non-standard interpretation seems like it could be a much richer field.

- ModelingToolkit: A Composable Graph Transformation System For Equation-Based Modeling
- Sparsity Programming: Automated Sparsity-Aware Optimizations in Differentiable Programming
- Uncertainty propagation with functionally correlated quantities

The post Generalizing Automatic Differentiation to Automatic Sparsity, Uncertainty, Stability, and Parallelism appeared first on Stochastic Lifestyle.

]]>Under the hood it's using the DifferentialEquations.jl package and the SciML stack, but it's abstracted from users so much that Julia is essentially an alternative to Rcpp with easier interactive development. The following example really brings the seamless ... READ MORE

The post JuliaCall Update: Automated Julia Installation for R Packages appeared first on Stochastic Lifestyle.

]]>Under the hood it's using the DifferentialEquations.jl package and the SciML stack, but it's abstracted from users so much that Julia is essentially an alternative to Rcpp with easier interactive development. The following example really brings the seamless integration home:

install.packages(diffeqr) library(diffeqr) de <- diffeqr::diffeq_setup() degpu <- diffeqr::diffeqgpu_setup() lorenz <- function (u,p,t){ du1 = p[1]*(u[2]-u[1]) du2 = u[1]*(p[2]-u[3]) - u[2] du3 = u[1]*u[2] - p[3]*u[3] c(du1,du2,du3) } u0 <- c(1.0,1.0,1.0) tspan <- c(0.0,100.0) p <- c(10.0,28.0,8/3) prob <- de$ODEProblem(lorenz,u0,tspan,p) fastprob <- diffeqr::jitoptimize_ode(de,prob) prob_func <- function (prob,i,rep){ de$remake(prob,u0=runif(3)*u0,p=runif(3)*p) } ensembleprob = de$EnsembleProblem(fastprob, prob_func = prob_func, safetycopy=FALSE) sol <- de$solve(ensembleprob,de$Tsit5(),degpu$EnsembleGPUArray(),trajectories=10000,saveat=0.01)

This example does the following:

- Automatically installs Julia
- Automatically installs DifferentialEquations.jl
- Automatically installs CUDA (via CUDA.jl)
- Automatically installs ModelingToolkit.jl and DiffEqGPU.jl
- JIT transpiles the R function to Julia via ModelingToolkit
- Uses KernelAbstractions (in DiffEqGPU) to generate a CUDA kernel from the Julia code
- Solves the ODE 10,000 times with different parameters 350x over deSolve

What a complicated code! Well maybe it would shock you to know that the source code for the diffeqr package is only 150 lines of code. Of course, it's powered by a lot of Julia magic in the backend, and so can your next package. For more details, see the big long post about differential equation solving in R with Julia.

The post JuliaCall Update: Automated Julia Installation for R Packages appeared first on Stochastic Lifestyle.

]]>The post GPU-Accelerated ODE Solving in R with Julia, the Language of Libraries appeared first on Stochastic Lifestyle.

]]>This is definitely not the first time this question was asked. The statistics libraries in Julia were developed by individuals like Douglas Bates who built some of R's most widely used packages like lme4 and Matrix. Doug had written a blog post in 2018 showing how to get top notch performance in linear mixed effects model fitting via JuliaCall. In 2018 the JuliaDiffEq organization had written a blog post demonstrating the use of DifferentialEquations.jl in R and Python (the Jupyter of Diffrential Equations). Now rebranded as SciML for Scientific Machine Learning, we looked to expand our mission and bring automated model discovery and acceleration include other languages like R and Python with Julia as the base.

With the release of diffeqr v1.0, we can now demonstrate many advances in R through the connection to Julia. Specifically, I would like to use this blog post to showcase:

- The new direct wrapping interface of diffeqr
- JIT compilation and symbolic analysis of ODEs and SDEs in R using Julia and ModelingToolkit.jl
- GPU-accelerated simulations of ensembles using Julia's DiffEqGPU.jl

Together we will demonstrate how models in R can be accelerated by 1000x without a user ever having to write anything but R.

Before continuing on with showing all of the features, I wanted to ask for support so that we can continue developing these bridged libraries. Specifically, I would like to be able to support developers interested in providing a fully automated Julia installation and static compilation so that calling into Julia libraries is just as easy as any Rcpp library. To show support, the easiest thing to do is to star our libraries. The work of this blog post is build on DifferentialEquations.jl, diffeqr, ModelingToolkit.jl, and DiffEqGPU.jl. Thank you for your patience and now back to our regularly scheduled program.

First let me start with the new direct wrappers of differential equations solvers in R. In the previous iterations of diffeqr, we had relied on specifically designed high level functions, like "ode_solve", to compensate for the fact that one could not directly use Julia's original DifferentialEquations.jl interface directly from R. However, the new diffeqr v1.0 directly exposes the entirety of the Julia library in an easy to use framework.

To demonstrate this, let's see how to define the Lorenz ODE with diffeqr. In Julia's DifferentialEquations.jl, we would start by defining an "ODEProblem" that contains the initial condition u0, the time span, the parameters, and the f in terms of `u' = f(u,p,t)` that defines the derivative. In Julia, this would look like:

using DifferentialEquations function lorenz(du,u,p,t) du[1] = p[1]*(u[2]-u[1]) du[2] = u[1]*(p[2]-u[3]) - u[2] du[3] = u[1]*u[2] - p[3]*u[3] end u0 = [1.0,1.0,1.0] tspan = (0.0,100.0) p = [10.0,28.0,8/3] prob = ODEProblem(lorenz,u0,tspan,p) sol = solve(prob,saveat=1.0)

With the new diffeqr, diffeq_setup() is a function that does a few things:

- It instantiates a Julia process to utilize as its underlying compute engine
- It first checks if the correct Julia libraries are installed and, if not, it installs them for you
- Then it exposes all of the functions of DifferentialEquations.jl into its object

What this means is that the following is the complete diffeqr v1.0 code for solving the Lorenz equation is:

library(diffeqr) de <- diffeqr::diffeq_setup() f <- function(u,p,t) { du1 = p[1]*(u[2]-u[1]) du2 = u[1]*(p[2]-u[3]) - u[2] du3 = u[1]*u[2] - p[3]*u[3] return(c(du1,du2,du3)) } u0 <- c(1.0,1.0,1.0) tspan <- c(0.0,100.0) p <- c(10.0,28.0,8/3) prob <- de$ODEProblem(f, u0, tspan, p) sol <- de$solve(prob,saveat=1.0)

This then carries on through SDEs, DDEs, DAEs, and more. Through this direct exposing form, the whole library of DifferentialEquations.jl is at the finger tips of any R user, making it a truly cross-language platform.

(Note that diffeq_setup installs Julia for you if it's not already installed!)

The reason for Julia is speed (well and other things, but here, SPEED!). Using the pure Julia library, we can solve the Lorenz equation 100 times in about 0.05 seconds:

@time for i in 1:100 solve(prob,saveat=1.0) end 0.048237 seconds (156.80 k allocations: 6.842 MiB) 0.048231 seconds (156.80 k allocations: 6.842 MiB) 0.048659 seconds (156.80 k allocations: 6.842 MiB)

Using diffeqr connected version, we get:

lorenz_solve <- function (i){ de$solve(prob,saveat=1.0) } > system.time({ lapply(1:100,lorenz_solve) }) user system elapsed 6.81 0.02 6.83 > system.time({ lapply(1:100,lorenz_solve) }) user system elapsed 7.09 0.00 7.10 > system.time({ lapply(1:100,lorenz_solve) }) user system elapsed 6.78 0.00 6.79

That's not good, that's about 100x difference! In this blog post I described that interpreter overhead and context switching are the main causes of this issue. We've also demonstrated that ML accelerators like PyTorch generally do not perform well in this regime since those kinds of accelerators rely on heavy array operations, unlike the scalarized nonlinear interactions seen in a lot of differential equation modeling. For this reason we cannot just slap any old JIT compiler onto the f call and then put it into the function since there would still be left over. So we need to do something a bit tricky.

In my JuliaCon 2020 talk, Automated Optimization and Parallelism in DifferentialEquations.jl I demonstrated how ModelingToolkit.jl can be used to trace functions and generate highly optimized sparse and parallel code for scientific computing all in an automated fashion. It turns out that JuliaCall can do a form of tracing on R functions, something that exploited to allow autodiffr to automatically differentiate R code with Julia's AD libraries. Thus it turns out that the same modelingtoolkitization methods used in AutoOptimize.jl can be used on a subset of R codes which includes a large majority of differential equation models.

In short, we can perform automated acceleration of R code by turning it into sparse parallel Julia code. This was exposed in diffeqr v1.0 as the `jitoptimize_ode(de,prob)` function (also `jitoptimize_sde(de,prob)`). Let's try it out on this example. All you need to do is give it the ODEProblem which you wish to accelerate. Let's take the last problem and turn it into a pure Julia defined problem and then time it:

fastprob <- diffeqr::jitoptimize_ode(de,prob) fast_lorenz_solve <- function (i){ de$solve(fastprob,saveat=1.0) } system.time({ lapply(1:100,fast_lorenz_solve) }) > system.time({ lapply(1:100,fast_lorenz_solve) }) user system elapsed 0.05 0.00 0.04 > system.time({ lapply(1:100,fast_lorenz_solve) }) user system elapsed 0.07 0.00 0.06 > system.time({ lapply(1:100,fast_lorenz_solve) }) user system elapsed 0.07 0.00 0.06

And there you go, an R user can get the full benefits of Julia's optimizing JIT compiler without having to write lick of Julia code! This function also did a few other things, like automatically defined the Jacobian code to make implicit solving of stiff ODEs much faster as well, and it can perform sparsity detection and automatically optimize computations on that.

To see how much of an advance this is, note that this Lorenz equation is the same from the deSolve examples page. So let's take their example and see how well it performs:

library(deSolve) Lorenz <- function(t, state, parameters) { with(as.list(c(state, parameters)), { dX <- a * X + Y * Z dY <- b * (Y - Z) dZ <- -X * Y + c * Y - Z list(c(dX, dY, dZ)) }) } parameters <- c(a = -8/3, b = -10, c = 28) state <- c(X = 1, Y = 1, Z = 1) times <- seq(0, 100, by = 1.0) out <- ode(y = state, times = times, func = Lorenz, parms = parameters) desolve_lorenz_solve <- function (i){ state <- c(X = runif(1), Y = runif(1), Z = runif(1)) parameters <- c(a = -8/3 * runif(1), b = -10 * runif(1), c = 28 * runif(1)) out <- ode(y = state, times = times, func = Lorenz, parms = parameters) } > system.time({ lapply(1:100,desolve_lorenz_solve) }) user system elapsed 5.03 0.03 5.07 > > system.time({ lapply(1:100,desolve_lorenz_solve) }) user system elapsed 5.42 0.00 5.44 > system.time({ lapply(1:100,desolve_lorenz_solve) }) user system elapsed 5.41 0.00 5.41

Thus we see 100x acceleration over the leading R library without users having to write anything but R code. This is the true promise in action of a "language of libraries" helping to extend all other high level languages!

What about writing C code and directly calling it with deSolve? It turns out that's still not as efficient as this JIT. Following the tutorial from deSolve on how to write an optimized Lorenz function, we first define the following function in C:

/* file lorenz.c */ #include <R.h> static double parms[3]; #define a parms[0] #define b parms[1] #define c parms[2] /* initializer */ void initmod(void (* odeparms)(int *, double *)) { int N = 3; odeparms(&N, parms); } /* Derivatives */ void derivs (int *neq, double *t, double *y, double *ydot, double *yout, int *ip) { ydot[0] = a * y[0] + y[1] * y[2]; ydot[1] = b * (y[1] - y[2]); ydot[2] = - y[0] * y[1] + c * y[1] - y[2]; }

Then we use system("R CMD SHLIB lorenzc.c") in R in order to compile this function to a .dll. Now we can call it from R:

library(deSolve) dyn.load("lorenz.dll") parameters <- c(a = -8/3, b = -10, c = 28) state <- c(X = 1, Y = 1, Z = 1) times <- seq(0, 100, by = 1.0) out <- ode(y = state, times = times, func = Lorenz, parms = parameters) desolve_lorenz_solve <- function (i){ state <- c(X = runif(1), Y = runif(1), Z = runif(1)) parameters <- c(a = -8/3 * runif(1), b = -10 * runif(1), c = 28 * runif(1)) ode(state, times, func = "derivs", parms = parameters, dllname = "lorenz", initfunc = "initmod") } > system.time({ lapply(1:100,desolve_lorenz_solve) }) user system elapsed 0.09 0.00 0.09

**Notice that even when rewriting the function to C, this still is almost 2x as slow as the direct JIT compiled R code!** This means that users, with less work, can get faster than what they had before!

Can we go deeper? Yes we can. In many cases like in optimization and sensitivity analysis of models for pharmacology the users need to solve the same ODE thousands or millions of times to understand the behavior over a large parameter space. To solve this problem well in Julia, we built DiffEqGPU.jl which transforms the pure Julia function into a .ptx kernel to then parallelize the ODE solver over. What this looks like is the following solves the Lorenz equation 100,000 times with randomized initial conditions and parameters:

using DifferentialEquations, DiffEqGPU function lorenz(du,u,p,t) du[1] = p[1]*(u[2]-u[1]) du[2] = u[1]*(p[2]-u[3]) - u[2] du[3] = u[1]*u[2] - p[3]*u[3] end u0 = [1.0,1.0,1.0] tspan = (0.0,100.0) p = [10.0,28.0,8/3] prob = ODEProblem(lorenz,u0,tspan,p) prob_func = (prob,i,repeat) -> remake(prob,u0=rand(3).*u0,p=rand(3).*p) monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy=false) sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=100_000,saveat=1.0f0)

Notice how this is only two lines of code different from what we had before, and now everything is GPU accelerated! The requirement for this to work is that the ODE/SDE/DAE function has to be written in Julia... but diffeqr::jitoptimize_ode(de,prob) accelerates the ODE solving in R by generating a Julia function, so could that mean...?

Yes, it does mean we can use DiffEqGPU directly on ODEs defined in R. Let's see this in action. Once again, we will write almost exactly the same code as in Julia, except with `de$` and with diffeqr::jitoptimize_ode(de,prob) to JIT compile our ODE definition. What this looks like is the following:

de <- diffeqr::diffeq_setup() degpu <- diffeqr::diffeqgpu_setup() lorenz <- function (u,p,t){ du1 = p[1]*(u[2]-u[1]) du2 = u[1]*(p[2]-u[3]) - u[2] du3 = u[1]*u[2] - p[3]*u[3] c(du1,du2,du3) } u0 <- c(1.0,1.0,1.0) tspan <- c(0.0,100.0) p <- c(10.0,28.0,8/3) prob <- de$ODEProblem(lorenz,u0,tspan,p) fastprob <- diffeqr::jitoptimize_ode(de,prob) prob_func <- function (prob,i,rep){ de$remake(prob,u0=runif(3)*u0,p=runif(3)*p) } ensembleprob = de$EnsembleProblem(fastprob, prob_func = prob_func, safetycopy=FALSE) sol = de$solve(ensembleprob,de$Tsit5(),degpu$EnsembleGPUArray(),trajectories=100000,saveat=1.0)

Note that diffeqr::diffeqgpu_setup() does the following:

- It sets up the drivers and installs the right version of CUDA for the user if not already available
- It installs the DiffEqGPU library
- It exposes the pieces of the DiffEqGPU library for the user to then call onto ensembles

This means that this portion of the library is fully automated, all the way down to the installation of CUDA! Let's time this out a bit. 100,000 ODE solves in serial:

@time sol = solve(monteprob,Tsit5(),EnsembleSerial(),trajectories=100_000,saveat=1.0f0) 15.045104 seconds (18.60 M allocations: 2.135 GiB, 4.64% gc time) 14.235984 seconds (16.10 M allocations: 2.022 GiB, 5.62% gc time)

100,000 ODE solves on the GPU in Julia:

@time sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=100_000,saveat=1.0f0) 2.071817 seconds (6.56 M allocations: 1.077 GiB) 2.148678 seconds (6.56 M allocations: 1.077 GiB)

Now let's check R in serial:

> system.time({ de$solve(ensembleprob,de$Tsit5(),de$EnsembleSerial(),trajectories=100000,saveat=1.0) }) user system elapsed 24.16 1.27 25.42 > system.time({ de$solve(ensembleprob,de$Tsit5(),de$EnsembleSerial(),trajectories=100000,saveat=1.0) }) user system elapsed 25.45 0.94 26.44

and R on GPUs:

> system.time({ de$solve(ensembleprob,de$Tsit5(),degpu$EnsembleGPUArray(),trajectories=100000,saveat=1.0) }) user system elapsed 12.39 1.51 13.95 > system.time({ de$solve(ensembleprob,de$Tsit5(),degpu$EnsembleGPUArray(),trajectories=100000,saveat=1.0) }) user system elapsed 12.55 1.36 13.90

R doesn't reach quite the level of Julia here, and if you profile you'll see it's because the `prob_func`, i.e. the function that tells you which problems to solve, is still a function written in R and this becomes the bottleneck as the computation becomes faster and faster. Thus you will get closer and closer to the Julia speed with longer and harder ODEs, but it still means there's work to be done. Another detail is that the Julia code is able to be further accelerated by using 32-bit numbers. Let's see that in action:

using DifferentialEquations, DiffEqGPU function lorenz(du,u,p,t) du[1] = p[1]*(u[2]-u[1]) du[2] = u[1]*(p[2]-u[3]) - u[2] du[3] = u[1]*u[2] - p[3]*u[3] end u0 = Float32[1.0,1.0,1.0] tspan = (0.0f0,100.0f0) p = Float32[10.0,28.0,8/3] prob = ODEProblem(lorenz,u0,tspan,p) prob_func = (prob,i,repeat) -> remake(prob,u0=rand(Float32,3).*u0,p=rand(Float32,3).*p) monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy=false) @time sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=100_000,saveat=1.0f0) # 1.781718 seconds (6.55 M allocations: 918.051 MiB) # 1.873190 seconds (6.56 M allocations: 917.875 MiB)

Right now the Julia to R bridge converts all 32-bit numbers back to 64-bit numbers so this doesn't seem to be possible without the user writing some Julia code, but we hope to get this fixed in one of our coming releases.

To figure out where that leaves us, let's use deSolve to solve that same Lorenz equation 100 and 1,000 times:

library(deSolve) Lorenz <- function(t, state, parameters) { with(as.list(c(state, parameters)), { dX <- a * X + Y * Z dY <- b * (Y - Z) dZ <- -X * Y + c * Y - Z list(c(dX, dY, dZ)) }) } parameters <- c(a = -8/3, b = -10, c = 28) state <- c(X = 1, Y = 1, Z = 1) times <- seq(0, 100, by = 1.0) out <- ode(y = state, times = times, func = Lorenz, parms = parameters) desolve_lorenz_solve <- function (i){ state <- c(X = runif(1), Y = runif(1), Z = runif(1)) parameters <- c(a = -8/3 * runif(1), b = -10 * runif(1), c = 28 * runif(1)) out <- ode(y = state, times = times, func = Lorenz, parms = parameters) } > system.time({ lapply(1:100,desolve_lorenz_solve) }) user system elapsed 5.06 0.00 5.13 > system.time({ lapply(1:1000,desolve_lorenz_solve) }) user system elapsed 55.68 0.03 55.75

We see the expected linear scaling of a scalar code, so we can extrapolate out and see that to solve 100,000 ODEs it would take deSolve 5000 seconds, as opposed to the 14 seconds of diffeqr or the 1.8 seconds of Julia. In summary:

- Pure R diffeqr offers a
**350x acceleation over deSolve** - Pure Julia DifferentialEquations.jl offers a
**2777x acceleration over deSolve**

And deSolve is not shabby: it's a library that calls Fortran libraries under the hood!

We hope that the R community has enjoyed this release and will enjoy our future releases as well. We hope to continue building further connections to Python as well. Together this will make Julia a true language of libraries that can be used to accelerate scientific computation in the surrounding higher level scientific ecosystem.

The post GPU-Accelerated ODE Solving in R with Julia, the Language of Libraries appeared first on Stochastic Lifestyle.

]]>Chris Rackauckas

Applied Mathematics Instructor, MIT

Senior Research Analyst, University of Maryland, Baltimore School of Pharmacy

This was a seminar talk given to the COVID modeling journal club on scientific machine learning for epidemic modeling.

Resources:

https://sciml.ai/

https://diffeqflux.sciml.ai/dev/

https://datadriven.sciml.ai/dev/

https://docs.sciml.ai/latest/

https://safeblues.org/

The post COVID-19 Epidemic Mitigation via Scientific Machine Learning (SciML) appeared first on Stochastic Lifestyle.

]]>Chris Rackauckas

Applied Mathematics Instructor, MIT

Senior Research Analyst, University of Maryland, Baltimore School of Pharmacy

This was a seminar talk given to the COVID modeling journal club on scientific machine learning for epidemic modeling.

Resources:

https://sciml.ai/

https://diffeqflux.sciml.ai/dev/

https://datadriven.sciml.ai/dev/

https://docs.sciml.ai/latest/

https://safeblues.org/

The post COVID-19 Epidemic Mitigation via Scientific Machine Learning (SciML) appeared first on Stochastic Lifestyle.

]]>Cheap But Effective: Instituting Effective Pandemic Policies Without Knowing Who's Infected

Chris Rackauckas

MIT Applied Mathematics Instructor

One way to find out how many people are infected is to figure out who's infected, but that's working too hard! In this talk we will look into cheaper alternatives for effective real-time policy making. To this end we introduce SafeBlues, a project that simulates fake virus strands over Bluetooth and utilizes deep neural networks mixed within differential equations to accurately approximate infection statistics weeks before updated statistics are available. We then introduce COEXIST, a quarantine policy which utilizes inexpensive "useless" tests to perform accurate regional case isolation. This work is all being done as part of the Microsoft Pandemic Modeling Project, where the Julia SciML tooling has accelerated the COEXIST simulations by 36,000x ... READ MORE

The post Cheap But Effective: Instituting Effective Pandemic Policies Without Knowing Who's Infected appeared first on Stochastic Lifestyle.

]]>Cheap But Effective: Instituting Effective Pandemic Policies Without Knowing Who's Infected

Chris Rackauckas

MIT Applied Mathematics Instructor

One way to find out how many people are infected is to figure out who's infected, but that's working too hard! In this talk we will look into cheaper alternatives for effective real-time policy making. To this end we introduce SafeBlues, a project that simulates fake virus strands over Bluetooth and utilizes deep neural networks mixed within differential equations to accurately approximate infection statistics weeks before updated statistics are available. We then introduce COEXIST, a quarantine policy which utilizes inexpensive "useless" tests to perform accurate regional case isolation. This work is all being done as part of the Microsoft Pandemic Modeling Project, where the Julia SciML tooling has accelerated the COEXIST simulations by 36,000x and quantitative systems pharmacology simulations for Pfizer by 175x in support of the efforts against COVID-19.

The post Cheap But Effective: Instituting Effective Pandemic Policies Without Knowing Who's Infected appeared first on Stochastic Lifestyle.

]]>Differentiable programming is a subset of modeling where you model with a program where each of the steps are differentiable, for the purpose of being able to find the correct program with parameter fitting using said ... READ MORE

The post Glue AD for Full Language Differentiable Programming appeared first on Stochastic Lifestyle.

]]>Differentiable programming is a subset of modeling where you model with a program where each of the steps are differentiable, for the purpose of being able to find the correct program with parameter fitting using said derivatives. Just like any modeling domain, different problems have different code styles which must be optimized in different ways. Traditional scientific computing code makes use of mutable buffers writing out nonlinear scalar operations and avoid memory allocations in order to keep top performance. On the other hand, many machine learning libraries allocate a ton of temporary arrays due to using out of place matrix multiplications, [which is fine because dense linear algebra costs grow much faster than the costs of the allocations](https://www.stochasticlifestyle.com/when-do-micro-optimizations-matter-in-scientific-computing/). Some need sparsity everywhere, others just need to fuse and build the fastest dense kernels possible. Some algorithms do great on GPUs, while some do not. This intersection between scientific computing and machine learning, i.e. scientific machine learning and other applications of differentiable programing, is too large of a domain for one approach to make anyone happy. And if an AD system is unable to reach top notch performance for a specific subdomain, it's simply better for the hardcore package author to not use the AD system and instead write their own adjoints.

Even worse is the fact that mathematically there are many cases where you should write your own adjoints, since differentiating through the code is very suboptimal. Any iterative algorithm is of this sort, where the derivative of a nonlinear solve f(x)=0 may use Newton's method to get f(x*)=0, but the adjoint is only defined at x* with f'(x*), so there's no need to ever differentiate through Newton's method. So we should all be writing adjoints! Does this mean that the story of differentiable programming is destroyed? Is it just always better to not do differentiable programming, so any hardcore library writer will ignore it?

Instead of falling into that despair, let's just follow down that road with a positive light. Let's assume that the best way to do differentiable programming is to write adjoints on library code. Then what's the purpose of a differentiable programming system? It's to help your adjoints get written and be useful. It's just matrix multiplications in machine learning. If the majority of the code is in some optimized kernel, then you don't need to worry about the performance of many other aspects rest: you just want it to work. With differentiable programming, if 99% of the computation is in the DiffEq/NLsolve/FFTW/etc. adjoints, what we need from a differentiable programming system is something that will get the rest of the adjoint done and be very easy to make correct. The way to facilitate this kind of workflow would be for the differentiable programming system to:

- Have very high coverage of the language. Sacrifice some speed if it needs to, that's okay, because if 99% of the compute time is in my adjoint, then I don't want that 1% to be hard to develop. It should just work, however it works.
- Be easy to debug and profile. Stacktraces should point to real code. Profiles should point to real lines of code. Errors should be legible.
- Have a language-wide system for defining adjoints. We can't have walled gardens if the way to "get good" is to have adjoints for everything: we need everyone to plug in and distribute the work. Not to just the developers of one framework, and not just to the users of one framework, but to every scientific developer in the entire programming language.
- Make it easy to swap out AD systems. More constrained systems may be more optimized, and if I don't want to define an adjoint, at least I can fallback to something that (a) works on my code and (b) matches its assumptions.

Thus what I think we're looking for is not one differentiable programming system that is the best in all aspects, but instead we're looking for a differentiable programming system that can glue together everything that's out there. "Differentiate all of the things, but also tell me how to do things better". We're looking for a glue AD.

Zygote is surprisingly close to being this perfect glue AD. It's stacktraces and profiling are fairly good because they point to the pieces generating the backpasses. It just needs some focus on this goal if it wants to obtain it. For (1), it would need to get higher coverage of the language, focusing on its expanse moreso than doing everything as fast as possible. Of course, it should do as well as it can, but for example, if it needs to sacrifice a bit of speed to get full performance in mutability today, that might be a good trade-off if the goal is to be a glue AD. Perfect? No, but if that's that would give you the coverage to then tell the user that if they need more on a particular piece of code, seek out more. To seek out more performance, users could just have Zygote call ReverseDiff.jl on a function and have that compile the tape (or other specialized AD systems which will be announced more broadly soon), or may want to write a partial adjoint.

So (4) is really the kicker. If I was to hit a slow mutating code today inside of a differential equation, it would probably be something perfect for ModelingToolkit.jl to handle, so the best thing to do is to build hyper-optimized adjoints of that differential equation using ModelingToolkit.jl. At that level, I can symbolically handle it and generate code that a compiler because I can make a lot of extra assumptions, like cos^2(x) + sin^2(x) = 1 in my mathematical context. I can move code around, auto-parallelize it, etc. easily because of the simple static graph I'm working on. Wouldn't it be a treat to just `@ModelingToolkitAdjoint f` and bingo now it's using ModelingToolkit on a portion of code? `@ForwardDiffAdjoint f` to tell it that it "you should forward mode mere". Yota.jl is a great reverse mode project, so `@YotaAdjoint f` and boom that could be more optimized than Zygote on some cases. `@ReverseDiff f` and let it compile the tape and it'll get fairly optimal on the places where ReverseDiff.jl is applicable.

Julia is the perfect language to develop such a system for because its AST is so nice and constrained for mathematical contexts that all of these AD libraries do not work on a special DSL language like TensorFlow graphs or torch.numpy, but instead work directly on the language itself and its original existing libraries. With ChainRules.jl allowing for adjoint overloads that apply to all AD packages, focusing on these "Glue AD" properties could really open up the playing field, allowing Zygote to be at the center of an expansive differentiable programming world that works everywhere, maybe makes some compromises to do so, but then gives a system for other developers to make assumptions and define easily define adjoints and plug alternative AD systems into the whole game. This is a true mixed mode which incorporates not just forward and reverse, but also different implementations with different performance profiles (and this can be implemented just through ChainRules overloads!). Zygote would then facilitate this playing field with just a solid debugging and profiling experience, along with a very high chance of working on your code on your first try. That plus buy-in by package authors would be a true solution to differentiable programming.

The post Glue AD for Full Language Differentiable Programming appeared first on Stochastic Lifestyle.

]]>Chris Rackauckas (MIT), "Generalized Physics-Informed Learning through Language-Wide Differentiable Programming"

Scientific computing is increasingly incorporating the advancements in machine learning to allow for data-driven physics-informed modeling approaches. However, re-targeting existing scientific computing workloads to machine learning frameworks is both costly and limiting, as scientific simulations tend to use the full feature set of a general purpose programming language. In this manuscript we develop an infrastructure for incorporating deep learning into existing scientific computing code through Differentiable Programming (∂P). We describe a ∂P system that is able to take gradients of full Julia programs, making Automatic Differentiation a first class language feature and compatibility with deep learning pervasive. Our system utilizes the one-language nature of Julia package development to augment the existing package ecosystem with deep learning, supporting almost all language ... READ MORE

The post Generalized Physics-Informed Learning through Language-Wide Differentiable Programming (Video) appeared first on Stochastic Lifestyle.

]]>Chris Rackauckas (MIT), "Generalized Physics-Informed Learning through Language-Wide Differentiable Programming"

Scientific computing is increasingly incorporating the advancements in machine learning to allow for data-driven physics-informed modeling approaches. However, re-targeting existing scientific computing workloads to machine learning frameworks is both costly and limiting, as scientific simulations tend to use the full feature set of a general purpose programming language. In this manuscript we develop an infrastructure for incorporating deep learning into existing scientific computing code through Differentiable Programming (∂P). We describe a ∂P system that is able to take gradients of full Julia programs, making Automatic Differentiation a first class language feature and compatibility with deep learning pervasive. Our system utilizes the one-language nature of Julia package development to augment the existing package ecosystem with deep learning, supporting almost all language constructs (control flow, recursion, mutation, etc.) while generating high-performance code without requiring any user intervention or refactoring to stage computations. We showcase several examples of physics-informed learning which directly utilizes this extension to existing simulation code: neural surrogate models, machine learning on simulated quantum hardware, and data-driven stochastic dynamical model discovery with neural stochastic differential equations.

Code is available at https://github.com/MikeInnes/zygote-paper

AAAI 2020 Spring Symposium on Combining Artificial Intelligence and Machine Learning with Physics Sciences, March 23-25, 2020 (https://sites.google.com/view/aaai-mlps)

https://figshare.com/articles/presentation/Generalized_Physics-Informed_Learning_through_Language-Wide_Differentiable_Programming/12751934

The post Generalized Physics-Informed Learning through Language-Wide Differentiable Programming (Video) appeared first on Stochastic Lifestyle.

]]>Colloquium with Chris Rackauckas

Department of Mathematics

Massachusetts Institute of Technology

"Universal Differential Equations for Scientific Machine Learning"

Feb 19, 2020, 3:30 p.m., 499 DSL

https://arxiv.org/abs/2001.04385

Abstract:

In the context of science, the well-known adage "a picture is worth a thousand words" might well be "a model is worth a thousand datasets." Scientific models, such as Newtonian physics or biological gene regulatory networks, are human-driven simplifications of complex phenomena that serve as surrogates for the countless experiments that validated the models. Recently, machine learning has been able to overcome the inaccuracies of approximate modeling by directly learning the entire set of nonlinear interactions from data. However, without any predetermined structure from the scientific basis behind the problem, machine learning approaches are flexible but data-expensive, requiring large databases of homogeneous labeled training data. A ... READ MORE

The post Universal Differential Equations for Scientific Machine Learning (Video) appeared first on Stochastic Lifestyle.

]]>Colloquium with Chris Rackauckas

Department of Mathematics

Massachusetts Institute of Technology

"Universal Differential Equations for Scientific Machine Learning"

Feb 19, 2020, 3:30 p.m., 499 DSL

https://arxiv.org/abs/2001.04385

Abstract:

In the context of science, the well-known adage "a picture is worth a thousand words" might well be "a model is worth a thousand datasets." Scientific models, such as Newtonian physics or biological gene regulatory networks, are human-driven simplifications of complex phenomena that serve as surrogates for the countless experiments that validated the models. Recently, machine learning has been able to overcome the inaccuracies of approximate modeling by directly learning the entire set of nonlinear interactions from data. However, without any predetermined structure from the scientific basis behind the problem, machine learning approaches are flexible but data-expensive, requiring large databases of homogeneous labeled training data. A central challenge is reconciling data that is at odds with simplified models without requiring "big data". In this work we develop a new methodology, universal differential equations (UDEs), which augments scientific models with machine-learnable structures for scientifically-based learning. We show how UDEs can be utilized to discover previously unknown governing equations, accurately extrapolate beyond the original data, and accelerate model simulation, all in a time and data-efficient manner. This advance is coupled with open-source software that allows for training UDEs which incorporate physical constraints, delayed interactions, implicitly-defined events, and intrinsic stochasticity in the model. Our examples show how a diverse set of computationally-difficult modeling issues across scientific disciplines, from automatically discovering biological mechanisms to accelerating climate simulations by 15,000x, can be handled by training UDEs.

The post Universal Differential Equations for Scientific Machine Learning (Video) appeared first on Stochastic Lifestyle.

]]>- Machine learning models require big data to train
- Machine learning models cannot extrapolate out of the their training data well
- Machine learning models are not interpretable

However, in our recent paper, we have shown that this does not have to be the case. In Universal Differential Equations for Scientific Machine Learning, we start by showing the following figure:

Indeed, it shows that by only seeing the tiny first part of the time series, we can automatically learn the equations in such a manner that it predicts the time series will be cyclic in the future, in a way that even gets ... READ MORE

The post Scientific Machine Learning: Interpretable Neural Networks That Accurately Extrapolate From Small Data appeared first on Stochastic Lifestyle.

]]>- Machine learning models require big data to train
- Machine learning models cannot extrapolate out of the their training data well
- Machine learning models are not interpretable

However, in our recent paper, we have shown that this does not have to be the case. In Universal Differential Equations for Scientific Machine Learning, we start by showing the following figure:

Indeed, it shows that by only seeing the tiny first part of the time series, we can automatically learn the equations in such a manner that it predicts the time series will be cyclic in the future, in a way that even gets the periodicity correct. Not only that, but our result is not a neural network, rather the program itself spits out the LaTeX for the differential equation:

with the correct coefficients, which is exactly the how we generated the data.

Rather than just explaining the method, what I really want to convey in this blog post is why it intuitively makes sense that our methods work. Intuitively, **we utilize all of the known scientific structure to embed as much prior knowledge as possible**. This is the opposite of most modern machine learning which tries to use blackbox architectures to fits as wide of a range of behaviors as possible. Instead, what we do is we look at our problem and say, what do I know has to be true about the system, and how can I constrain the neural network to force the parameter search to only look at cases such that it is true. In the context of science, we do so by directly embedding neural networks into existing scientific simulations, essentially saying that the model is at least approximately accurate, so our neural network should only learn what the scientific model didn't cover or simplified away. Our approach has many more applications than what we show in the paper, and if you know the underlying idea, it is quite straightforward to apply to your own work.

The starting point for universal differential equations is the now classic work on neural ordinary differential equations. Neural ODEs are defined by the equation:

i.e. it's an arbitrary function described as the solution to an ODE defined by a neural network. The reason why the authors went down this route was because it's a continuous form of a recurrent neural network that then makes it natural for handling irregularly-spaced time series data.

However, this formulation can have another interesting purpose. ODEs, and differential equations in general, are well-studied because they are the language of science. Newtonian physics is described in terms of differential equations. So are Einstein's equations, and quantum mechanics. Not only physics, but also biological models of chemical reactions in cells, population sizes in ecology, the motion of fluids, etc.: differential equations are really the language of science.

Thus it's not a surprise that in many cases we already have and know some differential equations. They may be an approximate model, but we know this approximation is "true" in some sense. This is the jumping point for the **universal differential equation**. Instead of trying to make the entire differential equation be a neural network like object, since science is encoded in differential equations, it would be scientifically informative to actually learn the differential equation itself. But, in any scientific context we already know parts of the differential equation, so **we might as well hard code that information as a form of prior structural information**. This gives us the form of the universal differential equation:

where is an arbitrary universal approximator, i.e. a finite parameter object that can represent "any possible function". It just so happens that neural networks are universal approximators, but note that other forms, like Chebyshev polynomials, also have this property, but neural networks do well in high dimensions and on irregular grids (some properties we utilize in some of our other examples).

What happens when we describe a differential equation in this form is that **the trained neural network becomes a tangible numerical approximation of the missing function**. By doing this, we can train a program that has the same exact input/output behavior as the missing term of our model. And this is precisely what we do. We assume we only know part of the differential equation:

and train a neural network so that way embedded neural networks defined a universal ODE that fits our data. When trained, the neural network is a numerical approximation to the missing function. But since it's just a simple function, it's fairly straightforward to plot it and say "hey! We were missing a quadratic term", and there you go: interpreted back to the original generating equations. In the paper we describe how to make use of the SInDy technique to make this more rigorous through a sparse regression to a basis of possible terms, but it's the same story that in the end we learn exactly the differential equations that generated the data, and hence the extrapolation accuracy even beyond the original time series and the nice picture:

Trying to approximate data might be a much harder problem then trying to understand the processes that create the data. Indeed, the Lotka-Volterra equations are a simple set of equations that are defined by 4 interpretable terms. The first simply states that the number of rabbits would grow exponentially if there wasn't a predator eating them. The second term just states that the number of rabbits goes down when they are eaten by predators (and more predators means more eating). The third term is that more prey means more food and growth of the wolf population. Finally, the wolves die off with an exponential decay due to old age.

That's it: a simple quadratic equation that describes 4 mechanisms of interaction. Each mechanism is interpretable and can be independently verified. Meanwhile, that cyclic solution that is the data set? The time series itself is such a complicated function that you can prove that there is no way to even express its analytical solution!. This phenomena is known as **emergence: simple mechanisms can give rise to complex behavior**. From this it should be clear that a method which is trying to predict a time series has a much more difficult problem to solve than one that is trying to learn mechanisms!

One way to really solidify this idea is our next example. In our next example we showcase how reconstructing a partial differential equation can be straightforwardly done through universal approximators embedded within partial differential equations, what we call a universal PDE. If you want the full details behind what I will handwave here, take a look at the MIT 18.337 Scientific Machine Learning course notes or the MIT 18.S096 Applications of Scientific Machine Learning course notes. In it, there is a derivation that showcases how one can interpret partial differential equations as large systems of ODEs. Specifically, the Fisher-KPP equations that we look at in our paper:

can be interpreted as a system of ODEs:

However, the term in front, , is known as a stencil. Basically, you go to each point and the solution of this operation is, sum the left and the right terms and subtract twice from the middle. Sounds familiar? It turns out that a convolutional layer from convolutional neural networks are actually just parameterized forms of stencils. For example, a picture of a two-dimensional stencil looks like:

where this stencil is applying the operation:

1 0 1

0 1 0

1 0 1

A convolutional layer is just a parameterized form of a stencil operation:

w1 w2 w3

w4 w5 w6

w7 w8 w9

Thus one way to approach learning spatiotemporal data which we think may come from such a generating process is:

i.e., the discretization of the second spatial derivative, what's known as diffusion, is the physically represented as the stencil of weights [1 -2 1]. Notice that in this form, **the entire spatiotemporal data is described by a 1-input 1-output neural network + 3 parameters**. Globally of the array of all , this becomes:

i.e. it's a universal differential equation with a 3 parameter CNN and (the same) small neural network applied at each spatial point.

Can this tiny neural network actually fit the data? It does. But not only does it fit the data, it also is interpretable:

You see, not only did it fit and accurately match the data, it also tells us exactly the PDE that generated the data. Notice that figure (C) says that in the fitted equation, and that . This means that the convolutional neural network learned to be , exactly as the theory would predict if the only spatial physical process was diffusion. But secondly, figure (D) shows that the neural network that represented the 1-dimensional behavior seems to be quadratic. Indeed, remember from the PDE discretization:

it really is quadratic. This tells us that the only physical generating process that could have given us this data is a diffusion equation with a quadratic reaction, i.e.

thus **interpreting the neural networks to precisely receive a PDE governing the evolution of the spatiotemporal data**. Not only that, this trained form can predict beyond its data set, since if we wished for it to predict the behavior of a fluid with a different diffusion constant , we know exactly how to change that term without retraining the neural networks since our weights in the convolutional neural network is , so we'd simply rescale those weights and suddenly have a neural network that predicts for a different underlying fluid.

Small neural network, small data, trained in an interpretable form that extrapolates, even on hard problems like spatiotemporal data.

From this explanation, it should be very clear that our approach is general, but in every application, it's specific. We utilize prior knowledge of differential equations, like known physics or biological interactions, to try and hard code as much of the equation as possible. Then the neural networks are just stand-ins for the little pieces that are leftover. Thus the neural networks have a very easy job! They don't have to learn very much! They just have to learn the leftovers! Thus the problem becomes easy since we imposed so much knowledge in how the neural infrastructure was made by utilizing the differential equation form to its fullest.

I like to think of it as follows. There is a certain amount of knowledge that is required to effectively learn the problem. Knowledge can come from prior information or it can come from data . Either way, you need enough knowledge to effectively learn the model and do accurate predicting. Machine learning has gone the route of effectively relying entirely on data, but that doesn't need to be the case. We know how physics works, and how time series relate to derivatives, so there's no reason to force a neural network to have to learn these parts. Instead, by writing small neural networks inside of differential equations, we can embed everything that we know about the physics as true structurally-imposed prior knowledge, and then what's left is a simple training problem. That way a big and a small still gives you , and that's how you make a "neural network" accurately extrapolate from only a small amount of training data. And now that it's only learning a simple function, what it learned is easily interpretable through sparse regression techniques.

Once we had this idea of wanting to embed structure, then all of the hard work came. Essentially, in big neural network machine learning, you can get away with a lot of performance issues if 99.9% of your time is spent in the neural network's calculations. But once we got into the regime of small data small neural network structured machine learning, our neural networks were not the big time consumer, which meant every little detail mattered. Thus we needed to hyper-optimize the solution of small ODE solves to make this a reality. As a result of our optimizations, we have easily reproducible benchmarks which showcase a 50,000x acceleration over the torchdiffeq neural ODE library. In fact, benchmarks show across the board orders of magnitude performance advantages over SciPy, MATLAB, and R's deSolve as well. This is not a small detail, as this problem of training neural networks within scientific simulations is a costly project which takes many millions of ODE solves, and therefore these performance optimizations changed the problem from "impractical" to "reality". Again, when very large neural networks are involved this may be masked by the cost of neural network passes itself, but in the context of small network scientific machine learning, this change was a godsend.

But the even larger difficulty that we noticed was that traditional numerical analysis ideas like stability really came into play once real physical models got involved. There is this property of ODEs called stiffness, and when it comes into play, the simple Runge-Kutta method or Adams-Bashforth-Moulton methods are no longer stable enough to accurately solve the equations. Thus when looking at the universal partial differential equations, we had to make use of a set of ODE solvers which have package implementations in Julia and Fortran. Additionally, any form of backwards solving is unconditionally unstable on the diffusion-advection equation, meaning that it ended up being a practical case where the use of simple adjoint methods like the backsolve approach of the original neural ODEs paper and torchdiffeq actually ends up diverging to infinity in finite time for any tolerance on the ODE solver. Thus we had to implement a bunch of different versions of (checkpointed) adjoint implementations in order to be accurately and efficiently train neural networks within these equations. Then, once we had a partial differential equation form, we had to build tools that would integrate with automatic differentiation to automatically specialize on sparsity. The result was a a full set of advanced methods for efficiently handling stiffness that was fully compatible with neural network backpropagation. It was only when all of this came together that the most difficult examples of what we showed actually worked. Now, our software DifferentialEquations.jl with DiffEqFlux.jl is able to handle:

- Stiffness and ill-conditioned problems
- Universal ordinary differential equations
- Universal stochastic differential equations
- Universal delay differential equations
- Universal differential algebraic equations
- Universal partial differential equations
- Universal (event-driven) hybrid differential equations

all with GPUs, sparse and structured Jacobians, preconditioned Newton-Krylov, and the list of features just keeps going. This is the limitation: when real scientific models get involved, the numerical complexity drastically increases. But now this is something that at least Julia libraries have solved.

The takeaway there is that **not only do you need to use all of the scientific knowledge available, but you also need to make use of all of the numerical analysis knowledge**. When you combine all of this knowledge with the most recent advances of machine learning, then you get small neural networks that train on small data in a way that is interpretable and accurately extrapolates. So yes, it's not magic: we just replaced the big data requirement with the requirement of having some prior scientific knowledge, and if you go talk to any scientist you'll know this data source exists. I think it's time we use it in machine learning.

The code to reproduce our results is in this Github repository. However, I would like to see people try other examples. All of the tooling is now in the open source DifferentialEquations.jl and DiffEqFlux.jl packages. The implementation of SInDy is in the DataDrivenDiffEq package (this one isn't quite released yet, but the tooled used for these examples are released and it does spit out ModelingToolkit which you can call LaTeXify on. Documentation on this will come soon!). The final example in the paper is a library call in NeuralNetDiffEq.jl with the algorithm choice of LambaEM().

For examples on how to use these tools in your own packages, I would consult this part of the DiffEqFlux.jl README.

The post Scientific Machine Learning: Interpretable Neural Networks That Accurately Extrapolate From Small Data appeared first on Stochastic Lifestyle.

]]>