install.packages("diffeqr") library(diffeqr) de <- diffeqr::diffeq_setup() lorenz <- function (u,p,t){ du1 = p[1]*(u[2]-u[1]) du2 = u[1]*(p[2]-u[3]) - u[2] du3 = u[1]*u[2] - p[3]*u[3] c(du1,du2,du3) } u0 <- c(1.0,1.0,1.0) tspan <- c(0.0,100.0) p <- c(10.0,28.0,8/3) prob <- de$ODEProblem(lorenz,u0,tspan,p) fastprob <- diffeqr::jitoptimize_ode(de,prob) sol <- de$solve(fastprob,de$Tsit5(),saveat=0.01)
Under the hood it's using the DifferentialEquations.jl package and the SciML stack, but it's abstracted from users so much that Julia is essentially an alternative to Rcpp with easier interactive development. The following example really brings the seamless ... READ MORE
The post JuliaCall Update: Automated Julia Installation for R Packages appeared first on Stochastic Lifestyle.
]]>install.packages("diffeqr") library(diffeqr) de <- diffeqr::diffeq_setup() lorenz <- function (u,p,t){ du1 = p[1]*(u[2]-u[1]) du2 = u[1]*(p[2]-u[3]) - u[2] du3 = u[1]*u[2] - p[3]*u[3] c(du1,du2,du3) } u0 <- c(1.0,1.0,1.0) tspan <- c(0.0,100.0) p <- c(10.0,28.0,8/3) prob <- de$ODEProblem(lorenz,u0,tspan,p) fastprob <- diffeqr::jitoptimize_ode(de,prob) sol <- de$solve(fastprob,de$Tsit5(),saveat=0.01)
Under the hood it's using the DifferentialEquations.jl package and the SciML stack, but it's abstracted from users so much that Julia is essentially an alternative to Rcpp with easier interactive development. The following example really brings the seamless integration home:
install.packages(diffeqr) library(diffeqr) de <- diffeqr::diffeq_setup() degpu <- diffeqr::diffeqgpu_setup() lorenz <- function (u,p,t){ du1 = p[1]*(u[2]-u[1]) du2 = u[1]*(p[2]-u[3]) - u[2] du3 = u[1]*u[2] - p[3]*u[3] c(du1,du2,du3) } u0 <- c(1.0,1.0,1.0) tspan <- c(0.0,100.0) p <- c(10.0,28.0,8/3) prob <- de$ODEProblem(lorenz,u0,tspan,p) fastprob <- diffeqr::jitoptimize_ode(de,prob) prob_func <- function (prob,i,rep){ de$remake(prob,u0=runif(3)*u0,p=runif(3)*p) } ensembleprob = de$EnsembleProblem(fastprob, prob_func = prob_func, safetycopy=FALSE) sol <- de$solve(ensembleprob,de$Tsit5(),degpu$EnsembleGPUArray(),trajectories=10000,saveat=0.01)
This example does the following:
What a complicated code! Well maybe it would shock you to know that the source code for the diffeqr package is only 150 lines of code. Of course, it's powered by a lot of Julia magic in the backend, and so can your next package. For more details, see the big long post about differential equation solving in R with Julia.
The post JuliaCall Update: Automated Julia Installation for R Packages appeared first on Stochastic Lifestyle.
]]>The post GPU-Accelerated ODE Solving in R with Julia, the Language of Libraries appeared first on Stochastic Lifestyle.
]]>This is definitely not the first time this question was asked. The statistics libraries in Julia were developed by individuals like Douglas Bates who built some of R's most widely used packages like lme4 and Matrix. Doug had written a blog post in 2018 showing how to get top notch performance in linear mixed effects model fitting via JuliaCall. In 2018 the JuliaDiffEq organization had written a blog post demonstrating the use of DifferentialEquations.jl in R and Python (the Jupyter of Diffrential Equations). Now rebranded as SciML for Scientific Machine Learning, we looked to expand our mission and bring automated model discovery and acceleration include other languages like R and Python with Julia as the base.
With the release of diffeqr v1.0, we can now demonstrate many advances in R through the connection to Julia. Specifically, I would like to use this blog post to showcase:
Together we will demonstrate how models in R can be accelerated by 1000x without a user ever having to write anything but R.
Before continuing on with showing all of the features, I wanted to ask for support so that we can continue developing these bridged libraries. Specifically, I would like to be able to support developers interested in providing a fully automated Julia installation and static compilation so that calling into Julia libraries is just as easy as any Rcpp library. To show support, the easiest thing to do is to star our libraries. The work of this blog post is build on DifferentialEquations.jl, diffeqr, ModelingToolkit.jl, and DiffEqGPU.jl. Thank you for your patience and now back to our regularly scheduled program.
First let me start with the new direct wrappers of differential equations solvers in R. In the previous iterations of diffeqr, we had relied on specifically designed high level functions, like "ode_solve", to compensate for the fact that one could not directly use Julia's original DifferentialEquations.jl interface directly from R. However, the new diffeqr v1.0 directly exposes the entirety of the Julia library in an easy to use framework.
To demonstrate this, let's see how to define the Lorenz ODE with diffeqr. In Julia's DifferentialEquations.jl, we would start by defining an "ODEProblem" that contains the initial condition u0, the time span, the parameters, and the f in terms of `u' = f(u,p,t)` that defines the derivative. In Julia, this would look like:
using DifferentialEquations function lorenz(du,u,p,t) du[1] = p[1]*(u[2]-u[1]) du[2] = u[1]*(p[2]-u[3]) - u[2] du[3] = u[1]*u[2] - p[3]*u[3] end u0 = [1.0,1.0,1.0] tspan = (0.0,100.0) p = [10.0,28.0,8/3] prob = ODEProblem(lorenz,u0,tspan,p) sol = solve(prob,saveat=1.0)
With the new diffeqr, diffeq_setup() is a function that does a few things:
What this means is that the following is the complete diffeqr v1.0 code for solving the Lorenz equation is:
library(diffeqr) de <- diffeqr::diffeq_setup() f <- function(u,p,t) { du1 = p[1]*(u[2]-u[1]) du2 = u[1]*(p[2]-u[3]) - u[2] du3 = u[1]*u[2] - p[3]*u[3] return(c(du1,du2,du3)) } u0 <- c(1.0,1.0,1.0) tspan <- c(0.0,100.0) p <- c(10.0,28.0,8/3) prob <- de$ODEProblem(f, u0, tspan, p) sol <- de$solve(prob,saveat=1.0)
This then carries on through SDEs, DDEs, DAEs, and more. Through this direct exposing form, the whole library of DifferentialEquations.jl is at the finger tips of any R user, making it a truly cross-language platform.
(Note that diffeq_setup installs Julia for you if it's not already installed!)
The reason for Julia is speed (well and other things, but here, SPEED!). Using the pure Julia library, we can solve the Lorenz equation 100 times in about 0.05 seconds:
@time for i in 1:100 solve(prob,saveat=1.0) end 0.048237 seconds (156.80 k allocations: 6.842 MiB) 0.048231 seconds (156.80 k allocations: 6.842 MiB) 0.048659 seconds (156.80 k allocations: 6.842 MiB)
Using diffeqr connected version, we get:
lorenz_solve <- function (i){ de$solve(prob,saveat=1.0) } > system.time({ lapply(1:100,lorenz_solve) }) user system elapsed 6.81 0.02 6.83 > system.time({ lapply(1:100,lorenz_solve) }) user system elapsed 7.09 0.00 7.10 > system.time({ lapply(1:100,lorenz_solve) }) user system elapsed 6.78 0.00 6.79
That's not good, that's about 100x difference! In this blog post I described that interpreter overhead and context switching are the main causes of this issue. We've also demonstrated that ML accelerators like PyTorch generally do not perform well in this regime since those kinds of accelerators rely on heavy array operations, unlike the scalarized nonlinear interactions seen in a lot of differential equation modeling. For this reason we cannot just slap any old JIT compiler onto the f call and then put it into the function since there would still be left over. So we need to do something a bit tricky.
In my JuliaCon 2020 talk, Automated Optimization and Parallelism in DifferentialEquations.jl I demonstrated how ModelingToolkit.jl can be used to trace functions and generate highly optimized sparse and parallel code for scientific computing all in an automated fashion. It turns out that JuliaCall can do a form of tracing on R functions, something that exploited to allow autodiffr to automatically differentiate R code with Julia's AD libraries. Thus it turns out that the same modelingtoolkitization methods used in AutoOptimize.jl can be used on a subset of R codes which includes a large majority of differential equation models.
In short, we can perform automated acceleration of R code by turning it into sparse parallel Julia code. This was exposed in diffeqr v1.0 as the `jitoptimize_ode(de,prob)` function (also `jitoptimize_sde(de,prob)`). Let's try it out on this example. All you need to do is give it the ODEProblem which you wish to accelerate. Let's take the last problem and turn it into a pure Julia defined problem and then time it:
fastprob <- diffeqr::jitoptimize_ode(de,prob) fast_lorenz_solve <- function (i){ de$solve(fastprob,saveat=1.0) } system.time({ lapply(1:100,fast_lorenz_solve) }) > system.time({ lapply(1:100,fast_lorenz_solve) }) user system elapsed 0.05 0.00 0.04 > system.time({ lapply(1:100,fast_lorenz_solve) }) user system elapsed 0.07 0.00 0.06 > system.time({ lapply(1:100,fast_lorenz_solve) }) user system elapsed 0.07 0.00 0.06
And there you go, an R user can get the full benefits of Julia's optimizing JIT compiler without having to write lick of Julia code! This function also did a few other things, like automatically defined the Jacobian code to make implicit solving of stiff ODEs much faster as well, and it can perform sparsity detection and automatically optimize computations on that.
To see how much of an advance this is, note that this Lorenz equation is the same from the deSolve examples page. So let's take their example and see how well it performs:
library(deSolve) Lorenz <- function(t, state, parameters) { with(as.list(c(state, parameters)), { dX <- a * X + Y * Z dY <- b * (Y - Z) dZ <- -X * Y + c * Y - Z list(c(dX, dY, dZ)) }) } parameters <- c(a = -8/3, b = -10, c = 28) state <- c(X = 1, Y = 1, Z = 1) times <- seq(0, 100, by = 1.0) out <- ode(y = state, times = times, func = Lorenz, parms = parameters) desolve_lorenz_solve <- function (i){ state <- c(X = runif(1), Y = runif(1), Z = runif(1)) parameters <- c(a = -8/3 * runif(1), b = -10 * runif(1), c = 28 * runif(1)) out <- ode(y = state, times = times, func = Lorenz, parms = parameters) } > system.time({ lapply(1:100,desolve_lorenz_solve) }) user system elapsed 5.03 0.03 5.07 > > system.time({ lapply(1:100,desolve_lorenz_solve) }) user system elapsed 5.42 0.00 5.44 > system.time({ lapply(1:100,desolve_lorenz_solve) }) user system elapsed 5.41 0.00 5.41
Thus we see 100x acceleration over the leading R library without users having to write anything but R code. This is the true promise in action of a "language of libraries" helping to extend all other high level languages!
What about writing C code and directly calling it with deSolve? It turns out that's still not as efficient as this JIT. Following the tutorial from deSolve on how to write an optimized Lorenz function, we first define the following function in C:
/* file lorenz.c */ #include <R.h> static double parms[3]; #define a parms[0] #define b parms[1] #define c parms[2] /* initializer */ void initmod(void (* odeparms)(int *, double *)) { int N = 3; odeparms(&N, parms); } /* Derivatives */ void derivs (int *neq, double *t, double *y, double *ydot, double *yout, int *ip) { ydot[0] = a * y[0] + y[1] * y[2]; ydot[1] = b * (y[1] - y[2]); ydot[2] = - y[0] * y[1] + c * y[1] - y[2]; }
Then we use system("R CMD SHLIB lorenzc.c") in R in order to compile this function to a .dll. Now we can call it from R:
library(deSolve) dyn.load("lorenz.dll") parameters <- c(a = -8/3, b = -10, c = 28) state <- c(X = 1, Y = 1, Z = 1) times <- seq(0, 100, by = 1.0) out <- ode(y = state, times = times, func = Lorenz, parms = parameters) desolve_lorenz_solve <- function (i){ state <- c(X = runif(1), Y = runif(1), Z = runif(1)) parameters <- c(a = -8/3 * runif(1), b = -10 * runif(1), c = 28 * runif(1)) ode(state, times, func = "derivs", parms = parameters, dllname = "lorenz", initfunc = "initmod") } > system.time({ lapply(1:100,desolve_lorenz_solve) }) user system elapsed 0.09 0.00 0.09
Notice that even when rewriting the function to C, this still is almost 2x as slow as the direct JIT compiled R code! This means that users, with less work, can get faster than what they had before!
Can we go deeper? Yes we can. In many cases like in optimization and sensitivity analysis of models for pharmacology the users need to solve the same ODE thousands or millions of times to understand the behavior over a large parameter space. To solve this problem well in Julia, we built DiffEqGPU.jl which transforms the pure Julia function into a .ptx kernel to then parallelize the ODE solver over. What this looks like is the following solves the Lorenz equation 100,000 times with randomized initial conditions and parameters:
using DifferentialEquations, DiffEqGPU function lorenz(du,u,p,t) du[1] = p[1]*(u[2]-u[1]) du[2] = u[1]*(p[2]-u[3]) - u[2] du[3] = u[1]*u[2] - p[3]*u[3] end u0 = [1.0,1.0,1.0] tspan = (0.0,100.0) p = [10.0,28.0,8/3] prob = ODEProblem(lorenz,u0,tspan,p) prob_func = (prob,i,repeat) -> remake(prob,u0=rand(3).*u0,p=rand(3).*p) monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy=false) sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=100_000,saveat=1.0f0)
Notice how this is only two lines of code different from what we had before, and now everything is GPU accelerated! The requirement for this to work is that the ODE/SDE/DAE function has to be written in Julia... but diffeqr::jitoptimize_ode(de,prob) accelerates the ODE solving in R by generating a Julia function, so could that mean...?
Yes, it does mean we can use DiffEqGPU directly on ODEs defined in R. Let's see this in action. Once again, we will write almost exactly the same code as in Julia, except with `de$` and with diffeqr::jitoptimize_ode(de,prob) to JIT compile our ODE definition. What this looks like is the following:
de <- diffeqr::diffeq_setup() degpu <- diffeqr::diffeqgpu_setup() lorenz <- function (u,p,t){ du1 = p[1]*(u[2]-u[1]) du2 = u[1]*(p[2]-u[3]) - u[2] du3 = u[1]*u[2] - p[3]*u[3] c(du1,du2,du3) } u0 <- c(1.0,1.0,1.0) tspan <- c(0.0,100.0) p <- c(10.0,28.0,8/3) prob <- de$ODEProblem(lorenz,u0,tspan,p) fastprob <- diffeqr::jitoptimize_ode(de,prob) prob_func <- function (prob,i,rep){ de$remake(prob,u0=runif(3)*u0,p=runif(3)*p) } ensembleprob = de$EnsembleProblem(fastprob, prob_func = prob_func, safetycopy=FALSE) sol = de$solve(ensembleprob,de$Tsit5(),degpu$EnsembleGPUArray(),trajectories=100000,saveat=1.0)
Note that diffeqr::diffeqgpu_setup() does the following:
This means that this portion of the library is fully automated, all the way down to the installation of CUDA! Let's time this out a bit. 100,000 ODE solves in serial:
@time sol = solve(monteprob,Tsit5(),EnsembleSerial(),trajectories=100_000,saveat=1.0f0) 15.045104 seconds (18.60 M allocations: 2.135 GiB, 4.64% gc time) 14.235984 seconds (16.10 M allocations: 2.022 GiB, 5.62% gc time)
100,000 ODE solves on the GPU in Julia:
@time sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=100_000,saveat=1.0f0) 2.071817 seconds (6.56 M allocations: 1.077 GiB) 2.148678 seconds (6.56 M allocations: 1.077 GiB)
Now let's check R in serial:
> system.time({ de$solve(ensembleprob,de$Tsit5(),de$EnsembleSerial(),trajectories=100000,saveat=1.0) }) user system elapsed 24.16 1.27 25.42 > system.time({ de$solve(ensembleprob,de$Tsit5(),de$EnsembleSerial(),trajectories=100000,saveat=1.0) }) user system elapsed 25.45 0.94 26.44
and R on GPUs:
> system.time({ de$solve(ensembleprob,de$Tsit5(),degpu$EnsembleGPUArray(),trajectories=100000,saveat=1.0) }) user system elapsed 12.39 1.51 13.95 > system.time({ de$solve(ensembleprob,de$Tsit5(),degpu$EnsembleGPUArray(),trajectories=100000,saveat=1.0) }) user system elapsed 12.55 1.36 13.90
R doesn't reach quite the level of Julia here, and if you profile you'll see it's because the `prob_func`, i.e. the function that tells you which problems to solve, is still a function written in R and this becomes the bottleneck as the computation becomes faster and faster. Thus you will get closer and closer to the Julia speed with longer and harder ODEs, but it still means there's work to be done. Another detail is that the Julia code is able to be further accelerated by using 32-bit numbers. Let's see that in action:
using DifferentialEquations, DiffEqGPU function lorenz(du,u,p,t) du[1] = p[1]*(u[2]-u[1]) du[2] = u[1]*(p[2]-u[3]) - u[2] du[3] = u[1]*u[2] - p[3]*u[3] end u0 = Float32[1.0,1.0,1.0] tspan = (0.0f0,100.0f0) p = Float32[10.0,28.0,8/3] prob = ODEProblem(lorenz,u0,tspan,p) prob_func = (prob,i,repeat) -> remake(prob,u0=rand(Float32,3).*u0,p=rand(Float32,3).*p) monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy=false) @time sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=100_000,saveat=1.0f0) # 1.781718 seconds (6.55 M allocations: 918.051 MiB) # 1.873190 seconds (6.56 M allocations: 917.875 MiB)
Right now the Julia to R bridge converts all 32-bit numbers back to 64-bit numbers so this doesn't seem to be possible without the user writing some Julia code, but we hope to get this fixed in one of our coming releases.
To figure out where that leaves us, let's use deSolve to solve that same Lorenz equation 100 and 1,000 times:
library(deSolve) Lorenz <- function(t, state, parameters) { with(as.list(c(state, parameters)), { dX <- a * X + Y * Z dY <- b * (Y - Z) dZ <- -X * Y + c * Y - Z list(c(dX, dY, dZ)) }) } parameters <- c(a = -8/3, b = -10, c = 28) state <- c(X = 1, Y = 1, Z = 1) times <- seq(0, 100, by = 1.0) out <- ode(y = state, times = times, func = Lorenz, parms = parameters) desolve_lorenz_solve <- function (i){ state <- c(X = runif(1), Y = runif(1), Z = runif(1)) parameters <- c(a = -8/3 * runif(1), b = -10 * runif(1), c = 28 * runif(1)) out <- ode(y = state, times = times, func = Lorenz, parms = parameters) } > system.time({ lapply(1:100,desolve_lorenz_solve) }) user system elapsed 5.06 0.00 5.13 > system.time({ lapply(1:1000,desolve_lorenz_solve) }) user system elapsed 55.68 0.03 55.75
We see the expected linear scaling of a scalar code, so we can extrapolate out and see that to solve 100,000 ODEs it would take deSolve 5000 seconds, as opposed to the 14 seconds of diffeqr or the 1.8 seconds of Julia. In summary:
And deSolve is not shabby: it's a library that calls Fortran libraries under the hood!
We hope that the R community has enjoyed this release and will enjoy our future releases as well. We hope to continue building further connections to Python as well. Together this will make Julia a true language of libraries that can be used to accelerate scientific computation in the surrounding higher level scientific ecosystem.
The post GPU-Accelerated ODE Solving in R with Julia, the Language of Libraries appeared first on Stochastic Lifestyle.
]]>Chris Rackauckas
Applied Mathematics Instructor, MIT
Senior Research Analyst, University of Maryland, Baltimore School of Pharmacy
This was a seminar talk given to the COVID modeling journal club on scientific machine learning for epidemic modeling.
Resources:
https://sciml.ai/
https://diffeqflux.sciml.ai/dev/
https://datadriven.sciml.ai/dev/
https://docs.sciml.ai/latest/
https://safeblues.org/
The post COVID-19 Epidemic Mitigation via Scientific Machine Learning (SciML) appeared first on Stochastic Lifestyle.
]]>Chris Rackauckas
Applied Mathematics Instructor, MIT
Senior Research Analyst, University of Maryland, Baltimore School of Pharmacy
This was a seminar talk given to the COVID modeling journal club on scientific machine learning for epidemic modeling.
Resources:
https://sciml.ai/
https://diffeqflux.sciml.ai/dev/
https://datadriven.sciml.ai/dev/
https://docs.sciml.ai/latest/
https://safeblues.org/
The post COVID-19 Epidemic Mitigation via Scientific Machine Learning (SciML) appeared first on Stochastic Lifestyle.
]]>Cheap But Effective: Instituting Effective Pandemic Policies Without Knowing Who's Infected
Chris Rackauckas
MIT Applied Mathematics Instructor
One way to find out how many people are infected is to figure out who's infected, but that's working too hard! In this talk we will look into cheaper alternatives for effective real-time policy making. To this end we introduce SafeBlues, a project that simulates fake virus strands over Bluetooth and utilizes deep neural networks mixed within differential equations to accurately approximate infection statistics weeks before updated statistics are available. We then introduce COEXIST, a quarantine policy which utilizes inexpensive "useless" tests to perform accurate regional case isolation. This work is all being done as part of the Microsoft Pandemic Modeling Project, where the Julia SciML tooling has accelerated the COEXIST simulations by 36,000x ... READ MORE
The post Cheap But Effective: Instituting Effective Pandemic Policies Without Knowing Who's Infected appeared first on Stochastic Lifestyle.
]]>Cheap But Effective: Instituting Effective Pandemic Policies Without Knowing Who's Infected
Chris Rackauckas
MIT Applied Mathematics Instructor
One way to find out how many people are infected is to figure out who's infected, but that's working too hard! In this talk we will look into cheaper alternatives for effective real-time policy making. To this end we introduce SafeBlues, a project that simulates fake virus strands over Bluetooth and utilizes deep neural networks mixed within differential equations to accurately approximate infection statistics weeks before updated statistics are available. We then introduce COEXIST, a quarantine policy which utilizes inexpensive "useless" tests to perform accurate regional case isolation. This work is all being done as part of the Microsoft Pandemic Modeling Project, where the Julia SciML tooling has accelerated the COEXIST simulations by 36,000x and quantitative systems pharmacology simulations for Pfizer by 175x in support of the efforts against COVID-19.
The post Cheap But Effective: Instituting Effective Pandemic Policies Without Knowing Who's Infected appeared first on Stochastic Lifestyle.
]]>Differentiable programming is a subset of modeling where you model with a program where each of the steps are differentiable, for the purpose of being able to find the correct program with parameter fitting using said ... READ MORE
The post Glue AD for Full Language Differentiable Programming appeared first on Stochastic Lifestyle.
]]>Differentiable programming is a subset of modeling where you model with a program where each of the steps are differentiable, for the purpose of being able to find the correct program with parameter fitting using said derivatives. Just like any modeling domain, different problems have different code styles which must be optimized in different ways. Traditional scientific computing code makes use of mutable buffers writing out nonlinear scalar operations and avoid memory allocations in order to keep top performance. On the other hand, many machine learning libraries allocate a ton of temporary arrays due to using out of place matrix multiplications, [which is fine because dense linear algebra costs grow much faster than the costs of the allocations](https://www.stochasticlifestyle.com/when-do-micro-optimizations-matter-in-scientific-computing/). Some need sparsity everywhere, others just need to fuse and build the fastest dense kernels possible. Some algorithms do great on GPUs, while some do not. This intersection between scientific computing and machine learning, i.e. scientific machine learning and other applications of differentiable programing, is too large of a domain for one approach to make anyone happy. And if an AD system is unable to reach top notch performance for a specific subdomain, it's simply better for the hardcore package author to not use the AD system and instead write their own adjoints.
Even worse is the fact that mathematically there are many cases where you should write your own adjoints, since differentiating through the code is very suboptimal. Any iterative algorithm is of this sort, where the derivative of a nonlinear solve f(x)=0 may use Newton's method to get f(x*)=0, but the adjoint is only defined at x* with f'(x*), so there's no need to ever differentiate through Newton's method. So we should all be writing adjoints! Does this mean that the story of differentiable programming is destroyed? Is it just always better to not do differentiable programming, so any hardcore library writer will ignore it?
Instead of falling into that despair, let's just follow down that road with a positive light. Let's assume that the best way to do differentiable programming is to write adjoints on library code. Then what's the purpose of a differentiable programming system? It's to help your adjoints get written and be useful. It's just matrix multiplications in machine learning. If the majority of the code is in some optimized kernel, then you don't need to worry about the performance of many other aspects rest: you just want it to work. With differentiable programming, if 99% of the computation is in the DiffEq/NLsolve/FFTW/etc. adjoints, what we need from a differentiable programming system is something that will get the rest of the adjoint done and be very easy to make correct. The way to facilitate this kind of workflow would be for the differentiable programming system to:
Thus what I think we're looking for is not one differentiable programming system that is the best in all aspects, but instead we're looking for a differentiable programming system that can glue together everything that's out there. "Differentiate all of the things, but also tell me how to do things better". We're looking for a glue AD.
Zygote is surprisingly close to being this perfect glue AD. It's stacktraces and profiling are fairly good because they point to the pieces generating the backpasses. It just needs some focus on this goal if it wants to obtain it. For (1), it would need to get higher coverage of the language, focusing on its expanse moreso than doing everything as fast as possible. Of course, it should do as well as it can, but for example, if it needs to sacrifice a bit of speed to get full performance in mutability today, that might be a good trade-off if the goal is to be a glue AD. Perfect? No, but if that's that would give you the coverage to then tell the user that if they need more on a particular piece of code, seek out more. To seek out more performance, users could just have Zygote call ReverseDiff.jl on a function and have that compile the tape (or other specialized AD systems which will be announced more broadly soon), or may want to write a partial adjoint.
So (4) is really the kicker. If I was to hit a slow mutating code today inside of a differential equation, it would probably be something perfect for ModelingToolkit.jl to handle, so the best thing to do is to build hyper-optimized adjoints of that differential equation using ModelingToolkit.jl. At that level, I can symbolically handle it and generate code that a compiler because I can make a lot of extra assumptions, like cos^2(x) + sin^2(x) = 1 in my mathematical context. I can move code around, auto-parallelize it, etc. easily because of the simple static graph I'm working on. Wouldn't it be a treat to just `@ModelingToolkitAdjoint f` and bingo now it's using ModelingToolkit on a portion of code? `@ForwardDiffAdjoint f` to tell it that it "you should forward mode mere". Yota.jl is a great reverse mode project, so `@YotaAdjoint f` and boom that could be more optimized than Zygote on some cases. `@ReverseDiff f` and let it compile the tape and it'll get fairly optimal on the places where ReverseDiff.jl is applicable.
Julia is the perfect language to develop such a system for because its AST is so nice and constrained for mathematical contexts that all of these AD libraries do not work on a special DSL language like TensorFlow graphs or torch.numpy, but instead work directly on the language itself and its original existing libraries. With ChainRules.jl allowing for adjoint overloads that apply to all AD packages, focusing on these "Glue AD" properties could really open up the playing field, allowing Zygote to be at the center of an expansive differentiable programming world that works everywhere, maybe makes some compromises to do so, but then gives a system for other developers to make assumptions and define easily define adjoints and plug alternative AD systems into the whole game. This is a true mixed mode which incorporates not just forward and reverse, but also different implementations with different performance profiles (and this can be implemented just through ChainRules overloads!). Zygote would then facilitate this playing field with just a solid debugging and profiling experience, along with a very high chance of working on your code on your first try. That plus buy-in by package authors would be a true solution to differentiable programming.
The post Glue AD for Full Language Differentiable Programming appeared first on Stochastic Lifestyle.
]]>Chris Rackauckas (MIT), "Generalized Physics-Informed Learning through Language-Wide Differentiable Programming"
Scientific computing is increasingly incorporating the advancements in machine learning to allow for data-driven physics-informed modeling approaches. However, re-targeting existing scientific computing workloads to machine learning frameworks is both costly and limiting, as scientific simulations tend to use the full feature set of a general purpose programming language. In this manuscript we develop an infrastructure for incorporating deep learning into existing scientific computing code through Differentiable Programming (∂P). We describe a ∂P system that is able to take gradients of full Julia programs, making Automatic Differentiation a first class language feature and compatibility with deep learning pervasive. Our system utilizes the one-language nature of Julia package development to augment the existing package ecosystem with deep learning, supporting almost all language ... READ MORE
The post Generalized Physics-Informed Learning through Language-Wide Differentiable Programming (Video) appeared first on Stochastic Lifestyle.
]]>Chris Rackauckas (MIT), "Generalized Physics-Informed Learning through Language-Wide Differentiable Programming"
Scientific computing is increasingly incorporating the advancements in machine learning to allow for data-driven physics-informed modeling approaches. However, re-targeting existing scientific computing workloads to machine learning frameworks is both costly and limiting, as scientific simulations tend to use the full feature set of a general purpose programming language. In this manuscript we develop an infrastructure for incorporating deep learning into existing scientific computing code through Differentiable Programming (∂P). We describe a ∂P system that is able to take gradients of full Julia programs, making Automatic Differentiation a first class language feature and compatibility with deep learning pervasive. Our system utilizes the one-language nature of Julia package development to augment the existing package ecosystem with deep learning, supporting almost all language constructs (control flow, recursion, mutation, etc.) while generating high-performance code without requiring any user intervention or refactoring to stage computations. We showcase several examples of physics-informed learning which directly utilizes this extension to existing simulation code: neural surrogate models, machine learning on simulated quantum hardware, and data-driven stochastic dynamical model discovery with neural stochastic differential equations.
Code is available at https://github.com/MikeInnes/zygote-paper
AAAI 2020 Spring Symposium on Combining Artificial Intelligence and Machine Learning with Physics Sciences, March 23-25, 2020 (https://sites.google.com/view/aaai-mlps)
https://figshare.com/articles/presentation/Generalized_Physics-Informed_Learning_through_Language-Wide_Differentiable_Programming/12751934
The post Generalized Physics-Informed Learning through Language-Wide Differentiable Programming (Video) appeared first on Stochastic Lifestyle.
]]>Colloquium with Chris Rackauckas
Department of Mathematics
Massachusetts Institute of Technology
"Universal Differential Equations for Scientific Machine Learning"
Feb 19, 2020, 3:30 p.m., 499 DSL
https://arxiv.org/abs/2001.04385
Abstract:
In the context of science, the well-known adage "a picture is worth a thousand words" might well be "a model is worth a thousand datasets." Scientific models, such as Newtonian physics or biological gene regulatory networks, are human-driven simplifications of complex phenomena that serve as surrogates for the countless experiments that validated the models. Recently, machine learning has been able to overcome the inaccuracies of approximate modeling by directly learning the entire set of nonlinear interactions from data. However, without any predetermined structure from the scientific basis behind the problem, machine learning approaches are flexible but data-expensive, requiring large databases of homogeneous labeled training data. A ... READ MORE
The post Universal Differential Equations for Scientific Machine Learning (Video) appeared first on Stochastic Lifestyle.
]]>Colloquium with Chris Rackauckas
Department of Mathematics
Massachusetts Institute of Technology
"Universal Differential Equations for Scientific Machine Learning"
Feb 19, 2020, 3:30 p.m., 499 DSL
https://arxiv.org/abs/2001.04385
Abstract:
In the context of science, the well-known adage "a picture is worth a thousand words" might well be "a model is worth a thousand datasets." Scientific models, such as Newtonian physics or biological gene regulatory networks, are human-driven simplifications of complex phenomena that serve as surrogates for the countless experiments that validated the models. Recently, machine learning has been able to overcome the inaccuracies of approximate modeling by directly learning the entire set of nonlinear interactions from data. However, without any predetermined structure from the scientific basis behind the problem, machine learning approaches are flexible but data-expensive, requiring large databases of homogeneous labeled training data. A central challenge is reconciling data that is at odds with simplified models without requiring "big data". In this work we develop a new methodology, universal differential equations (UDEs), which augments scientific models with machine-learnable structures for scientifically-based learning. We show how UDEs can be utilized to discover previously unknown governing equations, accurately extrapolate beyond the original data, and accelerate model simulation, all in a time and data-efficient manner. This advance is coupled with open-source software that allows for training UDEs which incorporate physical constraints, delayed interactions, implicitly-defined events, and intrinsic stochasticity in the model. Our examples show how a diverse set of computationally-difficult modeling issues across scientific disciplines, from automatically discovering biological mechanisms to accelerating climate simulations by 15,000x, can be handled by training UDEs.
The post Universal Differential Equations for Scientific Machine Learning (Video) appeared first on Stochastic Lifestyle.
]]>However, in our recent paper, we have shown that this does not have to be the case. In Universal Differential Equations for Scientific Machine Learning, we start by showing the following figure:
Indeed, it shows that by only seeing the tiny first part of the time series, we can automatically learn the equations in such a manner that it predicts the time series will be cyclic in the future, in a way that even gets ... READ MORE
The post Scientific Machine Learning: Interpretable Neural Networks That Accurately Extrapolate From Small Data appeared first on Stochastic Lifestyle.
]]>However, in our recent paper, we have shown that this does not have to be the case. In Universal Differential Equations for Scientific Machine Learning, we start by showing the following figure:
Indeed, it shows that by only seeing the tiny first part of the time series, we can automatically learn the equations in such a manner that it predicts the time series will be cyclic in the future, in a way that even gets the periodicity correct. Not only that, but our result is not a neural network, rather the program itself spits out the LaTeX for the differential equation:
with the correct coefficients, which is exactly the how we generated the data.
Rather than just explaining the method, what I really want to convey in this blog post is why it intuitively makes sense that our methods work. Intuitively, we utilize all of the known scientific structure to embed as much prior knowledge as possible. This is the opposite of most modern machine learning which tries to use blackbox architectures to fits as wide of a range of behaviors as possible. Instead, what we do is we look at our problem and say, what do I know has to be true about the system, and how can I constrain the neural network to force the parameter search to only look at cases such that it is true. In the context of science, we do so by directly embedding neural networks into existing scientific simulations, essentially saying that the model is at least approximately accurate, so our neural network should only learn what the scientific model didn't cover or simplified away. Our approach has many more applications than what we show in the paper, and if you know the underlying idea, it is quite straightforward to apply to your own work.
The starting point for universal differential equations is the now classic work on neural ordinary differential equations. Neural ODEs are defined by the equation:
i.e. it's an arbitrary function described as the solution to an ODE defined by a neural network. The reason why the authors went down this route was because it's a continuous form of a recurrent neural network that then makes it natural for handling irregularly-spaced time series data.
However, this formulation can have another interesting purpose. ODEs, and differential equations in general, are well-studied because they are the language of science. Newtonian physics is described in terms of differential equations. So are Einstein's equations, and quantum mechanics. Not only physics, but also biological models of chemical reactions in cells, population sizes in ecology, the motion of fluids, etc.: differential equations are really the language of science.
Thus it's not a surprise that in many cases we already have and know some differential equations. They may be an approximate model, but we know this approximation is "true" in some sense. This is the jumping point for the universal differential equation. Instead of trying to make the entire differential equation be a neural network like object, since science is encoded in differential equations, it would be scientifically informative to actually learn the differential equation itself. But, in any scientific context we already know parts of the differential equation, so we might as well hard code that information as a form of prior structural information. This gives us the form of the universal differential equation:
where is an arbitrary universal approximator, i.e. a finite parameter object that can represent "any possible function". It just so happens that neural networks are universal approximators, but note that other forms, like Chebyshev polynomials, also have this property, but neural networks do well in high dimensions and on irregular grids (some properties we utilize in some of our other examples).
What happens when we describe a differential equation in this form is that the trained neural network becomes a tangible numerical approximation of the missing function. By doing this, we can train a program that has the same exact input/output behavior as the missing term of our model. And this is precisely what we do. We assume we only know part of the differential equation:
and train a neural network so that way embedded neural networks defined a universal ODE that fits our data. When trained, the neural network is a numerical approximation to the missing function. But since it's just a simple function, it's fairly straightforward to plot it and say "hey! We were missing a quadratic term", and there you go: interpreted back to the original generating equations. In the paper we describe how to make use of the SInDy technique to make this more rigorous through a sparse regression to a basis of possible terms, but it's the same story that in the end we learn exactly the differential equations that generated the data, and hence the extrapolation accuracy even beyond the original time series and the nice picture:
Trying to approximate data might be a much harder problem then trying to understand the processes that create the data. Indeed, the Lotka-Volterra equations are a simple set of equations that are defined by 4 interpretable terms. The first simply states that the number of rabbits would grow exponentially if there wasn't a predator eating them. The second term just states that the number of rabbits goes down when they are eaten by predators (and more predators means more eating). The third term is that more prey means more food and growth of the wolf population. Finally, the wolves die off with an exponential decay due to old age.
That's it: a simple quadratic equation that describes 4 mechanisms of interaction. Each mechanism is interpretable and can be independently verified. Meanwhile, that cyclic solution that is the data set? The time series itself is such a complicated function that you can prove that there is no way to even express its analytical solution!. This phenomena is known as emergence: simple mechanisms can give rise to complex behavior. From this it should be clear that a method which is trying to predict a time series has a much more difficult problem to solve than one that is trying to learn mechanisms!
One way to really solidify this idea is our next example. In our next example we showcase how reconstructing a partial differential equation can be straightforwardly done through universal approximators embedded within partial differential equations, what we call a universal PDE. If you want the full details behind what I will handwave here, take a look at the MIT 18.337 Scientific Machine Learning course notes or the MIT 18.S096 Applications of Scientific Machine Learning course notes. In it, there is a derivation that showcases how one can interpret partial differential equations as large systems of ODEs. Specifically, the Fisher-KPP equations that we look at in our paper:
can be interpreted as a system of ODEs:
However, the term in front, , is known as a stencil. Basically, you go to each point and the solution of this operation is, sum the left and the right terms and subtract twice from the middle. Sounds familiar? It turns out that a convolutional layer from convolutional neural networks are actually just parameterized forms of stencils. For example, a picture of a two-dimensional stencil looks like:
where this stencil is applying the operation:
1 0 1
0 1 0
1 0 1
A convolutional layer is just a parameterized form of a stencil operation:
w1 w2 w3
w4 w5 w6
w7 w8 w9
Thus one way to approach learning spatiotemporal data which we think may come from such a generating process is:
i.e., the discretization of the second spatial derivative, what's known as diffusion, is the physically represented as the stencil of weights [1 -2 1]. Notice that in this form, the entire spatiotemporal data is described by a 1-input 1-output neural network + 3 parameters. Globally of the array of all , this becomes:
i.e. it's a universal differential equation with a 3 parameter CNN and (the same) small neural network applied at each spatial point.
Can this tiny neural network actually fit the data? It does. But not only does it fit the data, it also is interpretable:
You see, not only did it fit and accurately match the data, it also tells us exactly the PDE that generated the data. Notice that figure (C) says that in the fitted equation, and that . This means that the convolutional neural network learned to be , exactly as the theory would predict if the only spatial physical process was diffusion. But secondly, figure (D) shows that the neural network that represented the 1-dimensional behavior seems to be quadratic. Indeed, remember from the PDE discretization:
it really is quadratic. This tells us that the only physical generating process that could have given us this data is a diffusion equation with a quadratic reaction, i.e.
thus interpreting the neural networks to precisely receive a PDE governing the evolution of the spatiotemporal data. Not only that, this trained form can predict beyond its data set, since if we wished for it to predict the behavior of a fluid with a different diffusion constant , we know exactly how to change that term without retraining the neural networks since our weights in the convolutional neural network is , so we'd simply rescale those weights and suddenly have a neural network that predicts for a different underlying fluid.
Small neural network, small data, trained in an interpretable form that extrapolates, even on hard problems like spatiotemporal data.
From this explanation, it should be very clear that our approach is general, but in every application, it's specific. We utilize prior knowledge of differential equations, like known physics or biological interactions, to try and hard code as much of the equation as possible. Then the neural networks are just stand-ins for the little pieces that are leftover. Thus the neural networks have a very easy job! They don't have to learn very much! They just have to learn the leftovers! Thus the problem becomes easy since we imposed so much knowledge in how the neural infrastructure was made by utilizing the differential equation form to its fullest.
I like to think of it as follows. There is a certain amount of knowledge that is required to effectively learn the problem. Knowledge can come from prior information or it can come from data . Either way, you need enough knowledge to effectively learn the model and do accurate predicting. Machine learning has gone the route of effectively relying entirely on data, but that doesn't need to be the case. We know how physics works, and how time series relate to derivatives, so there's no reason to force a neural network to have to learn these parts. Instead, by writing small neural networks inside of differential equations, we can embed everything that we know about the physics as true structurally-imposed prior knowledge, and then what's left is a simple training problem. That way a big and a small still gives you , and that's how you make a "neural network" accurately extrapolate from only a small amount of training data. And now that it's only learning a simple function, what it learned is easily interpretable through sparse regression techniques.
Once we had this idea of wanting to embed structure, then all of the hard work came. Essentially, in big neural network machine learning, you can get away with a lot of performance issues if 99.9% of your time is spent in the neural network's calculations. But once we got into the regime of small data small neural network structured machine learning, our neural networks were not the big time consumer, which meant every little detail mattered. Thus we needed to hyper-optimize the solution of small ODE solves to make this a reality. As a result of our optimizations, we have easily reproducible benchmarks which showcase a 50,000x acceleration over the torchdiffeq neural ODE library. In fact, benchmarks show across the board orders of magnitude performance advantages over SciPy, MATLAB, and R's deSolve as well. This is not a small detail, as this problem of training neural networks within scientific simulations is a costly project which takes many millions of ODE solves, and therefore these performance optimizations changed the problem from "impractical" to "reality". Again, when very large neural networks are involved this may be masked by the cost of neural network passes itself, but in the context of small network scientific machine learning, this change was a godsend.
But the even larger difficulty that we noticed was that traditional numerical analysis ideas like stability really came into play once real physical models got involved. There is this property of ODEs called stiffness, and when it comes into play, the simple Runge-Kutta method or Adams-Bashforth-Moulton methods are no longer stable enough to accurately solve the equations. Thus when looking at the universal partial differential equations, we had to make use of a set of ODE solvers which have package implementations in Julia and Fortran. Additionally, any form of backwards solving is unconditionally unstable on the diffusion-advection equation, meaning that it ended up being a practical case where the use of simple adjoint methods like the backsolve approach of the original neural ODEs paper and torchdiffeq actually ends up diverging to infinity in finite time for any tolerance on the ODE solver. Thus we had to implement a bunch of different versions of (checkpointed) adjoint implementations in order to be accurately and efficiently train neural networks within these equations. Then, once we had a partial differential equation form, we had to build tools that would integrate with automatic differentiation to automatically specialize on sparsity. The result was a a full set of advanced methods for efficiently handling stiffness that was fully compatible with neural network backpropagation. It was only when all of this came together that the most difficult examples of what we showed actually worked. Now, our software DifferentialEquations.jl with DiffEqFlux.jl is able to handle:
all with GPUs, sparse and structured Jacobians, preconditioned Newton-Krylov, and the list of features just keeps going. This is the limitation: when real scientific models get involved, the numerical complexity drastically increases. But now this is something that at least Julia libraries have solved.
The takeaway there is that not only do you need to use all of the scientific knowledge available, but you also need to make use of all of the numerical analysis knowledge. When you combine all of this knowledge with the most recent advances of machine learning, then you get small neural networks that train on small data in a way that is interpretable and accurately extrapolates. So yes, it's not magic: we just replaced the big data requirement with the requirement of having some prior scientific knowledge, and if you go talk to any scientist you'll know this data source exists. I think it's time we use it in machine learning.
The code to reproduce our results is in this Github repository. However, I would like to see people try other examples. All of the tooling is now in the open source DifferentialEquations.jl and DiffEqFlux.jl packages. The implementation of SInDy is in the DataDrivenDiffEq package (this one isn't quite released yet, but the tooled used for these examples are released and it does spit out ModelingToolkit which you can call LaTeXify on. Documentation on this will come soon!). The final example in the paper is a library call in NeuralNetDiffEq.jl with the algorithm choice of LambaEM().
For examples on how to use these tools in your own packages, I would consult this part of the DiffEqFlux.jl README.
The post Scientific Machine Learning: Interpretable Neural Networks That Accurately Extrapolate From Small Data appeared first on Stochastic Lifestyle.
]]>Recently I have been pulling in a lot of technical collegues to help with the development of next generation QSP tooling. Without a background in biological modeling, I found it difficult to explain the "how" and "why" of pharmacological modeling. Why is it differential equations, and where do these "massively expensive global optimization" runs come from? What kinds of problems can you solve with such models when you know that they are only approximate?
To solve these questions, I took a step back and tried to explain a decision making scenario with a simple model, to showcase how playing with a model can allow one to distinguish ... READ MORE
The post How Inexact Models and Scientific Machine Learning Can Guide Decision Making in Quantitative Systems Pharmacology appeared first on Stochastic Lifestyle.
]]>Recently I have been pulling in a lot of technical collegues to help with the development of next generation QSP tooling. Without a background in biological modeling, I found it difficult to explain the "how" and "why" of pharmacological modeling. Why is it differential equations, and where do these "massively expensive global optimization" runs come from? What kinds of problems can you solve with such models when you know that they are only approximate?
To solve these questions, I took a step back and tried to explain a decision making scenario with a simple model, to showcase how playing with a model can allow one to distinguish between intervention strategies and uncover a way forward. This is my attempt. Instead of talking about something small and foreign like chemical reaction concentrations, let's talk about something mathematically equivalent that's easy to visualize: ecological intervention.
Let's take everyone's favorite ecology model: the Lotka-Volterra model. The model is the following:
Left alone, the rabbit population will grow exponentially
Rabbits are eaten wolves in proportion to the number of wolves (number of mouthes to feed), and in proportion to the number of rabbits (ease of food access: you eat more at a buffet!)
Wolf populations grow exponentially, as long as there is a proportional amount of food around (rabbits)
Wolves die overtime of old age, and any generation dies at a similar age (no major wolf medical discoveries)
The model is then the ODE:
using OrdinaryDiffEq, Plots function f(du,u,p,t) du[1] = dx = p[1]*u[1] - p[2]*u[1]*u[2] du[2] = dy = -p[3]*u[2] + p[4]*u[1]*u[2] end u0 = [1.0;1.0] tspan = (0.0,10.0) p = [1.5,1.0,3.0,1.0] prob = ODEProblem(f,u0,tspan,p) sol = solve(prob,Tsit5()) plot(sol,label=["Rabbits" "Wolves"])
Except, me showing you that picture glossed over a major detail that every piece of the model is only mechanistic, but also contains a parameter. For example, rabbits grow exponentially, but what's the growth rate? To make that plot I chose a value for that growth rate (1.5
), but in reality we need to get that from data since the results can be wildly different:
p = [0.1,1.0,3.0,1.0] prob = ODEProblem(f,u0,tspan,p) sol = solve(prob,Tsit5()) plot(sol)
Here the exponential growth rate of rabbits too low to sustain a wolf population, so the wolf population dies out, but then this makes the rabbits have no predators and grow exponentially, which is a common route of ecological collapse as then they will destroy the local ecosystem. More on that later.
But okay, we need parameters from data, but no single data source is great. One gives us a noisy sample of the population yearly, another every month for the first two years and only on the wolves, etc.:
function f_true(du,u,p,t) du[1] = dx = p[1]*u[1] - p[2]*u[1]*u[2] - p[5]*u[1]^2 du[2] = dy = -p[3]*u[2] + p[4]*u[1]*u[2] end p = [1.0,1.0,3.0,1.0,0.1] prob = ODEProblem(f_true,u0,tspan,p) sol = solve(prob,Tsit5()) data1 = sol(0:1:10) data2 = sol(0:0.1:2;idxs=2) scatter(data1) scatter!(data2)
Oh, and notice that ODE is not the Lotka-Volterra model, but instead also adds a term p[5]*u[1]^2
for a rabbit disease which requires high rabbit density.
The local population is very snobby and wants the rabbit population decreased. You're trying to find out how to best intervene with the populations so that you decrease the rabbit population to always stay below 4 million rabbits, but without causing population collapse. What should you be targetting? The rabbit birth rate? The ability for predators to find rabbits?
(In systems pharmacology, this is, which reactions should I interact with in order to achieve my target goals while not introducing toxic side effects?)
In a complex system, these will all act differently, so you need to simulate what happens under uncertainty with the model. For example, if I attack birth rate too hard, we already saw that we can lead to population collapse, but is birth rate a more robust target then wolf lifespan (i.e., could I change wolf lifespan more and get the same effect, but with less chance of collapse)?
But one caveat: in order to do these simulations you need to know the model and its parameters, since you want to investigate what happens when you change the model's parameters. But you just have "your best model" and "data". So you need to find out how to get "the best model you can" and the parameters of said model, since once you have that you can assess the targetting effects.
Let's start with the model we have:
function f(du,u,p,t) du[1] = dx = p[1]*u[1] - p[2]*u[1]*u[2] du[2] = dy = -p[3]*u[2] + p[4]*u[1]*u[2] end u0 = [1.0;1.0] tspan = (0.0,10.0) p = ones(4) prob = ODEProblem(f,u0,tspan,p) sol = solve(prob,Tsit5()) plot(sol)
Here I took all of the parametrs to be 1
, since my only guess is their relative order of magnitude which should be about correct (maybe?).
To get some parameter values, I then need do some parameter fitting. Let's define a cost function as a difference of our result against the data sources we have:
using LinearAlgebra function cost(p) # Needed for the optimizer to reject parameters out of the domain any(x->x<0,p) && return Inf # Solve the ODE with current parameters `p` prob = ODEProblem(f,u0,tspan,p) sol = solve(prob,Tsit5(),abstol=1e-8,reltol=1e-8) # Check the difference from the data norm(sol(0:1:10) - data1) + norm(data2 - sol(0:0.1:2;idxs=2)) end cost(ones(4))
10.917779941345977
Yeah, those original parameters are pretty bad. But now let's optimize them:
using Optim opt = optimize(cost,ones(4),BFGS()) p = Optim.minimizer(opt) prob = ODEProblem(f,u0,tspan,p) sol = solve(prob,Tsit5()) plot(sol) scatter!(data1) scatter!(data2)
Oh wow, that fit is pretty bad! Well generally, for large models (the models are usually >200 lines long), this is pretty standard for a local optimizer. So okay, we need to use a global optimizer.
using BlackBoxOptim bound = Tuple{Float64, Float64}[(0, 10),(0, 10),(0, 10),(0, 10)] result = bboptimize(cost;SearchRange = bound, MaxSteps = 21e3) p = result.archive_output.best_candidate prob = ODEProblem(f,u0,tspan,p) sol = solve(prob,Tsit5()) plot(sol) scatter!(data1) scatter!(data2)
Starting optimization with optimizer BlackBoxOptim.DiffEvoOpt{BlackBoxOptim .FitPopulation{Float64},BlackBoxOptim.RadiusLimitedSelector,BlackBoxOptim.A daptiveDiffEvoRandBin{3},BlackBoxOptim.RandomBound{BlackBoxOptim.Continuous RectSearchSpace}} 0.00 secs, 0 evals, 0 steps 0.50 secs, 521 evals, 382 steps, improv/step: 0.317 (last = 0.3168), fitnes s=6.101131069 1.00 secs, 1419 evals, 1253 steps, improv/step: 0.241 (last = 0.2078), fitn ess=4.441611914 1.50 secs, 2430 evals, 2264 steps, improv/step: 0.204 (last = 0.1573), fitn ess=3.039137898 2.00 secs, 3391 evals, 3225 steps, improv/step: 0.190 (last = 0.1592), fitn ess=2.959508579 2.50 secs, 4700 evals, 4535 steps, improv/step: 0.195 (last = 0.2069), fitn ess=2.936981210 3.01 secs, 5985 evals, 5821 steps, improv/step: 0.192 (last = 0.1827), fitn ess=2.936675847 3.51 secs, 7211 evals, 7048 steps, improv/step: 0.192 (last = 0.1874), fitn ess=2.936639907 4.01 secs, 8400 evals, 8238 steps, improv/step: 0.186 (last = 0.1555), fitn ess=2.936639382 4.51 secs, 9470 evals, 9309 steps, improv/step: 0.184 (last = 0.1662), fitn ess=2.936639323 5.01 secs, 10567 evals, 10406 steps, improv/step: 0.184 (last = 0.1814), fi tness=2.936639322 5.51 secs, 11665 evals, 11504 steps, improv/step: 0.185 (last = 0.1940), fi tness=2.936639322 6.01 secs, 12659 evals, 12498 steps, improv/step: 0.187 (last = 0.2113), fi tness=2.936639322 6.51 secs, 13866 evals, 13705 steps, improv/step: 0.185 (last = 0.1698), fi tness=2.936639322 7.01 secs, 15059 evals, 14898 steps, improv/step: 0.179 (last = 0.1106), fi tness=2.936639322 7.51 secs, 16209 evals, 16048 steps, improv/step: 0.170 (last = 0.0478), fi tness=2.936639322 8.02 secs, 17394 evals, 17233 steps, improv/step: 0.160 (last = 0.0211), fi tness=2.936639322 8.52 secs, 18612 evals, 18458 steps, improv/step: 0.151 (last = 0.0286), fi tness=2.936639322 9.02 secs, 19813 evals, 19970 steps, improv/step: 0.142 (last = 0.0337), fi tness=2.936639322 Optimization stopped after 21001 steps and 9.23 seconds Termination reason: Max number of steps (21000) reached Steps per second = 2275.05 Function evals per second = 2201.39 Improvements/step = 0.13638 Total function evaluations = 20321 Best candidate found: [0.766198, 1.31019, 2.92441, 1.12677] Fitness: 2.936639322
p = result.archive_output.best_candidate prob = ODEProblem(f,u0,tspan,p) sol = solve(prob,Tsit5()) plot(sol) scatter!(data1) scatter!(data2)
Oh dang, still no good. This means our model is misspecified. We see that we overshoot over time, so there's some kind of decay missing. Thus we go and talk to our collegues a bit more and find out that there's this weird bunny disease that is huge problem every few years: that may be an effect that is required to produce the data!
So then we change our model. What if this disease is just old age related? Then we would have decay of rabbits unrelated to wolves, so the model would be like:
function f2(du,u,p,t) du[1] = dx = p[1]*u[1] - p[2]*u[1]*u[2] - p[5]*u[1] du[2] = dy = -p[3]*u[2] + p[4]*u[1]*u[2] end
So now let's optimize with this:
function cost2(p) # Needed for the optimizer to reject parameters out of the domain any(x->x<0,p) && return Inf # Solve the ODE with current parameters `p` prob = ODEProblem(f2,u0,tspan,p) sol = solve(prob,Tsit5(),abstol=1e-8,reltol=1e-8) # Check the difference from the data norm(sol(0:1:10) - data1) + norm(data2 - sol(0:0.1:2;idxs=2)) end bound = Tuple{Float64, Float64}[(0, 10),(0, 10),(0, 10),(0, 10),(0, 10)] result = bboptimize(cost2;SearchRange = bound, MaxSteps = 21e3)
Starting optimization with optimizer BlackBoxOptim.DiffEvoOpt{BlackBoxOptim .FitPopulation{Float64},BlackBoxOptim.RadiusLimitedSelector,BlackBoxOptim.A daptiveDiffEvoRandBin{3},BlackBoxOptim.RandomBound{BlackBoxOptim.Continuous RectSearchSpace}} 0.00 secs, 0 evals, 0 steps 0.50 secs, 772 evals, 672 steps, improv/step: 0.257 (last = 0.2574), fitnes s=5.195430548 1.01 secs, 1633 evals, 1533 steps, improv/step: 0.187 (last = 0.1312), fitn ess=4.538806718 1.51 secs, 2744 evals, 2644 steps, improv/step: 0.148 (last = 0.0954), fitn ess=3.632220569 2.02 secs, 3772 evals, 3672 steps, improv/step: 0.136 (last = 0.1031), fitn ess=3.120794160 2.52 secs, 4962 evals, 4862 steps, improv/step: 0.137 (last = 0.1395), fitn ess=2.945033178 3.02 secs, 6061 evals, 5961 steps, improv/step: 0.137 (last = 0.1365), fitn ess=2.938642172 3.52 secs, 7226 evals, 7126 steps, improv/step: 0.141 (last = 0.1648), fitn ess=2.936764046 4.03 secs, 8494 evals, 8395 steps, improv/step: 0.141 (last = 0.1418), fitn ess=2.936657641 4.53 secs, 9641 evals, 9542 steps, improv/step: 0.141 (last = 0.1386), fitn ess=2.936639983 5.03 secs, 10755 evals, 10656 steps, improv/step: 0.145 (last = 0.1759), fi tness=2.936639353 5.53 secs, 11947 evals, 11848 steps, improv/step: 0.143 (last = 0.1334), fi tness=2.936639326 6.03 secs, 13181 evals, 13082 steps, improv/step: 0.145 (last = 0.1548), fi tness=2.936639322 6.53 secs, 14524 evals, 14425 steps, improv/step: 0.144 (last = 0.1355), fi tness=2.936639322 7.03 secs, 15717 evals, 15619 steps, improv/step: 0.144 (last = 0.1516), fi tness=2.936639322 7.53 secs, 16975 evals, 16877 steps, improv/step: 0.144 (last = 0.1447), fi tness=2.936639322 8.03 secs, 18020 evals, 17923 steps, improv/step: 0.144 (last = 0.1424), fi tness=2.936639322 8.53 secs, 18811 evals, 18714 steps, improv/step: 0.143 (last = 0.1087), fi tness=2.936639322 9.04 secs, 19671 evals, 19574 steps, improv/step: 0.139 (last = 0.0477), fi tness=2.936639322 9.54 secs, 20122 evals, 20025 steps, improv/step: 0.136 (last = 0.0377), fi tness=2.936639322 10.04 secs, 20940 evals, 20843 steps, improv/step: 0.132 (last = 0.0220), f itness=2.936639322 Optimization stopped after 21001 steps and 10.10 seconds Termination reason: Max number of steps (21000) reached Steps per second = 2078.48 Function evals per second = 2088.08 Improvements/step = 0.13081 Total function evaluations = 21098 Best candidate found: [8.17578, 1.31019, 2.92441, 1.12677, 7.40958] Fitness: 2.936639322
p = result.archive_output.best_candidate prob = ODEProblem(f2,u0,tspan,p) sol = solve(prob,Tsit5()) plot(sol) scatter!(data1) scatter!(data2)
Okay, that model is not right yet, but it's better. Let's try the local one and see what we get:
opt = optimize(cost2,ones(5),BFGS()) p = Optim.minimizer(opt) prob = ODEProblem(f2,u0,tspan,p) sol = solve(prob,Tsit5()) plot(sol) scatter!(data1) scatter!(data2)
So we go back to our colleagues who know the local ecology and tell them, that model is giving me junk. Then you learn from one of the resident biologists that it's population dependent, so this death is only going to occur when the population is higher! So you need to add an interaction between bunnies to bunnies. A simple model for this is to assume that two bunnies always have to die together, so:
function f3(du,u,p,t) du[1] = dx = p[1]*u[1] - p[2]*u[1]*u[2] - p[5]*u[1]^2 du[2] = dy = -p[3]*u[2] + p[4]*u[1]*u[2] end function cost3(p) # Needed for the optimizer to reject parameters out of the domain any(x->x<0,p) && return Inf # Solve the ODE with current parameters `p` prob = ODEProblem(f3,u0,tspan,p) sol = solve(prob,Tsit5(),abstol=1e-8,reltol=1e-8) # Check the difference from the data norm(sol(0:1:10) - data1) + norm(data2 - sol(0:0.1:2;idxs=2)) end opt = optimize(cost3,ones(5),BFGS()) p = Optim.minimizer(opt) prob = ODEProblem(f3,u0,tspan,p) sol = solve(prob,Tsit5()) plot(sol) scatter!(data1) scatter!(data2)
That might just be an optimization issue? So let's run our big model with global optimization on the cluster:
bound = Tuple{Float64, Float64}[(0, 10),(0, 10),(0, 10),(0, 10),(0, 10)] result = bboptimize(cost3;SearchRange = bound, MaxSteps = 21e3)
Starting optimization with optimizer BlackBoxOptim.DiffEvoOpt{BlackBoxOptim .FitPopulation{Float64},BlackBoxOptim.RadiusLimitedSelector,BlackBoxOptim.A daptiveDiffEvoRandBin{3},BlackBoxOptim.RandomBound{BlackBoxOptim.Continuous RectSearchSpace}} 0.00 secs, 0 evals, 0 steps 0.50 secs, 1558 evals, 1421 steps, improv/step: 0.246 (last = 0.2456), fitn ess=3.862852782 1.00 secs, 3159 evals, 3022 steps, improv/step: 0.195 (last = 0.1499), fitn ess=1.992867792 1.50 secs, 4336 evals, 4199 steps, improv/step: 0.180 (last = 0.1410), fitn ess=1.629497857 2.00 secs, 5310 evals, 5173 steps, improv/step: 0.175 (last = 0.1530), fitn ess=0.373072771 2.50 secs, 6785 evals, 6648 steps, improv/step: 0.162 (last = 0.1186), fitn ess=0.219904053 3.01 secs, 7974 evals, 7837 steps, improv/step: 0.162 (last = 0.1590), fitn ess=0.083573607 3.51 secs, 9415 evals, 9278 steps, improv/step: 0.162 (last = 0.1610), fitn ess=0.019814280 4.01 secs, 10222 evals, 10085 steps, improv/step: 0.162 (last = 0.1660), fi tness=0.006743899 4.66 secs, 11008 evals, 10871 steps, improv/step: 0.161 (last = 0.1527), fi tness=0.003019186 5.16 secs, 11850 evals, 11713 steps, improv/step: 0.162 (last = 0.1651), fi tness=0.003007867 5.66 secs, 12903 evals, 12767 steps, improv/step: 0.162 (last = 0.1679), fi tness=0.002607472 6.16 secs, 13571 evals, 13436 steps, improv/step: 0.163 (last = 0.1719), fi tness=0.002523329 6.66 secs, 14991 evals, 14856 steps, improv/step: 0.165 (last = 0.1831), fi tness=0.002451722 7.17 secs, 16627 evals, 16494 steps, improv/step: 0.165 (last = 0.1679), fi tness=0.002448788 7.67 secs, 18439 evals, 18306 steps, improv/step: 0.164 (last = 0.1518), fi tness=0.002448722 8.18 secs, 19939 evals, 19806 steps, improv/step: 0.168 (last = 0.2160), fi tness=0.002448716 Optimization stopped after 21001 steps and 8.59 seconds Termination reason: Max number of steps (21000) reached Steps per second = 2445.96 Function evals per second = 2461.33 Improvements/step = 0.16895 Total function evaluations = 21133 Best candidate found: [0.999909, 1.00034, 3.00008, 1.00015, 0.0999369] Fitness: 0.002448716
p = result.archive_output.best_candidate prob = ODEProblem(f3,u0,tspan,p) sol = solve(prob,Tsit5()) plot(sol) scatter!(data1) scatter!(data2)
Oh bingo! That looks good! So okay, is it safe to reduce the rabbit birth rate, and how much would we need to target it by to get it under 4?
test_p = p - [0.5,0,0,0,0] prob = ODEProblem(f3,u0,tspan,test_p) sol = solve(prob,Tsit5()) plot(sol)
test_p = p - [0.58,0,0,0,0] prob = ODEProblem(f3,u0,tspan,test_p) sol = solve(prob,Tsit5()) plot(sol)
test_p = p - [0.7,0,0,0,0] prob = ODEProblem(f3,u0,tspan,test_p) sol = solve(prob,Tsit5()) plot(sol)
Ouch. We see that effecting the birth rate, we would have to hit a reduction range of [0.55,0.7] to get the effect we want without introducing population collapse. What about effecting wolf population growth?
test_p = p + [0.0,0,0.0,0.25,0] prob = ODEProblem(f3,u0,tspan,test_p) sol = solve(prob,Tsit5()) plot(sol)
test_p = p + [0.0,0,0.0,0.5,0] prob = ODEProblem(f3,u0,tspan,test_p) sol = solve(prob,Tsit5()) plot(sol)
test_p = p + [0.0,0,1.5,2.0,0] prob = ODEProblem(f3,u0,tspan,test_p) sol = solve(prob,Tsit5()) plot(sol)
Wow, notice that in 3 very different parameter scenarios ("virtual populations") that it's very easy to intervene this way and get the rabbit population below 4 million, and not have a population collapse! This means that for this model, the most robust intervention is clearly to effect the wolf population growth factor.
From these results you go to your board meeting and recommend:
Do not attempt to decrease the rabbit population by killing baby rabbits. This seems like the obvious approach, but the models show that it can easily lead to population collapse.
However, making it easier for wolves to feast on rabbits is a robust change to the ecosystem that gives the people what they want.
We recommend clearing a lot of bushes to give rabbits a harder chance of hiding, along with long-term investment in research for wolf binoculars.
As we go through our results, we tell our collegues to never use a local optimizer.
During our talk, we tell our collegues that the global optimization takes for ever, so we should probably parallelize it, or do other things, like change our models into Julia to use their faster tools.
We have a cancer pathway and our team is synthesizing molecules to target different aspects of the pathway. Our goal as the modeler is to help the guide the choice of which part of the pathway we should target. So we:
Talk to biologists and consult the literature to learn about the pathway
Build a model
Find some data in the literature about how the pathway should generally act
Try to fit the data
If we don't fit well, go back to (1)
Great, we now have a good enough model. Investigate what happens to the cancer patient if we effect X vs Y. Does X introduce toxic effects? Can we know if it only produces toxic effects in men?
Gather these plots to showcase that yes, the model gives a reliable fit, and the analysis shows that targeting X will cause ...
So other than the interpretation being different, it's this same workflow. It's this same mantra. Get a model that fits, and then understand the general behavior of the model to learn the best intervention strategy.
Given this process, the possible improvements to tooling are:
Solve differential equations faster.
Do global optimization in less steps.
Parallelize the global optimization.
Automatically find models from data for people?
However, any tooling needs to respect the interactive aspect between the modeler, the biology, the data, and the interpretation of the results.
Christopher Rackauckas, How Inexact Models Can Guide Decision Making in Quantitative Systems Pharmacology, The Winnower 7:e158508.80560 (2020). DOI: 10.15200/winn.158508.80560
This post is open to read and review on The Winnower.
The post How Inexact Models and Scientific Machine Learning Can Guide Decision Making in Quantitative Systems Pharmacology appeared first on Stochastic Lifestyle.
]]>Recent Advancements in Differential Equation Solver Software
Since the time of the ancient Fortran methods like dop853 and DASSL were created, many advancements in numerical analysis, computational methods, and hardware have accelerated computing. However, many applications of differential equations still rely on the same older software, possibly to their own detriment. In this talk we will describe the recent advancements being made in differential equation solver software, focusing on the Julia-based DifferentialEquations.jl ecosystem. We will show how high order Rosenbrock and IMEX methods have been proven advantageous over traditional BDF implementations in certain problem domains, and the types of issues that give rise to general performance characteristics between the methods. Extensions of these solver ... READ MORE
The post Recent advancements in differential equation solver software appeared first on Stochastic Lifestyle.
]]>Recent Advancements in Differential Equation Solver Software
Since the time of the ancient Fortran methods like dop853 and DASSL were created, many advancements in numerical analysis, computational methods, and hardware have accelerated computing. However, many applications of differential equations still rely on the same older software, possibly to their own detriment. In this talk we will describe the recent advancements being made in differential equation solver software, focusing on the Julia-based DifferentialEquations.jl ecosystem. We will show how high order Rosenbrock and IMEX methods have been proven advantageous over traditional BDF implementations in certain problem domains, and the types of issues that give rise to general performance characteristics between the methods. Extensions of these solver methods to adaptive high order methods for stochastic differential-algebraic and delay differential-algebraic equations will be demonstrated, and the potential use cases of these new solvers will be discussed. Acceleration and generalization of adjoint sensitivity analysis through source-to-source reverse-mode automatic differentiation and GPU-compatibility will be demonstrated on neural differential equations, differential equations which incorporate trainable latent neural networks into their derivative functions to automatically learn dynamics from data.
Edit 11/22/2019: Pointing to a new version of the video.
The post Recent advancements in differential equation solver software appeared first on Stochastic Lifestyle.
]]>