Useful Algorithms That Are Not Optimized By Jax, PyTorch, or Tensorflow


In some previous blog posts we described in details how one can generalize automatic differentiation to give automatically stability enhancements and all sorts of other niceties by incorporating graph transformations into code generation. However, one of the things which we didn’t go into too much is the limitation of these types of algorithms. This limitation is what we have termed “quasi-static” which is the property that an algorithm can be reinterpreted as some static algorithm. It turns out that for very fundamental reasons, this is the same limitation that some major machine learning frameworks impose on the code that they can fully optimize, such as Jax or Tensorflow. This led us to the question: are there algorithms which are not optimizable within this mindset, and why? The answer is now published at ICML 2021, so lets dig into this higher level concept.

The Space of Quasi-Static Algorithms

First of all, lets come up with a concrete idea of what a quasi-static algorithm is. It’s the space of algorithms which in some way can be re-expressed as a static algorithm. Think of a “static algorithm” as one which has a simple mathematical description that does not require a full computer description, i.e. no loops, rewriting to memory, etc. As an example, let’s take a look at an example from the Jax documentation. The following is something that the Jax JIT works on:

@jit
def f(x):
  for i in range(3):
    x = 2 * x
  return x
 
print(f(3))

Notice that it’s represented by something with control flow, i.e. it is code represented with a loop, but the but the loop is not necessary We can also understand this method as 2*2*2*x or 8*x. The demonstrated example of where the JIT will fail by default is:

@jit
def f(x):
  if x < 3:
    return 3. * x ** 2
  else:
    return -4 * x
 
# This will fail!
try:
  f(2)
except Exception as e:
  print("Exception {}".format(e))

In this case, we can see that there’s essentially two compute graphs split at x<3, and so as stated this does not have a single mathematical statement that describes the computation. You can get around this by doing lax.cond(x < 3, 3. * x ** 2, -4 * x), but notice this is a fundamentally different computation: the lax.cond form always computes both sides of the if statement before choosing which one to carry forward, while the true if statement changes its computation based on the conditional. The reason why the lax.cond form thus works with Jax's JIT compilation system is thus because it is quasi-static. The computations that will occur are fixed, even if the result is not, while the original if statement will change what is computed based on the input values. This limitation exists because Jax traces through a program to attempt to build the static compute graph under the hood, and it then attempts to do its actual transformations on this graph. Are there other kinds of frameworks that do something similar? It also turns out that the set of algorithms which are transformable into purely symbolic languages is the set of quasi-static algorithms, so something like Symbolics.jl also has a form of quasi-staticness manifest in the behaviors of its algorithms. And it’s for the same reason: in symbolic algorithms you define symbolic variables like “x” and “y”, and then trade through a program to build a static compute graph for “2x^2 + 3y” which you then treat symbolically. In the frequently asked questions, there is a question for what happens when a conversion of a function to symbolic fails. If you take a look at the example:

function factorial(x)
  out = x
  while x > 1
    x -= 1
    out *= x
  end
  out
end
 
@variables x
factorial(x)

You can see that the reason for this is because the algorithm is not representable as a single mathematical expression: the factorial cannot be written as a fixed number of multiplications because the number of multiplications is dependent on that value x you’re trying to compute x! for! The error that the symbolic language throws is “ERROR: TypeError: non-boolean (Num) used in boolean context”, which is saying that it does not know how to symbolically expand out “while x > 1” to be able to represent it statically. And this is not something that is not necessarily “fixable”, it’s fundamental to the fact that this algorithm is not able to be represented by a fixed computation and necessarily needs to change the computation based on the input.

Handling Non-Quasi-Static Algorithms in Symbolics and Machine Learning

The “solution” is to define a new primitive to the graph via “@register factorial(x)”, so that this function itself is a fixed node that does not try to be symbolically expanded. This is the same concept as defining a Jax primitive or a Tensorflow primitive, where an algorithm simply is not quasi-static and so the way to get a quasi-static compute graph is to treat the dynamic block just as a function “y = f(x)” that is preordained to exist. In the context of both symbolic languages and machine learning frameworks, for this to work in full you also need to define derivatives of said function. That last part is the catch. If you take another look at the depths of the documentation of some of these tools, you’ll notice that many of these primitives representing non-static control flow fall outside of the realm that is fully handled.

Right there in the documentation it notes that you can replace a while loop with lax.while_loop, but that is not amenable to reverse-mode automatic differentiation. The reason is because its reverse-mode AD implementation assumes that such a quasi-static algorithm exists and uses this for two purposes, one for generating the backpass but secondly for generating the XLA (“Tensorflow”) description of the algorithm to then JIT compile optimize. XLA wants the static compute graph, which again, does not necessarily exist for this case, hence the fundamental limitation. The way to get around this of course is then to define your own primitive with its own fast gradient calculation and this problem goes away…

Or does it?

Where Can We Find The Limit Of Quasi-Static Optimizers?

There are machine learning frameworks which do not make the assumption of quasi-staticness but also optimize, and most of these things like Diffractor.jl, Zygote.jl, and Enzyme.jl in the Julia programming language (note PyTorch does not assume quasi-static representations, though TorchScript’s JIT compilation does). This got me thinking: are there actual machine learning algorithms for which this is a real limitation? This is a good question, because if you pull up your standard methods like convolutional neural networks, that’s a fixed function kernel call with a good derivative defined, or a recurrent neural network, that’s a fixed size for loop. If you want to break this assumption, you have to go to a space that is fundamentally about an algorithm where you cannot know “the amount of computation” until you know the specific values in the problem, and equation solvers are something of this form.

How many steps does it take for Newton’s method to converge? How many steps does an adaptive ODE solver take? This is not questions that can be answered a priori: they are fundamentally questions which require knowing:

  1. What equation are we solving?
  2. What is the initial condition?
  3. Over what time span?
  4. With what solver tolerance?

For this reason, people who work in Python frameworks have been looking for the “right” way to treat equation solving (ODE solving, finding roots f(x)=0, etc.) as a blackbox representation. If you take another look at the Neural Ordinary Differential Equations paper, one of the big things it was proposing was the treatment of neural ODEs as a blackbox with a derivative defined by the ODE adjoint. The reason of course is because adaptive ODE solvers necessarily iterate to tolerance, so there is necessarily something like “while t < tend" which is dependent on whether the current computations are computed to tolerance. As something not optimized in the frameworks they were working in, this is something that was required to make the algorithm work.

Should You Treat Equation Solvers As a Quasi-Static Blackbox?

No it’s not fundamental to have to treat such algorithms as a blackbox. In fact, we had a rather popular paper a few years ago showing that neural stochastic differential equations can be trained with forward and reverse mode automatic differentiation directly via some Julia AD tools. The reason is because these AD tools (Zygote, Diffractor, Enzyme, etc.) do not necessarily assume quasi-static forms due to how they do direct source-to-source transformations, and so they can differentiate the adaptive solvers directly and spit out the correct gradients. So you do not necessarily have to do it in the “define a Tensorflow op” style, but which is better?

It turns out that “better” can be really hard to define because the two algorithms are not necessarily the same and can compute different values. You can boil this down to: do you want to differentiate the solver of the equation, or do you want to differentiate the equation and apply a solver to that? The former, which is equivalent to automatic differentiation of the algorithm, is known as discrete sensitivity analysis or discrete-then-optimize. The latter is continuous sensitivity analysis or optimize-then-discretize approaches. Machine learning is not the first field to come up against this problem, so the paper on universal differential equations and the scientific machine learning ecosystem has a rather long description that I will quote:

“””
Previous research has shown that the discrete adjoint approach is more stable than continuous adjoints in some cases [41, 37, 42, 43, 44, 45] while continuous adjoints have been demonstrated to be more stable in others [46, 43] and can reduce spurious oscillations [47, 48, 49]. This trade-off between discrete and continuous adjoint approaches has been demonstrated on some equations as a trade-off between stability and computational efficiency [50, 51, 52, 53, 54, 55, 56, 57, 58]. Care has to be taken as the stability of an adjoint approach can be dependent on the chosen discretization method [59, 60, 61, 62, 63], and our software contribution helps researchers switch between all of these optimization approaches in combination with hundreds of differential equation solver methods with a single line of code change.
“””

Or, tl;dr: there’s tons of prior research which generally shows that continuous adjoints are less stable than discrete adjoints, but they can be faster. We have done recent follow-ups which show these claims are true on modern problems with modern software. Specifically, this paper on stiff neural ODEs shows why discrete adjoints are more stable that continuous adjoints when training on multiscale data, but we also recently showed continuous adjoints can be much faster at gradient computations than (some) current AD techniques for discrete adjoints.

So okay, there’s a true benefit to using discrete adjoint techniques if you’re handling these hard stiff differential equations, differentiating partial differential equations, etc. and this has been known since the 80’s in the field of control theory. But other than that, it’s a wash, and so it’s not clear whether differentiating such algorithms is better in machine learning, right?

Honing In On An Essentially Non-Quasi-Static Algorithm Which Accelerates Machine Learning

This now brings us to how the recent ICML paper fits into this narrative. Is there a non-quasi-static algorithm that is truly useful for standard machine learning? The answer turns out to be yes, but how to get there requires a few slick tricks. First, the setup. Neural ODEs can be an interesting method for machine learning because they use an adaptive ODE solver to essentially choose the number of layers for you, so it’s like a recurrent neural network (or more specifically, like a residual neural network) that automatically finds the “correct” number of layers, where the number of layers is the number of steps the ODE solver decides to take. In other words, Neural ODEs for image processing are an algorithm that automatically do hyperparameter optimization. Neat!

But… what is the “correct” number of layers? For hyperparameter optimization you’d assume that would be “the least number of layers to make predictions accurately”. However, by default neural ODEs will not give you that number of layers: they will give you whatever they feel like. In fact, if you look at the original neural ODE paper, as the neural ODE trains it keeps increasing the number of layers it uses:

So is there a way to change the neural ODE to make it define “correct number of layers” as “least number of layers”? In the work Learning Differential Equations that are Easy to Solve they did just that. How they did it is that they regularized the training process of the neural ODE. They looked at the solution and noted that ODEs which have more changes going on are necessarily harder to solve, so you can transform the training process into hyperparameter optimization by adding a regularization term that says “make the higher order derivative terms as small as possible”. The rest of the paper is how to enact this idea. How was that done? Well, if you have to treat the algorithm as a blackbox, you need to define some blackbox way to defining high order derivatives which then leads to Jesse’s pretty cool formulation of Taylor-mode automatic differentiation. But no matter how you put it, that’s going to be an expensive object to compute: computing the gradient is more expensive than the forward pass, and the second derivative moreso than the gradient, and the third etc, so an algorithm that wants 6th derivatives is going to be nasty to train. With some pretty heroic work they got a formulation of this blackbox operation which takes twice as long to train but successfully does the hyperparmeter optimization.

End of story? Far from it!

The Better Way to Make Neural ODEs An Automatic Hyperparameter Optimizing Algorithm

Is there a way to make automatic hyperparameter optimization via neural ODEs train faster? Yes, and our paper makes them not only train faster than that other method, but makes it train faster than the vanilla neural ODE. We can make layer hyperparameter optimization less than free: we can make it cheaper than not doing the optimization! But how? The trick is to open the blackbox. Let me show you what a step of the adaptive ODE solver looks like:

Notice that the adaptive ODE solver chooses whether a time step is appropriate by using an error estimate. The ODE algorithm is actually constructed so that the error estimate, the estimate of “how hard this ODE is to solve”, is computed for free. What if we use this free error estimate as our regularization technique? It turns out that is 10x faster to train than before, while similarly automatically performing hyperparameter optimization.

Notice where we have ended up: the resulting algorithm is necessarily not quasi-static. This error estimate is computed by the actual steps of the adaptive ODE solver: to compute this error estimate, you have to do the same computations, the same while loop, as the ODE solver. In this algorithm, you cannot avoid directly differentiating the ODE solver because pieces of the ODE solver’s internal calculations are now part of the regularization. This is something that is fundamentally not optimized by methods that require quasi-static compute graphs (Jax, Tensorflow, etc.), and it is something that makes hyperparameter optimization cheaper than not doing hyperparameter optimization since the regularizer is computed for free. I just find this result so cool!

Conclusion: Finding the Limitations of Our Tools

So yes, the paper is a machine learning paper on how to do hyperparameter optimization for free using a trick on neural ODEs, but I think the general software context this sits in highlights the true finding of the paper. This is the first algorithm that I know of where there is both a clear incentive for it to be used in modern machine learning, but also, there is a fundamental reason why common machine learning frameworks like Jax and Tensorflow will not be able to treat them optimally. Even PyTorch’s TorchScript will fundamentally, due to the assumptions of its compilation process, not work on this algorithm. Those assumptions were smartly chosen because most algorithms can satisfy them, but this one cannot. Does this mean machine learning is algorithmically stuck in a rut? Possibly, because I thoroughly believe that someone working within a toolset that does not optimize this algorithm would have never found it, which makes it very thought-provoking to me.

What other algorithms are out there which are simply better than our current approaches but are worse only because of the current machine learning frameworks? I cannot wait until Diffractor.jl’s release to start probing this question deeper.

2 thoughts on “Useful Algorithms That Are Not Optimized By Jax, PyTorch, or Tensorflow

  1. Alban

    says:

    Hi,

    I am curious what prevents someone from modifying the blackbox corresponding to the ODE solver and make that error estimate another output of the blackbox.
    And then implement a manual gradient formula for that new output.

    I’m sure that if the AD system is able to provide gradients for that error estimate, this error estimate is computed by a chain of differentiable functions and so someone could write down the gradient formula corresponding to that chain of differentiable functions.


    • That is a good question. We address this in the paper only briefly. The regularization would be equivalent to a linear combination of the internal stage values (the “k’s”), and thus you’d have to calculate the derivative with respect to each of them. But the derivative with respect to all of the k’s is precisely discrete sensitivity analysis, and it’s equivalent to doing AD through the solver algorithm. People have worked it all out, see FATODE. So you can do this by defining the solve and then its gradient, where the gradient is calculated by writing out all of the steps to automatic differentiation yourself, putting in the lax.while_loop in both the forward and reverse to avoid the XLA constraint on static memory requirements of generated reverse passes (since it’s no longer generated).

      Even if you do this though, you still won’t achieve the full performance you want because XLA still would opt out of many optimizations because of the lax.while_loop you wrote into both the forward and reverse passes. For example, it won’t fuse kernels, so GPU performance would be slow. This likely wouldn’t play nicely with primitives like vmap either, so it still stands that it’s not something that would optimize well. So yes, you could do a few primitives to get further along, but there are still some roadblocks you’d run into because of the quasi-static assumptions made by XLA which Jax uses to optimize. Of course, I wouldn’t go as far as to say Jax is doomed or anything silly like that just because we found one useful algorithm that falls outside of what it assumed people would want to do, but it’s at least interesting to note that such an algorithm does indeed exist.


Write a Reply or Comment

Your email address will not be published. Required fields are marked *


*

This site uses Akismet to reduce spam. Learn how your comment data is processed.