7. Automatic Differentiation#

7.1. Overview#

In this lecture, we discuss auto-differentiation in Julia and introduce some key packages.

using LinearAlgebra, Statistics
using ForwardDiff, Enzyme, Test

7.2. Introduction to Differentiable Programming#

The promise of differentiable programming is that we can move towards taking the derivatives of almost arbitrarily complicated computer programs, rather than simply thinking about the derivatives of mathematical functions. Differentiable programming is the natural evolution of automatic differentiation (AD, sometimes called algorithmic differentiation).

Stepping back, there are three ways to calculate the gradient or Jacobian

  • Analytic derivatives / Symbolic differentiation

    • You can sometimes calculate the derivative on pen-and-paper, and potentially simplify the expression.

    • In effect, repeated applications of the chain rule, product rule, etc.

    • It is sometimes the most accurate and fastest option if there are algebraic simplifications.

    • Sometimes, symbolic computation on the computer is a good solution if the package can handle your functions. Doing algebra by hand is tedious and error-prone, but sometimes invaluable.

  • Finite differences

    • Evaluate the function at least \(N+1\) times to get the gradient – Jacobians are even worse.

    • Large \(\Delta\) is numerically stable but inaccurate; too small is numerically unstable but more accurate.

    • Choosing the \(\Delta\) is hard, so use packages such as DiffEqDiffTools.jl.

    • If a function is \(R^N \to R\) for a large \(N\), this requires \(O(N)\) function evaluations.

\[ \partial_{x_i}f(x_1,\ldots x_N) \approx \frac{f(x_1,\ldots x_i + \Delta,\ldots x_N) - f(x_1,\ldots x_i,\ldots x_N)}{\Delta} \]
  • Automatic Differentiation

    • The same as analytic/symbolic differentiation, but where the chain rule is calculated numerically rather than symbolically.

    • Just as with analytic derivatives, can establish rules for the derivatives of individual functions (e.g. \(d\left(sin(x)\right)\) to \(cos(x) dx\)) for intrinsic derivatives.

AD has two basic approaches, which are variations on the order of evaluating the chain rule: reverse and forward mode (although mixed mode is possible).

  1. If a function is \(R^N \to R\), then reverse-mode AD can find the gradient in \(O(1)\) sweep (where a “sweep” is \(O(1)\) function evaluations).

  2. If a function is \(R \to R^N\), then forward-mode AD can find the jacobian in \(O(1)\) sweeps.

We will explore two types of automatic differentiation in Julia (and discuss a few packages which implement them). For both, remember the chain rule

\[ \frac{dy}{dx} = \frac{dy}{dw} \cdot \frac{dw}{dx} \]

Forward-mode starts the calculation from the left with \(\frac{dy}{dw}\) first, which then calculates the product with \(\frac{dw}{dx}\). On the other hand, reverse mode starts on the right hand side with \(\frac{dw}{dx}\) and works backwards.

Take an example a function with fundamental operations and known analytical derivatives

\[ f(x_1, x_2) = x_1 x_2 + \sin(x_1) \]

And rewrite this as a function which contains a sequence of simple operations and temporaries.

function f(x_1, x_2)
    w_1 = x_1
    w_2 = x_2
    w_3 = w_1 * w_2
    w_4 = sin(w_1)
    w_5 = w_3 + w_4
    return w_5
end
f (generic function with 1 method)

Here we can identify all of the underlying functions (*, sin, +), and see if each has an intrinsic derivative. While these are obvious, with Julia we could come up with all sorts of differentiation rules for arbitrarily complicated combinations and compositions of intrinsic operations. In fact, there is even a package for registering more.

7.2.1. Forward-Mode Automatic Differentiation#

In forward-mode AD, you first fix the variable you are interested in (called “seeding”), and then evaluate the chain rule in left-to-right order.

For example, with our \(f(x_1, x_2)\) example above, if we wanted to calculate the derivative with respect to \(x_1\) then we can seed the setup accordingly. \(\frac{\partial w_1}{\partial x_1} = 1\) since we are taking the derivative of it, while \(\frac{\partial w_2}{\partial x_1} = 0\).

Following through with these, redo all of the calculations for the derivative in parallel with the function itself.

\[\begin{split} \begin{array}{l|l} f(x_1, x_2) & \frac{\partial f(x_1,x_2)}{\partial x_1} \\ \hline w_1 = x_1 & \frac{\partial w_1}{\partial x_1} = 1 \text{ (seed)}\\ w_2 = x_2 & \frac{\partial w_2}{\partial x_1} = 0 \text{ (seed)} \\ w_3 = w_1 \cdot w_2 & \frac{\partial w_3}{\partial x_1} = w_2 \cdot \frac{\partial w_1}{\partial x_1} + w_1 \cdot \frac{\partial w_2}{\partial x_1} \\ w_4 = \sin w_1 & \frac{\partial w_4}{\partial x_1} = \cos w_1 \cdot \frac{\partial w_1}{\partial x_1} \\ w_5 = w_3 + w_4 & \frac{\partial w_5}{\partial x_1} = \frac{\partial w_3}{\partial x_1} + \frac{\partial w_4}{\partial x_1} \end{array} \end{split}\]

Since these two could be done at the same time, we say there is “one pass” required for this calculation.

Generalizing a little, if the function was vector-valued, then that single pass would get the entire row of the Jacobian in that single pass. Hence for a \(R^N \to R^M\) function, requires \(N\) passes to get a dense Jacobian using forward-mode AD.

How can you implement forward-mode AD? It turns out to be fairly easy with a generic programming language to make a simple example (while the devil is in the details for a high-performance implementation).

7.2.2. Forward-Mode with Dual Numbers#

One way to implement forward-mode AD is to use dual numbers.

Instead of working with just a real number, e.g. \(x\), we will augment each with an infinitesimal \(\epsilon\) and use \(x + \epsilon\).

From Taylor’s theorem,

\[ f(x + \epsilon) = f(x) + f'(x)\epsilon + O(\epsilon^2) \]

where we will define the infinitesimal such that \(\epsilon^2 = 0\).

With this definition, we can write a general rule for differentiation of \(g(x,y)\) as the chain rule for the total derivative

\[ g(x + \epsilon, y + \epsilon) = g(x, y) + (\partial_x g(x,y) + \partial_y g(x,y))\epsilon \]

But, note that if we keep track of the constant in front of the \(\epsilon\) terms (e.g. a \(x'\) and \(y'\))

\[ g(x + x'\epsilon, y + y'\epsilon) = g(x, y) + (\partial_x g(x,y)x' + \partial_y g(x,y)y')\epsilon \]

This is simply the chain rule. A few more examples

\[\begin{split} \begin{aligned} (x + x'\epsilon) + (y + y'\epsilon) &= (x + y) + (x' + y')\epsilon\\ (x + x'\epsilon)\times(y + y'\epsilon) &= (xy) + (x'y + y'x)\epsilon\\ \exp(x + x'\epsilon) &= \exp(x) + (x'\exp(x))\epsilon\\ \end{aligned} \end{split}\]

Using the generic programming in Julia, it is easy to define a new dual number type which can encapsulate the pair \((x, x')\) and provide a definitions for all of the basic operations. Each definition then has the chain-rule built into it.

With this approach, the “seed” process is simply the creation of the \(\epsilon\) for the underlying variable.

So if we have the function \(f(x_1, x_2)\) and we wanted to find the derivative \(\partial_{x_1} f(3.8, 6.9)\), we would seed them with the dual numbers \(x_1 \to (3.8, 1)\) and \(x_2 \to (6.9, 0)\).

If you then follow all of the same scalar operations above with a seeded dual number, it will calculate both the function value and the derivative in a single “sweep” and without modifying any of your (generic) code.

7.2.3. ForwardDiff.jl#

Dual-numbers are at the heart of one of the AD packages we have already seen.

h(x) = sin(x[1]) + x[1] * x[2] + sinh(x[1] * x[2]) # multivariate
x = [1.4 2.2]
@show ForwardDiff.gradient(h, x) # use AD, seeds from x

#Or, can use complicated functions of many variables
f(x) = sum(sin, x) + prod(tan, x) * sum(sqrt, x)
g = (x) -> ForwardDiff.gradient(f, x); # g() is now the gradient
g(rand(5)) # gradient at a random point
# ForwardDiff.hessian(f,x') # or the hessian
ForwardDiff.gradient(h, x) = [26.35476496103098 16.663053156992287]

5-element Vector{Float64}:
 1.0223895336213846
 0.6381842072139465
 0.9702628358435225
 0.7435237800711872
 1.1844121694582914

We can even auto-differentiate complicated functions with embedded iterations.

function squareroot(x) #pretending we don't know sqrt()
    z = copy(x) # Initial starting point for Newton’s method
    while abs(z * z - x) > 1e-13
        z = z - (z * z - x) / (2z)
    end
    return z
end
squareroot(2.0)
1.4142135623730951
dsqrt(x) = ForwardDiff.derivative(squareroot, x)
dsqrt(2.0)
0.35355339059327373

7.2.4. Reverse-Mode AD#

Unlike forward-mode auto-differentiation, reverse-mode is very difficult to implement efficiently, and many variations on the best approach exist.

Many reverse-mode packages are connected to machine-learning packages, since the efficient gradients of \(R^N \to R\) loss functions are necessary for the gradient descent optimization algorithms used in machine learning.

At this point, Julia lacks a single, consistently usable reverse-mode AD package without rough edges. However, a few key options to consider are:

  • ReverseDiff.jl, a relatively dependable but limited package. Not really intended for standard ML-pipeline usage or scientific computing.

  • Zygote.jl, which is flexible but buggy and less reliable. It is undergoing deprecation but often remains a common alternative.

  • Mooncake.jl, a promising Julia-native AD package, which might be a cleaner and better maintained alternative to Zygote.jl.

  • Enzyme.jl, which is the most promising and supports both forward and reverse mode. However, as it operates at a lower compiler level, it cannot support all Julia code. In particular, it favors in-place functions over “pure” functions.

7.3. Introduction to Enzyme#

Enzyme.jl is a high-performance automatic differentiation (AD) tool that operates at the LLVM level (compiler IR) rather than the source code level. This allows it to differentiate through low-level optimizations, mutation, and foreign function calls where other AD tools might fail.

Note

Caution : Enzyme.jl is under active development. Some of the patterns shown here may change in future releases. In practice, you may find using an LLM to be very valuable for navigating the perplexing error messages of Enzyme.jl. Compilation times can be very slow, and performance intuition is not always straightforward.

However, this power does make usage somewhat more challenging, as you need to ensure the compiler-generated code conforms to certain patterns.

It supports both Forward Mode (best for \(N \to \text{Many}\) derivatives) and Reverse Mode (best for gradients of scalar loss functions, \(N \to 1\)), and nested differentiation (e.g., Hessians).

While ForwardDiff.jl is often easier for simple problems, Enzyme is capable of high-performance differentiation typically used in scientific computing and scientific machine learning (e.g., differentiable simulations and sensitivity analysis of differential equations).

7.3.1. Comparison to the Python Ecosystem#

Relative to JAX and PyTorch, Enzyme is often faster and more flexible for specialized algorithms in scientific computing (e.g., ODE solvers, physical simulations) because it differentiates at the LLVM level rather than the operation level. However, Enzyme itself is a low-level tool; it lacks the high-level layers for managing neural network state and batching found in PyTorch.

Note

One current advantage of Enzyme is that it has traditionally been difficult to write mutating code in JAX, which is essential in many scientific computing applications. The JAX ecosystem has been making progress on this limitation through mechanisms like “hijax” support in JAX NNX. This mechanism allows users to write standard Python classes with in-place mutation, which the library then automatically transforms into JAX-compatible functional code during compilation. However, this remains experimental, whereas Enzyme handles mutation directly.

7.3.2. Lux.jl: Neural Networks in Julia#

While not designed for standard deep learning pipelines out of the box, Enzyme can be paired with frameworks like Lux.jl to handle deep learning tasks and implement neural networks.

Unlike Pytorch and the default behavior of JAX’s Flax NNX, the Lux.jl framework does not use implicit differentiable arguments within the neural network layers.

Instead, the differentiable parameters are explicitly passed and separated from non-differentiable arguments. This is consistent with the split/merge pattern in JAX NNX.

7.3.3. Reactant.jl: The Bridge to TPUs and XLA#

A recent addition to this ecosystem is Reactant.jl.

While Enzyme optimizes code for the CPU (and CUDA via LLVM), Reactant is designed to compile Julia code for high-performance accelerators like TPUs, as well as multi-node GPU clusters.

In particular, Reactant is a compiler frontend that lowers Julia code (and Enzyme gradients) into MLIR (Multi-Level Intermediate Representation) and StableHLO. This targets the exact same compiler stack as JAX. In particular, JAX traces Python code to generate HLO/StableHLO, which is then compiled by XLA (Accelerated Linear Algebra). Reactant does the same for Julia and allows it to emit the same IR that JAX generates.

This allows Julia users to leverage the XLA compiler’s massive optimizations for linear algebra and convolution, run natively on Google TPUs, and execute on large-scale distributed clusters, all while writing standard Julia code. However, this is in early stages of development and should not be considered as a general replacement for standard Python ML frameworks and use-cases.

See the Lux.jl documentation for an example combining Lux, Reactant, and Enzyme for a single Neural Network.

7.3.4. Writing Enzyme-Differentiable Code#

Enzyme-friendly code looks like ordinary Julia with a few discipline rules: mutate into preallocated buffers, avoid hidden allocations, and keep inputs type-stable.

In general, these practices also lead to very high-performance Julia code, so making code Enzyme-differentiable and high-performance often go hand-in-hand. However, these tend to be more advanced patterns than an introductory Julia user might be used to.

A common pattern is f!(out, inputs..., cache) where cache holds temporary work arrays passed last.

In many cases the biggest change is to use in-place linear algebra, many of which have corresponding highly optimized BLAS/LAPACK routines. For example:

  • mul!(y, A, x) implements the out-of-place math y = A * x without allocating

  • Use the 5-arg form mul!(Y, A, B, α, β) to compute Y = α * A * B + β * Y in-place (see mul! docs)

  • Use ldiv!(y, A, x) for the in-place solve corresponding to y = A \ x (see ldiv! docs)

  • Use column-wise access or @views when slicing to avoid copies

  • Pass buffers explicitly to keep the function allocation-free; even a simple cache = (; tmp = similar(x)) helps for temporary workspaces.

  • When in doubt, use a loop. Since it operates on compiled Julia code, Enzyme can differentiate through loops efficiently.

# in-place linear step with an explicit workspace
function step!(x_next, x, A, cache)
    @views mul!(cache.tmp, A, x) # cache.tmp = A * x
    @inbounds for i in eachindex(cache.tmp)
        # loops are encouraged, but be careful to avoid temp allocations
        cache.tmp[i] += 0.1 * i
    end
    copy!(x_next, cache.tmp) # x_next = cache.tmp
    return nothing
end

A = [0.9 0.1 0.0
     0.0 0.8 0.1
     0.0 0.0 0.7]
x = randn(3)
x_next = similar(x)
# Pass preallocated cache/buffers as individual arguments
# as named tuples, or as typesafe structs.
cache = (; tmp = zeros(3))

step!(x_next, x, A, cache)
x_next
3-element Vector{Float64}:
 1.0961466536012916
 0.39386860518948097
 0.307118720351144

Even without Enzyme, checking type-stability and allocations helps catch AD pain points early.

using BenchmarkTools
# warm-up
step!(x_next, x, A, cache)

# Should have no type instability warnings
@show @inferred step!(x_next, x, A, cache)

# Should allocate zero bytes
function count_allocs()
    return step!(x_next, x, A, cache)
end
@btime count_allocs()
#= In[7]:6 =# @inferred(step!(x_next, x, A, cache)) = nothing
  
68.040 ns (0 allocations: 0 bytes)

This should show that 0 bytes are allocated.

Note

Outside of a notebook environment, such as in the REPL, you can use @allocated step!(x_next, x, A, cache) directly. However, notebook environments it can give misleading results since it may include notebook cell overhead.

The same patterns apply to more complex routines: keep buffers explicit, avoid temporary slices, and rely on in-place linear algebra to minimize allocations that can break reverse-mode AD.

7.3.5. Core Concepts & Terminology#

To use Enzyme effectively, you must manually annotate your function arguments to tell the compiler how to handle them for a particular derivative.

7.3.6. The Primal vs. The Shadow#

  • Primal: The values or variables used in the main computation (e.g., your input vector x or buffer b).

  • Shadow: The separate memory location where derivatives are accumulated (e.g., dx or db).

7.3.7. Fundamental Rules of Activity#

When calling autodiff, every argument needs an “Activity” wrapper to tell Enzyme how to handle it.

  1. Are you differentiating with respect to this argument?

    • No: Use Const(x). This tells Enzyme the value is constant and its derivative is zero.

    • Yes (Scalar): Use Active(x). Enzyme will return the derivative directly.

    • Yes (Array/Struct): Use Duplicated(x, dx). You must provide the shadow memory dx.

  2. Is the argument mutated (written to) inside the function?

    • Yes, and I need to know the value: Use Duplicated(x, dx). Enzyme needs the shadow dx to store intermediate adjoints during the reverse pass.

    • Yes, and I do not need to access the value: You could still use Duplicated(x, dx) or DuplicatedNoNeed(x, dx) which may avoid some calculations required to calculate x if you are not using it directly.

This last point is important. As your functions are non-allocating, you need to ensure there is a “shadow” for any arguments that are used, even if they are just temporary buffers.

7.3.8. Argument Wrappers (Quick Reference)#

Wrapper

Usage

Primal

Shadow

Docs

Const(x)

Constants / Config

Read-only

None

Ref

Active(x)

Scalars (Float64)

Read-only

Returned

Ref

Duplicated(x, dx)

Arrays / Mutated Buffers

Read/Write

Explicit

Ref

DuplicatedNoNeed(x, dx)

Mutated Buffers, ignoring x

Read/Write

Explicit

Ref

BatchDuplicated

Vectorized Derivatives

Read/Write

Tuple of Shadows

Ref


7.3.9. Examples#

7.3.10. Convenience Functions#

Enzyme provides convenience functions to create zero-initialized shadows for you and to call autodiff with common patterns.

Following that documentation, we define a scalar valued function of a vector input.

When using Reverse-mode AD the forward pass needs to calculate the “primal” value, and we can request the function to return both.

rosenbrock(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2

# Use Reverse-mode AD
x = [1.0, 2.0]
@show gradient(Reverse, rosenbrock, x)
# Return a tuple with the "primal" (i.e., function value) and grad
@show gradient(ReverseWithPrimal, rosenbrock, x);
gradient(Reverse, rosenbrock, x) = ([-400.0, 200.0],)
gradient(ReverseWithPrimal, rosenbrock, x) = 
(derivs = ([-400.0, 200.0],), val = 100.0)

With a preallocated gradient vector, we can use an in-place version

dx = Enzyme.make_zero(x) # or just zeros(size(x)), but this is more bulletproof
gradient!(Reverse, dx, rosenbrock, x)
@show dx;
dx = [-400.0, 200.0]

Similarly, we can execute forward-mode AD to get the gradient. Unlike Reverse mode, this calls autodiff for each input dimension.

@show gradient(Forward, rosenbrock, [1.0, 2.0])
@show gradient(ForwardWithPrimal, rosenbrock, [1.0, 2.0]);
gradient(Forward, rosenbrock, [1.0, 2.0]) = ([-400.0, 200.0],)
gradient(ForwardWithPrimal, rosenbrock, [1.0, 2.0]) = 
(derivs = ([-400.0, 200.0],), val = 100.0)

In the case of vector-valued functions, we can fill a Jacobian matrix. If calling with Forward, each column of the Jacobian is filled in a separate pass. If calling with Reverse, each row is filled in a separate pass.

f(x) = [(1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2;
        x[1] * x[2]]
@show jacobian(Forward, f, x)
@show jacobian(Reverse, f, x)
@show jacobian(ReverseWithPrimal, f, x);
jacobian(Forward, f, x) = ([-400.0 200.0; 2.0 1.0],)
jacobian(Reverse, f, x) = 
([-400.0 200.0; 2.0 1.0],)
jacobian(ReverseWithPrimal, f, x) = (derivs = ([-400.0 200.0; 2.0 1.0],), val = [100.0, 2.0])

See here for more examples such as Hessian-vector products.

7.3.11. Simple Forward Mode#

Forward mode propagates derivatives alongside the primal calculation. It is ideal for low-dimensional inputs.

using Enzyme, LinearAlgebra

f(x, y) = sum(x .* y)

x = [1.0, 2.0, 3.0]
y = [0.5, 1.0, 1.5]

# We want ∂f/∂x (holding y constant).
# 1. Create Shadow for x (seed with 1.0s to get full gradient)
dx = ones(size(x))

# 2. Call autodiff with respect to x
# Note: y is Const, x is Duplicated (Array)
autodiff(Forward, f, Duplicated(x, dx), Const(y))

# Result: Returns nothing (void), but the function executed.
print("∂f/∂x = ", dx)
∂f/∂x = 
[1.0, 1.0, 1.0]

Note that in Forward mode for scalar outputs, you often examine the return value. However, for array inputs, Enzyme conventions typically focus on Reverse mode.

7.3.12. Reverse Mode#

Reverse mode computes the gradient of the output with respect to all inputs in a single pass.

# Define function
calc(x, y) = sum(x .^ 2 .+ y)

x = [1.0, 2.0, 3.0]
y = [0.5, 0.5, 0.5]

# 1. Initialize Shadows with Zeros
dx = Enzyme.make_zero(x) # ∂f/∂x will accumulate here
dy = Enzyme.make_zero(y) # ∂f/∂y will accumulate here

# 2. Call autodiff
# In this case we find the gradient for all arguments
autodiff(Reverse, calc, Duplicated(x, dx), Duplicated(y, dy))

@show dx  # Should be 2*x -> [2.0, 4.0, 6.0]
@show dy;  # Should be 1.0 -> [1.0, 1.0, 1.0]
dx = [2.0, 4.0, 6.0]
dy = [1.0, 1.0, 1.0]

Since it requires calculating the primal in the forward pass, you can request it to be returned, as with the convenience functions, and then examine the shadows.

dx = Enzyme.make_zero(x)
dy = Enzyme.make_zero(y)
autodiff(ReverseWithPrimal, calc, Duplicated(x, dx), Duplicated(y, dy))
((nothing, nothing), 15.5)

7.3.13. Handling Mutation and Buffers#

This is the most common pitfall. If a function modifies an argument (like out or a workspace buffer), both the Primal and Shadow must be valid.

function axpy!(y, A, x)
    mul!(y, A, x)
    return nothing
end

# 2. Loss calls the mutating function, but returns a scalar (the loss).
function compute_loss!(y, A, x)
    axpy!(y, A, x)
    return sum(y) # We want the gradient of THIS scalar
end

# Setup Data
A = [2.0 0.0; 0.0 3.0]
x = [1.0, 1.0]
y = zeros(2)

# Setup Shadows
# Enzyme will calculate ∂(sum)/∂y and populate dy for us.
dx = Enzyme.make_zero(x)
dA = Enzyme.make_zero(A)
dy = Enzyme.make_zero(y)

# 3. Calculate Gradient
#    We differentiate 'compute_loss!'.
#    Since it returns a scalar (Active), Enzyme automatically seeds it with 1.0.
autodiff(Reverse, compute_loss!,
         Duplicated(y, dy),   # Intermediate buffer (Enzyme handles the backprop!)
         Duplicated(A, dA),   # Parameter we want gradient for
         Duplicated(x, dx))

@show dx   # ∂(sum(y))/∂x
@show dA;  # ∂(sum(y))/∂A
dx = [2.0, 3.0]
dA = [1.0 1.0; 1.0 1.0]

7.4. Exercises#

7.4.1. Exercise 1#

Implementing forward-mode auto-differentiation is very easy in Julia since it is generic. In this exercise, you will fill in a few of the operations required for a simple AD implementation.

First, we need to provide a type to hold the dual.

struct DualNumber{T} <: Real
    val::T
    ϵ::T
end

Here we have made it a subtype of Real so that it can pass through functions expecting Reals.

We can add a variety of chain rule definitions by importing the appropriate functions and adding DualNumber versions. For example

import Base: +, *, -, ^, exp
+(x::DualNumber, y::DualNumber) = DualNumber(x.val + y.val, x.ϵ + y.ϵ)  # dual addition
+(x::DualNumber, a::Number) = DualNumber(x.val + a, x.ϵ)  # i.e. scalar addition, not dual
+(a::Number, x::DualNumber) = DualNumber(x.val + a, x.ϵ)  # i.e. scalar addition, not dual
+ (generic function with 237 methods)

With that, we can seed a dual number and find simple derivatives,

f(x, y) = 3.0 + x + y

x = DualNumber(2.0, 1.0)  # x -> 2.0 + 1.0\epsilon
y = DualNumber(3.0, 0.0)  # i.e. y = 3.0, no derivative

# seeded calculates both the function and the d/dx gradient!
f(x, y)
DualNumber{Float64}(8.0, 1.0)

For this assignment:

  1. Add AD rules for the other operations: *, -, ^, exp.

  2. Come up with some examples of univariate and multivariate functions combining those operations and use your AD implementation to find the derivatives.