19. Numerical Linear Algebra and Factorizations#

You cannot learn too much linear algebra. – Benedict Gross

19.1. Overview#

In this lecture, we examine the structure of matrices and linear operators (e.g., dense, sparse, symmetric, tridiagonal, banded) and discuss how the structure can be exploited to radically increase the performance of solving large problems.

We build on applications discussed in previous lectures: linear algebra, orthogonal projections, and Markov chains.

The methods in this section are called direct methods, and they are qualitatively similar to performing Gaussian elimination to factor matrices and solve systems of equations. In iterative methods and sparsity we examine a different approach, using iterative algorithms, where we can think of more general linear operators.

The list of specialized packages for these tasks is enormous and growing, but some of the important organizations to look at are JuliaMatrices , JuliaSparse, and JuliaMath

NOTE: As this section uses advanced Julia techniques, you may wish to review multiple-dispatch and generic programming in introduction to types, and consider further study on generic programming.

The theme of this lecture, and numerical linear algebra in general, comes down to three principles:

  1. Identify structure (e.g., symmetric, sparse, diagonal) matrices in order to use specialized algorithms.

  2. Do not lose structure by applying the wrong numerical linear algebra operations at the wrong times (e.g., sparse matrix becoming dense)

  3. Understand the computational complexity of each algorithm, given the structure of the inputs.

using LinearAlgebra, Statistics, BenchmarkTools, SparseArrays, Random
Random.seed!(42);  # seed random numbers for reproducibility

19.1.1. Computational Complexity#

Ask yourself whether the following is a computationally expensive operation as the matrix size increases

  • Multiplying two matrices?

    • Answer: It depends. Multiplying two diagonal matrices is trivial.

  • Solving a linear system of equations?

    • Answer: It depends. If the matrix is the identity, the solution is the vector itself.

  • Finding the eigenvalues of a matrix?

    • Answer: It depends. The eigenvalues of a triangular matrix are the diagonal elements.

As the goal of this section is to move toward numerical methods with large systems, we need to understand how well algorithms scale with the size of matrices, vectors, etc. This is known as computational complexity. As we saw in the answer to the questions above, the algorithm - and hence the computational complexity - changes based on matrix structure.

While this notion of complexity can work at various levels, such as the number of significant digits for basic mathematical operations, the amount of memory and storage required, or the amount of time, we will typically focus on the time complexity.

For time complexity, the size \(N\) is usually the dimensionality of the problem, although occasionally the key will be the number of non-zeros in the matrix or the width of bands. For our applications, time complexity is best thought of as the number of floating point operations (e.g., addition, multiplication) required.

19.1.1.1. Notation#

Complexity of algorithms is typically written in Big O notation, which provides bounds on the scaling of the computational complexity with respect to the size of the inputs.

Formally, if the number of operations required for a problem size \(N\) is \(f(N)\), we can write this as \(f(N) = O(g(N))\) for some \(g(N)\) - typically a polynomial.

The interpretation is that there exist some constants \(M\) and \(N_0\) such that

\[ f(N) \leq M g(N), \text{ for } N > N_0 \]

For example, the complexity of finding an LU Decomposition of a dense matrix is \(O(N^3)\), which should be read as there being a constant where eventually the number of floating point operations required to decompose a matrix of size \(N\times N\) grows cubically.

Keep in mind that these are asymptotic results intended for understanding the scaling of the problem, and the constant can matter for a given fixed size.

For example, the number of operations required for an LU decomposition of a dense \(N \times N\) matrix is \(f(N) = \frac{2}{3} N^3\), ignoring the \(N^2\) and lower terms. Other methods of solving a linear system may have different constants of proportionality, even if they have the same scaling, \(O(N^3)\).

19.1.2. Rules of Computational Complexity#

You will sometimes need to think through how combining algorithms changes complexity. For example, if you use

  1. an \(O(N^3)\) operation \(P\) times, then it simply changes the constant. The complexity remains \(O(N^3)\).

  2. one \(O(N^3)\) operation and one \(O(N^2)\) operation, then you take the max. The complexity remains \(O(N^3)\).

  3. a repetition of an \(O(N)\) operation that itself uses an \(O(N)\) operation, you take the product. The complexity becomes \(O(N^2)\).

With this, we have an important word of caution: Dense-matrix multiplication is an expensive operation for unstructured matrices. The naive version is \(O(N^3)\) while the fastest-known algorithms (e.g., Coppersmith-Winograd) are roughly \(O(N^{2.37})\). In practice, it is reasonable to crudely approximate with \(O(N^3)\) when doing an analysis, in part since the higher constant factors of the better scaling algorithms dominate the better complexity until matrices become very large.

Of course, modern libraries use highly tuned and numerically stable algorithms to multiply matrices and exploit the computer architecture, memory cache, etc., but this simply lowers the constant of proportionality and they remain roughly approximated by \(O(N^3)\).

A consequence is that, since many algorithms require matrix-matrix multiplication, it is often not possible to go below that order without further matrix structure.

That is, changing the constant of proportionality for a given size can help, but in order to achieve better scaling you need to identify matrix structure (e.g., tridiagonal, sparse) and ensure that your operations do not lose it.

19.1.3. Losing Structure#

As a first example of a structured matrix, consider a sparse array.

A = sprand(10, 10, 0.45)  # random sparse 10x10, 45 percent filled with non-zeros

@show nnz(A)  # counts the number of non-zeros
invA = sparse(inv(Array(A)))  # Julia won't invert sparse, so convert to dense with Array.
@show nnz(invA);
nnz(A) = 46
nnz(invA) = 
100

This increase from less than 50 to 100 percent dense demonstrates that significant sparsity can be lost when computing an inverse.

The results can be even more extreme. Consider a tridiagonal matrix of size \(N \times N\) that might come out of a Markov chain or a discretization of a diffusion process,

N = 5
A = Tridiagonal([fill(0.1, N - 2); 0.2], fill(0.8, N), [0.2; fill(0.1, N - 2)])
5×5 Tridiagonal{Float64, Vector{Float64}}:
 0.8  0.2   ⋅    ⋅    ⋅ 
 0.1  0.8  0.1   ⋅    ⋅ 
  ⋅   0.1  0.8  0.1   ⋅ 
  ⋅    ⋅   0.1  0.8  0.1
  ⋅    ⋅    ⋅   0.2  0.8

The number of non-zeros here is approximately \(3 N\), linear, which scales well for huge matrices into the millions or billions

But consider the inverse

inv(A)
5×5 Matrix{Float64}:
  1.29099      -0.327957     0.0416667  -0.00537634   0.000672043
 -0.163978      1.31183     -0.166667    0.0215054   -0.00268817
  0.0208333    -0.166667     1.29167    -0.166667     0.0208333
 -0.00268817    0.0215054   -0.166667    1.31183     -0.163978
  0.000672043  -0.00537634   0.0416667  -0.327957     1.29099

Now, the matrix is fully dense and has \(N^2\) non-zeros.

This also applies to the \(A' A\) operation when forming the normal equations of linear least squares.

A = sprand(20, 21, 0.3)
@show nnz(A) / 20^2
@show nnz(A' * A) / 21^2;
nnz(A) / 20 ^ 2 = 0.34
nnz(A' * A) / 21 ^ 2 = 
0.9229024943310657

We see that a 30 percent dense matrix becomes almost full dense after the product is taken.

Sparsity/Structure is not just for storage: Matrix size can sometimes become important (e.g., a 1 million by 1 million tridiagonal matrix needs to store 3 million numbers (i.e., about 6MB of memory), where a dense one requires 1 trillion (i.e., about 1TB of memory)).

But, as we will see, the main purpose of considering sparsity and matrix structure is that it enables specialized algorithms, which typically have a lower computational order than unstructured dense, or even unstructured sparse, operations.

First, create a convenient function for benchmarking linear solvers

using BenchmarkTools
function benchmark_solve(A, b)
    println("A\\b for typeof(A) = $(string(typeof(A)))")
    @btime $A \ $b
end
benchmark_solve (generic function with 1 method)

Then, take away structure to see the impact on performance,

N = 1000
b = rand(N)
A = Tridiagonal([fill(0.1, N - 2); 0.2], fill(0.8, N), [0.2; fill(0.1, N - 2)])
A_sparse = sparse(A)  # sparse but losing tridiagonal structure
A_dense = Array(A)    # dropping the sparsity structure, dense 1000x1000

# benchmark solution to system A x = b
benchmark_solve(A, b)
benchmark_solve(A_sparse, b)
benchmark_solve(A_dense, b);
A\b for typeof(A) = Tridiagonal{Float64, Vector{Float64}}
  24.735 μs
 (20 allocations: 47.34 KiB)
A\b for typeof(A) = SparseMatrixCSC{Float64, Int64}
  541.800 μs
 (95 allocations: 1.03 MiB)
A\b for typeof(A) = Matrix{Float64}
  11.827 ms
 (9 allocations: 7.64 MiB)

This example shows what is at stake: using a structured tridiagonal matrix may be 10-20 times faster than using a sparse matrix, which is 100 times faster than using a dense matrix.

In fact, the difference becomes more extreme as the matrices grow. Solving a tridiagonal system is \(O(N)\), while that of a dense matrix without any structure is \(O(N^3)\). The complexity of a sparse solution is more complicated, and scales in part by the nnz(N), i.e., the number of nonzeros.

19.1.4. Matrix Multiplication#

While we write matrix multiplications in our algebra with abundance, in practice the computational operation scales very poorly without any matrix structure.

Matrix multiplication is so important to modern computers that the constant of scaling is small using proper packages, but the order is still roughly \(O(N^3)\) in practice (although smaller in theory, as discussed above).

Sparse matrix multiplication, on the other hand, is \(O(N M_A M_B)\) where \(M_A\) is the number of nonzeros per row of \(A\) and \(M_B\) is the number of non-zeros per column of \(B\).

By the rules of computational order, that means any algorithm requiring a matrix multiplication of dense matrices requires at least \(O(N^3)\) operation.

The other important question is what is the structure of the resulting matrix. For example, multiplying an upper triangular matrix by a lower triangular matrix

N = 5
U = UpperTriangular(rand(N, N))
5×5 UpperTriangular{Float64, Matrix{Float64}}:
 0.894532  0.620631  0.238642  0.493049  0.418376
  ⋅        0.691248  0.640302  0.843674  0.0508756
  ⋅         ⋅        0.187509  0.148339  0.732873
  ⋅         ⋅         ⋅        0.190347  0.502406
  ⋅         ⋅         ⋅         ⋅        0.664839
L = U'
5×5 LowerTriangular{Float64, Adjoint{Float64, Matrix{Float64}}}:
 0.894532   ⋅          ⋅         ⋅         ⋅ 
 0.620631  0.691248    ⋅         ⋅         ⋅ 
 0.238642  0.640302   0.187509   ⋅         ⋅ 
 0.493049  0.843674   0.148339  0.190347   ⋅ 
 0.418376  0.0508756  0.732873  0.502406  0.664839

But the product is fully dense (e.g., think of a Cholesky multiplied by itself to produce a covariance matrix)

L * U
5×5 Matrix{Float64}:
 0.800188  0.555175  0.213473  0.441048  0.374251
 0.555175  0.863008  0.590717  0.88919   0.294825
 0.213473  0.590717  0.502096  0.685683  0.269839
 0.441048  0.88919   0.685683  1.01312   0.453547
 0.374251  0.294825  0.269839  0.453547  1.40915

On the other hand, a tridiagonal matrix times a diagonal matrix is still tridiagonal - and can use specialized \(O(N)\) algorithms.

A = Tridiagonal([fill(0.1, N - 2); 0.2], fill(0.8, N), [0.2; fill(0.1, N - 2)])
D = Diagonal(rand(N))
D * A
5×5 Tridiagonal{Float64, Vector{Float64}}:
 0.080983   0.0202457   ⋅          ⋅          ⋅ 
 0.0244522  0.195618   0.0244522   ⋅          ⋅ 
  ⋅         0.0760396  0.608317   0.0760396   ⋅ 
  ⋅          ⋅         0.0858704  0.686963   0.0858704
  ⋅          ⋅          ⋅         0.0113042  0.0452167

19.2. Factorizations#

When you tell a numerical analyst you are solving a linear system using direct methods, their first question is “which factorization?”.

Just as you can factor a number (e.g., \(6 = 3 \times 2\)) you can factor a matrix as the product of other, more convenient matrices (e.g., \(A = L U\) or \(A = Q R\), where \(L, U, Q,\) and \(R\) have properties such as being triangular, orthogonal, etc.).

19.2.1. Inverting Matrices#

On paper, since the Invertible Matrix Theorem tells us that a unique solution is equivalent to \(A\) being invertible, we often write the solution to \(A x = b\) as

\[ x = A^{-1} b \]

What if we do not (directly) use a factorization?

Take a simple linear system of a dense matrix,

N = 4
A = rand(N, N)
b = rand(N)
4-element Vector{Float64}:
 0.1692573631849904
 0.8286640042337545
 0.05607569601441986
 0.5958786724371172

On paper, we try to solve the system \(A x = b\) by inverting the matrix,

x = inv(A) * b
4-element Vector{Float64}:
 -0.4172203630414311
  0.01047937208058601
  9.118506847017617
 -4.99195905422948

As we will see throughout, inverting matrices should be used for theory, not for code. The classic advice that you should never invert a matrix may be slightly exaggerated, but is generally good advice.

Solving a system by inverting a matrix is always a little slower, is potentially less accurate, and will sometimes lose crucial sparsity compared to using factorizations. Moreover, the methods used by libraries to invert matrices are frequently the same factorizations used for computing a system of equations.

Even if you need to solve a system with the same matrix multiple times, you are better off factoring the matrix and using the solver rather than calculating an inverse.

N = 100
A = rand(N, N)
M = 30
B = rand(N, M)
function solve_inverting(A, B)
    A_inv = inv(A)
    X = similar(B)
    for i in 1:size(B, 2)
        X[:, i] = A_inv * B[:, i]
    end
    return X
end

function solve_factoring(A, B)
    X = similar(B)
    A = factorize(A)
    for i in 1:size(B, 2)
        X[:, i] = A \ B[:, i]
    end
    return X
end

@btime solve_inverting($A, $B)
@btime solve_factoring($A, $B)

# even better, use the built-in feature for multiple RHS
@btime $A \ $B;
  173.853 μs
 (132 allocations: 207.09 KiB)
  157.414 μs
 (159 allocations: 157.50 KiB)
  86.111 μs
 (8 allocations: 102.62 KiB)

19.2.2. Triangular Matrices and Back/Forward Substitution#

Some matrices are already in a convenient form and require no further factoring.

For example, consider solving a system with an UpperTriangular matrix,

b = [1.0, 2.0, 3.0]
U = UpperTriangular([1.0 2.0 3.0; 0.0 5.0 6.0; 0.0 0.0 9.0])
3×3 UpperTriangular{Float64, Matrix{Float64}}:
 1.0  2.0  3.0
  ⋅   5.0  6.0
  ⋅    ⋅   9.0

This system is especially easy to solve using back substitution. In particular, \(x_3 = b_3 / U_{33}, x_2 = (b_2 - x_3 U_{23})/U_{22}\), etc.

U \ b
3-element Vector{Float64}:
 0.0
 0.0
 0.3333333333333333

A LowerTriangular matrix has similar properties and can be solved with forward substitution.

The computational order of back substitution and forward substitution is \(O(N^2)\) for dense matrices. Those fast algorithms are a key reason that factorizations target triangular structures.

19.2.3. LU Decomposition#

The \(LU\) decomposition finds a lower triangular matrix \(L\) and an upper triangular matrix \(U\) such that \(L U = A\).

For a general dense matrix without any other structure (i.e., not known to be symmetric, tridiagonal, etc.) this is the standard approach to solve a system and exploit the speed of back and forward substitution using the factorization.

The computational order of LU decomposition itself for a dense matrix is \(O(N^3)\) - the same as Gaussian elimination - but it tends to have a better constant term than others (e.g., half the number of operations of the QR decomposition). For structured or sparse matrices, that order drops.

We can see which algorithm Julia will use for the \ operator by looking at the factorize function for a given matrix.

N = 4
A = rand(N, N)
b = rand(N)

Af = factorize(A)  # chooses the right factorization, LU here
LU{Float64, Matrix{Float64}, Vector{Int64}}
L factor:
4×4 Matrix{Float64}:
 1.0       0.0       0.0       0.0
 0.415843  1.0       0.0       0.0
 0.837544  0.412374  1.0       0.0
 0.230398  0.451915  0.320159  1.0
U factor:
4×4 Matrix{Float64}:
 0.95462  0.30487   0.196935    0.658974
 0.0      0.334293  0.0898636   0.476012
 0.0      0.0       0.763321   -0.213677
 0.0      0.0       0.0         0.183613

In this case, it provides an \(L\) and \(U\) factorization (with pivoting ).

With the factorization complete, we can solve different b right hand sides.

Af \ b
4-element Vector{Float64}:
  0.5483309544545236
  2.4706281021373653
 -0.5369034286036096
 -0.8253373998820174
b2 = rand(N)
Af \ b2
4-element Vector{Float64}:
  0.3734712417577136
 -2.072936783435963
 -0.08980746803909484
  1.8007954892470772

In practice, the decomposition also includes a \(P\) which is a permutation matrix such that \(P A = L U\).

Af.P * A  Af.L * Af.U
true

We can also directly calculate an LU decomposition with lu but without the pivoting,

L, U = lu(A, Val(false))  # the Val(false) provides a solution without permutation matrices
LU{Float64, Matrix{Float64}, Vector{Int64}}
L factor:
4×4 Matrix{Float64}:
 1.0        0.0     0.0     0.0
 0.554051   1.0     0.0     0.0
 2.40475   23.5446  1.0     0.0
 2.01409   15.6821  0.5334  1.0
U factor:
4×4 Matrix{Float64}:
 0.396972   0.461071    0.171758   0.750041
 0.0       -0.0341433   0.235206   0.0665844
 0.0        0.0        -5.75392   -2.71239
 0.0        0.0         0.0       -0.573506

And we can verify the decomposition

A  L * U
true

To see roughly how the solver works, note that we can write the problem \(A x = b\) as \(L U x = b\). Let \(U x = y\), which breaks the problem into two sub-problems.

\[\begin{split} \begin{aligned} L y &= b\\ U x &= y \end{aligned} \end{split}\]

As we saw above, this is the solution to two triangular systems, which can be efficiently done with back or forward substitution in \(O(N^2)\) operations.

To demonstrate this, first using

y = L \ b
4-element Vector{Float64}:
  0.6455522128314688
 -0.2655926740726696
  5.327941090151101
  0.47333580651201723
x = U \ y
x  A \ b  # Check identical
true

The LU decomposition also has specialized algorithms for structured matrices, such as a Tridiagonal

N = 1000
b = rand(N)
A = Tridiagonal([fill(0.1, N - 2); 0.2], fill(0.8, N), [0.2; fill(0.1, N - 2)])
factorize(A) |> typeof
LU{Float64, Tridiagonal{Float64, Vector{Float64}}, Vector{Int64}}

This factorization is the key to the performance of the A \ b in this case. For Tridiagonal matrices, the LU decomposition is \(O(N^2)\).

Finally, just as a dense matrix without any structure uses an LU decomposition to solve a system, so will the sparse solvers

A_sparse = sparse(A)
factorize(A_sparse) |> typeof  # dropping the tridiagonal structure to just become sparse
SparseArrays.UMFPACK.UmfpackLU{Float64, Int64}
benchmark_solve(A, b)
benchmark_solve(A_sparse, b);
A\b for typeof(A) = Tridiagonal{Float64, Vector{Float64}}
  24.807 μs
 (20 allocations: 47.34 KiB)
A\b for typeof(A) = SparseMatrixCSC{Float64, Int64}
  540.084 μs
 (95 allocations: 1.03 MiB)

With sparsity, the computational order is related to the number of non-zeros rather than the size of the matrix itself.

19.2.4. Cholesky Decomposition#

For real, symmetric, positive semi-definite matrices, a Cholesky decomposition is a specialized example of an LU decomposition where \(L = U'\).

The Cholesky is directly useful on its own (e.g., Classical Control with Linear Algebra), but it is also an efficient factorization to use in solving symmetric positive semi-definite systems.

As always, symmetry allows specialized algorithms.

N = 500
B = rand(N, N)
A_dense = B' * B  # an easy way to generate a symmetric positive semi-definite matrix
A = Symmetric(A_dense)  # flags the matrix as symmetric

factorize(A) |> typeof
BunchKaufman{Float64, Matrix{Float64}, Vector{Int64}}

Here, the \(A\) decomposition is Bunch-Kaufman rather than Cholesky, because Julia doesn’t know that the matrix is positive semi-definite. We can manually factorize with a Cholesky,

cholesky(A) |> typeof
Cholesky{Float64, Matrix{Float64}}

Benchmarking,

b = rand(N)
cholesky(A) \ b  # use the factorization to solve

benchmark_solve(A, b)
benchmark_solve(A_dense, b)
@btime cholesky($A, check = false) \ $b;
A\b for typeof(A) = Symmetric{Float64, Matrix{Float64}}
  2.150 ms
 (14 allocations: 2.16 MiB)
A\b for typeof(A) = Matrix{Float64}
  2.609 ms
 (9 allocations: 1.92 MiB)
  1.423 ms
 (6 allocations: 1.91 MiB)

19.2.5. QR Decomposition#

Previously, we learned about applications of the QR decomposition to solving the linear least squares.

While in principle the solution to the least-squares problem

\[ \min_x \| Ax -b \|^2 \]

is \(x = (A'A)^{-1}A'b\), in practice note that \(A'A\) becomes dense and calculating the inverse is rarely a good idea.

The QR decomposition is a decomposition \(A = Q R\) where \(Q\) is an orthogonal matrix (i.e., \(Q'Q = Q Q' = I\)) and \(R\) is an upper triangular matrix.

Given the previous derivation, we showed that we can write the least-squares problem as the solution to

\[ R x = Q' b \]

where, as discussed above, the upper-triangular structure of \(R\) can be solved easily with back substitution.

The \ operator solves the linear least-squares problem whenever the given A is rectangular

N = 10
M = 3
x_true = rand(3)

A = rand(N, M) .+ randn(N)
b = rand(N)
x = A \ b
3-element Vector{Float64}:
 -0.5238169732680477
 -0.12063059869195494
  0.8918137920710699

To manually use the QR decomposition in solving linear least squares:

Af = qr(A)
@show Af.Q * Af.R  A
x = Af.R \ (Matrix(Af.Q)' * b)  # simplified QR solution for least squares
Af.Q * Af.R ≈ A = true

3-element Vector{Float64}:
 -0.5238169732680475
 -0.12063059869195454
  0.8918137920710694

See here for more, thought keep in mind that more specialized algorithm would be more efficient.

In some cases, if an LU is not available for a particular matrix structure, the QR factorization can also be used to solve systems of equations (i.e., not just LLS). This tends to be about 2 times slower than the LU but is of the same computational order.

Deriving the approach, where we can now use the inverse since the system is square and we assumed \(A\) was non-singular,

\[\begin{split} \begin{aligned} A x &= b\\ Q R x &= b\\ Q^{-1} Q R x &= Q^{-1} b\\ R x &= Q' b \end{aligned} \end{split}\]

where the last step uses the fact that \(Q^{-1} = Q'\) for an orthogonal matrix.

Given the decomposition, the solution for dense matrices is of computational order \(O(N^2)\). To see this, look at the order of each operation.

  • Since \(R\) is an upper-triangular matrix, it can be solved quickly through back substitution with computational order \(O(N^2)\)

  • A transpose operation is of order \(O(N^2)\)

  • A matrix-vector product is also \(O(N^2)\)

In all cases, the order would drop depending on the sparsity pattern of the matrix (and corresponding decomposition). A key benefit of a QR decomposition is that it tends to maintain sparsity.

Without implementing the full process, you can form a QR factorization with qr and then use it to solve a system

N = 5
A = rand(N, N)
b = rand(N)
@show A \ b
@show qr(A) \ b;
A \ b = [-1.0375649582146846, -12.139453854554883, 6.356088063778175, 3.8473591088663284, 2.045532886550066]
qr(A) \ b = 
[-1.0375649582146882, -12.13945385455488, 6.356088063778179, 3.8473591088663257, 2.045532886550066]

19.2.6. Spectral Decomposition#

A spectral decomposition, also known as an eigendecomposition, finds all of the eigenvectors and eigenvalues to decompose a square matrix A such that

\[ A = Q \Lambda Q^{-1} \]

where \(Q\) is a matrix made of the eigenvectors of \(A\) as columns, and \(\Lambda\) is a diagonal matrix of the eigenvalues. Only square, diagonalizable matrices have an eigendecomposition (where a matrix is not diagonalizable if it does not have a full set of linearly independent eigenvectors).

In Julia, whenever you ask for a full set of eigenvectors and eigenvalues, it decomposes using an algorithm appropriate for the matrix type. For example, symmetric, Hermitian, and tridiagonal matrices have specialized algorithms.

To see this,

A = Symmetric(rand(5, 5))  # symmetric matrices have real eigenvectors/eigenvalues
A_eig = eigen(A)
Λ = Diagonal(A_eig.values)
Q = A_eig.vectors
norm(Q * Λ * inv(Q) - A)
3.9801437366711736e-15

Keep in mind that a real matrix may have complex eigenvalues and eigenvectors, so if you attempt to check Q * Λ * inv(Q) - A - even for a positive-definite matrix - it may not be a real number due to numerical inaccuracy.

19.3. Continuous-Time Markov Chains (CTMCs)#

In the lecture on discrete-time Markov chains, we saw that the transition probability between state \(x\) and state \(y\) was summarized by the matrix \(P(x, y) := \mathbb P \{ X_{t+1} = y \,|\, X_t = x \}\).

As a brief introduction to continuous time processes, consider the same state space as in the discrete case: \(S\) is a finite set with \(n\) elements \(\{x_1, \ldots, x_n\}\).

A Markov chain \(\{X_t\}\) on \(S\) is a sequence of random variables on \(S\) that have the Markov property.

In continuous time, the Markov Property is more complicated, but intuitively is the same as the discrete-time case.

That is, knowing the current state is enough to know probabilities for future states. Or, for realizations \(x(\tau)\in S, \tau \leq t\),

\[ \mathbb P \{ X(t+s) = y \,|\, X(t) = x, X(\tau) = x(\tau) \text{ for } 0 \leq \tau \leq t \} = \mathbb P \{ X(t+s) = y \,|\, X(t) = x\} \]

Heuristically, consider a time period \(t\) and a small step forward, \(\Delta\). Then the probability to transition from state \(i\) to state \(j\) is

\[\begin{split} \mathbb P \{ X(t + \Delta) = j \,|\, X(t) \} = \begin{cases} q_{ij} \Delta + o(\Delta) & i \neq j\\ 1 + q_{ii} \Delta + o(\Delta) & i = j \end{cases} \end{split}\]

where the \(q_{ij}\) are “intensity” parameters governing the transition rate, and \(o(\Delta)\) is little-o notation. That is, \(\lim_{\Delta\to 0} o(\Delta)/\Delta = 0\).

Just as in the discrete case, we can summarize these parameters by an \(N \times N\) matrix, \(Q \in R^{N\times N}\).

Recall that in the discrete case every element is weakly positive and every row must sum to one. With continuous time, however, the rows of \(Q\) sum to zero, where the diagonal contains the negative value of jumping out of the current state. That is,

  • \(q_{ij} \geq 0\) for \(i \neq j\)

  • \(q_{ii} \leq 0\)

  • \(\sum_{j} q_{ij} = 0\)

The \(Q\) matrix is called the intensity matrix, or the infinitesimal generator of the Markov chain. For example,

\[\begin{split} Q = \begin{bmatrix} -0.1 & 0.1 & 0 & 0 & 0 & 0\\ 0.1 &-0.2 & 0.1 & 0 & 0 & 0\\ 0 & 0.1 & -0.2 & 0.1 & 0 & 0\\ 0 & 0 & 0.1 & -0.2 & 0.1 & 0\\ 0 & 0 & 0 & 0.1 & -0.2 & 0.1\\ 0 & 0 & 0 & 0 & 0.1 & -0.1\\ \end{bmatrix} \end{split}\]

In the above example, transitions occur only between adjacent states with the same intensity (except for a ``bouncing back’’ of the bottom and top states).

Implementing the \(Q\) using its tridiagonal structure

using LinearAlgebra
alpha = 0.1
N = 6
Q = Tridiagonal(fill(alpha, N - 1), [-alpha; fill(-2alpha, N - 2); -alpha],
                fill(alpha, N - 1))
6×6 Tridiagonal{Float64, Vector{Float64}}:
 -0.1   0.1    ⋅     ⋅     ⋅     ⋅ 
  0.1  -0.2   0.1    ⋅     ⋅     ⋅ 
   ⋅    0.1  -0.2   0.1    ⋅     ⋅ 
   ⋅     ⋅    0.1  -0.2   0.1    ⋅ 
   ⋅     ⋅     ⋅    0.1  -0.2   0.1
   ⋅     ⋅     ⋅     ⋅    0.1  -0.1

Here we can use Tridiagonal to exploit the structure of the problem.

Consider a simple payoff vector \(r\) associated with each state, and a discount rate \(\rho\). Then we can solve for the expected present discounted value in a way similar to the discrete-time case.

\[ \rho v = r + Q v \]

or rearranging slightly, solving the linear system

\[ (\rho I - Q) v = r \]

For our example, exploiting the tridiagonal structure,

r = range(0.0, 10.0, length = N)
rho = 0.05

A = rho * I - Q
6×6 Tridiagonal{Float64, Vector{Float64}}:
  0.15  -0.1     ⋅      ⋅      ⋅      ⋅ 
 -0.1    0.25  -0.1     ⋅      ⋅      ⋅ 
   ⋅    -0.1    0.25  -0.1     ⋅      ⋅ 
   ⋅      ⋅    -0.1    0.25  -0.1     ⋅ 
   ⋅      ⋅      ⋅    -0.1    0.25  -0.1
   ⋅      ⋅      ⋅      ⋅    -0.1    0.15

Note that this \(A\) matrix is maintaining the tridiagonal structure of the problem, which leads to an efficient solution to the linear problem.

v = A \ r
6-element Vector{Float64}:
  38.15384615384615
  57.23076923076923
  84.92307692307693
 115.07692307692311
 142.76923076923077
 161.84615384615384

The \(Q\) is also used to calculate the evolution of the Markov chain, in direct analogy to the \(\psi_{t+k} = \psi_t P^k\) evolution with the transition matrix \(P\) of the discrete case.

In the continuous case, this becomes the system of linear differential equations

\[ \dot{\psi}(t) = Q(t)^T \psi(t) \]

given the initial condition \(\psi(0)\) and where the \(Q(t)\) intensity matrix is allowed to vary with time. In the simplest case of a constant \(Q\) matrix, this is a simple constant-coefficient system of linear ODEs with coefficients \(Q^T\).

If a stationary equilibrium exists, note that \(\dot{\psi}(t) = 0\), and the stationary solution \(\psi^{*}\) needs to satisfy

\[ 0 = Q^T \psi^{*} \]

Notice that this is of the form \(0 \psi^{*} = Q^T \psi^{*}\) and hence is equivalent to finding the eigenvector associated with the \(\lambda = 0\) eigenvalue of \(Q^T\).

With our example, we can calculate all of the eigenvalues and eigenvectors

lambda, vecs = eigen(Array(Q'))
Eigen{Float64, Float64, Matrix{Float64}, Vector{Float64}}
values:
6-element Vector{Float64}:
 -0.3732050807568874
 -0.29999999999999993
 -0.19999999999999998
 -0.09999999999999995
 -0.026794919243112274
  0.0
vectors:
6×6 Matrix{Float64}:
 -0.149429  -0.288675   0.408248   0.5          -0.557678  0.408248
  0.408248   0.57735   -0.408248   1.38778e-16  -0.408248  0.408248
 -0.557678  -0.288675  -0.408248  -0.5          -0.149429  0.408248
  0.557678  -0.288675   0.408248  -0.5           0.149429  0.408248
 -0.408248   0.57735    0.408248   7.63278e-16   0.408248  0.408248
  0.149429  -0.288675  -0.408248   0.5           0.557678  0.408248

Indeed, there is a \(\lambda = 0\) eigenvalue, which is associated with the last column in the eigenvector. To turn that into a probability, we need to normalize it.

vecs[:, N] ./ sum(vecs[:, N])
6-element Vector{Float64}:
 0.16666666666666657
 0.16666666666666657
 0.1666666666666667
 0.16666666666666682
 0.16666666666666685
 0.16666666666666663

19.3.1. Multiple Dimensions#

A frequent case in discretized models is dealing with Markov chains with multiple “spatial” dimensions (e.g., wealth and income).

After discretizing a process to create a Markov chain, you can always take the Cartesian product of the set of states in order to enumerate it as a single state variable.

To see this, consider states \(i\) and \(j\) governed by infinitesimal generators \(Q\) and \(A\).

function markov_chain_product(Q, A)
    M = size(Q, 1)
    N = size(A, 1)
    Q = sparse(Q)
    Qs = blockdiag(fill(Q, N)...)  # create diagonal blocks of every operator
    As = kron(A, sparse(I(M)))
    return As + Qs
end

alpha = 0.1
N = 4
Q = Tridiagonal(fill(alpha, N - 1), [-alpha; fill(-2alpha, N - 2); -alpha],
                fill(alpha, N - 1))
A = sparse([-0.1 0.1
            0.2 -0.2])
M = size(A, 1)
L = markov_chain_product(Q, A)
L |> Matrix  # display as a dense matrix
8×8 Matrix{Float64}:
 -0.2   0.1   0.0   0.0   0.1   0.0   0.0   0.0
  0.1  -0.3   0.1   0.0   0.0   0.1   0.0   0.0
  0.0   0.1  -0.3   0.1   0.0   0.0   0.1   0.0
  0.0   0.0   0.1  -0.2   0.0   0.0   0.0   0.1
  0.2   0.0   0.0   0.0  -0.3   0.1   0.0   0.0
  0.0   0.2   0.0   0.0   0.1  -0.4   0.1   0.0
  0.0   0.0   0.2   0.0   0.0   0.1  -0.4   0.1
  0.0   0.0   0.0   0.2   0.0   0.0   0.1  -0.3

This provides the combined Markov chain for the \((i,j)\) process. To see the sparsity pattern,

using Plots

spy(L, markersize = 10)

To calculate a simple dynamic valuation, consider whether the payoff of being in state \((i,j)\) is \(r_{ij} = i + 2j\)

r = [i + 2.0j for i in 1:N, j in 1:M]
r = vec(r)  # vectorize it since stacked in same order
8-element Vector{Float64}:
 3.0
 4.0
 5.0
 6.0
 5.0
 6.0
 7.0
 8.0

Solving the equation \(\rho v = r + L v\)

rho = 0.05
v = (rho * I - L) \ r
reshape(v, N, M)
4×2 Matrix{Float64}:
  87.8992   93.6134
  96.1345  101.849
 106.723   112.437
 114.958   120.672

The reshape helps to rearrange it back to being two-dimensional.

To find the stationary distribution, we calculate the eigenvalue and choose the eigenvector associated with \(\lambda=0\) . In this case, we can verify that it is the last one.

L_eig = eigen(Matrix(L'))
@assert norm(L_eig.values[end]) < 1E-10

psi = L_eig.vectors[:, end]
psi = psi / sum(psi)
8-element Vector{Float64}:
 0.16666666666666669
 0.1666666666666665
 0.16666666666666669
 0.1666666666666668
 0.08333333333333327
 0.08333333333333344
 0.08333333333333331
 0.0833333333333334

Reshape this to be two-dimensional if it is helpful for visualization.

reshape(psi, N, size(A, 1))
4×2 Matrix{Float64}:
 0.166667  0.0833333
 0.166667  0.0833333
 0.166667  0.0833333
 0.166667  0.0833333

19.3.2. Irreducibility#

As with the discrete-time Markov chains, a key question is whether CTMCs are reducible, i.e., whether states communicate. The problem is isomorphic to determining whether the directed graph of the Markov chain is strongly connected.

using Graphs
alpha = 0.1
N = 6
Q = Tridiagonal(fill(alpha, N - 1), [-alpha; fill(-2alpha, N - 2); -alpha],
                fill(alpha, N - 1))
6×6 Tridiagonal{Float64, Vector{Float64}}:
 -0.1   0.1    ⋅     ⋅     ⋅     ⋅ 
  0.1  -0.2   0.1    ⋅     ⋅     ⋅ 
   ⋅    0.1  -0.2   0.1    ⋅     ⋅ 
   ⋅     ⋅    0.1  -0.2   0.1    ⋅ 
   ⋅     ⋅     ⋅    0.1  -0.2   0.1
   ⋅     ⋅     ⋅     ⋅    0.1  -0.1

We can verify that it is possible to move between every pair of states in a finite number of steps with

Q_graph = DiGraph(Q)
@show is_strongly_connected(Q_graph);  # i.e., can follow directional edges to get to every state
is_strongly_connected(Q_graph) = true

Alternatively, as an example of a reducible Markov chain where states \(1\) and \(2\) cannot jump to state \(3\).

Q = [-0.2 0.2 0
     0.2 -0.2 0
     0.2 0.6 -0.8]
Q_graph = DiGraph(Q)
@show is_strongly_connected(Q_graph);
is_strongly_connected(Q_graph) = false

19.4. Banded Matrices#

A tridiagonal matrix has 3 non-zero diagonals: the main diagonal, the first sub-diagonal (i.e., below the main diagonal), and also the first super-diagonal (i.e., above the main diagonal).

This is a special case of a more general type called a banded matrix, where the number of sub- and super-diagonals can be greater than 1. The total width of main-, sub-, and super-diagonals is called the bandwidth. For example, a tridiagonal matrix has a bandwidth of 3.

An \(N \times N\) banded matrix with bandwidth \(P\) has about \(N P\) nonzeros in its sparsity pattern.

These can be created directly as a dense matrix with diagm. For example, with a bandwidth of three and a zero diagonal,

diagm(1 => [1, 2, 3], -1 => [4, 5, 6])
4×4 Matrix{Int64}:
 0  1  0  0
 4  0  2  0
 0  5  0  3
 0  0  6  0

Or as a sparse matrix,

spdiagm(1 => [1, 2, 3], -1 => [4, 5, 6])
4×4 SparseMatrixCSC{Int64, Int64} with 6 stored entries:
 ⋅  1  ⋅  ⋅
 4  ⋅  2  ⋅
 ⋅  5  ⋅  3
 ⋅  ⋅  6  ⋅

Or directly using BandedMatrices.jl

using BandedMatrices
BandedMatrix(1 => [1, 2, 3], -1 => [4, 5, 6])
4×4 BandedMatrix{Int64} with bandwidths (1, 1):
 0  1  ⋅  ⋅
 4  0  2  ⋅
 ⋅  5  0  3
 ⋅  ⋅  6  0

There is also a convenience function for generating random banded matrices

A = brand(7, 7, 3, 1)  # 7x7 matrix, 3 subdiagonals, 1 superdiagonal
7×7 BandedMatrix{Float64} with bandwidths (3, 1):
 0.249137  0.849849   ⋅         ⋅          ⋅         ⋅         ⋅ 
 0.607815  0.253132  0.371965   ⋅          ⋅         ⋅         ⋅ 
 0.724436  0.121212  0.415992  0.25258     ⋅         ⋅         ⋅ 
 0.965024  0.437153  0.729472  0.336554   0.664299   ⋅         ⋅ 
  ⋅        0.976969  0.314399  0.0272443  0.397334  0.919194   ⋅ 
  ⋅         ⋅        0.403747  0.147311   0.74152   0.602738  0.913819
  ⋅         ⋅         ⋅        0.583009   0.578108  0.560479  0.579793

And, of course, specialized algorithms will be used to exploit the structure when solving linear systems. In particular, the complexity is related to the \(O(N P_L P_U)\) for upper and lower bandwidths \(P\)

@show factorize(Symmetric(A)) |> typeof
A \ rand(7)
factorize(Symmetric(A)) |> typeof = LDLt{Float64, Symmetric{Float64, BandedMatrix{Float64, Matrix{Float64}, Base.OneTo{Int64}}}}

7-element Vector{Float64}:
 -1.6369592693831976
  1.6204127338847327
  1.6129642421666923
  2.2654589446799656
 -1.015461108463952
 -1.0012568753722138
  0.9146445743712437

The factorization algorithm uses a specialized LU decomposition for banded matrices.

19.5. Implementation Details and Performance#

Recall the famous quote from Knuth: “97% of the time, premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.” The most common example of premature optimization is trying to use your own mental model of a compiler while writing your code, worried about the efficiency of code, and (usually incorrectly) second-guessing the compiler.

Concretely, the lessons in this section are:

  1. Don’t worry about optimizing your code unless you need to. Code clarity is your first-order concern.

  2. If you use other people’s packages, they can worry about performance and you don’t need to.

  3. If you absolutely need that “critical 3%,” your intuition about performance is usually wrong on modern CPUs and GPUs, so let the compiler do its job.

  4. Benchmarking (e.g., @btime) and profiling are the tools to figure out performance bottlenecks. If 99% of computing time is spent in one small function, then there is no point in optimizing anything else.

  5. If you benchmark to show that a particular part of the code is an issue, and you can’t find another library that does a better job, then you can worry about performance.

You will rarely get to step 3, let alone step 5.

However, there is also a corollary: “don’t pessimize prematurely.” That is, don’t make choices that lead to poor performance without any tradeoff in improved code clarity. For example, writing your own algorithms when a high-performance algorithm exists in a package or Julia itself, or lazily making a matrix dense and carelessly dropping its structure.

19.5.1. Implementation Difficulty#

Numerical analysts sometimes refer to the lowest level of code for basic operations (e.g., a dot product, matrix-matrix product, convolutions) as kernels.

That sort of code is difficult to write, and performance depends on the characteristics of the underlying hardware, such as the instruction set available on the particular CPU, the size of the CPU cache, and the layout of arrays in memory.

Typically, these operations are written in a BLAS library, organized into different levels. The levels roughly correspond to the computational order of the operations: BLAS Level 1 are \(O(N)\) operations such as linear products, Level 2 are \(O(N^2)\) operations such as matrix-vector products, and Level 3 are roughly \(O(N^3)\), such as general matrix-matrix products.

An example of a BLAS library is OpenBLAS, which is used by default in Julia, or the Intel MKL, which is used in Matlab (and in Julia if the MKL.jl package is installed).

On top of BLAS are LAPACK operations, which are higher-level kernels, such as matrix factorizations and eigenvalue algorithms, and are often in the same libraries (e.g., MKL has both BLAS and LAPACK functionality).

The details of these packages are not especially relevant, but if you are talking about performance, people will inevitably start discussing these different packages and kernels. There are a few important things to keep in mind:

  1. Leave writing kernels to the experts. Even simple-sounding algorithms can be very complicated to implement with high performance.

  2. Your intuition about performance of code is probably going to be wrong. If you use high quality libraries rather than writing your own kernels, you don’t need to use your intuition.

  3. Don’t get distracted by the jargon or acronyms above if you are reading about performance.

19.5.2. Row- and Column-Major Ordering#

There is a practical performance issue which may influence your code. Since memory in a CPU is linear, dense matrices need to be stored by either stacking columns (called column-major order) or rows.

The reason this matters is that compilers can generate better performance if they work in contiguous chunks of memory, and this becomes especially important with large matrices due to the interaction with the CPU cache. Choosing the wrong order when there is no benefit in code clarity is an example of premature pessimization. The performance difference can be orders of magnitude in some cases, and nothing in others.

One option is to use the functions that let the compiler choose the most efficient way to traverse memory. If you need to choose the looping order yourself, then you might want to experiment with going through columns first and going through rows first. Other times, let Julia decide, i.e., enumerate and eachindex will choose the right approach.

Julia, Fortran, and Matlab all use column-major order, while C/C++ and Python use row-major order. This means that if you find an algorithm written for C/C++/Python, you will sometimes need to make small changes if performance is an issue.

19.5.3. Digression on Allocations and In-place Operations#

While we have usually not considered optimizing code for performance (and have focused on the choice of algorithms instead), when matrices and vectors become large we need to be more careful.

The most important thing to avoid are excess allocations, which usually occur due to the use of temporary vectors and matrices when they are not necessary. Sometimes those extra temporary values can cause enormous degradations in performance.

However, caution is suggested since excess allocations are never relevant for scalar values, and allocations frequently create faster code for smaller matrices/vectors since it can lead to better cache locality.

To see this, a convenient tool is the benchmarking

using BenchmarkTools
A = rand(10, 10)
B = rand(10, 10)
C = similar(A)
function f!(C, A, B)
    D = A * B
    C .= D .+ 1
end
@btime f!($C, $A, $B)
  367.529 ns
 (2 allocations: 944 bytes)
10×10 Matrix{Float64}:
 3.88622  3.89353  3.77256  3.03683  …  2.3747   2.06317  2.85813  4.12785
 5.00088  4.89479  4.5616   2.93174     2.86965  2.58255  3.32457  5.01006
 4.16559  3.89196  3.48728  2.36875     2.39433  2.35954  2.6446   3.69582
 4.72342  4.26309  4.10953  2.58485     2.80941  2.30148  2.6341   3.95345
 3.83682  4.552    3.991    2.91925     3.08619  2.3282   2.58817  4.06832
 4.50285  4.63918  4.62707  2.32979  …  2.79878  2.68404  3.12501  4.82882
 4.57254  4.50141  4.25523  3.01381     2.91585  2.27358  3.06085  4.17532
 4.642    4.00792  4.38117  2.09846     2.6599   2.62388  3.17824  4.34158
 3.69463  3.4852   3.2082   2.19226     2.29474  2.21371  2.33719  3.25337
 4.76994  4.89471  4.73363  3.13832     2.90939  2.59762  3.38554  4.96909

The ! on the f! is an informal way to say that the function is mutating, and the first argument (C here) is by convention the modified variable.

In the f! function, notice that the D is a temporary variable which is created, and then modified afterwards. But notice that since C is modified directly, there is no need to create the temporary D matrix.

This is an example of where an in-place version of the matrix multiplication can help avoid the allocation.

function f2!(C, A, B)
    mul!(C, A, B)  # in-place multiplication
    C .+= 1
end
A = rand(10, 10)
B = rand(10, 10)
C = similar(A)
@btime f!($C, $A, $B)
@btime f2!($C, $A, $B)
  366.316 ns
 (2 allocations: 944 bytes)
  327.436 ns
 (0 allocations: 0 bytes)
10×10 Matrix{Float64}:
 3.49484  3.39944  2.78868  3.30176  …  3.95507  3.03323  2.69935  2.94465
 2.55706  1.86998  2.24931  2.39319     3.04391  2.22918  1.87232  2.5985
 4.09158  3.5013   3.62429  3.80519     4.60324  3.11816  3.16512  3.63433
 3.67359  4.13121  3.88012  4.55375     5.39446  3.35861  3.33792  3.67154
 3.39581  4.40599  3.42146  4.39003     5.07657  3.47102  3.28428  3.77209
 3.79119  3.59088  3.16204  4.45079  …  4.98626  3.28275  3.15713  3.47816
 3.85483  4.33821  3.60923  3.93005     5.00665  3.97537  3.35012  4.00407
 4.89039  5.28359  4.77468  5.43633     6.72286  4.54196  4.0695   4.80744
 3.75726  3.34515  3.26204  3.78259     4.45989  3.6188   2.89155  3.41107
 4.08829  4.03342  3.21527  4.17448     4.33974  3.37887  3.08931  3.02885

Note that in the output of the benchmarking, the f2! is non-allocating and is using the pre-allocated C variable directly.

Another example of this is solutions to linear equations, where for large solutions you may pre-allocate and reuse the solution vector.

A = rand(10, 10)
y = rand(10)
z = A \ y  # creates temporary

A = factorize(A)  # in-place requires factorization
x = similar(y)  # pre-allocate
ldiv!(x, A, y)  # in-place left divide, using factorization
10-element Vector{Float64}:
  2.3013974055322075
  0.0846873817770798
  0.7473201607390058
 -2.747417513346127
 -1.693278785858056
  0.12375574392037846
 -1.4428955275936808
  1.3411435637041682
  2.761463941320383
  0.07288425762859393

However, if you benchmark carefully, you will see that this is sometimes slower. Avoiding allocations is not always a good idea - and worrying about it prior to benchmarking is premature optimization.

There are a variety of other non-allocating versions of functions. For example,

A = rand(10, 10)
B = similar(A)

transpose!(B, A)  # non-allocating version of B = transpose(A)
10×10 Matrix{Float64}:
 0.811601    0.899243  0.725708   0.284021  …  0.354106  0.112426  0.339154
 0.00790333  0.86226   0.251181   0.474599     0.34251   0.222795  0.0186245
 0.123037    0.668323  0.0619222  0.582055     0.959484  0.32293   0.214821
 0.0012518   0.733197  0.617966   0.151617     0.171723  0.74272   0.209474
 0.0911146   0.822027  0.910671   0.294019     0.832758  0.578819  0.129152
 0.425627    0.822726  0.355305   0.197185  …  0.172408  0.447427  0.823456
 0.668306    0.17692   0.765922   0.667631     0.873703  0.25864   0.0134306
 0.388655    0.931105  0.20858    0.383996     0.799093  0.164938  0.395565
 0.304053    0.695924  0.0553802  0.989633     0.928237  0.739555  0.334263
 0.710061    0.613563  0.883984   0.619658     0.13839   0.282556  0.891416

Finally, a common source of unnecessary allocations is when taking slices or portions of matrices. For example, the following allocates a new matrix B and copies the values.

A = rand(5, 5)
B = A[2, :]  # extract a vector
5-element Vector{Float64}:
 0.9230450911751324
 0.08136956335528056
 0.8791975622688366
 0.3124595901751288
 0.47680030576508337

To see that these are different matrices, note that

A[2, 1] = 100.0
@show A[2, 1]
@show B[1];
A[2, 1] = 100.0
B[1] = 0.9230450911751324

Instead of allocating a new matrix, you can take a view of a matrix, which provides an appropriate AbstractArray type that doesn’t allocate new memory with the @view matrix.

A = rand(5, 5)
B = @view A[2, :]  #  does not copy the data

A[2, 1] = 100.0
@show A[2, 1]
@show B[1];
A[2, 1] = 100.0
B[1] = 100.0

But again, you will often find that doing @view leads to slower code. Benchmark instead, and generally rely on it for large matrices and for contiguous chunks of memory (e.g., columns rather than rows).

19.6. Exercises#

19.6.1. Exercise 1#

This exercise is for practice on writing low-level routines (i.e., “kernels”), and to hopefully convince you to leave low-level code to the experts.

The formula for matrix multiplication is deceptively simple. For example, with the product of square matrices \(C = A B\) of size \(N \times N\), the \(i,j\) element of \(C\) is

\[ C_{ij} = \sum_{k=1}^N A_{ik} B_{kj} \]

Alternatively, you can take a row \(A_{i,:}\) and column \(B_{:, j}\) and use an inner product

\[ C_{ij} = A_{i,:} \cdot B_{:,j} \]

Note that the inner product in a discrete space is simply a sum, and has the same complexity as the sum (i.e., \(O(N)\) operations).

For a dense matrix without any structure and using a naive multiplication algorithm, this also makes it clear why the complexity is \(O(N^3)\): You need to evaluate it for \(N^2\) elements in the matrix and do an \(O(N)\) operation each time.

For this exercise, implement matrix multiplication yourself and compare performance in a few permutations.

  1. Use the built-in function in Julia (i.e., C = A * B, or, for a better comparison, the in-place version mul!(C, A, B), which works with pre-allocated data).

  2. Loop over each \(C_{ij}\) by the row first (i.e., the i index) and use a for loop for the inner product.

  3. Loop over each \(C_{ij}\) by the column first (i.e., the j index) and use a for loop for the inner product.

  4. Do the same but use the dot product instead of the sum.

  5. Choose your best implementation of these, and then for matrices of a few different sizes (N=10, N=1000, etc.), and compare the ratio of performance of your best implementation to the built-in BLAS library.

A few more hints:

  • You can just use random matrices (e.g., A = rand(N, N)).

  • For all of them, pre-allocate the \(C\) matrix beforehand with C = similar(A) or something equivalent.

  • To compare performance, put your code in a function and use the @btime macro to time it.

19.6.2. Exercise 2a#

Here we will calculate the evolution of the pdf of a discrete-time Markov chain, \(\psi_t\), given the initial condition \(\psi_0\).

Start with a simple symmetric tridiagonal matrix

N = 100
A = Tridiagonal([fill(0.1, N - 2); 0.2], fill(0.8, N), [0.2; fill(0.1, N - 2)])
A_adjoint = A';
  1. Pick some large T and use the initial condition \(\psi_0 = \begin{bmatrix} 1 & 0 & \ldots & 0\end{bmatrix}\)

  2. Write code to calculate \(\psi_t\) to some \(T\) by iterating the map for each \(t\), i.e.,

\[ \psi_{t+1} = A' \psi_t \]
  1. What is the computational order of calculating \(\psi_T\) using this iteration approach \(T < N\)?

  2. What is the computational order of \((A')^T = (A' \ldots A')\) and then \(\psi_T = (A')^T \psi_0\) for \(T < N\)?

  3. Benchmark calculating \(\psi_T\) with the iterative calculation above as well as the direct \(\psi_T = (A')^T \psi_0\) to see which is faster. You can take the matrix power with just A_adjoint^T, which uses specialized algorithms faster and more accurately than repeated matrix multiplication (but with the same computational order).

  4. Check the same if \(T = 2 N\)

Note: The algorithm used in Julia to take matrix powers depends on the matrix structure, as always. In the symmetric case, it can use an eigendecomposition, whereas with a general dense matrix it uses squaring and scaling.

19.6.3. Exercise 2b#

With the same setup as in Exercise 2a, do an eigendecomposition of A_transpose. That is, use eigen to factor the adjoint \(A' = Q \Lambda Q^{-1}\), where \(Q\) is the matrix of eigenvectors and \(\Lambda\) is the diagonal matrix of eigenvalues. Calculate \(Q^{-1}\) from the results.

Use the factored matrix to calculate the sequence of \(\psi_t = (A')^t \psi_0\) using the relationship

\[ \psi_t = Q \Lambda^t Q^{-1} \psi_0 \]

where matrix powers of diagonal matrices are simply the element-wise power of each element.

Benchmark the speed of calculating the sequence of \(\psi_t\) up to T = 2N using this method. In principle, the factorization and easy calculation of the power should give you benefits, compared to simply iterating the map as we did in Exercise 2a. Explain why it does or does not, using computational order of each approach.