13. The Need for Speed#
Contents
13.1. Overview#
Computer scientists often classify programming languages according to the following two categories.
High level languages aim to maximize productivity by
being easy to read, write and debug
automating standard tasks (e.g., memory management)
being interactive, etc.
Low level languages aim for speed and control, which they achieve by
being closer to the metal (direct access to CPU, memory, etc.)
requiring a relatively large amount of information from the user (e.g., all data types must be specified)
Traditionally we understand this as a trade off
high productivity or high performance
optimized for humans or optimized for machines
One of the great strengths of Julia is that it pushes out the curve, achieving both high productivity and high performance with relatively little fuss.
The word “relatively” is important here, however…
In simple programs, excellent performance is often trivial to achieve.
For longer, more sophisticated programs, you need to be aware of potential stumbling blocks.
This lecture covers the key points.
13.1.1. Requirements#
You should read our earlier lecture on types, methods and multiple dispatch before this one.
using LinearAlgebra, Statistics
13.2. Understanding Multiple Dispatch in Julia#
This section provides more background on how methods, functions, and types are connected.
13.2.1. Methods and Functions#
The precise data type is important, for reasons of both efficiency and mathematical correctness.
For example consider 1 + 1 vs. 1.0 + 1.0 or [1 0] + [0 1].
On a CPU, integer and floating point addition are different things, using a different set of instructions.
Julia handles this problem by storing multiple, specialized versions of functions like addition, one for each data type or set of data types.
These individual specialized versions are called methods.
When an operation like addition is requested, the Julia compiler inspects the type of data to be acted on and hands it out to the appropriate method.
This process is called multiple dispatch.
Like all “infix” operators, 1 + 1 has the alternative syntax +(1, 1)
+(1, 1)
2
This operator + is itself a function with multiple methods.
We can investigate them using the @which macro, which shows the method to which a given call is dispatched
x, y = 1.0, 1.0
@which +(x, y)
We see that the operation is sent to the +
method that specializes in adding
floating point numbers.
Here’s the integer case
x, y = 1, 1
@which +(x, y)
This output says that the call has been dispatched to the + method responsible for handling integer values.
(We’ll learn more about the details of this syntax below)
Here’s another example, with complex numbers
x, y = 1.0 + 1.0im, 1.0 + 1.0im
@which +(x, y)
Again, the call has been dispatched to a + method specifically designed for handling the given data type.
13.2.1.1. Adding Methods#
It’s straightforward to add methods to existing functions.
For example, we can’t at present add an integer and a string in Julia (i.e. 100 + "100"
is not valid syntax).
This is sensible behavior, but if you want to change it there’s nothing to stop you.
import Base: + # enables adding methods to the + function
+(x::Integer, y::String) = x + parse(Int, y)
@show +(100, "100")
@show 100 + "100"; # equivalent
100 + "100" = 200
100 + "100" = 200
The above code is not executed to avoid any chance of a method invalidation, where are a source of compile-time latency.
13.2.2. Understanding the Compilation Process#
We can now be a little bit clearer about what happens when you call a function on given types.
Suppose we execute the function call f(a, b)
where a
and b
are of concrete types S
and T
respectively.
The Julia interpreter first queries the types of a
and b
to obtain the tuple (S, T)
.
It then parses the list of methods belonging to f
, searching for a match.
If it finds a method matching (S, T)
it calls that method.
If not, it looks to see whether the pair (S, T)
matches any method defined for immediate parent types.
For example, if S
is Float64
and T
is ComplexF32
then the
immediate parents are AbstractFloat
and Number
respectively
supertype(Float64)
AbstractFloat
supertype(ComplexF32)
Number
Hence the interpreter looks next for a method of the form f(x::AbstractFloat, y::Number)
.
If the interpreter can’t find a match in immediate parents (supertypes) it proceeds up the tree, looking at the parents of the last type it checked at each iteration.
If it eventually finds a matching method, it invokes that method.
If not, we get an error.
This is the process that leads to the following error (since we only added the +
for adding Integer
and String
above)
@show (typeof(100.0) <: Integer) == false
100.0 + "100"
(typeof(100.0) <: Integer) == false = true
MethodError: no method matching +(::Float64, ::String)
The function `+` exists, but no method is defined for this combination of argument types.
Closest candidates are:
+(::Any, ::Any, ::Any, ::Any...)
@ Base operators.jl:596
+(::Real, ::Complex{Bool})
@ Base complex.jl:322
+(::Number, ::UniformScaling)
@ LinearAlgebra /opt/hostedtoolcache/julia/1.11.1/x64/share/julia/stdlib/v1.11/LinearAlgebra/src/uniformscaling.jl:145
...
Stacktrace:
[1] top-level scope
@ In[9]:2
Because the dispatch procedure starts from concrete types and works upwards, dispatch always invokes the most specific method available.
For example, if you have methods for function f
that handle
(Float64, Int64)
pairs(Number, Number)
pairs
and you call f
with f(0.5, 1)
then the first method will be invoked.
This makes sense because (hopefully) the first method is optimized for exactly this kind of data.
The second method is probably more of a “catch all” method that handles other data in a less optimal way.
Here’s another simple example, involving a user-defined function
function q(x) # or q(x::Any)
println("Default (Any) method invoked")
end
function q(x::Number)
println("Number method invoked")
end
function q(x::Integer)
println("Integer method invoked")
end
q (generic function with 3 methods)
Let’s now run this and see how it relates to our discussion of method dispatch above
q(3)
Integer method invoked
q(3.0)
Number method invoked
q("foo")
Default (Any) method invoked
Since typeof(3) <: Int64 <: Integer <: Number
, the call q(3)
proceeds up the tree to Integer
and invokes q(x::Integer)
.
On the other hand, 3.0
is a Float64
, which is not a subtype of Integer
.
Hence the call q(3.0)
continues up to q(x::Number)
.
Finally, q("foo")
is handled by the function operating on Any
, since String
is not a subtype of Number
or Integer
.
13.2.3. Analyzing Function Return Types#
For the most part, time spent “optimizing” Julia code to run faster is about ensuring the compiler can correctly deduce types for all functions.
The macro @code_warntype
gives us a hint
x = [1, 2, 3]
f(x) = 2x
@code_warntype f(x)
MethodInstance for f(::Vector{Int64})
from f(x) @ Main In[14]:2
Arguments
#self#::Core.Const(Main.f)
x::Vector{Int64}
Body::Vector{Int64}
1 ─
%1 = (
2 * x)::Vector{Int64}
└── return %1
The @code_warntype
macro compiles f(x)
using the type of x
as an example – i.e., the [1, 2, 3]
is used as a prototype for analyzing the compilation, rather than simply calculating the value.
Here, the Body::Array{Int64,1}
tells us the type of the return value of the
function, when called with types like [1, 2, 3]
, is always a vector of integers.
In contrast, consider a function potentially returning nothing
, as in this lecture
f(x) = x > 0.0 ? x : nothing
@code_warntype f(1)
MethodInstance for f(::Int64)
from f(x) @ Main In[15]:1
Arguments
#self#::Core.Const(Main.f)
x::Int64
Body::Union{Nothing, Int64}
1 ─ %1 = Main.:>::Core.Const(>)
│ %2 = (
%1)(x, 0.0)::Bool
└── goto #3 if not %2
2 ─ %4 = x::Int64
└── return %4
3 ─ %6 = Main.nothing::Core.Const(nothing)
└── return %6
This states that the compiler determines the return type when called with an integer (like 1
) could be one of two different types, Body::Union{Nothing, Int64}
.
A final example is a variation on the above, which returns the maximum of x
and 0
.
f(x) = x > 0.0 ? x : 0.0
@code_warntype f(1)
MethodInstance for f(::Int64)
from f(x) @ Main In[16]:1
Arguments
#self#::Core.Const(Main.f)
x::Int64
Body::Union{Float64, Int64}
1 ─ %1 = (x > 0.0)::Bool
└── goto #3 if not %1
2 ─ %3 = x::Int64
└── return %3
3 ─ return 0.0
Which shows that, when called with an integer, the type could be that integer or the floating point 0.0
.
On the other hand, if we use change the function to return 0
if x <= 0, it is type-unstable with floating point.
f(x) = x > 0.0 ? x : 0
@code_warntype f(1.0)
MethodInstance for f(::Float64)
from f(x) @ Main In[17]:1
Arguments
#self#::Core.Const(Main.f)
x::Float64
Body::Union{Float64, Int64}
1 ─ %1 = (x > 0.0)::Bool
└── goto #3 if not %1
2 ─ %3 = x::Float64
└── return %3
3 ─ return 0
The solution is to use the zero(x)
function which returns the additive identity element of type x
.
On the other hand, if we change the function to return 0
if x <= 0
, it is type-unstable with floating point.
@show zero(2.3)
@show zero(4)
@show zero(2.0 + 3im)
f(x) = x > 0.0 ? x : zero(x)
@code_warntype f(1.0)
zero(2.3) = 0.0
zero(4) = 0
zero(2.0 + 3im) = 0.0 + 0.0im
MethodInstance for f(::Float64)
from f(x) @ Main In[18]:5
Arguments
#self#::Core.Const(Main.f)
x::Float64
Body::Float64
1 ─ %1 = (x > 0.0)::Bool
└── goto #3 if not %1
2 ─ %3 = x::Float64
└── return %3
3 ─ %5 = Main.zero(x)::Core.Const(0.0)
└── return %5
13.3. Foundations#
Let’s think about how quickly code runs, taking as given
hardware configuration
algorithm (i.e., set of instructions to be executed)
We’ll start by discussing the kinds of instructions that machines understand.
13.3.1. Machine Code#
All instructions for computers end up as machine code.
Writing fast code — expressing a given algorithm so that it runs quickly — boils down to producing efficient machine code.
You can do this yourself, by hand, if you want to.
Typically this is done by writing assembly, which is a symbolic representation of machine code.
Here’s some assembly code implementing a function that takes arguments \(a, b\) and returns \(2a + 8b\)
pushq %rbp
movq %rsp, %rbp
addq %rdi, %rdi
leaq (%rdi,%rsi,8), %rax
popq %rbp
retq
nopl (%rax)
Note that this code is specific to one particular piece of hardware that we use — different machines require different machine code.
If you ever feel tempted to start rewriting your economic model in assembly, please restrain yourself.
It’s far more sensible to give these instructions in a language like Julia, where they can be easily written and understood.
function f(a, b)
y = 2a + 8b
return y
end
f (generic function with 2 methods)
or Python
def f(a, b):
y = 2 * a + 8 * b
return y
or even C
int f(int a, int b) {
int y = 2 * a + 8 * b;
return y;
}
In any of these languages we end up with code that is much easier for humans to write, read, share and debug.
We leave it up to the machine itself to turn our code into machine code.
How exactly does this happen?
13.3.2. Generating Machine Code#
The process for turning high level code into machine code differs across languages.
Let’s look at some of the options and how they differ from one another.
13.3.2.1. AOT Compiled Languages#
Traditional compiled languages like Fortran, C and C++ are a reasonable option for writing fast code.
Indeed, the standard benchmark for performance is still well-written C or Fortran.
These languages compile down to efficient machine code because users are forced to provide a lot of detail on data types and how the code will execute.
The compiler therefore has ample information for building the corresponding machine code ahead of time (AOT) in a way that
organizes the data optimally in memory and
implements efficient operations as required for the task in hand
At the same time, the syntax and semantics of C and Fortran are verbose and unwieldy when compared to something like Julia.
Moreover, these low level languages lack the interactivity that’s so crucial for scientific work.
13.3.2.2. Interpreted Languages#
Interpreted languages like Python generate machine code “on the fly”, during program execution.
This allows them to be flexible and interactive.
Moreover, programmers can leave many tedious details to the runtime environment, such as
specifying variable types
memory allocation/deallocation, etc.
But all this convenience and flexibility comes at a cost: it’s hard to turn instructions written in these languages into efficient machine code.
For example, consider what happens when Python adds a long list of numbers together.
Typically the runtime environment has to check the type of these objects one by one before it figures out how to add them.
This involves substantial overheads.
There are also significant overheads associated with accessing the data values themselves, which might not be stored contiguously in memory.
The resulting machine code is often complex and slow.
13.3.2.3. Just-in-time compilation#
Just-in-time (JIT) compilation is an alternative approach that marries some of the advantages of AOT compilation and interpreted languages.
The basic idea is that functions for specific tasks are compiled as requested.
As long as the compiler has enough information about what the function does, it can in principle generate efficient machine code.
In some instances, all the information is supplied by the programmer.
In other cases, the compiler will attempt to infer missing information on the fly based on usage.
Through this approach, computing environments built around JIT compilers aim to
provide all the benefits of high level languages discussed above and, at the same time,
produce efficient instruction sets when functions are compiled down to machine code
13.4. JIT Compilation in Julia#
JIT compilation is the approach used by Julia.
In an ideal setting, all information necessary to generate efficient native machine code is supplied or inferred.
In such a setting, Julia will be on par with machine code from low level languages.
13.4.1. An Example#
Consider the function
function f(a, b)
y = (a + 8b)^2
return 7y
end
f (generic function with 2 methods)
Suppose we call f
with integer arguments (e.g., z = f(1, 2)
).
The JIT compiler now knows the types of a
and b
.
Moreover, it can infer types for other variables inside the function
e.g.,
y
will also be an integer
It then compiles a specialized version of the function to handle integers and stores it in memory.
We can view the corresponding machine code using the @code_native macro
@code_native f(1, 2)
.text
.file "f"
.globl julia_f_9573 # -- Begin function julia_f_9573
.p2align 4, 0x90
.type julia_f_9573,@function
julia_f_9573: # @julia_f_9573
; Function Signature: f(Int64, Int64)
; ┌ @ In[20]:1 within `f`
# %bb.0: # %top
; │ @ In[20] within `f`
#DEBUG_VALUE: f:a <- $rdi
#DEBUG_VALUE: f:b <- $rsi
push rbp
; │ @ In[20]:2 within `f`
; │┌ @ int.jl:87 within `+`
lea rcx, [rdi + 8*rsi]
mov rbp, rsp
; │└
; │┌ @ intfuncs.jl:370 within `literal_pow`
; ││┌ @ int.jl:88 within `*`
imul rcx, rcx
; │└└
; │ @ In[20]:3 within `f`
; │┌ @ int.jl:88 within `*`
lea rax, [8*rcx]
sub rax, rcx
; │└
pop rbp
ret
.Lfunc_end0:
.size julia_f_9573, .Lfunc_end0-julia_f_9573
; └
# -- End function
.section ".note.GNU-stack","",@progbits
If we now call f
again, but this time with floating point arguments, the JIT compiler will once more infer types for the other variables inside the function.
e.g.,
y
will also be a float
It then compiles a new version to handle this type of argument.
@code_native f(1.0, 2.0)
.text
.file "f"
.section .rodata.cst8,"aM",@progbits,8
.p2align 3, 0x0 # -- Begin function julia_f_9675
.LCPI0_0:
.quad 0x4020000000000000 # double 8
.LCPI0_1:
.quad 0x401c000000000000 # double 7
.text
.globl julia_f_9675
.p2align 4, 0x90
.type julia_f_9675,@function
julia_f_9675: # @julia_f_9675
; Function Signature: f(Float64, Float64)
; ┌ @ In[20]:1 within `f`
# %bb.0: # %top
; │ @ In[20] within `f`
#DEBUG_VALUE: f:a <- $xmm0
#DEBUG_VALUE: f:b <- $xmm1
push rbp
movabs rax, offset .LCPI0_0
mov rbp, rsp
; │ @ In[20]:2 within `f`
; │┌ @ promotion.jl:430 within `*` @ float.jl:493
vmulsd xmm1, xmm1, qword ptr [rax]
movabs rax, offset .LCPI0_1
; │└
; │┌ @ float.jl:491 within `+`
vaddsd xmm0, xmm1, xmm0
; │└
; │┌ @ intfuncs.jl:370 within `literal_pow`
; ││┌ @ float.jl:493 within `*`
vmulsd xmm0, xmm0, xmm0
; │└└
; │ @ In[20]:3 within `f`
; │┌ @ promotion.jl:430 within `*` @ float.jl:493
vmulsd xmm0, xmm0, qword ptr [rax]
; │└
pop rbp
ret
.Lfunc_end0:
.size julia_f_9675, .Lfunc_end0-julia_f_9675
; └
# -- End function
.type ".L+Core.Float64#9677",@object # @"+Core.Float64#9677"
.section .rodata,"a",@progbits
.p2align 3, 0x0
".L+Core.Float64#9677":
.quad ".L+Core.Float64#9677.jit"
.size ".L+Core.Float64#9677", 8
.set ".L+Core.Float64#9677.jit", 140705808973840
.size ".L+Core.Float64#9677.jit", 8
.section ".note.GNU-stack","",@progbits
Subsequent calls using either floats or integers are now routed to the appropriate compiled code.
13.4.2. Potential Problems#
In some senses, what we saw above was a best case scenario.
Sometimes the JIT compiler produces messy, slow machine code.
This happens when type inference fails or the compiler has insufficient information to optimize effectively.
The next section looks at situations where these problems arise and how to get around them.
13.5. Fast and Slow Julia Code#
To summarize what we’ve learned so far, Julia provides a platform for generating highly efficient machine code with relatively little effort by combining
JIT compilation
Optional type declarations and type inference to pin down the types of variables and hence compile efficient code
Multiple dispatch to facilitate specialization and optimization of compiled code for different data types
But the process is not flawless, and hiccups can occur.
The purpose of this section is to highlight potential issues and show you how to circumvent them.
13.5.1. BenchmarkTools#
The main Julia package for benchmarking is BenchmarkTools.jl.
Below, we’ll use the @btime
macro it exports to evaluate the performance of Julia code.
As mentioned in an earlier lecture, we can also save benchmark results to a file and guard against performance regressions in code.
For more, see the package docs.
13.5.2. Global Variables#
Global variables are names assigned to values outside of any function or type definition.
The are convenient and novice programmers typically use them with abandon.
But global variables are also dangerous, especially in medium to large size programs, since
they can affect what happens in any part of your program
they can be changed by any function
This makes it much harder to be certain about what some small part of a given piece of code actually commands.
Here’s a useful discussion on the topic.
When it comes to JIT compilation, global variables create further problems.
The reason is that the compiler can never be sure of the type of the global variable, or even that the type will stay constant while a given function runs.
To illustrate, consider this code, where b
is global
b = 1.0
function g(a)
global b
tmp = a
for i in 1:1_000_000
tmp = tmp + a + b
end
return tmp
end
g (generic function with 1 method)
The code executes relatively slowly and uses a huge amount of memory.
using BenchmarkTools
@btime g(1.0)
43.307 ms
(3000001 allocations: 45.78 MiB)
2.000001e6
If you look at the corresponding machine code you will see that it’s a mess.
@code_native g(1.0)
.text
.file "g"
.globl julia_g_11121 # -- Begin function julia_g_11121
.p2align 4, 0x90
.type julia_g_11121,@function
julia_g_11121: # @julia_g_11121
; Function Signature: g(Float64)
; ┌ @ In[23]:2 within `g`
# %bb.0: # %top
; │ @ In[23] within `g`
#DEBUG_VALUE: g:a <- $xmm0
push rbp
mov rbp, rsp
push r15
push r14
push r13
push r12
push rbx
sub rsp, 72
vmovsd qword ptr [rbp - 56], xmm0 # 8-byte Spill
vxorps xmm0, xmm0, xmm0
vmovups ymmword ptr [rbp - 112], ymm0
mov qword ptr [rbp - 80], 0
#APP
mov rax, qword ptr fs:[0]
#NO_APP
lea rdi, [rbp - 112]
movabs rbx, 140705808973840
; │ @ In[23]:2 within `g`
mov esi, 752
mov edx, 16
mov rcx, qword ptr [rax - 8]
mov qword ptr [rbp - 112], 12
mov rax, qword ptr [rcx]
mov qword ptr [rbp - 48], rcx # 8-byte Spill
mov qword ptr [rbp - 104], rax
mov qword ptr [rcx], rdi
movabs rax, offset ijl_gc_pool_alloc_instrumented
mov rdi, qword ptr [rcx + 16]
mov rcx, rbx
vzeroupper
call rax
vmovsd xmm0, qword ptr [rbp - 56] # 8-byte Reload
# xmm0 = mem[0],zero
mov qword ptr [rax - 8], rbx
mov ebx, 1000000
lea r15, [rbp - 72]
mov r12, rax
vmovsd qword ptr [rax], xmm0
.p2align 4, 0x90
.LBB0_1: # %L2
# =>This Inner Loop Header: Depth=1
; │ @ In[23]:6 within `g`
movabs rax, offset ".L*Main.b#11123.jit"
mov r14, qword ptr [rax]
test r14, r14
je .LBB0_4
# %bb.2: # %ok
# in Loop: Header=BB0_1 Depth=1
mov rax, qword ptr [rbp - 48] # 8-byte Reload
mov qword ptr [rbp - 80], r14
mov qword ptr [rbp - 88], r12
movabs r13, 140705808973840
; │┌ @ operators.jl:596 within `+`
mov esi, 752
mov edx, 16
mov rcx, r13
mov rdi, qword ptr [rax + 16]
movabs rax, offset ijl_gc_pool_alloc_instrumented
call rax
vmovsd xmm0, qword ptr [rbp - 56] # 8-byte Reload
# xmm0 = mem[0],zero
mov qword ptr [rax - 8], r13
mov qword ptr [rbp - 72], r12
movabs r12, offset ".Ljl_global#11127.jit"
mov edx, 2
movabs r13, offset ijl_apply_generic
mov qword ptr [rbp - 96], rax
mov qword ptr [rbp - 64], rax
mov rsi, r15
mov rdi, r12
vmovsd qword ptr [rax], xmm0
call r13
mov edx, 2
mov qword ptr [rbp - 72], rax
mov qword ptr [rbp - 96], rax
mov qword ptr [rbp - 64], r14
mov rdi, r12
mov rsi, r15
call r13
; │└
; │ @ In[23]:7 within `g`
; │┌ @ range.jl:908 within `iterate`
; ││┌ @ promotion.jl:639 within `==`
dec rbx
; │└└
; │ @ In[23]:6 within `g`
; │┌ @ operators.jl:596 within `+`
mov r12, rax
; │└
; │ @ In[23]:7 within `g`
jne .LBB0_1
# %bb.3: # %L17
mov rax, qword ptr [rbp - 104]
mov rcx, qword ptr [rbp - 48] # 8-byte Reload
mov qword ptr [rcx], rax
; │ @ In[23]:8 within `g`
mov rax, r12
add rsp, 72
pop rbx
pop r12
pop r13
pop r14
pop r15
pop rbp
ret
.LBB0_4: # %err
; │ @ In[23]:6 within `g`
movabs rdi, offset ".Ljl_sym#b#11124.jit"
movabs rsi, offset ".Ljl_global#11125.jit"
movabs rax, offset ijl_undefined_var_error
call rax
.Lfunc_end0:
.size julia_g_11121, .Lfunc_end0-julia_g_11121
; └
# -- End function
.set ".Ljl_global#11127.jit", 140705759554272
.size ".Ljl_global#11127.jit", 8
.set ".Ljl_global#11125.jit", 140705804171376
.size ".Ljl_global#11125.jit", 8
.set ".L+Core.Float64#11128.jit", 140705808973840
.size ".L+Core.Float64#11128.jit", 8
.set ".L*Main.b#11123.jit", 140705891223344
.size ".L*Main.b#11123.jit", 8
.set ".Ljl_sym#b#11124.jit", 140705956314776
.size ".Ljl_sym#b#11124.jit", 8
.section ".note.GNU-stack","",@progbits
If we eliminate the global variable like so
function g(a, b)
tmp = a
for i in 1:1_000_000
tmp = tmp + a + b
end
return tmp
end
g (generic function with 2 methods)
then execution speed improves dramatically. Furthermore, the number of allocations has dropped to zero.
@btime g(1.0, 1.0)
1.850 ms
(0 allocations: 0 bytes)
2.000001e6
Note that if you called @time
instead, the first call would be slower as it would need to compile the function. Conversely, @btime
discards the first call and runs it multiple times.
More information is available with @benchmark
instead,
@benchmark g(1.0, 1.0)
BenchmarkTools.Trial: 2694 samples with 1 evaluation.
Range (min … max): 1.850 ms … 1.911 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.853 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.855 ms ± 5.031 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅ █ ▃
█▂█▁█▃▃▄▃▃▂▂▂▂▂▂▂▂▁▁▄▅▄█▅▅▅▄▄▃▃▃▃▃▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
1.85 ms Histogram: frequency by time 1.87 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Also, the machine code is simple and clean
@code_native g(1.0, 1.0)
.text
.file "g"
.globl julia_g_12656 # -- Begin function julia_g_12656
.p2align 4, 0x90
.type julia_g_12656,@function
julia_g_12656: # @julia_g_12656
; Function Signature: g(Float64, Float64)
; ┌ @ In[26]:1 within `g`
# %bb.0: # %top
; │ @ In[26] within `g`
#DEBUG_VALUE: g:a <- $xmm0
#DEBUG_VALUE: g:b <- $xmm1
push rbp
mov eax, 1000000
vmovapd xmm2, xmm0
mov rbp, rsp
.p2align 4, 0x90
.LBB0_1: # %L2
# =>This Inner Loop Header: Depth=1
; │ @ In[26]:4 within `g`
; │┌ @ operators.jl:596 within `+` @ float.jl:491
vaddsd xmm2, xmm2, xmm0
; │└
; │ @ In[26]:5 within `g`
; │┌ @ range.jl:908 within `iterate`
; ││┌ @ promotion.jl:639 within `==`
add rax, -160
; │└└
; │ @ In[26]:4 within `g`
; │┌ @ operators.jl:596 within `+` @ float.jl:491
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd
xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd
xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
vaddsd xmm2, xmm2, xmm0
vaddsd xmm2, xmm2, xmm1
; │└
; │ @ In[26]:5 within `g`
jne .LBB0_1
# %bb.2: # %L16
; │ @ In[26]:6 within `g`
vmovapd xmm0, xmm2
pop rbp
ret
.Lfunc_end0:
.size julia_g_12656, .Lfunc_end0-julia_g_12656
; └
# -- End function
.type ".L+Core.Float64#12658",@object # @"+Core.Float64#12658"
.section .rodata,"a",@progbits
.p2align 3, 0x0
".L+Core.Float64#12658":
.quad ".L+Core.Float64#12658.jit"
.size ".L+Core.Float64#12658", 8
.set ".L+Core.Float64#12658.jit", 140705808973840
.size ".L+Core.Float64#12658.jit", 8
.section ".note.GNU-stack","",@progbits
Now the compiler is certain of types throughout execution of the function and hence can optimize accordingly.
If global variations are strictly needed (and they almost never are) then you can declare them with a const
to declare to Julia that the type never changes (the value can). For example,
const b_const = 1.0
function g_const(a)
global b_const
tmp = a
for i in 1:1_000_000
tmp = tmp + a + b_const
end
return tmp
end
@btime g_const(1)
1.850 ms
(0 allocations: 0 bytes)
2.000001e6
Now the compiler can again generate efficient machine code.
However, global variables within a function is almost always a bad idea. Instead, the b_const
should be passed as a parameter to the function.
13.5.3. Composite Types with Abstract Field Types#
Another scenario that trips up the JIT compiler is when composite types have fields with abstract types.
We met this issue earlier, when we discussed AR(1) models.
Let’s experiment, using, respectively,
an untyped field
a field with abstract type, and
parametric typing
As we’ll see, the last of these options gives us the best performance, while still maintaining significant flexibility.
Here’s the untyped case
struct Foo_any
a
end
Here’s the case of an abstract type on the field a
struct Foo_abstract
a::Real
end
Finally, here’s the parametrically typed case (where the {T <: Real}
is not necessary for performance, and could simply be {T}
struct Foo_concrete{T <: Real}
a::T
end
Now we generate instances
fg = Foo_any(1.0)
fa = Foo_abstract(1.0)
fc = Foo_concrete(1.0)
Foo_concrete{Float64}(1.0)
In the last case, concrete type information for the fields is embedded in the object
typeof(fc)
Foo_concrete{Float64}
This is significant because such information is detected by the compiler.
13.5.3.1. Timing#
Here’s a function that uses the field a
of our objects
function f(foo)
tmp = foo.a
for i in 1:1_000_000
tmp = i + foo.a
end
return tmp
end
f (generic function with 2 methods)
Let’s try timing our code, starting with the case without any constraints:
@btime f($fg)
25.457 ms
(1999489 allocations: 30.51 MiB)
1.000001e6
The timing is not very impressive.
Here’s the nasty looking machine code
@code_native f(fg)
.text
.file "f"
.globl julia_f_13184 # -- Begin function julia_f_13184
.p2align 4, 0x90
.type julia_f_13184,@function
julia_f_13184: # @julia_f_13184
; Function Signature: f(Main.Foo_any)
; ┌ @ In[36]:1 within `f`
# %bb.0: # %top
#DEBUG_VALUE: f:foo <- [DW_OP_deref] [$rdi+0]
push rbp
mov rbp, rsp
push r15
push r14
push r13
push r12
push rbx
sub rsp, 56
vxorps xmm0, xmm0, xmm0
vmovaps xmmword ptr [rbp - 64], xmm0
mov qword ptr [rbp - 48], 0
#APP
mov rax, qword ptr fs:[0]
#NO_APP
lea rdx, [rbp - 64]
mov ebx, 1
movabs r14, offset ".Ljl_global#13188.jit"
movabs r12, offset ijl_apply_generic
lea r15, [rbp - 88]
mov rcx, qword ptr [rax - 8]
mov qword ptr [rbp - 64], 4
mov rax, qword ptr [rcx]
mov qword ptr [rbp - 72], rcx # 8-byte Spill
mov qword ptr [rbp - 56], rax
mov qword ptr [rcx], rdx
mov r13, qword ptr [rdi]
.p2align 4, 0x90
.LBB0_1: # %L2
# =>This Inner Loop Header: Depth=1
; │ @ In[36]:4 within `f`
movabs rax, offset ijl_box_int64
mov rdi, rbx
call rax
mov edx, 2
mov qword ptr [rbp - 88], rax
mov qword ptr [rbp - 48], rax
mov qword ptr [rbp - 80], r13
mov rdi, r14
mov rsi, r15
call r12
; │ @ In[36]:5 within `f`
; │┌ @ range.jl:908 within `iterate`
inc rbx
; ││┌ @ promotion.jl:639 within `==`
cmp rbx, 1000001
; │└└
jne .LBB0_1
# %bb.2: # %L17
mov rcx, qword ptr [rbp - 56]
mov rdx, qword ptr [rbp - 72] # 8-byte Reload
mov qword ptr [rdx], rcx
; │ @ In[36]:6 within `f`
add rsp, 56
pop rbx
pop r12
pop r13
pop r14
pop r15
pop rbp
ret
.Lfunc_end0:
.size julia_f_13184, .Lfunc_end0-julia_f_13184
; └
# -- End function
.set ".Ljl_global#13188.jit", 140705759554272
.size ".Ljl_global#13188.jit", 8
.section ".note.GNU-stack","",@progbits
The abstract case is almost identical,
@btime f($fa)
25.565 ms
(1999489 allocations: 30.51 MiB)
1.000001e6
Note the large memory footprint.
The machine code is also long and complex, although we omit details.
Finally, let’s look at the parametrically typed version
@btime f($fc)
3.085 ns
(0 allocations: 0 bytes)
1.000001e6
Which is improbably small - since a runtime of 1-2 nanoseconds without any allocations suggests no computations really took place.
A hint is in the simplicity of the corresponding machine code
@code_native f(fc)
.text
.file "f"
.section .rodata.cst8,"aM",@progbits,8
.p2align 3, 0x0 # -- Begin function julia_f_13278
.LCPI0_0:
.quad 0x412e848000000000 # double 1.0E+6
.text
.globl julia_f_13278
.p2align 4, 0x90
.type julia_f_13278,@function
julia_f_13278: # @julia_f_13278
; Function Signature: f(Main.Foo_concrete{Float64})
; ┌ @ In[36]:1 within `f`
# %bb.0: # %top
#DEBUG_VALUE: f:foo <- [DW_OP_deref] [$rdi+0]
push rbp
; │ @ In[36]:4 within `f`
; │┌ @ promotion.jl:429 within `+` @ float.jl:491
vmovsd xmm0, qword ptr [rdi] # xmm0 = mem[0],zero
movabs rax, offset .LCPI0_0
mov rbp, rsp
vaddsd xmm0, xmm0, qword ptr [rax]
; │└
; │ @ In[36]:6 within `f`
pop rbp
ret
.Lfunc_end0:
.size julia_f_13278, .Lfunc_end0-julia_f_13278
; └
# -- End function
.type ".L+Core.Float64#13280",@object # @"+Core.Float64#13280"
.section .rodata,"a",@progbits
.p2align 3, 0x0
".L+Core.Float64#13280":
.quad ".L+Core.Float64#13280.jit"
.size ".L+Core.Float64#13280", 8
.set ".L+Core.Float64#13280.jit", 140705808973840
.size ".L+Core.Float64#13280.jit", 8
.section ".note.GNU-stack","",@progbits
This machine code has none of the hallmark assembly instructions associated with a loop, in particular loops (e.g. for
in julia) end up as jumps in the machine code (e.g. jne
).
Here, the compiler was smart enough to realize that only the final step in the loop matters, i.e. it could generate the equivalent to f(a) = 1_000_000 + foo.a
and then it can return that directly and skip the loop. These sorts of code optimizations are only possible if the compiler is able to use a great deal of information about the types.
Finally, note that if we compile a slightly different version of the function, which doesn’t actually return the value
function f_no_return(foo)
for i in 1:1_000_000
tmp = i + foo.a
end
end
@code_native f_no_return(fc)
.text
.file "f_no_return"
.globl julia_f_no_return_13287 # -- Begin function julia_f_no_return_13287
.p2align 4, 0x90
.type julia_f_no_return_13287,@function
julia_f_no_return_13287: # @julia_f_no_return_13287
; Function Signature: f_no_return(Main.Foo_concrete{Float64})
; ┌ @ In[42]:1 within `f_no_return`
# %bb.0: # %top
#DEBUG_VALUE: f_no_return:foo <- [DW_OP_deref] [$rdi+0]
push rbp
mov rbp, rsp
; │ @ In[42]:4 within `f_no_return`
pop rbp
ret
.Lfunc_end0:
.size julia_f_no_return_13287, .Lfunc_end0-julia_f_no_return_13287
; └
# -- End function
.section ".note.GNU-stack","",@progbits
We see that the code is even simpler. In effect, if figured out that because tmp
wasn’t returned, and the foo.a
could have no side effects (since it knows the type of a
), that it doesn’t even need to execute any of the code in the function.
13.5.4. Type Inference#
Consider the following function, which essentially does the same job as Julia’s sum()
function but acts only on floating point data
function sum_float_array(x::AbstractVector{<:Number})
sum = 0.0
for i in eachindex(x)
sum += x[i]
end
return sum
end
sum_float_array (generic function with 1 method)
Calls to this function run very quickly
x_range = range(0, 1, length = 100_000)
x = collect(x_range)
typeof(x)
Vector{Float64} (alias for Array{Float64, 1})
@btime sum_float_array($x)
92.552 μs
(0 allocations: 0 bytes)
50000.0
When Julia compiles this function, it knows that the data passed in as x
will be an array of 64 bit floats. Hence it’s known to the compiler that the relevant method for +
is always addition of floating point numbers.
But consider a version without that type annotation
function sum_array(x)
sum = 0.0
for i in eachindex(x)
sum += x[i]
end
return sum
end
@btime sum_array($x)
92.542 μs
(0 allocations: 0 bytes)
50000.0
Note that this has the same running time as the one with the explicit types. In julia, there is (almost never) performance gain from declaring types, and if anything they can make things worse by limited the potential for specialized algorithms. See generic programming for more.
As an example within Julia code, look at the built-in sum for array
@btime sum($x)
12.453 μs
(0 allocations: 0 bytes)
50000.00000000001
Versus the underlying range
@btime sum($x_range)
14.656 ns
(0 allocations: 0 bytes)
50000.0
Note that the difference in speed is enormous—suggesting it is better to keep things in their more structured forms as long as possible. You can check the underlying source used for this with @which sum(x_range)
to see the specialized algorithm used.
13.5.4.1. Type Inferences#
Here’s the same function minus the type annotation in the function signature
function sum_array(x)
sum = 0.0
for i in eachindex(x)
sum += x[i]
end
return sum
end
sum_array (generic function with 1 method)
When we run it with the same array of floating point numbers it executes at a similar speed as the function with type information.
@btime sum_array($x)
92.543 μs
(0 allocations: 0 bytes)
50000.0
The reason is that when sum_array()
is first called on a vector of a given
data type, a newly compiled version of the function is produced to handle that
type.
In this case, since we’re calling the function on a vector of floats, we get a compiled version of the function with essentially the same internal representation as sum_float_array()
.
13.5.4.2. An Abstract Container#
Things get tougher for the interpreter when the data type within the array is imprecise.
For example, the following snippet creates an array where the element type is Any
x = Any[1 / i for i in 1:1e6];
eltype(x)
Any
Now summation is much slower and memory management is less efficient.
@btime sum_array($x)
19.265 ms
(1000000 allocations: 15.26 MiB)
14.392726722864989
13.6. Further Comments#
Here are some final comments on performance.
13.6.1. Summary and Tips#
Use functions to segregate operations into logically distinct blocks.
Data types will be determined at function boundaries.
If types are not supplied then they will be inferred.
If types are stable and can be inferred effectively your functions will run fast.
13.6.2. Further Reading#
A good next stop for further reading is the relevant part of the Julia documentation.