Chapter 20 Optimisation with Flambda

20.1 Overview

Flambda is the term used to describe a series of optimisation passesprovided by the native code compilers as of OCaml 4.03.

Flambda aims to make it easier to write idiomatic OCaml code withoutincurring performance penalties.

To use the Flambda optimisers it is necessary to pass the -flambdaoption to the OCaml configure script. (There is no support for asingle compiler that can operate in both Flambda and non-Flambda modes.)Code compiled with Flambdacannot be linked into the same program as code compiled without Flambda.Attempting to do this will result in a compiler error.

Whether or not a particular ocamlopt uses Flambda may bedetermined by invoking it with the -config option and lookingfor any line starting with “flambda:”. If such a line is presentand says “true”, then Flambda is supported, otherwise it is not.

Flambda provides full optimisation across different compilation units,so long as the .cmx files for the dependencies of the unit currentlybeing compiled are available. (A compilation unit corresponds to asingle .ml source file.) However it does not yet act entirely asa whole-program compiler: for example, elimination of dead code acrossa complete set of compilation units is not supported.

Optimisation with Flambda is not currently supported when generatingbytecode.

Flambda should not in general affect the semantics of existing programs.Two exceptions to this rule are: possible elimination of pure codethat is being benchmarked (see section 20.14) and changes inbehaviour of code using unsafe operations (see section 20.15).

Flambda does not yet optimise array or string bounds checks. Neitherdoes it take hints for optimisation from any assertions written by theuser in the code.

Consult the Glossary at the end of this chapter for definitions oftechnical terms used below.

20.2 Command-line flags

The Flambda optimisers provide a variety of command-line flags that maybe used to control their behaviour. Detailed descriptions of each flagare given in the referenced sections. Those sections also describe anyarguments which the particular flags take.

Commonly-used options:

  • -O2
  • Perform more optimisation than usual. Compilationtimes may be lengthened. (This flag is an abbreviation for a certainset of parameters described in section 20.5.)
  • -O3
  • Perform even more optimisation than usual, possiblyincluding unrolling of recursive functions. Compilation times may besignificantly lengthened.
  • -Oclassic
  • Make inlining decisions at the point ofdefinition of a function rather than at the call site(s). This mirrorsthe behaviour of OCaml compilers not using Flambda. Compared to compilationusing the new Flambda inlining heuristics (for example at -O2) itproducessmaller .cmx files, shorter compilation times and code that probablyruns rather slower. When using -Oclassic, only the following optionsdescribed in this section are relevant: -inlining-report and-inline. If any other of the options described in this section areused, the behaviour is undefined and may cause an error in future versionsof the compiler.
  • -inlining-report
  • Emit .inlining files (one perround of optimisation) showing all of the inliner’s decisions.Less commonly-used options:

  • -remove-unused-arguments

  • Remove unused function argumentseven when the argument is not specialised. This may have a smallperformance penalty.See section 20.10.3.
  • -unbox-closures
  • Pass free variables via specialised argumentsrather than closures (an optimisation for reducing allocation). Seesection 20.9.3. This may have a small performance penalty.Advanced options, only needed for detailed tuning:

  • -inline

  • The behaviour depends on whether -Oclassicis used.
    • When not in -Oclassic mode, -inline limits the totalsize of functions considered for inlining during any speculative inliningsearch. (See section 20.3.6.) Note that this parameter doesnot control the assessment as to whether any particular function maybe inlined. Raising it to excessive amounts will not necessarily causemore functions to be inlined.
    • When in -Oclassic mode, -inline behaves as inprevious versions of the compiler: it is the maximum size of function tobe considered for inlining. See section 20.3.1.
  • -inline-toplevel
  • The equivalent of -inline but usedwhen speculative inlining starts at toplevel. Seesection 20.3.6.Not used in -Oclassic mode.
  • -inline-branch-factor
  • Controls how the inliner assesseswhether a code path is likely to be hot or cold. Seesection 20.3.5.
  • -inline-alloc-cost,-inline-branch-cost,-inline-call-cost
  • Controls how the inliner assesses the runtimeperformance penalties associated with various operations. Seesection 20.3.5.
  • -inline-indirect-cost,-inline-prim-cost
  • Likewise.
  • -inline-lifting-benefit
  • Controls inlining of functorsat toplevel. See section 20.3.5.
  • -inline-max-depth
  • The maximum depth of anyspeculative inlining search. See section 20.3.6.
  • -inline-max-unroll
  • The maximum depth of any unrolling ofrecursive functions during any speculative inlining search.See section 20.3.6.
  • -no-unbox-free-vars-of-closures
  • Do not unbox closure variables. See section 20.9.1.
  • -no-unbox-specialised-args
  • Do not unbox arguments to which functions have been specialised. Seesection 20.9.2.
  • -rounds
  • How many rounds of optimisation to perform.See section 20.2.1.
  • -unbox-closures-factor
  • Scaling factor for benefitcalculation when using -unbox-closures. Seesection 20.9.3.
Notes
  • The set of command line flags relating to optimisation should typicallybe specified to be the same across an entire project. Flambda does notcurrently record the requested flags in the .cmx files. As such,inlining of functions from previously-compiled units will subject their codeto the optimisation parameters of the unit currently being compiled, ratherthan those specified when they were previously compiled. It is hoped torectify this deficiency in the future.
  • Flambda-specific flags do not affect linking with the exception ofaffecting the optimisation of code in the startup file (containinggenerated functions such as currying helpers). Typically such optimisationwill not be significant, so eliding such flags at link time might bereasonable.
  • Flambda-specific flags are silently accepted even when the-flambda option was not provided to the configure script.(There is no means provided to change this behaviour.)This is intended to make it morestraightforward to run benchmarks with and without the Flambda optimisersin effect.
  • Some of the Flambda flags may be subject to change in futurereleases.

20.2.1 Specification of optimisation parameters by round

Flambda operates in rounds: one round consists of a certain sequenceof transformations that may then be repeated in order to achieve moresatisfactory results. The number of rounds can be set manually using the-rounds parameter (although this is not necessary when usingpredefined optimisation levels such as with -O2 and -O3).For high optimisation the number of rounds might be set at 3 or 4.

Command-line flags that may apply per round, for example those with-cost in the name, accept arguments of the form:

n | round=n[,…]

  • If the first form is used, with a single integer specified,the value will apply to all rounds.
  • If the second form is used, zero-based round integers specifyvalues which are to be used only for those rounds.The flags -Oclassic, -O2 and -O3 are applied before allother flags, meaning that certain parameters may be overridden withouthaving to specify every parameter usually invoked by the given optimisationlevel.

20.3 Inlining

Inlining refers to the copying of the code of a function to aplace where the function is called.The code of the function will be surrounded by bindings of its parametersto the corresponding arguments.

The aims of inlining are:

  • to reduce the runtime overhead caused by function calls (includingsetting up for such calls and returning afterwards);
  • to reduce instruction cache misses by expressing frequently-takenpaths through the program using fewer machine instructions; and
  • to reduce the amount of allocation (especially of closures).These goals are often reached not just by inlining itself but also byother optimisations that the compiler is able to perform as a result ofinlining.

When a recursive call to a function (within the definition of that functionor another in the same mutually-recursive group) is inlined, the procedure isalso known as unrolling. This is somewhat akin to loop peeling.For example, given the following code:

  1. let rec fact x =
  2. if x = 0 then
  3. 1
  4. else
  5. x * fact (x - 1)
  6.  
  7. let n = fact 4

unrolling once at the call site fact 4 produces (with the body offact unchanged):

  1. let n =
  2. if 4 = 0 then
  3. 1
  4. else
  5. 4 * fact (4 - 1)

This simplifies to:

  1. let n = 4 * fact 3

Flambda provides significantly enhanced inlining capabilities relative toprevious versions of the compiler.

Aside: when inlining is performed

Inlining is performed together with all of the other Flambda optimisationpasses, that is to say, after closure conversion. This has three particularadvantages over a potentially more straightforward implementation prior toclosure conversion:

  • It permits higher-order inlining, for example when a non-inlinablefunction always returns the same function yet with different environmentsof definition. Not all such cases are supported yet, but it is intendedthat such support will be improved in future.
  • It is easier to integrate with cross-module optimisation, sinceimported information about other modules is already in the correctintermediate language.
  • It becomes more straightforward to optimise closure allocations sincethe layout of closures does not have to be estimated in any way: it isknown. Similarly,it becomes more straightforward to control which variables end upin which closures, helping to avoid closure bloat.

20.3.1 Classic inlining heuristic

In -Oclassic mode the behaviour of the Flambda inlinermimics previous versionsof the compiler. (Code may still be subject to further optimisations notperformed by previous versions of the compiler: functors may be inlined,constants are lifted and unused code is eliminated all as described elsewherein this chapter. See sections 20.3.3, 20.8.1 and 20.10.At the definition site of a function, the body of thefunction is measured. It will then be marked as eligible for inlining(and hence inlined at every direct call site) if:

  • the measured size (in unspecified units) is smaller than that of afunction call plus the argument of the -inline command-line flag; and
  • the function is not recursive.Non-Flambda versions of the compiler cannot inline functions thatcontain a definition of another function. However -Oclassic doespermit this. Further, non-Flambda versions also cannot inline functionsthat are only themselves exposed as a result of a previous pass of inlining,but again this is permitted by -Oclassic.For example:
  1. module M : sig
  2. val i : int
  3. end = struct
  4. let f x =
  5. let g y = x + y in
  6. g
  7. let h = f 3
  8. let i = h 4 (* h is correctly discovered to be g and inlined *)
  9. end

All of this contrasts with the normal Flambda mode, that is to saywithout -Oclassic, where:

  • the inlining decision is made at the call site; and
  • recursive functions can be handled, by specialisation (seebelow).The Flambda mode is described in the next section.

20.3.2 Overview of “Flambda” inlining heuristics

The Flambda inlining heuristics, used whenever the compiler is configuredfor Flambda and -Oclassic was not specified, make inlining decisionsat call sites. This helps in situations where the context is important.For example:

  1. let f b x =
  2. if b then
  3. x
  4. else
  5. ... big expression ...
  6.  
  7. let g x = f true x

In this case, we would like to inline f into g, because aconditional jump can be eliminated and the code size should reduce. If theinlining decision has been made after the declaration of f withoutseeing the use, its size would have probably made it ineligible forinlining; but at the call site, its final size can be known. Further,this function should probably not be inlined systematically: if bis unknown, or indeed false, there is little benefit to trade offagainst a large increase in code size. In the existing non-Flambda inlinerthis isn’t a great problem because chains of inlining were cut off fairlyquickly. However it has led to excessive use of overly-large inliningparameters such as -inline 10000.

In more detail, at each call site the following procedure is followed:

  • Determine whether it is clear that inlining would be beneficialwithout, for the moment, doing any inlining within the function itself.(The exact assessment of benefit is described below.) If so, thefunction is inlined.
  • If inlining the function is not clearly beneficial, then inliningwill be performed speculatively inside the function itself. Thesearch for speculative inlining possibilities is controlled by twoparameters: the inlining threshold and the inlining depth.(These are described in more detail below.)
    • If such speculation shows that performing some inlining inside thefunction would be beneficial, then such inlining is performed and theresulting function inlined at the original call site.
    • Otherwise, nothing happens.Inlining within recursive functions of calls to otherfunctions in the same mutually-recursive group is kept in check byan unrolling depth, described below. This ensures that functions arenot unrolled to excess. (Unrolling is only enabledif -O3 optimisation level is selected and/or the-inline-max-unrollflag is passed with an argument greater than zero.)

20.3.3 Handling of specific language constructs

Functors

There is nothing particular about functors that inhibits inlining comparedto normal functions. To the inliner, these both look the same, exceptthat functors are marked as such.

Applications of functors at toplevel are biased in favour of inlining.(This bias may be adjusted:see the documentation for -inline-lifting-benefit below.)

Applications of functors not at toplevel, for example in a local moduleinside some other expression, are treated by the inliner identically tonormal function calls.

First-class modules

The inliner will be able to consider inlining a call to a function in a firstclass module if it knows which particular function is going to be called.The presence of the first-class module record that wraps the set of functionsin the module does not per se inhibit inlining.

Objects

Method calls to objects are not at present inlined by Flambda.

20.3.4 Inlining reports

If the -inlining-report option is provided to the compiler then a filewill be emitted corresponding to each round of optimisation. For theOCaml source file basename.ml the filesare named basename.round.inlining.org,with round azero-based integer. Inside the files, which are formatted as “org mode”,will be found English prose describing the decisions that the inliner took.

20.3.5 Assessment of inlining benefit

Inlining typicallyresults in an increase in code size, which if left unchecked, may not onlylead to grossly large executables and excessive compilation times but alsoa decrease in performance due to worse locality. As such, theFlambda inliner trades off the change in code size againstthe expected runtime performance benefit, with the benefit being computedbased on the number of operations that the compiler observes may be removedas a result of inlining.

For example given the following code:

  1. let f b x =
  2. if b then
  3. x
  4. else
  5. ... big expression ...
  6.  
  7. let g x = f true x

it would be observed that inlining of f would remove:

  • one direct call;
  • one conditional branch.Formally, an estimate of runtime performance benefit is computed byfirst summingthe cost of the operations that are known to be removed as a result of theinlining and subsequent simplification of the inlined body.The individual costs for the various kinds of operations may be adjustedusing the various -inline-…-cost flags as follows. Costs arespecified as integers. All of these flags accept a single argumentdescribing such integers using the conventionsdetailed in section 20.2.1.

  • -inline-alloc-cost

  • The cost of an allocation.
  • -inline-branch-cost
  • The cost of a branch.
  • -inline-call-cost
  • The cost of a direct function call.
  • -inline-indirect-cost
  • The cost of an indirect function call.
  • -inline-prim-cost
  • The cost of a primitive. Primitivesencompass operations including arithmetic and memory access.(Default values are described in section 20.5 below.)

The initial benefit value is then scaled by a factor that attempts tocompensate for the fact that the current point in the code, if under somenumber of conditional branches, may be cold. (Flambda does not currentlycompute hot and cold paths.) The factor—the estimated probability thatthe inliner really is on a hot path—is calculated as1/(1 + f)d, where f is set by-inline-branch-factor and d is the nesting depth of branchesat the current point. As the inliner descends into more deeply-nestedbranches, the benefit of inlining thus lessens.

The resulting benefit value is known as the estimated benefit.

The change in code size is also estimated: morally speaking it should be thechange in machine code size, but since that is not available to the inliner,an approximation is used.

If the estimated benefit exceeds the increase in code size then the inlinedversion of the function will be kept. Otherwise the function will not beinlined.

Applications of functors at toplevel will be givenan additional benefit (which may be controlled by the-inline-lifting-benefit flag) to bias inlining in such situationstowards keeping the inlined version.

20.3.6 Control of speculation

As described above, there are three parameters that restrict the searchfor inlining opportunities during speculation:

  • the inlining threshold;
  • the inlining depth;
  • the unrolling depth.These parameters are ultimately bounded by the arguments provided tothe corresponding command-line flags (or their default values):

  • -inline (or, if the call site that triggered speculation isat toplevel, -inline-toplevel);

  • -inline-max-depth;
  • -inline-max-unroll.Note in particular that -inline does not have the meaning thatit has in the previous compiler or in -Oclassic mode. In both of thosesituations -inline was effectively some kind of basic assessment ofinlining benefit. However in Flambda inlining mode it corresponds to aconstraint on the search; the assessment of benefit is independent, asdescribed above.

When speculation starts the inlining threshold starts at the value setby -inline (or -inline-toplevel if appropriate, see above).Upon making a speculative inlining decision thethreshold is reduced by the code size of the function being inlined.If the threshold becomes exhausted, at or below zero, no further speculationwill be performed.

The inlining depth starts at zeroand is increased by one every time the inlinerdescends into another function. It is then decreased by one every time theinliner leaves such function. If the depth exceeds the value set by-inline-max-depth then speculation stops. This parameter is intendedas a general backstop for situations where the inliningthreshold does not control the search sufficiently.

The unrolling depth applies to calls within the same mutually-recursivegroup of functions. Each time an inlining of such a call is performedthe depth is incremented by one when examining the resulting body. If thedepth reaches the limit set by -inline-max-unroll then speculationstops.

20.4 Specialisation

The inliner may discover a call site to a recursive function wheresomething is known about the arguments: for example, they may be equal tosome other variables currently in scope. In this situation it may bebeneficial to specialise the function to those arguments. This isdone by copying the declaration of the function (and any others involvedin any same mutually-recursive declaration) and noting the extra informationabout the arguments. The arguments augmented by this information are knownas specialised arguments. In order to try to ensure that specialisationis not performed uselessly, arguments are only specialised if it can be shownthat they are invariant: in other words, during the execution of therecursive function(s) themselves, the arguments never change.

Unless overridden by an attribute (see below), specialisation of a functionwill not be attempted if:

  • the compiler is in -Oclassic mode;
  • the function is not obviously recursive;
  • the function is not closed.The compiler can prove invariance of function arguments across multiplefunctions within a recursive group (although this has some limitations,as shown by the example below).

It should be noted that the unboxing of closures pass (see below)can introduce specialised arguments on non-recursive functions. (No otherplace in the compiler currently does this.)

Example: the well-known List.iter function

This function might be written like so:

  1. let rec iter f l =
  2. match l with
  3. | [] -> ()
  4. | h :: t ->
  5. f h;
  6. iter f t

and used like this:

  1. let print_int x =
  2. print_endline (Int.to_string x)
  3.  
  4. let run xs =
  5. iter print_int (List.rev xs)

The argument f to iter is invariant so the function may bespecialised:

  1. let run xs =
  2. let rec iter' f l =
  3. (* The compiler knows: f holds the same value as foo throughout iter'. *)
  4. match l with
  5. | [] -> ()
  6. | h :: t ->
  7. f h;
  8. iter' f t
  9. in
  10. iter' print_int (List.rev xs)

The compiler notes down that for the function iter’, the argumentf is specialised to the constant closure print_int. Thismeans that the body of iter’ may be simplified:

  1. let run xs =
  2. let rec iter' f l =
  3. (* The compiler knows: f holds the same value as foo throughout iter'. *)
  4. match l with
  5. | [] -> ()
  6. | h :: t ->
  7. print_int h; (* this is now a direct call *)
  8. iter' f t
  9. in
  10. iter' print_int (List.rev xs)

The call to print_int can indeed be inlined:

  1. let run xs =
  2. let rec iter' f l =
  3. (* The compiler knows: f holds the same value as foo throughout iter'. *)
  4. match l with
  5. | [] -> ()
  6. | h :: t ->
  7. print_endline (Int.to_string h);
  8. iter' f t
  9. in
  10. iter' print_int (List.rev xs)

The unused specialised argument f may now be removed, leaving:

  1. let run xs =
  2. let rec iter' l =
  3. match l with
  4. | [] -> ()
  5. | h :: t ->
  6. print_endline (Int.to_string h);
  7. iter' t
  8. in
  9. iter' (List.rev xs)
Aside on invariant parameters.

The compiler cannot currentlydetect invariance in cases such as the following.

  1. let rec iter_swap f g l =
  2. match l with
  3. | [] -> ()
  4. | 0 :: t ->
  5. iter_swap g f l
  6. | h :: t ->
  7. f h;
  8. iter_swap f g t

20.4.1 Assessment of specialisation benefit

The benefit of specialisation is assessed in a similar way as for inlining.Specialised argument information may mean that the body of the functionbeing specialised can be simplified: the removed operations are accumulatedinto a benefit. This, together with the size of the duplicated (specialised)function declaration, is then assessed against the size of the call to theoriginal function.

20.5 Default settings of parameters

The default settings (when not using -Oclassic) are for oneround of optimisation using the following parameters.

ParameterSetting
-inline10
-inline-branch-factor0.1
-inline-alloc-cost7
-inline-branch-cost5
-inline-call-cost5
-inline-indirect-cost4
-inline-prim-cost3
-inline-lifting-benefit1300
-inline-toplevel160
-inline-max-depth1
-inline-max-unroll0
-unbox-closures-factor10

20.5.1 Settings at -O2 optimisation level

When -O2 is specified two rounds of optimisation are performed.The first round uses the default parameters (see above). The second usesthe following parameters.

ParameterSetting
-inline25
-inline-branch-factorSame as default
-inline-alloc-costDouble the default
-inline-branch-costDouble the default
-inline-call-costDouble the default
-inline-indirect-costDouble the default
-inline-prim-costDouble the default
-inline-lifting-benefitSame as default
-inline-toplevel400
-inline-max-depth2
-inline-max-unrollSame as default
-unbox-closures-factorSame as default

20.5.2 Settings at -O3 optimisation level

When -O3 is specified three rounds of optimisation are performed.The first two rounds are as for -O2. The third round usesthe following parameters.

ParameterSetting
-inline50
-inline-branch-factorSame as default
-inline-alloc-costTriple the default
-inline-branch-costTriple the default
-inline-call-costTriple the default
-inline-indirect-costTriple the default
-inline-prim-costTriple the default
-inline-lifting-benefitSame as default
-inline-toplevel800
-inline-max-depth3
-inline-max-unroll1
-unbox-closures-factorSame as default

20.6 Manual control of inlining and specialisation

Should the inliner prove recalcitrant and refuse to inline a particularfunction, or if the observed inlining decisions are not to the programmer’ssatisfaction for some other reason, inlining behaviour can be dictated by theprogrammer directly in the source code.One example where this might be appropriate is when the programmer,but not the compiler, knows that a particular function call is on a coldcode path. It might be desirable to prevent inlining of the function sothat the code size along the hot path is kept smaller, so as to increaselocality.

The inliner is directed using attributes.For non-recursive functions (and one-step unrolling of recursive functions,although @unroll is more clear for this purpose)the following are supported:

  • @@inline always or @@inline never
  • Attachedto a declaration of a function or functor, these direct the inliner toeitheralways or never inline, irrespective of the size/benefit calculation. (Ifthe function is recursive then the body is substituted and no specialaction is taken for the recursive call site(s).)@@inline with no argument is equivalent to@@inline always.
  • @inlined always or @inlined never
  • Attachedto a function application, these direct the inliner likewise. Theseattributes at call sites override any other attribute that may be presenton the corresponding declaration.@inlined with no argument is equivalent to@inlined always.For recursive functions the relevant attributes are:

  • @@specialise always or @@specialise never

  • Attached to a declaration of a functionor functor, this directs the inliner to either always or neverspecialise the function solong as it has appropriate contextual knowledge, irrespective of thesize/benefit calculation.@@specialise with no argument is equivalent to@@specialise always.
  • @specialised always or @specialised never
  • Attached to a function application, thisdirects the inliner likewise. This attribute at a call site overrides anyother attribute that may be present on the corresponding declaration.(Note that the function will still only be specialised if there existone or more invariant parameters whose values are known.)@specialised with no argument is equivalent to@specialised always.
  • @unrolled n
  • This attribute is attached to a functionapplication and always takes an integer argument. Each time the inliner seesthe attribute it behaves as follows:
    • If n is zero or less, nothing happens.
    • Otherwise the function being called is substituted at the call sitewith its body having been rewritten such that any recursive calls to that function orany others in the same mutually-recursive group are annotated with theattribute unrolled(n − 1). Inlining may continue on that body.As such, n behaves as the “maximum depth of unrolling”.A compiler warning will be emitted if it was found impossible to obey anannotation from an @inlined or @specialised attribute.
Example showing correct placement of attributes
  1. module F (M : sig type t end) = struct
  2. let[@inline never] bar x =
  3. x * 3
  4.  
  5. let foo x =
  6. (bar [@inlined]) (42 + x)
  7. end [@@inline never]
  8.  
  9. module X = F [@inlined] (struct type t = int end)

20.7 Simplification

Simplification, which is run in conjunction with inlining,propagates information (known as approximations) about whichvariables hold what values at runtime. Certain relationships betweenvariables and symbols are also tracked: for example, some variable may beknown to always hold the same value as some other variable; or perhapssome variable may be known to always hold the value pointed to by somesymbol.

The propagation can help to eliminate allocations in cases such as:

  1. let f x y =
  2. ...
  3. let p = x, y in
  4. ...
  5. ... (fst p) ... (snd p) ...

The projections from p may be replaced by uses of the variablesx and y, potentially meaning that p becomes unused.

The propagation performed by the simplification pass is also important fordiscovering which functions flow to indirect call sites. This can enablethe transformation of such call sites into direct call sites, which makesthem eligible for an inlining transformation.

Note that no information is propagated about the contents of strings,even in safe-string mode, because it cannot yet be guaranteedthat they are immutable throughout a given program.

20.8 Other code motion transformations

20.8.1 Lifting of constants

Expressions found to be constant will be lifted to symbolbindings—that is to say, they will be statically allocated in theobject file—whenthey evaluate to boxed values. Such constants may be straightforward numericconstants, such as the floating-point number 42.0, or more complicatedvalues such as constant closures.

Lifting of constants to toplevel reduces allocation at runtime.

The compiler aims to share constants lifted to toplevel such that thereare no duplicate definitions. However if .cmx files are hiddenfrom the compiler then maximal sharing may not be possible.

Notes about float arrays

The following language semantics apply specifically to constant float arrays.(By “constant float array” is meant an array consisting entirely of floatingpoint numbers that are known at compile time. A common case is a literalsuch as [| 42.0; 43.0; |].

  • Constant float arrays at the toplevel are mutable and never shared.(That is to say, for eachsuch definition there is a distinct symbol in the data section of the objectfile pointing at the array.)
  • Constant float arrays not at toplevel are mutable and are created eachtime the expression is evaluated. This can be thought of as an operation thattakes an immutable array (which in the source code has no associated name; letus call it the initialising array) andduplicates it into a fresh mutable array.
    • If the array is of size four or less, the expression will create afresh block and write the values into it one by one. There is no referenceto the initialising array as a whole.
    • Otherwise, the initialising array is lifted out and subject to thenormal constant sharing procedure;creation of the array consists of bulk copying the initialising arrayinto a fresh value on the OCaml heap.

20.8.2 Lifting of toplevel let bindings

Toplevel let-expressions may be lifted to symbol bindings to ensurethat the corresponding bound variables are not captured by closures. If thedefining expression of a given binding is found to be constant, it is boundas such (the technical term is a let-symbol binding).

Otherwise, the symbol is bound to a (statically-allocated)preallocated block containing one field. At runtime, the definingexpression will be evaluated and the first field of the block filled withthe resulting value. This initialise-symbol bindingcauses one extra indirection but ensures, byvirtue of the symbol’s address being known at compile time, that uses of thevalue are not captured by closures.

It should be noted that the blocks corresponding to initialise-symbolbindings are kept alive forever, by virtue of them occurring in a statictable of GC roots within the object file. This extended lifetime ofexpressions may on occasion be surprising. If it is desired to createsome non-constant value (for example when writing GC tests) that does nothave thisextended lifetime, then it may be created and used inside a function,with the application point of that function (perhaps at toplevel)—orindeed the function declaration itself—markedas to never be inlined. This technique prevents lifting of the definitionof the value in question (assuming of course that it is not constant).

20.9 Unboxing transformations

The transformations in this section relate to the splitting apart ofboxed (that is to say, non-immediate) values. They are largelyintended to reduce allocation, which tends to result in a runtimeperformance profile with lower variance and smaller tails.

20.9.1 Unboxing of closure variables

This transformation is enabled unless-no-unbox-free-vars-of-closures is provided.

Variables that appear in closure environments may themselves be boxedvalues. As such, they may be split into further closure variables, eachof which corresponds to some projection from the original closure variable(s).This transformation is called unboxing of closure variables orunboxing of free variables of closures. It is only applied whenthere isreasonable certainty that there are no uses of the boxed free variable itselfwithin the corresponding function bodies.

Example:

In the following code, the compiler observes thatthe closure returned from the function f contains a variable pair(free in the body of f) that may be split into two separate variables.

  1. let f x0 x1 =
  2. let pair = x0, x1 in
  3. Printf.printf "foo\n";
  4. fun y ->
  5. fst pair + snd pair + y

After some simplification one obtains:

  1. let f x0 x1 =
  2. let pair_0 = x0 in
  3. let pair_1 = x1 in
  4. Printf.printf "foo\n";
  5. fun y ->
  6. pair_0 + pair_1 + y

and then:

  1. let f x0 x1 =
  2. Printf.printf "foo\n";
  3. fun y ->
  4. x0 + x1 + y

The allocation of the pair has been eliminated.

This transformation does not operate if it would cause the closure tocontain more than twice as many closure variables as it did beforehand.

20.9.2 Unboxing of specialised arguments

This transformation is enabled unless-no-unbox-specialised-args is provided.

It may become the case during compilation that one or more invariant argumentsto a function become specialised to a particular value. When such values arethemselves boxed the corresponding specialised arguments may be split intomore specialised arguments corresponding to the projections out of the boxedvalue that occur within the function body. This transformation is calledunboxing of specialised arguments. It is only applied when there isreasonable certainty that the boxed argument itself is unused within thefunction.

If the function in question is involved in a recursive group then unboxingof specialised arguments may be immediately replicated across the groupbased on the dataflow between invariant arguments.

Example:

Having been given the following code, the compilerwill inline loop into f, and then observe invbeing invariant and always the pair formed by adding 42 and 43to the argument x of the function f.

  1. let rec loop inv xs =
  2. match xs with
  3. | [] -> fst inv + snd inv
  4. | x::xs -> x + loop2 xs inv
  5. and loop2 ys inv =
  6. match ys with
  7. | [] -> 4
  8. | y::ys -> y - loop inv ys
  9.  
  10. let f x =
  11. Printf.printf "%d\n" (loop (x + 42, x + 43) [1; 2; 3])

Since the functions have sufficiently few arguments, more specialisedarguments will be added. After some simplification one obtains:

  1. let f x =
  2. let rec loop' xs inv_0 inv_1 =
  3. match xs with
  4. | [] -> inv_0 + inv_1
  5. | x::xs -> x + loop2' xs inv_0 inv_1
  6. and loop2' ys inv_0 inv_1 =
  7. match ys with
  8. | [] -> 4
  9. | y::ys -> y - loop' ys inv_0 inv_1
  10. in
  11. Printf.printf "%d\n" (loop' [1; 2; 3] (x + 42) (x + 43))

The allocation of the pair within f has been removed. (Since thetwo closures for loop’ and loop2’ are constant they will also belifted to toplevel with no runtime allocation penalty. Thiswould also happen without having run the transformation to unboxspecialise arguments.)

The transformation to unbox specialised arguments never introduces extraallocation.

The transformation will not unbox arguments if it would result in theoriginal function having sufficiently many arguments so as to inhibittail-call optimisation.

The transformation is implemented by creating a wrapper function thataccepts the original arguments. Meanwhile, the original function is renamedand extra arguments are added corresponding to the unboxed specialisedarguments; this new functionis called from the wrapper. The wrapper will then be inlinedat direct call sites. Indeed, all call sites will be direct unless-unbox-closures is being used, since they will have been generatedby the compiler when originally specialising the function. (In the caseof -unbox-closures other functions may appear with specialisedarguments; in this case there may be indirect calls and these will incura small penalty owing to having to bounce through the wrapper. The techniqueof direct call surrogates used for -unbox-closures is notused by the transformation to unbox specialised arguments.)

20.9.3 Unboxing of closures

This transformation is not enabled by default. It may be enabledusing the -unbox-closures flag.

The transformation replaces closure variables by specialised arguments.The aim is to cause more closures to become closed. It is particularlyapplicable, as a means of reducing allocation, where the function concernedcannot be inlined or specialised. For example, some non-recursive functionmight be too large to inline; or some recursive function might offerno opportunities for specialisation perhaps because its only argument isone of type unit.

At present there may be a small penalty in terms of actual runtimeperformance when this transformation is enabled, although more stableperformance may be obtained due to reduced allocation. It is recommendedthat developers experiment to determine whether the option is beneficialfor their code. (It is expected that in the future it will be possiblefor the performance degradation to be removed.)

Simple example:

In the following code (which might typicallyoccur when g is too large to inline) the value of x would usuallybe communicated to the application of the + function via the closureof g.

  1. let f x =
  2. let g y =
  3. x + y
  4. in
  5. (g [@inlined never]) 42

Unboxing of the closure causes the value for x inside g tobe passed as an argument to g rather than through its closure. Thismeans that the closure of g becomes constant and may be lifted totoplevel, eliminating the runtime allocation.

The transformation is implemented by adding a new wrapper function in themanner of that used when unboxing specialised arguments. The closurevariables are still free in the wrapper, but the intention is that whenthe wrapper is inlined at direct call sites, the relevant values arepassed directly to the main function via the new specialised arguments.

Adding such a wrapper will penalise indirect calls to the function(which might exist in arbitrary places; remember that this transformationis not for example applied only on functions the compiler has producedas a result of specialisation) since such calls will bounce throughthe wrapper. Tomitigate this, if a function is small enough when weighed up againstthe number of free variables being removed, it will be duplicated by thetransformation to obtain two versions: the original (used for indirect calls,since we can do no better) and the wrapper/rewritten function pair asdescribed in the previous paragraph. The wrapper/rewritten function pairwill only be used at direct call sites of the function. (The wrapper inthis case is known as a direct call surrogate, sinceit takes the place of another function—the unchanged version used forindirect calls—at direct call sites.)

The -unbox-closures-factor command line flag, which takes aninteger, may be used to adjust the point at which a function is deemedlarge enough to be ineligible for duplication. The benefit ofduplication is scaled by the integer before being evaluated against thesize.

Harder example:

In the following code, there are two closurevariables that would typically cause closure allocations. One is calledfv and occurs inside the function baz; the other is calledz and occurs inside the function bar.In this toy (yet sophisticated) example we again use an attribute tosimulate the typical situation where the first argument of baz istoo large to inline.

  1. let foo c =
  2. let rec bar zs fv =
  3. match zs with
  4. | [] -> []
  5. | z::zs ->
  6. let rec baz f = function
  7. | [] -> []
  8. | a::l -> let r = fv + ((f [@inlined never]) a) in r :: baz f l
  9. in
  10. (map2 (fun y -> z + y) [z; 2; 3; 4]) @ bar zs fv
  11. in
  12. Printf.printf "%d" (List.length (bar [1; 2; 3; 4] c))

The code resulting from applying -O3 -unbox-closures to this codepasses the free variables via function arguments inorder to eliminate all closure allocation in this example (aside from anythat might be performed inside printf).

20.10 Removal of unused code and values

20.10.1 Removal of redundant let expressions

The simplification pass removes unused let bindings so long astheir corresponding defining expressions have “no effects”. Seethe section “Treatment of effects” below for the precise definition ofthis term.

20.10.2 Removal of redundant program constructs

This transformation is analogous to the removal of let-expressionswhose defining expressions have no effects. It operates instead on symbolbindings, removing those that have no effects.

20.10.3 Removal of unused arguments

This transformation is only enabled by default for specialised arguments.It may be enabled for all arguments using the -remove-unused-argumentsflag.

The pass analyses functions to determine which arguments are unused.Removal is effected by creating a wrapper function, which will be inlinedat every direct call site, that accepts the original arguments and thendiscards the unused ones before calling the original function. As aconsequence, this transformation may be detrimental if the originalfunction is usually indirectly called, since such calls will now bouncethrough the wrapper. (The technique of direct call surrogates usedto reduce this penalty during unboxing of closure variables (see above)does not yet apply to the pass that removes unused arguments.)

20.10.4 Removal of unused closure variables

This transformation performs an analysis acrossthe whole compilation unit to determine whether there exist closure variablesthat are never used. Such closure variables are then eliminated. (Note thatthis has to be a whole-unit analysis because a projection of a closurevariable from some particular closure may have propagated to an arbitrarylocation within the code due to inlining.)

20.11 Other code transformations

20.11.1 Transformation of non-escaping references into mutable variables

Flambda performs a simple analysis analogous to that performed elsewherein the compiler that can transform refs into mutable variablesthat may then be held in registers (or on the stack as appropriate) ratherthan being allocated on the OCaml heap. This only happens so long as thereference concerned can be shown to not escape from its defining scope.

20.11.2 Substitution of closure variables for specialised arguments

This transformation discovers closure variables that are known to beequal to specialised arguments. Such closure variables are replaced bythe specialised arguments; the closure variables may then be removed bythe “removal of unused closure variables” pass (see below).

20.12 Treatment of effects

The Flambda optimisers classify expressions in order to determine whetheran expression:

  • does not need to be evaluated at all; and/or
  • may be duplicated.This is done by forming judgements on the effects and the _coeffects_that might be performed were the expression to be executed. Effects talkabout how the expression might affect the world; coeffects talk about howthe world might affect the expression.

Effects are classified as follows:

  • No effects:
  • The expression does not change the observable stateof the world. For example, it must not write to any mutable storage,call arbitrary external functions or change control flow (e.g. by raisingan exception). Note that allocation is not classed as having“no effects” (see below).
    • It is assumed in the compiler that expressions with noeffects, whose results are not used, may be eliminated. (This typicallyhappens where the expression in question is the defining expression of alet; in such cases the let-expression will beeliminated.) It is furtherassumed that such expressions with no effects may beduplicated (and thus possibly executed more than once).
    • Exceptions arising from allocation points, for example“out of memory” orexceptions propagated from finalizers or signal handlers, are treated as“effects out of the ether” and thus ignored for our determination hereof effectfulness. The same goes for floating point operations that maycause hardware traps on some platforms.
  • Only generative effects:
  • The expression does not change theobservable state of the world save for possibly affecting the state ofthe garbage collector by performing an allocation. Expressionsthat only have generative effects and whose results are unusedmay be eliminated by the compiler. However, unlike expressions with“no effects”, such expressions will never be eligible for duplication.
  • Arbitrary effects:
  • All other expressions.There is a single classification for coeffects:

  • No coeffects:

  • The expression does not observe the effects (inthe sense described above) of other expressions. For example, it must notread from any mutable storage or call arbitrary external functions.It is assumed in the compiler that, subject to data dependencies,expressions with neither effects nor coeffects may be reordered withrespect to other expressions.

20.13 Compilation of statically-allocated modules

Compilation of modules that are able to be statically allocated (for example,the module corresponding to an entire compilation unit, as opposed to a firstclass module dependent on values computed at runtime) initially follows thestrategy used for bytecode. A sequence of let-bindings, which may beinterspersed with arbitrary effects, surrounds a record creation that becomesthe module block. The Flambda-specific transformation follows: these bindingsare lifted to toplevel symbols, as described above.

20.14 Inhibition of optimisation

Especially when writing benchmarking suites that run non-side-effectingalgorithms in loops, it may be found that the optimiser entirelyelides the code being benchmarked. This behaviour can be prevented byusing the Sys.opaque_identity function (which indeed behaves as anormal OCaml function and does not possess any “magic” semantics). Thedocumentation of the Sys module should be consulted for further details.

20.15 Use of unsafe operations

The behaviour of the Flambda simplification pass means that certain unsafeoperations, which may without Flambda or when using previous versions ofthe compiler be safe, must not be used. This specifically refers tofunctions found in the Obj module.

In particular, it is forbidden to change any value (for example usingObj.set_field or Obj.set_tag) that is not mutable.(Values returned from C stubsare always treated as mutable.) The compiler will emit warning 59 if itdetects such a write—but it cannot warn in all cases. Here is an exampleof code that will trigger the warning:

  1. let f x =
  2. let a = 42, x in
  3. (Obj.magic a : int ref) := 1;
  4. fst a

The reason this is unsafe is because the simplification pass believes thatfst a holds the value 42; and indeed it must, unless typesoundness has been broken via unsafe operations.

If it must be the case that code has to be written that triggers warning 59,but the code is known to actually be correct (for some definition ofcorrect), then Sys.opaque_identity may be used to wrap the valuebefore unsafe operations are performed upon it. Great care must be takenwhen doing this to ensure that the opacity is added at the correct place.It must be emphasised that this use of Sys.opaque_identity is onlyfor exceptional cases. It should not be used in normal code or totry to guide the optimiser.

As an example, this code will return the integer 1:

  1. let f x =
  2. let a = Sys.opaque_identity (42, x) in
  3. (Obj.magic a : int ref) := 1;
  4. fst a

However the following code will still return 42:

  1. let f x =
  2. let a = 42, x in
  3. Sys.opaque_identity (Obj.magic a : int ref) := 1;
  4. fst a

High levels of inlining performed by Flambda may expose bugs in codethought previously to be correct. Take care, for example, notto add type annotations that claim some mutable value is always immediateif it might be possible for an unsafe operation to update it to a boxedvalue.

20.16 Glossary

The following terminology is used in this chapter of the manual.

  • Call site
  • See direct call site and indirect call site below.
  • Closed function
  • A function whose body has no free variablesexcept its parameters and any to which are bound other functions withinthe same (possibly mutually-recursive) declaration.
  • Closure
  • The runtime representation of a function. Thisincludes pointers to the code of the functiontogether with the values of any variables that are used in the body ofthe function but actually defined outside of the function, in theenclosing scope.The values of such variables, collectively known as theenvironment, are required because the function may beinvoked from a place where the original bindings of such variables areno longer in scope. A group of possiblymutually-recursive functions defined using let rec all share asingle closure. (Note to developers: in the Flambda source code aclosure always corresponds to a single function; aset of closures refers to a group of such.)
  • Closure variable
  • A member of the environment held within theclosure of a given function.
  • Constant
  • Some entity (typically an expression) the value of whichis known by the compiler at compile time. Constantness may be explicit fromthe source code or inferred by the Flambda optimisers.
  • Constant closure
  • A closure that is statically allocated in anobject file. It is almost always the case that the environment portion ofsuch a closure is empty.
  • Defining expression
  • The expression e in let x = e in e’.
  • Direct call site
  • A place in a program’s code where a function iscalled and it is known at compile time which function it will always be.
  • Indirect call site
  • A place in a program’s code where a functionis called but is not known to be a direct call site.
  • Program
  • A collection of symbol bindings forming thedefinition of a single compilation unit (i.e. .cmx file).
  • Specialised argument
  • An argument to a function that is knownto always hold a particular value at runtime. These are introduced by theinliner when specialising recursive functions; and the unbox-closurespass. (See section 20.4.)
  • Symbol
  • A name referencing a particular place in an object fileor executable image. At that particular place will be some constant value.Symbols may be examined using operating system-specific tools (forexample objdump on Linux).
  • Symbol binding
  • Analogous to a let-expression but workingat the level of symbols defined in the object file. The address of a symbol isfixed, but it may be bound to both constant and non-constant expressions.
  • Toplevel
  • An expression in the current program which is notenclosed within any function declaration.
  • Variable
  • A named entity to which some OCaml value is bound by alet expression, pattern-matching construction, or similar.

20.17 Module Spacetime : Profiling of a program’s space behaviour over time.

Currently only supported on x86-64 platforms running 64-bit code.

To use the functions in this module you must:

  • configure the compiler with "-spacetime";
  • compile to native code.Without these conditions being satisfied the functions in this modulewill have no effect.Instead of manually taking profiling heap snapshots with this module it ispossible to use an automatic snapshot facility that writes profilinginformation at fixed intervals to a file. To enable this, all that needs tobe done is to build the relevant program using a compiler configured with-spacetime; and set the environment variable OCAML_SPACETIME_INTERVAL to aninteger number of milliseconds giving the interval between profiling heapsnapshots. This interval should not be made excessively small relative tothe running time of the program. A typical interval to start with might be1/100 of the running time of the program. The program must exit "normally"(i.e. by calling exit, with whatever exit code, rather than beingabnormally terminated by a signal) so that the snapshot file iscorrectly completed.

When using the automatic snapshot mode the profiling output is writtento a file called "spacetime-<pid>" where <pid> is the process ID of theprogram. (If the program forks and continues executing then multiplefiles may be produced with different pid numbers.) The profiling outputis by default written to the current working directory when the programstarts. This may be customised by setting the OCAML_SPACETIME_SNAPSHOT_DIRenvironment variable to the name of the desired directory.

If using automatic snapshots the presence of thesave_event_for_automatic_snapshots function, below, should be noted.

The functions in this module are thread safe.

For functions to decode the information recorded by the profiler,see the Spacetime offline library in otherlibs/.

  1. val enabled : bool

enabled is true if the compiler is configured with spacetime and falseotherwise
  1. module Series :

sig

  1. type t
Type representing a file that will hold a series of heap snapshotstogether with additional information required to interpret thosesnapshots.
  1. val create : path:string -> t
create ~path creates a series file at path.
  1. val save_event : ?time:float -> t -> event_name:string -> unit
save_event writes an event, which is an arbitrary string, into thegiven series file. This may be used for identifying particular pointsduring program execution when analysing the profile.The optional time parameter is as for Spacetime.Snapshot.take[20.17].
  1. val save_and_close : ?time:float -> t -> unit
save_and_close series writes information into series required forinterpreting the snapshots that series contains and then closes theseries file. This function must be called to produce a valid seriesfile.The optional time parameter is as for Spacetime.Snapshot.take[20.17].

end

  1. module Snapshot :

sig

  1. val take : ?time:float -> Spacetime.Series.t -> unit
take series takes a snapshot of the profiling annotations on the valuesin the minor and major heaps, together with GC stats, and write theresult to the series file. This function triggers a minor GC but doesnot allocate any memory itself.If the optional time is specified, it will be used instead of theresult of Sys.time[??] as the timestamp of the snapshot. Such timesshould start from zero and be monotonically increasing. This parameteris intended to be used so that snapshots can be correlated against wallclock time (which is not supported in the standard library) rather thanelapsed CPU time.

end

  1. val save_event_for_automatic_snapshots : event_name:string -> unit

Like Spacetime.Series.save_event[20.17], but writes to the automatic snapshot file.This function is a no-op if OCAML_SPACETIME_INTERVAL was not set.