Originally posted on the Backtrace I/O tech blog.
All long-lived programs are either implemented in dynamic languages,1 or eventually Greenspun themselves into subverting static programming languages to create a dynamic system (e.g., Unix process trees). The latter approach isn’t a bad idea, but it’s easy to introduce more flexibility than intended (e.g., data-driven JNDI lookups) when we add late binding features piecemeal, without a holistic view of how all the interacting components engender a weird program modification language.
At Backtrace, we mostly implement late (re)binding by isolating subtle logic in dedicated executables with short process lifetimes: we can replace binaries on disk atomically, and their next invocation will automatically pick up the change. In a pinch, we sometimes edit template or Lua source files and hot reload them in nginx. We prefer this to first-class programmatic support for runtime modification because Unix has a well understood permission model around files, and it’s harder to bamboozzle code into overwriting files when that code doesn’t perform any disk I/O.
However, these patterns aren’t always sufficient. For example, we
sometimes wish to toggle code that’s deep in performance-sensitive
query processing loops, or tightly coupled with such logic. That’s
when we rely on our
This library lets us tweak flags at runtime, but flags can only take boolean values (enabled or disabled), so the dynamicism it introduces is hopefully bounded enough to avoid unexpected emergent complexity. The functionality looks like classic feature flags, but thanks to the flags’ minimal runtime overhead coupled with the ability to flip them at runtime, there are additional use cases, such as disabling mutual exclusion logic during single-threaded startup or toggling log statements. The library has also proved invaluable for crisis management, since we can leave flags (enabled by default) in well-trodden pieces of code without agonising over their impact on application performance. These flags can serve as ad hoc circuit breakers around complete features or specific pieces of code when new inputs tickle old latent bugs.
The secret behind this minimal overhead? Cross-modifying machine code!
Intel tells us we’re not supposed to do that, at least not without
pausing threads… yet the core of the
dynamic_flag C library
has been toggling branches on thousands of machines for years, without
any issue. It’s available under the
for other adventurous folks.
Runtime efficiency is an essential feature in
enough to justify mutating machine code while it’s executing on other
—not only because it unlocks additional use cases, but, more
importantly, because it frees programmers from worrying about the
performance impact of branching on a flag in the most obvious
location, even if that’s in the middle of a hot inner loop.
With the aim of encouraging programmers to spontaneously protect code
with flag checks, without prodding during design or code review, we
dynamic_flag to minimise the amount of friction and mental
overhead of adding a new feature flag. That’s why we care so much
about all forms of overhead, not just execution time. For example,
there’s no need to break one’s flow and register flags separately from
their use points. Adding a feature flag should not feel like a chore.
However, we’re also aware that feature flags tend to stick around
forever. We try to counteract this inertia with static registration:
DF_* expansions in an executable appear in
dynamic_flag_list section, and
enumerates them at runtime. Periodic audits will reveal flags that
have become obsolete, and flags are easy to find:
each flag’s full name includes its location in the source code.
We find value in
dynamic_flag because its runtime overhead is
negligible for all but the most demanding
code,2 while the interface lets us easily
make chunks of code toggleable at runtime without having to worry
about things like “where am I supposed to register this new option?”
The same system is efficient and ergonomic enough for all teams in
all contexts, avoids contention in our source tree, and guarantees
discoverability for whoever happens to be on call.
How to use
All dynamic flags have a “kind” (namespace) string, and a name. We
often group all flags related to an experimental module or feature in
the same “kind,” and use the name to describe the specific functionality
in the feature guarded by the flag. A dynamic flag can be disabled by
default (like a feature flag), or enabled by default, and evaluating a
dynamic flag’s value implicitly defines and registers it with the
A dynamic flag introduced with the
as in the code snippet below, is disabled (evaluates to false) by
default, and instructs the compiler to optimise for that default value.
We can instead enable code by default and optimise for cases where the
flag is enabled with the
in the source is actually its own flag;
a flag’s full name looks like
my_module:flag_name@<stdin>:15), and each condition has its own
state record. It’s thus safe, if potentially confusing, to define flags
of different types (feature or default) with the same kind and
name. These macros may appear in inline
functions: each instantiation will get its own metadata block, and an
arbitrary number of blocks can share the same full name.
Before manipulating these dynamic flags,
applications must call
dynamic_flag_init_lib to initialise the library’s state.
Once the library is initialised, interactive or configuration-driven
usage typically toggles flags by calling
dynamic_flag_deactivate with POSIX extended regexes that match
the flags’ full names.
directly map to classic feature flags, but the
dynamic_flag library still has more to offer. Applications can
programmatically enable and disable blocks of code to implement a
restricted form of aspect oriented programming:
be inserted post hoc, and must instead be defined inline in the
source, but may be toggled at runtime by unrelated code.
For example, an application could let individual HTTP requests opt
into detailed tracing with a query string parameter
request->tracing_mode = true in its internal request object when
it accepts such a request. Environments where fewer than one request
in a million enables tracing could easily spend more aggregate time
if (request->tracing_mode == true) than they do in the
tracing logic itself.
One could try to reduce the overhead by coalescing the trace code in
fewer conditional blocks, but that puts more distance between the
tracing code and the traced logic it’s supposed to record, which
tends to cause the two to desynchronise and adds to development
It’s tempting to instead optimise frequent checks for the common case
(no tracing) with a dynamic flag that is enabled whenever at least one
in-flight request has opted into tracing. That’s why
DF_OPT (for opt-in logic) macro exists.
instructs the compiler to assume the flag is disabled, but leaves the
flag enabled (i.e., the conditional always evaluates
request->tracing_mode) until the library is initialised with
dynamic_flag_init_lib.3 After initialisation,
the flag acts like a
DF_FEATURE (i.e., the overhead is a
instruction that falls through without any conditional branching)
until it is explicitly enabled again.
With this flag-before-check pattern, it’s always safe to enable
request_tracing flags: in the worst case, we’ll just look at the
request object, see that
request->tracing_mode == false, and skip
the tracing logic. Of course, that’s not ideal for performance. When
we definitely know that no request has asked for tracing, we want to
request_tracing flags and not even look at the request
Whenever the application receives a request that opts into tracing, it
can enable all flags with kind
When that same request leaves the system (e.g., when the application
has fully sent a response back), the application
undoes the activation with
Activation and deactivation calls actually increment and decrement
counters associated with each instance of a
DF_... macro, so this scheme works correctly when
multiple requests with overlapping lifetimes opt into tracing:
tracing blocks will check whether
request->tracing_mode == true
whenever at least one in-flight request has
tracing_mode == true, and
skip these conditionals as soon as no such request exists.
Practical considerations for programmatic manipulation
Confirming that a flag is set to its expected value (disabled for
DF_OPT, enabled for
DF_DEFAULT) is fast…
because we shifted all the complexity to the flag flipping
code. Changing the value for a set of flags is extremely slow
(milliseconds of runtime and several IPIs for
mprotect(2) calls), so it only makes sense to use dynamic flags when they are
rarely activated or deactivated (e.g., less often than once a minute
or even less often than once an hour).
We have found programmatic flag manipulation to be useful not just for
opt-in request tracing or to enable log statements, but also to
minimise the impact of complex logic on program phases that do not
require them. For example, mutual exclusion and
safe memory reclamation deferral (PDF)
may be redundant while a program is in a single-threaded
startup mode; we can guard such code behind
to accelerate startup,
steady_state flags just before spawning worker threads.
It can also make sense to guard slow paths with
a program only enters phases that needs this slow path logic every few minutes. That was the case for a
software transactional memory system with batched updates.
Most of the time, no update is in flight, so readers never have to
check for concurrent writes. These checks can be guarded with
DF_OPT(stm, ...) conditions., as long as the program enables
flags around batches of updates. Enabling and disabling all these
flags can take a while (milliseconds), but, as long as updates are
infrequent enough, the improved common case (getting rid of a memory
load and a conditional jump for a read barrier) means the tradeoff
Even when flags are controlled programmatically, it can be useful to
work around bugs by manually forcing some flags to remain enabled or
disabled. In the tracing example above, we could find a crash in one
of the tracing blocks, and wish to prevent
exercising that block of code.
It’s easy to force a flag into an active state: flag activations
are counted, so it suffices to activate it
manually, once. However, we want it to be safe issue ad hoc
without wedging the system in a weird state, so activation counts don’t go negative.
Unfortunately, this means we can’t use deactivations
to prevent, e.g., a crashy request tracing block from being
Flags can instead be “unhooked” dynamically. While unhooked,
increments to a flag’s activation count are silently disregarded.
DF_* conditions when their full name matches the extended POSIX regular expression it received as an argument.
When a flag has been
more often than it has been
attempts to activate it will
silently no-op. Once a flag has been unhooked, we can
until its activation count reaches 0.
At that point, the flag is disabled, and will remain disabled
The core implementation trick
The introduction of
asm goto in GCC 4.5
made it possible to implement control operators in inline assembly.
When the condition actually varies at runtime, it usually
makes more sense to set an output variable with a condition code,
dynamic_flag conditions are actually static in machine code:
DF_* macro expands to one 5-byte instruction,
test eax, imm32 instruction
that falls through to the common case when that’s the flag’s value
(i.e., enabled for
DF_DEFAULT, disabled for
DF_OPT), and a 32-bit relative
jmp rel32 to the unexpected path
DF_DEFAULT, enabled for
otherwise. Activating and deactivating dynamic flags toggles the
corresponding target instructions between
test imm32 (0xA9) and
jmp rel32 (0xE9).
DF_... macros expand into a
lot more inline assembly than just that one instruction;
the rest of the expansion is a lot of noise to
register everything with structs and pointers in dedicated
sections. Automatic static registration is mostly orthogonal to the
performance goals, but is key to the (lazy-)programmer-friendly
test eax, imm32 instead of a nop because it’s exactly five
bytes, just like
jmp rel32, and because its 4-byte immediate is in
the same place as the 4-byte offset of
jmp rel32. We can thus encode
the jump offset at assembly-time, and flip between falling through to
the common path (
test) and jumping to the unexpected path (
overwriting the opcode byte (0xA9 for
test, 0xE9 for
Updating a single byte for each dynamic flag avoids questions
around the correct order for writes. This single-byte
cross-modification (we overwrite instruction bytes while other threads
may be executing the mutated machine code) also doesn’t affect the
size of the instruction (both
test eax and
jmp rel span 5 bytes),
which should hopefully suffice to avoid sharp edges around instruction
decoding in hardware, despite our disregard for
Intel’s recommendations regarding cross-modifying code in Section 8.1.3 of the SDM.4
The library does try to protect against code execution exploits by
relaxing and reinstating page protection with
all cross modification writes. Since
Read-Write-eXecute permissions to Read-eXecute acts as a
membarrier (issues IPIs) on Linux/x86-64,
we can also know that the updated code is globally visible by the time
a call to
dynamic_flag_activate, etc., returns.
It’s not practical to bounce page protection for each
especially with inlining (some users have hundreds of inlined calls to
flagged functions, e.g., to temporarily paper over
use-after-frees by nopping out a few calls to
free(2)). Most of the
dynamic_flag.c is simply in
gathering metadata records for all
DF_ sites that should be activated or deactivated, and in
mprotect calls for stretches of
DF_ sites on contiguous pages.
Sometimes, code is just done
is an updated interface for the core implementation of the
and reflects years of experience with that functionality. We’re happy
to share it, but aren’t looking for feature requests or contributions.
There might be some small clean-ups as we add support for ARM or RISC
V, or let the library interoperate with a Rust implementation.
However, we don’t expect changes to the interface, i.e., the
and the activation/deactivation functions, nor to its core structure,
especially given the contemporary tastes for hardening (for example,
the cross-modification approach is completely incompatible with
OpenBSD’s and OS X’s strict
W^X policies). The library works for our
target platforms, and we don’t wish to take on extra complexity that
is of no benefit to us.
Of course, it’s Apache licensed,
so anyone can fork the library and twist it beyond
recognition. However, if you’re interested in powerful patching
capabilities, dynamic languages (e.g., Erlang, Common Lisp, or even
Python and Ruby), or tools like Live++
and Recode may be more appropriate.5
dynamic_flag to remain simple and just barely flexible
enough for our usage patterns.
Thank you, Jacob, Josh, and Per, for feedback on earlier versions.
It’s no accident that canonical dynamic languages like Smalltalk, Forth, and Lisp are all image-based: how would an image-based system even work if it were impossible to redefine functions or types? ↩
Like guaranteed optimisations in Lisps, the predictable performance impact isn’t important because all code is performance sensitive, but because performance is a cross-cutting concern, and a predictably negligible overhead makes it easier to implement new abstractions, especially with the few tools available in C. In practice, the impact of considering a code path reachable in case a flag is flipped from its expected value usually dwarfs that of the single
testinstruction generated for the dynamic flag itself. ↩
Or if the
dynamic_flaglibrary isn’t aware of that
DF_OPT, maybe because the function surrounding that
DF_OPTconditional was loaded dynamically. ↩
After a few CPU-millenia of production experience, the cross-modification logic hasn’t been associated with any “impossible” bug, or with any noticeable increase in the rate of hardware hangs or failures. ↩
The industry could learn a lot from game development practices, especially for stateful non-interactive backend servers and slow batch computations. ↩