Originally posted on the Backtrace I/O tech blog.
All long-lived programs are either implemented in dynamic languages,1 or eventually Greenspun themselves into subverting static programming languages to create a dynamic system (e.g., Unix process trees). The latter approach isn’t a bad idea, but it’s easy to introduce more flexibility than intended (e.g., data-driven JNDI lookups) when we add late binding features piecemeal, without a holistic view of how all the interacting components engender a weird program modification language.
At Backtrace, we mostly implement late (re)binding by isolating subtle logic in dedicated executables with short process lifetimes: we can replace binaries on disk atomically, and their next invocation will automatically pick up the change. In a pinch, we sometimes edit template or Lua source files and hot reload them in nginx. We prefer this to first-class programmatic support for runtime modification because Unix has a well understood permission model around files, and it’s harder to bamboozzle code into overwriting files when that code doesn’t perform any disk I/O.
However, these patterns aren’t always sufficient. For example, we
sometimes wish to toggle code that’s deep in performance-sensitive
query processing loops, or tightly coupled with such logic. That’s
when we rely on our dynamic_flag
library.
This library lets us tweak flags at runtime, but flags can only take boolean values (enabled or disabled), so the dynamicism it introduces is hopefully bounded enough to avoid unexpected emergent complexity. The functionality looks like classic feature flags, but thanks to the flags’ minimal runtime overhead coupled with the ability to flip them at runtime, there are additional use cases, such as disabling mutual exclusion logic during single-threaded startup or toggling log statements. The library has also proved invaluable for crisis management, since we can leave flags (enabled by default) in well-trodden pieces of code without agonising over their impact on application performance. These flags can serve as ad hoc circuit breakers around complete features or specific pieces of code when new inputs tickle old latent bugs.
The secret behind this minimal overhead? Cross-modifying machine code!
Intel tells us we’re not supposed to do that, at least not without
pausing threads… yet the core of the dynamic_flag
C library
has been toggling branches on thousands of machines for years, without
any issue. It’s available under the
Apache license
for other adventurous folks.
Overhead matters
Runtime efficiency is an essential feature in dynamic_flag
—
enough to justify mutating machine code while it’s executing on other
cores
—not only because it unlocks additional use cases, but, more
importantly, because it frees programmers from worrying about the
performance impact of branching on a flag in the most obvious
location, even if that’s in the middle of a hot inner loop.
With the aim of encouraging programmers to spontaneously protect code
with flag checks, without prodding during design or code review, we
designed dynamic_flag
to minimise the amount of friction and mental
overhead of adding a new feature flag. That’s why we care so much
about all forms of overhead, not just execution time. For example,
there’s no need to break one’s flow and register flags separately from
their use points. Adding a feature flag should not feel like a chore.
However, we’re also aware that feature flags tend to stick around
forever. We try to counteract this inertia with static registration:
all the DF_*
expansions in an executable appear in
its dynamic_flag_list
section, and
the dynamic_flag_list_state
function
enumerates them at runtime. Periodic audits will reveal flags that
have become obsolete, and flags are easy to find:
each flag’s full name includes its location in the source code.
We find value in dynamic_flag
because its runtime overhead is
negligible for all but the most demanding
code,2 while the interface lets us easily
make chunks of code toggleable at runtime without having to worry
about things like “where am I supposed to register this new option?”
The same system is efficient and ergonomic enough for all teams in
all contexts, avoids contention in our source tree, and guarantees
discoverability for whoever happens to be on call.
How to use dynamic_flag
All dynamic flags have a “kind” (namespace) string, and a name. We
often group all flags related to an experimental module or feature in
the same “kind,” and use the name to describe the specific functionality
in the feature guarded by the flag. A dynamic flag can be disabled by
default (like a feature flag), or enabled by default, and evaluating a
dynamic flag’s value implicitly defines and registers it with the
dynamic_flag
library.
A dynamic flag introduced with the DF_FEATURE
macro,
as in the code snippet below, is disabled (evaluates to false) by
default, and instructs the compiler to optimise for that default value.
We can instead enable code by default and optimise for cases where the
flag is enabled with the DF_DEFAULT
macro.
Each DF_*
condition
in the source is actually its own flag;
a flag’s full name looks like kind:name@source_file:line_number
(e.g.,
my_module:flag_name@<stdin>:15
), and each condition has its own
state record. It’s thus safe, if potentially confusing, to define flags
of different types (feature or default) with the same kind and
name. These macros may appear in inline
or
static inline
functions: each instantiation will get its own metadata block, and an
arbitrary number of blocks can share the same full name.
Before manipulating these dynamic flags,
applications must call dynamic_flag_init_lib
to initialise the library’s state.
Once the library is initialised, interactive or configuration-driven
usage typically toggles flags by calling
dynamic_flag_activate
and dynamic_flag_deactivate
with POSIX extended regexes that match
the flags’ full names.
Using dynamic_flag
programmatically
The DF_FEATURE
and DF_DEFAULT
macros
directly map to classic feature flags, but the
dynamic_flag
library still has more to offer. Applications can
programmatically enable and disable blocks of code to implement a
restricted form of aspect oriented programming:
“advice” cannot
be inserted post hoc, and must instead be defined inline in the
source, but may be toggled at runtime by unrelated code.
For example, an application could let individual HTTP requests opt
into detailed tracing with a query string parameter ?tracing=1
, and
set request->tracing_mode = true
in its internal request object when
it accepts such a request. Environments where fewer than one request
in a million enables tracing could easily spend more aggregate time
evaluating if (request->tracing_mode == true)
than they do in the
tracing logic itself.
One could try to reduce the overhead by coalescing the trace code in
fewer conditional blocks, but that puts more distance between the
tracing code and the traced logic it’s supposed to record, which
tends to cause the two to desynchronise and adds to development
friction.
It’s tempting to instead optimise frequent checks for the common case
(no tracing) with a dynamic flag that is enabled whenever at least one
in-flight request has opted into tracing. That’s why
the DF_OPT
(for opt-in logic) macro exists.
The
DF_OPT
macro
instructs the compiler to assume the flag is disabled, but leaves the
flag enabled (i.e., the conditional always evaluates
request->tracing_mode
) until the library is initialised with
dynamic_flag_init_lib
.3 After initialisation,
the flag acts like a DF_FEATURE
(i.e., the overhead is a test eax
instruction that falls through without any conditional branching)
until it is explicitly enabled again.
With this flag-before-check pattern, it’s always safe to enable
request_tracing
flags: in the worst case, we’ll just look at the
request object, see that request->tracing_mode == false
, and skip
the tracing logic. Of course, that’s not ideal for performance. When
we definitely know that no request has asked for tracing, we want to
disable request_tracing
flags and not even look at the request
object’s tracing_mode
field.
Whenever the application receives a request that opts into tracing, it
can enable all flags with kind request_tracing
by
executing dynamic_flag_activate_kind(request_tracing, NULL)
.
When that same request leaves the system (e.g., when the application
has fully sent a response back), the application
undoes the activation with dynamic_flag_deactivate_kind(request_tracing, NULL)
.
Activation and deactivation calls actually increment and decrement
counters associated with each instance of a DF_...
macro, so this scheme works correctly when
multiple requests with overlapping lifetimes opt into tracing:
tracing blocks will check whether request->tracing_mode == true
whenever at least one in-flight request has tracing_mode == true
, and
skip these conditionals as soon as no such request exists.
Practical considerations for programmatic manipulation
Confirming that a flag is set to its expected value (disabled for
DF_FEATURE
and DF_OPT
, enabled for DF_DEFAULT
) is fast…
because we shifted all the complexity to the flag flipping
code. Changing the value for a set of flags is extremely slow
(milliseconds of runtime and several IPIs for
multiple mprotect(2)
calls), so it only makes sense to use dynamic flags when they are
rarely activated or deactivated (e.g., less often than once a minute
or even less often than once an hour).
We have found programmatic flag manipulation to be useful not just for
opt-in request tracing or to enable log statements, but also to
minimise the impact of complex logic on program phases that do not
require them. For example, mutual exclusion and
safe memory reclamation deferral (PDF)
may be redundant while a program is in a single-threaded
startup mode; we can guard such code behind DF_OPT(steady_state, ...)
to accelerate startup,
and enable steady_state
flags just before spawning worker threads.
It can also make sense to guard slow paths with DF_OPT
when
a program only enters phases that needs this slow path logic every few minutes. That was the case for a
software transactional memory system with batched updates.
Most of the time, no update is in flight, so readers never have to
check for concurrent writes. These checks can be guarded with
DF_OPT(stm, ...)
conditions., as long as the program enables stm
flags around batches of updates. Enabling and disabling all these
flags can take a while (milliseconds), but, as long as updates are
infrequent enough, the improved common case (getting rid of a memory
load and a conditional jump for a read barrier) means the tradeoff
is favourable.
Even when flags are controlled programmatically, it can be useful to
work around bugs by manually forcing some flags to remain enabled or
disabled. In the tracing example above, we could find a crash in one
of the tracing blocks, and wish to prevent request->tracing_mode
from
exercising that block of code.
It’s easy to force a flag into an active state: flag activations
are counted, so it suffices to activate it
manually, once. However, we want it to be safe issue ad hoc dynamic_flag_deactivate
calls
without wedging the system in a weird state, so activation counts don’t go negative.
Unfortunately, this means we can’t use deactivations
to prevent, e.g., a crashy request tracing block from being
activated.
Flags can instead be “unhooked” dynamically. While unhooked,
increments to a flag’s activation count are silently disregarded.
The dynamic_flag_unhook
function
unhooks DF_*
conditions when their full name matches the extended POSIX regular expression it received as an argument.
When a flag has been
“unhook”ed
more often than it has been
“rehook”ed,
attempts to activate it will
silently no-op. Once a flag has been unhooked, we can
issue dynamic_flag_deactivate
calls
until its activation count reaches 0.
At that point, the flag is disabled, and will remain disabled
until rehooked.
The core implementation trick
The introduction of asm goto
in GCC 4.5
made it possible to implement control operators in inline assembly.
When the condition actually varies at runtime, it usually
makes more sense to set an output variable with a condition code,
but dynamic_flag
conditions are actually static in machine code:
each DF_*
macro expands to one 5-byte instruction,
a test eax, imm32
instruction
that falls through to the common case when that’s the flag’s value
(i.e., enabled for DF_DEFAULT
, disabled for DF_FEATURE
and
DF_OPT
), and a 32-bit relative jmp rel32
to the unexpected path
(disabled for DF_DEFAULT
, enabled for DF_FEATURE
and DF_OPT
)
otherwise. Activating and deactivating dynamic flags toggles the
corresponding target instructions between test imm32
(0xA9) and jmp rel32
(0xE9).
The DF_...
macros expand into a
lot more inline assembly than just that one instruction;
the rest of the expansion is a lot of noise to
register everything with structs and pointers in dedicated
sections. Automatic static registration is mostly orthogonal to the
performance goals, but is key to the (lazy-)programmer-friendly
interface.
We use test eax, imm32
instead of a nop because it’s exactly five
bytes, just like jmp rel32
, and because its 4-byte immediate is in
the same place as the 4-byte offset of jmp rel32
. We can thus encode
the jump offset at assembly-time, and flip between falling through to
the common path (test
) and jumping to the unexpected path (jmp
) by
overwriting the opcode byte (0xA9 for test
, 0xE9 for jmp
).
Updating a single byte for each dynamic flag avoids questions
around the correct order for writes. This single-byte
cross-modification (we overwrite instruction bytes while other threads
may be executing the mutated machine code) also doesn’t affect the
size of the instruction (both test eax
and jmp rel
span 5 bytes),
which should hopefully suffice to avoid sharp edges around instruction
decoding in hardware, despite our disregard for
Intel’s recommendations regarding cross-modifying code in Section 8.1.3 of the SDM.4
The library does try to protect against code execution exploits by
relaxing and reinstating page protection with mprotect(2)
) around
all cross modification writes. Since mprotect
-ing from
Read-Write-eXecute permissions to Read-eXecute acts as a
membarrier
(issues IPIs) on Linux/x86-64,
we can also know that the updated code is globally visible by the time
a call to dynamic_flag_activate
, etc., returns.
It’s not practical to bounce page protection for each DF_
expansion,
especially with inlining (some users have hundreds of inlined calls to
flagged functions, e.g., to temporarily paper over
use-after-frees by nopping out a few calls to free(2)
). Most of the
complexity in dynamic_flag.c
is simply in
gathering metadata records for all DF_
sites that should be activated or deactivated, and in
amortising mprotect
calls for stretches of DF_
sites on contiguous pages.
Sometimes, code is just done
The dynamic_flag
library
is an updated interface for the core implementation of the
6-year old an_hook
,
and reflects years of experience with that functionality. We’re happy
to share it, but aren’t looking for feature requests or contributions.
There might be some small clean-ups as we add support for ARM or RISC
V, or let the library interoperate with a Rust implementation.
However, we don’t expect changes to the interface, i.e., the DF_
macros
and the activation/deactivation functions, nor to its core structure,
especially given the contemporary tastes for hardening (for example,
the cross-modification approach is completely incompatible with
OpenBSD’s and OS X’s strict W^X
policies). The library works for our
target platforms, and we don’t wish to take on extra complexity that
is of no benefit to us.
Of course, it’s Apache licensed,
so anyone can fork the library and twist it beyond
recognition. However, if you’re interested in powerful patching
capabilities, dynamic languages (e.g., Erlang, Common Lisp, or even
Python and Ruby), or tools like Live++
and Recode may be more appropriate.5
We want dynamic_flag
to remain simple and just barely flexible
enough for our usage patterns.
Thank you, Jacob, Josh, and Per, for feedback on earlier versions.
-
It’s no accident that canonical dynamic languages like Smalltalk, Forth, and Lisp are all image-based: how would an image-based system even work if it were impossible to redefine functions or types? ↩
-
Like guaranteed optimisations in Lisps, the predictable performance impact isn’t important because all code is performance sensitive, but because performance is a cross-cutting concern, and a predictably negligible overhead makes it easier to implement new abstractions, especially with the few tools available in C. In practice, the impact of considering a code path reachable in case a flag is flipped from its expected value usually dwarfs that of the single
test
instruction generated for the dynamic flag itself. ↩ -
Or if the
dynamic_flag
library isn’t aware of thatDF_OPT
, maybe because the function surrounding thatDF_OPT
conditional was loaded dynamically. ↩ -
After a few CPU-millenia of production experience, the cross-modification logic hasn’t been associated with any “impossible” bug, or with any noticeable increase in the rate of hardware hangs or failures. ↩
-
The industry could learn a lot from game development practices, especially for stateful non-interactive backend servers and slow batch computations. ↩