This post started out of the need to provide a little more clarification after
a long and heated discussions on Twitter (initial discussion and follow up) about the origins
of Control-Flow Integrity (CFI), the contributions of academia, and the
precision, performance, and compatibility of different existing
implementations.
Without memory corruption, only a single target is valid for each executed
indirect control-flow transfer but due to the imprecision through the static
CFI property, all targets in the set must be accepted (i.e., the implementations
are neither context nor path sensitive). Similarly, due to imprecision in the
analysis, the target sets are an over-approximation. In practice, most
implementations use a type-based analysis and group all functions with the same
prototype into a target set. These two imprecisions allow the attacker to
replace the original target with another one from the target set without
triggering the defense. The power of CFI as a mitigation is therefore tied to
the size of the sets. Intuitively, the smaller the sets, the stronger the
defense.
For the backward edge, a set check is insufficient due to the massive amount of
over-approximation. In its most precise form, set checking leverages the
function symbol to match the return sites for each call of that function. Most
implementations similarly leverage the function prototype to match the return
site. This means that from a void(*)(void) function you can return to
after the call of any void(*)(void) function, often resulting in huge
target sets. These over-approximations (both for function name and function
prototype matching) result in an opportunity for the attacker to short-circuit
different parts of the program. For example, two calls to printf in the same
function allow the attacker to implement a loop where the return in the second
call is overwritten to the first call. This shortcutting becomes a powerful
primitive, often resulting in arbitrary computation (see Control-Flow Bending for details). To protect
the backward edge, we therefore recommend stack integrity through a shadow
stack or safe stack. Intel CET is a promising
upcoming hardware implementation of a shadow stack and the LLVM shadow stack is a strong software
implementation for aarch64.
History of taming control flow
Multiple sources claim ownership of the idea of restricting control-flow based
on set checks. The original 2005 CFI paper
coined the term control-flow integrity, formalized the property for forward and
backward edge, and evaluated a prototype implementation (that was not released
openly). The same authors revisited CFI in 2007 and proposed
a shadow stack to protect return targets.
In 2003 PaXTeam wrote in their future ideas
text file about protecting function pointers and returns. For function pointers
(c1), they propose to make the function pointers themselves read-only. This
approach would fundamentally solve the problem. Read-only function pointers are
precise as an attacker can no longer overwrite the function pointer itself.
This approach unfortunately does not scale due to transitivity: an attacker can
still modify a pointer to the function pointer. Similar to how non-executable
data pages did not prohibit ROP, read-only function pointers do not stop
control-flow hijacking. Control-flow hijacking becomes harder as the attacker
now must control an additional layer of indirection. For returns (c2), they
propose a set check based on the function prototype, similar to the original CFI
paper. Unfortunately, this idea was not implemented and the future ideas text
file was not well known and therefore not cited in the Abadi CFI paper.
After we were made aware of this text file in about 2016, we started giving
credit for the idea of restricting dynamic control flow. Given the very short
description, exclusive focus on return for the control-flow check, and the lack
of an implementation or evaluation, the future ideas file can serve as
inspiration but does not provide a convincing case for CFI and should therefore
not serve as the main citation for CFI. The PaX future ideas text file may be
credited for the idea of restricting control-flow as a defense together with
other related work from the same time such as program shepherding
from 2002.
But as it turns out, the idea of taming control-flow is much older than PaX.
Hoelzle, Chambers, and Ungar proposed the idea of polymorphic inline caches in 1991 that replaces
indirect calls with a type check and a direct call, similar to the CFI set
check. A type mismatch could be detected and result in termination of the
program. Going back even further, Deutsch and Schiffman explored inline caches
for Smalltalk-80 in 1984. (Thanks
to Stefan Brunthaler for the references and discussion.)
The idea of taming indirect control flow is ancient with research in runtime
systems, programming languages, hardware architectures, compilers, and defenses.
This old legacy is rarely cited in the newer papers but may deserve a revisit.
It may be worthwhile to start conducting computer science archaeology as
academics regularly miss related work.
Evaluating CFI
The defense power of CFI is program dependent. Under the powerful attacker
model, the usefulness of CFI depends on the question if the target set contains
the necessary gadget(s) that are useful for the attacker. In addition to
security, two other properties that can be evaluated are performance and
compatibility. LLVM-CFI has negligible (less than 1%) performance overhead on
standard benchmarks such as SPEC CPU2006 and we will not repeat the measurements
here.
We evaluate LLVM-CFI (version 4.0.1) and RAP (via the 4.9.24-test7 patch), two
CFI mechanisms that implement prototype-based set checks for the forward edge.
RAP also offers prototype-based set checks on the backward edge but, due to the
security concerns mentioned above, we will restrict the evaluation to the
forward edge. Our tests focus on user-space code.
Using LLVM-CFI
LLVM-CFI is extremely easy to use. Install your favorite
(recent) LLVM through your favorite package manager, e.g., through apt install
clang-3.8 for the current Debian default. To activate LLVM-CFI runtime
checking, simply compile your software with clang -flto -fsanitize=cfi
test.c. There's also plenty of documentation available if you need to know
more.
Using RAP
The story for RAP is a little more complicated and compiling the underlying GCC
plugin is slightly more complicated (yes, this is sarcastic). The first
challenge is to discover the actual source. The history behind RAP is a little
obscure as no official write up or code repository exists. The only hints are
@paxteam on twitter and a H2H presentation. Without an
official release, I resorted to search the old PaX Linux patches as they
apparently contained a "public" version of RAP. The most recent patch I found
was pax-linux-4.9.24-test7.patch. Download
the patch and apply it against the linux-4.9.24.tar.gz tarball. Enter
the tarball and run make menuconfig to select a reasonable configuration.
For some reason, the patch did not change Kconfig and the options for RAP did
not show up. I therefore had to manually edit the .config file and make sure
the following entries were selected:
CONFIG_HAVE_GCC_PLUGINS=y
CONFIG_GCC_PLUGINS=y
CONFIG_PAX_RAP=y
After selecting this configuration compile the plugin as a side effect of
compiling the kernel through make. After the compilation finishes, you can
grab the plugin from
linux-4.9.24-pax/scripts/gcc-plugins/rap_plugin/rap_plugin.so and RAP is
ready to use. If you know the command line arguments. As no documentation exists
for RAP, I once again resorted to the code and discovered the following command
line switches in the Makefiles:
CFLAGS=-DRAP_PLUGIN -fplugin-arg-rap_plugin-typecheck=call,ret
CFLAGS+=-fplugin-arg-rap_plugin-hash=abs-finish
CFLAGS+=-fplugin-arg-rap_plugin-hash=abs-ops
CFLAGS+=-fplugin-arg-rap_plugin-hash=abs-attr
CFLAGS+=-fplugin-arg-rap_plugin-report=func,fptr,abs
CFLAGS+=-DX86_RAP_CALL_VECTOR=0x82
CFLAGS+=-DX86_RAP_RET_VECTOR=0x83
CFLAGS+= '-fplugin-arg-rap_plugin-callabort=int $$0x82'
CFLAGS+= '-fplugin-arg-rap_plugin-retabort=int $$0x83'
CFLAGS+= -DRAP_PLUGIN
My educated guess is that the first line activates RAP for calls and returns
while lines two, three, four activate higher precision depending on parts of the
function prototype. The report switch on line five prints debug information
about the hashes (and can be incredibly helpful to debug RAP). The remaining
lines select how traps are handled. With this information, we are ready to use
RAP for test software.
Getting to this stage required about 10-15 hours of software archaeology and
pointers from @lazytyped, @raistolo, and @stevecheckoway who reached out to help after a heated
discussion on Twitter.
Orthogonally, I received a lot of love from the Twitter "hacker" community,
e.g., I now know that I'm a useless incompetent academic who is too stupid to
read code and several other things that I was not aware of before. Thanks folks,
I love you too! A core hacker aspect is to share information and to help, not to
attack others who are trying to reproduce and understand.
Precision measurements
Precision is an important property for CFI. Strict CFI defenses may result in
false positives where an execution of a program is stopped even without an
attacker modifying memory. For example, if a function pointer is cast to a
different type, dereferencing it may result in a CFI violation. LLVM-CFI
enforces strict prototype checking and may therefore cause incompatibilities
with existing code due to loose checks.
Both LLVM-CFI and RAP implement CFI checks based on function prototype matching,
function prototypes are encoded to some type mask (in the case of RAP) or to
some ID (for LLVM-CFI). Both policies detect if any aspect of the function
prototype is changed, e.g., the type of a parameter or the return type. Both
policies are precise and check for specific pointer types.
There is one key difference between RAP and LLVM-CFI: LLVM-CFI only allows calls
to functions that are address-taken, i.e., only functions can be called
indirectly that had their address taken through the address-of operator.
Functions that are not address taken cannot be targets. The underlying idea
behind this check is that only a small subset of all functions are called
indirectly and LLVM-CFI reduces the size of their target sets this way (instead
of all void(*)(void) functions only the void(*)(void) functions that
have their address taken are in the set of valid targets. This vastly reduces
the size of the sets. The power to distinguish between address-taken and
not-address-taken functions comes at a price: LLVM-CFI requires -flto (link
time optimization) to decide if a function is address taken anywhere in the
program. RAP here trades precision for simplicity.
Both LLVM-CFI and RAP fail to detect a compromise of the function pointer if it
is overwritten with an address-taken function. Note that this imprecision is
expected from any type-based CFI mechanism.
To conclude, both LLVM-CFI and RAP implement CFI through type-based prototype
matching. LLVM-CFI only matches address-taken functions while RAP matches all
functions (resulting in some imprecision).
Performance
We run the performance test by indirectly dispatching a tiny function many
times, then leveraging the rdtsc time stamp to count the number of cycles
for each dispatch (including the execution of the function). We run the
benchmark on a mobile Intel Core i7-7500 at 2.7GHz running the performance
governor. We compile code with O3 and average 5 executions after one warmup. The
dispatch code is:
__attribute__((noinline)) int quickfun(int a, int b) {
__asm__ volatile("nop\n");
return a*b;
}
...
int (*ptr)(int, int) = &quickfun;
__asm__ volatile ( "rdtsc" : "=a" (lo), "=d" (hi));
start = lo | (hi << 32);
for (unsigned long i =0; i < NRSPEED; i++)
ptr(a, b);
__asm__ volatile ( "rdtsc" : "=a" (lo), "=d" (hi));
end = lo | (hi << 32);
Note that this microbenchmark only provides a very rough estimate of the
performance for one kind of dispatch. As a microbenchmark, it tries to show
worst case performance overhead for the protected dispatch.
Amazingly, RAP has no measureable performance overhead. Without RAP, gcc
compiles the code to execute in 4.172 cycles per dispatch. With RAP, the same
code executes in 4.173 cycles per dispatch. This is less than 0.04% overhead.
LLVM-CFI results in some overhead. Without CFI, LLVM compiles the code to
execute in 4.179 cylces per dispatch. With LLVM-CFI, the code executes in 5.012
cycles per dispatch. This is about 19.93% overhead for a pure dispatch.
Let's dive into the compiled code to see where the overhead comes from. Let's
start with RAP. The hot loop is compiled to:
2181: 0f 31 rdtsc
2183: 48 c1 e2 20 shl $0x20,%rdx
2187: bb 00 ca 9a 3b mov $0x3b9aca00,%ebx
218c: 48 09 c2 or %rax,%rdx
218f: 49 89 d6 mov %rdx,%r14
2192: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
2198: 48 81 7d f8 a8 2f bd cmpq $0x17bd2fa8,-0x8(%rbp)
219f: 17
21a0: 75 61 jne 2203 <speedfun+0x93>
21a2: 44 89 e6 mov %r12d,%esi
21a5: 44 89 ef mov %r13d,%edi
21a8: ff d5 callq *%rbp
21aa: 48 83 eb 01 sub $0x1,%rbx
21ae: 75 e8 jne 2198 <speedfun+0x28>
21b0: 0f 31 rdtsc
We see how the type check is executed in each loop through the cmpq
instruction, dispatching after a successful comparison. The target in *%rbp
directly points to the function quickfun.
LLVM-CFI reshuffles code quite a bit. The hot loop is compiled to:
4013cc: 0f 31 rdtsc
4013ce: 49 89 d6 mov %rdx,%r14
4013d1: b9 18 15 40 00 mov $0x401518,%ecx
4013d6: 49 39 cc cmp %rcx,%r12
4013d9: 75 79 jne 401454 <main+0x244>
4013db: 49 c1 e6 20 shl $0x20,%r14
4013df: 49 09 c6 or %rax,%r14
4013e2: bb 00 ca 9a 3b mov $0x3b9aca00,%ebx
4013e7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
4013ee: 00 00
4013f0: 44 89 ff mov %r15d,%edi
4013f3: 89 ee mov %ebp,%esi
4013f5: 41 ff d4 callq *%r12
4013f8: 48 ff cb dec %rbx
4013fb: 75 f3 jne 4013f0 <main+0x1e0>
4013fd: 0f 31 rdtsc
Interestingly, the check itself is hoisted outside of the loop. The target
0x401518 contains a single jump to the real implementation of quickfun.
0000000000401518 <quickfun>:
401518: e9 03 fc ff ff jmpq 401120 <quickfun.cfi>
40151d: cc int3
40151e: cc int3
40151f: cc int3
LLVM-CFI implements the CFI check as a range check that maps into a table of
trampolines. All address taken functions of the same type are translated into a
range where they are placed next to each other. A CFI dispatch is then matched
into a jump into an 8-byte aligned jump into the area of targets for the
corresponding type. This indirection results in a 1 cycle penalty for each
dispatch due to the double indirection.
Given this simple measurement on a single CPU, the RAP-style instrumentation is
about 1 cycle per dispatch faster than the LLVM-CFI based one. The LLVM-CFI
instrumentation has the advantage that all valid targets are tightly encoded in
a single table, potentially helping a reverse engineer. There is no reason why
LLVM-CFI could not implement simple prototype-based checking that is encoded
inline next to the function as proposed in the original CFI paper.
Summary
In summary, in the current implementations, RAP is slightly faster due to direct
encoding of target sets but LLVM-CFI is more precise by limiting indirect
dispatch to only address-taken functions. Both mechanisms implement a
prototype-based CFI policy.
Just looking at how easy it is to run LLVM-CFI and how complicated it is to run
RAP, I don't think any reasonable person would use RAP in their code due to
concerns about compatibility and long term maintainability as it is not clear if
or how RAP will be supported in the future. Also, I have no way of knowing how
complete the RAP plugin is or if it is even the most recent version as PaXteam
turned closed-source and no longer openly shares the plugin (but offered to run
my tests on his machine).
I wonder why LLVM-CFI relies on a trampoline table to dispatch targets but does
not prepend each function with a type identifier. The double indirection for
LLVM-CFI seems excessive and unneeded.
Get the full code for the benchmark and play with it yourself