Side channel attacks such as Spectre or Meltdown allow data leakage from an
unwilling process. Until now, transient execution side channel attacks
primarily leveraged cache-based side channels to leak information. The very
purpose of a cache, that of providing faster access to a subset of data, enables
information leakage. While the world focused on a string of exploits leveraging
caches (and the memory hierarchy pyramid) and defenders tried to block data
leakage through it, we look at the core tenet enabling the channel: contention.
Contention in a CPU is not limited to cache capacity; it manifests itself in a
variety of forms when resources are shared. Freely sharing resources among
untrusted entities allows an attacker process to infer when another (victim)
process is contending for the resource, thereby slowing down the attacker.
A less obvious form of contention arises in Simultaneously Multi-Threaded (SMT)
cores. The latter lay the foundation for nearly all modern x86 CPUs, IBM POWER8,
Oracle T5 and Cavium ThunderX2/X3. In SMT, scheduling units called ports are
shared among threads of execution which can be exploited to leak information.
Port contention as a phenomenon has been previously discussed by Anders Fogh in
his 2016 blog post.
We precisely characterize the port-induced side channel (that we call SMoTher)
and demonstrate that it is possible to detect a sequence as small as a single
(schedulable) instruction tied at design time to a specific subset of ports by
leveraging contention. Leveraging SMoTher (instead of a cache-based side
channel), we present a powerful, practical transient execution attack to leak
secrets that may be held in registers or the closely-coupled L1 cache, called
SMoTherSpectre. The full paper is on arXiv,
the work is a collaboration between the EPFL HexHive
and PARSA labs, and IBM Research
Zurich and joint work between Atri
Bhattacharyya, Alexandra Sandulescu, Matthias Neugschwandtner, Alessandro
Sorniotti, Babak
Falsafi, Mathias
Payer, and Anil Kurmus.
Simultaneous multi-threading and scheduling
A Simultaneously Multi-threaded (SMT) CPU fetches and executes instructions for
more than one thread on the same core. To the operating system/user, it appears
as a greater number of logical cores than physical cores. The former is used
to denote the capability to execute a thread, while the latter is the physical
implementation of a unit (execution pipeline, registers, caches) called a core.
These (colocated) threads have a few dedicated components per thread (fetch unit
and architectural registers), while sharing the rest of the pipeline (branch
predictors, reservation station, ports, execution units). Implementations differ
in which components are shared or dedicated and the number of threads per
physical core.
A typical modern, out-of-order processor schedules micro-ops from an unified
reservation station to specialized execution units. See this
presentation
for an overview of scheduling on recent Intel microarchitectures. Core-series
processors contain 5-8 ports to perform this scheduling. Each port is
responsible for a fixed subset of execution units. Intel Skylake processors,
specifically, contain eight ports. Four of them (0,1,5 and 6) are used to
schedule operations to integer, floating-point, vector execution units among
others. The other four ports handle loads, stores, and address generation
operations. The execution units for the most commonly executed micro-ops are
replicated and associated with multiple ports. With SMT, micro-ops from both
co-located threads may reside in the same reservation station(s). In each cycle,
a single micro-op from either thread may be scheduled by each port. See
scheduling Ports on Intel Skylake
processors
for details of Intel Skylake scheduling.
SMoTher
When SMT threads have ready micro-ops which can use the same port, they must
contend for it each cycle. Each thread would need to wait for a few cycles when
the port under contention chooses to schedule a micro-op from another thread,
causing a slowdown. This slowdown is detectable (by taking timestamps using
rdtsc
on Intel's CPUs), and allows a specially crafted thread to measure the
co-located thread's utilization of a port.
Suppose threads A (Attacker) and V (Victim) run on the same physical Skylake
core, where crc32
is scheduled by port 1 and fadd
is scheduled by port 5. If
a thread V only uses other ports (example, running fadd
s), thread A running 20
one-cycle crc32
instructions should require 20 cycles. However, if the
reservation station also contains a single ready crc32
instruction from
thread V, there should be one cycle where port 1 chooses it over the micro-ops
from A. Overall, thread A now runs the same instruction in 21 cycles, which is a
5% slowdown. Longer sequences with contention have lead to attacker slowdown
up to 35% in our experiments.
Each instruction in a sequence of code can be scheduled on specific ports. This
allows us to create a port-fingerprint for every sequence, consisting of the
expected utilization of each of the ports while scheduling that sequence. For a
pair of victim sequences, V_a and V_b, with different port-fingerprints, a
carefully crafted attacker thread can identify which victim sequence is
concurrently run. Essentially, the attacker chooses one or more ports for which
the victim sequences differ in the signature. By timing instructions
specifically scheduled on these ports, the attacker can measure contention.
Higher contention means that a concurrent victim is using the same ports (and
vice versa), identifying the sequence. We call such pairs of instruction
sequences SMoTher-differentiable.
To leak information, we look at data-dependent control flow (conditional
branches) leading to SMoTher-differentiable sequences (as branch target and
fallthrough). We call this a SMoTher-gadget. The attacker can identify the
sequence following the branch, and thereby infer the outcome of the condition.
The information leaked depends on the branch condition. Common examples include
specific bits in registers or memory (for example, TEST 0x1, al; jz TGT
, or
CMPB 0x0, (rdx); jl TGT
).
SMoTherSpectre
SMoTherSpectre is a speculative code-reuse attack. Speculative execution at
particular points of a victim's execution is influenced to execute
SMoTher-gadgets, leaking information.
As an example, an attacker can use branch target injection (BTI) to redirect the
speculative execution following an indirect jump/call (until the target is
calculated) on a co-located victim process (shared branch predictor). This code
sequence, including the indirect branch, is named the BTI gadget. At this
point, we require a register or a memory location (with a pointer to it) to hold
the secret to be leaked. The code below is an example BTI gadget, where a secret
is loaded into rdi
and the indirect jump will eventually branch to the
pointer target loaded into rax
.
BTI gadget:
load rdi, (secret)
load rax, (pointer)
jmp [rax]
The poisoned target leads the victim to execute a different data-dependent
control-flow sequence--i.e., a conditional branch--somewhere in the victim code.
The condition must use the secret, so that the branch leaks the outcome of the
condition if the attacker can figure out which instructions the victim executed
just afterwards (i.e., either the branch target or the fall-through). For
SMoTher-gadgets, conditional branches where the two subsequent paths are
SMoTher-differentiable, the attacker can figure out if the branch was taken or
not taken by introducing contention on one or more ports which the target and
fallthrough use for different number of cycles. In the SMoTher-gadget shown
below, crc32
is scheduled by port 1 on Skylake, while ror
is scheduled on
ports 0 and 6.
SMoTher gadget:
cmp rdi, 0
jl <mark>
crc32
crc32
...
mark:
ror
ror
...
An attacker running a sequence of crc32
instructions will contend on port 1
if the victim branch falls though and runs crc32
instructions too. The
slowdown due to contention can be detected by the attacker, using rdtsc
timestamps to count the number of cycles taken to run its sequence, and allows
it to infer that the victim's secret is not less than 0.
Gadgets
SMoTherSpectre requires two gadgets in the victim code base:
- A BTI gadget (to trigger speculation):
C/C++ compilers typically use indirect call instructions to implement calls
using function pointers/virtual-function calls in code which does not deploy
repoline defences. OpenSSL's EVP library uses such a pointer to, e.g.,
encrypt/decrypt using the selected cipher.
i = ctx->cipher->do_cipher(ctx, out, in, inl);
Further, a secret argument may be held in registers. In the example above,
register rdx
holds a pointer to the secret plaintext being encrypted.
- A SMoTher gadget (to leak the secret):
Every conditional branch in a victim's binary is a potential SMoTher gadget.
In fact, even unintended sequences which may be interpreted as a conditional
jump can be used (similar to ROP). However, an attacker needs to be able to
differentiate between the target and fallthrough sequences using port contention.
It turns out that hundreds to thousands of such sequences exist in
glibc
(with
different degrees of SMoTher-differentiability). The paper describes in more
detail our methodology for finding and ranking SMoTher-gadgets.
The vast availability of SMoTher gadgets makes SMoTherSpectre such a powerful
attack. While the first stage is similar to other speculative execution attacks,
the side channel to leak information is different and more readily available
than cache-based side channels. While each step leaks only one bit of
information (the conditional branch that depends on the secret value was taken
or not taken), SMoTher-gadgets are more readily available and can be combined
the leak information.
Evaluation
We release the proof of concept
code to enable other researchers to
reproduce, evaluate, and assess this side channel. It uses the gadgets described
above. Separate processes are used for the attacker and victim. Per iteration of
the experiment, a randomly-generated bit (representing the victim's secret) is
written by the victim process to a file, while the attacker writes its guess to
another file. For each secret, we repeat the attack multiple times to allow the
attacker to get multiple samples. Multiple samples allows the attacker to
eliminate some of the noise that invariably creeps into a timing experiment at
such fine granularity. We run 1,000 iterations and compare the files in
post-processing to calculate the attacker's accuracy in guessing the secret.
The histogram of the attacker's timing for the crc32
sequence in the SMoTher
phase shows separate plots for when the actual secret is zero or one.
Specifically, the attacker's timing is the average over 9 runs with the same
secret. This figure clearly shows that an attacker can use a threshold of 94
cycles to make a guess of the secret with high confidence.
Overall, our attacker was able to guess the secret with an accuracy from 60%
(with one sample) to 98% (with 9 samples).
OpenSSL exploit
We also created a concept exploit for OpenSSL's (commit f1d49ed
, dated
27-Nov-2018) high level EnVeloP (EVP) API. We modelled a victim program using
OpenSSL to encrypt data. Specifically, the program calls EVP_EncryptUpdate
to
encrypt chunks of data. An indirect call in the function serves as our BTI
gadget.
At the BTI gadget, the register rdx
holds a pointer to the plaintext being
encrypted (victim secret). A SMoTher-gadget from glibc
is used, comparing the
first byte in memory referenced by rdx
against zero. In effect, this is
leaking information about the first byte of the plaintext.
The distribution of the attacker's timing with different secret values are
distinguishable by statistical tests such as Student's t-test, implying that an
attacker which is able to run multiple encryption runs of the same plaintext can
identify the secret.
Mitigation
Mitigating SMoTherSpectre is possible by mitigating the port contention side
channel or the transient execution side channel (BTI in our PoC).
There are a range of mitigations for BTI (commonly known as Spectre v2
mitigations), including enabling the Single Thread Indirect Branch Predictors
(STIBP) feature on Intel processors. We saw that microcode updates
(https://downloadcenter.intel.com/search?keyword=linux+microcode) dated
2017-07-07 and after as published by Intel prevented BTI on our released PoC
code. All security-critical userspace programs should be compiled with
retpolines.
However, BTI is one of multiple avenues for influencing indirect branches (or
returns) on victim processes. Newer Spectre variants continue to propose
alternate methods for influencing branch speculation. Therefore, defense against
SMoTher is required to fully mitigate this transient execution attack.
The general idea of preventing SMoTher leaking information is to ensure that two
threads with different privileges (in the general sense) do not compete for the
same execution port. An obvious scenario is threads from separate users sharing
a physical core. However, in certain cases, threads from the same Linux user
can represent different mutually-untrusting entities. Therefore, the strongest
defence is disabling simultaneous multi-threading.
Disclosure
We discovered the SMoTher side channel in June 2018 and developed the
SMoTherSpectre speculative side channel proof of concept in November 2018. The
co-authors at IBM Research disclosed the findings internally to IBM. We
disclosed the vulnerability to Intel (on December 05, 2018) and to OpenSSL (on
December 05, 2018). AMD was also notified as part of IBM's internal disclosure
process. After the acknowledgement of receiving our PoC, we did not receive any
feedback from Intel or OpenSSL. The IBM internal disclosure process completed on
February 28, 2019 and we are releasing the details of the vulnerability on March
06, 2019.
The full paper is on arXiv. The PoC
code enables reproduction. Contact:
Mathias Payer or @gannimo on
Twitter.