A week in Rome: what could be better than sitting all day in a conference room?
POPL (Principles Of Programming Languages) is one of these great
conferences that have been around forever and sound very interesting to
any systems person. The definition of POPL according to the SIGPLAN
homepage is: "Principles of Programming Languages symposium addresses
fundamental principles and important innovations in the design,
definition, analysis, and implementation of programming languages,
programming systems, and programming interfaces.". During my PhD I
was in a very (computer) systems and compiler oriented group and POPL
always sounded like a fun conference. So I was quite happy when I got
some of my older PhD work accepted for the PPREW (Program Protection
and Reverse Engineering Workshop) that was co-located with POPL this
year. The 2013 POPL venue was in Rome, Italy which was even better!
The POPL conference
Before the conference I quickly browsed through the conference program,
recognized some names (hi Vijay D'Silva, Daniel Kroening, Ben Livshits,
and Domagoj Babic) and some of the papers sounded quite intriguing
(although I feared that I would have some trouble understanding the
overly theoretical papers and talks).
The keynote on the first day by Georges Gonthier (MS Research) was
mostly about proving theorems in Coq (a theorem prover) and examples
on how to user Coq (which did help me understand the whole theorem
proving world).
The highlight of the first day was Vijay D'Silva's talk about
Abstract conflict driven learning (unfortunately I did not find
a free PDF version of this paper). I really enjoyed his presentation
(he's a great speaker) and the paper was added towards the top of my
reading stack.
The second keynote by Shriram Krishnamurthi (Brown University)
discussed that JavaScript and HTML are basically evil languages that
have no built in safety concepts. Missing language abstractions and full
implicit trust of all external library functions (e.g., any included
JavaScript library like the JQuery library has full access to all data
of a website; all code runs under the same privilege domain with the
same priorities) makes it hard for websites to reuse existing libraries.
Shriram discussed sandboxing approaches and other language approaches to
overcome this problem.
The highlight talk of the second day was Ben Livshits (MS
Research) talk Towards fully automatic placement of security
sanitizers and declassifiers A sanitizer is a function that cleans
an user-defined input string from any potentially malicious/bad
character sequences (e.g., escape characters if the string is used for
an SQL statement, or JavaScript tags, if the string is used as part of
an HTML page). A problem is that most sanitizers are neither idempotent
(i.e., the cannot be executed repeatedly on the same data without
chaning the data) nor reversible (i.e., removing bad characters cannot
be undone). Ben argues that programmers are unable to place these
sanitizers at the correct location and an automatic analysis tool will
do a better job. The core of their analysis piggy backs on existing
compiler data-flow techniques and uses both a static and a dynamic
component. The static component uses node-based placement of security
classifiers for most of the statically detected input sources. The
remaining input sources are tagged dynamically and use a dynamic data
flow tracking to add sanitizers at runtime. So for most of the input
flows the algorithm is able to determine the location of the sanitizer
statically while for some remaining sanitizers the algorithm falls back
to a dynamic approach.
I really liked the presentation and when Ben visited Berkeley one week
after POPL I took the opportunity to talk to him a bit before his
(re-)presentation. What bugs me a bit is that the approach is limited to
strings (so no full data tracking for all kinds of data flows) and that
the strings remain static during the process. If strings are
used/concatenated during the execution then some additional special
sanitizers must be added that tag part of the string (i.e., assume that
an SQL string is being built up from multiple sources: then the approach
cannot handle partial strings or concatenated strings).
The keynote of the third day was by Noah Goodman (Stanford) and he
talked about principles and practice of probabilistic programming. The
talk focused about how we can deal with uncertainty: an easy solution is
to extend the current value system and add a probabilistic component to
it. After the motivation he moved towards Lambda calculus and explained
how basic Lambda calculus can be extended with stochastic methods. I
really liked his presentation and was able to follow through most of the
talk although the focus of the talk was not exactly in my comfort zone.
The highlight talk of the third day was Sigma*: Symbolic Learning of
Stream Filters by Matko Botincan and Domagoj Babic. They
presented a nice concept of how they can the learning of a symbolic
model of an application. They present an iterative process that
generates both under-approximations (using dynamic symbolic execution)
and over-approximations (using counter-example guided abstraction
refinement) of an application. During each step they use an SMT solver
to test for equivalence. If the two approximations are not equal then
they either extend the under-approximation or shorten the
over-approximation until they reach a fixpoint.
The PPREW workshop
After three days of POPL I was looking forward to a more low-level,
computer-systems related, and machine-code oriented workshop and
PPREW (Program Protection and Reverse Engineering Workshop) met and
exceeded my expectations.
First of all, (Jeffrey) Todd McDonald gave a great introduction and
welcomed us all to the 2nd incarnation of this workshop. He hopes to
grow the workshop in the future and to attract more academic work in the
areas of program protection and reverse engineering.
Arun Lakhotia (University of Lousiana at Lafayette) gave the keynote
and talked about Fast Location of Similar Code Fragments Using
Semantic 'Juice' (BinJuice). We have to move away from the old
question wheter a binary is 'bad'/malicious or not and move towards the
question who is actually responsible for the malicious code. His idea is
to remap malware as an information retrieval problem: find the closest
match of a new (malware) sample to existing code in a large code
library; or partition all your samples inside the library into classes
and families. As a next step it is possible to use machine learning to
classify similarity. Code is related through code evolution (bug fixes,
capabilities, code reuse, shared libraries).
A prerequisite is defining features that can be used to classify
individual executables. As part of this research they evolved the
features using different forms of abstractions: use a sequence of n
bytes; use disassembled text; use disassembled text without relocations,
or use n-mnemonic words (je push) (push movl).
An important finding is that current malware does only block level
transformations. Semantics catch any changes inside a block, a remaining
problem is if unused registers are used for bogus computation across
control blocks.
The first talk was by Jack W. Davidson (University of Virginia)
about Software Protection for Dynamically-Generated Code. The
focus of this work is program protection: slowing down reverse
engineering of a piece of code. If code is available then reverse
engineering will always be possible, the question is how fast the 'bad
guys' will be able to crack a software protection. Jack proposes
Process-level virtualization (PVM): using a binary-dynamic translator
that dynamically decrypts encrypted portions of a binary executable at
runtime. A static compilation process adds additional guards into the
executable, encodes specific the binary and encrypts it using some key.
The second talk was Mathias Payer's (aka your humble author, currently
PostDoc at UC Berkeley) talk about String Oriented Programming:
When ASLR is Not Enough. In this talk Mathias discusses protection
mechanisms of current systems and their drawbacks. Data Execution
Prevention (DEP) is a great solution against code injection while other
forms of code-reuse attacks like Return Oriented Programming (ROP) are
still possible. Address Space Layout Randomization (ASLR) and Stack
canaries are a probabilistic protection against data overwrite attacks
that rely on a secret that is shared with all code in the process. Any
information leak can be used to reveal that secret and to break the
probabilistic protection. String Oriented Programming is an approach
that exploits format string bugs to write special static regions in
regularly compiled executables to redirect control flow to carefully
prepared invocation frames on the stack without triggering a stack
canary mismatch and circumventing ASLR and DEP as well.
In the third talk Viviane Zwanger talked about Kernel Mode API
Spectroscopy for Incident Response and Digital Forensics. In the
talk Viviane presented her ideas on "API spectroscopy": she measures API
function calls of kernel drivers, classifies these function calls into
different groups, counts number of calls per group, and uses the a
feature vector based on the number of function calls to classify
potentially malicious drivers. The analysis is based on a static
approach where she uses kernel debugging functionality to get a dump of
the kernel memory image to extract individual drivers. The static
analysis then counts each function call and generates the feature
vector.
In the last talk Martial Bourquin presented BinSlayer: Accurate
Comparison of Binary Executables. He told us that there is no good
way to compare two different binaries. The only tool that is readily
available is Halvar Flake's BinDiff tool which relies mostly on a
bunch of heuristics. Martial extends BinDiff with a better similarity
metric by analyzing the callgraph and collapsing similar entries.