All paths lead to Rome: POPL and PPREW from a syssec perspective

A week in Rome: what could be better than sitting all day in a conference room?

POPL (Principles Of Programming Languages) is one of these great conferences that have been around forever and sound very interesting to any systems person. The definition of POPL according to the SIGPLAN homepage is: "Principles of Programming Languages symposium addresses fundamental principles and important innovations in the design, definition, analysis, and implementation of programming languages, programming systems, and programming interfaces.". During my PhD I was in a very (computer) systems and compiler oriented group and POPL always sounded like a fun conference. So I was quite happy when I got some of my older PhD work accepted for the PPREW (Program Protection and Reverse Engineering Workshop) that was co-located with POPL this year. The 2013 POPL venue was in Rome, Italy which was even better!

The POPL conference

Before the conference I quickly browsed through the conference program, recognized some names (hi Vijay D'Silva, Daniel Kroening, Ben Livshits, and Domagoj Babic) and some of the papers sounded quite intriguing (although I feared that I would have some trouble understanding the overly theoretical papers and talks).

The keynote on the first day by Georges Gonthier (MS Research) was mostly about proving theorems in Coq (a theorem prover) and examples on how to user Coq (which did help me understand the whole theorem proving world).

The highlight of the first day was Vijay D'Silva's talk about Abstract conflict driven learning (unfortunately I did not find a free PDF version of this paper). I really enjoyed his presentation (he's a great speaker) and the paper was added towards the top of my reading stack.

The second keynote by Shriram Krishnamurthi (Brown University) discussed that JavaScript and HTML are basically evil languages that have no built in safety concepts. Missing language abstractions and full implicit trust of all external library functions (e.g., any included JavaScript library like the JQuery library has full access to all data of a website; all code runs under the same privilege domain with the same priorities) makes it hard for websites to reuse existing libraries. Shriram discussed sandboxing approaches and other language approaches to overcome this problem.

The highlight talk of the second day was Ben Livshits (MS Research) talk Towards fully automatic placement of security sanitizers and declassifiers A sanitizer is a function that cleans an user-defined input string from any potentially malicious/bad character sequences (e.g., escape characters if the string is used for an SQL statement, or JavaScript tags, if the string is used as part of an HTML page). A problem is that most sanitizers are neither idempotent (i.e., the cannot be executed repeatedly on the same data without chaning the data) nor reversible (i.e., removing bad characters cannot be undone). Ben argues that programmers are unable to place these sanitizers at the correct location and an automatic analysis tool will do a better job. The core of their analysis piggy backs on existing compiler data-flow techniques and uses both a static and a dynamic component. The static component uses node-based placement of security classifiers for most of the statically detected input sources. The remaining input sources are tagged dynamically and use a dynamic data flow tracking to add sanitizers at runtime. So for most of the input flows the algorithm is able to determine the location of the sanitizer statically while for some remaining sanitizers the algorithm falls back to a dynamic approach.

I really liked the presentation and when Ben visited Berkeley one week after POPL I took the opportunity to talk to him a bit before his (re-)presentation. What bugs me a bit is that the approach is limited to strings (so no full data tracking for all kinds of data flows) and that the strings remain static during the process. If strings are used/concatenated during the execution then some additional special sanitizers must be added that tag part of the string (i.e., assume that an SQL string is being built up from multiple sources: then the approach cannot handle partial strings or concatenated strings).

The keynote of the third day was by Noah Goodman (Stanford) and he talked about principles and practice of probabilistic programming. The talk focused about how we can deal with uncertainty: an easy solution is to extend the current value system and add a probabilistic component to it. After the motivation he moved towards Lambda calculus and explained how basic Lambda calculus can be extended with stochastic methods. I really liked his presentation and was able to follow through most of the talk although the focus of the talk was not exactly in my comfort zone.

The highlight talk of the third day was Sigma*: Symbolic Learning of Stream Filters by Matko Botincan and Domagoj Babic. They presented a nice concept of how they can the learning of a symbolic model of an application. They present an iterative process that generates both under-approximations (using dynamic symbolic execution) and over-approximations (using counter-example guided abstraction refinement) of an application. During each step they use an SMT solver to test for equivalence. If the two approximations are not equal then they either extend the under-approximation or shorten the over-approximation until they reach a fixpoint.

The PPREW workshop

After three days of POPL I was looking forward to a more low-level, computer-systems related, and machine-code oriented workshop and PPREW (Program Protection and Reverse Engineering Workshop) met and exceeded my expectations.

First of all, (Jeffrey) Todd McDonald gave a great introduction and welcomed us all to the 2nd incarnation of this workshop. He hopes to grow the workshop in the future and to attract more academic work in the areas of program protection and reverse engineering.

Arun Lakhotia (University of Lousiana at Lafayette) gave the keynote and talked about Fast Location of Similar Code Fragments Using Semantic 'Juice' (BinJuice). We have to move away from the old question wheter a binary is 'bad'/malicious or not and move towards the question who is actually responsible for the malicious code. His idea is to remap malware as an information retrieval problem: find the closest match of a new (malware) sample to existing code in a large code library; or partition all your samples inside the library into classes and families. As a next step it is possible to use machine learning to classify similarity. Code is related through code evolution (bug fixes, capabilities, code reuse, shared libraries).

A prerequisite is defining features that can be used to classify individual executables. As part of this research they evolved the features using different forms of abstractions: use a sequence of n bytes; use disassembled text; use disassembled text without relocations, or use n-mnemonic words (je push) (push movl).

An important finding is that current malware does only block level transformations. Semantics catch any changes inside a block, a remaining problem is if unused registers are used for bogus computation across control blocks.

The first talk was by Jack W. Davidson (University of Virginia) about Software Protection for Dynamically-Generated Code. The focus of this work is program protection: slowing down reverse engineering of a piece of code. If code is available then reverse engineering will always be possible, the question is how fast the 'bad guys' will be able to crack a software protection. Jack proposes Process-level virtualization (PVM): using a binary-dynamic translator that dynamically decrypts encrypted portions of a binary executable at runtime. A static compilation process adds additional guards into the executable, encodes specific the binary and encrypts it using some key.

The second talk was Mathias Payer's (aka your humble author, currently PostDoc at UC Berkeley) talk about String Oriented Programming: When ASLR is Not Enough. In this talk Mathias discusses protection mechanisms of current systems and their drawbacks. Data Execution Prevention (DEP) is a great solution against code injection while other forms of code-reuse attacks like Return Oriented Programming (ROP) are still possible. Address Space Layout Randomization (ASLR) and Stack canaries are a probabilistic protection against data overwrite attacks that rely on a secret that is shared with all code in the process. Any information leak can be used to reveal that secret and to break the probabilistic protection. String Oriented Programming is an approach that exploits format string bugs to write special static regions in regularly compiled executables to redirect control flow to carefully prepared invocation frames on the stack without triggering a stack canary mismatch and circumventing ASLR and DEP as well.

In the third talk Viviane Zwanger talked about Kernel Mode API Spectroscopy for Incident Response and Digital Forensics. In the talk Viviane presented her ideas on "API spectroscopy": she measures API function calls of kernel drivers, classifies these function calls into different groups, counts number of calls per group, and uses the a feature vector based on the number of function calls to classify potentially malicious drivers. The analysis is based on a static approach where she uses kernel debugging functionality to get a dump of the kernel memory image to extract individual drivers. The static analysis then counts each function call and generates the feature vector.

In the last talk Martial Bourquin presented BinSlayer: Accurate Comparison of Binary Executables. He told us that there is no good way to compare two different binaries. The only tool that is readily available is Halvar Flake's BinDiff tool which relies mostly on a bunch of heuristics. Martial extends BinDiff with a better similarity metric by analyzing the callgraph and collapsing similar entries.

Final remarks

All in all I must say that I enjoyed my stay in Rome. Both conferences where interesting: meeting people during POPL and peeking into the theorem proofing and formal verification world was certainly challenging and broadening my view. PPREW on the other hand felt like home: I was aware of the related work, knew my area, and I was surrounded by friendly like-minded people.