This year at CSET yours truly had the pleasure to organize a round table on rigor in experimentation with Geoff Voelker, Micah Sherr, and Adam Doupé as panelists. After a quick introduction and mission statements we discussed rigor in experimentation along several dimensions. The most interesting aspects were open source in academia and maintenance of code, design of experiments and comparison to other sciences, published results and reproducibility, and how defining metrics for security is hard.
Geoff Voelker is well known for focus on computer systems networking and network security with both experimental and empirical research. Two aspects of his research include challenging internet-wide experiments and measuring volatile data. During his mission statement he discussed the problem of reproducing own results and that computer science research labs need to develop strategies and guidelines for experiments, including detailed lab notes and exact details of how benchmarks are run. He included the anecdote that some of his papers include individual SQL queries used to generate specific graphs and data, thereby documenting the required resources. Students must label, backup, log, and detail all their results. Whenever a student produces a graph and brings it to a discussion, he encourages the students to challenge the validity of the data which helped find programming errors on several occasions. Another important lesson that he teaches students is the difficulty of correctly setting up an experiment. New students must first reproduce an existing experiment and during this process they learn how easy it is to make mistakes. As they know how the data should look like, it is easier to spot mistakes and improve in the process.
Adam Doupé's research is centered around web security and he automated tools to discover web vulnerabilities and infer bugs. Through measuring the effectiveness of web vulnerability scanners he developed metrics to compare different tools and analyzed their weaknesses. Building on this framework he looked into developing defenses against server-side vulnerabilities. He is best known for playing with shellphish and leveraging capture-the-flag hacking games to educate security students. Many of the prototypes of his work are on GitHub. In his mission statement, Adam brought up the importance of reproducibility. As a community (and inside each research group) we need to push towards making our research reproducible. Open-source and sharing data sets are crucial requirements for reproducibility. He brought up a point that it may not always be possible to open-source prototypes, especially for companies that have business interests in a given prototype.
Micah Sherr focuses on privacy-preserving technology, e-voting security, eavesdropping, and wiretapping. He self-described likes to break stuff and fix things. In the past, he worked on creating testbeds for Tor, systematizing challenges in security research for cyber physical systems, and measurement experiments. In his statement, Micah brought up challenges in incentives in the publication process. To publish a paper, a prototype needs to be improved until its performance (according to whatever metric) surpasses the performance of the prior system. Whenever the prototype improves upon the related work, the process stops and the results are published. An interesting challenge is the difference between rigor and quality. While quality addresses how to handle insufficient work, rigor defines how to do good science. Students must be rigorous in analyzing results and reproducibility may not be an inherent criterion or requirement for good science (or not even possible in certain cases).
After the mission statement we branched into the discussion of different topics with lively interaction of the audience. Most prominent was the discussion about open-source and if open-source should be a requirement for publication. Open-source is the first step towards reproducibility. As a precondition it must be satisfied but open-source alone does not inherently allow reproducibility or even repeatability. Reproducibility comes in different flavors. If the benchmarks can be rerun with the same code and the same data, the experiment can be repeated. Full reproducibility requires reimplementing the system according to the description in the paper and running it both on the same and additional data to ensure that the system did not just cover a small hand-crafted set of examples. If the code is well documented, results may be repeated, reproducibility requires a lot of additional legwork.
If source code is released at all, the quality may not be stellar. Students often write their prototypes under time pressure and as soon as the benchmarks run, development stops. Interestingly, the last commit of open-source research prototypes often aligns with the publication date of the paper. An interesting discussion point was that the open-source prototype should be consumed as-is. While the authors may help with documentation and some requests, it is not the responsibility of the authors to maintain the code and port it to other systems. Maintainability is the job of the code consumers, other researchers who try to replicate the results have to address portability. Note that it is always a nice gesture and good tone to help wherever you can, but it is not an obligation. Open-source research prototypes should not be considered production ready.
Artifact evaluation goes along with reproducibility and may be an interesting complementary aspect. Program committees start to include more and more artifact evaluations. Papers with accompanying artifacts that are successfully evaluated (usually by graduate students who build the prototype and repeat the results in a couple of hours) receive an artifact evaluation award or badge. Such badges may be used as incentives for authors to open-source their prototypes as it shows that the results in the paper can at least be repeated.
The last topic was responsible disclosure and legal aspects of security research. While responsible disclosure (letting software or hardware vendors privately know about any discovered vulnerabilities and giving them time to patch them) is well accepted in the community, adversarial research may still be controversial. Open-sourced adversarial research, e.g., in bug finding tools or exploitation tools may lead to legal issues as these tools could be classified as weapons. As these legal frameworks are being developed, security researchers must be aware of such possible pitfalls.
Overall we had a great time discussing these different aspects of rigor in security research. Although we talked for an hour, time went by very quickly and we agreed that we could have continued to talk for a couple more hours! The audience joined in on the discussion and we were happy with all the great questions. Thanks to the great panelists, we had a wide range of expertise and I believe that the audience enjoyed our chat.