This year at CSET yours truly had the pleasure to organize a round table on
rigor in experimentation with Geoff Voelker, Micah Sherr, and Adam Doupé as
panelists. After a quick introduction and mission statements we discussed rigor
in experimentation along several dimensions. The most interesting aspects were
open source in academia and maintenance of code, design of experiments and
comparison to other sciences, published results and reproducibility, and how
defining metrics for security is hard.
Geoff Voelker is well known for focus on computer systems networking and network
security with both experimental and empirical research. Two aspects of his
research include challenging internet-wide experiments and measuring volatile
data. During his mission statement he discussed the problem of reproducing
own results and that computer science research labs need to develop strategies
and guidelines for experiments, including detailed lab notes and exact details
of
how benchmarks are run. He included the anecdote that some of his papers include
individual SQL queries used to generate specific graphs and data, thereby
documenting the required resources. Students must label, backup, log, and detail
all their results. Whenever a student produces a graph and brings it to a
discussion, he encourages the students to challenge the validity of the data
which helped find programming errors on several occasions. Another important
lesson that he teaches students is the difficulty of correctly setting up an
experiment. New students must first reproduce an existing experiment and during
this process they learn how easy it is to make mistakes. As they know how the
data should look like, it is easier to spot mistakes and improve in the process.
Adam Doupé's research is centered around web security and he automated tools to
discover web vulnerabilities and infer bugs. Through measuring the effectiveness
of web vulnerability scanners he developed metrics to compare different tools
and analyzed their weaknesses. Building on this framework he looked into
developing defenses against server-side vulnerabilities. He is best known for
playing with shellphish and leveraging capture-the-flag hacking games to educate
security students. Many of the prototypes of his work are on GitHub. In his
mission statement, Adam brought up the importance of reproducibility. As a
community (and inside each research group) we need to push towards making our
research reproducible. Open-source and sharing data sets are crucial
requirements for reproducibility. He brought up a point that it may not always
be possible to open-source prototypes, especially for companies that have
business interests in a given prototype.
Micah Sherr focuses on privacy-preserving technology, e-voting security,
eavesdropping, and wiretapping. He self-described likes to break stuff and fix
things. In the past, he worked on creating testbeds for Tor, systematizing
challenges in security research for cyber physical systems, and measurement
experiments. In his statement, Micah brought up challenges in incentives in the
publication process. To publish a paper, a prototype needs to be improved until
its performance (according to whatever metric) surpasses the performance of the
prior system. Whenever the prototype improves upon the related work, the
process
stops and the results are published. An interesting challenge is the difference
between rigor and quality. While quality addresses how to handle insufficient
work, rigor defines how to do good science. Students must be rigorous in
analyzing results and reproducibility may not be an inherent criterion or
requirement for good science (or not even possible in certain cases).
After the mission statement we branched into the discussion of different topics
with lively interaction of the audience. Most prominent was the discussion
about open-source and if open-source should be a requirement for publication.
Open-source is the first step towards reproducibility. As a precondition it must
be satisfied but open-source alone does not inherently allow reproducibility or
even repeatability. Reproducibility comes in different flavors. If the
benchmarks can be rerun with the same code and the same data, the experiment can
be repeated. Full reproducibility requires reimplementing the system according
to the description in the paper and running it both on the same and additional
data to ensure that the system did not just cover a small hand-crafted set of
examples. If the code is well documented, results may be repeated,
reproducibility requires a lot of additional legwork.
If source code is released at all, the quality may not be stellar. Students
often write their prototypes under time pressure and as soon as the benchmarks
run, development stops. Interestingly, the last commit of open-source research
prototypes often aligns with the publication date of the paper. An interesting
discussion point was that the open-source prototype should be consumed as-is.
While the authors may help with documentation and some requests, it is not the
responsibility of the authors to maintain the code and port it to other systems.
Maintainability is the job of the code consumers, other researchers who try to
replicate the results have to address portability. Note that it is always a nice
gesture and good tone to help wherever you can, but it is not an obligation.
Open-source research prototypes should not be considered production ready.
Artifact evaluation goes along with reproducibility and may be an interesting
complementary aspect. Program committees start to include more and more artifact
evaluations. Papers with accompanying artifacts that are successfully evaluated
(usually by graduate students who build the prototype and repeat the results in
a couple of hours) receive an artifact evaluation award or badge. Such badges
may be used as incentives for authors to open-source their prototypes as it
shows that the results in the paper can at least be repeated.
The last topic was responsible disclosure and legal aspects of security
research. While responsible disclosure (letting software or hardware vendors
privately know about any discovered vulnerabilities and giving them time to
patch them) is well accepted in the community, adversarial research may still be
controversial. Open-sourced adversarial research, e.g., in bug finding tools or
exploitation tools may lead to legal issues as these tools could be classified
as weapons. As these legal frameworks are being developed, security researchers
must be aware of such possible pitfalls.
Overall we had a great time discussing these different aspects of rigor in
security research. Although we talked for an hour, time went by very quickly and
we agreed that we could have continued to talk for a couple more hours! The
audience joined in on the discussion and we were happy with all the great
questions. Thanks to the great panelists, we had a wide range of expertise and I
believe that the audience enjoyed our chat.