VEE conference ramblings

As you might know I've been to the VEE'2011 confernece in Newport Beach/LA in the last couple of days. If you are interested in more information about the talks then you can read my notes below.

Conference details:

In total there were 84 abstracts, 64 full submissions, and 20 papers selected for presentation. Corporate sponsors are: VMWare, Intel, Google, Microsoft Research, IBM Research.

Keynote:

Virtualization in the Age of Heterogeneous Machines, *David. F. Bacon* (IBM Research, known for thin locks, http://www.research.ibm.com/liquidmetal/ )

Motivation: It's the multicore area! But what about performance? Three different models of computation exist; CPU: general purpose, GPU: wins at gflops/$ (raw power), FPGA: wins at gflops/$/watt. The drawback is that they are heterogeneous. A possible solution would be to virtualize these heterogeneous systems.

There were basically 2 original ideas in computer science: Hashing & Indirection, all else a combination of those. Virtualization can be categorized in the indirection area. There are two forms of virtualization, namely System VM: Virtualize Environment (VMWare, QEMU - diff machines) and Language VM: Virtualize ISA (MMAME, QEMU - diff architectures). The current VM model usually is the accelerator model, send stuff from CPU to GPU/FPGA for computation, get nice chunk of data back.

What is the solution to get over this heterogeneity? Use virtualization! David introduces LIME: Liquid Metal Programming Language, a single language with multiple backends: CPU, GPU, WSP, & FPGA. This new single language compiles down to different architectures. CPU backend must compile any code, all other backends can decide not to compile that piece of code; e.g., code that is not deeply pipelineable (with increased latency) can be rejected by the GPU compiler. Approach for FPGA uses an artifact store that has solutions for common problems. These artifacts are then stitched together to form the compiled program, otherwise the compilation overhead would be way to large. LmVM: Lime Virtual Machine is introduces as an implementation of the LIME principle. Code originally starts on the CPU and evolves (or can be forced to evolve) to other platforms.

The programming approach is as follows: Java is a subset of Lime. A programmer starts with a Java program and extends it with different Lime features. Many new types are introduced in lime to adhere to the hardware peculiarities in the different machines. [Insert long and lengthy discussion about language features here].

Performance is evaluated using the following scheme: write 4 benchmarks and 4 different versions of each benchmark to compare the different implementations. Baseline is a naive Java implementation. This baseline is compared to a handwritten expert implementation and the automatic Lime compilation. Total man power needed to develop this approach: 8 man years.

Session 1: Performance Monitoring

Performance Profiling of Virtual Machines (Jiaqing Du, Nipun Sehrawat and Willy Zwaenepoel, EPFL Lausanne)

PerfCTR only incur low overhead, a lot faster than binary instrumentation. The drawback is that support for virtual machines is missing. There are three different profiling modes: native profiling (os<->cpu), guest-wide (os<->cpu, without VMM, only guest is profiled), system-wide (os<->VMM<->cpu, both VMM and guest is profiled). Implement performance counters for para-virtualization (Xen), hardware assistance (KVM), and binary translation (QEMU) for both guest-wide profiling and system-wide profiling.

Challenging problems for guest-wide profiling is that the context must be saved for all context switches (e.g., client 1 to VMM, VMM to client 2). The overhead of the implemented approach is low, about 0.4% for the additional counters in all cases. Native overhead in contrast is about 0.04%, so the additional VMM increases the overhead by 10x. An analysis of the accuracy shows that the deviation increases for virtual machines but are still very low for compute-intensive benchmarks. For memory intensive benchmarks QEMU has a lot higher cache miss rate due to the binary translation overhead.

Questions: What about profiling across different VMs? (in the VMM?) Is PEBS supported?

Perfctr-Xen: A Framework for Performance Counter Virtualization (Ruslan Nikolaev and Godmar Back, Virginia Tech Blacksburg)

Perfctr-Xen as an implementation for performance counter virtualization using the perfctr library in Xen. This removes the need for architecture-specific code inside the Xen core to support PMUs. Two new drivers needed to be implemented: Xen Host Perfctr driver, Xen Guest Perfctr driver, and Perfctr library needed to be changed as well.

Questions: What kind of changes in the user-space library and why Xen guest driver are needed? Is PEBS supported? Ruslan did not convince me with his answer to the questions.

Dynamic Cache Contention Detection in Multi-threaded Applications (Qin Zhao, David Koh, Syed Raza, Derek Bruening (Google), Weng-Fai Wong and Saman Amarasinghe, MIT)

The motivation of this talk is to detect cache contention in multi-threaded applications (e.g., false sharing between arrays across multiple threads). Use dynamic instrumentation to keep track of single memory locations using a bitmap and shadow memory. The ownership bitmap for each cache line stores ownership of individual cache lines for each of up to 32 threads. If a thread that accesses a cache line is not single owner then we have a potential data sharing problem. Depending on the performance counters we can detect cache contention. Implementation is on top of Umbra which uses DynamoRIO.

Questions: What tool did you use for BT? How can you know that you measure the real overhead and not some distortion through the instrumentation interface?

Session 2: Configuration

Rethink the Virtual Machine Template (Kun Wang, Chengzhong Xu and Jia Rao)

Main objective of is to reduce the startup overhead of system images down to 1sec. Problem is that the overhead of VM creation is large. Cloning copies files and other solutions are limited to the same physical machine. Idea is to concentrate on the 'substrate' of the virtual machine and only concentrate the smallest possible state (e.g., app/os state). This small substrace can then easily be copied to other machines and restored to full online VM images.

Dolly: Virtualization-driven Database Provisioning for the Cloud (Emmanuel Cecchet, Rahul Singh, Upendra Sharma and Prashant Shenoy, UMASS CS)

Emmanuel used the tagline "Virtual Sex in the Cloud". This work tries to solve the problem of adding database replicas to organize the load on database backends. Problems are that the VM can not just be cloned but a complete databse backup and restore must be carried out so that the DBs can be synced across different nodes. It is hard to replicate state and e.g., instantiate a consistent copy/replica of a database. When the replica is actually ready it misses a couple of updates that happened during the process of generating the replica. The idea of this work is to dynamically scale database backends in the cloud by generating new VM clones and using DB restore in the background. Parameters are snapshot intervals, update frequency of the database which in turn describes the size of the replay log that must be recovered. The evaluation part contains a detailed analysis of different predictors of when to take snapshots, what the replay overhead from the snapshot to the current state is, and how much it would cost on Amazon EC2.

ReHype: Enabling VM Survival Across Hypervisor Failures (Highlight) (Michael Le and Yuval Tamir, UCLA)

VMM is single point of failure for VMs (due to hardware faults or faults in the virtualization software). The problem also is that system reboots (of the host) are too slow. ReHype detects failures and pauses VMs in place. VMM is then micro rebooted. Paused VMs are then integrated into the new VMM instance and unpaused. Related work 'Otherworld' reboots the linux kernel after a failure and keeps processes (applications) in memory. ReHype recovers a failed VMM while preserving the sates of VMs. Possible VMM failures are crash, hang, or silent (no crash/hang detected but VMs fail). Crash: VMM panic handler is executed, hang: VMM watchdog handler is executed. The system was evaluated using faul injections into the VMM sate.

Can only recover from software failures, HW is still the same so persistent HW failures are not protected. Logs are not kept to fix the bugs later on. But in theory this system could also be used to upgrade VMMs.

Questions: What about the size of the system (LOC)?

Session 3: Recovery

Fast and Space Efficient Virtual Machine Checkpointing (Eunbyung Park, Bernhard Egger and Jaejin Lee, Seoul University, South Korea)

Checkpointing can be used for faster VM scaling, high availability, and debugging/forensics. Checkpoint stores volatile state of the VM. A large part of the snapshot data does not need to be saved, e.g., file cache in the Linux kernel. Goal is to make checkpointing faster and to reduce these redundant pages and remove them from the snapshot. A mapping between memory pages and disk blocks is added to the VMM. Problem: how to detect dirty/written pages in memory? It is necessary to check the shadow page-table of the guest. Result: 81% reduction in stored data and 74% reduction in the checkpoint time for para-virtualized guests, 66% reduction in data and 55% reduction in checkpoint time for fully-virtualized guests.

Fast Restore of Checkpointed Memory using Working Set Estimation (Highlight) (Irene Zhang, Yury Baskakov, Alex Garthwaite and Kenneth C. Barr, VMWare)

Reduce time to restore a checkpoint from disk. Current schemes: (1) eager restore; restores all pages to memory. (2) lazy restore; restores only the CPU/device state and only restores memory in the background. If the guest accesses pages that are not yet restored then the VMM must stop the VM and restore that specific page (this can lead to trashing). How to measure restore performance? Time-to-responsiveness measures time until VM is usable. How big is the share of the restore process of the total time? (Mutator utilization, comes from GC communicty that measures GC overhead).

New feature: working-set-restore that prefetches the current working set to reduce VM performance degradation. Working set is estimated using either access-bit scanning or memory tracing. The memory tracing runs alongside the VM all the time (overhead around 0.002%) and keeps track of the working set. When the checkpoint is restored then this set of pages is restored first and the VM is started at the point the checkpoint was taken, so all the pages will be accessed again. (Of course no external I/O may be executed during the checkpointing). Question: Do you do any linearization of the list of the pages that need to be restored? I/O allowed during lazy checkpointing? Working-set-predictor and peeking into the future?

Fast and Correct Performance Recovery of Operating Systems Using a Virtual Machine Monitor (Kenichi Kourai, Institute of Technology, Japan)

A reboot is often the only solution to get over a fault in the system. After a reboot performance is still degraded due to many page misses. A new form of reboot with a warm page cache is proposed. The page cache is kept in memory and can be reused after a reboot. A cache consistency mechanism is added to keep track of the caching information. Saving a couple of seconds after a reboot leads to constant overhead during the runtime due to the implementation. Does this really make sense? Reboots should be infrequent for servers. Does it really make sense to keep the page cache around?

Session 4: Migration

Evaluation of Delta Compression techniques for Efficient Live Migration of Large Virtual Machines (Petter Svard, Benoit Hudzia, Johan Tordsson and Erik Elmroth, Umea University Sweden)

A problem with current solutions is that more pages can turn dirty than pages are transferred to the other host. At one point in time the VM is stopped and the remaining pages are transferred. This leads to a long downtime. Depending on the transfer link it makes more sense to compress, transfer, decompress than to just transfer pages because compression and decompression is faster than the transfer of the full uncompressed page. A special remark is that only pages that were already transferred are compressed. If a page was already transferred and is in the cache of the sender and turns dirty then the delta is constructed, compressed, and sent. Otherwise the plain page is sent. Petter did some live demos.

CloudNet: Dynamic Pooling of Cloud Resources by Live WAN Migration of Virtual Machines (Highlight) (Timothy Wood, KK Ramakrishnan, Prashant Shenoy and Jacobus van der Merwe)

Problem: cloud resources are isolated from one another and the enterprise. The interesting question is how to manage these different isolated machines and how to secure data transfers between the different machines and across multiple data-centers. Use VPNs to connect different data centers and use common migration tools.

Workload-Aware Live Storage Migration for Clouds (Jie Zheng, T. S. Eugene Ng and Kunwadee Sripanidkulchai, Rice University)

Storage migration in a wide-area VM migration contributes the largest part of the data that needs to be transferred. No shared file storage is available, so disk image must be synchronized somehow (based on block migration).

Session 5: Security

Patch Auditing in Infrastructure as a Service Clouds (Highlight; Read paper) (Lionel Litty and David Lie, VMWare / University of Toronto)

Apply your patches! But not everybody does it. Even automatic patch application is not a solution. Also monitoring on the OS level is not continuous or systematic, different applications have different update mechanisms. There is need for a better tool to automate the update mechanism and to monitor the vulnerable state of systems. Additional challenges are VMs that might be powered down or unavailable to the infrastructure administrator. Solution: add patch monitoring to the VMM infrastructure and report to a central tool. Use VMM to detect application updates (binary and text only) and analyze different patches. Use executable bits to detect all live executed code on host VM. Check that executed code is OK.

Patagonix (only binary code detected) -> P2 (extended executable code (bash script, python, executable) detected).

Fine-Grained User-Space Security Through Virtualization (Mathias Payer and Thomas R. Gross, ETH Zurich Switzerland)

My talk. See my paper for details.

Session 6: Virtualization Techniques

Minimal-overhead Virtualization of a Large Scale Supercomputer (Jack Lange, Kevin Pedretti, Peter Dinda, Patrick Bridges, Chang Bae, Philip Soltero and Alexander Merritt, University of Pittsburgh)

Palacios (OS-independent embeddable VMM) and Kitten (lightweight supercomputing OS) for HPC. Key concepts for minimal overhead virtualization are that (1)I/O is passed through, e.g., direct I/O access with no virtualization overhead; (2) virtual paging is optimized for nested and shadow paging; (3) preemption is controlled to reduce host OS noise. The VMM trusts the guest (e.g., to do DMA correctly). Bugs in the guest could bring down the complete system. Symbiotic virtualization as new approach that uses cooperation.

Virtual WiFi: Bring Virtualization from Wired to Wireless (Highlight) (Lei Xia, Sanjay Kumar, Xue Yang, Praveen Gopalakrishnan, York Liu, Sebastian Schoenberg and Xingang Guo, Northwestern University)

New approach to virtualization that enables wifi virtualization. One phyisical WiFi interface is virtualized and can be used in multiple VMs. Current approach is to virtualize an ethernet device inside the GuestVM. This strips all the wifi functionality. The new approach virtualizes complete wifi functionalities in the VM. The same Intel Wifi driver is used in the GuestVM as is used in the HostVM. Each VM gets its own vMAC, HostVM distributes packets according to vMAC, all other capabilities are directly forwarded to the VMs and can be set by the VMs as well.

Questions: Promiscuous? Rate limited? Multiple vMACs supported in VM as well?

SymCall: Symbiotic Virtualization Through VMM-to-Guest Upcalls (Jack Lange and Peter Dinda)

SemanticGap: loss of semantic information between HW and emulated guest HW and guest OS state is unkown to VMM. Two approaches to find out about guest: BlackBox: Monitor external guest interactions, GrayBox: reverse engineer guest state.

Symbiotic Virtualization: design both the guest OS and the VMM to minimize the semantic gap. But also offer a fallback to blackbox guest OS. SymSpy passive interface uses asynchronous communication to get information about hidden state and SymCall that uses upcalls into the guest during exit handling.

SymSpy: uses a shared memory page between the OS and the VMM to offer structured data exchange between VMM and OS
SymCall: similar to system calls. The VMM requests services from the OS.
Restrictions: only 1 SymCall active at a time, SymCalls run to completition (no blocking, no context switches, no exceptions or interrupts), SymCalls cannot wait on locks (deadlocks).

SwapBypass is an optimization that pushes swapping from the guest to the VMM. SwapBypass uses a shadow copy of the page tables of the guest VM. The VM does not swap out any pages and caching/swapping only happens in the VMM but never in the guest VM to reduce I/O pollution. Page fault happens in VMM and not in host.

Session 7: Memory Management

Overdriver: Handling Memory Overload in an Oversubscribed Cloud (Highlight) (Dan Williams, Hani Jamjoom, Yew-Huey Liu and Hakim Weatherspoon, Cornell University)

Peak loads are very rare and utilization in data centers is below 15%. But on the other hand peak loads are unpredictable and oversubscription can lead to overload. Memory oversubscription is kind of critical because overload carries a high penalty due to swapping costs. The focus of this work is to research if the performance degradation due to memory overload can be managed, reduced, or eliminated.

Analysis of different memory overloads shows that most overload is transient (96% are less than 1min), some overload is sustained (2% last longer than 10min). Two techniques used to address memory overload: VM migration (migrates VM to another machine), and network memory that sends swapped pages not to disk but to another swapping machine over the network. Network Memory may be used for transient overloads and VM migration for sustained overloads.

OverDriver uses network memory and VM migration to handle overload. OverDriver collects swap/overload statistics for each VM. Use overload profiles to decide when to switch from network memory to VM migration. Question: Decision on when to migrate is static, what about adaptive checks/analysis for migration? What other predictors could you use? (Sounds like future work)

Selective Hardware/Software Memory Virtualization (Xiaolin Wang, Jiarui Zang, Zhenlin Wang, Yingwei Luo and Xiaoming Li, Peking University)

3 possibilities for memory virtualization: MMU para-virtualization, shadow page tables, and EPT/NPT. Idea: use dynamic switching between hardware assisted paging and shadow paging. Question: how and when to switch?

Hybrid Binary Rewriting for Memory Access Instrumentation (Highlight; Read paper) (Amitabha Roy, Steven Hand and Tim Harris, University of Cambridge UK)

Scaling inside multi-threaded shared memory programs can be problematic (scalability, races, atomicity violations). Run existing x86 binaries and analyze synchronization primitives (locks). Dynamic binary rewriting used to analyze lock primitives.

Hard to decide statically if lock is taken or not. Either overinstrumentation or unsound. Therefore dynamic BT is needed. Hybrid binary rewriting uses static binary rewriting and dynamic binary rewriting as a fallback. A persistanc instrumentation cache (PIC) is stored between different runs of the same program. So the translated code can be reused.

HBR used for two case-studies:

Profiling: interested in understanding how suitable programs are for applying STM transformations.
Speculative Lock Elision: remove locks and turn them into stm_start, stm_commit, and instrument reads and writes. STAMP used to evaluate this dynamic instrumentation. Problem is that STAMP uses private data that is accessed inside transactions and there is manual optimization for STMs that get rid of the additional read and write operations. Dynamic instrumentation instruments all reads and writes and has bad performance for these cases. Private Data Tracking uses a special tracking of private data to reduce the amount of instrumentation and reduces the overhead to reasonable numbers.

Question: Static binary rewriting: no runtime overhead (no translation overhead), but there can be artifacts/overhead through the translation process. Translation overhead for DBT is <1% What about hierarchical transactions?

Peek into the future:

VEE 2012 will be in London, UK, general chair will be Steve Hand. VEE'12 is colocated with ASPLOS again, Saturday 3rd of March and Sunday 4th of March.