As you might know I've been to the VEE'2011 confernece in Newport
Beach/LA in the last couple of days. If you are interested in more
information about the talks then you can read my notes below.
Conference details:
In total there were 84 abstracts, 64 full submissions, and 20 papers
selected for presentation.
Corporate sponsors are: VMWare, Intel, Google, Microsoft Research, IBM
Research.
Keynote:
Virtualization in the Age of Heterogeneous Machines, *David. F.
Bacon*
(IBM Research, known for thin locks,
http://www.research.ibm.com/liquidmetal/ )
Motivation:
It's the multicore area! But what about performance? Three different
models of computation exist; CPU: general purpose, GPU: wins at gflops/$
(raw power), FPGA: wins at gflops/$/watt. The drawback is that they are
heterogeneous. A possible solution would be to virtualize these
heterogeneous systems.
There were basically 2 original ideas in computer science: Hashing &
Indirection, all else a combination of those. Virtualization can be
categorized in the indirection area. There are two forms of
virtualization, namely System VM: Virtualize Environment (VMWare, QEMU -
diff machines) and Language VM: Virtualize ISA (MMAME, QEMU - diff
architectures). The current VM model usually is the accelerator model,
send stuff from CPU to GPU/FPGA for computation, get nice chunk of data
back.
What is the solution to get over this heterogeneity? Use virtualization!
David introduces LIME: Liquid Metal Programming Language, a single
language with multiple backends: CPU, GPU, WSP, & FPGA. This new single
language compiles down to different architectures. CPU backend must
compile any code, all other backends can decide not to compile that
piece of code; e.g., code that is not deeply pipelineable (with
increased latency) can be rejected by the GPU compiler. Approach for
FPGA uses an artifact store that has solutions for common problems.
These artifacts are then stitched together to form the compiled program,
otherwise the compilation overhead would be way to large. LmVM: Lime
Virtual Machine is introduces as an implementation of the LIME
principle. Code originally starts on the CPU and evolves (or can be
forced to evolve) to other platforms.
The programming approach is as follows:
Java is a subset of Lime. A programmer starts with a Java program and
extends it with different Lime features. Many new types are introduced
in lime to adhere to the hardware peculiarities in the different
machines. [Insert long and lengthy discussion about language features
here].
Performance is evaluated using the following scheme: write 4 benchmarks
and 4 different versions of each benchmark to compare the different
implementations. Baseline is a naive Java implementation. This baseline
is compared to a handwritten expert implementation and the automatic
Lime compilation.
Total man power needed to develop this approach: 8 man years.
Session 1: Performance Monitoring
Performance Profiling of Virtual Machines
(Jiaqing Du, Nipun Sehrawat and Willy Zwaenepoel, EPFL Lausanne)
PerfCTR only incur low overhead, a lot faster than binary
instrumentation. The drawback is that support for virtual machines is
missing. There are three different profiling modes: native profiling
(os<->cpu), guest-wide (os<->cpu, without VMM, only guest is profiled),
system-wide (os<->VMM<->cpu, both VMM and guest is profiled). Implement
performance counters for para-virtualization (Xen), hardware assistance
(KVM), and binary translation (QEMU) for both guest-wide profiling and
system-wide profiling.
Challenging problems for guest-wide profiling is that the context must
be saved for all context switches (e.g., client 1 to VMM, VMM to client
2). The overhead of the implemented approach is low, about 0.4% for the
additional counters in all cases. Native overhead in contrast is about
0.04%, so the additional VMM increases the overhead by 10x. An analysis
of the accuracy shows that the deviation increases for virtual machines
but are still very low for compute-intensive benchmarks. For memory
intensive benchmarks QEMU has a lot higher cache miss rate due to the
binary translation overhead.
Questions: What about profiling across different VMs? (in the VMM?) Is
PEBS supported?
Perfctr-Xen: A Framework for Performance Counter Virtualization
(Ruslan Nikolaev and Godmar Back, Virginia Tech Blacksburg)
Perfctr-Xen as an implementation for performance counter virtualization
using the perfctr library in Xen. This removes the need for
architecture-specific code inside the Xen core to support PMUs. Two new
drivers needed to be implemented: Xen Host Perfctr driver, Xen Guest
Perfctr driver, and Perfctr library needed to be changed as well.
Questions: What kind of changes in the user-space library and why Xen
guest driver are needed? Is PEBS supported? Ruslan did not convince me
with his answer to the questions.
Dynamic Cache Contention Detection in Multi-threaded Applications
(Qin Zhao, David Koh, Syed Raza, Derek Bruening (Google), Weng-Fai Wong
and Saman Amarasinghe, MIT)
The motivation of this talk is to detect cache contention in
multi-threaded applications (e.g., false sharing between arrays across
multiple threads). Use dynamic instrumentation to keep track of single
memory locations using a bitmap and shadow memory. The ownership bitmap
for each cache line stores ownership of individual cache lines for each
of up to 32 threads. If a thread that accesses a cache line is not
single owner then we have a potential data sharing problem. Depending on
the performance counters we can detect cache contention. Implementation
is on top of Umbra which uses DynamoRIO.
Questions: What tool did you use for BT? How can you know that you
measure the real overhead and not some distortion through the
instrumentation interface?
Session 2: Configuration
Rethink the Virtual Machine Template
(Kun Wang, Chengzhong Xu and Jia Rao)
Main objective of is to reduce the startup overhead of system images
down to 1sec. Problem is that the overhead of VM creation is large.
Cloning copies files and other solutions are limited to the same
physical machine. Idea is to concentrate on the 'substrate' of the
virtual machine and only concentrate the smallest possible state (e.g.,
app/os state). This small substrace can then easily be copied to other
machines and restored to full online VM images.
Dolly: Virtualization-driven Database Provisioning for the Cloud
(Emmanuel Cecchet, Rahul Singh, Upendra Sharma and Prashant Shenoy,
UMASS CS)
Emmanuel used the tagline "Virtual Sex in the Cloud". This work tries to
solve the problem of adding database replicas to organize the load on
database backends. Problems are that the VM can not just be cloned but a
complete databse backup and restore must be carried out so that the DBs
can be synced across different nodes. It is hard to replicate state and
e.g., instantiate a consistent copy/replica of a database. When the
replica is actually ready it misses a couple of updates that happened
during the process of generating the replica. The idea of this work is
to dynamically scale database backends in the cloud by generating new VM
clones and using DB restore in the background. Parameters are snapshot
intervals, update frequency of the database which in turn describes the
size of the replay log that must be recovered. The evaluation part
contains a detailed analysis of different predictors of when to take
snapshots, what the replay overhead from the snapshot to the current
state is, and how much it would cost on Amazon EC2.
ReHype: Enabling VM Survival Across Hypervisor Failures (Highlight)
(Michael Le and Yuval Tamir, UCLA)
VMM is single point of failure for VMs (due to hardware faults or faults
in the virtualization software). The problem also is that system reboots
(of the host) are too slow. ReHype detects failures and pauses VMs in
place. VMM is then micro rebooted. Paused VMs are then integrated into
the new VMM instance and unpaused. Related work 'Otherworld' reboots the
linux kernel after a failure and keeps processes (applications) in
memory. ReHype recovers a failed VMM while preserving the sates of VMs.
Possible VMM failures are crash, hang, or silent (no crash/hang detected
but VMs fail). Crash: VMM panic handler is executed, hang: VMM watchdog
handler is executed. The system was evaluated using faul injections into
the VMM sate.
Can only recover from software failures, HW is still the same so
persistent HW failures are not protected. Logs are not kept to fix the
bugs later on. But in theory this system could also be used to upgrade
VMMs.
Questions: What about the size of the system (LOC)?
Session 3: Recovery
Fast and Space Efficient Virtual Machine Checkpointing
(Eunbyung Park, Bernhard Egger and Jaejin Lee, Seoul University, South
Korea)
Checkpointing can be used for faster VM scaling, high availability, and
debugging/forensics. Checkpoint stores volatile state of the VM. A large
part of the snapshot data does not need to be saved, e.g., file cache in
the Linux kernel. Goal is to make checkpointing faster and to reduce
these redundant pages and remove them from the snapshot. A mapping
between memory pages and disk blocks is added to the VMM. Problem: how
to detect dirty/written pages in memory? It is necessary to check the
shadow page-table of the guest. Result: 81% reduction in stored data and
74% reduction in the checkpoint time for para-virtualized guests, 66%
reduction in data and 55% reduction in checkpoint time for
fully-virtualized guests.
Fast Restore of Checkpointed Memory using Working Set Estimation
(Highlight)
(Irene Zhang, Yury Baskakov, Alex Garthwaite and Kenneth C. Barr,
VMWare)
Reduce time to restore a checkpoint from disk. Current schemes: (1)
eager restore; restores all pages to memory. (2) lazy restore; restores
only the CPU/device state and only restores memory in the background. If
the guest accesses pages that are not yet restored then the VMM must
stop the VM and restore that specific page (this can lead to trashing).
How to measure restore performance? Time-to-responsiveness measures time
until VM is usable. How big is the share of the restore process of the
total time? (Mutator utilization, comes from GC communicty that measures
GC overhead).
New feature: working-set-restore that prefetches the current working set
to reduce VM performance degradation. Working set is estimated using
either access-bit scanning or memory tracing. The memory tracing runs
alongside the VM all the time (overhead around 0.002%) and keeps track
of the working set. When the checkpoint is restored then this set of
pages is restored first and the VM is started at the point the
checkpoint was taken, so all the pages will be accessed again. (Of
course no external I/O may be executed during the checkpointing).
Question: Do you do any linearization of the list of the pages that need
to be restored? I/O allowed during lazy checkpointing?
Working-set-predictor and peeking into the future?
Fast and Correct Performance Recovery of Operating Systems Using a
Virtual Machine Monitor
(Kenichi Kourai, Institute of Technology, Japan)
A reboot is often the only solution to get over a fault in the system.
After a reboot performance is still degraded due to many page misses. A
new form of reboot with a warm page cache is proposed. The page cache is
kept in memory and can be reused after a reboot. A cache consistency
mechanism is added to keep track of the caching information.
Saving a couple of seconds after a reboot leads to constant overhead
during the runtime due to the implementation. Does this really make
sense? Reboots should be infrequent for servers. Does it really make
sense to keep the page cache around?
Session 4: Migration
Evaluation of Delta Compression techniques for Efficient Live
Migration of Large Virtual Machines
(Petter Svard, Benoit Hudzia, Johan Tordsson and Erik Elmroth, Umea
University Sweden)
A problem with current solutions is that more pages can turn dirty than
pages are transferred to the other host. At one point in time the VM is
stopped and the remaining pages are transferred. This leads to a long
downtime. Depending on the transfer link it makes more sense to
compress, transfer, decompress than to just transfer pages because
compression and decompression is faster than the transfer of the full
uncompressed page. A special remark is that only pages that were already
transferred are compressed. If a page was already transferred and is in
the cache of the sender and turns dirty then the delta is constructed,
compressed, and sent. Otherwise the plain page is sent. Petter did some
live demos.
CloudNet: Dynamic Pooling of Cloud Resources by Live WAN Migration of
Virtual Machines (Highlight)
(Timothy Wood, KK Ramakrishnan, Prashant Shenoy and Jacobus van der
Merwe)
Problem: cloud resources are isolated from one another and the
enterprise. The interesting question is how to manage these different
isolated machines and how to secure data transfers between the different
machines and across multiple data-centers. Use VPNs to connect different
data centers and use common migration tools.
Workload-Aware Live Storage Migration for Clouds
(Jie Zheng, T. S. Eugene Ng and Kunwadee Sripanidkulchai, Rice
University)
Storage migration in a wide-area VM migration contributes the largest
part of the data that needs to be transferred. No shared file storage is
available, so disk image must be synchronized somehow (based on block
migration).
Session 5: Security
Patch Auditing in Infrastructure as a Service Clouds (Highlight; Read
paper)
(Lionel Litty and David Lie, VMWare / University of Toronto)
Apply your patches! But not everybody does it. Even automatic patch
application is not a solution. Also monitoring on the OS level is not
continuous or systematic, different applications have different update
mechanisms. There is need for a better tool to automate the update
mechanism and to monitor the vulnerable state of systems. Additional
challenges are VMs that might be powered down or unavailable to the
infrastructure administrator. Solution: add patch monitoring to the VMM
infrastructure and report to a central tool. Use VMM to detect
application updates (binary and text only) and analyze different
patches. Use executable bits to detect all live executed code on host
VM. Check that executed code is OK.
Patagonix (only binary code detected) -> P2 (extended executable code
(bash script, python, executable) detected).
Fine-Grained User-Space Security Through Virtualization
(Mathias Payer and Thomas R. Gross, ETH Zurich Switzerland)
My talk. See my paper for details.
Session 6: Virtualization Techniques
Minimal-overhead Virtualization of a Large Scale Supercomputer
(Jack Lange, Kevin Pedretti, Peter Dinda, Patrick Bridges, Chang Bae,
Philip Soltero and Alexander Merritt, University of Pittsburgh)
Palacios (OS-independent embeddable VMM) and Kitten (lightweight
supercomputing OS) for HPC. Key concepts for minimal overhead
virtualization are that (1)I/O is passed through, e.g., direct I/O
access with no virtualization overhead; (2) virtual paging is optimized
for nested and shadow paging; (3) preemption is controlled to reduce
host OS noise. The VMM trusts the guest (e.g., to do DMA correctly).
Bugs in the guest could bring down the complete system. Symbiotic
virtualization as new approach that uses cooperation.
Virtual WiFi: Bring Virtualization from Wired to Wireless
(Highlight)
(Lei Xia, Sanjay Kumar, Xue Yang, Praveen Gopalakrishnan, York Liu,
Sebastian Schoenberg and Xingang Guo, Northwestern University)
New approach to virtualization that enables wifi virtualization. One
phyisical WiFi interface is virtualized and can be used in multiple VMs.
Current approach is to virtualize an ethernet device inside the GuestVM.
This strips all the wifi functionality. The new approach virtualizes
complete wifi functionalities in the VM. The same Intel Wifi driver is
used in the GuestVM as is used in the HostVM. Each VM gets its own vMAC,
HostVM distributes packets according to vMAC, all other capabilities are
directly forwarded to the VMs and can be set by the VMs as well.
Questions: Promiscuous? Rate limited? Multiple vMACs supported in VM as
well?
SymCall: Symbiotic Virtualization Through VMM-to-Guest Upcalls
(Jack Lange and Peter Dinda)
SemanticGap: loss of semantic information between HW and emulated guest
HW and guest OS state is unkown to VMM. Two approaches to find out about
guest: BlackBox: Monitor external guest interactions, GrayBox: reverse
engineer guest state.
Symbiotic Virtualization: design both the guest OS and the VMM to
minimize the semantic gap. But also offer a fallback to blackbox guest
OS. SymSpy passive interface uses asynchronous communication to get
information about hidden state and SymCall that uses upcalls into the
guest during exit handling.
- SymSpy: uses a shared memory page between the OS and the VMM to offer
structured data exchange between VMM and OS
- SymCall: similar to system calls. The VMM requests services from the OS.
- Restrictions: only 1 SymCall active at a time, SymCalls run to
completition (no blocking, no context switches, no exceptions or
interrupts), SymCalls cannot wait on locks (deadlocks).
SwapBypass is an optimization that pushes swapping from the guest to the
VMM. SwapBypass uses a shadow copy of the page tables of the guest VM.
The VM does not swap out any pages and caching/swapping only happens in
the VMM but never in the guest VM to reduce I/O pollution. Page fault
happens in VMM and not in host.
Session 7: Memory Management
Overdriver: Handling Memory Overload in an Oversubscribed Cloud
(Highlight)
(Dan Williams, Hani Jamjoom, Yew-Huey Liu and Hakim Weatherspoon,
Cornell University)
Peak loads are very rare and utilization in data centers is below 15%.
But on the other hand peak loads are unpredictable and oversubscription
can lead to overload. Memory oversubscription is kind of critical
because overload carries a high penalty due to swapping costs. The focus
of this work is to research if the performance degradation due to memory
overload can be managed, reduced, or eliminated.
Analysis of different memory overloads shows that most overload is
transient (96% are less than 1min), some overload is sustained (2% last
longer than 10min). Two techniques used to address memory overload: VM
migration (migrates VM to another machine), and network memory that
sends swapped pages not to disk but to another swapping machine over the
network. Network Memory may be used for transient overloads and VM
migration for sustained overloads.
OverDriver uses network memory and VM migration to handle overload.
OverDriver collects swap/overload statistics for each VM. Use overload
profiles to decide when to switch from network memory to VM migration.
Question: Decision on when to migrate is static, what about adaptive
checks/analysis for migration? What other predictors could you use?
(Sounds like future work)
Selective Hardware/Software Memory Virtualization
(Xiaolin Wang, Jiarui Zang, Zhenlin Wang, Yingwei Luo and Xiaoming Li,
Peking University)
3 possibilities for memory virtualization: MMU para-virtualization,
shadow page tables, and EPT/NPT. Idea: use dynamic switching between
hardware assisted paging and shadow paging. Question: how and when to
switch?
Hybrid Binary Rewriting for Memory Access Instrumentation (Highlight;
Read paper)
(Amitabha Roy, Steven Hand and Tim Harris, University of Cambridge UK)
Scaling inside multi-threaded shared memory programs can be problematic
(scalability, races, atomicity violations). Run existing x86 binaries
and analyze synchronization primitives (locks). Dynamic binary rewriting
used to analyze lock primitives.
Hard to decide statically if lock is taken or not. Either
overinstrumentation or unsound. Therefore dynamic BT is needed.
Hybrid binary rewriting uses static binary rewriting and dynamic binary
rewriting as a fallback. A persistanc instrumentation cache (PIC) is
stored between different runs of the same program. So the translated
code can be reused.
HBR used for two case-studies:
- Profiling: interested in understanding how suitable programs are for
applying STM transformations.
- Speculative Lock Elision: remove locks and turn them into stm_start,
stm_commit, and instrument reads and writes. STAMP used to evaluate
this dynamic instrumentation. Problem is that STAMP uses private data
that is accessed inside transactions and there is manual optimization
for STMs that get rid of the additional read and write operations.
Dynamic instrumentation instruments all reads and writes and has bad
performance for these cases. Private Data Tracking uses a special
tracking of private data to reduce the amount of instrumentation and
reduces the overhead to reasonable numbers.
Question: Static binary rewriting: no runtime overhead (no translation
overhead), but there can be artifacts/overhead through the translation
process. Translation overhead for DBT is <1% What about hierarchical
transactions?
Peek into the future:
VEE 2012 will be in London, UK, general chair will be Steve Hand. VEE'12
is colocated with ASPLOS again, Saturday 3rd of March and Sunday 4th of
March.