As you might know I've been presenting at the IEEE ISPASS'11
conference in Austin, Texas. The conference went from April 10th to
April 12th. If you are interested you can read my ramblings about the
(David Brooks & Rajeev Balasubramonian)
64 submissions in total, 4 PC reviews/paper, accepted via consensus, 24
Keynote: The Era of Heterogeneity: Are we prepared?
(Ravi Iyer, Intel)
Shift from client/server to smart devices (tablets, smart phones, ...)
Integrate GPU, IP into CPU core for power efficiency, it's no longer
just about cores but also about the accelerators that we integrate into
Why heterogeneity? Because the workloads are heterogeneous and one
single solution (general purpose core) will not work. Small cores scale
and good power performance, big cores are needed for single threaded
performance. The talk sounds a lot like ISCA'09 where they proposed a
heterogeneous architecture with one big core and multiple small cores.
Intel's Idea: SPECS (Scalability, Programmability, Energy,
Questions for cores and accelerators are:
How to mix and prioritize heterogeneous cores? Should all cores have the
same ISA (e.g., SEE)? How should we structure the cache architecture?
Solution for mixed ISA: Use a co-op model and run applications on any
core. If we have an unsupported opcode exception on the smaller core
than the OS must move the application to the bigger core. This sounds a
little like Albert Noll's VM for the Cell. What about hardware tricks
for context switches?
Session 1: Best Paper Nominees (David Christie, AMD)
Characterization and Dynamic Mitigation of Intra-Application Cache
(Carole-Jean Wu, Margaret Martonosi, Princeton University)
Intra-Application cache interference is a challenging problem. Measure
and characterize the cache behavior of applications. Paper uses
two-folded measurements, Intel Nehalem using perfmon2 and Simics/GEMS
for an artificial system. Measurements show that system cache lines are
usually not reused (source: mostly TLB misses) so these misses pollute
the application cache lines.
Propose new cache systems that adhere to the fact that cache lines from
a system context are not reused as often as cache lines from a user
Questions: Measurement done on 64/32b system? Are there differences due
to different page placement?
A Semi-Preemptive Garbage Collector for Solid State Drives
(Junghee Lee*, Uoungjae Kim, Galen M. Shipman, Sarp Oral, Jongman
Kim*, Feiyi Wang, Oak Ridge National Laboratory, Georgia Institute of
Block replacement strategies and how to cope with flash problems.
Implement some form of GC for flash blocks.
Fast access speed but performance degradation due to garbage collection.
New form of GC inside the SSD.
PRISM: Zooming in Persistent RAM Storage Behavior
(Ju-Young Jung, Sangeyeun Cho, University of Pittsburgh)
Block-oriented FS for PRAM. Not my field.
Evaluation and Optimization of Multicore Performance Bottlenecks in
(Jeff Diamond, Martin Burtscher, John D. McCalpin, Byoung-Do Kim,
Stephen W. Keckler, James C. Browne, University of Texas Austin, Texas
State University, and NVIDIA)
Moore's law of super computing: scale the number of cores! Motivation
for this paper: inter-chip scalability. New compiler optimizations for
multi-core cpus, depending on cache-layout and coordination. Use AMD
performance counters to measure HPC performance.
Most important performance options for multi-core are: L3 Miss Rates
(cache contention), Off Chip Bandwidth, DRAM contention (DRAM page miss
rates) (Great talk!)
Session 2: Memory Hierarchies (Suzanne Rivoire, Sonoma State University)
Minimizing Interference through Application Mapping in Multi-Level
(Christina M. Patrick, Nicholas Voshell, Mahmut Kandemir, Pennsylvania
Storage paper that handles a switched network with a complicated node
hierarchy. The paper introduces interference predictors for the I/O
route through the network and analyzes buffer cache placement.
-ETOOMANYFORMULAS (for me)
Analyzing the Impact of Useless Write-Backs on the Endurance and
Energy Consumption of PCM Main Memory
(Santiago Bock, Bruce R. Childers, Rami G. Melhem, Daniel Mosse, Youtao
Zhang, University of Pittsburgh)
20-40% of energy consumption due to memory system. Use PhaseChangeMemory
instead of DRAM. Low static power (non volatile), read performance
comparable to DRAM, scales better than DRAM, bug high energy cost for
writes and limited write endurance. Observation: a write-back is useless
if the data is not used again later on. Use application information from
allocator, control flow analysis or stack pointer. Focus: how many
useless write-backs can be avoided using these metrics? 3 different
regions analyzed: heap: use malloc / free; global: control flow
analysis; stack: stack pointer.
What about DRAM, would that make sense as well? (Reducing the number of
write-backs), e.g. for cache coherency in multi-cores?
His solution: application tells the HW which regions are dead / alive.
Access Pattern-Aware DRAM Performance Model for Multi-Core Systems
(Hyojin Choi, Jongbok Lee*, Wonyong Sung, Seoul National University,
Latency between different banks, very low level/HW.
Characterizing Multi-threaded Applications based on Shared-Resource
(Tanimo Dey, Wei Wang, Jack Davidson, Mary Lou Soffa, University of
Check/measure intra-application contention and inter-application
contention for L1/L2/Front side bus.
Session 3: Tracing (Tom Wenisch, University of Michigan)
Trace-driven Simulation of Multithreaded Applications
(Alejandro Rico*, Alejandro Duran*, Felipe Cabarcas*, Alex Ramirez,
Yoav Etsion*, Mateo Valero*, Barcelona Supercomputing center*,
Universitat Politecnica de Catalunya)
How to simulate multi-threaded applications using traces? Capture traces
for sequential code sections, capture calls to parops but do not capture
the execution of parops. Interesting but not my topic.
Efficient Memory Tracing by Program Skeletonization
(Alain Ketterlin, Philippe Clauss, Universite de Strasbourg & INRIA)
We want to get the minimum amount of code to reproduce the memory layout
of an application. Instrumentation is expensive but useful as a
baseline. To improve from there we need to find loops in binary code,
try to recognize patterns and generate access sequences to remove
instrumentation. Work on machine code and find register accesses movl
%eax, [%ebx, %ecx, 8]
Program skeletonization extracts what is useful to compute the memory
Do you also track direct registers (e.g., the address computation
happens before)? You decouple the memory recording and the application,
so recording happens with loose correlation to the application. How do
you handle threads/concurrent memory accesses? Exceptions? (Great talk!)
Portable Trace Compression through Instruction Interpretation
(Svilen Kanev, Robert Cohn*, Harvard University, Intel*)
If you are reliably able to predict a byte stream you do not need to
Reception & Poster Session
VMAD: A Virtual Machine for Advanced Dynamic Analysis of Programs
(Alexandra Jimborean, Matthieu Hermann*, Vincent Loechner, Philippe
Claus, INRIA, Universite Strasbourg*)
Interesting work on LLVM that adds different alternatives and tries
reverse compilation to turn, e.g., while loops into for loops and
adaptively optimize them (for C/C++ code). Interesting work, maybe
forward her Olivers' work
Performance Characterization of Mobile-Class Nodes: Why Fewer Bits is
(Michelle McDaniel, Kim Hazelwood, University of Virginia)
For netbooks 32bit code is faster than 64bit code. What kind of GCC
settings did you use? Mention Acovea, also her masters is about padding,
give her a pointer to my work.
Keynote II: Integrated Modeling Challenges in Extreme-Scale Computing
(Pradip Bose, IBM)
Exa-Scale Computing is 10^18 which is 100x peta-scale computing. What is
the wall: power or reliability?
Power-wall: We need to reduce power needed in chips, dozens of cores per
chip that are allowed to use 1/1000 of power. Idea: different processing
modes: storage mode; turn of parallel cores, computing mode: turn off
storage controllers, I/O. Reliability wall: MTTF and reliability drops
with the increased numbers of transistors. Problem: with millions of
cores/cpus MTTF is so low that super computers are not even able to
complete linpack benchmarks between failures. MTTR/MTTF (mean time to
repair vs. mean time to failure).
Session 4: Emerging Workloads (Derek Chiou, UT Austin)
Where is the Data? Why you Cannot Debate CPU vs. GPU Performance
Without the Answer
(Chris Gregg, Kim Hazelwood, University of Virginia)
GPU computation is fast but data transfer from/to the GPU is a
bottleneck. GPU speedup is misleading without describing the data
Questions: What about algorithms with dual-use approach where the CPU
does not idle during kernel? What about compression? (Great talk!)
Accelerating Search and Recognition Workloads with SSE 4.2 String and
Text Processing Instructions
(Guangyu Shi, Min Li, Mikko Lipasti, UW-Madison)
STTNI can be used to implement broad set of search and recognition
application, embrace newly available instructions to speed up classical
algorithms. pcmpestri: packed compare explicit length strings return
index. New instructions can be used for any data comparisons. Depending
on data structure different algorithms are needed. Easy for arrays, tree
structure need some B-Tree and similar handling as strings, for hash
tables more complicated but resolve collisions with STTNI.
What about aligned loads, or loop unrolling for this code? Example was a
single static loop that did use unaligned loads (expensive) and no
manual loop unrolling. Speaker only compared to GCC, not ICC or other
A Comprehensive Analysis and Parallelization of an Image Retrieval
(Zhenman Fang, Weihua Zhang, Haibo Chen, Binyu Zang, Fudan University)
You shall not use Comic, Sans Serif, Courier, and Serif fonts on one
Performance Evaluation of Adaptivity in Transactional Memory
(Mathias Payer, Thomas R. Gross, ETH Zurich)
My talk. See: https://nebelwelt.net/publications/11ISPASS/
Transactional memory (TM) is an attractive platform for parallel
programs, and several software transactional memory (STM) designs have
been presented. We explore and analyze several optimization
opportunities to adapt STM parameters to a running program.
This paper uses adaptSTM, a flexible STM library with a non-adaptive
baseline common to current fast STM libraries to evaluate different
performance options. The baseline is extended by an online evaluation
system that enables the measurement of key runtime parameters like read-
and write-locations, or commit- and abort-rate. The performance data is
used by a thread-local adaptation system to tune the STM configuration.
The system adapts different important parameters like write-set
hash-size, hash-function, and write strategy based on runtime statistics
on a per-thread basis.
We discuss different self-adapting parameters, especially their
performance implications and the resulting trade-offs. Measurements show
that local per-thread adaptation out- performs global system-wide
adaptation. We position local
adaptivity as an extension to existing systems.
Using the STAMP benchmarks, we compare adaptSTM to two other STM
libraries, TL2 and tinySTM. Comparing adaptSTM and the adaptation system
to TL2 results in an average speedup of 43% for 8 threads and 137% for
16 threads. adaptSTM offers performance that is competitive with tinySTM
for low-contention benchmarks; for high-contention benchmarks adaptSTM
Thread-local adaptation alone increases performance on average by 4.3%
for 16 threads, and up to 10% for individual benchmarks, compared to
adaptSTM without active adaptation.
Session 5: Simulation and Modeling (David Murrell, Freescale)
Scalable, accurate NoC simulation for the 1000-core era
(Mieszko Lis, Omer Khan, MIT)
Yet another cycle accurate instruction simulator.
A Single-Specification Principle for Functional-to-Timing Simulator
(David A. Penry, Brigham Young University)
Desinging simulators. Problem: depending on the level of information
that is needed there is a huge performance difference for simulators.
Idea: define high-level interface and generate low-level interfaces that
offer faster simulation automatically.
WiLIS: Architectural Modeling of Wireless Systems
(Kermin Fleming, Man Cheuk Ng, Sam Gross, Arvind, MIT)
Simulator for wireless protocols implemented in hardware (FPGA) for
better/more accurate analysis.
Detecting Race Conditions in Asynchronous DMA Operations with
Full-System Simulation (Michael Kistler, Daniel Brokenshire IBM)
Using heavy-weight simulation helps in finding DMA bugs for light cache
protocols like Cell that have no explicit cache management. This work
can also be used for the analysis of cache protocols. (Great talk!)
Mechanistic-Empirical Processor Performance Modeling for Constructing
CPI Stacks on Real Hardware
(Stijn Eyerman, Kenneth Hoste, Lieven Eeckhout, Ghent University)
Analyze different types of architectures and compare performance and
different HW features.
Session 6: Power and Reliability (Bronis de Supinski, LLNL)
Power Signature Analysis of the SPECpower_ssj2008 Benchmark
(Chunghsing Hsu, Stephen W. Poole, ORNL)
Use many available measurements and analyze the signatures to develop a
better predictor for different CPU models.
Analyzing Throughput of GPGPUs Exploiting Within-Die Core-to-Core
(Jung Seob Lee, Nam Sung Kim, University of Wisconsin, Madison)
Scaling of HW down to very small structures leads to new problems and
Universal Rules Guided Design Parameter Selection for Soft Error
(Lide Duan, Ying Zhang, Bin Li, Lu Peng, LSU)
Reduce soft errors in processors due to an analysis of architectural
A Dynamic Energy Management in Multi-Tier Data Centers
(Seung-Hwan Lim, Bikash Sharma, Byung Chul Tak, Chita R. Das, The
Pennsylvania State University)
How to save energy in data centers.