Conference remarks: ISPASS 2011

As you might know I've been presenting at the IEEE ISPASS'11 conference in Austin, Texas. The conference went from April 10th to April 12th. If you are interested you can read my ramblings about the talks below.

Conference information

(David Brooks & Rajeev Balasubramonian)

75+ Registrations 64 submissions in total, 4 PC reviews/paper, accepted via consensus, 24 accepted papers.

Keynote: The Era of Heterogeneity: Are we prepared?

(Ravi Iyer, Intel)

Shift from client/server to smart devices (tablets, smart phones, ...) Integrate GPU, IP into CPU core for power efficiency, it's no longer just about cores but also about the accelerators that we integrate into the CPU.

Why heterogeneity? Because the workloads are heterogeneous and one single solution (general purpose core) will not work. Small cores scale and good power performance, big cores are needed for single threaded performance. The talk sounds a lot like ISCA'09 where they proposed a heterogeneous architecture with one big core and multiple small cores. Intel's Idea: SPECS (Scalability, Programmability, Energy, Compatibility, Scheduling/Management)

Questions for cores and accelerators are: How to mix and prioritize heterogeneous cores? Should all cores have the same ISA (e.g., SEE)? How should we structure the cache architecture? Solution for mixed ISA: Use a co-op model and run applications on any core. If we have an unsupported opcode exception on the smaller core than the OS must move the application to the bigger core. This sounds a little like Albert Noll's VM for the Cell. What about hardware tricks for context switches? (Great talk!)

Session 1: Best Paper Nominees (David Christie, AMD)

Characterization and Dynamic Mitigation of Intra-Application Cache Interference (Carole-Jean Wu, Margaret Martonosi, Princeton University)

Intra-Application cache interference is a challenging problem. Measure and characterize the cache behavior of applications. Paper uses two-folded measurements, Intel Nehalem using perfmon2 and Simics/GEMS for an artificial system. Measurements show that system cache lines are usually not reused (source: mostly TLB misses) so these misses pollute the application cache lines.

Propose new cache systems that adhere to the fact that cache lines from a system context are not reused as often as cache lines from a user context.

Questions: Measurement done on 64/32b system? Are there differences due to different page placement?

A Semi-Preemptive Garbage Collector for Solid State Drives (Junghee Lee*, Uoungjae Kim, Galen M. Shipman, Sarp Oral, Jongman Kim*, Feiyi Wang, Oak Ridge National Laboratory, Georgia Institute of Technology*)

Block replacement strategies and how to cope with flash problems. Implement some form of GC for flash blocks. Fast access speed but performance degradation due to garbage collection. New form of GC inside the SSD.

PRISM: Zooming in Persistent RAM Storage Behavior (Ju-Young Jung, Sangeyeun Cho, University of Pittsburgh)

Block-oriented FS for PRAM. Not my field.

Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications (Jeff Diamond, Martin Burtscher, John D. McCalpin, Byoung-Do Kim, Stephen W. Keckler, James C. Browne, University of Texas Austin, Texas State University, and NVIDIA)

Moore's law of super computing: scale the number of cores! Motivation for this paper: inter-chip scalability. New compiler optimizations for multi-core cpus, depending on cache-layout and coordination. Use AMD performance counters to measure HPC performance.

Most important performance options for multi-core are: L3 Miss Rates (cache contention), Off Chip Bandwidth, DRAM contention (DRAM page miss rates) (Great talk!)

Session 2: Memory Hierarchies (Suzanne Rivoire, Sonoma State University)

Minimizing Interference through Application Mapping in Multi-Level Buffer Caches (Christina M. Patrick, Nicholas Voshell, Mahmut Kandemir, Pennsylvania State University)

Storage paper that handles a switched network with a complicated node hierarchy. The paper introduces interference predictors for the I/O route through the network and analyzes buffer cache placement. -ETOOMANYFORMULAS (for me)

Analyzing the Impact of Useless Write-Backs on the Endurance and Energy Consumption of PCM Main Memory (Santiago Bock, Bruce R. Childers, Rami G. Melhem, Daniel Mosse, Youtao Zhang, University of Pittsburgh)

20-40% of energy consumption due to memory system. Use PhaseChangeMemory instead of DRAM. Low static power (non volatile), read performance comparable to DRAM, scales better than DRAM, bug high energy cost for writes and limited write endurance. Observation: a write-back is useless if the data is not used again later on. Use application information from allocator, control flow analysis or stack pointer. Focus: how many useless write-backs can be avoided using these metrics? 3 different regions analyzed: heap: use malloc / free; global: control flow analysis; stack: stack pointer.

What about DRAM, would that make sense as well? (Reducing the number of write-backs), e.g. for cache coherency in multi-cores? His solution: application tells the HW which regions are dead / alive.

Access Pattern-Aware DRAM Performance Model for Multi-Core Systems (Hyojin Choi, Jongbok Lee*, Wonyong Sung, Seoul National University, Hansung University*)

Latency between different banks, very low level/HW.

Characterizing Multi-threaded Applications based on Shared-Resource Contention (Tanimo Dey, Wei Wang, Jack Davidson, Mary Lou Soffa, University of Virginia)

Check/measure intra-application contention and inter-application contention for L1/L2/Front side bus.

Session 3: Tracing (Tom Wenisch, University of Michigan)

Trace-driven Simulation of Multithreaded Applications (Alejandro Rico*, Alejandro Duran*, Felipe Cabarcas*, Alex Ramirez, Yoav Etsion*, Mateo Valero*, Barcelona Supercomputing center*, Universitat Politecnica de Catalunya)

How to simulate multi-threaded applications using traces? Capture traces for sequential code sections, capture calls to parops but do not capture the execution of parops. Interesting but not my topic.

Efficient Memory Tracing by Program Skeletonization (Alain Ketterlin, Philippe Clauss, Universite de Strasbourg & INRIA)

We want to get the minimum amount of code to reproduce the memory layout of an application. Instrumentation is expensive but useful as a baseline. To improve from there we need to find loops in binary code, try to recognize patterns and generate access sequences to remove instrumentation. Work on machine code and find register accesses movl %eax, [%ebx, %ecx, 8]

Program skeletonization extracts what is useful to compute the memory addresses.

Do you also track direct registers (e.g., the address computation happens before)? You decouple the memory recording and the application, so recording happens with loose correlation to the application. How do you handle threads/concurrent memory accesses? Exceptions? (Great talk!)

Portable Trace Compression through Instruction Interpretation (Svilen Kanev, Robert Cohn*, Harvard University, Intel*) If you are reliably able to predict a byte stream you do not need to record it

Reception & Poster Session

VMAD: A Virtual Machine for Advanced Dynamic Analysis of Programs (Alexandra Jimborean, Matthieu Hermann*, Vincent Loechner, Philippe Claus, INRIA, Universite Strasbourg*)

Interesting work on LLVM that adds different alternatives and tries reverse compilation to turn, e.g., while loops into for loops and adaptively optimize them (for C/C++ code). Interesting work, maybe forward her Olivers' work

Performance Characterization of Mobile-Class Nodes: Why Fewer Bits is Better

(Michelle McDaniel, Kim Hazelwood, University of Virginia)

For netbooks 32bit code is faster than 64bit code. What kind of GCC settings did you use? Mention Acovea, also her masters is about padding, give her a pointer to my work.

Keynote II: Integrated Modeling Challenges in Extreme-Scale Computing

(Pradip Bose, IBM)

Exa-Scale Computing is 10^18 which is 100x peta-scale computing. What is the wall: power or reliability?

Power-wall: We need to reduce power needed in chips, dozens of cores per chip that are allowed to use 1/1000 of power. Idea: different processing modes: storage mode; turn of parallel cores, computing mode: turn off storage controllers, I/O. Reliability wall: MTTF and reliability drops with the increased numbers of transistors. Problem: with millions of cores/cpus MTTF is so low that super computers are not even able to complete linpack benchmarks between failures. MTTR/MTTF (mean time to repair vs. mean time to failure).

Session 4: Emerging Workloads (Derek Chiou, UT Austin)

Where is the Data? Why you Cannot Debate CPU vs. GPU Performance Without the Answer (Chris Gregg, Kim Hazelwood, University of Virginia)

GPU computation is fast but data transfer from/to the GPU is a bottleneck. GPU speedup is misleading without describing the data transfer necessities.

Questions: What about algorithms with dual-use approach where the CPU does not idle during kernel? What about compression? (Great talk!)

Accelerating Search and Recognition Workloads with SSE 4.2 String and Text Processing Instructions (Guangyu Shi, Min Li, Mikko Lipasti, UW-Madison)

STTNI can be used to implement broad set of search and recognition application, embrace newly available instructions to speed up classical algorithms. pcmpestri: packed compare explicit length strings return index. New instructions can be used for any data comparisons. Depending on data structure different algorithms are needed. Easy for arrays, tree structure need some B-Tree and similar handling as strings, for hash tables more complicated but resolve collisions with STTNI.

What about aligned loads, or loop unrolling for this code? Example was a single static loop that did use unaligned loads (expensive) and no manual loop unrolling. Speaker only compared to GCC, not ICC or other compilers.

A Comprehensive Analysis and Parallelization of an Image Retrieval Algorithm (Zhenman Fang, Weihua Zhang, Haibo Chen, Binyu Zang, Fudan University)

You shall not use Comic, Sans Serif, Courier, and Serif fonts on one slide!

Performance Evaluation of Adaptivity in Transactional Memory (Mathias Payer, Thomas R. Gross, ETH Zurich)

My talk. See: https://nebelwelt.net/publications/files/11ISPASS.pdf

Transactional memory (TM) is an attractive platform for parallel programs, and several software transactional memory (STM) designs have been presented. We explore and analyze several optimization opportunities to adapt STM parameters to a running program. This paper uses adaptSTM, a flexible STM library with a non-adaptive baseline common to current fast STM libraries to evaluate different performance options. The baseline is extended by an online evaluation system that enables the measurement of key runtime parameters like read- and write-locations, or commit- and abort-rate. The performance data is used by a thread-local adaptation system to tune the STM configuration. The system adapts different important parameters like write-set hash-size, hash-function, and write strategy based on runtime statistics on a per-thread basis.

We discuss different self-adapting parameters, especially their performance implications and the resulting trade-offs. Measurements show that local per-thread adaptation out- performs global system-wide adaptation. We position local adaptivity as an extension to existing systems.

Using the STAMP benchmarks, we compare adaptSTM to two other STM libraries, TL2 and tinySTM. Comparing adaptSTM and the adaptation system to TL2 results in an average speedup of 43% for 8 threads and 137% for 16 threads. adaptSTM offers performance that is competitive with tinySTM for low-contention benchmarks; for high-contention benchmarks adaptSTM outperforms tinySTM.

Thread-local adaptation alone increases performance on average by 4.3% for 16 threads, and up to 10% for individual benchmarks, compared to adaptSTM without active adaptation.

Session 5: Simulation and Modeling (David Murrell, Freescale)

Scalable, accurate NoC simulation for the 1000-core era (Mieszko Lis, Omer Khan, MIT)

Yet another cycle accurate instruction simulator.

A Single-Specification Principle for Functional-to-Timing Simulator Interface Design (David A. Penry, Brigham Young University)

Desinging simulators. Problem: depending on the level of information that is needed there is a huge performance difference for simulators. Idea: define high-level interface and generate low-level interfaces that offer faster simulation automatically.

WiLIS: Architectural Modeling of Wireless Systems (Kermin Fleming, Man Cheuk Ng, Sam Gross, Arvind, MIT)

Simulator for wireless protocols implemented in hardware (FPGA) for better/more accurate analysis.

Detecting Race Conditions in Asynchronous DMA Operations with Full-System Simulation (Michael Kistler, Daniel Brokenshire IBM)

Using heavy-weight simulation helps in finding DMA bugs for light cache protocols like Cell that have no explicit cache management. This work can also be used for the analysis of cache protocols. (Great talk!)

Mechanistic-Empirical Processor Performance Modeling for Constructing CPI Stacks on Real Hardware (Stijn Eyerman, Kenneth Hoste, Lieven Eeckhout, Ghent University)

Analyze different types of architectures and compare performance and different HW features.

Session 6: Power and Reliability (Bronis de Supinski, LLNL)

Power Signature Analysis of the SPECpower_ssj2008 Benchmark (Chunghsing Hsu, Stephen W. Poole, ORNL)

Use many available measurements and analyze the signatures to develop a better predictor for different CPU models.

Analyzing Throughput of GPGPUs Exploiting Within-Die Core-to-Core Frequency Variation (Jung Seob Lee, Nam Sung Kim, University of Wisconsin, Madison)

Scaling of HW down to very small structures leads to new problems and characteristics.

Universal Rules Guided Design Parameter Selection for Soft Error Resilient Processors (Lide Duan, Ying Zhang, Bin Li, Lu Peng, LSU)

Reduce soft errors in processors due to an analysis of architectural weaknesses.

A Dynamic Energy Management in Multi-Tier Data Centers (Seung-Hwan Lim, Bikash Sharma, Byung Chul Tak, Chita R. Das, The Pennsylvania State University)

How to save energy in data centers.

Final remarks

Jeff Diamond won the best paper award, no other remarks.