SC²S Colloquium - July 23, 2013

From Sccswiki
Jump to navigation Jump to search
Date: July 23, 2013
Room: 02.07.023
Time: 2:30 pm, s.t.

Roman Karlstetter: Parameter-Optimization and Parallelization of a Regressor based on spatially-adaptive Sparse Grids for huge Datasets

Gaining knowledge out of vast datasets is a main challenge in data-driven applications nowadays. Sparse grids provide a numerical method for both classification and regression in data mining which scales only linearly in the number of data points and is thus well-suited for huge amounts of data. Based on top of an existing parallelization for multi-core CPUs with vector units, GPUs and hybrid systems, a MPI-Version for compute-clusters was developed inside SGpp. Several different communication strategies were developed and evaluated with strong-scaling benchmarks on various different clusters. Furthermore, a performance model for two of the strategies was developed, using microbenchmarks for the expected communication times. In addition to this, the existing codebase was refactored to be more modular and therefore making it easy to port existing vecorizations to the cluster version.

Wolfgang Hölzl: Vectorization and GPGPU-Acceleration of an augmented Riemann solver for the shallow water equations

This work presents results of the vectorization and CUDA optimization of an augmented riemann solver for the shallow water equations over variable topography with steady states and inundation. The underlying solver was presented by David L. George and scalar versions are implemented in ClawPack and in the SWE Teaching Code, developed at TUM. The vectorization is done with SSE4.1 and AVX1 intrinsics and able to be adapted to wider vector registers in the future. The thesis explains, how this vectorization has been accomplished, with special view on the handling of divergent branches within the control flow, as well as the macro based approach, which eases the adaption to wider vector units provided by future architectures. Furthermore, the solver has been implemented with CUDA, to take advantage of the parallelization capacities of modern GPGPUs. Performance results are given, and show speedups in the range of the achievable peak, as well as the limits of Sandy Bridge CPUs with respect to AVX. At the end, an outline is given how to adapt the vectorization on the Intel MIC architecture.