Weekly outline

  • General

    The seminar covers optimization and parallelization techniques for modern multi- and manycore systems. The topics are chosen from interesting contemporary problems in High Performance Computing on modern hardware like multicore processors, accelerators (e.g., GPGPUs), and clusters.

    Lecturer: Prof. G. Wellein, Martensstr. 1, Room 01.131. Phone -28136
    Location: 2.037 (e-Studio), Martensstr. 1, 2nd floor
    Time: Tuesday 8:30-10:00

    Either 2.5 or 5 ECTS credits will be granted, depending on whether the student gives one or two talks. In either case, a written seminar report is mandatory.

    Possible topics:

      1. Seven-point stencil smoothers on Xeon Phi: Implementation, optimization, performance modeling.

      2. Relaxed thread synchronisation for multi-core architectures: Effect of replacing OpenMP barriers by relaxed synchronization constructs, e.g., locks. Target architecture: 8-core Sandy Bridge; AMD Interlagos; Intel Xeon Phi. application/kernels:

        1. Gauß-Seidel

        2. SIP Solver (Strongly Implicit Procedure after Stone)

      3. Speeding up the DUNE framework by SIMD vectorization. Evaluate potential SIMD vectorization strategies for selected DUNE kernels with high computational requirements

        1. Performance model (instruction throughput / cache bandwidth)

        2. Determine guidelines for code generators

        3. Speed-up using single precision (SP) computations

        4. Target architectures I: Intel Sandy Bridge / AMD Interlagos

        5. Target architectures II: GPGPUs+OpenCL (Porting and optimization for kernels only + performance projection for complete code if those kernels are accelerated)

      4. ILBDC (lattice-Boltzmann) kernel:

        1. GPGPU implementation in OpenCL/CUDA

        2. Evaluate impact of list ordering

      5. ILBDC (lattice-Boltzmann) kernel:

        1. SIMD Vectorized TRT/MRT kernel

        2. Code generator for automatic SIMDfication

      6. Evaluation of the OpenACC directives on CRAY XE6@HLRS. OpenACC tries to standardize the way compiler directives are used to program accelerator devices like GPGPUs. It is available, e.g., on recent CRAY supercomputers like the HERMIT system at HLRS Stuttgart.

        1. STREAM benchmarks

        2. Jacobi solver

        3. spMVM

      7. Iterative solvers for sparse matrix problems: Implementing a sparse CG (conjugate gradient) solver on GPUs and CPUs with OpenMP and CUDA/OpenCL. Comparison of CPU and GPU performance, performance analysis and modeling of the complete solver.Suitable real-world test problems from physics will be available.

      8. Implementation of selected algorithms using the Global Arrays toolkit:

        a) Jacobi and/or Gauss-Seidel (pipelined parallel processing)
        b) 3D Lattice-Boltzmann flow solver

      9. Strongly Implicit Procedure (SIP) solver after Stone

        a) OpenMP implementation from scratch. The goal is to provide a benchmark code that can be used in the RRZE benchmark suite
        b) Implementation in CoArray Fortran and benchmark testing with different compilers and environments (LiMa, Cray XE6).

      10. Asynchronous MPI communication: Using explicit threading ("task mode") to implement explicit overlap between communication and computation in different solvers.

        a) MPI-parallel Lattice-Boltzmann flow solver on CPUs
        b) MPI-parallel Jacobi solver on GPUs

      11. Stepanov Test "reloaded": Development of a modern test for the optimization capabilities of compilers, including SIMD vectorization, auto-parallelization

        a) for C++ (abstractions, expression templates, overloaded operators,...)
        b) for Fortran 95 (array syntax, slices, Fortran pointers, ...
      12. Evaluation of optimization strategies for matrix-matrix multiply on modern processors. Set up an automatic framework which generates unolling and blocking strategies for matrix-matrix multiplication. Evaluate the efficiency of those strategies and the impact of/interaction with compiler optimizations.
      13. Evaluation of short vector sums on modern architectures. Benchmark and evaluate the vector sum on Multicore, GPGPU and Xeon Phi. This involves an analysis of the overhead introduced by the necessary reduction and synchronization.
      14. Evaluation of sorting of a float array. Benchmark and evaluate and/or implement fast sorting on modern multicore and accelerator architectures. Instead of a full sort this can also be done for the nth select operation which is very common in business analytics.

  • 13 October - 19 October

  • 20 October - 26 October

    • 27 October - 2 November

      • 3 November - 9 November

        • 10 November - 16 November

        • 17 November - 23 November

          • 24 November - 30 November

            • 1 December - 7 December

              • 8 December - 14 December

                • 15 December - 21 December

                  • 22 December - 28 December

                    • 29 December - 4 January

                      • 5 January - 11 January

                        • 12 January - 18 January

                          • 19 January - 25 January