Parallel Programming

Åbo Akademi University, Information Technology Department

Instructor: Alireza Olama

Homework Assignment 2: Optimizing Matrix Multiplication in C++

Due Date: 08/05/2025

Points: 100

Assignment Overview

Welcome to the second homework assignment of the Parallel Programming course! In Assignment 1, you implemented a naive matrix multiplication using a triple nested loop. In this assignment, you will optimize the performance of your naive implementation using two techniques:

Cache Optimization via Blocked Matrix Multiplication: Improve data locality to reduce cache misses.
Parallel Matrix Multiplication using OpenMP: Parallelize the computation across multiple threads.

Your task is to implement both optimizations in the provided C++ main.cpp file, measure their performance, and compare the wall clock time of the naive, cache-optimized, and parallel implementations for each test case. This assignment builds on your Assignment 1 code, so ensure your naive implementation is correct before starting.

Technical Requirements

1. Cache Optimization (Blocked Matrix Multiplication)

Why Cache Optimization?

Modern CPUs rely on cache memory to reduce the latency of accessing data from main memory. Cache memory is faster but smaller, organized in cache lines (typically 64 bytes). When a CPU accesses a memory location, it fetches an entire cache line. Matrix multiplication can suffer from poor performance if memory accesses are not cache-friendly, leading to frequent cache misses.

The naive matrix multiplication (with triple nested loops) accesses memory in a way that may not exploit spatial and temporal locality:

Spatial Locality: Accessing consecutive memory locations (e.g., elements in the same cache line).
Temporal Locality: Reusing the same data multiple times while it’s still in the cache.

Blocked matrix multiplication divides the matrices into smaller submatrices (blocks) that fit into the cache. By performing computations on these blocks, you ensure that data is reused while it resides in the cache, reducing cache misses and improving performance.

Blocked Matrix Multiplication Pseudocode

Assume matrices ( A ) (m × n), ( B ) (n × p), and ( C ) (m × p) are stored in row-major order. The blocked matrix multiplication processes submatrices of size ( block_size × block_size ):

// C = A * B
for (ii = 0; ii < m; ii += block_size)
    for (jj = 0; jj < p; jj += block_size)
        for (kk = 0; kk < n; kk += block_size)
            // Process block: C[ii:ii+block_size, jj:jj+block_size] += A[ii:ii+block_size, kk:kk+block_size] * B[kk:kk+block_size, jj:jj+block_size]
            for (i = ii; i < min(ii + block_size, m); i++)
                for (j = jj; j < min(jj + block_size, p); j++)
                    for (k = kk; k < min(kk + block_size, n); k++)
                        C[i * p + j] += A[i * n + k] * B[k * p + j]

block_size: Chosen to ensure the block fits in the cache (e.g., 32, 64, or 128, depending on the system).
Outer loops (ii, jj, kk): Iterate over blocks.
Inner loops (i, j, k): Compute within a block, reusing data in the cache.

Task: Implement the blocked_matmul function in the provided main.cpp. Experiment with different block sizes (e.g., 16, 32, 64) and report the best performance.

2. Parallel Matrix Multiplication with OpenMP

Why OpenMP?

OpenMP is a portable API for parallel programming in shared-memory systems. It allows you to parallelize loops with minimal code changes, distributing iterations across multiple threads. In matrix multiplication, the outer loop(s) can be parallelized, as each element of the output matrix ( C ) can be computed independently.

Parallelizing with OpenMP

Use OpenMP to parallelize the outer loop(s) of the naive matrix multiplication. For example, parallelize the loop over rows of ( C ):

#pragma omp parallel for
for (i = 0; i < m; i++)
    for (j = 0; j < p; j++)
        for (k = 0; k < n; k++)
            C[i * p + j] += A[i * n + k] * B[k * p + j];

The #pragma omp parallel for directive tells OpenMP to distribute iterations of the loop across available threads.
Ensure thread safety: Since each iteration writes to a distinct element of ( C ), this loop is safe to parallelize without locks.
Use omp_get_wtime() to measure wall clock time for accurate performance comparisons.

Task: Implement the parallel_matmul function in the provided main.cpp using OpenMP. Test with different numbers of threads (e.g., 2, 4, 8) by setting the environment variable OMP_NUM_THREADS.

3. Performance Measurement

For each test case (0 through 9 in the data folder):

Measure the wall clock time for:
- Naive matrix multiplication (naive_matmul).
- Cache-optimized matrix multiplication (blocked_matmul).
- Parallel matrix multiplication (parallel_matmul).
Use omp_get_wtime() for timing, as it provides high-resolution wall clock time.
Report the times in a table in your submission README.md, including:
- Test case number.
- Matrix dimensions (m × n × p).
- Wall clock time for each implementation (in seconds).
- Speedup of blocked and parallel implementations over the naive implementation.

Example table format:

Test Case	Dimensions (m × n × p)	Naive Time (s)	Blocked Time (s)	Parallel Time (s)	Blocked Speedup	Parallel Speedup
0	512 × 512 × 512	2.345	0.987	0.543	2.38×	4.32×

Matrix Storage and Memory Management

Continue using row-major order for all matrices, as in Assignment 1.
Use C-style arrays with manual memory management (malloc or new, free or delete).
Do not use STL containers or smart pointers.

Input/Output and Validation

Use the same input/output format as Assignment 1:
- Input files: data/<case>/input0.raw (matrix ( A )) and input1.raw (matrix ( B )).
- Output file: data/<case>/result.raw (matrix ( C )).
- Reference file: data/<case>/output.raw for validation.
The executable accepts a case number (0–9) as a command-line argument.
Validate correctness by comparing result.raw with output.raw for each implementation.

Build Instructions

Use the provided CMakeLists.txt to build the project.
Additional Requirements:
- Ensure OpenMP is enabled in your compiler (e.g., -fopenmp for GCC).
- The provided CMake file includes OpenMP support.
Windows Users:
- Use CLion or Visual Studio with CMake.
- Alternatively, use MinGW with cmake -G "MinGW Makefiles" and make.
Linux/Mac Users:
- Make sure gcc compiler is installed (brew install gcc on Mac).
- Configure cmake to use the correct compiler:
```
cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ .
```
- Run cmake . to generate a Makefile, then make.
Testing OpenMP:
- Set the number of threads using the environment variable OMP_NUM_THREADS (e.g., export OMP_NUM_THREADS=4 on Linux/Mac, or set OMP_NUM_THREADS=4 on Windows).
- Test with different thread counts to find the best performance.

Submission Requirements

Fork and Clone the Repository

Fork the Assignment 2 repository (provided separately).

Clone your fork:

git clone https://github.com/parallelcomputingabo/Homework-2.git
cd Homework-2

Create a New Branch

git checkout -b student-name

Implement Your Solution

Modify the provided main.cpp to implement blocked_matmul and parallel_matmul.
Update README.md with your performance results table.

Commit and Push

git add .
git commit -m "student-name: Implemented optimized matrix multiplication"
git push origin student-name

Submit a Pull Request (PR)

Create a pull request from your branch to the base repository’s main branch.
Include a description of your optimizations and any challenges faced.

Grading (100 Points Total)

Subtask	Points
Correct implementation of `blocked_matmul`	30
Correct implementation of `parallel_matmul`	30
Accurate performance measurements	20
Performance results table in README.md	10
Code clarity, commenting, and organization	10
Total	100

Tips for Success

Cache Optimization:
- Experiment with different block sizes. Start with powers of 2 (e.g., 16, 32, 64).
- Use a block size that balances cache usage without excessive overhead.
OpenMP:
- Test with different thread counts to find the optimal number for your system.
- Be cautious of false sharing (when threads access nearby memory locations, causing cache coherence issues).
Performance Measurement:
- Run multiple iterations for each test case and report the average time to reduce variability.
- Ensure no other heavy processes are running during measurements.
Debugging:
- Validate each implementation against output.raw to ensure correctness before optimizing.
- Use small test cases to debug your blocked and parallel implementations.

Good luck, and enjoy optimizing your matrix multiplication!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Parallel Programming

Homework Assignment 2: Optimizing Matrix Multiplication in C++

Assignment Overview

Technical Requirements

1. Cache Optimization (Blocked Matrix Multiplication)

2. Parallel Matrix Multiplication with OpenMP

3. Performance Measurement

Matrix Storage and Memory Management

Input/Output and Validation

Build Instructions

Submission Requirements

Fork and Clone the Repository

Create a New Branch

Implement Your Solution

Commit and Push

Submit a Pull Request (PR)

Grading (100 Points Total)

Tips for Success

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
CMakeLists.txt		CMakeLists.txt
README.md		README.md
main.cpp		main.cpp

RayChromium/Homework-2

Folders and files

Latest commit

History

Repository files navigation

Parallel Programming

Homework Assignment 2: Optimizing Matrix Multiplication in C++

Assignment Overview

Technical Requirements

1. Cache Optimization (Blocked Matrix Multiplication)

2. Parallel Matrix Multiplication with OpenMP

3. Performance Measurement

Matrix Storage and Memory Management

Input/Output and Validation

Build Instructions

Submission Requirements

Fork and Clone the Repository

Create a New Branch

Implement Your Solution

Commit and Push

Submit a Pull Request (PR)

Grading (100 Points Total)

Tips for Success

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages