Åbo Akademi University, Information Technology Department
Instructor: Alireza Olama
Due Date: 08/05/2025
Points: 100
Welcome to the second homework assignment of the Parallel Programming course! In Assignment 1, you implemented a naive matrix multiplication using a triple nested loop. In this assignment, you will optimize the performance of your naive implementation using two techniques:
- Cache Optimization via Blocked Matrix Multiplication: Improve data locality to reduce cache misses.
- Parallel Matrix Multiplication using
OpenMP: Parallelize the computation across multiple threads.
Your task is to implement both optimizations in the provided C++ main.cpp file, measure their performance, and compare the
wall clock time of the naive, cache-optimized, and parallel implementations for each test case. This assignment builds
on your Assignment 1 code, so ensure your naive implementation is correct before starting.
Why Cache Optimization?
Modern CPUs rely on cache memory to reduce the latency of accessing data from main memory. Cache memory is faster but smaller, organized in cache lines (typically 64 bytes). When a CPU accesses a memory location, it fetches an entire cache line. Matrix multiplication can suffer from poor performance if memory accesses are not cache-friendly, leading to frequent cache misses.
The naive matrix multiplication (with triple nested loops) accesses memory in a way that may not exploit spatial and temporal locality:
- Spatial Locality: Accessing consecutive memory locations (e.g., elements in the same cache line).
- Temporal Locality: Reusing the same data multiple times while it’s still in the cache.
Blocked matrix multiplication divides the matrices into smaller submatrices (blocks) that fit into the cache. By performing computations on these blocks, you ensure that data is reused while it resides in the cache, reducing cache misses and improving performance.
Blocked Matrix Multiplication Pseudocode
Assume matrices ( A ) (m × n), ( B ) (n × p), and ( C ) (m × p) are stored in row-major order. The blocked matrix multiplication processes submatrices of size ( block_size × block_size ):
// C = A * B
for (ii = 0; ii < m; ii += block_size)
for (jj = 0; jj < p; jj += block_size)
for (kk = 0; kk < n; kk += block_size)
// Process block: C[ii:ii+block_size, jj:jj+block_size] += A[ii:ii+block_size, kk:kk+block_size] * B[kk:kk+block_size, jj:jj+block_size]
for (i = ii; i < min(ii + block_size, m); i++)
for (j = jj; j < min(jj + block_size, p); j++)
for (k = kk; k < min(kk + block_size, n); k++)
C[i * p + j] += A[i * n + k] * B[k * p + j]- block_size: Chosen to ensure the block fits in the cache (e.g., 32, 64, or 128, depending on the system).
- Outer loops (ii, jj, kk): Iterate over blocks.
- Inner loops (i, j, k): Compute within a block, reusing data in the cache.
Task: Implement the blocked_matmul function in the provided main.cpp. Experiment with different block sizes (e.g.,
16, 32, 64) and report the best performance.
Why OpenMP?
OpenMP is a portable API for parallel programming in shared-memory systems. It allows you to parallelize loops with
minimal code changes, distributing iterations across multiple threads. In matrix multiplication, the outer loop(s) can
be parallelized, as each element of the output matrix ( C ) can be computed independently.
Parallelizing with OpenMP
Use OpenMP to parallelize the outer loop(s) of the naive matrix multiplication. For example, parallelize the loop over rows of ( C ):
#pragma omp parallel for
for (i = 0; i < m; i++)
for (j = 0; j < p; j++)
for (k = 0; k < n; k++)
C[i * p + j] += A[i * n + k] * B[k * p + j];- The
#pragma omp parallel fordirective tellsOpenMPto distribute iterations of the loop across available threads. - Ensure thread safety: Since each iteration writes to a distinct element of ( C ), this loop is safe to parallelize without locks.
- Use
omp_get_wtime()to measure wall clock time for accurate performance comparisons.
Task: Implement the parallel_matmul function in the provided main.cpp using OpenMP. Test with different numbers of
threads (e.g., 2, 4, 8) by setting the environment variable OMP_NUM_THREADS.
For each test case (0 through 9 in the data folder):
- Measure the wall clock time for:
- Naive matrix multiplication (
naive_matmul). - Cache-optimized matrix multiplication (
blocked_matmul). - Parallel matrix multiplication (
parallel_matmul).
- Naive matrix multiplication (
- Use
omp_get_wtime()for timing, as it provides high-resolution wall clock time. - Report the times in a table in your submission README.md, including:
- Test case number.
- Matrix dimensions (m × n × p).
- Wall clock time for each implementation (in seconds).
- Speedup of blocked and parallel implementations over the naive implementation.
Example table format:
| Test Case | Dimensions (m × n × p) | Naive Time (s) | Blocked Time (s) | Parallel Time (s) | Blocked Speedup | Parallel Speedup |
|---|---|---|---|---|---|---|
| 0 | 512 × 512 × 512 | 2.345 | 0.987 | 0.543 | 2.38× | 4.32× |
- Continue using row-major order for all matrices, as in Assignment 1.
- Use C-style arrays with manual memory management (
mallocornew,freeordelete). - Do not use STL containers or smart pointers.
- Use the same input/output format as Assignment 1:
- Input files:
data/<case>/input0.raw(matrix ( A )) andinput1.raw(matrix ( B )). - Output file:
data/<case>/result.raw(matrix ( C )). - Reference file:
data/<case>/output.rawfor validation.
- Input files:
- The executable accepts a case number (0–9) as a command-line argument.
- Validate correctness by comparing
result.rawwithoutput.rawfor each implementation.
- Use the provided
CMakeLists.txtto build the project. - Additional Requirements:
- Ensure OpenMP is enabled in your compiler (e.g.,
-fopenmpfor GCC). - The provided CMake file includes OpenMP support.
- Ensure OpenMP is enabled in your compiler (e.g.,
- Windows Users:
- Use CLion or Visual Studio with CMake.
- Alternatively, use MinGW with
cmake -G "MinGW Makefiles"andmake.
- Linux/Mac Users:
- Make sure gcc compiler is installed (
brew install gccon Mac). - Configure cmake to use the correct compiler:
cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ . - Run
cmake .to generate a Makefile, thenmake.
- Make sure gcc compiler is installed (
- Testing OpenMP:
- Set the number of threads using the environment variable
OMP_NUM_THREADS(e.g.,export OMP_NUM_THREADS=4on Linux/Mac, orset OMP_NUM_THREADS=4on Windows). - Test with different thread counts to find the best performance.
- Set the number of threads using the environment variable
- Fork the Assignment 2 repository (provided separately).
- Clone your fork:
git clone https://github.com/parallelcomputingabo/Homework-2.git cd Homework-2
git checkout -b student-name- Modify the provided
main.cppto implementblocked_matmulandparallel_matmul. - Update
README.mdwith your performance results table.
git add .
git commit -m "student-name: Implemented optimized matrix multiplication"
git push origin student-name- Create a pull request from your branch to the base repository’s
mainbranch. - Include a description of your optimizations and any challenges faced.
| Subtask | Points |
|---|---|
Correct implementation of blocked_matmul |
30 |
Correct implementation of parallel_matmul |
30 |
| Accurate performance measurements | 20 |
| Performance results table in README.md | 10 |
| Code clarity, commenting, and organization | 10 |
| Total | 100 |
- Cache Optimization:
- Experiment with different block sizes. Start with powers of 2 (e.g., 16, 32, 64).
- Use a block size that balances cache usage without excessive overhead.
- OpenMP:
- Test with different thread counts to find the optimal number for your system.
- Be cautious of false sharing (when threads access nearby memory locations, causing cache coherence issues).
- Performance Measurement:
- Run multiple iterations for each test case and report the average time to reduce variability.
- Ensure no other heavy processes are running during measurements.
- Debugging:
- Validate each implementation against
output.rawto ensure correctness before optimizing. - Use small test cases to debug your blocked and parallel implementations.
- Validate each implementation against
Good luck, and enjoy optimizing your matrix multiplication!