15-418/618 Spring 2020 Exercise 1
The aim of this study is to determine the flexibility and force parameters of sport climbers from among second-grade primary school students who exercise
adsPart of the document
1
Problems
Consider the following code where each line within the function represents a single instruction.
typedef struct{
floatx;
floaty;
} point;
inline voidinnerProduct(point*a, point*b,float*result)
{
floatx1 = a->x;// Uses a load instruction
floatx2 = b->x;
floatproduct1 = x1*x2;
floaty1 = a->y;
floaty2 = b->y;
floatproduct2 = y1*y2;
floatinner = product1 + product2;
*
result = inner;// Uses a store instruction
}
voidcomputeInnerProduct(point A[], point B[],floatresult[],intN)
{
for(inti = 0; i < N; i++)
innerProduct(&A[i], &B[i], &result[i]);
}
In the following questions, you can assume the following:
Nis very large (>106).
The machines described have modern CPUs, providing out-of-order execution, speculative execution,
branch prediction, etc.
-There are ample resources for fetching, decoding, and committing instructions. The only per-
formance limitations are due to the number, capabilities, and latencies of the execution units.
-The branch prediction is perfect.
There are no cache misses.
The overhead of updating the loop indexiis negligible.
The load/store units perform any necessary address arithmetic.
The overhead due to procedure calls, as well as starting and ending loops, is negligible.
2
Problem 1: Instruction-Level Parallelism
Suppose you have a machineM1with two load/store units that can each load or store a single value on
each clock cycle, and one arithmetic unit that can perform one arithmetic operation (e.g., multiplication or
addition) on each clock cycle.
A.
Assume that the load/store and arithmetic units ha velatencies of one c ycle.Ho wman yclock c ycles
would be required to executecomputeInnerProductas a function ofN? Explain what limits the
performance.
B.
No wassume that the load/store and arithmetic unit ha velatencies of 10 clock c ycles,b utthe yare fully
pipelined, able to initiate new operations every clock cycle. How many clock cycles would be required
to executecomputeInnerProductas a function ofN? Explain how this relates to your answer to
part A.
3
Problem 2: SIMD with ISPC
Consider running the following ISPC code.
export voidcomputeInnerProductISPC(uniformpoint[] A,
uniformpoint[] B,
uniform float[] result,
uniform intN)
{
foreach(i = 0 ... N)
{
result[i] = A[i].x
*B[i].x + A[i].y*B[i].y;
}
}
Suppose machineM2has one 8-wide SIMD load/store unit, and one 8-wide SIMD arithmetic unit. Both
have latencies of one clock cycle.
A.
Ho wman yclock c yclesw ouldbe required to e xecutecomputeInnerProductISPCas a function
ofN? Explain what limits the performance.
B.
If we were to run the code sho wnin computeInnerProductISPCon a five-core machineM3,
where each core has the same SIMD capabilities asM2, what would be the best speedup it could achieve
over the single-core performance of part A? Explain.
4
Problem 3: SIMD, Multicore, and Multi-Threaded Parallelism with ISPC
A.
Consider the fi ve-coremachine M3described in Problem 2B. Suppose you could write multi-threaded
code where there is no overhead associated with the threads. Each thread would run the function
computeInnerProductISPCto compute some subset of theNelements. What is the maximum
speedup you could achieve relative to the single-threaded code running on machineM2?
B.
No wsuppose we ha vea machine M4, identical toM3, except that each core supports three-way simul-
taneous multithreading. What is the maximum speedup your multithreaded code could achieve relative
to what it achieved running on machineM3.
C.
Let N= 106. Running on machineM3, if we were to write a Pthreads program that spawns 250 threads,
each computing 4000 inner products usingcomputeInnerProductISPC, would this program get
an overall performance improvement over one that uses a single thread to compute allNinner prod-
ucts? (Use your own intution about the cost of spawning new threads in Pthreads when answering this
question.) Explain.
5
D.Let N= 106. Running on machineM3, if we were to write an ISPC program that launches 250 tasks,
each computing 4000 inner products usingcomputeInnerProductISPC, would this program get
an overall performance improvement over one that uses a single task to compute allNinner products?
(Consider what you know about the performance characteristics of ISPC tasks.) Explain.
6