In distributed Flynn's SIMD system implementing linear convolution of two n-vectors, then the resulting temporal complexity is equal to: ○ a.<sub>188</sub>n O b. nlog(n) ○ c. n > POWEROUNITE d. n<sup>2</sup> ○ e. log(n) ○ f. 3(n/2) ○ g.<sub>10n³</sub> Clear my choice e. 8 ° f. 15 o g. 16 Consider a processor operating at 2 GHz connected to a DRAM with a latency of L=200 ns initially without using caches. Assume that the processor is of 4-path multi-pipeline architecture and has 2 multiply-add units. Knowing that the dot-product computation performs one multiply-add on a single pair of vector elements, so that each floating point operation requires one data fetch, then the limit for the peak speed of this computation (for each one floating point operation) – in terms of [nano seconds, MFLOPS] – is: POWEROUNITE - a. [100, 8] - o b. [200, 5] - <sup>○</sup> <sup>C.</sup> [100, 2] - Od. [400, 6] - O e. [200, 4] - f. [200, 8] Consider a processor operating at 2 GHz connected to a DRAM with a latency of L = 200 ns initially without using caches. Assume that the processor is of 4-path multi-pipeline architecture and has 2 multiply-add units. Consider introducing a cache of size 32 KB with a latency of 1 ns. We need to multiply two matrices [A] and [B] of dimensions 32 x 32. The total number of operations needed to multiply our two matrices is: POWEROUNITE - O a. <sub>132 M</sub> - O b. 48 M - c. 32 M - $\odot$ d. <sub>48 K</sub> - $\odot$ e. <sub>32 K</sub> - f. 64 K - $\circ$ d. (n/2) + 2w - e. n².w - Of. 3w + 2n - g. n.w Consider a processor operating at 2 GHz connected to a DRAM with a latency of L = 200 ns initially without using caches. Assume that the processor is of 4-path multi-pipeline architecture and has 2 multiply-add units. Assume that the used block size is 1 word, then the number of cycles for waiting time before the processor can process the data - for each time a memory request is made - is: POWEROUNITE - <sup>○</sup> a. 300 cycles - O b. 200 cycles - o c. 400 cycles - O d. 500 cycles - O e. 100 cycles - O f. 800 cycles Consider a processor operating at 2 GHz connected to a DRAM with a latency of L = 200 ns initially without using caches. Assume that the processor is of 4-path multi-pipeline architecture and has 2 multiply-add units. Consider introducing a cache of size 32 KB with a latency of 1 ns. We need to multiply two matrices [A] and [B] of dimensions 32 x 32. The time needed for fetching the two matrices into the cache is: - O a. 100 ns - O b. 200 ns - c. 400 μs - O d. 800 μs - O e. 200 µs O f. 8.25 O g. 17.2 O e. 12.5 Of. 13.25 O g. 18.5 For 16-point FFT, the speedup gain of performing complex multiplications when compared to executing using DFT is: - a. <sub>12.25</sub> - b. 10.7 - o c. 8.25 - $\bigcirc$ d. $_{9.4}$ - e. <sub>16.2</sub> - O f. 7 - O g. 17.2 Consider a processor operating at 2 GHz connected to a DRAM with a latency of L = 200 ns initially without using caches. Assume that the processor is of 4-path multi-pipeline architecture and has 2 multiply-add units. Consider introducing a cache of size 32 KB with a latency of 1 ns. We need to multiply two matrices [A] and [B] of dimensions 32 x 32. The number of words needed to be fetched into the cache is: POWEROUNITE - O a. 1 K Words - O b. 2 K Words - O C. 1 M Words - Od. 4 K Words - e. 2 M Words - Of. 4 M Words For an 8-point distributed FFT implementation, the O() complexity for executing real multiplications is: O a. 14 O b. 27 ○ c. 32 $\bigcirc$ d. 24 POWEROUNITE ○ e. <sub>9</sub> O f. 12 g. 48 Consider a processor operating at 2 GHz connected to a DRAM with a latency of L = 200 ns initially without using caches. Assume that the processor is of 4-path multi-pipeline architecture and has 2 multiply-add units. Consider introducing a cache of size 32 KB with a latency of 1 ns. We need to multiply two matrices [A] and [B] of dimensions 32 x 32. The needed time to perform this matrix multiplication at the rate of 4 instructions / cycle is approximately: - o a. 16 ns - b. 8 µs - · - O c. 4 ns - d. 4 µs - O e. 2 ms Consider a processor operating at 2 GHz connected to a DRAM with a latency of L=200 ns initially without using caches. Assume that the processor is of 4 path multi-pipeline architecture and has 2 multiply-add units. The peak processor rating is: - a. 8 GFLOPS - O b. 2 GFLOPS - oc. 4 GFLOPS - Od. 12 GFLOPS - o e. 6 GFLOPS - f. 10 GFLOPS