Performance Assignment (SIMD)

This assignment requires you to write a function, using inline assembly, that takes three arbitrary-length vectors of double-precision floating point numbers a, b, and c, and perform the following operation:

      for (int i = 0; i < length; i++) {
        a[i] += b[i] * c[i];
      }
    

The code above will calculate the correct result, but it performs slowly. Yours will need to be faster. Fortunately, we have hardware support for this; modern processors have a set of vector extensions called AVX2 that will allow us to perform calculations on up to four doubles at a time. In fact, AVX2 adds support for exactly this purpose through the use of the new FMA3 fused multiply-add instructions.
Distribution code for creating random vectors and testing the performance of your code against a naïve scalar implementation can be found here: https://bitbucket.org/wuchangfeng/cs201_simd

What to Submit

Within your homework directory, a single file with the title submission.c should contain your solution. This file should contain, at minimum, a function with the following signature:

      bool vector_fma(struct doubleVector * a, const struct doubleVector * b, const struct doubleVector * c)
    

This function should work with the provided main.c and Makefile; there is no need to write or include your own. The function also should not print output to the screen or take input from the user.

The return value should indicate whether or not the computation was successful. The only case that I can think of off the top of my head that would cause this to fail is if the vectors had varying lengths.

Grading Criteria

Assignments will be evaluated based on the following criteria:


The file main.c contains code that will be used to test and benchmark your submission. Running this program should give you a good idea of whether or not your code is correct.

Notes and Hints

To complete this assignment, you'll need to work on a processor that supports AVX2 instructions. I would recommend linuxlab.cs.pdx.edu, which is also where submissions will be tested.

Included in the distribution code is a file called vectors.s. This file contains a couple of assembly routines which are used to copy and compare vectors of doubles. Looking at these may be helpful, as they contain usage examples of most – if not all – of the vector instructions you need.

You may notice that the scalar implementation is also written in assembly, and uses the vfmadd132sd instruction. This is necessary because FMA instructions are actually also more precise than a separate multiplication and addition, since we only need to round once. If you're attempting to debug your code in GDB or similar by checking its results against expressions like a->data[0] + b->data[0] * c->data[0], you may notice minute differences in the smallest decimal digits; this is why.

Vector Instructions

The two vector instructions described below are minimally sufficient to complete this assignment.

vmovupd

vmovupd, or "Vector move unaligned packed doubles", is used to move multiple doubles at a time to or from floating-point registers. It is analogous to a regular mov instruction, and should therefore be pretty familiar. Examples of usage can be found in vectors.c.

The number of doubles is determined by the size of the source/destination registers. If we vmovupd from memory to a 128-bit register such as %xmm0, we'll move two doubles; if we vmovupd to a 256-bit register such as %ymm0, we'll move four doubles.

vfmadd231pd

vfmadd231pd performs a fused multiply-add operation on packed doubles. The 231 refers to the fact that in Intel syntax, this multiplies together the second and third operands and adds the product to the first. Unfortunately, we're using AT&T syntax for our assembly, which means that our arguments are in the opposite order. Consider the following usage example:

      vfmadd231pd %ymm0, %ymm1, %ymm2
    

This multiplies the packed doubles stored in %ymm0 and %ymm1, and adds them to the packed doubles in %ymm2. Since in AT&T syntax, the destination comes last, the result is also stored in %ymm2.