Use which hardware PMU events to calculate FLOPS on Intel(R) Xeon Phi(TM) coprocessor?

FLOPS means total floating point operations per second, which is used in High Performance Computing. In general, Intel(R) VTune(TM) Amplifier XE
only provides metric named Cycles Per Instruction (average CPI), that is to measure performance for general programs.

In this article, I use matrix1.c as an example and show what events will be used to calculate FLOPS in code for different platform.
First at all, I will use Intel(R) C++ compiler with different switchers to generate binary for legacy x87, SSE, AVX on Intel(R) Xeon
processor, vector instructions on Intel(R) Xeon Phi(TM) coprocessor, then use events to calculate FLOPS.

(I work on 2nd Generation Intel(R) Core(TM) Architecture, Sandy Bridge processor, CPUfrequency is 3.4 GHz, 64bit operation system)
(I also work on Intel Xeon Phi coprocessor, CPU frequency is 1.09GHz)

1. Use X87 OPS as traditional FP to generate legacy X87 instructions used, calculate FLOPS

Build:
gcc -g –mno-sse matrix1.c -o matrix1.x87

Run VTune:
amplxe-cl -collect-with runsa -knob event-config=FP_COMP_OPS_EXE.X87:sa=10000 -- ./matrix1.x87

amplxe-cl –report hw-events

Function Module Hardware Event Count:CPU_CLK_UNHALTED.REF_TSC (K) Hardware Event Count:FP_COMP_OPS_EXE.X87 (K)
----------------------- ----------- ------------------------------------------------- --------------------------------------------
multiply matrix1.x87 36,782,055 2,160,570

There were 2,160,570,000 counts of FP_COMP_OPS_EXE.X87
Elapsed time of multiply() = 36,782,055,000 / 3,400,000,000 = 10.818 seconds
FLOPS = 2,160,570,000 / 1,000,000 / 10.818 = 199.719 Mflops

2. Use SSE registers by using Intel C++ compiler with SSE enabled options, calculate FLOPS

Build:
icc –g –fno-inline –xSSE4.1 matrix1.c –o matrix1.SSE41

Run VTune:
amplxe-cl -collect-with runsa -knob event-config=FP_COMP_OPS_EXE.X87:sa=10000 -- ./matrix1.SSE41

amplxe-cl –collect-with runsa -knob event-config= FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE:sa=10000, FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE:sa=10000 -- ./matrix1.SSE41

amplxe-cl –report hw-events

Function Module Hardware Event Count:CPU_CLK_UNHALTED.REF_TSC (K) Hardware Event Count:FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE (K) Hardware Event Count:FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE (K)
---------------------------- ------------- ------------------------------------------------- ---------------------------------------------------------- ----------------------------------------------------------
multiply matrix1.SSE41 1,100,002 0 1,185,800

There were 1,185,800,000 counts of COMP_OPS_EXE.SSE_PACKED_DOUBLE
Elapsed time of multiply() = 1,100,002,000 / 3,400,000,000 = 0.3235s
FLOPS = 1,185,800,000 / 1,000,000 / 0.3235 = 3665.53 Mflops

3. Use AVX registers by using Intel C++ compiler with the option to enable AVX, calculate FLOPS

Build:
icc -g -fno-inline -xAVX matrix1.c -o matrix1.AVX

Run VTune:
amplxe-cl -collect-with runsa -knob event-config=SIMD_FP_256.PACKED_DOUBLE:sa=10000 -- ./matrix1.AVX

amplxe-cl –report hw-events
Function Module Hardware Event Count:CPU_CLK_UNHALTED.REF_TSC (K) Hardware Event Count:SIMD_FP_256.PACKED_DOUBLE (K)
------------- ----------- ------------------------------------------------- --------------------------------------------------
multiply matrix1.AVX 1,486,002 777,070

There were 777,070,000 counts of SIMD_FP_256.PACKED_DOUBLE
Elapsed time of multiply() = 1,486,002,000 / 3,400,000,000 = 0.437s
FLOPS = 777,070,000 / 1,000,000 / 0.437 = 1778.19 Mflops

4. Use vector instructions by using Intel C++ compiler to build native program for Intel Xeon Phi coprocessor, calculate FLOPS
Build:
icc -g -fno-inline -mmic -O3 matrix1.c -o matrix1.MIC

FP operations of application will be processed via the vector processing unit (VPU), which provides data parallelism, VTune provides supported events:
VPU_DATA_READ
VPU_DATA_WRITE

Run VTune:
amplxe-cl -target-system=mic-native:0 -collect-with runsa -knob event-config=VPU_DATA_READ,VPU_DATA_WRITE -search-dir=. -- /root/matrix1.MIC

amplxe-cl -R hw-events
Function Module Hardware Event Count:VPU_DATA_READ (M) Hardware Event Count:VPU_DATA_WRITE (M) Hardware Event Count:CPU_CLK_UNHALTED (M)
---------------- ------------------ -------------------------------------- --------------------------------------- -----------------------------------------
multiply matrix1.MIC 176 134 2,152

There were (176+134)=300M counts of VPU_DAT_READ & VPU_DATA_WRITE
Elapsed time of multiply() = 2,152,000,000 / 1,090,000,000 = 1.974s
FLOPS = 300,000,000 / 1,000,000 / 1.974 = 151.97 Mflops

Please note that my example is a single thread app working on one core, and you may develop multithreaded app working on multiple cores of Intel Xeon Phi coprocessor.

VTune Xeon Phi vector instruction performance metric