FLOPS means total floating point operations per second, which is used in High Performance Computing. In general, Intel(R) VTune(TM) Amplifier XE
only provides metric named Cycles Per Instruction (average CPI), that is to measure performance for general programs.
In this article, I use matrix1.c as an example and show what events will be used to calculate FLOPS in code for different platform.
First at all, I will use Intel(R) C++ compiler with different switchers to generate binary for legacy x87, SSE, AVX on Intel(R) Xeon
processor, vector instructions on Intel(R) Xeon Phi(TM) coprocessor, then use events to calculate FLOPS.
(I work on 2nd Generation Intel(R) Core(TM) Architecture, Sandy Bridge processor, CPUfrequency is 3.4 GHz, 64bit operation system)
(I also work on Intel Xeon Phi coprocessor, CPU frequency is 1.09GHz)
1. Use X87 OPS as traditional FP to generate legacy X87 instructions used, calculate FLOPS
Build:
gcc -g –mno-sse matrix1.c -o matrix1.x87
Run VTune:
amplxe-cl -collect-with runsa -knob event-config=FP_COMP_OPS_EXE.X87:sa=10000 -- ./matrix1.x87
amplxe-cl –report hw-events
Function Module Hardware Event Count:CPU_CLK_UNHALTED.REF_TSC (K) Hardware Event Count:FP_COMP_OPS_EXE.X87 (K)
----------------------- ----------- ------------------------------------------------- --------------------------------------------
multiply matrix1.x87 36,782,055 2,160,570
There were 2,160,570,000 counts of FP_COMP_OPS_EXE.X87
Elapsed time of multiply() = 36,782,055,000 / 3,400,000,000 = 10.818 seconds
FLOPS = 2,160,570,000 / 1,000,000 / 10.818 = 199.719 Mflops
2. Use SSE registers by using Intel C++ compiler with SSE enabled options, calculate FLOPS
Build:
icc –g –fno-inline –xSSE4.1 matrix1.c –o matrix1.SSE41
Run VTune:
amplxe-cl -collect-with runsa -knob event-config=FP_COMP_OPS_EXE.X87:sa=10000 -- ./matrix1.SSE41
amplxe-cl –collect-with runsa -knob event-config= FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE:sa=10000, FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE:sa=10000 -- ./matrix1.SSE41
amplxe-cl –report hw-events
Function Module Hardware Event Count:CPU_CLK_UNHALTED.REF_TSC (K) Hardware Event Count:FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE (K) Hardware Event Count:FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE (K)
---------------------------- ------------- ------------------------------------------------- ---------------------------------------------------------- ----------------------------------------------------------
multiply matrix1.SSE41 1,100,002 0 1,185,800
There were 1,185,800,000 counts of COMP_OPS_EXE.SSE_PACKED_DOUBLE
Elapsed time of multiply() = 1,100,002,000 / 3,400,000,000 = 0.3235s
FLOPS = 1,185,800,000 / 1,000,000 / 0.3235 = 3665.53 Mflops
3. Use AVX registers by using Intel C++ compiler with the option to enable AVX, calculate FLOPS
Build:
icc -g -fno-inline -xAVX matrix1.c -o matrix1.AVX
Run VTune:
amplxe-cl -collect-with runsa -knob event-config=SIMD_FP_256.PACKED_DOUBLE:sa=10000 -- ./matrix1.AVX
amplxe-cl –report hw-events
Function Module Hardware Event Count:CPU_CLK_UNHALTED.REF_TSC (K) Hardware Event Count:SIMD_FP_256.PACKED_DOUBLE (K)
------------- ----------- ------------------------------------------------- --------------------------------------------------
multiply matrix1.AVX 1,486,002 777,070
There were 777,070,000 counts of SIMD_FP_256.PACKED_DOUBLE
Elapsed time of multiply() = 1,486,002,000 / 3,400,000,000 = 0.437s
FLOPS = 777,070,000 / 1,000,000 / 0.437 = 1778.19 Mflops
4. Use vector instructions by using Intel C++ compiler to build native program for Intel Xeon Phi coprocessor, calculate FLOPS
Build:
icc -g -fno-inline -mmic -O3 matrix1.c -o matrix1.MIC
FP operations of application will be processed via the vector processing unit (VPU), which provides data parallelism, VTune provides supported events:
VPU_DATA_READ
VPU_DATA_WRITE
Run VTune:
amplxe-cl -target-system=mic-native:0 -collect-with runsa -knob event-config=VPU_DATA_READ,VPU_DATA_WRITE -search-dir=. -- /root/matrix1.MIC
amplxe-cl -R hw-events
Function Module Hardware Event Count:VPU_DATA_READ (M) Hardware Event Count:VPU_DATA_WRITE (M) Hardware Event Count:CPU_CLK_UNHALTED (M)
---------------- ------------------ -------------------------------------- --------------------------------------- -----------------------------------------
multiply matrix1.MIC 176 134 2,152
There were (176+134)=300M counts of VPU_DAT_READ & VPU_DATA_WRITE
Elapsed time of multiply() = 2,152,000,000 / 1,090,000,000 = 1.974s
FLOPS = 300,000,000 / 1,000,000 / 1.974 = 151.97 Mflops
Please note that my example is a single thread app working on one core, and you may develop multithreaded app working on multiple cores of Intel Xeon Phi coprocessor.
Image de l’icône:
