基于 Windows* 8.1 台式机的 Miracast*

Amélioration des performances

Bibliothèques

Développement multithread

Learning Lab

TBB-Learn

Zone des thèmes:

Android

↧

WRF Conus2.5km on Intel® Xeon Phi™ Coprocessors and Intel® Xeon® processors in Symmetric Mode

June 17, 2014, 12:45 pm

Latest and popular articles on Intel Technologies

≫ Next: Optimizing Cyberlink PowerDVD 10* Improves Battery Life

≪ Previous: Guía instructiva para Android*: cómo escribir una aplicación multiproceso con Threading Building Blocks de Intel®

Overview

This document demonstrates the best methods to obtain, build and run the WRF model on multiple nodes in symmetric mode on Intel® Xeon Phi™ Coprocessors and Intel® Xeon processors. This document also describes the WRF software configuration and affinity settings to extract the best performance from multiple node symmetric mode operation when using Intel Xeon Phi Coprocessor and an Intel Xeon processor.

Introduction

The Weather Research and Forecasting (WRF) model is a numerical weather prediction system designed to serve atmospheric research and operational forecasting needs. WRF is used by academic atmospheric scientists, forecast teams at operational centers, application scientists, etc. Please see http://www.wrf-model.org/index.phpfor more details on this system. The source code and input files can be downloaded from the NCAR website. The latest version as of this writing is WRFV3.6. In this article, we use the conus2.5km benchmark.

WRF is used by many private and public organizations across the world for weather and climate prediction.

WRF has a relatively flat profile on Intel Architecture over many functions for atmospheric dynamics and physics: advection, microphysics, etc.

Technology (Hardware/Software)

System	Xeon E5-2697 v2 @ 2.7GHz
Coprocessor	Intel Xeon Phi coprocessor 7120A @ 1.23GHz
Intel® MPI	4.1.1.036
Intel® Compiler	composer_xe_2013_sp1.1.106
Intel® MPSS	6720-21

We used the above hardware and software configuration for all of our testing.

Note: This Index card assumes that you are running the workload on the aforementioned hardware configuration. If you are using Intel Xeon Phi coprocessor model 7110 cards, please use the following instructions on 8 nodes instead of 4. To run the workload on 4 nodes, you need Intel Xeon Phi coprocessors with 16GB memory; since the 7110 model coprocessors have 8GB memory, you will need to use more than 4 Xeon Phi coprocessor Cards.

Note: Please use netcdf-3.6.3 and pnetcdf-1.3.0 for I/O.

Multi Node Symmetric Intel Xeon + Intel Xeon Phi coprocessor (4 Nodes)

Compile WRF for the Coprocessor

Download and un-tar the WRFV3.6 source code from the NCAR repository http://www.mmm.ucar.edu/wrf/users/download/get_sources.html#V351.
Source the Intel MPI for intel64 and Intel Compiler
1. source /opt/intel/impi/4.1.1.036/mic/bin/mpivars.sh
2. source /opt/intel/composer_xe_2013/bin/compilervars.sh intel64
On bash, export the path for the host netcdf and host pnetcdf. Having netcdf and pnetcdf built for Intel Xeon Phi coprocessor is a prerequisite.
1. export NETCDF=/localdisk/igokhale/k1om/trunk/WRFV3.5/netcdf/mic/
2. export PNETCDF=/localdisk/igokhale/k1om/trunk/WRFV3.5/pnetcdf/mic/
Turn on Large file IO support
1. export WRFIO_NCD_LARGE_FILE_SUPPORT=1
Cd into the ../WRFV3/ directory and run ./configure and select the option to build with Xeon Phi (MIC architecture) (option 17). On the next prompt for nesting options, hit return for the default, which is 1.
In the configure.wrf that is created, remove delete -DUSE_NETCDF4_FEATURES and replace –O3 with –O2
Replace !DEC$ vector always with !DEC$ SIMD on line 7578 in the dyn_em/module_advect_em.F source file.
Run ./compile wrf >& build.mic
This will build a wrf.exe in the ../WRFV3/main folder.
For a new, clean build run ./clean –a and repeat the process.

Compile WRF for Intel Xeon processor-based host

Download and un-tar the WRF3.5 source code from the NCAR repository http://www.mmm.ucar.edu/wrf/users/download/get_sources.html#V351.
Source the latest Intel MPI for intel64 and latest Intel Compiler (as an example below)
1. source /opt/intel/impi/4.1.1.036/intel64/bin/mpivars.sh
2. source /opt/intel/composer_xe_2013/bin/compilervars.sh intel64
Export the path for the host netcdf and pnetcdf. Having netcdf and pnetcdf built for the host is a prerequisite.
1. export NETCDF=/localdisk/igokhale/IVB/trunk/WRFV3.5/netcdf/xeon/
2. export PNETCDF=/localdisk/igokhale/IVB/trunk/WRFV3.5/pnetcdf/xeon/
Turn on Large file IO support
1. export WRFIO_NCD_LARGE_FILE_SUPPORT=1
Cd into the WRFV3 directory created in step #1 and run ./configure and select option 21: "Linux x86_64 i486 i586 i686, Xeon (SNB with AVX mods) ifort compiler with icc (dm+sm)". On the next prompt for nesting options, hit return for the default, which is 1.
In the configure.wrf that is created, remove delete -DUSE_NETCDF4_FEATURES and replace –O3 with –O2
Replace !DEC$ vector always with !DEC$ SIMD on line 7578 in the dyn_em/module_advect_em.F source file.
Run ./compile wrf >& build.snb.avx . This will build a wrf.exe in the ../WRFV3/main folder. (Note: to speed up compiles, set the environment variable J to "-j 4" or whatever number of parallel make tasks you wish to use.)
For a new, clean build run ./clean –a and repeat the process.

Run WRF Conus2.5km in Symmetric Mode

Download the CONUS2.5_rundir from http://www2.mmm.ucar.edu/WG2bench/conus_2.5_v3/
Follow the READ-ME.txt to build the wrf input files.
The namelist.input has to be altered. The changes are as follows:
1. In the &time_interval section, edit the values as below:
  1. restart_interval =360,
  2. io_form_history =2,
  3. io_form_restart =2,
  4. io_form_input =2,
  5. io_form_boundary =2,
2. Remove "perturb_input =.true." from the &domains section and replace with "nproc_x =8,"
3. Add "tile_strategy =2," under the &domains section.
4. Add "use_baseparam_fr_nml =.true." under the &dynamics section.
Create a new directory called CONUS2.5_rundir (../WRFV/CONUS_rundir) in the CONUS2.5_rundir, create 2 directories "mic" and "x86". Copy over contents of ../WRFV/run/ into “mic” and “x86” directories.
Copy the Intel Xeon Phi coprocessor binary into the CONUS2.5_rundir/mic directory and copy the Intel Xeon binary into the CONUS2.5_rundir/x64 directory.
Cd into the CONUS2.5_rundir and execute WRF as follows on 4 nodes (i.e 4 coprocessors + 4 Intel Xeon processors) in symmetric mode. To run conus2.5km, you need to have access to 4 nodes (example shown below)

Script to run on Xeon-Phi + Xeon (symmetric mode)

The nodes I am using are: node01 node02 node03 node04

When you request for nodes, make sure you have a large stack size MIC_ULIMIT_STACKSIZE=365536


source /opt/intel/impi/4.1.0.036/mic/bin/mpivars.sh
source /opt/intel/composer_xe_2013_sp1.1.106/bin/compilervars.sh intel64

export I_MPI_DEVICE=rdssm
export I_MPI_MIC=1
export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-v2-scif0
export I_MPI_PIN_MODE=pm
export I_MPI_PIN_DOMAIN=auto

./run.symmetric

Below is the run.symmetric to run the code in symmetric mode:

run.symmetric script


#!/bin/sh
mpiexec.hydra
 -host node01 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe
: -host node02 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe
: -host node03 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe
: -host node04 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe
: -host node01-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh
: -host node02-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh
: -host node03-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh
: -host node04-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh

In ../CONUS2.5_rundir/mic, create a wrf.sh file as below.

Below is the wrf.sh that is needed for the Xeon Phi part of the runscript.

wrf.sh script


export LD_LIBRARY_PATH=/opt/intel/compiler/2013_sp1.1.106/composer_xe_2013_sp1.1.106/compiler/lib/mic:$LD_LIBRARY_PATH
/path/to/CONUS2.5_rundir/mic/wrf.exe

You will have 80 rsl.error.* and 80 rsl.out.* files in your CONUS2.5_rundir directory.
Do a 'tail –f rsl.error.0000' and when you see 'wrf: SUCCESS COMPLETE WRF' your run is successful.
After the run, compute the total time taken to simulate with the scripts below. The mean value (which indicates the Average Time Step (ATS)) is of interest for WRF (lower the better).

Parsing scripts

gettiming.sh – is the parsing script


grep 'Timing for main' rsl.out.0000 | sed '1d' | head -719 | awk '{print $9}' | awk -f stats.awk
bash-4.1$ cat stats.awk 
BEGIN{ a = 0.0 ; i = 0 ; max = -999999999 ; min = 9999999999 }
{
i ++
a += $1
if ( $1 > max ) max = $1
if ( $1 < min ) min = $1
}
END{ printf("---n%10s %8dn%10s %15fn%10s %15fn%10s %15fn%10s %15fn%10s %15fn","items:",i,"max:",max,"min:",min,"sum:",a,"mean:",a/(i*1.0),"mean/max:",(a/(i*1.0))/max) }

Validation

To validate if the successful WRF run is correct or not, check the following:

It should generate a wrf_output file.
diffwrf your_output wrfout_reference > diffout_tag
'DIGITS' column should have high value (>3). If yes, the WRF run is considered valid.

Compiler Options

-mmic : build an application that natively runs on Intel® Xeon Phi™ Coprocessor
-openmp : enable the compiler to generate multi-threaded code based on the OpenMP* directives (same as -fopenmp)
-O3 :enable aggressive optimizations by the compiler.
-opt-streaming-stores always : generate streaming stores
-fimf-precision=low : low precision for higher performance
-fimf-domain-exclusion=15 : gives lowest precision sequences for Single precision and Double precision.
-opt-streaming-cache-evict=0 : turn off all cache line evicts.

Conclusion

This document enables users to compile and run the WRF Conus2.5KM workload on an Intel-based cluster with Intel Xeon processor based systems and Intel Xeon Phi coprocessors and showcases the benefits of using Intel Xeon-Phi coprocessors over the use of a homogeneous Intel Xeon processor based installation in a symmetric mode run with 4 nodes.

About the Author

Indraneil Gokhale is a Software Architect in the Intel Software and Services Group (Intel SSG).

Intel(R) Xeon Phi(TM) Coprocessor

Xeon Phi

Phi

MIC

Weather Research and Forecasting

Informatique parallèle

Intel® Xeon Phi™ Coprocessor Developer zone

URL:

Intel® MPSS download

Intel® Many Integrated Core Architecture Forum

↧

Optimizing Cyberlink PowerDVD 10* Improves Battery Life

June 9, 2014, 2:33 pm

Latest and popular articles on Intel Technologies

≫ Next: Optimizing an Augmented Reality Pipeline using Intel® IPP Asynchronous

≪ Previous: WRF Conus2.5km on Intel® Xeon Phi™ Coprocessors and Intel® Xeon® processors in Symmetric Mode

Download PDF

Authors:
Manuj Sabharwal and Gael Hofemeier, Software Engineers, Software Solutions Group, Intel Corporation

Introduction

Low battery life is one of the most serious issues currently plaguing mobile devices in general and Ultrabook™ devices and tablets specifically. Users have become accustomed to streaming multimedia content to their mobile devices “on-demand” from content servers in the cloud. Because these devices have limited battery capacity, energy efficiency is important. Cyberlink PowerDVD 10* (PowerDVD*) is one of the top players in the industry for HD, and 3D movie playback. This app is often included as a pre-bundled application from OEMs. In this case study, we showcase how Intel and Cyberlink collaborated to optimize the PowerDVD* application to give best-in-class experience on Intel devices.

First, we’ll talk about the challenges that Cyberlink encountered when adding content streaming features to PowerDVD and the tools and techniques Intel used to improve the power consumption of PowerDVD.

Then, we’ll discuss the power consumption profile of a Cyberlink PowerDVD streaming media application and its impact on battery life for mobile devices. We also provide an analysis of PowerDVD behavior to identify issues such as decoding on CPU, large numbers of context switches, high interrupt rates, etc., causing increased power consumption. Finally, we’ll provide the data that shows the reduced power consumption following optimization.

The optimization was a huge success. The Intel team was able to make the following improvements to PowerDVD:

Package C0 reduced to 20% from 100% during media playback
Reduced SoC power from ~6 W to ~1.8W using Intel® Power Gadget
Intel® VTune™ analyzer reported CPU utilization of 25% down from 70%
The Windows* Performance Analyzer showed frequent wakeups (5 Msec) vs. 10 msec wake up frequency for local or streaming media playback frequency of 10%.

Definitions

Acronym	Definition
BLA	Battery Life Analyzer
GPU	Graphics processing unit
WPA	Windows Performance Analyzer
DLNA Server	Digital Living Network Alliance Server
HD	High density
SoC	System on Chip
FPS	Frames per second
SDK	Software development kit
SKU	Stock Keeping Unit

The Challenges of Optimizing Battery Life

PowerDVD offers new features for organizing, streaming media, mobile devices, and social media. In addition to functioning on a client, the latest software can turn a device into a DLNA server and stream multimedia content from a PC across a network to other devices. It can also stream content from external content servers. Adding content streaming came with a price, however. New capabilities, such as HD streaming, required running more processes, consuming much more memory and CPU cycles. This took a toll on battery life. We needed to answer the following questions:

What is the power consumption from PowerDVD during a 1080p streaming media playback?
Why was PowerDVD able to playback only an hour of media on a fully charged battery?

After two months and three iterations of analysis and validation, the engineering teams improved battery life by making the following changes:

Offloaded graphics to the GPU (using the Intel® Media SDK)
Removed the sleep loop calls from two threads
Used an overlay to reduce extra memory copies

The following describes the process and tools that resulted in the optimized version of PowerDVD.

Optimization of Cyberlink PowerDVD for Power Consumption

Test System Configuration:

4th generation Intel® Core™ i7 processor
Lenovo Yoga* 2 Pro
CPU speed : 1.4 GHz non-turbo frequency
Memory 4 GB display : 1920x1080p HD panel
Cyberlink PowerDVD 10 and Cyberlink PowerDVD 12

Validation and analysis showed:

Package C0 was pegged 100% during media playback, while we expected it to be at 20%.
Intel Power Gadget showed SoC power to be ~6 W. It should be ~1.7 W on a 4th generation Intel processor.
Intel VTune results revealed no offloading of graphics to the GPU and high CPU utilization of 70% (we expected about 10%)
The Windows Performance Analyzer tests revealed frequent wakeups (5 msec). The normal frequency is 10 msec with audio playback.

First Step - Validation

To understand and address PowerDVD’s impact on battery life, we used Intel Power Gadget and Battery Life Analyzer (BLA) to validate the application’s SoC power usage. Figure 1 shows the Intel Power Gadget’s UI on a Windows platform.

Figure 1. Intel® Power Gadget UI on Windows* Platform

As part of our validation of PowerDVD, we used Intel Power Gadget to determine power impacts during playback. Figure 2 shows the power output Intel Power Gadget recorded.

PowerDVD’s power usage was ~6 W of SoC power during playback. Intel recommends a maximum of ~2.0 W on 4th generation Intel processors (low power processors typically used in Ultrabook devices).

Figure 2. Processor Power Usage during PowerDVD* Playback

To gain deeper insight into what other activities were affecting power, we used the Battery Life Analyzer (BLA) tool to understand the impact of media playback on residencies. Understanding residency is important as changing the SoC SKU can impact power.

BLA is a power management analysis tool developed by Intel to identify issues that impact battery life. BLA helps to identify a wide range of issues during software analysis such as:

Software CPU utilization
OS timer resolution changes
Frequent C state transitions
Excessive ISR/DPC activity

Figure 3 shows package residency during 1080p HD video playback using Cyberlink PowerDVD.

Figure 3. Package Residency during 1080p HD Video Playback using PowerDVD*

The package residency includes CPU, Graphics, and UnCore events. More time in package C0 results in higher SoC power. Expected package C0 for Cyberlink PowerDVD 1080p playback is ~20% on 4th generation U-Processor. As we can see from Figure 3, package residency is far higher than it should be.

Both Intel Power Gadget and BLA confirmed higher power usage and ~4 hrs. of battery life on 42 Whr (Watt-hours) battery capacity with ~6 W SoC+3 W of display and 2+ W for other components.

Our next step was to analyze the application for power optimization.

Second Step - Analysis

For the analysis phase, we used two tools:

The following tables summarize the results of the analysis, which showed definite room for improvement.

Table 1. Intel® Power Gadget and BLA Results

Actual Results	Expected Results
Package C0 is pegged at 100% during media playback	Package C0 should be at 20% during media playback
SoC power using Intel® Power Gadget is ~6 W	SoC power should be ~1.7 W on 4th generation Intel processor

Table 2. Intel® Vtune™ and WPA Results

Analysis Tool	Observations
Intel VTune results	Since the app had no codecs, there was no offloading to graphics High CPU utilization (70% vs. the expected 10%)
Windows Performance Analyzer	Frequent wakeups (5 msec) occurred– expected frequency is 10 msec with audio playback

The next figures provide a walkthrough of some of the important screenshots from our analysis.

Intel VTune analyzer was used to validate the PowerDVD application for the presence of spin waits, the presence of hardware acceleration, and hotspots (a micro-architecture issue). Figure 4 shows the steps for collecting the graphics call stacks.

Figure 4. VTune™ UI for Analyzing DirectX* Pipeline Events

Figure 5 shows the VTune summary with significant time spent in spin loop. GPU Usage shows no codec usage. Most of the time spent in the GPU is for display and other pre-processing algorithms during playback.

Figure 5. VTune™ Summary showing Spin Loop time

Digging deeper into the analysis, Intel VTune shows high CPU utilization during media playback, and instances where VSync (the red highlights in Figure 5) and GPU software queue are not occurring every ~33 msec (30 FPS playback). This analysis shows software glitches during media playback.

Figure 6. VTune™ Summary Report

Looking at Figure 7, the summary report confirms an inconsistent frame rate over time. The FPS varies for 30 FPS movie playback between 0-60 FPS. The chart shows the total number of frames executed in an application with a specific frame rate. A high number of slow or fast frames signals a performance bottleneck. The goal is to optimize the code to keep the frame rate constant, for example, from 30 to 60 FPS.

Figure 7. VTune™ analysis of Frame Rates

Next, we used the Windows Performance Analyzer (WPA) tool to analyze the application for wakeup activities, interrupts, and context switches. Figure 8 shows using CPU-based Intel® SSE instructions for H264 decode. It is more efficient to offload this work on to the GPU than to run it on the CPU.

Figure 8. WPA Analysis of Wakeup Activities, Interrupts, and Context Switches

WPA also shows wakeup activities from PowerDVD during playback. Figure 9 displays the two PowerDVD threads, both running at 10 msec. The two threads are not coalesced, which causes the overall system to wake up at a 5 msec timer interval. Figure 10 shows the call stack with sleep loop Win32* API being called every 10 msec interval.

Figure 9. WPA thread analysis

Figure 10. WPA call stack with sleep loop analysis

Table 3 reveals significant reduction in package residency after optimization.

Table 3. Validating Package Residency after Optimization

C-state Counters	Average (%) Before Optimization	Average (%) After Optimization
PackageC0-C1	100%	20.18%
PackageC0-C2	0%	8.29%
PackageC0 C3	0% 0%	0.19%
PackageC0 C6	0%	1.91%
PackageC0 C7	0%	69.43%

Optimization Results/Validation

The following tables show the “before” and “after” results:

Table 4. Intel® Power Gadget and BLA: Before and After¹

Before Optimization	After Optimization
Package C0 is pegged at 100% during media playback	Package C0 is reduced to 20%
SoC power is ~6 W	SoC power reduced to ~1.8 W on test system

Table 5. Intel® VTune™ Amplifier and WPA Results: Before and After¹

	Before	After
Intel® VTune™ Amplifier	Since the app had no codecs, there was no offloading to graphics High CPU utilization (70% vs. the expected 10%)	Video codecs now reported CPU utilization decreased by 25%
Windows Performance Analyzer	Frequent wakeups (5 msec) – expected frequency is 10 msec with audio playback	Sleep thread removed – reduced wakeups by 2x (5 msec to 10 msec)
Battery Life Analyzer	Package residency 100%	Package residency ~20%

¹ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance

We optimized by:

Offloading to Intel® HD Graphics using Intel Media SDK
Optimizing Win32 API calls that cause periodic wakeup on CPU
Using an overlay to save one memory copy per frame

The first task was to use the Intel Media SDK for offloading decode to graphics which will provide better efficient/watt usage of Intel HD graphics. The pseudo code in Figure 11 provides an example of a simple use of Intel Media SDK to offload a stream of frame to graphics.

Figure 11. Intel® Media SDK code snippet – offloading a frame to graphics.

Once we offloaded to graphics using the Intel Media SDK, we ran PowerDVD and measured the results using Intel VTune Amplifier. Compared to Figure 5 where we didn’t see any codec usage, we now see Video Enhancement in the summary (Figure 12).

Figure 12.Intel® VTune™ Amplifier Summary result

Examining other Intel VTune graphics views, we verified that by using Intel Media SDK [to do what?] use of frame decoded on the GPU vs. on the CPU. Figure 13 shows a batch of frames being decoded after ~20 msec on GPU. Offloading the decode work to the GPU helped to reduce CPU utilization by ~25% on the test system.

Figure 13. Frame decoding after ~20 msec on the GPU

To verify our optimization of offloading graphics, we ran Intel Power Gadget. Compared to the baseline result shown in Figure 2, we saw ~2 W of power saving just by performing graphics offloading (Figure 14).

Figure 14. Power Savings resulting from Graphics Offload

We made some good progress, but ~4 W was not low enough. As stated earlier, the goal for streaming media 1080p playback is ~1.7 W of SoC/package power.

The next step was to find other CPU-based optimizations. Initial analysis showed sleep loop calls from two threads (non-coalesced) waking the CPU every 5 msec. CyberLink engineers needed to remove the sleep threads from their application. However, this was one of the most difficult changes since it required modifying the structure of the application. Figure 15 shows wakeup activities increase to 10 mse after periodic activities were removed.

Figure 15. Optimized Cyberlink PowerDVD* after removing periodic activities

Removing periodic activities revealed a ~800 mW saving. With current optimizations, 1080p HD streaming playback SoC power went from ~6 W to 2.8 W, but additional optimizations still had to be done to reach the 1.7 W goal seen in best-in-class applications.

Figure 16. Power Optimizations down to ~2.8 W

The next step was to reduce extra memory copies using an overlay. With the overlay, the overall package power was reduced by ~400 mW. Figure 17 shows power was reduced to ~1.8 W from ~6 W.

Figure 17. Cyberlink PowerDVD* at final Power Consumption (1.8 W)

With that, the most important optimization goals had been achieved, and Intel and Cyberlink engineers deemed the project a success.

Close collaboration between Cyberlink and Intel helped to complete the optimization in two months with full validation. The final product with all optimizations was released to OEMs six months from when we started.

Conclusion

The Intel and PowerDVD engineers used several tools including Intel VTune and Microsoft Windows Performance Analyzer to reach the optimum low-power playback. The collaboration included knowledge sharing on tools with weekly analysis/meetings to meet the battery life goal before the release deadline.

Several iterations were completed before the team was satisfied with their results (PowerDVD consumes ~1.8 W down from ~6 W.) Intel and Cyberlink engineers faced the challenge of keeping the quality of playback the same before and after optimization. Each optimization required a validation and analysis process before it could pass the Cyberlink team’s internal quality tests. Thus, every change was tracked and user experience metrics (power and performance) were evaluated.

The following optimizations were found to work the best for achieving the optimization goals, but as noted above, these were accomplished over several iterations:

Offloading graphics to the GPU (using the Intel Media SDK)
Removing sleep loop calls from two threads
Using an overlay to reduce extra memory copies

The combined efforts between the Intel and CyberLink PowerDVD team resulted in optimizing their streaming media playback application to reach the best-in-class goal.

About the Authors

Manuj Sabharwal is a Software Engineer in the Software Solutions Group at Intel. Manuj has been involved in exploring power enhancement opportunities for idle and active software workloads. He has significant research experience in power efficiency and has delivered tutorials and technical sessions in the industry. He also works on enabling client platforms through software optimization techniques.

Gael Hofemeier has worked for Intel since 2000 as an Application Engineer in the Software Solutions Group at Intel. Gael’s current focus is in Technology Evangelism for Business Client Apps and Technologies.

References

Windows Performance Analyzer: http://www.microsoft.com/en-us/download/details.aspx?id=30652
Battery Life Analyzer: http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=19351
Intel® Power Gadget: https://software.intel.com/en-us/articles/intel-power-gadget-20
Cyberlink PowerDVD: http://www.cyberlink.com/products/powerdvd-ultra/features_en_US.html?&r=1
Intel® Media SDK: https://software.intel.com/en-us/vcsource/tools/media-sdk-clients

Relevant Intel Links

Energy Efficient Software Development: https://software.intel.com/en-us/energy-efficient-software
Power Analysis Guide for Windows*: https://software.intel.com/en-us/articles/power-analysis-guide-for-windows
Windows 8* Software Power Optimization: https://software.intel.com/en-us/articles/windows-8-software-power-optimization
Intel processor numbers: http://www.intel.com/products/processor_number/

Notices and Disclaimers

http://legal.intel.com/Marketing/notices+and+disclaimers.htm

Intel, the Intel logo, Ultrabook, and VTune are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others
Copyright© 2014 Intel Corporation. All rights reserved.

Analyzing Power Efficiency

optimizing applications

WPA

Battery Life analyzer

Amplificateur Intel® VTune™ XE

Traitement média

PC portable

Bureau

Intel® Integrated Performance Primitives

URL:

Intel(r) Power Gadget

Intel(r) Media SDK

↧

Optimizing an Augmented Reality Pipeline using Intel® IPP Asynchronous

June 17, 2014, 4:18 pm

Latest and popular articles on Intel Technologies

≫ Next: NAMD* for Intel® Xeon Phi™ Coprocessor

≪ Previous: Optimizing Cyberlink PowerDVD 10* Improves Battery Life

Using Intel® GPUs to Optimize the Performance and Power Consumption of Total Immersion's D'Fusion* Augmented Reality Pipeline

Michael Jeronimo, Intel (michael.jeronimo@intel.com)
Pascal Mobuchon, Total Immersion (pascal.mobuchon@t-immersion.com)

Executive Summary

This case study details the optimization of Total Immersion's D'Fusion* Augmented Reality pipeline, using the Intel® Integrated Performance Primitives (Intel® IPP) Asynchronous to execute key parts of the pipeline on the GPU. The paper explains the Total Immersion pipeline, the goals and strategy for the optimization, the results achieved, and the lessons learned.

Intel IPP Asynchronous

The Intel IPP Asynchronous (Intel IPP-A) library—available for Windows* 7, Windows 8, Linux*, and Android*—is a companion to the traditional CPU-based Intel IPP library. This library extends the successful Intel IPP acceleration library model to the GPU, providing a set of GPU-accelerated primitive functions that can be used to build visual computing algorithms. Intel IPP-A is a simple host-callable C API consisting of a set of functions that operate on matrix data, the basic data type used to represent image and video data. The functions provided by Intel IPP-A are low-, medium-, and high-level building blocks for video analysis algorithms. The library includes low-level functions such as basic math and Boolean logic operations; mid-level functions like filtering operations, morphological operations, edge detection algorithms; and high level functions including HAAR classification, optical flow, and Harris and Fast9 feature detection.

When a client application calls a function in the Intel IPP-A API, the library loads and executes the corresponding GPU kernel. The application does not explicitly manage GPU kernels; at application run time the library loads the correct highly optimized kernels for the specific processor. The Intel IPP-A library supports third generation Intel® Core™ processors (code named Ivy Bridge) and higher, and Intel® Atom™ processors, like the Bay Trail SoC, that include Intel® Processor Graphics. Allowing the library implementation to manage kernel selection, loading, dispatch, and synchronization simplifies the task of using the GPU for visual computing functionality. The Intel IPP-A library also includes a CPU-optimized implementation for fallback on legacy systems or application-level CPU/GPU balancing.

Like the traditional CPU-based Intel IPP library, when code is implemented using the Intel IPP-A API, the code does not need to be updated to take advantage of the additional resources provided by future Intel processors. For example, when a processor providing additional GPU execution units (EUs) is released, the existing Intel IPP-A kernels can automatically scale performance, taking advantage of the additional EUs. Or, if a future Intel processor provides new hardware acceleration blocks for video analysis operations, a new Intel IPP-A library implementation will use the accelerators while keeping the Intel IPP-A interface constant. Developers can simply recompile and relink with the new library implementation. Intel IPP-A provides a convenient abstraction layer for GPU-based visual computing that provides automatic performance scaling across processor generations.

It is easy to integrate Intel IPP-A code with the existing CPU-based code, so developers can take an incremental approach to optimization. They can identify key pixel processing hotspots and target those for offload to the GPU. But they must take care when offloading to the GPU so as not to introduce data transfer overhead. Instead, developers should create an algorithm pipeline that allows significant work to be performed on the GPU before the results are required by the CPU code, minimizing inter-processor data transfer.

Benefits of GPU Offload

Offloading time consuming pixel processing operations to the GPU can result in significant power and performance benefits. In particular, the GPU:

Has a lower operating frequency– the GPU runs at a lower clock frequency than the CPU, consuming less power for the same computation.
Has more hardware threads– the GPU has significantly more hardware threads, providing better performance for operations where performance scales with an increasing number of threads, such as the visual processing operations in Intel IPP-A.
Has the potential to run more complex algorithms– due to the better power and performance provided by the GPU, developers can use more computationally intensive algorithms to achieve improved results and/or process more pixels than they could otherwise using the CPU only.
Can free the CPU for other tasks – by moving processing to the GPU, developers can reduce CPU utilization, freeing up the CPU processing resources for other tasks.

The benefits offered by Intel IPP-A programming on the GPU can be applied in a variety of market segments to help ISVs reach specific goals. For example, in Digital Security and Surveillance (DSS), the primary metric is the number of channels of input video that a platform can process (the "channel density"), while in Augmented Reality, decreasing the time to acquire targets to track and increasing the number of objects that can be simultaneously tracked are key.

Augmented Reality

Augmented Reality (AR) enhances a user's perception with computer-generated input such as sound, video, or graphics data. AR merges the real world with computer-generated elements, either meta information or virtual objects, resulting in a composite that presents more information and capabilities than an un-augmented experience. AR applications usually overlay information about the environment and objects on a real-time video stream, making the virtual objects interactive. AR technology can be applied to many market segments including retail, medicine, entertainment, and education. For example:

Mobile augmented reality systems combine a mobile platform's camera, GPS, and compass sensors with its Internet connectivity to pinpoint the user's location, detect device orientation, and provide information about the scene, overlaying content on the screen.
Virtual dressing rooms allow customers to virtually try on clothes, shoes, jewelry, or watches, either in-store or at home, automatically sizing the item to the user in a 3D view on the device.
Construction managers can view and monitor work in progress, in real time, through Augmented Reality markers placed throughout a site.

Total Immersion

Total Immersion is an augmented reality company, founded in 1998, based in Suresnes, France. Through its patented D'Fusion software solution, Total Immersion combines the virtual world and the real world by integrating real-time interactive 3D graphics into a live video stream. The company maintains offices in Europe, North America, and Asia and supports the world's largest augmented reality partner network, with over 130 solution providers.

Today, mobile technology is everywhere. Total Immersion (TI) is developing compelling AR experiences for tablets and phones. Intel, recognizing Total Immersion as a leader in Augmented Reality, initiated a collaboration with TI to optimize the D'Fusion software for Intel processors, including GPU offloading. They aimed to improve the AR experience when running on Intel products that power mobile platforms, such as the Intel Atom SoC Z3680.

Optimization Goals and Strategy

Augmented Reality applications rely on computer vision algorithms to detect, recognize, and track objects in input video streams. While a large part of the AR processing doesn't deal directly with pixels, the pixel processing required is a computationally intensive, data parallel task appropriate for GPU offload. Intel and Total Immersion planned to offload the pixel processing to the GPU, using Intel IPP-A, so that the pipeline handled the pixel processing—from capture to rendering—and only the metadata about the pixel information would be returned to the CPU as input for higher-level AR operations. By offloading all of the pixel processing to the GPU, the application achieved better performance with less power consumption, making D'Fusion-based applications run efficiently on mobile platforms while conserving battery life.

The D'Fusion AR Pipeline

The core of the D'Fusion software is a processing pipeline that consists of the following stages:

Figure 1 – The Design of the PixelFlow Framework

Capture – The first step in the pipeline is capturing input video from the camera. The video can be captured in a variety of formats, such as RGB24, NV12, or YUY2, depending on the specific camera. Frames are captured at the full frame rate, typically 30 FPS, and passed to the next stage in the pipeline. Each captured frame has an associated time stamp that specifies the precise time of capture.
Preparation – Computer vision algorithms usually operate on grayscale images, and the TI AR pipeline is no exception. The first step after Capture is to convert the color format of the captured image to grayscale. Next, because computer vision algorithms often do not require the full frame size to operate effectively, input frames can be downscaled to a lower resolution. The reduced number of pixels to process saves computational resources. Then, depending on the orientation of the image, mirroring may also be required. Finally, in addition to the grayscale image required by the computer vision processing, a color image must also be sent down the pipeline so that the scene can eventually be rendered along with the AR-generated information. It is also necessary to obtain a second color format conversion from the camera input format, like NV12, to a format appropriate for display, such as ARGB. All of the operations in the Preparation stage are pixel-intensive operations appropriate to target for offload to the GPU.
Detection – Once a frame is prepared, the pipeline applies a feature detection algorithm, either Harris or Fast9, to the reduced-size grayscale input image. The algorithm returns a list of feature points detected in the image. The feature detection algorithm can be controlled by various parameters, including the threshold level. These parameters continuously adjust the feature point detection to return an optimal number of feature points and to adapt to changing ambient conditions, such as the brightness of the input scene. Non-maximal suppression is applied to the feature point calculation to get a better distribution of feature points, avoiding local "clustering." Both feature detection and non-maximal suppression are targeted for offload to the GPU.
Recognition – Once the features are generated by the Detection stage of the pipeline, the FERNS algorithm is used to match the features against a database of known objects. Instead of operating on the feature points directly, the FERNS algorithm uses a patch, a square region of pixels centered on the feature point. The patches are taken from a filtered version of the frame that has been convolved with a smoothing filter. Each of the patches is associated with a timestamp of the frame from which they were derived. Since the processing of each patch by the FERNS algorithm is an independent operation, it is easily parallelizable and a candidate for GPU offload. The frame smoothing can also happen on the GPU.
Tracking - Many image processing algorithms operate on multi-resolution images called image pyramids, where each level of the pyramid is a further downscaled version of the original input frame. The Tracking stage of the pipeline provides the image pyramid to the Lucas-Kanade optical flow algorithm to track the objects in the scene. Both the image pyramid generation and the optical flow are good candidates to run on the GPU.
Rendering – Rendering is the final stage of the pipeline. In this stage, the AR results are combined with the color video and rendered on the output, in this case using OpenGL*. The application renders the color video as an OpenGL texture and uses OpenGL functions to draw the graphics output, based on the video analysis, on top of the video frame.

Optimization Strategy

Initial profiling of the TI application confirmed that the pixel processing operations mentioned in the prior section were the primary bottlenecks in the AR pipeline. However, other bottlenecks existed, including a CPU-based copy of the color image data to an OpenGL texture.

To simplify collaboration, Intel delivered the optimizations to Total Immersion as a library to be incorporated into the TI software. The library, dubbed PixelFlow, encapsulates the pixel processing required by the TI AR pipeline and is implemented using Intel IPP-A library. Intel and Total Immersion decided that PixelFlow would target the Preparation, Detection, and Rendering bottlenecks first, while also providing information required for the Recognition and Tracking stages. Moving the first stages of the pipeline to the GPU would be a milestone towards the eventual goal of handling all pixel processing operations on the GPU.

To implement the Preparation and Detection stages, the operations performed by PixelFlow on the GPU included color format conversion, resizing, mirroring, Fast9 and Harris feature point detection, and non-maximal suppression. To support the Recognition and Tracking stages, the library provides a smoothed frame to be used by the FERNS algorithm and an image pyramid of the input to be used by the optical flow algorithm. Finally, PixelFlow also provides a GPU texture of the color input frame suitable for use in OpenGL.

Implementation

The PixelFlow framework was conceived as a flexible framework for analysis of multiple video input streams derived from a single video capture source. The PixelFlow pipeline runs on the GPU, operating asynchronously with the CPU. Each video capture source serves frames to one or more logical video streams, where the color format and resolution of each stream is independently configurable. Each stream runs on a separate thread and can use Intel IPP-A to analyze the video frames, producing meta information. The following diagram shows the general design of the framework.

Figure 2 – The Design of the PixelFlow Framework

The TI Augmented Reality pipeline is comprised of two video streams: the Analytics Stream and the Graphics Stream. The Analytics Stream processes a grayscale input frame, performing feature detection with non-maximal suppression, image pyramid generation, and smoothing of the input frame. The Graphics Stream converts the color camera input to ARGB for display. In both cases, the resulting data is placed in a queue for access by the CPU-based code. The following diagram shows the basic organization of the pipeline and the functions targeted for offload to the GPU.

Figure 3 – The PixelFlow implementation for the TI AR pipeline

The information on each queue has a timestamp of the original frame capture, allowing the CPU software to correlate each frame with the corresponding data produced by the analytics stream.

Implementation Challenges

Several challenges were encountered during the implementation of the PixelFlow framework:

Separate kernels for frame preparation– The initial PixelFlow implementation used separate Intel IPP-A functions for resizing, color format conversion, and mirroring. Because the functions didn't support multi-channel images to prepare the ARGB output for the Analytics Stream, the implementation used one Intel IPP-A function to split the input image into separate channels, then called other functions to resize and mirror each of the channels individually before combining them back into an interleaved format. To minimize the kernel overhead and simplify programming, the Intel IPP-A team developed a single hppiAdvancedResize function to combine the resize, color format conversion, and mirroring into a single GPU kernel, allowing the frame to be prepared for the Analytics Stream or the Graphics Stream with a single function call.
Direct-to-GPU-memory video input – The intention of the PixelFlow pipeline was to have the entire pipeline, from video capture to graphics rendering, on the GPU. However, the graphics drivers for the targeted platforms did not yet support direct-to-GPU-memory video capture. Instead, each frame was captured to system memory and then copied to GPU memory. To minimize the impact of the copy, the PixelFlow implementation took advantage of the Fast Copy feature supported by the Intel IPP-A library. Using a 4K-aligned system memory buffer, the GPU kernel is able to use shared physical memory to access the data, thus avoiding a copy.
NMS, weights, and orientation for Fast9 – The results produced by the Intel IPP-A Fast9 algorithm did not initially match the CPU-based function that it replaced. An investigation revealed that the TI code was also applying non-maximal suppression to the results of the Fast9 calculation. In addition, the TI code also calculated a weight and orientation value for each detected feature point. The team updated the Intel IPP-A Fast9 function to add NMS as an option and to return the weight and orientation values.
OpenGL surface sharing and DX9 surface import/export– OpenGL is used for rendering in this pipeline. The video frame is rendered as an OpenGL texture and other virtual elements are added by calling OpenGL drawing primitives. In the Frame Preparation stage of the pipeline, Intel IPP-A's AdvancedResize function converts the video frame from the input format (NV12, YUY2, etc.) to ARGB. A CPU-based copy of this image into an OpenGL texture was one of the top bottlenecks. The Intel IPP-A team added an import/export capability so that a DX9 surface handle could be extracted from an existing Intel IPP-A matrix, or an Intel IPP-A matrix could be created from an existing DX9 surface. This enabled the use of the OpenGL surface sharing capability in the Intel OpenGL driver. With is functionality, a DX9 surface could be shared with OpenGL as a texture, avoiding the CPU-based copy and keeping the data on the GPU.

Additional Non-PixelFlow Optimizations

After implementing the optimizations described in the previous section, a trace performed in the VTune™ analyzer showed that when tracking nine targets, with input video and analytics resolution at 1024x768, several hotspots remained in the computer vision module:

Remaining Hotspots – Ivy Bridge
Function	% of CV	Description
dcvGroupFernsRecognizer::RecognizeAll	18.95	Using x87 floating point. Should try using SIMD floating point instructions such as Intel® SSE3 or Intel® AVX.
dcvGaussianPyramid3x3::ConstructFirstPyramidLevelOptim	16.76	General code generation issues. Expect these would be improved by using the Intel® compiler.
dcvPolynomSolver::solve_deg3	10.20	General code generation issues. Expect these would be improved by using the Intel compiler.

After building the computer vision module with the Intel® compiler with Intel® AVX instructions enabled, the hotspots were eliminated.

Remaining Hotspots – Ivy Bridge
Function	% of CV	Description
dcvGaussianPyramid3x3::ConstructFirstPyramidLevelOptim	33.56	Image pyramid generation.
dcvCorrelationsDetectorLite::ComputerIntegralImage	16.83	Integral image computation.
dcvKtlOptim::__CalcOpticalFlowPyrLK_Optim_ResizeNN_levels	13.0	LK optical flow.

The second trace uncovered an instance in the code that still used the old CPU-based image pyramid calculation. The instance was updated to use the image pyramid calculated by PixelFlow. The remaining hotspots were additional operations that were not yet included in PixelFlow, integral image, and LK optical flow. The team will target these functions first when extending the PixelFlow functionality.

Results – Performance and Power

The resulting AR pipeline offloads its initial stages to the GPU and provides data for subsequent stages of AR processing. To analyze the PixelFlow implementation of the AR pipeline, the team used a test application from Total Immersion, the "AR Player." This configurable test application allows the user to set operating parameters like the number of targets to track, the video capture resolution and format, the analytics processing resolution, and so on. In addition to the power and performance statistics, the team was interested in the feasibility and impact of increasing the analytics resolution. For the pre-optimized CPU-based flow, the TI AR software used a 320x240 analytics resolution. The additional performance provided by the GPU offload allowed us to experiment with higher resolutions and the resulting impact on responsiveness and quality. The team tested PixelFlow implementation on Ivy Bridge and Bay Trail platforms.

Results: Ivy Bridge

We tested the software on the following Ivy Bridge platform:

Ivy Bridge Platform Details
Item	Description
Computer	HP EliteBook* 8470p
Processor	Intel® Core™ I7 processor 3720QM
Clock Speed	2.6 GHz (3.6 GHz Max Turbo Frequency)
# Cores, Threads	4, 8
L1, L2, L3 Cache	256 KB, 1 MB, 6 MB
RAM	8 GB
Graphics	Intel® HD Graphics 4000
# of Execution Units	16
Graphics Driver	Igdumdim64, 9.18.10.3257, Win7 64-bit
OS	Windows* 7 Pro (Build 7601), 64-bit, SP1

The first test scenario tracked nine targets simultaneously, with both a video capture resolution and an analytics resolution of 640x480.

Test Scenario #1
Metric	Value
Number of targets	9
Capture resolution	640x480
Analytics resolution	640x480

Performance Results – Ivy Bridge, Test Scenario #1
Processor Number	Software (ms)	PixelFlow (ms)	Difference (ms)	Difference (%)
Rendering FPS	60	60
Analytics FPS	30	30
Tracking FPS	30	30
Frame Preprocessing	0.399	0.088	-0.311	-77.83
Tracking	1.412	1.355	-0.057	-4.03
Construct Pyramid	0.548	0.025	-0.523	-95.44
Recognition	3.322	1.477	-1.846	-55.55
Compute Interest Points	1.358	0.035	-1.323	-97.43
Smooth Image	0.693	0.001	-0.692	-99.89

The second test scenario also tracks nine targets, but increases the video capture resolution to 1024x768 with an analytics resolution of 640x480.

Test Scenario #2
Metric	Value
Number of targets	9
Capture resolution	1024x768
Analytics resolution	640x480

Performance Results – Ivy Bridge, Test Scenario #2
Processor Number	Software (ms)	PixelFlow (ms)	Difference (ms)	Difference (%)
Rendering FPS	60	60
Analytics FPS	30	30
Tracking FPS	30	30
Frame Preprocessing	0.391	0.094	-0.297	-75.99
Tracking	1.355	0.900	-0.455	-33.58
Construct Pyramid	0.532	0.024	-0.508	-95.58
Recognition	2.844	0.917	-1.927	-67.77
Compute Interest Points	1.225	0.027	-1.199	-97.83
Smooth Image	0.708	0.001	-0.7070	-99.93

Results: Bay Trail

Similar tests were run on the following Bay Trail platform:

Bay Trail Platform Details
Item	Description
Computer	Intel® Atom™ (Bay Trail) Tablet PR1.1B
Processor	Intel® Atom™ processor Z3770
Clock Speed	1.46 GHz
# Cores, Threads	4, 4
L1, L2, L3 Cache	128 KB, 2048 KB
RAM	2 GB
Graphics	Intel® HD Graphics
# of Execution Units	4
Graphics Driver	Igdumdim32.dll, 10.18.10.3341, Win8 32-bit
OS	Windows* 8 (Build 9431), 32-bit

The test scenario is slightly different than the first test scenario run on the Ivy Bridge platform due to the different resolutions supported by the camera on the Bay Trail system.

Test Scenario #1
Metric	Value
Number of targets	9
Capture resolution	640x360
Analytics resolution	640x360

Performance Results – Bay Trail, Test Scenario #1
Processor Number	Software (ms)	PixelFlow (ms)	Difference (ms)	Difference (%)
Rendering FPS	55	35
Analytics FPS	30	30
Tracking FPS	15	15
Frame Preprocessing	5.215	0.385	-4.830	-92.62
Tracking	15.484	10.411	-5.074	-32.77
Construct Pyramid	6.081	0.122	-5.985	-97.99
Recognition	28.389	15.590	-12.799	-45.09
Compute Interest Points	9.235	0.365	-8.870	-96.04
Smooth Image	7.236	0.011	0.7255	-99.85

The second scenario for Bay Trail tests the video capture resolution at 1280x720, while the analytics resolution remains at 640x460.

Test Scenario #2
Metric	Value
Number of targets	9
Capture resolution	1280x720
Analytics resolution	640x360

Performance Results – Bay Trail, Test Scenario #2
Processor Number	Software (ms)	PixelFlow (ms)	Difference (ms)	Difference (%)
Rendering FPS	12	30
Analytics FPS	30	25
Tracking FPS	8	12
Frame Preprocessing	4.865	0.408	-4.458	-91.62
Tracking	16.158	9.718	-6.440	-39.86
Construct Pyramid	5.995	0.122	-5.872	-97.96
Recognition	32.398	14.532	-17.865	-55.14
Compute Interest Points	8.864	0.376	-8.488	-95.76
Smooth Image	7.337	0.013	-7.324	-99.82

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

For more complete information about performance and benchmark results, visit Performance Test Disclosure

Power Analysis

After implementing GPU offload using the PixelFlow pipeline, investigations into the power savings achieved by the GPU offload yielded unexpected results; instead of achieving a significant power savings from offloading the processing to the GPU from the CPU, the power consumption of the PixelFlow implementation was on par with the CPU-only implementation. The following GPUView trace shows why this occurred.

Figure 4 –GPUView trace of the processing for a single frame

The application dispatched the work to the GPU in separate chunks: CPU setup, GPU operation, wait for completion, CPU setup, GPU operation, wait for completion, etc. This approach impacted power consumption, causing the processor package to be continually active and not allowing the processor to enter deeper sleep states.

Instead, the pipeline should consolidate GPU operations and maximize CPU/GPU concurrency. The following diagram illustrates the ideal situation to achieve maximum power savings: GPU operations consolidated into a single block, executing concurrently with CPU threads and leaving a period of inactivity that allows the processor package to achieve deeper sleep states.

Figure 5 – Ideal pattern to maximize power savings

Conclusion

Moving the key pixel processing bottlenecks of the Total Immersion AR pipeline to the GPU resulted in performance gains on Intel processors, allowing the application to use a larger input frame size for video analysis, find targets faster, track more targets, and track them more smoothly. We expect similar gains can be achieved for similar video analysis pipelines.

While achieving performance benefits using Intel IPP-A is fairly straightforward, achieving power benefits requires a careful design of the processing pipeline. The best is one that consolidates the GPU operations and maximizes CPU/GPU concurrency to allow the processor to reach deeper sleep states. Diagnostic and profiling tools that are GPU-capable, like GPUView and Intel VTune analyzer, are essential as they can help to identify power-related problems with the pipeline. Consider using these tools during development to verify the power efficiency of a pipeline and avoid having to re-architect a pipeline to address power-related issues.

The PixelFlow pipeline offloaded several of the pixel processing bottlenecks in the TI pipeline. Work remains to move additional operations to the GPU such as integral image, optical flow, FERNS, etc. Once these operations are included in PixelFlow, all of the pixel processing will occur on the GPU with these operations returning metadata to the CPU as input for higher-level operations. The success of the current PixelFlow implementation, which uses IPP-A-based GPU offload, indicates that further gains are possible with additional offloading of pixel processing operations.

Finally, power and performance optimization can go beyond just the vision processing algorithms, but can extend to other areas such as video input, codecs, and graphics output. Intel IPP-A allows for DX9-based surface sharing with related Intel technologies such as the Intel® Media SDK for codecs and the OpenGL graphics driver. Understanding the optimization opportunities with these related technologies is also important. This allows developers to create entire GPU-based processing pipelines.

Author Biographies

Michael Jeronimo is a software architect and applications engineer in Intel's Software and Solutions Division (SSG), focused on helping customers to accelerate computer vision workloads using the GPU.

Pascal Mobuchon is the VP of Engineering at Total Immersion.

References

Item	Location
Total Immersion web site	http://www.t-immersion.com/
Total Immersion Wikipedia page	http://en.wikipedia.org/wiki/Total_Immersion_(augmented_reality)
Augmented Reality – Wikipedia page	http://en.wikipedia.org/wiki/Augmented_reality
Intel® VTune™ Amplifier XE	https://software.intel.com/en-us/intel-vtune-amplifier-xe
Intel® Graphics Performance Analyzers	https://software.intel.com/en-us/vcsource/tools/intel-gpa
GPUView	http://msdn.microsoft.com/en-us/library/windows/hardware/ff570133(v=vs.85).aspx
Intel® IPP-A web site	https://software.intel.com/en-us/intel-ipp-preview

Intel® IPP

Augmented Reality Pipeline

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 8

Intermédiaire

Bibliothèque Intel® Integrated Performance Primitives (IPP)

Amplificateur Intel® VTune™ XE

Développement de jeu

Graphiques

Processeurs Intel® Atom™

Processeurs Intel® Core™

Scalable Molecular Dynamics

↧

NAMD* for Intel® Xeon Phi™ Coprocessor

June 19, 2014, 9:38 am

Latest and popular articles on Intel Technologies

≫ Next: 竞赛获胜者将带有百科全书的增强现实整合至 ARPedia*

≪ Previous: Optimizing an Augmented Reality Pipeline using Intel® IPP Asynchronous

Purpose

This code recipe describes how to get, build, and use the NAMD* Scalable Molecular Dynamics code for the Intel® Xeon Phi™ Coprocessor.

Introduction

NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. Based on Charm++* parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 200,000 cores for the largest simulations. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER*, CHARMM*, and X-PLOR*.

NAMD is distributed free of charge with source code. Users can build NAMD or download binaries for a wide variety of platforms. Tutorials show how to use NAMD and VMD* for biomolecular modeling. Find out more about NAMD at http://www.ks.uiuc.edu/Research/namd/.

Code Support for Intel® Xeon Phi™ Coprocessor

NAMD 2.10 with Intel® Xeon Phi™ Coprocessor support is expected to be released in early to mid 2014. With support for Intel® many-integrated core (MIC) architecture, Intel expects to push NAMD performance and scalability to higher limits on Intel® architecture. Currently the code remains in development, but it can be compiled from nightly source code builds. Pre-built binaries are not available at this time.

NAMD code for Intel Xeon Phi Coprocessor continues to evolve. Intel developers are diligently working on known issues in order to achieve the project goals of performance and scalability on Intel Xeon Phi Coprocessor.

Code Access

To get access to the NAMD for Intel Xeon Phi Coprocessor code:

Download the original code at http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD and select Source Code under Version Nightly Build.

Build Directions

To build NAMD you also need the following libraries.

TCL (http://www.tcl.tk/);
FFTW (http://www.fftw.org/) , use fftw2 version (if you want you can try fftw3 version):

./configure --enable-float --enable-type-prefix --enable-static --prefix=<fftwBaseDirHere> --disable-fortran CC=icc
make CFLAGS=" -O2 " clean install
CHARM ++ (http://charm.cs.uiuc.edu/software/) can be built in 2 ways:
1. Infiniband (verbs-linux-x86_64-smp-iccstatic) version:
  
  ./build charm++ verbs-linux-x86_64 smp iccstatic --with-production
  
  Notes: check where your ibverbs lib is, if it is not in /opt/ofed/lib64 or /usr/local/ofed/lib64 directories you need to change [charmDir]/src/arch/verbs-linux-x86_64/conv-mach.sh file
2. MPI (mpi-linux-x86_64-smp-mpicxx) version: ./build charm++ mpi-linux-x86_64 smp mpicxx --with-production -DCMK_OPTIMIZE -DMPICH_IGNORE_CXX_SEEK

NAMD build instructions for the Intel Xeon Phi Coprocessor version are essentially the same as compiling standard NAMD, with the following changes:

Note: You can obtain Intel® Composer XE Version 13 from https://registrationcenter.intel.com/regcenter/register.aspx, or register at https://software.intel.com/en-us/ to get a free 30-day evaluation copy.

Notes: using make’s "-j" option will speedup compilation significantly.

Running NAMD Workloads on Intel Xeon Phi Coprocessor

Running NAMD on Intel Xeon Phi Coprocessor is much like running the standard NAMD code, with the following exceptions:

Source the Intel® compiler, so libraries can be found.

Setup the following extra environment variables:

export KMP_AFFINITY=granularity=fine,compact
export MIC_ENV_PREFIX=MIC
export MIC_OMP_NUM_THREADS=240
export MIC_KMP_AFFINITY=granularity=fine,balanced

To execute NAMD, on the namd2 command line, add +devices xxx, where xxx is a list of devices (e.g. "0,1" for the first two devices on a node). If the user omits the "+devices xxx" option at runtime, the application will attempt to use all available devices on a given node.
The number of PE’s per node must be > number of MICs in the node, and there must be at least one patch per PE.

Host threads and PEs are part of the command line options traditionally used.

Some examples of running NAMD workloads:

Ibverbs:
$BIN_DIR/charmrun ++nodelist $NODEFILE +p $NUM_PROCS ++ppn $PPN $BIN_DIR/wrapper.sh $BIN_DIR/$BIN $WORKLOAD_DIR/$CONFIG_FILE +pemap 1-$PPN +commap 0 "+devices 0,1"
PPN – for best results use 1 less than the number of available cores, for example PPN=23 if you have 24 cores per node(or PPN=47 if you use hyperthreading⁵)
NUM_PROCS = $PPN * $ NODECOUNT
MPI:
mpiexec.hydra -perhost 1 -n $NODECOUNT $BIN_DIR/$BIN +ppn $PPN $WORKLOAD_DIR/$CONFIG_FILE +pemap 1-$PPN +commap 0 +devices 0,1
Notes: "+pemap 1-$PPN +commap 0" more effective than "+setcpuaffinity"

Performance Testing^2,3

The following results show performance on a single node and cluster.

Single-node Performance Testing

Note: Single-node performance uses the multi-core build of NAMD (no network layers are used).

Single-node Platform Configurations⁴

The following hardware and software were used for the above recipe and performance testing.

Server Configuration (Intel® Xeon® processor E5 V2 family):

2-socket/24 cores:
Processor: Intel® Xeon® processor E5-2697 V2 @ 2.70GHz (12 cores) with Intel® Hyper-Threading⁵
Operating System: Red Hat Enterprise Linux* 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64 x86_64 x86_64 GNU/Linux
Memory: 64GB
Coprocessor: 2X Intel® Xeon Phi™ Coprocessor 7120P: 61 cores @ 1.238 GHz, 4-way Intel Hyper-Threading⁵, Memory: 15872 MB
Intel® Many-core Platform Software Stack Version 2.1.6720-15
Intel® C++ Compiler Version 13.1.3 20130607 (2013.5.192)

Server Configuration (Intel® Xeon® processor E5 family):

2-socket/16 cores:
Processor: Intel® Xeon® processor E5 @ 2.60GHz (8 cores) with Intel® Hyper-Threading⁵
Operating System: Red Hat Enterprise Linux* 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
Memory: 64GB
Coprocessor: 2X Intel® Xeon Phi™ Coprocessor 7120P: 61 cores @ 1.238 GHz, 4-way Intel Hyper-Threading⁵, Memory: 15872 MB
Intel® Many-core Platform Software Stack Version 2.1.6720-13
Intel® C++ Compiler Version 13.1.3 20130607 (2013.5.192)

NAMD

NAMD: Linux-x64_64-icc
Charm++: multicore-linux64-icc
Configuration parameters were modified to achieve optimal performance⁴

Cluster Performance Testing^2,3

Note: Cluster results use Infiniband*.

Cluster Platform Configuration⁴

The following hardware and software were used for the above recipe and performance testing.

Endeavor Cluster Configuration:

2-socket/24 cores:
Processor: Intel® Xeon® processor E5-2697 V2 @ 2.70GHz (12 cores) with Intel® Hyper-Threading⁵
Operating System: Red Hat Enterprise Linux* 2.6.32-358.6.2.el6.x86_64.crt1 #4 SMP Fri May 17 15:33:33 MDT 2013 x86_64 x86_64 x86_64 GNU/Linux
Memory: 64GB
Coprocessor: 2X Intel® Xeon Phi™ Coprocessor 7120P: 61 cores @ 1.238 GHz, 4-way Intel Hyper-Threading⁵, Memory: 15872 MB
Intel® Many-core Platform Software Stack Version 2.1.6720-16
Intel® C++ Compiler Version 13.1.3 20130607 (2013.5.192)

NAMD

NAMD: Linux-x64_64-icc
Charm++: verbs-linux-x86_64-smp-iccstatic
Configuration parameters were modified to achieve optimal performance⁴

DISCLAIMERS:

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

2. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

3. Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.

Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

4. For more information go to http://www.intel.com/performance

5. Available on select Intel® processors. Requires an Intel® HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading.

Intel, the Intel logo, Xeon and Xeon Phi are trademarks of Intel Corporation in the US and/or other countries.

*Other names and brands may be claimed as the property of others.

Intel(R) Xeon Phi(TM) Coprocessor

Intel® Perceptual Computing SDK

↧

竞赛获胜者将带有百科全书的增强现实整合至 ARPedia*

June 19, 2014, 11:01 pm

Latest and popular articles on Intel Technologies

≫ Next: What's new? Beta Update 1 - Intel® VTune™ Amplifier XE 2015 Beta

≪ Previous: NAMD* for Intel® Xeon Phi™ Coprocessor

作者：Garret Romaine

未来的界面已经在某处的实验室中或测试屏幕上进行实验或测试，并随时等待着转化为充分开发的实例和演示。事实上，CES 2014 宣布，英特尔® 感知计算第二阶段挑战赛中创造性用户体验类的获奖者便很好的明证。 Zhongqian Su 和一群研究生使用英特尔® 感知计算软件开发套件和 Creative Interactive Gesture Camera Kit将增强现实（AR）和一部普通的百科全书整合到 ARPedia* 中 — 增强现实与维基百科*的完美结合。 ARPedia 是一款新型的知识库，支持用户通过手势而非敲击键盘的方式来使用。

来自北京工业大学的六人团队在两个月内使用多种工具开发了这款应用。该团队使用了 Maya* 3D创建 3D 模型，使用 Unity* 3D渲染 3D 场景和开发应用逻辑，然后使用英特尔感知计算软件开发套件 Unity 3D 插件（包含在软件开发套件中）将所有组件结合起来。该演示结合 3D 模型和动画视频，创建了一种与虚拟世界交互的新方式。该款应用鼓励用户通过移动身体，使用手势、语音和触摸在未知世界中进行数字开发，未来工作将会非常令人期待。

关于恐龙

借助 AR 视觉效果，ARPedia 相当于一款编写和体验故事的游戏。随着用户逐渐习惯无缝交互式体验，许多技术开始创建交互式体验，即使这种交互式体验非常简单。在一款 PC 游戏中，普通的鼠标和键盘或触摸屏，是与应用交互的常见方式。但是，ARPedia 没有使用上述任何一种方式。在一款 AR 应用中，自然用户界面非常重要。 ARPedia 用户能够通过裸手手势和面部移动来控制动作，这要归功于 Creative Senz3D* 摄像头。许多有趣的手势能够帮助提高游戏体验，如抓、挥、点、抬和按。这些手势让玩家成为游戏以及虚拟恐龙世界的真正控制者。

图 1：ARPedia* 结合了增强现实和基于维基的百科全书，支持用户使用手势导航界面。

组长 Zhongqian Su 曾在以前的一个任务中使用小型雷克斯霸王龙的角色创建过教学应用，因此，他让这位众所周知的恐龙作为 ARPedia 应用的主角。玩家通过手部运动伸手抵达并摘取小型的恐龙图片，然后将其放在屏幕的各个点上。根据恐龙所放置的位置，用户可以了解该生物的饮食、习惯和其他特征。

图 2：用户与小雷克斯霸王龙互动来学习化石、古生物学和地质学知识。

据团队成员 Liang Zhang 表示，该团队在使用该恐龙 3D 模型之前便针对教育市场编写了一款 AR 应用。虽然，他们已经有一款应用作基础，但是还需要根据竞赛的要求做大量的调整。例如，他们已经完成的摄像头使用了 3D 技术，因此他们需要重新编写该代码（见图 3），以便与更新的 Creative Interactive Gesture Camera Kit 相融。这同时也意味着需要快速达到英特尔感知计算软件开发套件的性能。


bool isHandOpen(PXCMGesture.GeoNode[] data)
	{
		int n = 1;
		for(int i=1;i<6;i++)
		{
			if(data[i].body==PXCMGesture.GeoNode.Label.LABEL_ANY)
				continue;
			bool got = false;
			for(int j=0;j<i;j++)
			{
				if(data[j].body==PXCMGesture.GeoNode.Label.LABEL_ANY)
					continue;
				Vector3 dif = new Vector3();
				dif.x = data[j].positionWorld.x-data[i].positionWorld.x;
				dif.y = data[j].positionWorld.y-data[i].positionWorld.y;
				dif.z = data[j].positionWorld.z-data[i].positionWorld.z;
				if(dif.magnitude<1e-5)
					got = true;
			}
			if(got)
				continue;
			n++;
		}
		return (n>2);
	}

图 3：ARPedia* 重新编写了其摄像头代码，以配合 Creative Interactive Gesture Camera 使用。

Zhang 表示，幸运的是，他的公司非常热衷于在学习新技术方面投入时间和精力。他表示：“我们已经开发了许多款应用。我们时刻关注我们公司中能够使用的新软硬件改进。在此次竞赛前，我们使用了 Microsoft Kinect* 的自然身体交互。当我们发现该摄像头时，我们感到非常兴奋，并希望试一试。我们认为这次竞赛也能够为我们提供改进技术技能的机会，所以，为什么不试一试呢？”

先行的明智决策

由于竞赛的时间范围有限，该团队不得不加快采用新技术的速度。 Zhang 花费了两周的时间学习英特尔感知计算软件开发套件，然后该团队将 Zhang 能想到的交互技术尽可能多地设计进去。

同时，编剧开始编写团队能够进行编码的故事和可行场景。他们满足并探讨这些选择，并由 Zhang 根据自己的软件开发套件知识指出优势和劣势。他对技术细节有着充分的了解，从而能够制定明智的决策，因此，该团队放心地选择了他描述为“...最佳的故事和最有趣、最适合的交互。”

Zhang 表示，他们以前的决策中最重要的一个决策是让玩家充分参与到游戏中。例如，在早期的孵化阶段中，玩家可以担任扮演上帝的角色，执行创建地球以及下雨、日出等操作。玩家需要设置和学习许多手势操作。

在另一个阶段中，玩家需要抓恐龙。 Zhang 对系统进行了设置，这样用户的手中可以拿一片肉，然后恐龙能够上前把肉衔起（图 4）。该动作可以让玩家与恐龙进行互动，并建立参与。他表示：“我们希望让玩家一直沉浸在虚拟世界中。”

图 4：喂食小恐龙能够让用户沉浸其中并能创建互动。

但是，向前推进这些计划需要做更多的工作。该演示包括许多新的手势需要用户学习。 Zhang 表示：“当我与 CES 上在英特尔展台上玩这款游戏的人交谈时，发现他们不太清楚如何玩这款游戏，因为每个阶段都有各种等级的手势。我们发现它们不像我们原来设想的那么直观，这让我们决定，当我们加入新的交互式方法时，该设计必须更加直观。我们进行下一个项目时，肯定会将其牢记于心。”

ARPedia 团队介绍了两种主要手势。一种是“双手打开”，另一种是“单手打开，手指伸开。” 双手打开手势，可用来打开应用，是一种简单直接的编码方式。但是，编写第二种手势需要更多工作。

图 5：该团队努力确保该摄像头不会将手腕检测为手掌上的一点。

Zhang 解释道：“最初“打开手”的姿势并不太准确。有时，手腕会被检测为手掌上的一点，拳头会被检测为一根手指，然后系统将会把其识别为“打开”，这是错误的。因此，我们设计了一种新的打开手的姿势，在这种姿势中，至少伸出两根手指才会识别为打开手。然后，他们在屏幕上添加了文本提示，引导用户了解附件（图 5）。

英特尔® 感知计算软件开发套件

ARPedia 团队使用了 2013 版英特尔感知计算开发套件，并特别指出摄像头校准、应用调试和语音识别支持、面部分析、近距离深度跟踪和 AR 的出色易用性。它支持多个感知计算应用共享输入设备，并可在 RGB 和深度摄像头打开时提供一个隐私通知来通知用户。软件开发套件能够帮助用户轻松添加更多使用模式，添加新的输入硬件，支持新的游戏引擎和定制算法，并支持新的编程语言。

该实用程序包括 PXCUPipeline(C) 和 UtilPipeline(C++) 等 C/C++ 组件。这些组件主要用于设置和管理管线会话。框架和会话端口包括适用于 Unity 3D、处理、其他框架和游戏引擎的端口，以及适用于 C# and Java* 等编程语言的端口。软件开发套件接口包括核心框架 API、I/O 分类和算法。感知计算应用可通过三种主要功能块与软件开发套件进行交互。

Zhang 表示：“英特尔[感知计算]软件开发套件非常有帮助。我们在开发这款应用的时候没有遇到任何问题。我们能够在非常短的时间内完成大量的工作。”

英特尔® RealSense™ 技术

全球的开发人员都在学习英特尔® RealSense™ 技术。英特尔在 CES 2014 上宣布，英特尔 RealSense 技术是以前的英特尔® 感知计算技术的新名称和品牌。该直观新用户界面采用英特尔在 2013 年推向市场的手势和语音等功能。借助英特尔 RealSense 技术，用户将会获得其他的新功能，包括扫描、修改、打印和以 3D 形式共享，以及 AR 接口中的主要优势。借助这些新功能，用户可以在游戏和应用中使用高级手、指感应技术，自然地操作和播放扫描的 3D 对象。

Zhang 现在能够直接看到其他开发人员如何使用 AR 技术操作。在 CES 2014 上，他了解了来自全球的演示。虽然每个演示都是独一无二的，并希望达到不同的目标，但是他仍然从中发现了快速发展 3D 摄像头技术所带来的优势。 “在软件开发套件中包含手势检测非常有帮助。人们仍然能够以不同的方式使用摄像头，但是软件开发套件已经为他们提供了广泛的基础。我建议开发人员使用该技术开发自己的项目，并寻找功能充分地开发其理念。”

借助高级手—指追踪，开发人员可支持其用户使用复杂的 3D 操作以更高的精度、更简单的命令来控制设备。借助自然语言语音技术和精确的面部识别，设备能够更好地了解其用户的需求。

深度感应可带来更逼真的游戏体验，准确的手-指追踪可为任何虚拟冒险带来更卓越的追踪。游戏将变得更加逼真和有趣。借助 AR 技术和手指感应技术，开发人员将能够吧真实世界和虚拟世界融为一体。

Zhang 相信即将推出的英特尔 RealSense 3D 摄像头将会非常适合他所熟悉的应用场景。他表示：“据我所知，它将更加出色 — 更准确、具备更多功能、更直观。我们非常期待这款产品。此外，它还会加入 3D 面部追踪和其他的出色特性。它是首款面向笔记本电脑，并用作动作感应设备的 3D 摄像头，但是它不同于 Kinect。此外，它还能够提供与内部 3D 摄像头一样的功能。我认为，新的英特尔摄像头是支持制造商向笔记本电脑和平板电脑集成的更出色的设备。此外，作为一款微型用户接口设备，它还具备很好的便携性优势。借助该款摄像头，将来我们肯定能够开发出许多出色的项目。”

Maya 3D

ARPedia 团队使用 Maya 3D 模拟软件继续开发其知名的小型、逼真的模型 — 小雷克斯霸王龙。构建合适的模型（包括逼真的动作和精细的色彩），应用的其他部分便水到渠成。

Maya 是创建 3D 计算机动画、建模、模拟、渲染等的黄金标准。它是一款高可扩展的产品平台，可支持下一代显示技术，加开建模工作流程的速度和处理复杂数据。该团队尚未使用过 3D 软件，但是他们使用过 Maya，并能够轻松地更新并与其现有的图形相集成。 Zhang 表示，其团队又额外花费时间进行了图形的开发。他表示：“我们花费了将近一个月的时间设计和修改图形，以便让一切更完美和提高交互方式。”

Unity 3D

该团队选择 Unity 引擎作为其应用的基础。 Unity 是一款强大的渲染引擎，可用于创建交互式 3D 和 2D 内容。 Unity 工具集既是一款应用构建程序，也是一款应用开发工具，其特点是直观、易于使用且支持多平台开发。无论对于初用者还是已经使用过该产品的用户，它都是开发仿真、休闲和大型游戏以及面向 web、移动或控制台的应用的理想解决方案。

Zhang 表示，我们毫不犹豫地选择了 Unity。他表示：“我们开发所有的 AR 应用都是使用 Unity，包括这一款。我们了解这款工具，并且相信它能够做到我们需要的一切事情。” 他能够将网格作为专有 3D 应用文件快速、轻松地从 Maya 导入，既节省时间又节省精力。

今天的信息，明天的游戏

ARPedia 为未来的工作提供了许多有意义的角度。对于刚起步的团队而言，该团队认为游戏和其他应用中拥有巨大的机遇，你可以借鉴其在英特尔感知计算挑战赛中的成果。 Zhang 表示：“我们与许多感兴趣的组织进行过交谈。他们也希望我们进一步对这一版本进行完善。希望我们能够在市场上找到一席之地。我们将会向游戏中加入更多的恐龙，并引进有关这些恐龙的所有知识来吸引更多用户。它是一个有趣的环境，我们将围绕其设计更多有趣的交互。”

“此外，我们还准备设计一款宠物游戏，在这款游戏中，用户可以喂养自己的虚拟恐龙。他们可以拥有自己的特定收集，还可以拿来向彼此展示。我们还将把它设计成一款网络游戏。我们准备在新版本中加入更多场景。”

该团队的胜出让大家非常惊讶，因为他们并不熟悉全球范围内其他开发团队的工作。 Zhang 表示“我们不了解其他人的工作。我们只着眼自己的事情，没有太多机会了解其他人在做什么。” 现在，他们已经知道了自己的局限，并做好充分的准备迎接下一步挑战。 “此次竞赛为我们提供了证明自己的动力，以及与其他开发人员比较和交流的机会。我们非常感谢英特尔为我们提供的这次机会。现在，我们更加了解全球范围内的主要技术，并且在未来开发增强现实应用时将会更有信心。”

资源

英特尔® 开发人员专区
 英特尔® 感知计算挑战赛
 英特尔® RealSense™ 技术
 英特尔® 感知计算软件开发套件
查看感知计算文档中的兼容性指南，确保您现有的应用能够使用英特尔® RealSense™ 3D 摄像头。
英特尔® 感知计算软件开发套件 2013 R7 版本注释。
Maya* 软件概述
 Unity*

Informatique perceptuelle

Expérience et conception utilisateur

PC portable

Tablette

Microsoft Windows* (XP, Vista, 7)

↧

What's new? Beta Update 1 - Intel® VTune™ Amplifier XE 2015 Beta

June 20, 2014, 1:23 pm

Latest and popular articles on Intel Technologies

≫ Next: 如何在offload代码输入输出变量的内存分配中使用2M大页面

≪ Previous: 竞赛获胜者将带有百科全书的增强现实整合至 ARPedia*

Intel® VTune™ Amplifier XE 2015 Beta

A performance profiler for serial and parallel performance analysis. Overview, training, support.

New for Beta Update 1!

Ability to resolve symbols for modules with build-id and separate files with debug information
NMI Watchdog timer automatically disabled during data collection
Support for importing *.perf files with the event-based sampling data collected by the Linux Perf tool
Option to limit the call stack size (in system pages) and minimize collection overhead for custom hardware event-based sampling analysis results
Option to display verbose collection and finalization messages in the Collection Log window
Support for importing csv files with instant counters collected out of the VTune Amplifier with the external collector
Ability to specify x64 code from a 32-bit process in the JIT API
Remote system configuration options provided in the Project Properties: Target tab to specify a path to the VTune Amplifier installed on a remote machine and a path to a remote temporary directory used for storing performance results
Optimized workflow for the remote data collection in the Attach to Process mode providing an option in the Project Properties: Target tab to easily get a list of processes running on the remote Linux* system and select the required process for analysis
Updated Event Reference for Intel microarchitectures code name Ivy Bridge, Ivy Town, and Haswell
Updated product toolbar providing quick access to the product documentation with the new Help button and to the Import dialog box (standalone only) with the Import Result button
Ubuntu 14.04 support

Resources

Learn (“How to” videos, technical articles, documentation, …)
Support
Release Notes

File: vtune_amplifier_xe_2015_beta_update1.tar.gz

Installer for Intel® VTune™ Amplifier XE for Linux* 2015 Beta Update 1

File: VTune_Amplifier_XE_2015_beta_update1_setup.exe

Installer for Intel® VTune™ Amplifier XE for Windows* 2015 Beta Update 1

File: vtune_amplifier_xe_2015_beta_update1.dmg

Installer for Intel® VTune™ Amplifier XE for OS X* 2015 Beta Update 1

* Other names and brands may be claimed as the property of others.

Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries.

performance profiling

Beta tools

Développeurs

Linux*

Amplificateur Intel® VTune™ XE

OpenCL*

OpenMP*

Intel Parallel Composer XE

↧

如何在offload代码输入输出变量的内存分配中使用2M大页面

June 23, 2014, 12:17 am

Latest and popular articles on Intel Technologies

≫ Next: Implementing Gesture Sequences in Unity* 3D with TouchScript

≪ Previous: What's new? Beta Update 1 - Intel® VTune™ Amplifier XE 2015 Beta

英特尔编译器为至强融核™ 协处理器提供的offload编译模式使程序员可以在一段主机代码中加入编译指示或者某些新的关键字使指定的代码段运行在协处理器上。在显式拷贝模式下，程序员在使用offload pragma/directive将指定代码段offload到协处理器上执行的同时，还须指定在主机和扩展卡间进行拷贝的指针或数组类型变量。英特尔编译器在编译过程中会通过加入代码来自动完成主机和协处理器之间的传输数据。

在默认情况下offload模式的运行时系统为offload代码输入/输出变量在协处理器内存空间分配内存时会使用4K字节大小的页面。这样当offload代码需要很大的输入输出内存空间时，内存分配过程中就可能会产生很多的页面缺失异常，用户将观察到过长的内存分配延迟。针对这样的问题，英特尔编译器提供了一个环境变量“MIC_USE_2MB_BUFFERS”，使用户可以让运行时系统为offload代码输入/输出变量分配内存时在某些情况下改用2M字节大小的页面。下面是该环境变量的说明：

MIC_USE_2MB_BUFFERS

为运行时占用内存大小超过该环境变量给定值的指针型变量分配空间时使用2M字节页面。

该环境变量的设置方式：

整数值 B|K|M|G|T，其中

B = 字节

K = K字节

M = M字节

G = G字节

T = T字节

例如：

MIC_USE_2MB_BUFFERS=64K

该设置会使offload运行时系统在为所有超过64K字节大小的输入/输出变量分配协处理器内存空间时使用2M大页面。

更多关于如何使用英特尔编译器开发至强融核协处理器程序的信息请参见英特尔编译器用户参考手册的相关内容。

Bibliothèque Intel® MPI Library

Zone des thèmes:

IDZone

↧

Implementing Gesture Sequences in Unity* 3D with TouchScript

June 23, 2014, 4:14 pm

Latest and popular articles on Intel Technologies

≫ Next: Installing Intel(R) Cluster Studio XE on the systems with unsupported CPUs

≪ Previous: 如何在offload代码输入输出变量的内存分配中使用2M大页面

Download PDF

By Lynn Thompson

When configuring touch targets to control other elements of a scene, it’s important to minimize the screen space that the controlling elements occupy. In this way, you can devote more of the Ultrabook™ device’s viewable screen area to displaying visual action and less to user interaction. One means of accomplishing this is to configure the touch targets to handle multiple gesture combinations, eliminating the need for more touch targets on the screen. An example is the continual tapping of a graphical user interface (GUI) widget, causing a turret to rotate while firing, instead of a dedicated GUI widget for firing and another for rotating the turret (or another asset in the Unity* 3D scene).

This article shows you how to configure a scene using touch targets to control the first person controller (FPC). Initially, you’ll configure the touch targets for basic FPC position and rotation; then, augment them for additional functionality. This additional functionality is achieved through existing GUI widgets and does not require adding geometry. The resulting scene will demonstrate Unity 3D running on Windows* 8 as a viable platform for handling multiple gestures used in various sequences.

Configure the Unity* 3D Scene

I begin setting up the scene by importing an FBX terrain asset with raised elevation and trees, which I had exported from Autodesk 3ds Max*. I then place an FPC at the center of the terrain.

I set the depth of the scene’s main camera, a child of the FPC, to −1. I create a dedicated GUI widget camera with an orthographic projection, a width of 1, and a height of 0.5 as well as Don’t Clear flags. I then create a GUIWidget layer and set it as the GUI widget camera’s culling mask.

Next, I place basic GUI widgets for FPC manipulation in the scene in view of the dedicated orthogonal camera. For the left hand, I configure a sphere for each finger. The left little sphere moves the FPC left, the left ring sphere moves it forward, the left middle moves it right, and the left index sphere moves the FPC backward. The left-thumb sphere makes the FPC jump and launches spherical projectiles at an angle of 30 degrees clockwise.

For the right-hand GUI widget, I create a cube (made square through the orthogonal projection). I configure this cube with a Pan Gesture and tie it to the MouseLook.cs script. This widget delivers functionality similar to that of an Ultrabook touch pad.

I place these GUI widgets out of view of the main camera and set their layer to GUIWidget. Figure 1 shows the scene at runtime, with these GUI widgets in use to launch projectiles and manipulate the position of the FPC.

Figure 1. FPC scene with terrain and launched spherical projectiles

The projectiles launched from the FPC pass through the trees in the scene. To remedy this, I would need to configure each tree with a mesh or box collider. Another issue with this scene is that the forward velocity is slow if I use the touch pad to have the FPC look down while pressing the ring finger to move the FPC forward. To resolve this issue, I limit the “look-down” angle when the “move forward” button is pressed.

Multiple Taps

The base scene contains an FPC that fires projectiles at a specified angle off center (see Figure 1). The default for this off-center angle is 30 degrees clockwise when looking down on the FPC.

I configure the scene to have multiple taps, initiated at less than a specified time differential, alter the angle at which the projectiles are launched, then launch a projectile. I can configure this behavior to increase the angle exponentially with the number of taps in the sequence by manipulating float variables in the left-thumb jump script. These float variables control the firing angle and keep track of the time since the last projectile was launched:

	private float timeSinceFire = 0.0f;
	private float firingAngle = 30.0f;

I then configure the Update loop in the left-thumb jump script to decrement the firing angle if the jump sphere tap gestures are less than one-half second apart. The firing angle is reset to 30 degrees if the taps are greater than one-half second apart or the firing angle has decremented to 0 degrees. The code is as follows:

		timeSinceFire += Time.deltaTime;

			if(timeSinceFire <= 0.5f)
			{
				firingAngle += -1.0f;

			}
			else
			{
				firingAngle = 30.0f;
			}

			timeSinceFire = 0.0f;

			if(firingAngle <= 0)
			{
				firingAngle = 30;
			}


			projectileSpawnRotation = Quaternion.AngleAxis(firingAngle,CH.transform.up);

This code produces a strafing effect, where continuous tapping results in projectiles being launched while decrementing the angle at which they’re launched (see Figure 2). This effect is something you can let a user customize or make available at specific conditions in a simulation or game.

Figure 2.Continuous taps rotate the heading of the launched projectile.

Scale Followed by Pan

I configured the square in the lower right of Figure 1 to function similarly to a touch pad on a keyboard. Panning over the square doesn’t move the square but instead rotates the scene’s main camera up, down, left, and right by feeding the FPS’s MouseLook script. Similarly, a scaling gesture (similar to a pinch on other platforms) that the square receives doesn’t scale the square but instead alters the main camera’s field of view (FOV), allowing a user to zoom in and out on what the main camera is currently looking at (see Figure 3). I will configure a Pan Gesture initiated shortly after a Scale Gesture to return the FOV to the default of 60 degrees.

I configure this function by setting a Boolean variable—panned—and a float variable to hold the time since the last Scale Gesture:

	private float timeSinceScale;
	private float timeSincePan;
	private bool panned;

I set the timeSinceScale variable to 0.0f when a Scale Gesture is initiated and set the panned variable to True when a Pan Gesture is initiated. The FOV of the scene’s main camera is adjusted in the Update loop as follows in the script attached to the touch pad cube:

		timeSinceScale += Time.deltaTime;
		timeSincePan += Time.deltaTime;

		if(panned && timeSinceScale >= 0.5f && timeSincePan >= 0.5f)
		{
			fieldOfView += 5.0f;
			panned = false;
		}

		if(panned && timeSinceScale <= 0.5f)
		{
			fieldOfView = 60.0f;
			panned = false;
		}

		Camera.main.fieldOfView = fieldOfView;

Following are the onScale and onPan functions. Note the timeSincePan float variable, which prevents the FOV from being constantly increased when the touch pad is in use for the camera:

	private void onPanStateChanged(object sender, GestureStateChangeEventArgs e)
    {
        switch (e.State)
        {
            case Gesture.GestureState.Began:
            case Gesture.GestureState.Changed:
                var target = sender as PanGesture;
                Debug.DrawRay(transform.position, target.WorldTransformPlane.normal);
                Debug.DrawRay(transform.position, target.WorldDeltaPosition.normalized);

                var local = new Vector3(transform.InverseTransformDirection(target.WorldDeltaPosition).x, transform.InverseTransformDirection(target.WorldDeltaPosition).y, 0);
                targetPan += transform.InverseTransformDirection(transform.TransformDirection(local));

                //if (transform.InverseTransformDirection(transform.parent.TransformDirection(targetPan - startPos)).y < 0) targetPan = startPos;
                timeSincePan = 0.0f;
				panned = true;
				break;

        }

    }

	private void onScaleStateChanged(object sender, GestureStateChangeEventArgs e)
    {
        switch (e.State)
        {
            case Gesture.GestureState.Began:
            case Gesture.GestureState.Changed:
                var gesture = (ScaleGesture)sender;

                if (Math.Abs(gesture.LocalDeltaScale) > 0.01 )
                {
					fieldOfView *= gesture.LocalDeltaScale;

					if(fieldOfView >= 170){fieldOfView = 170;}
					if(fieldOfView <= 1){fieldOfView = 1;}

					timeSinceScale = 0.0f;


                }
                break;
        }
    }

Figure 3. The scene’s main camera “zoomed in” on distance features via the right GUI touch pad simulator

Press and Release Followed by Flick

The following gesture sequence increases the horizontal speed of the FPC when the left little sphere receives press and release gestures followed by a Flick Gesture within one-half second.

To add this functionality, I begin by adding a float variable to keep track of the time since the sphere received the Release Gesture and a Boolean variable to keep track of the sphere receiving a Flicked Gesture:

	private float timeSinceRelease;
	private bool flicked;

As part of the scene’s initial setup, I configured the script attached to the left little sphere with access to the FPC’s InputController script, which allows the left little sphere to instigate moving the FPC to the left. The variable controlling the FPC’s horizontal speed is not in the InputController but in the FPC’s CharacterMotor. Granting the left little sphere’s script to the CharacterMotor is configured similarly as follows:

		CH = GameObject.Find("First Person Controller");
		CHFPSInputController = (FPSInputController)CH.GetComponent("FPSInputController");
		CHCharacterMotor = (CharacterMotor)CH.GetComponent ("CharacterMotor");

The script’s onFlick function merely sets the Boolean variable flicked equal to True.

The script’s Update function (called once per frame) alters the FPC’s horizontal movement speed as follows:

		if(flicked && timeSinceRelease <= 0.5f)
		{
			CHCharacterMotor.movement.maxSidewaysSpeed += 2.0f;
			flicked = false;
		}

		timeSinceRelease += Time.deltaTime;
	}

This code gives the user the ability to increase the horizontal movement speed of the FPC by pressing and releasing the left little sphere, and then flicking the left little sphere within one-half second. You could configure the ability to decrease the horizontal movement speed in any number of ways, including a Flick Gesture following a press and release of the left index sphere. Note that the CHCharacterMotor.movement method contains not only maxSidewaysSpeed but gravity, maxForwardsSpeed, maxBackwardsSpeed, and other parameters. The many TouchScript gestures and geometries receiving them used in combination with these parameters provide many options and strategies for developing touch interfaces to Unity 3D scenes. When developing touch interfaces for these types of applications, experiment with these many options to narrow them to those that provide the most efficient and ergonomic user experience.

Issues with Gesture Sequences

The gesture sequences that I configured in the examples in this article rely heavily on the Time.deltaTime function. I use this differential in combination with the gestures before and after the differential to determine an action. The two main issues I encountered when configuring these examples are the magnitude of the time differential and the gestures used.

Time Differential

The time differential I used in this article is one-half second. When I used a smaller magnitude of one-tenth second, the gesture sequences weren’t recognized. Although I felt I was tapping fast enough for the gesture sequence to be recognized, the expected scene action did not occur. This is possibly the result of the hardware and software latency. As such, when developing gesture sequences, it’s a good idea to keep in mind the performance characteristics of the target hardware platforms.

Gestures

When configuring this example, I originally planned to have Scale and Pan Gestures followed by Tap and Flick Gestures. Having the Scale and Pan Gestures functioning as desired, I introduced a Tap Gesture, which caused the Scale and Pan Gestures to cease functioning. Although I was able to configure a sequence of Scale followed by Pan, this is not the most user-friendly gesture sequence. A more useful sequence may consist of another geometry target in the widget to accept the Tap and Flick Gestures after the Scale and Pan Gestures.

I used the time differential of one-half second in this example as the break point for actions taken (or not taken). Although it adds a level of complexity to the user interface (UI), you could configure this example to use multiple time differentials. Where Press and Release Gestures followed by a Flick Gesture within one-half second may cause horizontal speed to increase, the Press and Release Gestures followed by a Flick Gesture between one-half and 1 second may decrease the horizontal speed. Using the time differentials in this manner not only offers flexibility for the UI but could be used to plant “Easter eggs” within the scene itself.

Conclusion

The gesture sequence scene I configured for this article uses Unity 3D with TouchScript on Ultrabook devices running Windows 8. The sequences implemented are intended to reduce the amount of touch screen area required for the user to interact with the application. The less touch screen area dedicated to user interaction, the more area you can dedicate to more visually appealing content.

When I wasn’t able to get a gesture sequence to perform as desired, I was able to formulate an acceptable alternative. Part of this performance tuning was adjusting the Time.deltaTime differential to get a gesture sequence to perform as desired on the hardware available. As such, the Unity 3D scene I constructed in this article shows that Windows 8 running on Ultrabook devices is a viable platform for developing apps that use gesture sequences.

Installing Intel(R) Cluster Studio XE on the systems with unsupported CPUs

June 25, 2014, 3:12 am

Latest and popular articles on Intel Technologies

≫ Next: 解读Intel编译器的offload报告

≪ Previous: Implementing Gesture Sequences in Unity* 3D with TouchScript

Using a VPS (Virtual Private Server) in the cloud as a build machine has benefits. For example, I don’t have to pay the electricity bills and I have an access to a fresh build from all around the world.

I used the following steps to set up my build system on a new VPS.

1. Download Intel® Cluster Studio XE

I downloaded my copy of Intel® Cluster Studio XE 2013 SP1 Update 1 from the Intel® Software Development Products Registration Center (IRC) .

[user01@test-2 ~]$ wget http://registrationcenter.intel.com/irc_nas/3918/l_ics_2013.1.046_intel64.tgz
--2014-06-23 03:58:01--  http://registrationcenter.intel.com/irc_nas/3918/l_ics_2013.1.046_intel64.tgz
Resolving registrationcenter.intel.com... 198.175.96.34
Connecting to registrationcenter.intel.com|198.175.96.34|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2672369932 (2.5G) [application/x-compressed]
Saving to: “l_ics_2013.1.046_intel64.tgz”

100%[====================================>] 2,672,369,932  266K/s   in 1h 51m

2014-06-23 05:49:03 (392 KB/s) - “l_ics_2013.1.046_intel64.tgz” saved [2672369932/2672369932]

2. Try to install it

I unpacked it

[user01@test-2 ~]$ tar -xzf  ./l_ics_2013.1.046_intel64.tgz

and tried to install by running the install.sh script.

[user01@test-2 ~]$ cd l_ics_2013.1.046_intel64
[user01@test-2 l_ics_2013.1.046_intel64]$ ./install.sh
CPU is not supported.

Unfortunately, the Intel® Cluster Studio XE installer doesn’t recognize the CPU.

[user01@test-2 l_ics_2013.1.046_intel64]$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 2
model name      : QEMU Virtual CPU version 1.0
stepping        : 3
cpu MHz         : 2399.998
cache size      : 4096 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 4
wp              : yes
flags           : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm up rep_good unfair_spinlock pni vmx cx16 popcnt hypervisor lahf_lm
bogomips        : 4799.99
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

3. Use --ignore-cpu flag

I used the --ignore-cpu flag to tell the installer not to check the system CPU.

[user01@test-2 l_ics_2013.1.046_intel64]$ ./install.sh --ignore-cpu
Please make your selection by entering an option.
Root access is recommended for evaluation.

1. Run as a root for system wide access for all users [default]
2. Run using sudo privileges and password for system wide access for all users
3. Run as current user to limit access to user level

h. Help
q. Quit

…

Step 6 of 7 | Installation
--------------------------------------------------------------------------------
Each component will be installed individually. If you cancel the installation,
some components might remain on your system. This installation may take several
minutes, depending on your system and the options you selected.
--------------------------------------------------------------------------------
Installing Intel(R) MPI Library, Runtime Environment for applications running on
Intel(R) 64 Architecture component... done
--------------------------------------------------------------------------------
Installing Intel(R) MPI Library, Runtime Environment for applications running on
Intel(R) Many Integrated Core Architecture component... done
--------------------------------------------------------------------------------
Installing Intel(R) MPI Library for applications running on Intel(R) 64
Architecture component... done
--------------------------------------------------------------------------------
Installing Intel(R) MPI Library for applications running on Intel(R) Many
Integrated Core Architecture component... done
--------------------------------------------------------------------------------
Installing Intel(R) Trace Analyzer for Intel(R) 64 Architecture component... done
--------------------------------------------------------------------------------
Installing Intel(R) Trace Collector for Intel(R) 64 Architecture component... done
--------------------------------------------------------------------------------
Installing Intel(R) Trace Collector for Intel(R) Many Integrated Core
Architecture component... done
--------------------------------------------------------------------------------
Installing Command line interface component... done
--------------------------------------------------------------------------------
Installing Sampling Driver kit component... done
--------------------------------------------------------------------------------
Installing Power Driver kit component... done
--------------------------------------------------------------------------------
Installing Graphical user interface component... done
--------------------------------------------------------------------------------
Installing Command line interface component... done
--------------------------------------------------------------------------------
Installing Graphical user interface component... done
--------------------------------------------------------------------------------
Installing Command line interface component... done
--------------------------------------------------------------------------------
Installing Graphical user interface component... done
--------------------------------------------------------------------------------
Installing Intel Fortran Compiler XE for Intel(R) 64 component... done
--------------------------------------------------------------------------------
Installing Intel C++ Compiler XE for Intel(R) 64 component... done
--------------------------------------------------------------------------------
Installing Intel Debugger for Intel(R) 64 component... done
--------------------------------------------------------------------------------
Installing Intel MKL core libraries for Intel(R) 64 component... done
--------------------------------------------------------------------------------
Installing Intel(R) Xeon Phi(TM) coprocessor support component... done
--------------------------------------------------------------------------------
Installing Fortran 95 interfaces for BLAS and LAPACK for Intel(R) 64
component... done
--------------------------------------------------------------------------------
Installing GNU* Compiler Collection support for Intel(R) 64 component... done
--------------------------------------------------------------------------------
Installing Cluster support for Intel(R) 64 component... done
--------------------------------------------------------------------------------
Installing Intel IPP single-threaded libraries for Intel(R) 64 component... done
--------------------------------------------------------------------------------
Installing Intel TBB component... done
--------------------------------------------------------------------------------
Installing GNU* GDB 7.5 on Intel(R) 64 (Provided under GNU General Public
License v3) component... done
--------------------------------------------------------------------------------
Installing GDB Eclipse* Integration on Intel(R) 64 (Provided under Eclipse
Public License v.1.0) component... done
--------------------------------------------------------------------------------
Installing Intel(R) MPI Benchmarks component... done
--------------------------------------------------------------------------------
Finalizing product configuration...
--------------------------------------------------------------------------------
Preparing driver configuration scripts... done
--------------------------------------------------------------------------------
Press "Enter" key to continue

4. Test the installation

The installer has completed, so now I simply need to test an MPI application and ensure basic functionality.

[user01@test-2 ~]$ . ~/intel/composerxe/bin/compilervars.sh intel64
[user01@test-2 ~]$ . ~/intel/impi/4.1.3.048/intel64/bin/mpivars.sh
[user01@test-2 ~]$ mpiicc ~/intel/impi/4.1.3.048/test/test.c -o test
[user01@test-2 ~]$ mpirun -n 2 -host `hostname -I` ./test
Hello world: rank 0 of 2 running on test-2
Hello world: rank 1 of 2 running on test-2

Intel® Cluster Ready

Interface de transmission de messages

Informatique cloud

Informatique en cluster

Intel Parallel Composer XE

Pour commencer

↧

解读Intel编译器的offload报告

June 26, 2014, 11:35 pm

Latest and popular articles on Intel Technologies

≫ Next: Video dual en aplicaciones Android* mediante la tecnología WiDi de Intel®

≪ Previous: Installing Intel(R) Cluster Studio XE on the systems with unsupported CPUs

英特尔编译器在对代码进行编译优化的过程中用户可以通过使用”-opt-report-phase=phase”选项让编译器输出某些特定优化阶段的相关信息。针对至强融核™ 协处理器提供的offload编译模式英特尔编译器提供了”offload”关键字。它可以让编译器提供主机和目标协处理器之间的数据传输信息。

加上编译选项”-offload-report-phase offload”后编译器会对原代码中的每一个offload区域生成两段报告：第一段以Offload to target MIC开头的报告来自于主机代码编译过程；第二段以Outlined offload region开头的报告则来自于目标协处理器编译过程。

例如对于下面的代码 “reduction.c”：

1 float reduction(float *data, int numberOf)

2 {

3 float ret = 0.f;

4 int i;

5 #pragma offload target(mic) in(data:length(numberOf))

6 {

7 #pragma omp parallel for reduction(+:ret)

8 for (i=0; i < numberOf; ++i)

9 ret += data[i];

10 }

11 return ret;

12 }

$ icc -c -openmp -opt-report-phase=offload reduction.c

reduction.c(5-5):OFFLOAD:reduction: Offload to target MIC 1

Data sent from host to target

data_2_V$0, pointer to (<expr>) elements

i, scalar size 4 bytes

numberOf_2_V$1, scalar size 4 bytes

ret, scalar size 4 bytes

Data received by host from target

i, scalar size 4 bytes

numberOf_2_V$1, scalar size 4 bytes

ret, scalar size 4 bytes

reduction.c(5-5):OFFLOAD:reduction: Outlined offload region

Data received by target from host

data_2_V$0, pointer to (<expr>) elements

i, scalar size 4 bytes

numberOf_2_V$1, scalar size 4 bytes

ret, scalar size 4 bytes

Data sent from target to host

i, scalar size 4 bytes

numberOf_2_V$1, scalar size 4 bytes

ret, scalar size 4 bytes

在编译器输出的报告中我们可以看到：源代码第5行的offload区域在offload模式下执行时，首先由主机向协处理器传输的数据包括：

指针”data”所指向的数据元素，其中元素个数由表达式的运行时值确定
标量”i”，长度为4字节
标量”NumberOf”，长度为4字节
标量”ret”，长度为4字节

接下来在offload区域执行完毕后由协处理器传回主机的数据包括：

标量”i”，长度为4字节
标量”NumberOf”，长度为4字节
标量”ret”，长度为4字节

从报告的内容我们还可以看出，代码中的指针”data”被显式指定为in类型，所以它所指向的数据只被传输到协处理器上而无需传回主机；而其它3个在offload区域被引用的标量型变量由于没有被显式指定传输类型，所以它们遵循隐式规则被双向传输。

更多关于如何使用英特尔编译器开发至强融核协处理器程序的信息请参见英特尔编译器用户参考手册的相关内容。

Rubriques de compilateurs

↧

Video dual en aplicaciones Android* mediante la tecnología WiDi de Intel®

June 23, 2014, 3:48 pm

Latest and popular articles on Intel Technologies

≫ Next: Monte Carlo European Option Pricing with RNG Interface for Intel® Xeon Phi™ Coprocessor

≪ Previous: 解读Intel编译器的offload报告

Descarga

Descargar ejemplos de código de Widi para video dual [ZIP 112KB]

Este ejemplo indica cómo usar la clase Presentation para mostrar contenido de video en una pantalla externa por medio de la tecnología WiDi de Intel®. Además, muestra cómo usar un servicio para reproducir el contenido en la pantalla externa, lo cual permite que continúe la reproducción del contenido de video cuando se inicia otra aplicación en la pantalla principal del dispositivo. Por último, muestra cómo configurar el audio en los dispositivos Android basados en procesadores Intel® con el fin de posibilitar secuencias de audio duales para la reproducción de video o video dual combinada con cualquier otra aplicación que reproduzca contenido de audio.

La clase Presentation con video

La clase Presentation se usa para crear un diálogo que exhiba contenido en una pantalla externa. En este ejemplo, veremos cómo mostrar contenido de video con ella. Cuando se utiliza la API de Presentation para exhibir contenido en una pantalla externa mediante la tecnología WiDi de Intel, es necesario seleccionar la pantalla apropiada a la cual presentar el contenido. Se puede usar la función getSystemService para obtener un puntero que apunte al objeto DisplayManager. Con este objeto, se puede obtener una matriz de todas las pantallas externas que se pueden usar con la clase Presentation con la función getDisplays function y la constante DISPLAY_CATEGORY_PRESENTATION. Una vez que se tiene el puntero a la pantalla a la cual se desea enviar la presentación, se puede crear una instancia de la clase RemoteVideoPresentation y usar su función de exhibición para comenzar a representar contenido en la pantalla externa.

private DisplayManager mDisplayManager;
mDisplayManager = (DisplayManager)getSystemService(Context.DISPLAY_SERVICE);

//Selección de la pantalla
Display[] displays = mDisplayManager.getDisplays(DisplayManager.DISPLAY_CATEGORY_PRESENTATION);
for (Display display : displays)
{
	//Configuración de la clase Presentation y muestra
	presentation = new RemoteVideoPresentation(this, display, video);
	presentation.show();
}

Nuestra clase RemoteVideoPresentation extiende la clase Presentation de Google e invalida tres funciones: onCreate, onStart y onStop. A OnCreate se la llama de manera similar a la función OnCreate de Activity. Aquí es donde configuramos el diseño que contiene nuestro VideoView y obtenemos un identificador de AudioManager.

@Override
protected void onCreate(Bundle savedInstanceState)
{
	super.onCreate(savedInstanceState);

	mAudManager = (AudioManager)getContext().getSystemService(Context.AUDIO_SERVICE);
	getWindow().setType(WindowManager.LayoutParams.TYPE_SYSTEM_ALERT);
	setContentView(R.layout.activity_remote_video);
	mVideoView = (VideoView) findViewById(R.id.remoteVideoView);
}

El método onStart se invoca después de que el creador del objeto llama a su función de muestra. Aquí es donde comenzamos a reproducir el video cuyo URI se pasó al constructor de la pantalla externa que también se pasó al constructor. Configuramos el audio, el URI del video de VideoView y luego llamamos a su función de inicio.

@Override
protected void onStart()
{
	super.onStart();
	playVideo();
}

public void playVideo()
{
	if (mVideoView != null)
	{
		mVideoView.setVideoURI(mVideoUri);
		int result = mAudManager.requestAudioFocus(afChangeListener,
				// Usamos la secuencia de música.
				AudioManager.STREAM_MUSIC,
				// Solicitud de foco permanente.
				AudioManager.AUDIOFOCUS_GAIN);
		if (result == AudioManager.AUDIOFOCUS_REQUEST_FAILED)
		{
			//Error
		}
		mAudManager.setParameters("bgm_state=true");

		mVideoView.start();
	}
}

Cómo administrar la clase Presentation con un servicio

No es necesario que los diálogos de Presentation los administre un servicio, pero si no ocurre así, el diálogo se detendrá cuando lo haga la presentación Para permitir que continúe la reproducción de video en la pantalla externa mientras se cambia de aplicación en la pantalla local, un servicio tiene que encargarse de crear y administrar la clase Presentation que se usa para reproducir el video. Esto también es útil para para reproducir dos secuencias de video distintas en la pantalla externa y la local. Al crear el servicio como clase que extiende la clase Service, podemos iniciar el servicio como Intent y detenerlo de la misma manera. RemoteVideoService es el servicio que hemos extendido desde la clase de servicio base.

public void OnClickPlayRemoteVideo(View view)
{
	Intent serviceIntent = new Intent(this, RemoteVideoService.class);
	serviceIntent.putExtra(RemoteVideoService.URI, mRemoteVideoUri);
	startService(serviceIntent);
	mRemoteStopButton.setVisibility(View.VISIBLE);
	mRemoteStopButton.setClickable(true);
}

public void OnClickStopRemoteVideo(View view)
{
	Intent serviceIntent = new Intent(this, RemoteVideoService.class);
	stopService(serviceIntent);

	mRemoteStopButton.setVisibility(View.INVISIBLE);
	mRemoteStopButton.setClickable(false);
}

En el servicio, necesitamos invalidar cuatro funciones: onBind, onCreate, onDestroy y onStartCommand. En la función onBind, solo necesitamos que se devuelva un nuevo objeto Binder.

@Override
public IBinder onBind(Intent intent)
{
	return new Binder();
}

onCreate es similar a otras funciones onCreate con las que hemos trabajado en Activities. Sin embargo, aquí no es necesaria la configuración de un diseño porque todo se maneja en la clase Presentation. Solo configuramos DisplayManager porque habrá que usarlo para seleccionar la pantalla externa en la cual se establecerá la Presentation.

@Override
public void onCreate()
{
	super.onCreate();
	mDisplayManager = (DisplayManager)getSystemService(Context.DISPLAY_SERVICE);
}

Se llama a onDestroy cuando la Activity principal ha detenido el servicio con la función stopService. La usamos para llamar a la función de cancelación de Presentation. La consecuencia es que se llamará a la función onStop de Presentation, lo cual le permitirá hacer una limpieza después de realizar su proceso.

@Override
public void onDestroy()
{
	if (presentation != null)
	{
		presentation.cancel();
	}
	super.onDestroy();
}

La función onStartCommand es donde hacemos la mayor parte del trabajo. Establecemos un objeto de notificación y lo iniciamos de modo que el usuario pueda tener un objeto en el menú desplegable de Android para navegar fácilmente de regreso a la aplicación y manejar el servicio desde la Activity principal. Aquí es también donde creamos una instancia de nuestra clase Presentation para reproducir video, obtener el URI que se pasó al servicio como Parcel y seleccionar la pantalla externa.

@Override
public int onStartCommand(Intent intent, int flags, int startId)
{
	CharSequence text = getText(R.string.app_name);
	Intent startApp = new Intent(this, MainActivity.class);
	PendingIntent pendingIntent = PendingIntent.getActivity(this, 0, startApp, 0);
	Notification.Builder bld = new Notification.Builder(this);
	Notification not = bld
			.setSmallIcon(R.drawable.ic_launcher)
			.setContentIntent(pendingIntent)
			.setContentTitle(text)
			.build();

	startForeground(1, not);

	Uri video = (Uri)intent.getParcelableExtra(URI);

	//Selección de la pantalla
	Display[] displays = mDisplayManager.getDisplays(DisplayManager.DISPLAY_CATEGORY_PRESENTATION);
	for (Display display : displays)
	{
		//Configuración de la clase Presentation y muestra
		presentation = new RemoteVideoPresentation(this, display, video);
		presentation.show();
	}

	return START_NOT_STICKY;
}

Secuencia de audio doble

En dispositivos Android basados en procesadores Intel, se puede configurar el audio de manera que se reproduzcan dos secuencias de audio por separado en los altavoces o auriculares locales del dispositivo y pueda haber una pantalla externa capaz de reproducir audio, como podría ser una pantalla con tecnología WiDi de Intel o HDMI. En esencia, esto permite que un creador de aplicaciones reproduzca en la pantalla externa contenido de video junto con el audio mientras reproduce en simultáneo un video aparte (con audio) en la pantalla local. Otro ejemplo sería reproducir video externamente mientras se recibe una llamada telefónica localmente en el dispositivo. El código para hacerlo en este ejemplo es relativamente simple. Todo se hace en la clase Presentation. Cuando se reproduce video, necesitamos configurar el administrador de audio para que use la secuencia de música y establezca el parámetro bgm_state como verdadero.

OnAudioFocusChangeListener afChangeListener = new OnAudioFocusChangeListener() {
	public void onAudioFocusChange(int focusChange) {
		if (focusChange == AudioManager.AUDIOFOCUS_LOSS_TRANSIENT) {

		} else if (focusChange == AudioManager.AUDIOFOCUS_GAIN) {

		} else if (focusChange == AudioManager.AUDIOFOCUS_LOSS) {
			mAudManager.abandonAudioFocus(afChangeListener);
		}
	}
};
int result = mAudManager.requestAudioFocus(afChangeListener,
		// Usamos la secuencia de música.
		AudioManager.STREAM_MUSIC,
		// Solicitud de foco permanente.
		AudioManager.AUDIOFOCUS_GAIN);
if (result == AudioManager.AUDIOFOCUS_REQUEST_FAILED)
{
	//Error
}
mAudManager.setParameters("bgm_state=true");

También necesitamos ajustar el archivo xml de manifiesto para indicar que nuestra aplicación modificará las configuraciones de audio.

<uses-permission android:name="android.permission.MODIFY_AUDIO_SETTINGS"/>

Este ejemplo básico muestra cómo agregar a una aplicación reproducción de video o audio con la tecnología WiDi de Intel, procedimiento que permite a los usuarios realizar varias tareas y muchas actividades diferentes localmente sin interrumpir la reproducción del contenido externo en dispositivos Android con procesadores Intel. ¡Buena programación!

Biografía del autor

Gideon forma parte del Grupo de Software y Servicios de Intel. Trabaja con proveedores de software independientes y los ayuda a optimizar sus productos para procesadores Intel® Atom™. En el pasado, trabajó en un equipo que escribió controladores gráficos de Linux* para plataformas con Android OS.

Enlaces relacionados

Muestras de código para habilitar Intel® WiDi Dual Screen: http://software.intel.com/es-es/intel-widi#pid-19198-1607 Aplicación WiDi con Dual Screen Intel® http://software.intel.com/es-es/articles/dual-screen-intel-widi-application
Cómo habilitar Intel® Wireless Display Differentiation para Miracast* en un teléfono con arquitectura Intel® http://software.intel.com/es-es/articles/how-to-enable-intel-wireless-display-differentiation-for-miracast-on-intel-architecture

Para conocer más acerca de las herramientas Intel para el desarrollador Android visita Intel® Developer Zone para Android.

Expérience et conception utilisateur

Téléphone

Tablette

Intel® Wireless Dual Screen Code Samples

URL:

↧

Monte Carlo European Option Pricing with RNG Interface for Intel® Xeon Phi™ Coprocessor

June 25, 2014, 2:15 pm

Latest and popular articles on Intel Technologies

≫ Next: Determining the Idle Power of an Intel® Xeon Phi™ Coprocessor

≪ Previous: Video dual en aplicaciones Android* mediante la tecnología WiDi de Intel®

Download Available under the Intel Sample Source Code License Agreement license.

Background

Monte Carlo is a numerical method that uses statistical sampling techniques to approximate solutions to quantitative problems. The name comes from the famous casino in the principality of Monaco, where a roulette table provides uncertainty outcomes just like series of random numbers. The contemporary version of the Monte Carlo algorithm was first used by Stanislaw Ulam, while he was working on the Manhattan project in the mid-1940s. Nicholas Metropolis was first to make the connection between the casino and the algorithm and coined the term Monte Carlo to refer to any numerical simulation algorithm that involves a random number generator. John von Neumann was the first to implement Monte Carlo on the ENIAC Computer in the late 1940s. Since then, Monte Carlo has been widely used in engineering physics, molecular dynamics, and in calculating integrals with complicated boundary conditions.

In 1973, Fisher Black and Myron Scholes published their historical paper and introduced what later became known as the Black-Scholes Option Pricing model for financial derivatives. As the rest of the world was still trying to digest the Black-Scholes Model, an actuarial professor from the University of British Columbia, Phelim Boyle introduced the Monte Carlo method to Finance and successfully used it as an alternate way to get the same result as the Black-Scholes Model. In his article, he takes the example of a European call option and calculates its price using the Monte Carlo method.

In this paper, we use the same numerical problem as an example to highlight various techniques and practices to achieve high performance computing on Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors.

Code Access

The Monte Carlo European Option with RNG interface is maintained by Shuo Li and is available under the BSD 3-Clause Licensing Agreement. The code supports the asynchronous offload of the Intel Xeon processor (referred to as “host” in this document) with the Intel Xeon Phi coprocessor (referred to as “coprocessor” in this document) in a single node environment.

To access the code and test workloads:

Go to source location to download the MonteCarloRNGsrc.tar file.

Build Directions

Here are the steps you need to follow in order to rebuild the program:

Install the Intel® Composer XE 2013 SP 2 on your system
Source the environment variable script file compilervars.csh under /pkg_bin
Untar the montecarlorng.tar file, type make to build the binary
Issue the make command, be unconditional, and be silent using the –Bs option

[prompt]$ make –Bs

Run Directions

Copy the following files to the Intel Xeon Phi coprocessor card.

[prompt]$ scp MonteCarloRNGSP.knc yourhost-mic0:
[prompt]$ scp MonteCarloRNGDP.knc yourhost-mic0:
[prompt]$ scp /opt/intel/composerxe/lib/mic/libiomp5.so yourhost-mic0:
[prompt]$ scp /opt/intel/composerxe/tbb/lib/mic/libtbbmalloc.so yourhost-mic0:
[prompt]$ scp /opt/intel/composerxe/tbb/lib/mic/libtbbmalloc.so.2 yourhost-mic0:
[prompt]$ scp /opt/intel/composerxe/mkl/lib/mic/libmkl_intel_lp64.so yourhost-mic0:
[prompt]$ scp /opt/intel/composerxe/mkl/lib/mic/libmkl_sequential.so yourhost-mic0:
[prompt]$ scp /opt/intel/composerxe/mkl/lib/mic/libmkl_core.so yourhost-mic0:

Turn on the turbo mode on your Intel Xeon Phi coprocessor card.

[prompt]$ sudo /opt/intel/mic/bin/micsmc --turbo enable
Information: mic0: Turbo Mode Enable succeeded.

Invoke the binary and set the environmental variable for the execution from the host.

[prompt]$ ssh yourhost-mic0 "export LD_LIBRARY_PATH=.;export OMP_NUM_THREADS=244;export KMP_AFFINITY='compact,granularity=fine';./MonteCarloRNGSP.knc"
Monte Carlo European Option Pricing Single Precision

Compiler Version  = 14
Release Update    = 2
Build Time        = Jun  2 2014 12:22:43
Path Length       = 262144
Number of Options = 999912
Block Size        = 16384
Worker Threads    = 244

Starting options pricing...
Parallel simulation completed in 21.439754 seconds.
Validating the result...
L1_Norm          = 4.812E-04
Average RESERVE  = 12.872
Max Error        = 8.035E-02
==========================================
Total Cycles = 28586338291
Cyc/opt      = 28588.854
Time Elapsed =   21.440
Options/sec  = 46638.222
==========================================
[prompt]$ ssh yourhost-mic0 "export LD_LIBRARY_PATH=.;export OMP_NUM_THREADS=244;export KMP_AFFINITY='compact,granularity=fine';./MonteCarloRNGDP.knc"
Monte Carlo European Option Pricing Double Precision

Compiler Version  = 14
Release Update    = 2
Build Time        = Jun  2 2014 12:22:44
Path Length       = 262144
Number of Options = 999912
Block Size        = 8192
Worker Threads    = 244

Starting options pricing...
Parallel simulation completed in 47.075885 seconds.
Validating the result...
L1_Norm          = 4.812E-04
Average RESERVE  = 12.920
Max Error        = 8.034E-02
==========================================
Total Cycles = 62767847297
Cyc/opt      = 62773.371
Time Elapsed =   47.076
Options/sec  = 21240.429
==========================================

The program priced about a million sets of option input data. If you divide 1 million among 244 threads, you will get 4098.36. Without losing generality, let’s simplify the number of options each thread runs to a round number of 4096, or 4k, and the total number of options the program will price is 999,912. For each option, the program first generates a random number sequence that is independently and identically distributed or i.i.d. for short. We cover details of generating random number sequences in a later section. We then use these random numbers as samples of stock movement with the European payoff formula and calculate the stock value and confidence interval in a formula shown in the next section.

The program was built on the host and executes on the Intel Xeon Phi coprocessor. For each option data set, it calculates the option values and the confidence intervals. Result validation is part of the benchmark. It measures the average error between the calculated result and the result from Black-Scholes-Merton ^[2] formula.

This benchmark runs on a single node of an Intel Xeon Phi coprocessor. It can also be modified to run in a cluster environment. The program reports a latency measure in seconds spent in pricing activities and also a throughput measure using total number of options priced over the elapsed time, which was printed out as the last performance number in Options/sec.

Generating and Using Random Numbers in Monte Carlo Methods

Since Monte Carlo is a numerical method based on the simulation of random variables, the implementation of this algorithm starts with identifying a random number generator. VSL, the vector statistical library component of the Intel® Math Kernel Library (Intel® MKL), provides a variety of random number generators for different distributions. Our implementation uses the Mersenne Twister random number generator using the normal distribution. VSL is part of the Intel® C++ Composer XE 2013 that we use to build applications for Intel Xeon processor and Intel Xeon Phi coprocessors.

Using Random Number Generators

Inside VSL a random number sequence is identified as a stream. Each stream delivers random numbers in a given distribution in vector interface. To manage the complexity, VSL uses two implementation layers to support different RNGs and different distributions. At the lower level, all core random number generation routines are implemented to deliver random numbers in uniform distribution. At the higher level, transformation functions are applied to turn uniform distribution to the distribution the user desires.

To use VSL, follow this typical 5-step process:

Specify RNG streams
Initialize and create the random number streams
Request a vector of random numbers in a specific distribution
Consume the random number sequences in the simulation
Destroy the RNG streams

Here is how the process works for calculating our European Call options:

Specify a random number stream. In our benchmark we are going to use all 61 cores and create 4 threads per core. In total, we can have 244 threads. To allow each thread to price an option independently, we should give each thread an independent random number stream. We need to declare an RNG state descriptor for each thread. We can declare these data structures before we create any threads.

#include <mkl_vsl.h>
    // Declare random number buffer and random number sequence descriptors
    float *samples[MAX_THREADS];
    VSLStreamStatePtr Streams[MAX_THREADS];

VSLStreamStatePtr is a C/C++ opaque data structure and Streams is an array of still uninitialized opaque data.

Initialize and create the random stream and set up the stream with a basic random number generator and an integer seed.
Once we have created worker threads, each thread will allocate its own buffer to receive the RNG sequences and initialize the stream descriptor so that it knows what basic random number generator to use and whether the threads need to work together to ensure mutual independency.
```
samples[threadID] = (float *)scalable_aligned_malloc(RAND_BLOCK_LENGTH * sizeof(float), SIMDALIGN);
vslNewStream(&(Streams[threadID]), VSL_BRNG_MT2203 + threadID, RANDSEED);
```
Intel MKL provides the following routine to create and initialize the stream you declared:
```
vslNewStream (VSLStreamStatePtr &Randomstream, int brng, int seed )
```
Randomstream is a reference to the uninitialized random stream you just declared. It takes a reference because the routine passes back an initialized stream state descriptor. brng is an enumeration parameter specifying which basic random number to use. VSL_BRNG_MT2203 specifies a family of modified Mersenne Twister[^10] pseudo generators, each of which is i.d.d. In our problem, each thread uses its own random number generator identified by VSL_BRNG_MT2203 plus its thread ID. You can find more information on the basic random number generator here BRNG parameter definitions. seed is an integer for the random stream to ensure the reproducibility for debugging purposes.
Request a vector of random numbers of a specific distribution. Using the random stream descriptor, we can call one of the distribution generators to produce a sequence of random numbers with a certain probability distribution and a specific data type. The result will be placed in a user-provided buffer in the form of a C array.
```
float *rand = samples[threadID];
vsRngGaussian (VSL_RNG_METHOD_GAUSSIAN_ICDF, Streams[threadID], RAND_BLOCK_LENGTH, rand, MuByT, VBySqrtT);
```
Intel MKL uses the following routine to generate normal distributed random numbers.
vsRngGaussian(method, stream, n, r, a, sigma)where:
method can be VSL_RNG_METHOD_GAUSSIAN_ICDF, and other values are listed here
stream initialized random streams
n the number of random numbers to be requested
r address of the receiving buffer, usually a C array declared to hold the random number
a first parameter to the distribution. For normal distribution, it’s the mean.
sigma second parameter to the distribution. For normal distribution, it’s the std deviation.

Consume the sequence of random numbers.

for(int i=0; i < RAND_BLOCK_LENGTH; i++)
    {
        float callValue  = Y * exp2f(rand[i]) - Z;
        callValue = (callValue > 0) ? callValue : 0;
        v0 += callValue;
        v1 += callValue * callValue;
    }

Delete the random stream.
Use vslDeleteStream (VSLStreamStatePtr &stream) to delete the stream declared in step 1. Since we created the stream in the worker threads, it’s customary to destroy the stream in the worker thread.

Other Implementation Notes

In our implementation, worker threads are created using OpenMP* parallel directives. Each thread creates its own random number streams, generates the unique option input data, prices the option, and then validates the result. Each thread’s input data is generated by calling C runtime library rand_r in a unique sequence identified by its thread ID, which guarantees each thread will produce a unique and reproducible sequence.

The aligned memory allocation interface from Intel® Threading Building Blocks (Intel® TBB) are used to allocate the aligned memory blocks that are also cache-friendly to the worker threads. This means, these memory blocks have to be disposed of using the corresponding API. Intel TBB is part of our minimum build requirement.

OpenMP reduction operations are used to calculate the statistical properties of all the options. It’s also used to find the maximum error from the threads.

Source Code for MonteCarloRNG Core

The following is a core part of MonteCarloRNG using single precision data types. Double precision is almost identical.

// Declare random number buffer and random number sequence descriptors
float *samples[MAX_THREADS];
VSLStreamStatePtr Streams[MAX_THREADS];

// calculate the block number based on block size
const int nblocks = RAND_N/RAND_BLOCK_LENGTH;

#pragma omp parallel reduction(+ : sum_delta) reduction(+ : sum_ref) reduction(+ : sumReserve) reduction(max : max_delta)
{

#ifdef _OPENMP
    int threadID = omp_get_thread_num();
#else
    int threadID = 0;
#endif
    unsigned int randseed = RANDSEED + threadID;
    srand(randseed);
float *CallResultList     = (float *)scalable_aligned_malloc(mem_size, SIMDALIGN);
float *CallConfidenceList = (float *)scalable_aligned_malloc(mem_size, SIMDALIGN);
float *StockPriceList     = (float *)scalable_aligned_malloc(mem_size, SIMDALIGN);
float *OptionStrikeList   = (float *)scalable_aligned_malloc(mem_size, SIMDALIGN);
float *OptionYearsList    = (float *)scalable_aligned_malloc(mem_size, SIMDALIGN);
for(int i = 0; i < OPT_PER_THREAD; i++)
{
    CallResultList[i]     = 0.0f;
    CallConfidenceList[i] = 0.0f;
    StockPriceList[i]     = RandFloat_T(5.0f, 50.0f, &randseed);
    OptionStrikeList[i]   = RandFloat_T(10.0f, 25.0f, &randseed);
    OptionYearsList[i]    = RandFloat_T(1.0f, 5.0f, &randseed);
}

samples[threadID] = (float *)scalable_aligned_malloc(RAND_BLOCK_LENGTH * sizeof(float), SIMDALIGN);
vslNewStream(&(Streams[threadID]), VSL_BRNG_MT2203 + threadID, RANDSEED);

#pragma omp barrier
if (threadID == 0)
{
    printf("Starting options pricing...n");
    sTime = second();
    start_cyc = _rdtsc();
}

for(int opt = 0; opt < OPT_PER_THREAD; opt++)
{
    const float VBySqrtT = VLog2E * sqrtf(OptionYearsList[opt]);
    const float MuByT    = MuLog2E * OptionYearsList[opt];
    const float Y        = StockPriceList[opt];
    const float Z        = OptionStrikeList[opt];

    float v0 = 0.0f;
    float v1 = 0.0f;
    for(int block = 0; block < nblocks; ++block)
    {
        float *rand = samples[threadID];
        vsRngGaussian (VSL_RNG_METHOD_GAUSSIAN_ICDF, Streams[threadID], RAND_BLOCK_LENGTH, rand, MuByT, VBySqrtT);

#pragma vector aligned
#pragma simd reduction(+:v0) reduction(+:v1)
#pragma unroll(4)
        for(int i=0; i < RAND_BLOCK_LENGTH; i++)
        {
            float callValue  = Y * exp2f(rand[i]) - Z;
            callValue = (callValue > 0) ? callValue : 0;
            v0 += callValue;
            v1 += callValue * callValue;
        }
    }
    const float  exprt      = exp2f(RLog2E*OptionYearsList[opt]);
    CallResultList[opt]     = exprt * v0 * INV_RAND_N;
    const float  stdDev     = sqrtf((F_RAND_N * v1 - v0 * v0) * STDDEV_DENOM);
    CallConfidenceList[opt] = (float)(exprt * stdDev * CONFIDENCE_DENOM);
} //end of opt

#pragma omp barrier
if (threadID == 0) {
    end_cyc = _rdtsc();
    eTime = second();
    printf("Parallel simulation completed in %f seconds.n", eTime-sTime);
    printf("Validating the result...n");
}

double delta = 0.0, ref = 0.0, L1norm = 0.0;
int max_index = 0;
double max_local  = 0.0;
for(int i = 0; i < OPT_PER_THREAD; i++)
{
    double callReference, putReference;
    BlackScholesBodyCPU(
        callReference,
        putReference,
        StockPriceList[i],
        OptionStrikeList[i], OptionYearsList[i],  RISKFREE, VOLATILITY );
        ref   = callReference;
        delta = fabs(callReference - CallResultList[i]);
        sum_delta += delta;
        sum_ref   += fabs(ref);
        if(delta > 1e-6)
             sumReserve += CallConfidenceList[i] / delta;
        max_local = delta>max_local? delta: max_local;
}
max_delta = max_local>max_delta? max_local: max_delta;
vslDeleteStream(&(Streams[threadID]));
scalable_aligned_free(CallResultList);
scalable_aligned_free(CallConfidenceList);
scalable_aligned_free(StockPriceList);
scalable_aligned_free(OptionStrikeList);
scalable_aligned_free(OptionYearsList);

}//end of parallel block

Appendix

About the Author

Shuo Li works for the Intel Software and Service Group. His main interests are parallel programming and application software performance. In his recent role as a staff software performance engineer covering the financial service industry, Shuo works closely with software developers and modelers and helps them achieve high performance with their software solutions.

Shuo holds a Master's degree in Computer Science from the University of Oregon and an MBA degree from Duke University.

References and Resources

^[1]Option Pricing: A Simplified Approach (1979) by John C. Cox, Stephen A. Ross, and Mark Rubinstein:

^[2]Theorie de la Speculation, Annales Scientifiques de l´ Ecole Normale Sup´erieure, 21–86. Bachelier, L. (1900). reprinted 1995 Editions Jacques Gabay

^[3]Hull, John C, Options, Futures, and other Derivatives, 7th Edition Prentice-Hull, 2009

^[4]Wilmott, P., Derivatives: The Theory and Practice of Financial Engineering. Chichester: Wiley, 1998

^[5]Cox, J. C. Ross, S. A. and Rubinstein, M. Option Pricing: A simplified Approach Journal of Financial Economics 7 (October 1979): 229-64

^[6]Black, F., and M. Scholes, The Pricing of Options and Corporate Liabilities Journal of Political Economy, 81(May/June 1973): 637-59

^[7]Merton, R. C. Theory of Rational Option Pricing, Bell Journal of Economics and Management Science, 4(Spring 1973): 141-83

^[8]Boyle, P. P., Options: A Monte Carlo Approach Journal of Financial Economics, 4 (1977) 323-38

^[9]Black, Fischer and Scholes, Myron The Pricing of Options and Corporate Liabilities (May-Jun 1973)

^[10]Matsumoto, M., and Nishumira T. Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator, ACM Transactions on Modeling and Computer Simulation, Vol. 8, No. 1, Pages 3-30, January 1998

^[11]Intel Xeon processor: http://www.intel.com/content/www/us/en/processors/xeon/xeon-processor-e7-family.html

^[12]Intel Xeon Phi coprocessor: https://software.intel.com/en-us/articles/quick-start-guide-for-the-intel-xeon-phi-coprocessor-developer

License

Intel sample source is provided under the Intel Sample Source License Agreement.

Notices

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.

Intel, the Intel logo, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

Monte Carlo European Option Pricing

Intel(R) Xeon Phi(TM) Coprocessor

Secteur des services financiers

Intel Sample Source Code License Agreement

Contrat de licence:

Fichiers joints protégés:

Fichier attaché	Taille
Téléchargement MonteCarloRNG.tar	30 Ko

↧

Determining the Idle Power of an Intel® Xeon Phi™ Coprocessor

June 26, 2014, 2:35 am

Latest and popular articles on Intel Technologies

≫ Next: 如何在offload程序中控制协处理器的执行环境

≪ Previous: Monte Carlo European Option Pricing with RNG Interface for Intel® Xeon Phi™ Coprocessor

Abstract

This document gives platform designers, thermal engineers, hardware engineers, and computer architects instructions on how to acquire idle power readings from the Intel® Xeon Phi™ coprocessor.

There are two access methods by which the server management and control panel component may obtain status information from the Intel® Xeon Phi™ coprocessor. The “in-band” method utilizes the Symmetric Communications Interface (SCIF), the capabilities designed into the coprocessor OS, and the host driver to deliver the Intel® Xeon Phi™ coprocessor status. It also provides a limited ability to set specific parameters that control hardware behavior. An example application using this interface would be ‘micsmc’, which is provided with the Intel® Manycore Platform Software Stack (Intel® MPSS), which reads the power in-band.

The same information can be obtained using an “out-of-band” method. This method starts with the same capabilities in the coprocessor OS, but sends the information to the System Management Controller (SMC) using a proprietary protocol. With this method, the coprocessor idle power measurements can be made without waking up the card.

The Intel® Xeon Phi™ coprocessor communicates with the baseboard management controller (BMC) or peripheral control hub (PCH) over the System Management Bus (SMBus) using the standard Intelligent Platform Management Bus (IPMB) protocol. The SMC responds to queries from the platform’s BMC using the Intelligent Platform Management Interface (IPMI). Through the Inter-Integrated Circuit (I2C) interface, the SMC can communicate with the Intel® Xeon Phi™ coprocessor and to sensors located on the PCIe card.

Figure 1: Intel® Xeon Phi™ Coprocessor Board Schematic

For Intel® Xeon® processor E5-2600 v3 product family platforms, the Intel® ME can provide access to the SMC on the Intel® Xeon Phi™ coprocessor with very little effort, and can read the idle power from the SMC.

For Intel® Xeon® processor E5 V2 family platforms, the Intel® ME does not have a mechanism for reading power from the Intel® Xeon Phi™ coprocessor and so relies on the BMC to provide sensor information. This however, requires that the BMC implements some mechanism for communicating with the Intel® Xeon Phi™ coprocessor either via special OEM commands or through a bridging mechanism.

Figure 2: Example of a topology where the BMC is connected to Intel® Xeon Phi™ coprocessors

Bridging, Channels, and OEM Commands

Unlike the sensors that can be accessed via the BMC’s SDR, the sensors of the Intel® Xeon Phi™ coprocessor are abstracted behind a different I2C bus. In order to access these sensors, the user needs to be familiar with the I2C network diagram and the mechanism for accessing the bus. Also, they might also need to be exposed via special BMC OEM commands or via a third-party vendor’s help. You can find out more details about how to do this from the Intel® Xeon Phi™ Coprocessor Datasheet.

The example scripts were tested on both an Intel® Xeon® processor E7 V2 server product with four 7120A Intel® Xeon Phi™ Coprocessors, and an Intel® Xeon® processor E5-2600 v3 server product with two 7120A Intel® Xeon Phi™ Coprocessors.

On these platforms, the BMC has implemented several OEM commands that provide reverse PCIe SMBus proxy. On the Intel® Xeon® processor E5 V2 family* platforms, the format is below:

Table 1: Get MIC Card Info Command (30h E3h)

Net Function = Software Development Kit (SDK) General Application (0x3e)
E3h	Get MIC Card Info	Request *Byte 1:3 - Intel Manufacturer ID – 000157h, LS byte first Byte 4 - Card instance (1-based) for which information is requested. If this byte is zero only the total number of cards detected will be returned.	This command returns information about management-capable PCIe* cards that are discovered by Intel® ME, including protocol support and addressing information that can be used in MIC IPMB Request command.
E3h	Get MIC Card Info	Response Byte 1 – Completion Code = 00h – Success = CBh “Requested sensor, data, or record not present” the requested card instance is greater than the number of cards detected. Byte 2:4 – Intel Manufacturer ID – 000157h, LS byte first. The following bytes are only returned if there are any management-capable cards detected by the Intel ® ME. Byte 5 – Total number of MIC devices detected. The following bytes are only returned if the specified management-capable card is detected by the BMC. Byte 6 – Command Protocol Detection Support[7:4 ] – Reserved [3] - MCTP over SMBus [2] – IPMI on PCIe* SMBus (refer to IPMI 2.0 spec) [1] - IPMB [0] – Unknown A value of 1b indicates detection of a protocol is supported. Support for detection of specific protocols is OEM specific. NOTE: Intel ® ME firmware for Grantley only support detection of IPMB Byte 7 – Command Protocols Supported by Card [7:4 ] – Reserved [3] – MCTP over SMBus [2] – IPMI on PCIe* SMBus [1] - IPMB [0] – Unknown Byte 8 – Address/Protocol/Bus# [7:6] Address Type 00b – Bus/Slot/Address Other values reserved [3:0] Bus Number – Identifies SMBus interface on which the MIC device was detected Byte 9 - Slot Number – identifies PCIe* slot in which the MIC device is inserted. Byte 10 - Slave Address - the I2C slave address (8 bit “write” address) of the MIC device	This command returns information about management-capable PCIe* cards that are discovered by Intel® ME, including protocol support and addressing information that can be used in MIC IPMB Request command.

Net Function = Software Development Kit (SDK) General Application (0x3e)

Code

Command

Request, Response Data

Description

E3h

Get MIC Card Info

Request

*Byte 1:3 - Intel Manufacturer ID – 000157h, LS byte first
Byte 4 - Card instance (1-based) for which information is requested. If this byte is zero only the total number of cards detected will be returned.

This command returns information about management-capable PCIe* cards that are discovered by Intel® ME, including protocol support and addressing information that can be used in MIC IPMB Request command.

E3h

Get MIC Card Info

Response

Byte 1 – Completion Code

= 00h – Success

= CBh “Requested sensor, data, or record not present” the requested card instance is greater than the number of cards detected.

Byte 2:4 – Intel Manufacturer ID – 000157h, LS byte first. The following bytes are only returned if there are any management-capable cards detected by the Intel ® ME.

Byte 5 – Total number of MIC devices detected. The following bytes are only returned if the specified management-capable card is detected by the BMC.

Byte 6 – Command Protocol Detection Support[7:4 ] – Reserved

[3] - MCTP over SMBus

[2] – IPMI on PCIe* SMBus (refer to IPMI 2.0 spec)

[1] - IPMB

[0] – Unknown

A value of 1b indicates detection of a protocol is supported. Support for detection of specific protocols is OEM specific.

NOTE: Intel ® ME firmware for Grantley only support detection of IPMB

Byte 7 – Command Protocols Supported by Card

[7:4 ] – Reserved

[3] – MCTP over SMBus

[2] – IPMI on PCIe* SMBus

[1] - IPMB

[0] – Unknown

Byte 8 – Address/Protocol/Bus#

[7:6] Address Type

00b – Bus/Slot/Address

Other values reserved

[3:0] Bus Number – Identifies SMBus interface on which the MIC device was detected

Byte 9 - Slot Number – identifies PCIe* slot in which the MIC device is inserted.

Byte 10 - Slave Address - the I2C slave address (8 bit “write” address) of the MIC device

* Note that this changed to match the E8h command in later versions of the document. Also, the Intel Manufacturer ID – 000157h is removed from the request and response parts of the command below.

On the Intel® Xeon® processor E5-2600 v3 server products family and Intel® Xeon® processor E7 V2 family, the format is slightly different:

Table 2: Get MIC Card Info Command (30h E8h)

Net Function = SDK General Application (0x30)
Code	Command	Request, Response Data	Description
E8h	Get MIC card Info	Request Byte 1 - Card instance (1-based) for which information is requested. If this byte is zero only the total number of cards detected will be returned.	This command returns information about management-capable PCIe* cards that are discovered by the BMC, including protocol support and addressing information that can be used in the MIC card IPMB Request command. Note: E8h is the default value; it may be configured in spsFITC.
E8h	Get MIC card Info	Response Byte 1 – Completion Code =00h – Success =CBh “Requested sensor, data, or record not present” the requested card instance is greater than the number of cards detected. The following bytes are only returned if there are any management-capable cards detected by the Intel ® ME. Byte 2 – Total number of MIC devices detected. The following bytes are only returned if the specified management-capable card is detected by the BMC.
		Response Byte 3 – Command Protocol Detection Support [7:4 ] – Reserved [3] - MCTP over SMBus [2] – IPMI on PCIe* SMBus (refer to IPMI 2.0 spec) [1] - IPMB [0] – Unknown A value of 1b indicates detection of a protocol is supported. Support for detection of specific protocols is OEM specific. NOTE: Intel ®ME firmware for Intel® Xeon® processor E7 V2 family only support detection of IPMB Byte 4 – Command Protocols Supported by Card [7:4 ] – Reserved [3] – MCTP over SMBus [2] – IPMI on PCIe SMBus [1] - IPMB [0] – Unknown Byte 5 – Address/Protocol/Bus# [7:6] Address Type 00b – Bus/Slot/Address Other values reserved [3:0] Bus Number – Identifies SMBus interface on which the MIC card was detected. Byte 6 - Slot Number – Identifies PCIe* slot in which the MIC device is inserted. Byte 7 - Slave Address - The I2C slave address (8-bit “write” address) of the MIC device.

The first step that needs to get determined is to find out how many Intel® Xeon Phi™ coprocessors are in the system. Then once this is done, the bus number, slot number, and slave address of each Intel® Xeon Phi™ coprocessor needs to be determined. The bus number determines which SMBus interface in which the Intel® Xeon Phi™ coprocessor was detected. The slot number identifies which PCIe slot that the Intel® Xeon Phi™ coprocessor is inserted into. Finally, the slave address is the I2C slave address of the SMC on the Intel® Xeon Phi™ coprocessor. With this information, commands can be sent directly to the Intel® Xeon Phi™ Coprocessor according to the commands in the Intel® Xeon Phi™ Coprocessor Datasheet, section 6.6.3.

Perl is a great programming language that is great for scripting and it can be used to automate complicated IPMI commands using IPMItool. In the example Perl subroutines below, IPMItool is used to send PCIe slot commands and determine how many cards are on the system:

sub Read_PCIe_smbus_slot_card_info {
    my ($count_KNC) = @_;
    my $str0 = "ipmitool raw 0x30 0x".$eX_cmd." 0x00";
    #printf ("PCIe SMbus slot card info request: $str0\n");
    my $str1 = `$str0`;
    #If the response is "Unable..." this means that the command isn't implemented
    if (substr($str1,1,1) eq "") {
        #Let's see if this is an Intel(R) Xeon(R) processor E7 V2 product or an
        #Intel(R) Xeon(R) processor E5-2600 v3 product
        $eX_cmd= "e8";
        $str0 = "ipmitool raw 0x30 0x".$eX_cmd." 0x00";
        #printf ("PCIe SMbus slot card info request: $str0\n");
        $str1 = `$str0`;
    }
    if (substr($str1,1,1) eq "") {
        die("\nThe BMC on your platform does not support the PCIe Slot SMBus Slot Command. Please"," consult with your BMC vendor. This program will now quit.\n");
    }
        $count_KNC= substr($str1,1,3);
    #printf ("$count_KNC\n");

    return $count_KNC;
}

Next the Intel® Xeon Phi™ coprocessor’s addressing parameters can be determined with the following command:

sub Read_PCIe_smbus_slot_card {
    my ($key) = @_;
    my $str0 = "ipmitool raw 0x30 0x".$eX_cmd."".$key;
    #printf "$str0\n";
    my $str1 = `$str0`;
    #printf "$str1\n";
    my $bus_num= substr($str1,4, 2);
    my $slot_num= substr($str1,13,2);
    my $slave_address= substr($str1,16,2);
    return $bus_num, $slot_num, $slave_address;
}

Now with the communication parameters of each Intel® Xeon Phi™ coprocessor, the BMC needs to provide a way to send the command to the SMC itself. On the Intel® platforms that were just mentioned, this can be done using the Slot IPMB command below:

Table 3: Slot IPMB Command (3Eh 51h)

Net Function = SDK General Application (3Eh)
51h	Slot IPMB	Request Byte 1 [7:6] – Address Type =00b – Bus/Slot/Address =01b – Reserved for Unique identifier [5:4] Reserved [3:0] Bus Number. Set to 0 for “Address Type” not “Bus/Slot/Address” Byte 2 - Slot Number – identifies PCIe slot in which the MIC device is inserted. Set to 0 if “Address Type” is not “Bus/Slot/Address” Byte 3 – Identifier/Slave-address. This byte holds either the unique ID or the slave address (8 bit “write” address), dependent on the “Address Type” field. Byte 4 – Net Function Byte 5 – IPMI Command Byte 6:n – Command Data (optional)	This command is used for sending IPMB commands to a MIC device. This command can be used by BMC to communicate to Intel® Xeon Phi™ devices. This command may be sent at any time. If MIC is accessed via MUX the command handler will block MUX until a response is received or an IPMB timeout has occurred. In order to reduce effect of a nonresponsive card from impacting access to other slots, specific implementation might decide to shorten the IPMB timeout and/or limit the retry mechanism for all slot accesses (both proxy and nonproxy) if a MUX is used. If a card beyond the MUX is consistently not responding in a reasonable time it should be treated as a defect and needs to be root caused and fixed. Additional recommended action is to remove the non-responding card slot from any polling routines until the next system reset, power cycle, or PCIe hot-plug event for that slot.
		Response Byte 1 – Completion Code =00h – Normal =c1h – Command not supported on this platform. =c7h – Command data invalid length. =c9h – Parameter not implemented or supported. =82h – Bus error. =85h – Invalid PCIe slot number. Byte 2 – Reading Type Byte 2:n – Response Data

Net Function = SDK General Application (3Eh)

Code

Command

Request, Response Data

Description

51h

Slot IPMB

Request
Byte 1
[7:6] – Address Type
=00b – Bus/Slot/Address
=01b – Reserved for Unique identifier
[5:4] Reserved
[3:0] Bus Number. Set to 0 for “Address Type” not
“Bus/Slot/Address”
Byte 2 - Slot Number – identifies PCIe slot in which the MIC device is inserted. Set to 0 if “Address Type” is not
“Bus/Slot/Address”

Byte 3 – Identifier/Slave-address. This byte holds either the unique ID or the slave address (8 bit “write” address), dependent
on the “Address Type” field.
Byte 4 – Net Function
Byte 5 – IPMI Command
Byte 6:n – Command Data (optional)

This command is used for sending IPMB commands to a MIC device. This command can be used by BMC to communicate to Intel® Xeon Phi™ devices. This command may be sent at any time. If MIC is accessed via MUX the command handler will block MUX until a response is received or an IPMB timeout has occurred. In order to reduce effect of a nonresponsive card from impacting access to other slots, specific implementation might decide to shorten the IPMB timeout and/or limit the retry mechanism for all slot accesses (both proxy and nonproxy) if a MUX is used. If a card beyond the MUX is consistently not responding in a reasonable time it should be treated as a defect and needs to be root caused and fixed. Additional recommended action is to remove the non-responding card slot from any polling routines until the next system reset, power cycle, or PCIe hot-plug event for that slot.

Response
Byte 1 – Completion Code
=00h – Normal
=c1h – Command not supported on this platform.
=c7h – Command data invalid length.
=c9h – Parameter not implemented or supported.
=82h – Bus error.
=85h – Invalid PCIe slot number.
Byte 2 – Reading Type
Byte 2:n – Response Data

Here a command is sent to the card in this way:

Request:

[intel]$ ipmitool raw 0x3e 0x51 0x02 0x96 0x30 0x06 0x1

Response (see below for the explanation):

(00) 00 00 00 01 16 02 0f 57 01 00 60 00 d6 13 00 00

The response is broken down below. For more details, see Table 4 after the explanation.

The byte in parentheses will not be shown in the response. It is a successful completion code from the command 0x3E 0x51 which is not displayed by IPMItool.

The first byte represents the completion code 00h from the execution of the bridged Get Device ID command which normally would not be shown unless there is an error. It is displayed here because the Slot IPMB command simply returns the full content of the response without parsing the completion code. Byte 2 is the device ID (00h for unspecified). Byte 3 is the device revision (00h in this case and also indicating that the device does not provide device SDRs).

Byte 4 refers to the Firmware Revision 1 (01h indicates Major Firmware Revision of 1, and indicating normal operation). Byte 5 refers to the Firmware Revision 2 (16h indicates 1.6). (These two bytes combined together would correspond to the SMC FW’s revision, in the case, is 1.16).

Byte 6 refers to the IPMI version (02h indicates 2.0). Byte 7 represents Additional Device Support (0Fh means that the device supports a FRU, SEL, SDR, and sensor devices). Bytes 8 – 10 are the manufacturer ID, LS byte first (000157h means Intel’s manufacturer ID).

Bytes 11-12 are the Product ID, LS byte first (0060h). Bytes 13-16 stand for the auxiliary firmware revision information (D6130000h).

Table 4: Get Device ID Command

The command above is a simple “Get Device ID” command which is common among most BMC and other IPMI devices. Other commands can be sent to the SMC in a similar way.

Searching the SMC’s SDR for Sensor Information and Calculating the Idle Power from the “avg_power1” Sensor

The sensor names in the SMC’s SDR are static and do not change from release to release, however, the sensor numbers are not always static. These numbers may change in future releases, so it is a good idea to query the BMC each time the BMC or SMC firmware has changed. Doing this requires a few steps and the construction of some simple subroutines. The “for” loop below shows how to send multiple IPMI commands in order to get some basic information out of the card:

for (1)
{
    #e3 is the PCIe SMBus Slot card command for the Intel(R) Xeon(R) processor E5-4600 v2 product family platforms;
    #e8 is for Intel(R) Xeon(R) processor E5-2600 v3 product family and Intel(R) Xeon(R) processor E7 V2 family
    #my $eX_cmd = "e3";
    my ($count_KNC1) = Read_PCIe_smbus_slot_card_info($eX_cmd);
    #printf "$count_KNC1\n";
    my $count_KNC = substr($count_KNC1,1,3);
    printf "The number of Intel(R) Xeon Phi(TM) coprocessors is $count_KNC\n";

    printf("Which Intel(R) Xeon Phi(TM) coprocessor PCIe card would you like to query? (0..n) ");
    my $key = getc(STDIN);
    $key +=1;
    if ($key > $count_KNC) {
        $key = 1;
    }
    my $key2 = $key - 1;

    my ($bus_num, $slot_num, $slave_address) = Read_PCIe_smbus_slot_card($key);
    printf "\nFor Intel(R) Xeon Phi(TM) coprocessor PCIe card#$key2:\nThe Bus Number is 0x$bus_num, Slot Number is 0x$slot_num, Slave Address is 0x$slave_address\n";
    (my $count_SDR) = Read_SMC_SDR_Repository($bus_num, $slot_num, $slave_address);
    printf "\nThe SMC's SDR Repository has 0x".$count_SDR." records\n";
    my $count_SDR_dec = hex($count_SDR);

    for (my $i=0; $i < $count_SDR_dec; $i++ ){
        Scan_SMC_SDR_Repository_for_Idle_Power ($key, $i, $bus_num, $slot_num, $slave_address);
    }

}

The first command calls the “Read_PCIe_smbus_slot_card_info()” subroutine to get the number of Intel® Xeon Phi™ coprocessors, and then ask the user which card they want to read. Then the “Read_PCIe_smbus_slot_card()” subroutine is called to get the Intel® Xeon Phi™ coprocessor’s bus number, slot number, and slave address.

Next the “Read_SMC_SDR_Repository()” subroutine is called to find the number of SDR records on the SMC:

sub Read_SMC_SDR_Repository {
    my ($bus_num, $slot_num, $slave_address)=@_;
    my $str0 = "ipmitool raw 0x3e 0x51 0x".$bus_num." 0x".$slot_num." 0x".$slave_address." 0x0a 0x20";
    #printf "$str0\n";
    my $str1 = `$str0`;
    #printf "$str1\n";
    my $count_SDR = substr($str1, 7,2);
}

Here is the structure of that command and the output (based on one particular Intel® Xeon Phi™ coprocessor):

Request:

ipmitool raw 0x3e 0x51 0x02 0x96 0x30 0x0a 0x20

Response:

00 51 1c 00 00 00 00 00 00 00 00 00 00 00 01

The third byte returns the number of record present in the SDR. In this example, the card has 1Ch records, or 28 records.

Next the “Scan_SMC_SDR_Repository_for_Idle_Power()” subroutine makes more IPMItool calls to read the sensor value. A “for” loop calls this subroutine up to 28 times until the desired sensor is found, in this case, “avg_power1”. This sensor is the sum of the three power sensors on the card and is averaged over time window 1, so it is a good indicator of the card’s power.

The subroutine below is broken down into parts:

sub Scan_SMC_SDR_Repository_for_Idle_Power {
    #my $SDR_no=@_[0];
    my ($key, $SDR_no, $bus_num, $slot_num, $slave_address)=@_;
    my $str_sdr = "ipmitool raw 0x3e 0x51 0x".$bus_num." 0x".$slot_num." 0x".$slave_address." 0x0a 0x23 0x00 0x00 ".$SDR_no." 0x00 0x07 0x0f";
    my $sensor = `$str_sdr`;
    #printf ("First part of the SDR $SDR_no is $sensor\n");
    my $sensor_no = substr($sensor,10, 2);

These first few commands involve reading the SDR record and finding out the contents from byte 07h until byte 0Fh. There are different types of sensor data records, but the most common one is Type 01h, for a Full Sensor Record. The first 8 bytes of the SDR description are shown below:

Table 5: Full Sensor Record - SDR Type 01h (First 8 Bytes)

Byte 8 gives the sensor number, which can be used to match it up with the sensor name.

    my $str0 = "ipmitool raw 0x3e 0x51 0x".$bus_num." 0x".$slot_num." 0x".$slave_address." 0x0a 0x23 0x00 0x00 ".$SDR_no." 0x00 0x2e 0xff";
    #printf "$str0\n";
    my $str1 = `$str0`;
    #printf ("Second part of the SDR $SDR_no is $str1\n");
    my $str2 = substr($str1, 15, length($str1));
    #printf "The name of the SDR is in ASCII here: $str2\n";
    $str2 =~ s/\s+//g;
    #printf "Remove spaces: $str2\n";
    my $str3= hex_to_ascii($str2);
    #printf "The sensor name of SDR#$SDR_no is '$str3' (Sensor# 0x$sensor_no)\n";
    #Other sensors can be substituted for "avg_power1" if it is desired to poll a different sensor */
    if ($str3 eq "avg_power1") {
        #printf("Entered if comparison\n");
        my $str4 = "ipmitool raw 0x3e 0x51 0x".$bus_num." 0x".$slot_num." 0x".$slave_address." 0x04 0x2d 0x".$sensor_no;

The above code reads the SDR record from bytes 2Eh until the end of the record. At byte 31h, or 49 in decimal, the name of the sensor is coded in ASCII character codes. Here the bytes are saved into a variable, then converted from ASCII codes into characters, put into a string, and then compared to “avg_power1”. If there is a match, then the right sensor is found.

Table 6: Full Sensor Record - SDR Type 01h (ID String Bytes)

The next step is to convert the raw data into something easily understood:

        # Get the M, B, Accuracy, Accuracy Exp, R exp, and B Exp for the SDR formula
        my $str6 = "ipmitool raw 0x3e 0x51 0x00 0x".$slot_num." 0x".$slave_address." 0x0a 0x23 0x00 0x00 ".$SDR_no." 0x00 0x18 0x06";
        my $str7 = `$str6`;
        # Only the M value seems to be needed for the formula. The other values can be ignored */
        my $M = substr($str7, 11, 2);
        #printf("M is $M\n");

The values for the ‘y=mx+b’ reading conversion are determined by reading bytes 25 - 30. M can be read from byte 25 and parts of 26. Typically, in power sensors, only the M value is significant (reading the SDR at these bytes reveal 02h, which means to multiple the decimal value from the sensor by 2).

Table 7: Full Sensor Record - SDR Type 01h (M, B, Accuracy, R exp & B exp)

In the last several lines of the code below, these parameters are then used to send an IPMItool command to read the sensor:

          my $key2 = $key - 1;
        while (1){
            my $str5 = `$str4`;
            #printf ("The Sensor value of SDR#$SDR_no is $str5");
            printf ("Sensor '$str3' (0x$sensor_no) is $str5\n");
            my $str6 = substr($str5, 4, 2);
            #printf ("String6 is $str5\n");
            my $dec_num = hex($str6);
            #printf "dec_num is $dec_num\n";
            my $idle_pw = $dec_num * $M;
            printf("Intel(R) Xeon Phi(TM) coprocessor PCIe card#$key2:\nThe Bus number is 0x$bus_num, Slot Number is 0x$slot_num, Slave Address is 0x$slave_address: Power is $idle_pw W\n");
            sleep 1;
        }
    }
}

Request:

[intel]$ ipmitool raw 0x3e 0x51 0x02 0x96 0x30 0x04 0x2d 0x19

Response:

00 08 00 00

The first byte is the completion code, and 00 means that the command executed successfully. The second byte is the raw sensor value. The subroutine converts the value from hex to decimal, multiplies it by a factor of 2, and then prints the calculated value to the screen along with the Intel® Xeon Phi™ coprocessor’s number, bus number, slot number, and the slave address. To prevent the overloading of the bus, the subroutine waits approximately 1 second and then reads the sensor again.

Conclusion

There could be several factors keeping the Intel® Xeon Phi™ coprocessor idle power to be higher than expected. Here are some tips to reduce energy consumption:

Intel® MPSS must be running in order to put the card in PC3 or PC6
Power management is handled by Intel® MPSS and the coprocessor OS running on the card
‘micsmc’ will wake the card out of PC6, so micsmc must be shut down to allow the card to enter the PC6 idle state
Shutting down the virtual interface on the host platform will prevent the card from being woken up by pings to the card. Use the command “ifdown micN” where N represents the Intel® Xeon Phi™ Coprocessor number
Always run the latest SMC firmware to make sure that your card supports power management (Note: Not all SKUs support all PC states)

There are some steps that need to be followed in order to get the sample Perl script described in this white paper. Here are instructions how to do this on Red Hat*:

[intel]$ yum install perl

For SuSE*, use YaST in GUI mode. From the command line, use “rug” if using SuSE* 10.1, or “zypper” if using 10.3. Please check SuSE* documentation for more details.

Once Perl is installed, enter the following command:

[intel]$ perl -MCPAN -e shell

Then at the new prompt:

cpan> install String::HexConvert ':all'

The Perl script and subroutines can be modified to read other sensors on the SMC if so desired. Please check the Intel® Xeon Phi™ Coprocessor Datasheet for sensor names. Also check the M, B, Tolerance, B, Accuracy, Accuracy exp, R exp, and B exp parameters from the SDR record when looking at other sensors.

Additional Resources

488073: Intel® Xeon Phi™ Coprocessor Datasheet, available from IBL/CDI

513973: Intel® Intelligent Power Node Manager 3.0 External Interface Specification using IPMI, Rev. 1.0.3, available from IBL/CDI

434090: Intel® Intelligent Power Node Manager 2.0 External Interface Specification Using IPMI, Revision 1.8, available from IBL/CDI

Intelligent Platform Management Interface Specification, Second Generation, v2.0, available publicly at http://www.intel.com/content/www/us/en/servers/ipmi/ipmi-specifications.html

IPMItool Man page: http://linux.die.net/man/1/ipmitool

Acknowledgements

This paper could not have been written without the SMC/BMC expertise of Patrick Voelker, and BMC expertise of Keith Kroeker and Gerald Wheeler. A big thanks to Andrey Semin for being the voice of the customer.

About the Author

Todd Enger is a platform application engineer working overseas in Taipei, Taiwan for Intel Microelectronics Asia Ltd. He specializes in software and firmware support of the Intel® Xeon Phi™ Coprocessor and is also working on enabling customers who will build platforms based upon the next generation Knights Landing processor. Todd has spent the last 10 years working in Taiwan, of which the past 4 years, he has been at Intel. Prior to that, he worked for various OEMs and ODMs in the server, notebook, and smartphone areas. Back in the US, Todd worked in the Chicago area until a business trip brought him to Taiwan. After 2 weeks of astonishment, Todd used his ingenuity to find an opportunity in Taipei developing software on smartphones. Todd received his BSE in Electrical and Computer Engineering from the University of Michigan-Dearborn. In his spare time, he enjoys scuba diving in the waters around Taiwan, running, swimming, and hanging out at the beach.

Notices

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Intel, the Intel logo, VTune, Cilk and Xeon are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others

This sample source code is released under the Intel Sample Source Code License Agreement

General Notices: Notices placed in all materials distributed or released by Intel

Benchmarking and Performance Disclaimers: Disclaimers for Intel materials that use benchmarks or make performance claims.

Technical Collateral Disclaimers: Disclaimers that should be included in Intel technical materials that describe the form, fit or function of Intel products.

Technology Notices: Notices for Intel materials when the benefits or features of a technology or program are described. Note for technology disclaimers - if every product being discussed (e.g., ACER ULV) has the particular technology/feature, then you can remove the requirements statement in the disclaimer. If you have multiple technical disclaimers, you can consolidate the "your performance may vary" statements and only put in a single "your mileage may vary".

Informatique en cluster

Débogage

Code source libre

Intel Parallel Composer XE

↧

如何在offload程序中控制协处理器的执行环境

June 30, 2014, 12:24 am

Latest and popular articles on Intel Technologies

≫ Next: Meshcentral - Introduction & Overview

≪ Previous: Determining the Idle Power of an Intel® Xeon Phi™ Coprocessor

在offload编译模式下Intel编译器的offload运行时系统提供了两种机制让主机CPU程序对协处理器上的执行环境进行控制：

在主机系统上设置环境变量，然后将这些环境变量传递到协处理器上
在主机程序中调用相应的运行环境控制函数

环境变量：

缺省情况下，当offload发生时运行时系统会把主机程序执行环境中的所有环境变量全部复制到协处理器的执行环境中。用户可以通过定义环境变量“MIC_ENV_PREFIX”的值来改变这一默认行为。当该环境变量被赋予某个特定值之后，offload运行时系统将不再复制全部主机环境变量，而改为只复制那些以“MIC_ENV_PREFIX”的值加上下划线为前缀的那些环境变量；而且，在协处理器执行环境中对应的环境变量将不会保留这些前缀。通过这种方式，用户就可以在主机系统和协处理器上对同一名字的环境变量使用不同的值。例如在主机系统中已如下方式设置环境变量：

MIC_ENV_PREFIX=ABC

OMP_NUM_THREADS=8

ABC_OMP_NUM_THREADS=124

那么主机上的OMP_NUM_THREADS被设置成8，而对协处理器上执行的offload代码而言其执行环境中的OMP_NUM_THREADS将被设置为124.

Offload运行时系统还支持对特定的设备指定不同的环境变量值，其指定方式为在“MIC_ENV_PREFIX”前缀和环境变量名中加上协处理器号。例如，对于上面的例子如果设置OMP_NUM_THREADS的方式改为：

MIC_ENV_PREFIX=ABC

OMP_NUM_THREADS=8

ABC_4_OMP_NUM_THREADS=124

那么主机上的MP_NUM_THREADS被设置成8，第5个协处理器上的OMP_NUM_THREADS值被设置成124，而其他协处理器上的OMP_NUM_THREAD将不被设置。

如果需要一次对协处理器指定多个环境变量，还可以采用下面的简化模式：

mic_prefix_VAR=variable1=value1|variable2=value2|variable3=value3|...

或

mic_prefix_card_number_VAR=variable1=value1|variable2=value2|variable3=value3|...

其中的card_number为协处理器号。

运行环境控制函数：

某些CPU运行环境控制API函数具有对等的offload控制版本，区别在于增加了两个额外的参数：

target_type: 设备类型。目前推荐使用预定义的值“DEFAULT_TARGET_TYPE”

target_number: 设备编号。

使用这些API函数之前要包含相应的头文件”offload.h”。例如用于设置OpenMP线程数目的API，两种形式分别为：

CPU API：void omp_set_num_threads (int num_threads);

Offload API: void omp_set_num_threads_target (TARGET_TYPE target_type, int target_number, int num_threads);

更多关于如何使用英特尔编译器开发至强融核协处理器程序的信息请参见英特尔编译器用户参考手册的相关内容。

Informatique parallèle

Rubriques de compilateurs

Amélioration des performances

Développement multithread

Zone des thèmes:

IDZone

↧

Meshcentral - Introduction & Overview

July 3, 2014, 11:17 am

Latest and popular articles on Intel Technologies

≫ Next: Apresente seus aplicativos Android para a Lenovo!

≪ Previous: 如何在offload程序中控制协处理器的执行环境

Site Links

Main site: meshcentral.com
Information site: info.meshcentral.com
Developer blog:intel.com/software/ylian

Overview
Meshcentral is an open source project under Apache 2.0 license that allows administrators to remotely manage computers over the Internet using a single web portal. You have to download and install a mesh agent on all your devices, but once installed the agent is self-upgrading and makes the device available for management on the web portal. There are a few things that set Meshcentral apart from other solutions. It's open source and so, anyone can freely setup their own instance of Meshcentral on their own server. Meshcentral manages a very wide array of devices: Windows, OSX, Android, Linux, XEN and more. You can use the same solution to manage big servers and Intel® Galileo devices.

Features

Meshcentral features can be seperated into in-band and out-of-band features. In-band features are available on all devices, out-of-band features are only available on computer with Intel® AMT.

Remote desktop (in-band and Intel® AMT hardware KVM)
Remote terminal access (in-band and Intel® AMT serial-over-lan)
Remote file access
Remote web access
Remote power control (in-band and Intel® AMT power control)
General monitoring
Video chat with Android

Tutorial Videos

To help, we have a YouTube playlist with a set of tutorial videos covering many aspects of using Meshcentral. The first two videos "Getting Started" and "Basic Features" are probably the best way to get a quick initiation to Meshcentral.

Compatible Tools

Most people using Meshcentral will only use the web portal, which is feature rich and works on any device with a browser. But in addition to the web portal, we have applications and tools that are compatible with Meshcentral. So, if you are already using these tools, you can easily take advantage of remote management for the Internet.

Microsoft Windows* (XP, Vista, 7)

Processeurs Intel® Atom™

Processeurs Intel® Core™

Technologie Intel® vPro™

Mobilité

Code source libre