Quantcast
Channel: Intermédiaire
Viewing all 677 articles
Browse latest View live

基于 Windows* 8.1 台式机的 Miracast*

$
0
0

Download PDF
 
Code Sample

要点综述

英特尔 WiDi 扩展库中的许多功能已迁移至微软实施的 Miracast 中,成为 Windows* 8.1 的一部分。 本白皮书讨论了使用英特尔® 媒体软件开发套件和 OpenGL* 为 Miracast 提供 Windows 8.1 台式机应用支持的多种技术。 本文不提供在 Windows 应用商店应用中支持 Miracast 的内容,因为它们需要完全不同的框架。

系统要求

示例代码使用 Visual Studio* 2013 编写,以便展示以下两项内容:  使用英特尔媒体软件开发套件的(1) Miracast 和 (2) 英特尔® 媒体软件开发套件 / OpenGL* 纹理共享, 其中解码平面无需进行复制即可与 OpenGL 纹理共享,进而显著提高了效率。 MJPEG 解码器面向第四代英特尔® 酷睿™ 处理器(代号:Haswell)及更高版本的处理器实施了硬件加速,以前版本的处理器中的英特尔媒体软件开发套件将会自动使用该软件解码器。 在任何情况下,它都需要与 MJPEG 兼容的摄像头(无论是板载还是 USB 摄像头)。

除了辨认 Miracast 连接类型以外,该示例代码和白皮书中使用的大部分技术都适用于 Visual Studio 2012。 该示例代码基于 Intel Media SDK 2014 for Client,可通过以下链接进行下载:http://software.intel.com/sites/default/files/MediaSDK2014Clients.zip。 安装英特尔媒体软件开发套件可为 Visual Studio 自动设置一套环境变量,以便为头文件和库查找正确的路径。

应用概述

该应用以摄像头为 MJPEG 输入,并对 MJPEG 视频解码,将视频流编码至 H264,然后再对 H264 进行解码。 MJPEG 摄像头视频流(解码后)和最终以 H264 标准解码的视频流将会在基于 MFC 的 GUI 上显示。 在 Haswell 系统上,2 个解码器和 1 个编码器(1080P 分辨率)将会按顺序运行以改进可读性,但是由于硬件加速,它们的速度非常快,这使得摄像头速度成为限制 fps 的因素。 在实际情况下,编码器和解码器将会在不同的线程中运行,性能不会成为障碍。

在单个显示器配置上,摄像头馈入将会在基于 OpenGL 的 GUI 中以 H264 标准解码的视频上方的 PIP 中显示(图 1)。 当 Miracast 连接时,软件将会自动辨认与 Miracast 相连的显示器,并使用全屏窗口播放以 H264 标准解码的视频,而 GUI 将显示原始摄像头视频,因此原始视频和加密视频之间的区别便清晰可见。 最后,查看->监控拓扑(View->Monitor Topology)菜单不仅能够以单选按钮的形式显示当前的显示器拓扑,还能够用来更改拓扑。 但是很可惜,它无法启动 Miracast 连接。 这些操作只能在 OS charm 菜单上完成(从右侧滑入-> 设备->项目),目前尚且没有能够创建 Miracast 连接的 API 。 有趣的是,通过将显示器拓扑设置为仅限内部使用可以将 Miracast 连接断开。 如果通过线路连接了多台显示器,菜单可以随时更改拓扑。


1.单个显示器拓扑。 MJPEG 摄像头视频流在右下角显示。 以 H264 标准加密的视频在 GUI 中播放。 当启用多台显示器(如 Miracast)时,该软件可检测到变化,MJPEG 摄像头视频和以 H264 加密的视频将会自动分至不同的显示器。

检测显示器拓扑变化

当检测到显示器配置变化时,如添加/删除外置显示器(Miracast 连接/断开),OS 将会在顶层窗口显示 WM_DISPLAYCHANGE 消息。 在样本代码中,顶层窗口是 CMainFrame 类,而且其 OnDisplayChange 类函数可处理该消息。 鉴于多台显示器播放时会出现较短的延迟,OnDisplayChange 处理程序将会先禁用所有更新内部数据结构的活动,如摄像头馈入及所有后续流程,然后启用定时器以便提供充足的时间切换显示器配置。 QueryDisplayConfig API 可用于了解拓扑,它可以提供一组显示器信息(包括每台显示器的位置和尺寸,如果希望在某个显示器上显示全屏窗口,了解这一信息非常重要)以及拓扑类型(内 置、复制、扩展、外置等)。 这些函数可在 CDisplayHelper 类中封装, OnDisplayChange 处理程序发起的 OnTimer 函数使用该类进行封装。 对拓扑进行重新配置后,处理程序将重启内置活动,恢复摄像头馈入。

更改显示器拓扑

如果想要更改显示器拓扑,你可以调用 SetDisplayConfig(不要调用 QueryDisplayConfig),这会生成一系列事件,如 WM_DISPLAYCHANGE,它由 WM_DISPLAYCHANGE 进行处理,如同显示器以物理方式连接/断开。 该函数封装在 CDisplayHelper::SetCurrentTopology 中。 比如,当用户点击单选菜单项目时,可以在 CMainFrame::OnMonitortopologyRange 处理程序中使用该函数。 

多显示器拓扑变更注意事项

理论上,在外置显示器中显示另一个窗口并根据拓扑变化检测对其进行控制似乎相当简单。 但是实际情况下,操作系统开始转换,完成显示器配置,以及显示内容都需要时间。 当结合使用编码器 / 解码器 / D3D / OpenGL 及其优势时,调试内置处理和 GUI 的时间将会非常复杂。 例如,如果摄像头进入视频播放、解码和编码管线,生成馈入,但是没有连接实际显示器,可能会致使系统崩溃,且难以修复。 本示例尝试重新使用切换过程中的大部分管线,但是关闭整个管线并重新启动将会更简单,因为添加显示器的 10 多秒钟的时间内可能会出现任何故障 — 甚至 HDMI 或 VGA 连接故障。

未来展望

本白皮书能够很好地处理多显示器(包括 Miracast)上的视频。 但是,当外置显示器自带扬声器时,它无法处理音频,一般情况下,Miracast 显示器是带有内置扬声器的大屏幕电视机。 我们计划在未来添加音频切换。

 

英特尔、Intel logo、英特尔标识、Core 和酷睿是英特尔在美国和/或其他国家的商标。
* 其他的名称和品牌可能是其他所有者的资产

英特尔公司© 2014 年版权所有。 所有权保留。

  • Dual Screen
  • Intel WiDi
  • Miracast
  • WindowsCodeSample
  • Développeurs
  • Microsoft Windows* 8
  • Intermédiaire
  • URL

  • 从英特尔媒体软件开发套件向 OpenGL 分享纹理

    $
    0
    0

    Code Sample

    要点综述

    通常,在 Windows* OS,使用 Direct3D 进行视频处理。 但是,也有许多应用使用 OpenGL*,因为它具备出色的跨平台能力,能够在不同的平台上提供相同的 GUI 和外观与体验。  最新的英特尔图形驱动程序支持使用 NV_DX_interop 从 D3D 向 OpenGL 曲面共享,并可与英特尔® 媒体软件开发套件结合使用。 英特尔® 媒体软件开发套件可配置使用 Direct3D 和加入 NV_DX_interop,OpenGL 可使用英特尔媒体软件开发套件的帧缓冲区,而不用执行将纹理从 GPU 复制至 CPU 再复制回 GPU 这一昂贵过程。 本示例代码和白皮书演示了如何设置英特尔® 媒体软件开发套件以使用 D3D 进行编码和解码,从 NV12 色域(媒体软件开发套件的自然颜色格式)向 RGBA 色域(OpenGL 的自然颜色格式)进行颜色转换,然后将 D3D 曲面应设置 OpenGL 纹理。 本管线完全绕过将纹理从 GPU 复制至 CPU 这一过程,该过程曾经是配合英特尔® 媒体软件开发套件使用 OpenGL 时的最大瓶颈。

    系统要求

    本示例代码使用 Visual Studio* 2013 编写,旨在 (1) 展示 Miracast 和 (2) 利用英特尔® 媒体软件开发套件实现英特尔® 媒体软件开发套件/ OpenGL 纹理共享,其中不用进行任何复制即可与 OpenGL 纹理分享解码曲面,这能够显著地提升效率。 MJPEG 解码器面向 Haswell 和更高版本的处理器执行了硬件加速,较低版本的处理器中的媒体软件开发套件可自动使用软件解码器。 在任何情况下,它都需要与 MJPEG 兼容的摄像头(无论是板载还是 USB 摄像头)。
    除了辨认 Miracast 连接类型以外,该示例代码和白皮书中使用的大部分技术都适用于 Visual Studio 2012。 该示例代码基于 Intel Media SDK 2014 for Client,可通过以下链接进行下载:(https://software.intel.com/sites/default/files/MediaSDK2014Clients.zip。) 安装媒体软件开发套件后,将会为 Visual Studio 创建一套环境变量,以便为头文件和库查找正确的路径。

    应用概述

    该应用将摄像头作为 MJPEG 输入,并对 MJPEG 视频解码,将视频流编码至 H264,然后再对 H264 进行解码。 MJPEG 摄像头视频流(解码后)和最终以 H264 标准解码的视频流将会在基于 MFC 的 GUI 上显示。 在 Haswell 系统上,2 个解码器和 1 个编码器(1080P 分辨率)将会按顺序运行以提供出色的可读性,但是由于硬件加速,它们的速度非常快,这使得摄像头速度成为 fps 的唯一限制因素。 在实际情况下,编码器和解码器将会在不同的线程中运行,性能不会成为障碍。

    在单个显示器配置上,摄像头馈入将会在基于 OpenGL 的 GUI 中以 H264 标准解码的视频上方的 PIP 中显示(图 1)。 当 Miracast 连接时,软件将会自动辨认与 Miracast 相连的显示器,并使用全屏窗口播放以 H264 标准解码的视频,而 GUI 将显示原始摄像头视频,因此原始视频和加密视频之间的区别便清晰可见。 最后,查看->监控拓扑(View->Monitor Topology)菜单不仅能够检测到当前的显示器拓扑,还能够更改拓扑。 但是很可惜,它无法启动 Miracast 连接。 它只能在 OS charm 菜单上完成(从右侧滑入-> 设备->项目),目前尚且没有能够创建 Miracast 连接的 API 。 有趣的是,通过将显示器拓扑设置为仅限内部使用可以将 Miracast 连接断开。 如果通过线路连接了多台显示器,菜单可以随时更改拓扑。

    图 1 单个显示器拓扑。 MJPEG 摄像头在右下角显示。 以 H264 标准加密的视频在 GUI 中播放。 当启用多台显示器(如 Miracast)时,该软件可检测到变化,MJPEG 摄像头视频和以 H264 加密的视频将会自动分至不同的显示器。

    管线设置的主要接入点

    本示例代码基于 MFC,设置管线的主要接入点是 CChildView::OnCreate (),它能够启动摄像头,从 MJPEG 转码至 H264,使用 H264 解码,并将转码器和解码器上的纹理与 OpenGL 渲染器绑定。 转码器仅为在基础解码器上添加编码器的解码器子类。 最后,OnCreate 可启动持续获取序列化摄像头馈入的线程。 在工作线程中读取摄像头馈入后,它可向 OnCamRead 函数发送消息,该函数可对 MJPEG 进行解码,编码至 H264,解码 H264,并将纹理更新至 OpenGL 渲染器。  从最高层面而言,整个管线整洁、简单,易于执行。

    启动解码器/转码器

    解码器和转码器都是使用 D3D9Ex 进行启动。  我们可对英特尔® 媒体软件开发套件进行配置以使用软件、D3D9 或 D3D11。 在本示例中,D3D9 用于简化颜色转换。 英特尔® 媒体软件开发套件的自然颜色格式是 NV12;IDirect3DDevice9::StretchRect 和 IDirectXVideoProcessor::VideoProcessBlt 都可用来将色域转换至 RGBA。 简单起见,本白皮书使用的是 StretchRect,但是一般情况下,我们推荐使用 VideoProcessBlt,因为它在后期处理过程中有出色的附加功能。 可惜的是,D3D11 不支持 StretchRect,因此颜色转换可能非常复杂。 此外,本文中的解码器和转码器使用了独立的 D3D 设备进行各种实验,如混合软件和硬件,但是 D3D 设备可以在二者之间进行共享以节约内存。 以该方式设置管线后,解码输出将设置为 (mfxFrameSurface1 *) 类型。 这仅为面向 D3D9 的封装,mfxFrameSurface1-> Data.MemId 可换算为 (IDirect3DSurface9 *),并在解码后用于 CDecodeD3d9::ColorConvert 函数中的 StretchRect 或 VideoProcessBlt。 媒体软件开发套件的输出曲面不可共享,但是必须要进行颜色转换才能供 OpenGL 使用,并且需要创建一个共享曲面存储颜色转换的结果。

    启动转码器

    转码器的解码结果将会直接输入编码器,并确保在分配曲面时使用该 MFX_MEMTYPE_FROM_DECODE。

    绑定 D3D 和 OpenGL 之间的纹理

    绑定纹理的代码位于 CRenderOpenGL::BindTexture 函数中。 请确保对 WGLEW_NV_DX_interop 进行了定义,然后使用 wglDxOpenDeviceNV、wglDXSetResourceShareHandleNV 和 wglDXRegisterObjectNV。  这将会把 D3D 曲面绑定至 OpenGL 纹理。 但是,它无法自动更新纹理,而且调用 wglDXLockObjectsNV / wglDXUnlockObjectsNV 将会更新纹理(CRenderOpenGL::UpdateCamTexture 和 CRenderOpenGL::UpdateDecoderTexture)。 对纹理进行更新后,便可像使用 OpenGL 中的其他纹理一样使用它。

    多显示器拓扑变更注意事项

    理论上,在外置显示器中提供另一个窗口并根据拓扑变化检测对其进行控制似乎相当简单。 但是实际情况下,从操作系统开始转换,到显示器配置完成,以及内容显示都需要时间。 当结合使用编码器 / 解码器 / D3D / OpenGL 及其优势时,调试将会非常复杂。 本示例尝试重新使用切换过程中的大部分管线,但是实际上,关闭整个管线并重新启动将会更简单,因为添加显示器需要 10 多秒钟的时间,在这期间可能会出现任何故障 — 甚至 HDMI 或 VGA 连接故障。

    未来展望

    本白皮书中的示例代码面向 D3D9,且不可为 D3D11 实施提供支持。 目前我们尚不清楚,在没有 StretchRect 或 VideoProcessBlt 时,哪种方式能够最有效地将色域从 NV12 转换至 RGBA。 D3D11 实施退出后,白皮书/代码将进行更新。

    贡献

    感谢 Petter Larsson、Michel Jeronimo、Thomas Eaton 和 Piotr Bialecki 对本文的贡献。

     

    英特尔、Intel 标识、Xeon 和至强是英特尔在美国和其他国家的商标。
    * 其他的名称和品牌可能是其他所有者的资产
    
英特尔公司© 2013 年版权所有。 所有权保留。

  • Dual Screen
  • Intel WiDi
  • Miracast
  • WindowsCodeSample
  • texture sharing
  • Développeurs
  • Microsoft Windows* 8
  • Windows*
  • Intermédiaire
  • Intel® Media SDK
  • OpenGL*
  • URL
  • Guía instructiva para Android*: cómo escribir una aplicación multiproceso con Threading Building Blocks de Intel®

    $
    0
    0

    Recientemente publicamos la “Guía instructiva para Windows* 8: cómo escribir una aplicación multiproceso para Windows Store* con Threading Building Blocks de Intel®”. En esa guía dijimos que el motor de cálculo paralelo se puede portar fácilmente a otras plataformas móviles o de escritorio. Android es un buen ejemplo de ese tipo de plataforma móvil.

    En una versión establerecientemente publicada de Intel Threading Building Blocks (Intel® TBB), hemos agregado de forma experimental compatibilidad con aplicaciones Android, lo que es decir, bibliotecas de Intel TBB para usar en aplicaciones Android mediante la interfaz JNI. Esta versión se puede descargar desde threadingbuildingblocks.org.

    Para iniciar el proceso en un host Linux*, hay que desempaquetar la distribución de código fuente de Intel TBB, obtener el script <unpacked_dir>/build/android_setup.csh y compilar las bibliotecas. Es necesario compilar las bibliotecas porque las versiones para desarrollo solo se distribuyen en forma de código. El archivo <unpacked_dir>/build/index.android.html contiene instrucciones para configurar el entorno y compilar la biblioteca en Linux.

    Si suponemos que gnu make 3.81 se encuentra en %PATH% (en una plataforma host Microsoft Windows*) y $PATH (en un host Linux), necesitamos emitir el siguiente comando en el entorno NDK para compilar las bibliotecas Intel TBB para Android:

    gmake tbb tbbmalloc target=android

    Eso es todo lo que se necesita para compilar la biblioteca; ahora podemos pasar a compilar el ejemplo con Eclipse*. Para el ejemplo de abajo, voy a usar Android SDK Tools Rev.21 y Android NDK Rev 8C en Windows* con el fin de ilustrar el proceso de desarrollo multiplataforma.

    Creamos un proyecto con la plantilla predeterminada «New Android Application». Por simplicidad, lo llamamos “app1”, el mismo nombre que en la guía anterior:

    Seleccionamos FullscreenActivity como Activity. Eso es todo para la plantilla. Se puede observar que com.example* no es un nombre de paquete aceptable para Google Play*, pero sirve para nuestro ejemplo.

    Luego hay que agregar un par de botones al marco principal. Después de agregarlos, el archivo XML del marco principal (app1/res/layout/activity_fullscreen.xml) se verá así:

    <FrameLayout xmlns:android="http://schemas.android.com/apk/res/android"
    
        xmlns:tools="http://schemas.android.com/tools"
    
        android:layout_width="match_parent"
    
        android:layout_height="match_parent"
    
        android:background="#0099cc"
    
        tools:context=".FullscreenActivity">
    
        <TextView
    
            android:id="@+id/fullscreen_content"
    
            android:layout_width="match_parent"
    
            android:layout_height="match_parent"
    
            android:gravity="center"
    
            android:keepScreenOn="true"
    
            android:text="@string/dummy_content"
    
            android:textColor="#33b5e5"
    
            android:textSize="50sp"
    
            android:textStyle="bold" />
    
        <FrameLayout
    
            android:layout_width="match_parent"
    
            android:layout_height="match_parent"
    
            android:fitsSystemWindows="true">
    
            <LinearLayout
    
                android:id="@+id/fullscreen_content_controls"
    
                style="?buttonBarStyle"
    
                android:layout_width="match_parent"
    
                android:layout_height="74dp"
    
                android:layout_gravity="bottom|center_horizontal"
    
                android:background="@color/black_overlay"
    
                android:orientation="horizontal"
    
                tools:ignore="UselessParent">
    
                <Button
    
                    android:id="@+id/dummy_button1"
    
                    style="?buttonBarButtonStyle"
    
                    android:layout_width="0dp"
    
                    android:layout_height="wrap_content"
    
                    android:layout_weight="1"
    
                    android:text="@string/dummy_button1"
    
                    android:onClick="onClickSR" />
    
                <Button
    
                    android:id="@+id/dummy_button2"
    
                    style="?buttonBarButtonStyle"
    
                    android:layout_width="0dp"
    
                    android:layout_height="wrap_content"
    
                    android:layout_weight="1"
    
                    android:text="@string/dummy_button2"
    
                    android:onClick="onClickDR" />
    
            </LinearLayout>
    
        </FrameLayout></FrameLayout>

    Y el archivo string (app1/res/values/strings.xml) se verá así

    <?xml version="1.0" encoding="utf-8"?><resources>
    
        <string name="app_name">Sample</string>
    
        <string name="dummy_content">Reduce sample</string>
    
        <string name="dummy_button1">Simple Reduce</string>
    
        <string name="dummy_button2">Deterministic Reduce</string></resources>

    Luego hay que agregar los controladores de botones:

    // JNI functions
    private native float onClickDRCall();
    private native float onClickSRCall();
    
          public void onClickDR(View myView) {
                TextView tv=(TextView)(this.findViewById(R.id.fullscreen_content));
                float res=onClickDRCall();
                tv.setText("Result DR is n" + res);
          }
    
          public void onClickSR(View myView) {
                TextView tv=(TextView)(this.findViewById(R.id.fullscreen_content));
                float res=onClickSRCall();
                tv.setText("Result SR is n" + res);
          }

    y la biblioteca se carga al archivo FullscreenActivity.java:

    @Override
    
          protected void onCreate(Bundle savedInstanceState) {
    
                super.onCreate(savedInstanceState);
    
    …
                System.loadLibrary("gnustl_shared");
    
                System.loadLibrary("tbb");
    
                System.loadLibrary("jni-engine");
    
          }
    

    En el caso de la biblioteca "tbb", todo debería estar claro; la biblioteca "gnustl_shared" es necesaria para la compatibilidad con las características de lenguaje C++ de TBB. Sin embargo, para la biblioteca "jni-engine" tenemos que ser más detallados.

    "jni-engine" es una biblioteca de ?++ que implementa un motor de cálculo y exporta las interfaces C para llamadas a JNI de nombre onClickSRCall() y onClickSRCall().

    De acuerdo con las reglas para desarrollo de NDK, hay que crear una carpeta “jni” dentro del espacio de trabajo y 3 archivos en ella específicos para nuestra biblioteca "jni-engine".

    Estos archivos son:

    Android.mk (el texto entre signos menor y mayor <> se debe reemplazar con valores reales)

    LOCAL_PATH := $(call my-dir)
    
    TBB_PATH :=
    
    
    
    include $(CLEAR_VARS)
    
    LOCAL_MODULE    := jni-engine
    
    LOCAL_SRC_FILES := jni-engine.cpp
    
    LOCAL_CFLAGS += -DTBB_USE_GCC_BUILTINS -std=c++11 -I$(TBB_PATH)/include
    
    LOCAL_LDLIBS := -ltbb -L./ -L$(TBB_PATH)/
    
    include $(BUILD_SHARED_LIBRARY)
    
    
    
    include $(CLEAR_VARS)
    
    LOCAL_MODULE    := libtbb
    
    LOCAL_SRC_FILES := libtbb.so
    
    include $(PREBUILT_SHARED_LIBRARY)
    

    Application.mk

    APP_ABI := x86
    
    APP_GNUSTL_FORCE_CPP_FEATURES := exceptions rtti
    
    APP_STL := gnustl_shared
    

    jni-engine.cpp:

    #include
    
    
    
    #include "tbb/parallel_reduce.h"
    
    #include "tbb/blocked_range.h"
    
    float SR_Click()
    
    {
    
        int N=10000000;
    
        float fr = 1.0f/(float)N;
    
        float sum = tbb::parallel_reduce(
    
            tbb::blocked_range(0,N), 0.0f,
    
            [=](const tbb::blocked_range& r, float sum)->float
    
            {
    
                for( int i=r.begin(); i!=r.end(); ++i )
    
                    sum += fr;
    
                return sum;
    
            },
    
            []( float x, float y )->float
    
            {
    
                return x+y;
    
            }
    
        );
    
        return sum;
    
    }
    
    
    
    float DR_Click()
    
    {
    
        int N=10000000;
    
        float fr = 1.0f/(float)N;
    
        float sum = tbb::parallel_deterministic_reduce(
    
            tbb::blocked_range(0,N), 0.0f,
    
            [=](const tbb::blocked_range& r, float sum)->float
    
            {
    
                for( int i=r.begin(); i!=r.end(); ++i )
    
                    sum += fr;
    
                return sum;
    
            },
    
            []( float x, float y )->float
    
            {
    
                return x+y;
    
            }
    
        );
    
        return sum;
    
    }
    
    
    
     extern "C" JNIEXPORT jfloat JNICALL Java_com_example_app1_FullscreenActivity_onClickDRCall(JNIEnv *env, jobject obj)
    
    {
    
        return DR_Click();
    
    }
    
    
    
    extern "C" JNIEXPORT jfloat JNICALL Java_com_example_app1_FullscreenActivity_onClickSRCall(JNIEnv *env, jobject obj)
    
    {
    
        return SR_Click();
    
    }
    

    Usamos los mismos algoritmos que en la guía anterior.

    Cuando usamos el NDK para compilar, compila las bibliotecas a las carpetas correspondientes, incluidas nuestras bibliotecas libjni-engine.so, libgnustl_shared.so y libtbb.so.

    A continuación, hay que volver a Eclipse y compilar el archivo app1.apk. Ahora la aplicación está lista para instalarse en el AVD o en hardware real. En el AVD se ve así:

     

    ¡Y terminamos! Esta aplicación sencilla está lista y debería ser un buen primer paso hacia la escritura de una aplicación paralela más compleja para Android. Y para aquellos que usaron código de  la guía anterior, la aplicación se pudo portar con éxito a Android.

    * Es posible que la propiedad de otros nombres y marcas corresponda a terceros.

    Related Articles and Resources:

  • C++11
  • education
  • Développeurs
  • Professeurs
  • Étudiants
  • Android*
  • Android*
  • C/C++
  • Java*
  • Débutant
  • Intermédiaire
  • Intel Hardware Accelerated Execution Manager (HAXM)
  • Intel® Threading Building Blocks
  • Intel® Parallel Studio
  • Éducation
  • Processeurs Intel® Core™
  • Mobilité
  • Code source libre
  • Informatique parallèle
  • Portage
  • Efficacité de l’alimentation
  • Parallélisation
  • Téléphone
  • URL
  • Exemple de code
  • Amélioration des performances
  • Bibliothèques
  • Développement multithread
  • Learning Lab
  • TBB-Learn
  • Zone des thèmes: 

    Android

    WRF Conus2.5km on Intel® Xeon Phi™ Coprocessors and Intel® Xeon® processors in Symmetric Mode

    $
    0
    0

    Overview

    This document demonstrates the best methods to obtain, build and run the WRF model on multiple nodes in symmetric mode on Intel® Xeon Phi™ Coprocessors and Intel® Xeon processors. This document also describes the WRF software configuration and affinity settings to extract the best performance from multiple node symmetric mode operation when using Intel Xeon Phi Coprocessor and an Intel Xeon processor.

    Introduction

    The Weather Research and Forecasting (WRF) model is a numerical weather prediction system designed to serve atmospheric research and operational forecasting needs. WRF is used by academic atmospheric scientists, forecast teams at operational centers, application scientists, etc. Please see http://www.wrf-model.org/index.phpfor more details on this system. The source code and input files can be downloaded from the NCAR website. The latest version as of this writing is WRFV3.6. In this article, we use the conus2.5km benchmark.

    WRF is used by many private and public organizations across the world for weather and climate prediction.

    WRF has a relatively flat profile on Intel Architecture over many functions for atmospheric dynamics and physics: advection, microphysics, etc.

    Technology (Hardware/Software)

    System

    Xeon E5-2697 v2 @ 2.7GHz

    Coprocessor

    Intel Xeon Phi coprocessor 7120A @ 1.23GHz

    Intel® MPI

    4.1.1.036

    Intel® Compiler

    composer_xe_2013_sp1.1.106

    Intel® MPSS

    6720-21

    We used the above hardware and software configuration for all of our testing.

    Note: This Index card assumes that you are running the workload on the aforementioned hardware configuration. If you are using Intel Xeon Phi coprocessor model 7110 cards, please use the following instructions on 8 nodes instead of 4. To run the workload on 4 nodes, you need Intel Xeon Phi coprocessors with 16GB memory; since the 7110 model coprocessors have 8GB memory, you will need to use more than 4 Xeon Phi coprocessor Cards.

    Note: Please use netcdf-3.6.3 and pnetcdf-1.3.0 for I/O.

    Multi Node Symmetric Intel Xeon + Intel Xeon Phi coprocessor (4 Nodes)

    Compile WRF for the Coprocessor

    1. Download and un-tar the WRFV3.6 source code from the NCAR repository http://www.mmm.ucar.edu/wrf/users/download/get_sources.html#V351.
    2. Source the Intel MPI for intel64 and Intel Compiler
      1. source /opt/intel/impi/4.1.1.036/mic/bin/mpivars.sh
      2. source /opt/intel/composer_xe_2013/bin/compilervars.sh intel64
    3. On bash, export the path for the host netcdf and host pnetcdf. Having netcdf and pnetcdf built for Intel Xeon Phi coprocessor is a prerequisite.
      1. export NETCDF=/localdisk/igokhale/k1om/trunk/WRFV3.5/netcdf/mic/
      2. export PNETCDF=/localdisk/igokhale/k1om/trunk/WRFV3.5/pnetcdf/mic/
    4. Turn on Large file IO support
      1. export WRFIO_NCD_LARGE_FILE_SUPPORT=1
    5. Cd into the ../WRFV3/ directory and run ./configure and select the option to build with Xeon Phi (MIC architecture) (option 17). On the next prompt for nesting options, hit return for the default, which is 1.
    6. In the configure.wrf that is created, remove delete -DUSE_NETCDF4_FEATURES and replace –O3 with –O2
    7. Replace !DEC$ vector always with !DEC$ SIMD on line 7578 in the dyn_em/module_advect_em.F source file.
    8. Run ./compile wrf >& build.mic
    9. This will build a wrf.exe in the ../WRFV3/main folder.
    10. For a new, clean build run ./clean –a and repeat the process.

    Compile WRF for Intel Xeon processor-based host

    1. Download and un-tar the WRF3.5 source code from the NCAR repository http://www.mmm.ucar.edu/wrf/users/download/get_sources.html#V351.
    2. Source the latest Intel MPI for intel64 and latest Intel Compiler (as an example below)
      1. source /opt/intel/impi/4.1.1.036/intel64/bin/mpivars.sh
      2. source /opt/intel/composer_xe_2013/bin/compilervars.sh intel64
    3. Export the path for the host netcdf and pnetcdf. Having netcdf and pnetcdf built for the host is a prerequisite.
      1. export NETCDF=/localdisk/igokhale/IVB/trunk/WRFV3.5/netcdf/xeon/
      2. export PNETCDF=/localdisk/igokhale/IVB/trunk/WRFV3.5/pnetcdf/xeon/
    4. Turn on Large file IO support
      1. export WRFIO_NCD_LARGE_FILE_SUPPORT=1
    5. Cd into the WRFV3 directory created in step #1 and run ./configure and select option 21: "Linux x86_64 i486 i586 i686, Xeon (SNB with AVX mods) ifort compiler with icc (dm+sm)". On the next prompt for nesting options, hit return for the default, which is 1.
    6. In the configure.wrf that is created, remove delete -DUSE_NETCDF4_FEATURES and replace –O3 with –O2
    7. Replace !DEC$ vector always with !DEC$ SIMD on line 7578 in the dyn_em/module_advect_em.F source file.
    8. Run ./compile wrf >& build.snb.avx . This will build a wrf.exe in the ../WRFV3/main folder. (Note: to speed up compiles, set the environment variable J to "-j 4" or whatever number of parallel make tasks you wish to use.)
    9. For a new, clean build run ./clean –a and repeat the process.

    Run WRF Conus2.5km in Symmetric Mode

    1. Download the CONUS2.5_rundir from http://www2.mmm.ucar.edu/WG2bench/conus_2.5_v3/
    2. Follow the READ-ME.txt to build the wrf input files.
    3. The namelist.input has to be altered. The changes are as follows:
      1. In the &time_interval section, edit the values as below:
        1. restart_interval           =360,
        2. io_form_history          =2,
        3. io_form_restart           =2,
        4. io_form_input             =2,
        5. io_form_boundary       =2,
      2. Remove "perturb_input =.true." from the &domains section and replace with "nproc_x =8,"
      3. Add "tile_strategy =2," under the &domains section.
      4. Add "use_baseparam_fr_nml =.true." under the &dynamics section.
    4. Create a new directory called CONUS2.5_rundir (../WRFV/CONUS_rundir) in the CONUS2.5_rundir, create 2 directories "mic" and "x86". Copy over contents of ../WRFV/run/ into “mic” and “x86” directories.
    5. Copy the Intel Xeon Phi coprocessor binary into the CONUS2.5_rundir/mic directory and copy the Intel Xeon binary into the CONUS2.5_rundir/x64 directory.
    6. Cd into the CONUS2.5_rundir and execute WRF as follows on 4 nodes (i.e 4 coprocessors + 4 Intel Xeon processors) in symmetric mode. To run conus2.5km, you need to have access to 4 nodes (example shown below)

    Script to run on Xeon-Phi + Xeon (symmetric mode)

    The nodes I am using are: node01 node02 node03 node04

    When you request for nodes, make sure you have a large stack size MIC_ULIMIT_STACKSIZE=365536

    
    source /opt/intel/impi/4.1.0.036/mic/bin/mpivars.sh
    source /opt/intel/composer_xe_2013_sp1.1.106/bin/compilervars.sh intel64
    
    export I_MPI_DEVICE=rdssm
    export I_MPI_MIC=1
    export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-v2-scif0
    export I_MPI_PIN_MODE=pm
    export I_MPI_PIN_DOMAIN=auto
    
    ./run.symmetric
    
    
    

    Below is the run.symmetric to run the code in symmetric mode:

    run.symmetric script

    
    #!/bin/sh
    mpiexec.hydra
     -host node01 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe
    : -host node02 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe
    : -host node03 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe
    : -host node04 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe
    : -host node01-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh
    : -host node02-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh
    : -host node03-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh
    : -host node04-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh
    
    
    

    In ../CONUS2.5_rundir/mic, create a wrf.sh file as below.

    Below is the wrf.sh that is needed for the Xeon Phi part of the runscript.

    wrf.sh script

    
    export LD_LIBRARY_PATH=/opt/intel/compiler/2013_sp1.1.106/composer_xe_2013_sp1.1.106/compiler/lib/mic:$LD_LIBRARY_PATH
    /path/to/CONUS2.5_rundir/mic/wrf.exe
    
    
    
    • You will have 80 rsl.error.* and 80 rsl.out.* files in your CONUS2.5_rundir directory.
    • Do a 'tail –f rsl.error.0000' and when you see 'wrf: SUCCESS COMPLETE WRF' your run is successful.
    • After the run, compute the total time taken to simulate with the scripts below. The mean value (which indicates the Average Time Step (ATS)) is of interest for WRF (lower the better).

    Parsing scripts

    gettiming.sh – is the parsing script

    
    grep 'Timing for main' rsl.out.0000 | sed '1d' | head -719 | awk '{print $9}' | awk -f stats.awk
    bash-4.1$ cat stats.awk 
    BEGIN{ a = 0.0 ; i = 0 ; max = -999999999 ; min = 9999999999 }
    {
    i ++
    a += $1
    if ( $1 > max ) max = $1
    if ( $1 < min ) min = $1
    }
    END{ printf("---n%10s %8dn%10s %15fn%10s %15fn%10s %15fn%10s %15fn%10s %15fn","items:",i,"max:",max,"min:",min,"sum:",a,"mean:",a/(i*1.0),"mean/max:",(a/(i*1.0))/max) }

    Validation

    To validate if the successful WRF run is correct or not, check the following:

    • It should generate a wrf_output file.
    • diffwrf your_output wrfout_reference > diffout_tag
    • 'DIGITS' column should have high value (>3). If yes, the WRF run is considered valid.

    Compiler Options

    • -mmic : build an application that natively runs on Intel® Xeon Phi™ Coprocessor
    • -openmp : enable the compiler to generate multi-threaded code based on the OpenMP* directives (same as -fopenmp)
    • -O3 :enable aggressive optimizations by the compiler.
    • -opt-streaming-stores always : generate streaming stores
    • -fimf-precision=low : low precision for higher performance
    • -fimf-domain-exclusion=15 : gives lowest precision sequences for Single precision and Double precision.
    • -opt-streaming-cache-evict=0 : turn off all cache line evicts.

    Conclusion

    This document enables users to compile and run the WRF Conus2.5KM workload on an Intel-based cluster with Intel Xeon processor based systems and Intel Xeon Phi coprocessors and showcases the benefits of using Intel Xeon-Phi coprocessors over the use of a homogeneous Intel Xeon processor based installation in a symmetric mode run with 4 nodes.

    About the Author

    Indraneil Gokhale is a Software Architect in the Intel Software and Services Group (Intel SSG).

  • Intel(R) Xeon Phi(TM) Coprocessor
  • Xeon Phi
  • Phi
  • MIC
  • Weather Research and Forecasting
  • WRF
  • conus2.5km
  • Développeurs
  • Linux*
  • Serveur
  • C/C++
  • Intermédiaire
  • Intel® Many Integrated Core Architecture
  • Informatique parallèle
  • Serveur
  • URL
  • Optimizing Cyberlink PowerDVD 10* Improves Battery Life

    $
    0
    0

    Download PDF

    Authors:
    Manuj Sabharwal and Gael Hofemeier, Software Engineers, Software Solutions Group, Intel Corporation

    Introduction

    Low battery life is one of the most serious issues currently plaguing mobile devices in general and Ultrabook™ devices and tablets specifically. Users have become accustomed to streaming multimedia content to their mobile devices “on-demand” from content servers in the cloud. Because these devices have limited battery capacity, energy efficiency is important. Cyberlink PowerDVD 10* (PowerDVD*) is one of the top players in the industry for HD, and 3D movie playback. This app is often included as a pre-bundled application from OEMs. In this case study, we showcase how Intel and Cyberlink collaborated to optimize the PowerDVD* application to give best-in-class experience on Intel devices.

    First, we’ll talk about the challenges that Cyberlink encountered when adding content streaming features to PowerDVD and the tools and techniques Intel used to improve the power consumption of PowerDVD.

    Then, we’ll discuss the power consumption profile of a Cyberlink PowerDVD streaming media application and its impact on battery life for mobile devices. We also provide an analysis of PowerDVD behavior to identify issues such as decoding on CPU, large numbers of context switches, high interrupt rates, etc., causing increased power consumption. Finally, we’ll provide the data that shows the reduced power consumption following optimization.

    The optimization was a huge success. The Intel team was able to make the following improvements to PowerDVD:

    • Package C0 reduced to 20% from 100% during media playback
    • Reduced SoC power from ~6 W to ~1.8W using Intel® Power Gadget
    • Intel® VTune™ analyzer reported CPU utilization of 25% down from 70%
    • The Windows* Performance Analyzer showed frequent wakeups (5 Msec) vs. 10 msec wake up frequency for local or streaming media playback frequency of 10%.

    Definitions

    Acronym

    Definition

    BLA

    Battery Life Analyzer

    GPU

    Graphics processing unit

    WPA

    Windows Performance Analyzer

    DLNA Server

    Digital Living Network Alliance Server

    HD

    High density

    SoC

    System on Chip

    FPS

    Frames per second

    SDK

    Software development kit

    SKU

    Stock Keeping Unit

     

    The Challenges of Optimizing Battery Life

    PowerDVD offers new features for organizing, streaming media, mobile devices, and social media. In addition to functioning on a client, the latest software can turn a device into a DLNA server and stream multimedia content from a PC across a network to other devices. It can also stream content from external content servers. Adding content streaming came with a price, however. New capabilities, such as HD streaming, required running more processes, consuming much more memory and CPU cycles. This took a toll on battery life. We needed to answer the following questions:

    1. What is the power consumption from PowerDVD during a 1080p streaming media playback?
    2. Why was PowerDVD able to playback only an hour of media on a fully charged battery?

    After two months and three iterations of analysis and validation, the engineering teams improved battery life by making the following changes:

    • Offloaded graphics to the GPU (using the Intel® Media SDK)
    • Removed the sleep loop calls from two threads
    • Used an overlay to reduce extra memory copies

    The following describes the process and tools that resulted in the optimized version of PowerDVD.

    Optimization of Cyberlink PowerDVD for Power Consumption

    Test System Configuration:

    • 4th generation Intel® Core™ i7 processor
    • Lenovo Yoga* 2 Pro
    • CPU speed : 1.4 GHz non-turbo frequency
    • Memory 4 GB display : 1920x1080p HD panel
    • Cyberlink PowerDVD 10 and Cyberlink PowerDVD 12

    Validation and analysis showed:

    • Package C0 was pegged 100% during media playback, while we expected it to be at 20%.
    • Intel Power Gadget showed SoC power to be ~6 W. It should be ~1.7 W on a 4th generation Intel processor.
    • Intel VTune results revealed no offloading of graphics to the GPU and high CPU utilization of 70% (we expected about 10%)
    • The Windows Performance Analyzer tests revealed frequent wakeups (5 msec). The normal frequency is 10 msec with audio playback.

    First Step - Validation

    To understand and address PowerDVD’s impact on battery life, we used Intel Power Gadget and Battery Life Analyzer (BLA) to validate the application’s SoC power usage. Figure 1 shows the Intel Power Gadget’s UI on a Windows platform.

     


    Figure 1. Intel® Power Gadget UI on Windows* Platform

    As part of our validation of PowerDVD, we used Intel Power Gadget to determine power impacts during playback. Figure 2 shows the power output Intel Power Gadget recorded.

    PowerDVD’s power usage was ~6 W of SoC power during playback. Intel recommends a maximum of ~2.0 W on 4th generation Intel processors (low power processors typically used in Ultrabook devices).


    Figure 2. Processor Power Usage during PowerDVD* Playback

    To gain deeper insight into what other activities were affecting power, we used the Battery Life Analyzer (BLA) tool to understand the impact of media playback on residencies. Understanding residency is important as changing the SoC SKU can impact power.

    BLA is a power management analysis tool developed by Intel to identify issues that impact battery life. BLA helps to identify a wide range of issues during software analysis such as:

    • Software CPU utilization
    • OS timer resolution changes
    • Frequent C state transitions
    • Excessive ISR/DPC activity


    Figure 3 shows package residency during 1080p HD video playback using Cyberlink PowerDVD.


    Figure 3. Package Residency during 1080p HD Video Playback using PowerDVD*

    The package residency includes CPU, Graphics, and UnCore events. More time in package C0 results in higher SoC power. Expected package C0 for Cyberlink PowerDVD 1080p playback is ~20% on 4th generation U-Processor. As we can see from Figure 3, package residency is far higher than it should be.

    Both Intel Power Gadget and BLA confirmed higher power usage and ~4 hrs. of battery life on 42 Whr (Watt-hours) battery capacity with ~6 W SoC+3 W of display and 2+ W for other components.

    Our next step was to analyze the application for power optimization.

    Second Step - Analysis

    For the analysis phase, we used two tools:

    The following tables summarize the results of the analysis, which showed definite room for improvement.

    Table 1. Intel® Power Gadget and BLA Results

    Actual Results

    Expected Results

    Package C0 is pegged at 100% during media playback

    Package C0 should be at 20% during media playback

    SoC power using Intel® Power Gadget is ~6 W

    SoC power should be ~1.7 W on 4th generation Intel processor

     

    Table 2. Intel® Vtune™ and WPA Results

    Analysis Tool

    Observations

    Intel VTune results

    1. Since the app had no codecs, there was no offloading to graphics
    2. High CPU utilization (70% vs. the expected 10%)

    Windows Performance Analyzer

    Frequent wakeups (5 msec) occurred– expected frequency is 10 msec with audio playback

     

    The next figures provide a walkthrough of some of the important screenshots from our analysis.

    Intel VTune analyzer was used to validate the PowerDVD application for the presence of spin waits, the presence of hardware acceleration, and hotspots (a micro-architecture issue). Figure 4 shows the steps for collecting the graphics call stacks.


    Figure 4. VTune™ UI for Analyzing DirectX* Pipeline Events

    Figure 5 shows the VTune summary with significant time spent in spin loop. GPU Usage shows no codec usage. Most of the time spent in the GPU is for display and other pre-processing algorithms during playback.


    Figure 5. VTune™ Summary showing Spin Loop time

    Digging deeper into the analysis, Intel VTune shows high CPU utilization during media playback, and instances where VSync (the red highlights in Figure 5) and GPU software queue are not occurring every ~33 msec (30 FPS playback). This analysis shows software glitches during media playback.


    Figure 6. VTune™ Summary Report

    Looking at Figure 7, the summary report confirms an inconsistent frame rate over time. The FPS varies for 30 FPS movie playback between 0-60 FPS. The chart shows the total number of frames executed in an application with a specific frame rate. A high number of slow or fast frames signals a performance bottleneck. The goal is to optimize the code to keep the frame rate constant, for example, from 30 to 60 FPS.


    Figure 7. VTune™ analysis of Frame Rates

    Next, we used the Windows Performance Analyzer (WPA) tool to analyze the application for wakeup activities, interrupts, and context switches. Figure 8 shows using CPU-based Intel® SSE instructions for H264 decode. It is more efficient to offload this work on to the GPU than to run it on the CPU.


    Figure 8. WPA Analysis of Wakeup Activities, Interrupts, and Context Switches

    WPA also shows wakeup activities from PowerDVD during playback. Figure 9 displays the two PowerDVD threads, both running at 10 msec. The two threads are not coalesced, which causes the overall system to wake up at a 5 msec timer interval. Figure 10 shows the call stack with sleep loop Win32* API being called every 10 msec interval.


    Figure 9. WPA thread analysis


    Figure 10. WPA call stack with sleep loop analysis

    Table 3 reveals significant reduction in package residency after optimization.

    Table 3. Validating Package Residency after Optimization

    C-state Counters

    Average (%) Before Optimization

    Average (%) After Optimization

    PackageC0-C1

    100%

    20.18%

    PackageC0-C2

    0%

    8.29%

    PackageC0 C3

    0%

    0%

    0.19%

    PackageC0 C6

    0%

    1.91%

    PackageC0 C7

    0%

    69.43%

     

    Optimization Results/Validation

    The following tables show the “before” and “after” results:

    Table 4. Intel® Power Gadget and BLA: Before and After1

    Before Optimization

    After Optimization

    Package C0 is pegged at 100% during media playback

    Package C0 is reduced to 20%

    SoC power is ~6 W

    SoC power reduced to ~1.8 W on test system

     

    Table 5. Intel® VTune™ Amplifier and WPA Results: Before and After1

     

    Before

    After

    Intel® VTune™ Amplifier

    • Since the app had no codecs, there was no offloading to graphics
    • High CPU utilization (70% vs. the expected 10%)
    • Video codecs now reported
    • CPU utilization decreased by 25%

    Windows Performance Analyzer

    Frequent wakeups (5 msec) – expected frequency is 10 msec with audio playback

    Sleep thread removed – reduced wakeups by 2x (5 msec to 10 msec)

    Battery Life Analyzer

    Package residency 100%

    Package residency ~20%

     

    1 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance

    We optimized by:

    1. Offloading to Intel® HD Graphics using Intel Media SDK
    2. Optimizing Win32 API calls that cause periodic wakeup on CPU
    3. Using an overlay to save one memory copy per frame

    The first task was to use the Intel Media SDK for offloading decode to graphics which will provide better efficient/watt usage of Intel HD graphics. The pseudo code in Figure 11 provides an example of a simple use of Intel Media SDK to offload a stream of frame to graphics.


    Figure 11. Intel® Media SDK code snippet – offloading a frame to graphics.

    Once we offloaded to graphics using the Intel Media SDK, we ran PowerDVD and measured the results using Intel VTune Amplifier. Compared to Figure 5 where we didn’t see any codec usage, we now see Video Enhancement in the summary (Figure 12).


    Figure 12.Intel® VTune™ Amplifier Summary result

    Examining other Intel VTune graphics views, we verified that by using Intel Media SDK [to do what?] use of frame decoded on the GPU vs. on the CPU. Figure 13 shows a batch of frames being decoded after ~20 msec on GPU. Offloading the decode work to the GPU helped to reduce CPU utilization by ~25% on the test system.


    Figure 13. Frame decoding after ~20 msec on the GPU

    To verify our optimization of offloading graphics, we ran Intel Power Gadget. Compared to the baseline result shown in Figure 2, we saw ~2 W of power saving just by performing graphics offloading (Figure 14).


    Figure 14. Power Savings resulting from Graphics Offload

    We made some good progress, but ~4 W was not low enough. As stated earlier, the goal for streaming media 1080p playback is ~1.7 W of SoC/package power.

    The next step was to find other CPU-based optimizations. Initial analysis showed sleep loop calls from two threads (non-coalesced) waking the CPU every 5 msec. CyberLink engineers needed to remove the sleep threads from their application. However, this was one of the most difficult changes since it required modifying the structure of the application. Figure 15 shows wakeup activities increase to 10 mse after periodic activities were removed.


    Figure 15. Optimized Cyberlink PowerDVD* after removing periodic activities

    Removing periodic activities revealed a ~800 mW saving. With current optimizations, 1080p HD streaming playback SoC power went from ~6 W to 2.8 W, but additional optimizations still had to be done to reach the 1.7 W goal seen in best-in-class applications.


    Figure 16. Power Optimizations down to ~2.8 W

    The next step was to reduce extra memory copies using an overlay. With the overlay, the overall package power was reduced by ~400 mW. Figure 17 shows power was reduced to ~1.8 W from ~6 W.


    Figure 17. Cyberlink PowerDVD* at final Power Consumption (1.8 W)

    With that, the most important optimization goals had been achieved, and Intel and Cyberlink engineers deemed the project a success.

    Close collaboration between Cyberlink and Intel helped to complete the optimization in two months with full validation. The final product with all optimizations was released to OEMs six months from when we started.

    Conclusion

    The Intel and PowerDVD engineers used several tools including Intel VTune and Microsoft Windows Performance Analyzer to reach the optimum low-power playback. The collaboration included knowledge sharing on tools with weekly analysis/meetings to meet the battery life goal before the release deadline.

    Several iterations were completed before the team was satisfied with their results (PowerDVD consumes ~1.8 W down from ~6 W.) Intel and Cyberlink engineers faced the challenge of keeping the quality of playback the same before and after optimization. Each optimization required a validation and analysis process before it could pass the Cyberlink team’s internal quality tests. Thus, every change was tracked and user experience metrics (power and performance) were evaluated.

    The following optimizations were found to work the best for achieving the optimization goals, but as noted above, these were accomplished over several iterations:

    • Offloading graphics to the GPU (using the Intel Media SDK)
    • Removing sleep loop calls from two threads
    • Using an overlay to reduce extra memory copies

    The combined efforts between the Intel and CyberLink PowerDVD team resulted in optimizing their streaming media playback application to reach the best-in-class goal.

    About the Authors

    Manuj Sabharwal is a Software Engineer in the Software Solutions Group at Intel. Manuj has been involved in exploring power enhancement opportunities for idle and active software workloads. He has significant research experience in power efficiency and has delivered tutorials and technical sessions in the industry. He also works on enabling client platforms through software optimization techniques.

     

     

    Gael Hofemeier has worked for Intel since 2000 as an Application Engineer in the Software Solutions Group at Intel. Gael’s current focus is in Technology Evangelism for Business Client Apps and Technologies.

     

     

     

    References

    1. Windows Performance Analyzer: http://www.microsoft.com/en-us/download/details.aspx?id=30652
    2. Battery Life Analyzer: http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=19351
    3. Intel® Power Gadget: https://software.intel.com/en-us/articles/intel-power-gadget-20
    4. Cyberlink PowerDVD: http://www.cyberlink.com/products/powerdvd-ultra/features_en_US.html?&r=1
    5. Intel® Media SDK: https://software.intel.com/en-us/vcsource/tools/media-sdk-clients

    Relevant Intel Links

    Energy Efficient Software Development: https://software.intel.com/en-us/energy-efficient-software
    Power Analysis Guide for Windows*: https://software.intel.com/en-us/articles/power-analysis-guide-for-windows
    Windows 8* Software Power Optimization: https://software.intel.com/en-us/articles/windows-8-software-power-optimization
    Intel processor numbers: http://www.intel.com/products/processor_number/

     

    Notices and Disclaimers

    http://legal.intel.com/Marketing/notices+and+disclaimers.htm

     

    Intel, the Intel logo, Ultrabook, and VTune are trademarks of Intel Corporation in the U.S. and other countries.
    *Other names and brands may be claimed as the property of others
    Copyright© 2014 Intel Corporation. All rights reserved.

  • Analyzing Power Efficiency
  • optimizing applications
  • WPA
  • Battery Life analyzer
  • CyberLink
  • PowerDVD
  • power analysis
  • Développeurs
  • Microsoft Windows* 8
  • Windows*
  • Intermédiaire
  • Intel® Media SDK
  • Amplificateur Intel® VTune™ XE
  • Traitement média
  • Efficacité de l’alimentation
  • PC portable
  • Bureau
  • URL
  • Optimizing an Augmented Reality Pipeline using Intel® IPP Asynchronous

    $
    0
    0

    Using Intel® GPUs to Optimize the Performance and Power Consumption of Total Immersion's D'Fusion* Augmented Reality Pipeline

    Michael Jeronimo, Intel (michael.jeronimo@intel.com)
    Pascal Mobuchon, Total Immersion (pascal.mobuchon@t-immersion.com)

    Executive Summary

    This case study details the optimization of Total Immersion's D'Fusion* Augmented Reality pipeline, using the Intel® Integrated Performance Primitives (Intel® IPP) Asynchronous to execute key parts of the pipeline on the GPU. The paper explains the Total Immersion pipeline, the goals and strategy for the optimization, the results achieved, and the lessons learned.

    Intel IPP Asynchronous

    The Intel IPP Asynchronous (Intel IPP-A) library—available for Windows* 7, Windows 8, Linux*, and Android*—is a companion to the traditional CPU-based Intel IPP library. This library extends the successful Intel IPP acceleration library model to the GPU, providing a set of GPU-accelerated primitive functions that can be used to build visual computing algorithms. Intel IPP-A is a simple host-callable C API consisting of a set of functions that operate on matrix data, the basic data type used to represent image and video data. The functions provided by Intel IPP-A are low-, medium-, and high-level building blocks for video analysis algorithms. The library includes low-level functions such as basic math and Boolean logic operations; mid-level functions like filtering operations, morphological operations, edge detection algorithms; and high level functions including HAAR classification, optical flow, and Harris and Fast9 feature detection.

    When a client application calls a function in the Intel IPP-A API, the library loads and executes the corresponding GPU kernel. The application does not explicitly manage GPU kernels; at application run time the library loads the correct highly optimized kernels for the specific processor. The Intel IPP-A library supports third generation Intel® Core™ processors (code named Ivy Bridge) and higher, and Intel® Atom™ processors, like the Bay Trail SoC, that include Intel® Processor Graphics. Allowing the library implementation to manage kernel selection, loading, dispatch, and synchronization simplifies the task of using the GPU for visual computing functionality. The Intel IPP-A library also includes a CPU-optimized implementation for fallback on legacy systems or application-level CPU/GPU balancing.

    Like the traditional CPU-based Intel IPP library, when code is implemented using the Intel IPP-A API, the code does not need to be updated to take advantage of the additional resources provided by future Intel processors. For example, when a processor providing additional GPU execution units (EUs) is released, the existing Intel IPP-A kernels can automatically scale performance, taking advantage of the additional EUs. Or, if a future Intel processor provides new hardware acceleration blocks for video analysis operations, a new Intel IPP-A library implementation will use the accelerators while keeping the Intel IPP-A interface constant. Developers can simply recompile and relink with the new library implementation. Intel IPP-A provides a convenient abstraction layer for GPU-based visual computing that provides automatic performance scaling across processor generations.

    It is easy to integrate Intel IPP-A code with the existing CPU-based code, so developers can take an incremental approach to optimization. They can identify key pixel processing hotspots and target those for offload to the GPU. But they must take care when offloading to the GPU so as not to introduce data transfer overhead. Instead, developers should create an algorithm pipeline that allows significant work to be performed on the GPU before the results are required by the CPU code, minimizing inter-processor data transfer.

    Benefits of GPU Offload

    Offloading time consuming pixel processing operations to the GPU can result in significant power and performance benefits. In particular, the GPU:

    • Has a lower operating frequency– the GPU runs at a lower clock frequency than the CPU, consuming less power for the same computation.
    • Has more hardware threads– the GPU has significantly more hardware threads, providing better performance for operations where performance scales with an increasing number of threads, such as the visual processing operations in Intel IPP-A.
    • Has the potential to run more complex algorithms– due to the better power and performance provided by the GPU, developers can use more computationally intensive algorithms to achieve improved results and/or process more pixels than they could otherwise using the CPU only.
    • Can free the CPU for other tasks – by moving processing to the GPU, developers can reduce CPU utilization, freeing up the CPU processing resources for other tasks.

    The benefits offered by Intel IPP-A programming on the GPU can be applied in a variety of market segments to help ISVs reach specific goals. For example, in Digital Security and Surveillance (DSS), the primary metric is the number of channels of input video that a platform can process (the "channel density"), while in Augmented Reality, decreasing the time to acquire targets to track and increasing the number of objects that can be simultaneously tracked are key.

    Augmented Reality

    Augmented Reality (AR) enhances a user's perception with computer-generated input such as sound, video, or graphics data. AR merges the real world with computer-generated elements, either meta information or virtual objects, resulting in a composite that presents more information and capabilities than an un-augmented experience. AR applications usually overlay information about the environment and objects on a real-time video stream, making the virtual objects interactive. AR technology can be applied to many market segments including retail, medicine, entertainment, and education. For example:

    • Mobile augmented reality systems combine a mobile platform's camera, GPS, and compass sensors with its Internet connectivity to pinpoint the user's location, detect device orientation, and provide information about the scene, overlaying content on the screen.
    • Virtual dressing rooms allow customers to virtually try on clothes, shoes, jewelry, or watches, either in-store or at home, automatically sizing the item to the user in a 3D view on the device.
    • Construction managers can view and monitor work in progress, in real time, through Augmented Reality markers placed throughout a site.

    Total Immersion

    Total Immersion is an augmented reality company, founded in 1998, based in Suresnes, France. Through its patented D'Fusion software solution, Total Immersion combines the virtual world and the real world by integrating real-time interactive 3D graphics into a live video stream. The company maintains offices in Europe, North America, and Asia and supports the world's largest augmented reality partner network, with over 130 solution providers.

    Today, mobile technology is everywhere. Total Immersion (TI) is developing compelling AR experiences for tablets and phones. Intel, recognizing Total Immersion as a leader in Augmented Reality, initiated a collaboration with TI to optimize the D'Fusion software for Intel processors, including GPU offloading. They aimed to improve the AR experience when running on Intel products that power mobile platforms, such as the Intel Atom SoC Z3680.

    Optimization Goals and Strategy

    Augmented Reality applications rely on computer vision algorithms to detect, recognize, and track objects in input video streams. While a large part of the AR processing doesn't deal directly with pixels, the pixel processing required is a computationally intensive, data parallel task appropriate for GPU offload. Intel and Total Immersion planned to offload the pixel processing to the GPU, using Intel IPP-A, so that the pipeline handled the pixel processing—from capture to rendering—and only the metadata about the pixel information would be returned to the CPU as input for higher-level AR operations. By offloading all of the pixel processing to the GPU, the application achieved better performance with less power consumption, making D'Fusion-based applications run efficiently on mobile platforms while conserving battery life.

    The D'Fusion AR Pipeline

    The core of the D'Fusion software is a processing pipeline that consists of the following stages:

    The D'Fusion AR Pipeline
    Figure 1 – The Design of the PixelFlow Framework

    • Capture – The first step in the pipeline is capturing input video from the camera. The video can be captured in a variety of formats, such as RGB24, NV12, or YUY2, depending on the specific camera. Frames are captured at the full frame rate, typically 30 FPS, and passed to the next stage in the pipeline. Each captured frame has an associated time stamp that specifies the precise time of capture.
    • Preparation – Computer vision algorithms usually operate on grayscale images, and the TI AR pipeline is no exception. The first step after Capture is to convert the color format of the captured image to grayscale. Next, because computer vision algorithms often do not require the full frame size to operate effectively, input frames can be downscaled to a lower resolution. The reduced number of pixels to process saves computational resources. Then, depending on the orientation of the image, mirroring may also be required. Finally, in addition to the grayscale image required by the computer vision processing, a color image must also be sent down the pipeline so that the scene can eventually be rendered along with the AR-generated information. It is also necessary to obtain a second color format conversion from the camera input format, like NV12, to a format appropriate for display, such as ARGB. All of the operations in the Preparation stage are pixel-intensive operations appropriate to target for offload to the GPU.
    • Detection – Once a frame is prepared, the pipeline applies a feature detection algorithm, either Harris or Fast9, to the reduced-size grayscale input image. The algorithm returns a list of feature points detected in the image. The feature detection algorithm can be controlled by various parameters, including the threshold level. These parameters continuously adjust the feature point detection to return an optimal number of feature points and to adapt to changing ambient conditions, such as the brightness of the input scene. Non-maximal suppression is applied to the feature point calculation to get a better distribution of feature points, avoiding local "clustering." Both feature detection and non-maximal suppression are targeted for offload to the GPU.
    • Recognition – Once the features are generated by the Detection stage of the pipeline, the FERNS algorithm is used to match the features against a database of known objects. Instead of operating on the feature points directly, the FERNS algorithm uses a patch, a square region of pixels centered on the feature point. The patches are taken from a filtered version of the frame that has been convolved with a smoothing filter. Each of the patches is associated with a timestamp of the frame from which they were derived. Since the processing of each patch by the FERNS algorithm is an independent operation, it is easily parallelizable and a candidate for GPU offload. The frame smoothing can also happen on the GPU.
    • Tracking - Many image processing algorithms operate on multi-resolution images called image pyramids, where each level of the pyramid is a further downscaled version of the original input frame. The Tracking stage of the pipeline provides the image pyramid to the Lucas-Kanade optical flow algorithm to track the objects in the scene. Both the image pyramid generation and the optical flow are good candidates to run on the GPU.
    • Rendering – Rendering is the final stage of the pipeline. In this stage, the AR results are combined with the color video and rendered on the output, in this case using OpenGL*. The application renders the color video as an OpenGL texture and uses OpenGL functions to draw the graphics output, based on the video analysis, on top of the video frame.

    Optimization Strategy

    Initial profiling of the TI application confirmed that the pixel processing operations mentioned in the prior section were the primary bottlenecks in the AR pipeline. However, other bottlenecks existed, including a CPU-based copy of the color image data to an OpenGL texture.

    To simplify collaboration, Intel delivered the optimizations to Total Immersion as a library to be incorporated into the TI software. The library, dubbed PixelFlow, encapsulates the pixel processing required by the TI AR pipeline and is implemented using Intel IPP-A library. Intel and Total Immersion decided that PixelFlow would target the Preparation, Detection, and Rendering bottlenecks first, while also providing information required for the Recognition and Tracking stages. Moving the first stages of the pipeline to the GPU would be a milestone towards the eventual goal of handling all pixel processing operations on the GPU.

    To implement the Preparation and Detection stages, the operations performed by PixelFlow on the GPU included color format conversion, resizing, mirroring, Fast9 and Harris feature point detection, and non-maximal suppression. To support the Recognition and Tracking stages, the library provides a smoothed frame to be used by the FERNS algorithm and an image pyramid of the input to be used by the optical flow algorithm. Finally, PixelFlow also provides a GPU texture of the color input frame suitable for use in OpenGL.

    Implementation

    The PixelFlow framework was conceived as a flexible framework for analysis of multiple video input streams derived from a single video capture source. The PixelFlow pipeline runs on the GPU, operating asynchronously with the CPU. Each video capture source serves frames to one or more logical video streams, where the color format and resolution of each stream is independently configurable. Each stream runs on a separate thread and can use Intel IPP-A to analyze the video frames, producing meta information. The following diagram shows the general design of the framework.

    Design of the PixelFlow Framework
    Figure 2 – The Design of the PixelFlow Framework

    The TI Augmented Reality pipeline is comprised of two video streams: the Analytics Stream and the Graphics Stream. The Analytics Stream processes a grayscale input frame, performing feature detection with non-maximal suppression, image pyramid generation, and smoothing of the input frame. The Graphics Stream converts the color camera input to ARGB for display. In both cases, the resulting data is placed in a queue for access by the CPU-based code. The following diagram shows the basic organization of the pipeline and the functions targeted for offload to the GPU.

    PixelFlow implementation for the TI AR pipeline
    Figure 3 – The PixelFlow implementation for the TI AR pipeline

    The information on each queue has a timestamp of the original frame capture, allowing the CPU software to correlate each frame with the corresponding data produced by the analytics stream.

    Implementation Challenges

    Several challenges were encountered during the implementation of the PixelFlow framework:

    • Separate kernels for frame preparation– The initial PixelFlow implementation used separate Intel IPP-A functions for resizing, color format conversion, and mirroring. Because the functions didn't support multi-channel images to prepare the ARGB output for the Analytics Stream, the implementation used one Intel IPP-A function to split the input image into separate channels, then called other functions to resize and mirror each of the channels individually before combining them back into an interleaved format. To minimize the kernel overhead and simplify programming, the Intel IPP-A team developed a single hppiAdvancedResize function to combine the resize, color format conversion, and mirroring into a single GPU kernel, allowing the frame to be prepared for the Analytics Stream or the Graphics Stream with a single function call.
    • Direct-to-GPU-memory video input – The intention of the PixelFlow pipeline was to have the entire pipeline, from video capture to graphics rendering, on the GPU. However, the graphics drivers for the targeted platforms did not yet support direct-to-GPU-memory video capture. Instead, each frame was captured to system memory and then copied to GPU memory. To minimize the impact of the copy, the PixelFlow implementation took advantage of the Fast Copy feature supported by the Intel IPP-A library. Using a 4K-aligned system memory buffer, the GPU kernel is able to use shared physical memory to access the data, thus avoiding a copy.
    • NMS, weights, and orientation for Fast9 – The results produced by the Intel IPP-A Fast9 algorithm did not initially match the CPU-based function that it replaced. An investigation revealed that the TI code was also applying non-maximal suppression to the results of the Fast9 calculation. In addition, the TI code also calculated a weight and orientation value for each detected feature point. The team updated the Intel IPP-A Fast9 function to add NMS as an option and to return the weight and orientation values.
    • OpenGL surface sharing and DX9 surface import/export– OpenGL is used for rendering in this pipeline. The video frame is rendered as an OpenGL texture and other virtual elements are added by calling OpenGL drawing primitives. In the Frame Preparation stage of the pipeline, Intel IPP-A's AdvancedResize function converts the video frame from the input format (NV12, YUY2, etc.) to ARGB. A CPU-based copy of this image into an OpenGL texture was one of the top bottlenecks. The Intel IPP-A team added an import/export capability so that a DX9 surface handle could be extracted from an existing Intel IPP-A matrix, or an Intel IPP-A matrix could be created from an existing DX9 surface. This enabled the use of the OpenGL surface sharing capability in the Intel OpenGL driver. With is functionality, a DX9 surface could be shared with OpenGL as a texture, avoiding the CPU-based copy and keeping the data on the GPU.

    Additional Non-PixelFlow Optimizations

    After implementing the optimizations described in the previous section, a trace performed in the VTune™ analyzer showed that when tracking nine targets, with input video and analytics resolution at 1024x768, several hotspots remained in the computer vision module:

    Remaining Hotspots – Ivy Bridge
    Function% of CVDescription
    dcvGroupFernsRecognizer::RecognizeAll18.95Using x87 floating point. Should try using SIMD floating point instructions such as Intel® SSE3 or Intel® AVX.
    dcvGaussianPyramid3x3::ConstructFirstPyramidLevelOptim16.76General code generation issues. Expect these would be improved by using the Intel® compiler.
    dcvPolynomSolver::solve_deg310.20General code generation issues. Expect these would be improved by using the Intel compiler.

     

    After building the computer vision module with the Intel® compiler with Intel® AVX instructions enabled, the hotspots were eliminated.

    Remaining Hotspots – Ivy Bridge
    Function% of CVDescription
    dcvGaussianPyramid3x3::ConstructFirstPyramidLevelOptim33.56Image pyramid generation.
    dcvCorrelationsDetectorLite::ComputerIntegralImage16.83Integral image computation.
    dcvKtlOptim::__CalcOpticalFlowPyrLK_Optim_ResizeNN_levels13.0LK optical flow.

    The second trace uncovered an instance in the code that still used the old CPU-based image pyramid calculation. The instance was updated to use the image pyramid calculated by PixelFlow. The remaining hotspots were additional operations that were not yet included in PixelFlow, integral image, and LK optical flow. The team will target these functions first when extending the PixelFlow functionality.

    Results – Performance and Power

    The resulting AR pipeline offloads its initial stages to the GPU and provides data for subsequent stages of AR processing. To analyze the PixelFlow implementation of the AR pipeline, the team used a test application from Total Immersion, the "AR Player." This configurable test application allows the user to set operating parameters like the number of targets to track, the video capture resolution and format, the analytics processing resolution, and so on. In addition to the power and performance statistics, the team was interested in the feasibility and impact of increasing the analytics resolution. For the pre-optimized CPU-based flow, the TI AR software used a 320x240 analytics resolution. The additional performance provided by the GPU offload allowed us to experiment with higher resolutions and the resulting impact on responsiveness and quality. The team tested PixelFlow implementation on Ivy Bridge and Bay Trail platforms.

    Results: Ivy Bridge

    We tested the software on the following Ivy Bridge platform:

    Ivy Bridge Platform Details
    ItemDescription
    ComputerHP EliteBook* 8470p
    ProcessorIntel® Core™ I7 processor 3720QM
    Clock Speed2.6 GHz (3.6 GHz Max Turbo Frequency)
    # Cores, Threads4, 8
    L1, L2, L3 Cache256 KB, 1 MB, 6 MB
    RAM8 GB
    GraphicsIntel® HD Graphics 4000
    # of Execution Units16
    Graphics DriverIgdumdim64, 9.18.10.3257, Win7 64-bit
    OSWindows* 7 Pro (Build 7601), 64-bit, SP1

    The first test scenario tracked nine targets simultaneously, with both a video capture resolution and an analytics resolution of 640x480.

    Test Scenario #1

    MetricValue
    Number of targets9
    Capture resolution640x480
    Analytics resolution640x480
    Performance Results – Ivy Bridge, Test Scenario #1
    Processor Number

    Software (ms)

    PixelFlow (ms)Difference (ms)Difference (%)
    Rendering FPS

    60

    60  
    Analytics FPS

    30

    30  
    Tracking FPS

    30

    30  
    Frame Preprocessing

    0.399

    0.088-0.311-77.83
    Tracking

    1.412

    1.355-0.057-4.03
      Construct Pyramid

    0.548

    0.025-0.523-95.44
    Recognition

    3.322

    1.477-1.846-55.55
      Compute Interest Points

    1.358

    0.035-1.323-97.43
      Smooth Image

    0.693

    0.001-0.692-99.89

    The second test scenario also tracks nine targets, but increases the video capture resolution to 1024x768 with an analytics resolution of 640x480.

    Test Scenario #2

    MetricValue
    Number of targets9
    Capture resolution1024x768
    Analytics resolution640x480
    Performance Results – Ivy Bridge, Test Scenario #2
    Processor NumberSoftware (ms)PixelFlow (ms)Difference (ms)Difference (%)
    Rendering FPS

    60

    60  
    Analytics FPS

    30

    30  
    Tracking FPS

    30

    30  
    Frame Preprocessing

    0.391

    0.094-0.297-75.99
    Tracking

    1.355

    0.900-0.455-33.58
      Construct Pyramid

    0.532

    0.024-0.508-95.58
    Recognition

    2.844

    0.917-1.927-67.77
      Compute Interest Points

    1.225

    0.027-1.199-97.83
      Smooth Image

    0.708

    0.001-0.7070-99.93

    Results: Bay Trail

    Similar tests were run on the following Bay Trail platform:

    Bay Trail Platform Details
    ItemDescription
    ComputerIntel® Atom™ (Bay Trail) Tablet PR1.1B
    ProcessorIntel® Atom™ processor Z3770
    Clock Speed1.46 GHz
    # Cores, Threads4, 4
    L1, L2, L3 Cache128 KB, 2048 KB
    RAM2 GB
    GraphicsIntel® HD Graphics
    # of Execution Units4
    Graphics DriverIgdumdim32.dll, 10.18.10.3341, Win8 32-bit
    OSWindows* 8 (Build 9431), 32-bit

    The test scenario is slightly different than the first test scenario run on the Ivy Bridge platform due to the different resolutions supported by the camera on the Bay Trail system.

    Test Scenario #1
    MetricValue
    Number of targets9
    Capture resolution640x360
    Analytics resolution640x360
    Performance Results – Bay Trail, Test Scenario #1
    Processor NumberSoftware (ms) PixelFlow (ms)Difference (ms)Difference (%)
    Rendering FPS5535  
    Analytics FPS3030  
    Tracking FPS

    15

    15  
    Frame Preprocessing

    5.215

    0.385-4.830-92.62
    Tracking15.48410.411-5.074-32.77
      Construct Pyramid6.0810.122-5.985-97.99
    Recognition28.38915.590-12.799-45.09
      Compute Interest Points9.2350.365-8.870-96.04
      Smooth Image7.2360.0110.7255-99.85

    The second scenario for Bay Trail tests the video capture resolution at 1280x720, while the analytics resolution remains at 640x460.

    Test Scenario #2
    MetricValue
    Number of targets9
    Capture resolution1280x720
    Analytics resolution640x360
    Performance Results – Bay Trail, Test Scenario #2
    Processor NumberSoftware (ms)PixelFlow (ms)Difference (ms)Difference (%)
    Rendering FPS

    12

    30   
    Analytics FPS

    30

    25  
    Tracking FPS812  
    Frame Preprocessing

    4.865

    0.408-4.458-91.62
    Tracking

    16.158

    9.718-6.440-39.86
      Construct Pyramid

    5.995

    0.122-5.872-97.96
    Recognition

    32.398

    14.532-17.865-55.14
      Compute Interest Points8.8640.376-8.488-95.76
      Smooth Image7.3370.013-7.324-99.82

    Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

    For more complete information about performance and benchmark results, visit Performance Test Disclosure

    Power Analysis

    After implementing GPU offload using the PixelFlow pipeline, investigations into the power savings achieved by the GPU offload yielded unexpected results; instead of achieving a significant power savings from offloading the processing to the GPU from the CPU, the power consumption of the PixelFlow implementation was on par with the CPU-only implementation. The following GPUView trace shows why this occurred.

    GPUView trace of the processing for a single frame
    Figure 4 –GPUView trace of the processing for a single frame

    The application dispatched the work to the GPU in separate chunks: CPU setup, GPU operation, wait for completion, CPU setup, GPU operation, wait for completion, etc. This approach impacted power consumption, causing the processor package to be continually active and not allowing the processor to enter deeper sleep states.

    Instead, the pipeline should consolidate GPU operations and maximize CPU/GPU concurrency. The following diagram illustrates the ideal situation to achieve maximum power savings: GPU operations consolidated into a single block, executing concurrently with CPU threads and leaving a period of inactivity that allows the processor package to achieve deeper sleep states.

    Ideal pattern to maximize power savings
    Figure 5 – Ideal pattern to maximize power savings

    Conclusion

    Moving the key pixel processing bottlenecks of the Total Immersion AR pipeline to the GPU resulted in performance gains on Intel processors, allowing the application to use a larger input frame size for video analysis, find targets faster, track more targets, and track them more smoothly. We expect similar gains can be achieved for similar video analysis pipelines.

    While achieving performance benefits using Intel IPP-A is fairly straightforward, achieving power benefits requires a careful design of the processing pipeline. The best is one that consolidates the GPU operations and maximizes CPU/GPU concurrency to allow the processor to reach deeper sleep states. Diagnostic and profiling tools that are GPU-capable, like GPUView and Intel VTune analyzer, are essential as they can help to identify power-related problems with the pipeline. Consider using these tools during development to verify the power efficiency of a pipeline and avoid having to re-architect a pipeline to address power-related issues.

    The PixelFlow pipeline offloaded several of the pixel processing bottlenecks in the TI pipeline. Work remains to move additional operations to the GPU such as integral image, optical flow, FERNS, etc. Once these operations are included in PixelFlow, all of the pixel processing will occur on the GPU with these operations returning metadata to the CPU as input for higher-level operations. The success of the current PixelFlow implementation, which uses IPP-A-based GPU offload, indicates that further gains are possible with additional offloading of pixel processing operations.

    Finally, power and performance optimization can go beyond just the vision processing algorithms, but can extend to other areas such as video input, codecs, and graphics output. Intel IPP-A allows for DX9-based surface sharing with related Intel technologies such as the Intel® Media SDK for codecs and the OpenGL graphics driver. Understanding the optimization opportunities with these related technologies is also important. This allows developers to create entire GPU-based processing pipelines.

    Author Biographies

    Michael Jeronimo is a software architect and applications engineer in Intel's Software and Solutions Division (SSG), focused on helping customers to accelerate computer vision workloads using the GPU.

    Pascal Mobuchon is the VP of Engineering at Total Immersion.

    References

    Item

    Location

    Total Immersion web sitehttp://www.t-immersion.com/
    Total Immersion Wikipedia pagehttp://en.wikipedia.org/wiki/Total_Immersion_(augmented_reality)
    Augmented Reality – Wikipedia pagehttp://en.wikipedia.org/wiki/Augmented_reality
    Intel® VTune™ Amplifier XEhttps://software.intel.com/en-us/intel-vtune-amplifier-xe
    Intel® Graphics Performance Analyzershttps://software.intel.com/en-us/vcsource/tools/intel-gpa
    GPUViewhttp://msdn.microsoft.com/en-us/library/windows/hardware/ff570133(v=vs.85).aspx
    Intel® IPP-A web sitehttps://software.intel.com/en-us/intel-ipp-preview
  • Intel® Integrated Performance Primitives
  • Intel® IPP
  • Augmented Reality Pipeline
  • GPU
  • GPU optmization
  • total immersion
  • video analytics
  • GPU pipeline
  • Développeurs
  • Arduino
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8
  • Intermédiaire
  • Bibliothèque Intel® Integrated Performance Primitives (IPP)
  • Amplificateur Intel® VTune™ XE
  • Développement de jeu
  • Graphiques
  • Processeurs Intel® Atom™
  • Processeurs Intel® Core™
  • Efficacité de l’alimentation
  • URL
  • NAMD* for Intel® Xeon Phi™ Coprocessor

    $
    0
    0

    Purpose

    This code recipe describes how to get, build, and use the NAMD* Scalable Molecular Dynamics code for the Intel® Xeon Phi™ Coprocessor.

    Introduction

    NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. Based on Charm++* parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 200,000 cores for the largest simulations. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER*, CHARMM*, and X-PLOR*.

    NAMD is distributed free of charge with source code. Users can build NAMD or download binaries for a wide variety of platforms. Tutorials show how to use NAMD and VMD* for biomolecular modeling. Find out more about NAMD at http://www.ks.uiuc.edu/Research/namd/.

    Code Support for Intel® Xeon Phi™ Coprocessor

    NAMD 2.10 with Intel® Xeon Phi™ Coprocessor support is expected to be released in early to mid 2014. With support for Intel® many-integrated core (MIC) architecture, Intel expects to push NAMD performance and scalability to higher limits on Intel® architecture. Currently the code remains in development, but it can be compiled from nightly source code builds. Pre-built binaries are not available at this time.

    NAMD code for Intel Xeon Phi Coprocessor continues to evolve. Intel developers are diligently working on known issues in order to achieve the project goals of performance and scalability on Intel Xeon Phi Coprocessor.

    Code Access

    To get access to the NAMD for Intel Xeon Phi Coprocessor code:

    1. Download the original code at http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD and select Source Code under Version Nightly Build.

    Build Directions

    To build NAMD you also need the following libraries.

    1. TCL (http://www.tcl.tk/);
    2. FFTW (http://www.fftw.org/) , use fftw2 version (if you want you can try fftw3 version):

      ./configure --enable-float --enable-type-prefix --enable-static --prefix=<fftwBaseDirHere> --disable-fortran CC=icc

      make CFLAGS=" -O2 " clean install

    3. CHARM ++ (http://charm.cs.uiuc.edu/software/) can be built in 2 ways:
      1. Infiniband (verbs-linux-x86_64-smp-iccstatic) version:

        ./build charm++ verbs-linux-x86_64 smp iccstatic --with-production

        Notes: check where your ibverbs lib is, if it is not in /opt/ofed/lib64  or /usr/local/ofed/lib64 directories you need to change [charmDir]/src/arch/verbs-linux-x86_64/conv-mach.sh file
      2. MPI (mpi-linux-x86_64-smp-mpicxx) version: ./build charm++ mpi-linux-x86_64 smp mpicxx --with-production -DCMK_OPTIMIZE -DMPICH_IGNORE_CXX_SEEK

    NAMD build instructions for the Intel Xeon Phi Coprocessor version are essentially the same as compiling standard NAMD, with the following changes:

    Note: You can obtain Intel® Composer XE Version 13 from https://registrationcenter.intel.com/regcenter/register.aspx, or register at https://software.intel.com/en-us/ to get a free 30-day evaluation copy.

    Notes: using make’s "-j" option will speedup compilation significantly.

    Running NAMD Workloads on Intel Xeon Phi Coprocessor

    Running NAMD on Intel Xeon Phi Coprocessor is much like running the standard NAMD code, with the following exceptions:

    1. Source the Intel® compiler, so libraries can be found.
    2. Setup the following extra environment variables:
      export KMP_AFFINITY=granularity=fine,compact
      export MIC_ENV_PREFIX=MIC
      export MIC_OMP_NUM_THREADS=240
      export MIC_KMP_AFFINITY=granularity=fine,balanced
    3. To execute NAMD, on the namd2 command line, add +devices xxx, where xxx is a list of devices (e.g. "0,1" for the first two devices on a node). If the user omits the "+devices xxx" option at runtime, the application will attempt to use all available devices on a given node.
    4. The number of PE’s per node must be > number of MICs in the node, and there must be at least one patch per PE.

      Host threads and PEs are part of the command line options traditionally used.

    Some examples of running NAMD workloads:

    1. Ibverbs:

      $BIN_DIR/charmrun ++nodelist $NODEFILE +p $NUM_PROCS ++ppn $PPN $BIN_DIR/wrapper.sh $BIN_DIR/$BIN $WORKLOAD_DIR/$CONFIG_FILE +pemap 1-$PPN +commap 0 "+devices 0,1"

      PPN – for best results use 1 less than the number of available cores, for example PPN=23 if you have 24 cores per node(or PPN=47 if you use hyperthreading5)

      NUM_PROCS = $PPN * $ NODECOUNT

    2. MPI:

      mpiexec.hydra -perhost 1 -n $NODECOUNT $BIN_DIR/$BIN +ppn $PPN $WORKLOAD_DIR/$CONFIG_FILE +pemap 1-$PPN +commap 0 +devices 0,1

      Notes: "+pemap 1-$PPN +commap 0" more effective than "+setcpuaffinity"

    Performance Testing2,3

    The following results show performance on a single node and cluster.

    Single-node Performance Testing

    Note: Single-node performance uses the multi-core build of NAMD (no network layers are used).

    Single-node Platform Configurations4

    The following hardware and software were used for the above recipe and performance testing.

    Server Configuration (Intel® Xeon® processor E5 V2 family):

    • 2-socket/24 cores:
    • Processor: Intel® Xeon® processor E5-2697 V2 @ 2.70GHz (12 cores) with Intel® Hyper-Threading5
    • Operating System: Red Hat Enterprise Linux* 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64 x86_64 x86_64 GNU/Linux
    • Memory: 64GB
    • Coprocessor: 2X Intel® Xeon Phi™ Coprocessor 7120P: 61 cores @ 1.238 GHz, 4-way Intel Hyper-Threading5, Memory: 15872 MB
    • Intel® Many-core Platform Software Stack Version 2.1.6720-15
    • Intel® C++ Compiler Version 13.1.3 20130607 (2013.5.192)

    Server Configuration (Intel® Xeon® processor E5 family):

    • 2-socket/16 cores:
    • Processor: Intel® Xeon® processor E5 @ 2.60GHz (8 cores) with Intel® Hyper-Threading5
    • Operating System: Red Hat Enterprise Linux* 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
    • Memory: 64GB
    • Coprocessor: 2X Intel® Xeon Phi™ Coprocessor 7120P: 61 cores @ 1.238 GHz, 4-way Intel Hyper-Threading5, Memory: 15872 MB
    • Intel® Many-core Platform Software Stack Version 2.1.6720-13
    • Intel® C++ Compiler Version 13.1.3 20130607 (2013.5.192)

    NAMD

    • NAMD: Linux-x64_64-icc
    • Charm++: multicore-linux64-icc
    • Configuration parameters were modified to achieve optimal performance4

    Cluster Performance Testing2,3

    Note: Cluster results use Infiniband*.

    Cluster Platform Configuration4

    The following hardware and software were used for the above recipe and performance testing.

    Endeavor Cluster Configuration:

    • 2-socket/24 cores:
    • Processor: Intel® Xeon® processor E5-2697 V2 @ 2.70GHz (12 cores) with Intel® Hyper-Threading5
    • Operating System: Red Hat Enterprise Linux* 2.6.32-358.6.2.el6.x86_64.crt1 #4 SMP Fri May 17 15:33:33 MDT 2013 x86_64 x86_64 x86_64 GNU/Linux
    • Memory: 64GB
    • Coprocessor: 2X Intel® Xeon Phi™ Coprocessor 7120P: 61 cores @ 1.238 GHz, 4-way Intel Hyper-Threading5, Memory: 15872 MB
    • Intel® Many-core Platform Software Stack Version 2.1.6720-16
    • Intel® C++ Compiler Version 13.1.3 20130607 (2013.5.192)

    NAMD

    • NAMD: Linux-x64_64-icc
    • Charm++: verbs-linux-x86_64-smp-iccstatic
    • Configuration parameters were modified to achieve optimal performance4

    DISCLAIMERS:

    INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

    A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

    Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

    The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

    Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

    Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

    2. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

    3. Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.

    Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

    Notice revision #20110804

    4. For more information go to http://www.intel.com/performance

    5. Available on select Intel® processors. Requires an Intel® HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading.

    Intel, the Intel logo, Xeon and Xeon Phi are trademarks of Intel Corporation in the US and/or other countries.

    *Other names and brands may be claimed as the property of others.

    Copyright © 2014 Intel Corporation. All rights reserved.

  • Scalable Molecular Dynamics
  • Intel(R) Xeon Phi(TM) Coprocessor
  • NAMD*
  • Développeurs
  • Linux*
  • Serveur
  • Intermédiaire
  • Intel® Many Integrated Core Architecture
  • Serveur
  • URL
  • 竞赛获胜者将带有百科全书的增强现实整合至 ARPedia*

    $
    0
    0

    作者:Garret Romaine

    未来的界面已经在某处的实验室中或测试屏幕上进行实验或测试,并随时等待着转化为充分开发的实例和演示。 事实上,CES 2014 宣布,英特尔® 感知计算第二阶段挑战赛中创造性用户体验类的获奖者便很好的明证。 Zhongqian Su 和一群研究生使用英特尔® 感知计算软件开发套件Creative Interactive Gesture Camera Kit将增强现实(AR)和一部普通的百科全书整合到 ARPedia* 中 — 增强现实与维基百科*的完美结合。 ARPedia 是一款新型的知识库,支持用户通过手势而非敲击键盘的方式来使用。

    来自北京工业大学的六人团队在两个月内使用多种工具开发了这款应用。 该团队使用了 Maya* 3D创建 3D 模型,使用 Unity* 3D渲染 3D 场景和开发应用逻辑,然后使用英特尔感知计算软件开发套件 Unity 3D 插件(包含在软件开发套件中)将所有组件结合起来。 该演示结合 3D 模型和动画视频,创建了一种与虚拟世界交互的新方式。 该款应用鼓励用户通过移动身体,使用手势、语音和触摸在未知世界中进行数字开发,未来工作将会非常令人期待。

    关于恐龙


    借助 AR 视觉效果,ARPedia 相当于一款编写和体验故事的游戏。 随着用户逐渐习惯无缝交互式体验,许多技术开始创建交互式体验,即使这种交互式体验非常简单。 在一款 PC 游戏中,普通的鼠标和键盘或触摸屏,是与应用交互的常见方式。 但是,ARPedia 没有使用上述任何一种方式。 在一款 AR 应用中,自然用户界面非常重要。 ARPedia 用户能够通过裸手手势和面部移动来控制动作,这要归功于 Creative Senz3D* 摄像头。 许多有趣的手势能够帮助提高游戏体验,如抓、挥、点、抬和按。 这些手势让玩家成为游戏以及虚拟恐龙世界的真正控制者。


    图 1:ARPedia* 结合了增强现实和基于维基的百科全书,支持用户使用手势导航界面。

    组长 Zhongqian Su 曾在以前的一个任务中使用小型雷克斯霸王龙的角色创建过教学应用,因此,他让这位众所周知的恐龙作为 ARPedia 应用的主角。 玩家通过手部运动伸手抵达并摘取小型的恐龙图片,然后将其放在屏幕的各个点上。 根据恐龙所放置的位置,用户可以了解该生物的饮食、习惯和其他特征。

    图 2:用户与小雷克斯霸王龙互动来学习化石、古生物学和地质学知识。

    据团队成员 Liang Zhang 表示,该团队在使用该恐龙 3D 模型之前便针对教育市场编写了一款 AR 应用。 虽然,他们已经有一款应用作基础,但是还需要根据竞赛的要求做大量的调整。 例如,他们已经完成的摄像头使用了 3D 技术,因此他们需要重新编写该代码(见图 3),以便与更新的 Creative Interactive Gesture Camera Kit 相融。 这同时也意味着需要快速达到英特尔感知计算软件开发套件的性能。

    
    bool isHandOpen(PXCMGesture.GeoNode[] data)
    	{
    		int n = 1;
    		for(int i=1;i<6;i++)
    		{
    			if(data[i].body==PXCMGesture.GeoNode.Label.LABEL_ANY)
    				continue;
    			bool got = false;
    			for(int j=0;j<i;j++)
    			{
    				if(data[j].body==PXCMGesture.GeoNode.Label.LABEL_ANY)
    					continue;
    				Vector3 dif = new Vector3();
    				dif.x = data[j].positionWorld.x-data[i].positionWorld.x;
    				dif.y = data[j].positionWorld.y-data[i].positionWorld.y;
    				dif.z = data[j].positionWorld.z-data[i].positionWorld.z;
    				if(dif.magnitude<1e-5)
    					got = true;
    			}
    			if(got)
    				continue;
    			n++;
    		}
    		return (n>2);
    	}

    图 3:ARPedia* 重新编写了其摄像头代码,以配合 Creative Interactive Gesture Camera 使用。

    Zhang 表示,幸运的是,他的公司非常热衷于在学习新技术方面投入时间和精力。 他表示:“我们已经开发了许多款应用。 我们时刻关注我们公司中能够使用的新软硬件改进。 在此次竞赛前,我们使用了 Microsoft Kinect* 的自然身体交互。 当我们发现该摄像头时,我们感到非常兴奋,并希望试一试。 我们认为这次竞赛也能够为我们提供改进技术技能的机会,所以,为什么不试一试呢?”

    先行的明智决策


    由于竞赛的时间范围有限,该团队不得不加快采用新技术的速度。 Zhang 花费了两周的时间学习英特尔感知计算软件开发套件,然后该团队将 Zhang 能想到的交互技术尽可能多地设计进去。

    同时,编剧开始编写团队能够进行编码的故事和可行场景。 他们满足并探讨这些选择,并由 Zhang 根据自己的软件开发套件知识指出优势和劣势。 他对技术细节有着充分的了解,从而能够制定明智的决策,因此,该团队放心地选择了他描述为“...最佳的故事和最有趣、最适合的交互。”

    Zhang 表示,他们以前的决策中最重要的一个决策是让玩家充分参与到游戏中。 例如,在早期的孵化阶段中,玩家可以担任扮演上帝的角色,执行创建地球以及下雨、日出等操作。 玩家需要设置和学习许多手势操作。

    在另一个阶段中,玩家需要抓恐龙。 Zhang 对系统进行了设置,这样用户的手中可以拿一片肉,然后恐龙能够上前把肉衔起(图 4)。 该动作可以让玩家与恐龙进行互动,并建立参与。 他表示:“我们希望让玩家一直沉浸在虚拟世界中。”

    图 4:喂食小恐龙能够让用户沉浸其中并能创建互动。

    但是,向前推进这些计划需要做更多的工作。 该演示包括许多新的手势需要用户学习。 Zhang 表示:“当我与 CES 上在英特尔展台上玩这款游戏的人交谈时,发现他们不太清楚如何玩这款游戏,因为每个阶段都有各种等级的手势。 我们发现它们不像我们原来设想的那么直观,这让我们决定,当我们加入新的交互式方法时,该设计必须更加直观。 我们进行下一个项目时,肯定会将其牢记于心。”

    ARPedia 团队介绍了两种主要手势。 一种是“双手打开”,另一种是“单手打开,手指伸开。” 双手打开手势,可用来打开应用,是一种简单直接的编码方式。 但是,编写第二种手势需要更多工作。

    图 5:该团队努力确保该摄像头不会将手腕检测为手掌上的一点。

    Zhang 解释道:“最初“打开手”的姿势并不太准确。 有时,手腕会被检测为手掌上的一点, 拳头会被检测为一根手指,然后系统将会把其识别为“打开”,这是错误的。 因此,我们设计了一种新的打开手的姿势,在这种姿势中,至少伸出两根手指才会识别为打开手。 然后,他们在屏幕上添加了文本提示,引导用户了解附件(图 5)。

    英特尔® 感知计算软件开发套件


    ARPedia 团队使用了 2013 版英特尔感知计算开发套件,并特别指出摄像头校准、应用调试和语音识别支持、面部分析、近距离深度跟踪和 AR 的出色易用性。 它支持多个感知计算应用共享输入设备,并可在 RGB 和深度摄像头打开时提供一个隐私通知来通知用户。 软件开发套件能够帮助用户轻松添加更多使用模式,添加新的输入硬件,支持新的游戏引擎和定制算法,并支持新的编程语言。

    该实用程序包括 PXCUPipeline(C) 和 UtilPipeline(C++) 等 C/C++ 组件。 这些组件主要用于设置和管理管线会话。 框架和会话端口包括适用于 Unity 3D、处理、其他框架和游戏引擎的端口,以及适用于 C# and Java* 等编程语言的端口。 软件开发套件接口包括核心框架 API、I/O 分类和算法。 感知计算应用可通过三种主要功能块与软件开发套件进行交互。

    Zhang 表示:“英特尔[感知计算]软件开发套件非常有帮助。 我们在开发这款应用的时候没有遇到任何问题。 我们能够在非常短的时间内完成大量的工作。”

    英特尔® RealSense™ 技术

    全球的开发人员都在学习英特尔® RealSense™ 技术。 英特尔在 CES 2014 上宣布,英特尔 RealSense 技术是以前的 英特尔® 感知计算技术的新名称和品牌。 该直观新用户界面采用英特尔在 2013 年推向市场的手势和语音等功能。 借助英特尔 RealSense 技术,用户将会获得其他的新功能,包括扫描、修改、打印和以 3D 形式共享,以及 AR 接口中的主要优势。 借助这些新功能,用户可以在游戏和应用中使用高级手、指感应技术,自然地操作和播放扫描的 3D 对象。

    Zhang 现在能够直接看到其他开发人员如何使用 AR 技术操作。 在 CES 2014 上,他了解了来自全球的演示。 虽然每个演示都是独一无二的,并希望达到不同的目标,但是他仍然从中发现了快速发展 3D 摄像头技术所带来的优势。 “在软件开发套件中包含手势检测非常有帮助。 人们仍然能够以不同的方式使用摄像头,但是软件开发套件已经为他们提供了广泛的基础。 我建议开发人员使用该技术开发自己的项目,并寻找功能充分地开发其理念。”

    借助高级手—指追踪,开发人员可支持其用户使用复杂的 3D 操作以更高的精度、更简单的命令来控制设备。 借助自然语言语音技术和精确的面部识别,设备能够更好地了解其用户的需求。

    深度感应可带来更逼真的游戏体验,准确的手-指追踪可为任何虚拟冒险带来更卓越的追踪。 游戏将变得更加逼真和有趣。 借助 AR 技术和手指感应技术,开发人员将能够吧真实世界和虚拟世界融为一体。

    Zhang 相信即将推出的英特尔 RealSense 3D 摄像头将会非常适合他所熟悉的应用场景。 他表示:“据我所知,它将更加出色 — 更准确、具备更多功能、更直观。 我们非常期待这款产品。 此外,它还会加入 3D 面部追踪和其他的出色特性。 它是首款面向笔记本电脑,并用作动作感应设备的 3D 摄像头,但是它不同于 Kinect。 此外,它还能够提供与内部 3D 摄像头一样的功能。 我认为,新的英特尔摄像头是支持制造商向笔记本电脑和平板电脑集成的更出色的设备。 此外,作为一款微型用户接口设备,它还具备很好的便携性优势。 借助该款摄像头,将来我们肯定能够开发出许多出色的项目。”

    Maya 3D


    ARPedia 团队使用 Maya 3D 模拟软件继续开发其知名的小型、逼真的模型 — 小雷克斯霸王龙。 构建合适的模型(包括逼真的动作和精细的色彩),应用的其他部分便水到渠成。

    Maya 是创建 3D 计算机动画、建模、模拟、渲染等的黄金标准。 它是一款高可扩展的产品平台,可支持下一代显示技术,加开建模工作流程的速度和处理复杂数据。 该团队尚未使用过 3D 软件,但是他们使用过 Maya,并能够轻松地更新并与其现有的图形相集成。 Zhang 表示,其团队又额外花费时间进行了图形的开发。 他表示:“我们花费了将近一个月的时间设计和修改图形,以便让一切更完美和提高交互方式。”

    Unity 3D


    该团队选择 Unity 引擎作为其应用的基础。 Unity 是一款强大的渲染引擎,可用于创建交互式 3D 和 2D 内容。 Unity 工具集既是一款应用构建程序,也是一款应用开发工具,其特点是直观、易于使用且支持多平台开发。 无论对于初用者还是已经使用过该产品的用户,它都是开发仿真、休闲和大型游戏以及面向 web、移动或控制台的应用的理想解决方案。

    Zhang 表示,我们毫不犹豫地选择了 Unity。 他表示:“我们开发所有的 AR 应用都是使用 Unity,包括这一款。 我们了解这款工具,并且相信它能够做到我们需要的一切事情。” 他能够将网格作为专有 3D 应用文件快速、轻松地从 Maya 导入,既节省时间又节省精力。

    今天的信息,明天的游戏


    ARPedia 为未来的工作提供了许多有意义的角度。 对于刚起步的团队而言,该团队认为游戏和其他应用中拥有巨大的机遇,你可以借鉴其在英特尔感知计算挑战赛中的成果。 Zhang 表示:“我们与许多感兴趣的组织进行过交谈。 他们也希望我们进一步对这一版本进行完善。 希望我们能够在市场上找到一席之地。 我们将会向游戏中加入更多的恐龙,并引进有关这些恐龙的所有知识来吸引更多用户。 它是一个有趣的环境,我们将围绕其设计更多有趣的交互。”

    “此外,我们还准备设计一款宠物游戏,在这款游戏中,用户可以喂养自己的虚拟恐龙。 他们可以拥有自己的特定收集,还可以拿来向彼此展示。 我们还将把它设计成一款网络游戏。 我们准备在新版本中加入更多场景。”

    该团队的胜出让大家非常惊讶,因为他们并不熟悉全球范围内其他开发团队的工作。 Zhang 表示“我们不了解其他人的工作。 我们只着眼自己的事情,没有太多机会了解其他人在做什么。” 现在,他们已经知道了自己的局限,并做好充分的准备迎接下一步挑战。 “此次竞赛为我们提供了证明自己的动力,以及与其他开发人员比较和交流的机会。 我们非常感谢英特尔为我们提供的这次机会。 现在,我们更加了解全球范围内的主要技术,并且在未来开发增强现实应用时将会更有信心。”

    资源


    英特尔® 开发人员专区
    英特尔® 感知计算挑战赛
    英特尔® RealSense™ 技术
    英特尔® 感知计算软件开发套件
    查看感知计算文档中的兼容性指南,确保您现有的应用能够使用英特尔® RealSense™ 3D 摄像头。
    英特尔® 感知计算软件开发套件 2013 R7 版本注释
    Maya* 软件概述
    Unity*

  • ARPedia
  • Creative Senz3D
  • Autodesk Maya
  • Unity 3D
  • Gesture Recognition
  • RealSense
  • Développeurs
  • Microsoft Windows* 8
  • Windows*
  • Intermédiaire
  • Intel® Perceptual Computing SDK
  • Informatique perceptuelle
  • Expérience et conception utilisateur
  • PC portable
  • Tablette
  • URL

  • What's new? Beta Update 1 - Intel® VTune™ Amplifier XE 2015 Beta

    $
    0
    0

    Intel® VTune™ Amplifier XE 2015 Beta

    A performance profiler for serial and parallel performance analysis. Overview, training, support.

    New for Beta Update 1! 

    • Ability to resolve symbols for modules with build-id and separate files with debug information
    • NMI Watchdog timer automatically disabled during data collection
    • Support for importing *.perf files with the event-based sampling data collected by the Linux Perf tool
    • Option to limit the call stack size (in system pages) and minimize collection overhead for custom hardware event-based sampling analysis results
    • Option to display verbose collection and finalization messages in the Collection Log window
    • Support for importing csv files with instant counters collected out of the VTune Amplifier with the external collector
    • Ability to specify x64 code from a 32-bit process in the JIT API
    • Remote system configuration options provided in the Project Properties: Target tab to specify a path to the VTune Amplifier installed on a remote machine and a path to a remote temporary directory used for storing performance results
    • Optimized workflow for the remote data collection in the Attach to Process mode providing an option in the Project Properties: Target tab to easily get a list of processes running on the remote Linux* system and select the required process for analysis
    • Updated Event Reference for Intel microarchitectures code name Ivy Bridge, Ivy Town, and Haswell
    • Updated product toolbar providing quick access to the product documentation with the new Help button and to the Import dialog box (standalone only) with the Import Result button
    • Ubuntu 14.04 support

    Resources

    Contents

     

    File: vtune_amplifier_xe_2015_beta_update1.tar.gz

    Installer for Intel® VTune™ Amplifier XE for Linux* 2015 Beta Update 1

    File: VTune_Amplifier_XE_2015_beta_update1_setup.exe

    Installer for Intel® VTune™ Amplifier XE for Windows* 2015 Beta Update 1

    File: vtune_amplifier_xe_2015_beta_update1.dmg

    Installer for Intel® VTune™ Amplifier XE for OS X* 2015 Beta Update 1

    * Other names and brands may be claimed as the property of others.

    Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries.

  • performance profiling
  • Beta tools
  • Développeurs
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8
  • .NET*
  • C#
  • C/C++
  • Fortran
  • Java*
  • Avancé
  • Débutant
  • Intermédiaire
  • Amplificateur Intel® VTune™ XE
  • OpenCL*
  • OpenMP*
  • URL
  • 如何在offload代码输入输出变量的内存分配中使用2M大页面

    $
    0
    0

    英特尔编译器为至强融核™ 协处理器提供的offload编译模式使程序员可以在一段主机代码中加入编译指示或者某些新的关键字使指定的代码段运行在协处理器上。在显式拷贝模式下,程序员在使用offload pragma/directive将指定代码段offload到协处理器上执行的同时,还须指定在主机和扩展卡间进行拷贝的指针或数组类型变量。英特尔编译器在编译过程中会通过加入代码来自动完成主机和协处理器之间的传输数据。

     

    在默认情况下offload模式的运行时系统为offload代码输入/输出变量在协处理器内存空间分配内存时会使用4K字节大小的页面。这样当offload代码需要很大的输入输出内存空间时,内存分配过程中就可能会产生很多的页面缺失异常,用户将观察到过长的内存分配延迟。针对这样的问题,英特尔编译器提供了一个环境变量“MIC_USE_2MB_BUFFERS”,使用户可以让运行时系统为offload代码输入/输出变量分配内存时在某些情况下改用2M字节大小的页面。下面是该环境变量的说明:

     

    MIC_USE_2MB_BUFFERS

     

    为运行时占用内存大小超过该环境变量给定值的指针型变量分配空间时使用2M字节页面。

     

    该环境变量的设置方式:

     

    整数值 B|K|M|G|T,其中

     

          B = 字节

          K = K字节

          M = M字节

          G = G字节

          T = T字节

    例如:

    MIC_USE_2MB_BUFFERS=64K

    该设置会使offload运行时系统在为所有超过64K字节大小的输入/输出变量分配协处理器内存空间时使用2M大页面。

    更多关于如何使用英特尔编译器开发至强融核协处理器程序的信息请参见英特尔编译器用户参考手册的相关内容。

     

     

  • Intel Parallel Composer XE
  • Développeurs
  • Étudiants
  • Linux*
  • Serveur
  • C/C++
  • Fortran
  • Intermédiaire
  • Intel® Composer XE
  • Outils de développement
  • URL
  • Zone des thèmes: 

    IDZone

    Implementing Gesture Sequences in Unity* 3D with TouchScript

    $
    0
    0

    Download PDF

    By Lynn Thompson

    When configuring touch targets to control other elements of a scene, it’s important to minimize the screen space that the controlling elements occupy. In this way, you can devote more of the Ultrabook™ device’s viewable screen area to displaying visual action and less to user interaction. One means of accomplishing this is to configure the touch targets to handle multiple gesture combinations, eliminating the need for more touch targets on the screen. An example is the continual tapping of a graphical user interface (GUI) widget, causing a turret to rotate while firing, instead of a dedicated GUI widget for firing and another for rotating the turret (or another asset in the Unity* 3D scene).

    This article shows you how to configure a scene using touch targets to control the first person controller (FPC). Initially, you’ll configure the touch targets for basic FPC position and rotation; then, augment them for additional functionality. This additional functionality is achieved through existing GUI widgets and does not require adding geometry. The resulting scene will demonstrate Unity 3D running on Windows* 8 as a viable platform for handling multiple gestures used in various sequences.

    Configure the Unity* 3D Scene

    I begin setting up the scene by importing an FBX terrain asset with raised elevation and trees, which I had exported from Autodesk 3ds Max*. I then place an FPC at the center of the terrain.

    I set the depth of the scene’s main camera, a child of the FPC, to −1. I create a dedicated GUI widget camera with an orthographic projection, a width of 1, and a height of 0.5 as well as Don’t Clear flags. I then create a GUIWidget layer and set it as the GUI widget camera’s culling mask.

    Next, I place basic GUI widgets for FPC manipulation in the scene in view of the dedicated orthogonal camera. For the left hand, I configure a sphere for each finger. The left little sphere moves the FPC left, the left ring sphere moves it forward, the left middle moves it right, and the left index sphere moves the FPC backward. The left-thumb sphere makes the FPC jump and launches spherical projectiles at an angle of 30 degrees clockwise.

    For the right-hand GUI widget, I create a cube (made square through the orthogonal projection). I configure this cube with a Pan Gesture and tie it to the MouseLook.cs script. This widget delivers functionality similar to that of an Ultrabook touch pad.

    I place these GUI widgets out of view of the main camera and set their layer to GUIWidget. Figure 1 shows the scene at runtime, with these GUI widgets in use to launch projectiles and manipulate the position of the FPC.


    Figure 1. FPC scene with terrain and launched spherical projectiles

    The projectiles launched from the FPC pass through the trees in the scene. To remedy this, I would need to configure each tree with a mesh or box collider. Another issue with this scene is that the forward velocity is slow if I use the touch pad to have the FPC look down while pressing the ring finger to move the FPC forward. To resolve this issue, I limit the “look-down” angle when the “move forward” button is pressed.

    Multiple Taps

    The base scene contains an FPC that fires projectiles at a specified angle off center (see Figure 1). The default for this off-center angle is 30 degrees clockwise when looking down on the FPC.

    I configure the scene to have multiple taps, initiated at less than a specified time differential, alter the angle at which the projectiles are launched, then launch a projectile. I can configure this behavior to increase the angle exponentially with the number of taps in the sequence by manipulating float variables in the left-thumb jump script. These float variables control the firing angle and keep track of the time since the last projectile was launched:

    	private float timeSinceFire = 0.0f;
    	private float firingAngle = 30.0f;

    I then configure the Update loop in the left-thumb jump script to decrement the firing angle if the jump sphere tap gestures are less than one-half second apart. The firing angle is reset to 30 degrees if the taps are greater than one-half second apart or the firing angle has decremented to 0 degrees. The code is as follows:

    		timeSinceFire += Time.deltaTime;
    
    			if(timeSinceFire <= 0.5f)
    			{
    				firingAngle += -1.0f;
    
    			}
    			else
    			{
    				firingAngle = 30.0f;
    			}
    
    			timeSinceFire = 0.0f;
    
    			if(firingAngle <= 0)
    			{
    				firingAngle = 30;
    			}
    
    
    			projectileSpawnRotation = Quaternion.AngleAxis(firingAngle,CH.transform.up);

    This code produces a strafing effect, where continuous tapping results in projectiles being launched while decrementing the angle at which they’re launched (see Figure 2). This effect is something you can let a user customize or make available at specific conditions in a simulation or game.


    Figure 2.Continuous taps rotate the heading of the launched projectile.

    Scale Followed by Pan

    I configured the square in the lower right of Figure 1 to function similarly to a touch pad on a keyboard. Panning over the square doesn’t move the square but instead rotates the scene’s main camera up, down, left, and right by feeding the FPS’s MouseLook script. Similarly, a scaling gesture (similar to a pinch on other platforms) that the square receives doesn’t scale the square but instead alters the main camera’s field of view (FOV), allowing a user to zoom in and out on what the main camera is currently looking at (see Figure 3). I will configure a Pan Gesture initiated shortly after a Scale Gesture to return the FOV to the default of 60 degrees.

    I configure this function by setting a Boolean variable—panned—and a float variable to hold the time since the last Scale Gesture:

    	private float timeSinceScale;
    	private float timeSincePan;
    	private bool panned;

    I set the timeSinceScale variable to 0.0f when a Scale Gesture is initiated and set the panned variable to True when a Pan Gesture is initiated. The FOV of the scene’s main camera is adjusted in the Update loop as follows in the script attached to the touch pad cube:

    		timeSinceScale += Time.deltaTime;
    		timeSincePan += Time.deltaTime;
    
    		if(panned && timeSinceScale >= 0.5f && timeSincePan >= 0.5f)
    		{
    			fieldOfView += 5.0f;
    			panned = false;
    		}
    
    		if(panned && timeSinceScale <= 0.5f)
    		{
    			fieldOfView = 60.0f;
    			panned = false;
    		}
    
    		Camera.main.fieldOfView = fieldOfView;

    Following are the onScale and onPan functions. Note the timeSincePan float variable, which prevents the FOV from being constantly increased when the touch pad is in use for the camera:

    	private void onPanStateChanged(object sender, GestureStateChangeEventArgs e)
        {
            switch (e.State)
            {
                case Gesture.GestureState.Began:
                case Gesture.GestureState.Changed:
                    var target = sender as PanGesture;
                    Debug.DrawRay(transform.position, target.WorldTransformPlane.normal);
                    Debug.DrawRay(transform.position, target.WorldDeltaPosition.normalized);
    
                    var local = new Vector3(transform.InverseTransformDirection(target.WorldDeltaPosition).x, transform.InverseTransformDirection(target.WorldDeltaPosition).y, 0);
                    targetPan += transform.InverseTransformDirection(transform.TransformDirection(local));
    
                    //if (transform.InverseTransformDirection(transform.parent.TransformDirection(targetPan - startPos)).y < 0) targetPan = startPos;
                    timeSincePan = 0.0f;
    				panned = true;
    				break;
    
            }
    
        }
    
    	private void onScaleStateChanged(object sender, GestureStateChangeEventArgs e)
        {
            switch (e.State)
            {
                case Gesture.GestureState.Began:
                case Gesture.GestureState.Changed:
                    var gesture = (ScaleGesture)sender;
    
                    if (Math.Abs(gesture.LocalDeltaScale) > 0.01 )
                    {
    					fieldOfView *= gesture.LocalDeltaScale;
    
    					if(fieldOfView >= 170){fieldOfView = 170;}
    					if(fieldOfView <= 1){fieldOfView = 1;}
    
    					timeSinceScale = 0.0f;
    
    
                    }
                    break;
            }
        }


    Figure 3. The scene’s main camera “zoomed in” on distance features via the right GUI touch pad simulator

    Press and Release Followed by Flick

    The following gesture sequence increases the horizontal speed of the FPC when the left little sphere receives press and release gestures followed by a Flick Gesture within one-half second.

    To add this functionality, I begin by adding a float variable to keep track of the time since the sphere received the Release Gesture and a Boolean variable to keep track of the sphere receiving a Flicked Gesture:

    	private float timeSinceRelease;
    	private bool flicked;

    As part of the scene’s initial setup, I configured the script attached to the left little sphere with access to the FPC’s InputController script, which allows the left little sphere to instigate moving the FPC to the left. The variable controlling the FPC’s horizontal speed is not in the InputController but in the FPC’s CharacterMotor. Granting the left little sphere’s script to the CharacterMotor is configured similarly as follows:

    		CH = GameObject.Find("First Person Controller");
    		CHFPSInputController = (FPSInputController)CH.GetComponent("FPSInputController");
    		CHCharacterMotor = (CharacterMotor)CH.GetComponent ("CharacterMotor");

    The script’s onFlick function merely sets the Boolean variable flicked equal to True.

    The script’s Update function (called once per frame) alters the FPC’s horizontal movement speed as follows:

    		if(flicked && timeSinceRelease <= 0.5f)
    		{
    			CHCharacterMotor.movement.maxSidewaysSpeed += 2.0f;
    			flicked = false;
    		}
    
    		timeSinceRelease += Time.deltaTime;
    	}

    This code gives the user the ability to increase the horizontal movement speed of the FPC by pressing and releasing the left little sphere, and then flicking the left little sphere within one-half second. You could configure the ability to decrease the horizontal movement speed in any number of ways, including a Flick Gesture following a press and release of the left index sphere. Note that the CHCharacterMotor.movement method contains not only maxSidewaysSpeed but gravity, maxForwardsSpeed, maxBackwardsSpeed, and other parameters. The many TouchScript gestures and geometries receiving them used in combination with these parameters provide many options and strategies for developing touch interfaces to Unity 3D scenes. When developing touch interfaces for these types of applications, experiment with these many options to narrow them to those that provide the most efficient and ergonomic user experience.

    Issues with Gesture Sequences

    The gesture sequences that I configured in the examples in this article rely heavily on the Time.deltaTime function. I use this differential in combination with the gestures before and after the differential to determine an action. The two main issues I encountered when configuring these examples are the magnitude of the time differential and the gestures used.

    Time Differential

    The time differential I used in this article is one-half second. When I used a smaller magnitude of one-tenth second, the gesture sequences weren’t recognized. Although I felt I was tapping fast enough for the gesture sequence to be recognized, the expected scene action did not occur. This is possibly the result of the hardware and software latency. As such, when developing gesture sequences, it’s a good idea to keep in mind the performance characteristics of the target hardware platforms.

    Gestures

    When configuring this example, I originally planned to have Scale and Pan Gestures followed by Tap and Flick Gestures. Having the Scale and Pan Gestures functioning as desired, I introduced a Tap Gesture, which caused the Scale and Pan Gestures to cease functioning. Although I was able to configure a sequence of Scale followed by Pan, this is not the most user-friendly gesture sequence. A more useful sequence may consist of another geometry target in the widget to accept the Tap and Flick Gestures after the Scale and Pan Gestures.

    I used the time differential of one-half second in this example as the break point for actions taken (or not taken). Although it adds a level of complexity to the user interface (UI), you could configure this example to use multiple time differentials. Where Press and Release Gestures followed by a Flick Gesture within one-half second may cause horizontal speed to increase, the Press and Release Gestures followed by a Flick Gesture between one-half and 1 second may decrease the horizontal speed. Using the time differentials in this manner not only offers flexibility for the UI but could be used to plant “Easter eggs” within the scene itself.

    Conclusion

    The gesture sequence scene I configured for this article uses Unity 3D with TouchScript on Ultrabook devices running Windows 8. The sequences implemented are intended to reduce the amount of touch screen area required for the user to interact with the application. The less touch screen area dedicated to user interaction, the more area you can dedicate to more visually appealing content.

    When I wasn’t able to get a gesture sequence to perform as desired, I was able to formulate an acceptable alternative. Part of this performance tuning was adjusting the Time.deltaTime differential to get a gesture sequence to perform as desired on the hardware available. As such, the Unity 3D scene I constructed in this article shows that Windows 8 running on Ultrabook devices is a viable platform for developing apps that use gesture sequences.

    Related Content

    About the Author

    Lynn Thompson is an IT professional with more than 20 years of experience in business and industrial computing environments. His earliest experience is using CAD to modify and create control system drawings during a control system upgrade at a power utility. During this time, Lynn received his B.S. degree in Electrical Engineering from the University of Nebraska, Lincoln. He went on to work as a systems administrator at an IT integrator during the dot com boom. This work focused primarily on operating system, database, and application administration on a wide variety of platforms. After the dot com bust, he worked on a range of projects as an IT consultant for companies in the garment, oil and gas, and defense industries. Now, Lynn has come full circle and works as an engineer at a power utility. Lynn has since earned a Masters of Engineering degree with a concentration in Engineering Management, also from the University of Nebraska, Lincoln.

     

    Intel, the Intel logo, Ultrabook, and VTune are trademarks of Intel Corporation in the U.S. and other countries.
    *Other names and brands may be claimed as the property of others
    Copyright© 2014 Intel Corporation. All rights reserved.

  • touch interfaces
  • unity
  • touch targets
  • Gesture Sequencing
  • Développeurs
  • Microsoft Windows* 8
  • Windows*
  • Unity
  • Intermédiaire
  • Interfaces tactiles
  • PC portable
  • Tablette
  • URL
  • Installing Intel(R) Cluster Studio XE on the systems with unsupported CPUs

    $
    0
    0

    Using a VPS (Virtual Private Server) in the cloud as a build machine has benefits.  For example, I don’t have to pay the electricity bills and I have an access to a fresh build from all around the world.

    I used the following steps to set up my build system on a new VPS.

    1. Download Intel® Cluster Studio XE

    I downloaded my copy of Intel® Cluster Studio XE 2013 SP1 Update 1 from the Intel® Software Development Products Registration Center (IRC) .

    [user01@test-2 ~]$ wget http://registrationcenter.intel.com/irc_nas/3918/l_ics_2013.1.046_intel64.tgz
    --2014-06-23 03:58:01--  http://registrationcenter.intel.com/irc_nas/3918/l_ics_2013.1.046_intel64.tgz
    Resolving registrationcenter.intel.com... 198.175.96.34
    Connecting to registrationcenter.intel.com|198.175.96.34|:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 2672369932 (2.5G) [application/x-compressed]
    Saving to: “l_ics_2013.1.046_intel64.tgz”
    
    100%[====================================>] 2,672,369,932  266K/s   in 1h 51m
    
    2014-06-23 05:49:03 (392 KB/s) - “l_ics_2013.1.046_intel64.tgz” saved [2672369932/2672369932]
    

    2. Try to install it

    I unpacked it

    [user01@test-2 ~]$ tar -xzf  ./l_ics_2013.1.046_intel64.tgz

    and tried to install by running the install.sh script.

    [user01@test-2 ~]$ cd l_ics_2013.1.046_intel64
    [user01@test-2 l_ics_2013.1.046_intel64]$ ./install.sh
    CPU is not supported.

    Unfortunately, the Intel® Cluster Studio XE installer doesn’t recognize the CPU.
     

    [user01@test-2 l_ics_2013.1.046_intel64]$ cat /proc/cpuinfo
    processor       : 0
    vendor_id       : GenuineIntel
    cpu family      : 6
    model           : 2
    model name      : QEMU Virtual CPU version 1.0
    stepping        : 3
    cpu MHz         : 2399.998
    cache size      : 4096 KB
    fpu             : yes
    fpu_exception   : yes
    cpuid level     : 4
    wp              : yes
    flags           : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm up rep_good unfair_spinlock pni vmx cx16 popcnt hypervisor lahf_lm
    bogomips        : 4799.99
    clflush size    : 64
    cache_alignment : 64
    address sizes   : 40 bits physical, 48 bits virtual
    power management:
    

    3. Use --ignore-cpu flag

    I used the --ignore-cpu flag to tell the installer not to check the system CPU.

    [user01@test-2 l_ics_2013.1.046_intel64]$ ./install.sh --ignore-cpu
    Please make your selection by entering an option.
    Root access is recommended for evaluation.
    
    1. Run as a root for system wide access for all users [default]
    2. Run using sudo privileges and password for system wide access for all users
    3. Run as current user to limit access to user level
    
    h. Help
    q. Quit
    
    …
    
    Step 6 of 7 | Installation
    --------------------------------------------------------------------------------
    Each component will be installed individually. If you cancel the installation,
    some components might remain on your system. This installation may take several
    minutes, depending on your system and the options you selected.
    --------------------------------------------------------------------------------
    Installing Intel(R) MPI Library, Runtime Environment for applications running on
    Intel(R) 64 Architecture component... done
    --------------------------------------------------------------------------------
    Installing Intel(R) MPI Library, Runtime Environment for applications running on
    Intel(R) Many Integrated Core Architecture component... done
    --------------------------------------------------------------------------------
    Installing Intel(R) MPI Library for applications running on Intel(R) 64
    Architecture component... done
    --------------------------------------------------------------------------------
    Installing Intel(R) MPI Library for applications running on Intel(R) Many
    Integrated Core Architecture component... done
    --------------------------------------------------------------------------------
    Installing Intel(R) Trace Analyzer for Intel(R) 64 Architecture component... done
    --------------------------------------------------------------------------------
    Installing Intel(R) Trace Collector for Intel(R) 64 Architecture component... done
    --------------------------------------------------------------------------------
    Installing Intel(R) Trace Collector for Intel(R) Many Integrated Core
    Architecture component... done
    --------------------------------------------------------------------------------
    Installing Command line interface component... done
    --------------------------------------------------------------------------------
    Installing Sampling Driver kit component... done
    --------------------------------------------------------------------------------
    Installing Power Driver kit component... done
    --------------------------------------------------------------------------------
    Installing Graphical user interface component... done
    --------------------------------------------------------------------------------
    Installing Command line interface component... done
    --------------------------------------------------------------------------------
    Installing Graphical user interface component... done
    --------------------------------------------------------------------------------
    Installing Command line interface component... done
    --------------------------------------------------------------------------------
    Installing Graphical user interface component... done
    --------------------------------------------------------------------------------
    Installing Intel Fortran Compiler XE for Intel(R) 64 component... done
    --------------------------------------------------------------------------------
    Installing Intel C++ Compiler XE for Intel(R) 64 component... done
    --------------------------------------------------------------------------------
    Installing Intel Debugger for Intel(R) 64 component... done
    --------------------------------------------------------------------------------
    Installing Intel MKL core libraries for Intel(R) 64 component... done
    --------------------------------------------------------------------------------
    Installing Intel(R) Xeon Phi(TM) coprocessor support component... done
    --------------------------------------------------------------------------------
    Installing Fortran 95 interfaces for BLAS and LAPACK for Intel(R) 64
    component... done
    --------------------------------------------------------------------------------
    Installing GNU* Compiler Collection support for Intel(R) 64 component... done
    --------------------------------------------------------------------------------
    Installing Cluster support for Intel(R) 64 component... done
    --------------------------------------------------------------------------------
    Installing Intel IPP single-threaded libraries for Intel(R) 64 component... done
    --------------------------------------------------------------------------------
    Installing Intel TBB component... done
    --------------------------------------------------------------------------------
    Installing GNU* GDB 7.5 on Intel(R) 64 (Provided under GNU General Public
    License v3) component... done
    --------------------------------------------------------------------------------
    Installing GDB Eclipse* Integration on Intel(R) 64 (Provided under Eclipse
    Public License v.1.0) component... done
    --------------------------------------------------------------------------------
    Installing Intel(R) MPI Benchmarks component... done
    --------------------------------------------------------------------------------
    Finalizing product configuration...
    --------------------------------------------------------------------------------
    Preparing driver configuration scripts... done
    --------------------------------------------------------------------------------
    Press "Enter" key to continue

    4. Test the installation

    The installer has completed, so now I simply need to test an MPI application and ensure basic functionality.

    [user01@test-2 ~]$ . ~/intel/composerxe/bin/compilervars.sh intel64
    [user01@test-2 ~]$ . ~/intel/impi/4.1.3.048/intel64/bin/mpivars.sh
    [user01@test-2 ~]$ mpiicc ~/intel/impi/4.1.3.048/test/test.c -o test
    [user01@test-2 ~]$ mpirun -n 2 -host `hostname -I` ./test
    Hello world: rank 0 of 2 running on test-2
    Hello world: rank 1 of 2 running on test-2

     

  • Développeurs
  • Professeurs
  • Étudiants
  • Linux*
  • Services Cloud
  • Serveur
  • Débutant
  • Intermédiaire
  • Outils de cluster
  • Compilateurs
  • Bibliothèque Intel® MPI Library
  • Intel® Cluster Ready
  • Interface de transmission de messages
  • Informatique cloud
  • Informatique en cluster
  • Serveur
  • URL
  • Pour commencer
  • 解读Intel编译器的offload报告

    $
    0
    0

    英特尔编译器在对代码进行编译优化的过程中用户可以通过使用”-opt-report-phase=phase”选项让编译器输出某些特定优化阶段的相关信息。针对至强融核™ 协处理器提供的offload编译模式英特尔编译器提供了”offload”关键字。它可以让编译器提供主机和目标协处理器之间的数据传输信息。

     

    加上编译选项”-offload-report-phase offload”后编译器会对原代码中的每一个offload区域生成两段报告:第一段以Offload to target MIC开头的报告来自于主机代码编译过程;第二段以Outlined offload region开头的报告则来自于目标协处理器编译过程。

     

    例如对于下面的代码 “reduction.c”:

     

      1 float reduction(float *data, int numberOf)

      2 {

      3   float ret = 0.f;

      4   int i;

      5   #pragma offload target(mic) in(data:length(numberOf))

      6   {

      7      #pragma omp parallel for reduction(+:ret)

      8      for (i=0; i < numberOf; ++i)

      9         ret += data[i];

     10   }

     11   return ret;

     12 }

     13

     

    $ icc -c -openmp -opt-report-phase=offload reduction.c

    reduction.c(5-5):OFFLOAD:reduction:  Offload to target MIC 1

     Data sent from host to target

           data_2_V$0, pointer to (<expr>) elements

           i, scalar size 4 bytes

           numberOf_2_V$1, scalar size 4 bytes

           ret, scalar size 4 bytes

     Data received by host from target

           i, scalar size 4 bytes

           numberOf_2_V$1, scalar size 4 bytes

           ret, scalar size 4 bytes

     

    reduction.c(5-5):OFFLOAD:reduction:  Outlined offload region

     Data received by target from host

           data_2_V$0, pointer to (<expr>) elements

           i, scalar size 4 bytes

           numberOf_2_V$1, scalar size 4 bytes

           ret, scalar size 4 bytes

     Data sent from target to host

           i, scalar size 4 bytes

           numberOf_2_V$1, scalar size 4 bytes

           ret, scalar size 4 bytes

     

    在编译器输出的报告中我们可以看到:源代码第5行的offload区域在offload模式下执行时,首先由主机向协处理器传输的数据包括:

    1. 指针”data”所指向的数据元素,其中元素个数由表达式的运行时值确定
    2. 标量”i”,长度为4字节
    3. 标量”NumberOf”,长度为4字节
    4. 标量”ret”,长度为4字节

     

    接下来在offload区域执行完毕后由协处理器传回主机的数据包括:

    1. 标量”i”,长度为4字节
    2. 标量”NumberOf”,长度为4字节
    3. 标量”ret”,长度为4字节

     

    从报告的内容我们还可以看出,代码中的指针”data”被显式指定为in类型,所以它所指向的数据只被传输到协处理器上而无需传回主机;而其它3个在offload区域被引用的标量型变量由于没有被显式指定传输类型,所以它们遵循隐式规则被双向传输。

     

    更多关于如何使用英特尔编译器开发至强融核协处理器程序的信息请参见英特尔编译器用户参考手册的相关内容。

  • Intel Parallel Composer XE
  • Développeurs
  • Étudiants
  • Linux*
  • C/C++
  • Fortran
  • Intermédiaire
  • Intel® Composer XE
  • Outils de développement
  • Serveur
  • URL
  • Rubriques de compilateurs
  • Video dual en aplicaciones Android* mediante la tecnología WiDi de Intel®

    $
    0
    0

    Descarga

    Descargar ejemplos de código de Widi para video dual [ZIP 112KB]

    Este ejemplo indica cómo usar la clase Presentation para mostrar contenido de video en una pantalla externa por medio de la tecnología WiDi de Intel®. Además, muestra cómo usar un servicio para reproducir el contenido en la pantalla externa, lo cual permite que continúe la reproducción del contenido de video cuando se inicia otra aplicación en la pantalla principal del dispositivo. Por último, muestra cómo configurar el audio en los dispositivos Android basados en procesadores Intel® con el fin de posibilitar secuencias de audio duales para la reproducción de video o video dual combinada con cualquier otra aplicación que reproduzca contenido de audio.

    La clase Presentation con video

    La clase Presentation se usa para crear un diálogo que exhiba contenido en una pantalla externa. En este ejemplo, veremos cómo mostrar contenido de video con ella. Cuando se utiliza la API de Presentation para exhibir contenido en una pantalla externa mediante la tecnología WiDi de Intel, es necesario seleccionar la pantalla apropiada a la cual presentar el contenido. Se puede usar la función getSystemService para obtener un puntero que apunte al objeto DisplayManager. Con este objeto, se puede obtener una matriz de todas las pantallas externas que se pueden usar con la clase Presentation con la función getDisplays function y la constante DISPLAY_CATEGORY_PRESENTATION. Una vez que se tiene el puntero a la pantalla a la cual se desea enviar la presentación, se puede crear una instancia de la clase RemoteVideoPresentation y usar su función de exhibición para comenzar a representar contenido en la pantalla externa.

    private DisplayManager mDisplayManager;
    mDisplayManager = (DisplayManager)getSystemService(Context.DISPLAY_SERVICE);
    
    //Selección de la pantalla
    Display[] displays = mDisplayManager.getDisplays(DisplayManager.DISPLAY_CATEGORY_PRESENTATION);
    for (Display display : displays)
    {
    	//Configuración de la clase Presentation y muestra
    	presentation = new RemoteVideoPresentation(this, display, video);
    	presentation.show();
    }

    Nuestra clase RemoteVideoPresentation extiende la clase Presentation de Google e invalida tres funciones: onCreate, onStart y onStop. A OnCreate se la llama de manera similar a la función OnCreate de Activity. Aquí es donde configuramos el diseño que contiene nuestro VideoView y obtenemos un identificador de AudioManager.

    @Override
    protected void onCreate(Bundle savedInstanceState)
    {
    	super.onCreate(savedInstanceState);
    
    	mAudManager = (AudioManager)getContext().getSystemService(Context.AUDIO_SERVICE);
    	getWindow().setType(WindowManager.LayoutParams.TYPE_SYSTEM_ALERT);
    	setContentView(R.layout.activity_remote_video);
    	mVideoView = (VideoView) findViewById(R.id.remoteVideoView);
    }

    El método onStart se invoca después de que el creador del objeto llama a su función de muestra. Aquí es donde comenzamos a reproducir el video cuyo URI se pasó al constructor de la pantalla externa que también se pasó al constructor. Configuramos el audio, el URI del video de VideoView y luego llamamos a su función de inicio.

    @Override
    protected void onStart()
    {
    	super.onStart();
    	playVideo();
    }
    
    public void playVideo()
    {
    	if (mVideoView != null)
    	{
    		mVideoView.setVideoURI(mVideoUri);
    		int result = mAudManager.requestAudioFocus(afChangeListener,
    				// Usamos la secuencia de música.
    				AudioManager.STREAM_MUSIC,
    				// Solicitud de foco permanente.
    				AudioManager.AUDIOFOCUS_GAIN);
    		if (result == AudioManager.AUDIOFOCUS_REQUEST_FAILED)
    		{
    			//Error
    		}
    		mAudManager.setParameters("bgm_state=true");
    
    		mVideoView.start();
    	}
    }

    Cómo administrar la clase Presentation con un servicio

    No es necesario que los diálogos de Presentation los administre un servicio, pero si no ocurre así, el diálogo se detendrá cuando lo haga la presentación Para permitir que continúe la reproducción de video en la pantalla externa mientras se cambia de aplicación en la pantalla local, un servicio tiene que encargarse de crear y administrar la clase Presentation que se usa para reproducir el video. Esto también es útil para para reproducir dos secuencias de video distintas en la pantalla externa y la local. Al crear el servicio como clase que extiende la clase Service, podemos iniciar el servicio como Intent y detenerlo de la misma manera. RemoteVideoService es el servicio que hemos extendido desde la clase de servicio base.

    public void OnClickPlayRemoteVideo(View view)
    {
    	Intent serviceIntent = new Intent(this, RemoteVideoService.class);
    	serviceIntent.putExtra(RemoteVideoService.URI, mRemoteVideoUri);
    	startService(serviceIntent);
    	mRemoteStopButton.setVisibility(View.VISIBLE);
    	mRemoteStopButton.setClickable(true);
    }
    
    public void OnClickStopRemoteVideo(View view)
    {
    	Intent serviceIntent = new Intent(this, RemoteVideoService.class);
    	stopService(serviceIntent);
    
    	mRemoteStopButton.setVisibility(View.INVISIBLE);
    	mRemoteStopButton.setClickable(false);
    }

    En el servicio, necesitamos invalidar cuatro funciones: onBind, onCreate, onDestroy y onStartCommand. En la función onBind, solo necesitamos que se devuelva un nuevo objeto Binder.

    @Override
    public IBinder onBind(Intent intent)
    {
    	return new Binder();
    }

    onCreate es similar a otras funciones onCreate con las que hemos trabajado en Activities. Sin embargo, aquí no es necesaria la configuración de un diseño porque todo se maneja en la clase Presentation. Solo configuramos DisplayManager porque habrá que usarlo para seleccionar la pantalla externa en la cual se establecerá la Presentation.

    @Override
    public void onCreate()
    {
    	super.onCreate();
    	mDisplayManager = (DisplayManager)getSystemService(Context.DISPLAY_SERVICE);
    }

    Se llama a onDestroy cuando la Activity principal ha detenido el servicio con la función stopService. La usamos para llamar a la función de cancelación de Presentation. La consecuencia es que se llamará a la función onStop de Presentation, lo cual le permitirá hacer una limpieza después de realizar su proceso.

    @Override
    public void onDestroy()
    {
    	if (presentation != null)
    	{
    		presentation.cancel();
    	}
    	super.onDestroy();
    }

    La función onStartCommand es donde hacemos la mayor parte del trabajo. Establecemos un objeto de notificación y lo iniciamos de modo que el usuario pueda tener un objeto en el menú desplegable de Android para navegar fácilmente de regreso a la aplicación y manejar el servicio desde la Activity principal. Aquí es también donde creamos una instancia de nuestra clase Presentation para reproducir video, obtener el URI que se pasó al servicio como Parcel y seleccionar la pantalla externa.

    @Override
    public int onStartCommand(Intent intent, int flags, int startId)
    {
    	CharSequence text = getText(R.string.app_name);
    	Intent startApp = new Intent(this, MainActivity.class);
    	PendingIntent pendingIntent = PendingIntent.getActivity(this, 0, startApp, 0);
    	Notification.Builder bld = new Notification.Builder(this);
    	Notification not = bld
    			.setSmallIcon(R.drawable.ic_launcher)
    			.setContentIntent(pendingIntent)
    			.setContentTitle(text)
    			.build();
    
    	startForeground(1, not);
    
    	Uri video = (Uri)intent.getParcelableExtra(URI);
    
    	//Selección de la pantalla
    	Display[] displays = mDisplayManager.getDisplays(DisplayManager.DISPLAY_CATEGORY_PRESENTATION);
    	for (Display display : displays)
    	{
    		//Configuración de la clase Presentation y muestra
    		presentation = new RemoteVideoPresentation(this, display, video);
    		presentation.show();
    	}
    
    	return START_NOT_STICKY;
    }

    Secuencia de audio doble

    En dispositivos Android basados en procesadores Intel, se puede configurar el audio de manera que se reproduzcan dos secuencias de audio por separado en los altavoces o auriculares locales del dispositivo y pueda haber una pantalla externa capaz de reproducir audio, como podría ser una pantalla con tecnología WiDi de Intel o HDMI. En esencia, esto permite que un creador de aplicaciones reproduzca en la pantalla externa contenido de video junto con el audio mientras reproduce en simultáneo un video aparte (con audio) en la pantalla local. Otro ejemplo sería reproducir video externamente mientras se recibe una llamada telefónica localmente en el dispositivo. El código para hacerlo en este ejemplo es relativamente simple. Todo se hace en la clase Presentation. Cuando se reproduce video, necesitamos configurar el administrador de audio para que use la secuencia de música y establezca el parámetro bgm_state como verdadero.

    OnAudioFocusChangeListener afChangeListener = new OnAudioFocusChangeListener() {
    	public void onAudioFocusChange(int focusChange) {
    		if (focusChange == AudioManager.AUDIOFOCUS_LOSS_TRANSIENT) {
    
    		} else if (focusChange == AudioManager.AUDIOFOCUS_GAIN) {
    
    		} else if (focusChange == AudioManager.AUDIOFOCUS_LOSS) {
    			mAudManager.abandonAudioFocus(afChangeListener);
    		}
    	}
    };
    int result = mAudManager.requestAudioFocus(afChangeListener,
    		// Usamos la secuencia de música.
    		AudioManager.STREAM_MUSIC,
    		// Solicitud de foco permanente.
    		AudioManager.AUDIOFOCUS_GAIN);
    if (result == AudioManager.AUDIOFOCUS_REQUEST_FAILED)
    {
    	//Error
    }
    mAudManager.setParameters("bgm_state=true");

    También necesitamos ajustar el archivo xml de manifiesto para indicar que nuestra aplicación modificará las configuraciones de audio.

    <uses-permission android:name="android.permission.MODIFY_AUDIO_SETTINGS"/>

    Este ejemplo básico muestra cómo agregar a una aplicación reproducción de video o audio con la tecnología WiDi de Intel, procedimiento que permite a los usuarios realizar varias tareas y muchas actividades diferentes localmente sin interrumpir la reproducción del contenido externo en dispositivos Android con procesadores Intel. ¡Buena programación!

    Biografía del autor

    Gideon forma parte del Grupo de Software y Servicios de Intel. Trabaja con proveedores de software independientes y los ayuda a optimizar sus productos para procesadores Intel® Atom™. En el pasado, trabajó en un equipo que escribió controladores gráficos de Linux* para plataformas con Android OS.

    Enlaces relacionados

    Muestras de código para habilitar Intel® WiDi Dual Screen: http://software.intel.com/es-es/intel-widi#pid-19198-1607 Aplicación WiDi con Dual Screen Intel® http://software.intel.com/es-es/articles/dual-screen-intel-widi-application
    Cómo habilitar Intel® Wireless Display Differentiation para Miracast* en un teléfono con arquitectura Intel® http://software.intel.com/es-es/articles/how-to-enable-intel-wireless-display-differentiation-for-miracast-on-intel-architecture

    Para conocer más acerca de las herramientas Intel para el desarrollador Android visita Intel® Developer Zone para Android.

  • applications
  • Intel® WiDi
  • Dual Video
  • Développeurs
  • Android*
  • Android*
  • Intermédiaire
  • Expérience et conception utilisateur
  • Téléphone
  • Tablette
  • URL
  • Monte Carlo European Option Pricing with RNG Interface for Intel® Xeon Phi™ Coprocessor

    $
    0
    0

    Download Available under the Intel Sample Source Code License Agreement license.

    Background

    Monte Carlo is a numerical method that uses statistical sampling techniques to approximate solutions to quantitative problems. The name comes from the famous casino in the principality of Monaco, where a roulette table provides uncertainty outcomes just like series of random numbers. The contemporary version of the Monte Carlo algorithm was first used by Stanislaw Ulam, while he was working on the Manhattan project in the mid-1940s. Nicholas Metropolis was first to make the connection between the casino and the algorithm and coined the term Monte Carlo to refer to any numerical simulation algorithm that involves a random number generator. John von Neumann was the first to implement Monte Carlo on the ENIAC Computer in the late 1940s. Since then, Monte Carlo has been widely used in engineering physics, molecular dynamics, and in calculating integrals with complicated boundary conditions.

    In 1973, Fisher Black and Myron Scholes published their historical paper and introduced what later became known as the Black-Scholes Option Pricing model for financial derivatives. As the rest of the world was still trying to digest the Black-Scholes Model, an actuarial professor from the University of British Columbia, Phelim Boyle introduced the Monte Carlo method to Finance and successfully used it as an alternate way to get the same result as the Black-Scholes Model. In his article, he takes the example of a European call option and calculates its price using the Monte Carlo method.

    In this paper, we use the same numerical problem as an example to highlight various techniques and practices to achieve high performance computing on Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors.

    Code Access

    The Monte Carlo European Option with RNG interface is maintained by Shuo Li and is available under the BSD 3-Clause Licensing Agreement. The code supports the asynchronous offload of the Intel Xeon processor (referred to as “host” in this document) with the Intel Xeon Phi coprocessor (referred to as “coprocessor” in this document) in a single node environment.

    To access the code and test workloads:

    Go to source location to download the MonteCarloRNGsrc.tar file.

    Build Directions

    Here are the steps you need to follow in order to rebuild the program:

    1. Install the Intel® Composer XE 2013 SP 2 on your system
    2. Source the environment variable script file compilervars.csh under /pkg_bin
    3. Untar the montecarlorng.tar file, type make to build the binary
    4. Issue the make command, be unconditional, and be silent using the –Bs option
    [prompt]$ make –Bs
    

    Run Directions

    Copy the following files to the Intel Xeon Phi coprocessor card.

    [prompt]$ scp MonteCarloRNGSP.knc yourhost-mic0:
    [prompt]$ scp MonteCarloRNGDP.knc yourhost-mic0:
    [prompt]$ scp /opt/intel/composerxe/lib/mic/libiomp5.so yourhost-mic0:
    [prompt]$ scp /opt/intel/composerxe/tbb/lib/mic/libtbbmalloc.so yourhost-mic0:
    [prompt]$ scp /opt/intel/composerxe/tbb/lib/mic/libtbbmalloc.so.2 yourhost-mic0:
    [prompt]$ scp /opt/intel/composerxe/mkl/lib/mic/libmkl_intel_lp64.so yourhost-mic0:
    [prompt]$ scp /opt/intel/composerxe/mkl/lib/mic/libmkl_sequential.so yourhost-mic0:
    [prompt]$ scp /opt/intel/composerxe/mkl/lib/mic/libmkl_core.so yourhost-mic0:

    Turn on the turbo mode on your Intel Xeon Phi coprocessor card.

    [prompt]$ sudo /opt/intel/mic/bin/micsmc --turbo enable
    Information: mic0: Turbo Mode Enable succeeded.

    Invoke the binary and set the environmental variable for the execution from the host.

    [prompt]$ ssh yourhost-mic0 "export LD_LIBRARY_PATH=.;export OMP_NUM_THREADS=244;export KMP_AFFINITY='compact,granularity=fine';./MonteCarloRNGSP.knc"
    Monte Carlo European Option Pricing Single Precision
    
    Compiler Version  = 14
    Release Update    = 2
    Build Time        = Jun  2 2014 12:22:43
    Path Length       = 262144
    Number of Options = 999912
    Block Size        = 16384
    Worker Threads    = 244
    
    Starting options pricing...
    Parallel simulation completed in 21.439754 seconds.
    Validating the result...
    L1_Norm          = 4.812E-04
    Average RESERVE  = 12.872
    Max Error        = 8.035E-02
    ==========================================
    Total Cycles = 28586338291
    Cyc/opt      = 28588.854
    Time Elapsed =   21.440
    Options/sec  = 46638.222
    ==========================================
    [prompt]$ ssh yourhost-mic0 "export LD_LIBRARY_PATH=.;export OMP_NUM_THREADS=244;export KMP_AFFINITY='compact,granularity=fine';./MonteCarloRNGDP.knc"
    Monte Carlo European Option Pricing Double Precision
    
    Compiler Version  = 14
    Release Update    = 2
    Build Time        = Jun  2 2014 12:22:44
    Path Length       = 262144
    Number of Options = 999912
    Block Size        = 8192
    Worker Threads    = 244
    
    Starting options pricing...
    Parallel simulation completed in 47.075885 seconds.
    Validating the result...
    L1_Norm          = 4.812E-04
    Average RESERVE  = 12.920
    Max Error        = 8.034E-02
    ==========================================
    Total Cycles = 62767847297
    Cyc/opt      = 62773.371
    Time Elapsed =   47.076
    Options/sec  = 21240.429
    ==========================================
    

    The program priced about a million sets of option input data. If you divide 1 million among 244 threads, you will get 4098.36. Without losing generality, let’s simplify the number of options each thread runs to a round number of 4096, or 4k, and the total number of options the program will price is 999,912. For each option, the program first generates a random number sequence that is independently and identically distributed or i.i.d. for short. We cover details of generating random number sequences in a later section. We then use these random numbers as samples of stock movement with the European payoff formula and calculate the stock value and confidence interval in a formula shown in the next section.

    The program was built on the host and executes on the Intel Xeon Phi coprocessor. For each option data set, it calculates the option values and the confidence intervals. Result validation is part of the benchmark. It measures the average error between the calculated result and the result from Black-Scholes-Merton [2] formula.

    This benchmark runs on a single node of an Intel Xeon Phi coprocessor.  It can also be modified to run in a cluster environment. The program reports a latency measure in seconds spent in pricing activities and also a throughput measure using total number of options priced over the elapsed time, which was printed out as the last performance number in Options/sec.

    Generating and Using Random Numbers in Monte Carlo Methods

    Since Monte Carlo is a numerical method based on the simulation of random variables, the implementation of this algorithm starts with identifying a random number generator. VSL, the vector statistical library component of the Intel® Math Kernel Library (Intel® MKL), provides a variety of random number generators for different distributions. Our implementation uses the Mersenne Twister random number generator using the normal distribution. VSL is part of the Intel® C++ Composer XE 2013 that we use to build applications for Intel Xeon processor and Intel Xeon Phi coprocessors.

    Using Random Number Generators

    Inside VSL a random number sequence is identified as a stream. Each stream delivers random numbers in a given distribution in vector interface. To manage the complexity, VSL uses two implementation layers to support different RNGs and different distributions. At the lower level, all core random number generation routines are implemented to deliver random numbers in uniform distribution. At the higher level, transformation functions are applied to turn uniform distribution to the distribution the user desires.

    To use VSL, follow this typical 5-step process:

    1. Specify RNG streams
    2. Initialize and create the random number streams
    3. Request a vector of random numbers in a specific distribution
    4. Consume the random number sequences in the simulation
    5. Destroy the RNG streams

    Here is how the process works for calculating our European Call options:

    1. Specify a random number stream. In our benchmark we are going to use all 61 cores and create 4 threads per core. In total, we can have 244 threads. To allow each thread to price an option independently, we should give each thread an independent random number stream. We need to declare an RNG state descriptor for each thread. We can declare these data structures before we create any threads.
       
      #include <mkl_vsl.h>
          // Declare random number buffer and random number sequence descriptors
          float *samples[MAX_THREADS];
          VSLStreamStatePtr Streams[MAX_THREADS];

      VSLStreamStatePtr is a C/C++ opaque data structure and Streams is an array of still uninitialized opaque data.

    2. Initialize and create the random stream and set up the stream with a basic random number generator and an integer seed.

      Once we have created worker threads, each thread will allocate its own buffer to receive the RNG sequences and initialize the stream descriptor so that it knows what basic random number generator to use and whether the threads need to work together to ensure mutual independency.

      samples[threadID] = (float *)scalable_aligned_malloc(RAND_BLOCK_LENGTH * sizeof(float), SIMDALIGN);
      vslNewStream(&(Streams[threadID]), VSL_BRNG_MT2203 + threadID, RANDSEED);

      Intel MKL provides the following routine to create and initialize the stream you declared:

      vslNewStream (VSLStreamStatePtr &Randomstream, int brng, int seed )

      Randomstream is a reference to the uninitialized random stream you just declared. It takes a reference because the routine passes back an initialized stream state descriptor. brng is an enumeration parameter specifying which basic random number to use. VSL_BRNG_MT2203 specifies a family of modified Mersenne Twister[10] pseudo generators, each of which is i.d.d. In our problem, each thread uses its own random number generator identified by VSL_BRNG_MT2203 plus its thread ID. You can find more information on the basic random number generator here BRNG parameter definitions. seed is an integer for the random stream to ensure the reproducibility for debugging purposes.

    3. Request a vector of random numbers of a specific distribution. Using the random stream descriptor, we can call one of the distribution generators to produce a sequence of random numbers with a certain probability distribution and a specific data type. The result will be placed in a user-provided buffer in the form of a C array.
      float *rand = samples[threadID];
      vsRngGaussian (VSL_RNG_METHOD_GAUSSIAN_ICDF, Streams[threadID], RAND_BLOCK_LENGTH, rand, MuByT, VBySqrtT);

      Intel MKL uses the following routine to generate normal distributed random numbers.
      vsRngGaussian(method, stream, n, r, a, sigma)where:
      method can be VSL_RNG_METHOD_GAUSSIAN_ICDF, and other values are listed here
      stream initialized random streams
      n the number of random numbers to be requested
      r address of the receiving buffer, usually a C array declared to hold the random number
      a first parameter to the distribution. For normal distribution, it’s the mean.
      sigma second parameter to the distribution. For normal distribution, it’s the std deviation.

    4. Consume the sequence of random numbers.

      for(int i=0; i < RAND_BLOCK_LENGTH; i++)
          {
              float callValue  = Y * exp2f(rand[i]) - Z;
              callValue = (callValue > 0) ? callValue : 0;
              v0 += callValue;
              v1 += callValue * callValue;
          }
    5. Delete the random stream.

      Use vslDeleteStream (VSLStreamStatePtr &stream) to delete the stream declared in step 1. Since we created the stream in the worker threads, it’s customary to destroy the stream in the worker thread.

    Other Implementation Notes

    In our implementation, worker threads are created using OpenMP* parallel directives. Each thread creates its own random number streams, generates the unique option input data, prices the option, and then validates the result. Each thread’s input data is generated by calling C runtime library rand_r in a unique sequence identified by its thread ID, which guarantees each thread will produce a unique and reproducible sequence.

    The aligned memory allocation interface from Intel® Threading Building Blocks (Intel® TBB) are used to allocate the aligned memory blocks that are also cache-friendly to the worker threads. This means, these memory blocks have to be disposed of using the corresponding API. Intel TBB is part of our minimum build requirement.

    OpenMP reduction operations are used to calculate the statistical properties of all the options. It’s also used to find the maximum error from the threads.

    Source Code for MonteCarloRNG Core

    The following is a core part of MonteCarloRNG using single precision data types. Double precision is almost identical.

    // Declare random number buffer and random number sequence descriptors
    float *samples[MAX_THREADS];
    VSLStreamStatePtr Streams[MAX_THREADS];
    
    // calculate the block number based on block size
    const int nblocks = RAND_N/RAND_BLOCK_LENGTH;
    
    #pragma omp parallel reduction(+ : sum_delta) reduction(+ : sum_ref) reduction(+ : sumReserve) reduction(max : max_delta)
    {
    
    #ifdef _OPENMP
        int threadID = omp_get_thread_num();
    #else
        int threadID = 0;
    #endif
        unsigned int randseed = RANDSEED + threadID;
        srand(randseed);
    float *CallResultList     = (float *)scalable_aligned_malloc(mem_size, SIMDALIGN);
    float *CallConfidenceList = (float *)scalable_aligned_malloc(mem_size, SIMDALIGN);
    float *StockPriceList     = (float *)scalable_aligned_malloc(mem_size, SIMDALIGN);
    float *OptionStrikeList   = (float *)scalable_aligned_malloc(mem_size, SIMDALIGN);
    float *OptionYearsList    = (float *)scalable_aligned_malloc(mem_size, SIMDALIGN);
    for(int i = 0; i < OPT_PER_THREAD; i++)
    {
        CallResultList[i]     = 0.0f;
        CallConfidenceList[i] = 0.0f;
        StockPriceList[i]     = RandFloat_T(5.0f, 50.0f, &randseed);
        OptionStrikeList[i]   = RandFloat_T(10.0f, 25.0f, &randseed);
        OptionYearsList[i]    = RandFloat_T(1.0f, 5.0f, &randseed);
    }
    
    samples[threadID] = (float *)scalable_aligned_malloc(RAND_BLOCK_LENGTH * sizeof(float), SIMDALIGN);
    vslNewStream(&(Streams[threadID]), VSL_BRNG_MT2203 + threadID, RANDSEED);
    
    #pragma omp barrier
    if (threadID == 0)
    {
        printf("Starting options pricing...n");
        sTime = second();
        start_cyc = _rdtsc();
    }
    
    for(int opt = 0; opt < OPT_PER_THREAD; opt++)
    {
        const float VBySqrtT = VLog2E * sqrtf(OptionYearsList[opt]);
        const float MuByT    = MuLog2E * OptionYearsList[opt];
        const float Y        = StockPriceList[opt];
        const float Z        = OptionStrikeList[opt];
    
        float v0 = 0.0f;
        float v1 = 0.0f;
        for(int block = 0; block < nblocks; ++block)
        {
            float *rand = samples[threadID];
            vsRngGaussian (VSL_RNG_METHOD_GAUSSIAN_ICDF, Streams[threadID], RAND_BLOCK_LENGTH, rand, MuByT, VBySqrtT);
    
    #pragma vector aligned
    #pragma simd reduction(+:v0) reduction(+:v1)
    #pragma unroll(4)
            for(int i=0; i < RAND_BLOCK_LENGTH; i++)
            {
                float callValue  = Y * exp2f(rand[i]) - Z;
                callValue = (callValue > 0) ? callValue : 0;
                v0 += callValue;
                v1 += callValue * callValue;
            }
        }
        const float  exprt      = exp2f(RLog2E*OptionYearsList[opt]);
        CallResultList[opt]     = exprt * v0 * INV_RAND_N;
        const float  stdDev     = sqrtf((F_RAND_N * v1 - v0 * v0) * STDDEV_DENOM);
        CallConfidenceList[opt] = (float)(exprt * stdDev * CONFIDENCE_DENOM);
    } //end of opt
    
    #pragma omp barrier
    if (threadID == 0) {
        end_cyc = _rdtsc();
        eTime = second();
        printf("Parallel simulation completed in %f seconds.n", eTime-sTime);
        printf("Validating the result...n");
    }
    
    double delta = 0.0, ref = 0.0, L1norm = 0.0;
    int max_index = 0;
    double max_local  = 0.0;
    for(int i = 0; i < OPT_PER_THREAD; i++)
    {
        double callReference, putReference;
        BlackScholesBodyCPU(
            callReference,
            putReference,
            StockPriceList[i],
            OptionStrikeList[i], OptionYearsList[i],  RISKFREE, VOLATILITY );
            ref   = callReference;
            delta = fabs(callReference - CallResultList[i]);
            sum_delta += delta;
            sum_ref   += fabs(ref);
            if(delta > 1e-6)
                 sumReserve += CallConfidenceList[i] / delta;
            max_local = delta>max_local? delta: max_local;
    }
    max_delta = max_local>max_delta? max_local: max_delta;
    vslDeleteStream(&(Streams[threadID]));
    scalable_aligned_free(CallResultList);
    scalable_aligned_free(CallConfidenceList);
    scalable_aligned_free(StockPriceList);
    scalable_aligned_free(OptionStrikeList);
    scalable_aligned_free(OptionYearsList);
    
    }//end of parallel block
    
    

    Appendix

    About the Author

    Shuo Li works for the Intel Software and Service Group. His main interests are parallel programming and application software performance. In his recent role as a staff software performance engineer covering the financial service industry, Shuo works closely with software developers and modelers and helps them achieve high performance with their software solutions.

    Shuo holds a Master's degree in Computer Science from the University of Oregon and an MBA degree from Duke University.

    References and Resources

    [1]Option Pricing: A Simplified Approach (1979) by John C. Cox, Stephen A. Ross, and Mark Rubinstein:

    [2]Theorie de la Speculation, Annales Scientifiques de l´ Ecole Normale Sup´erieure, 21–86. Bachelier, L. (1900). reprinted 1995 Editions Jacques Gabay

    [3]Hull, John C, Options, Futures, and other Derivatives, 7th Edition Prentice-Hull, 2009

    [4]Wilmott, P., Derivatives: The Theory and Practice of Financial Engineering. Chichester: Wiley, 1998

    [5]Cox, J. C. Ross, S. A. and Rubinstein, M. Option Pricing: A simplified Approach Journal of Financial Economics 7 (October 1979): 229-64

    [6]Black, F., and M. Scholes, The Pricing of Options and Corporate Liabilities Journal of Political Economy, 81(May/June 1973): 637-59

    [7]Merton, R. C. Theory of Rational Option Pricing, Bell Journal of Economics and Management Science, 4(Spring 1973): 141-83

    [8]Boyle, P. P., Options: A Monte Carlo Approach Journal of Financial Economics, 4 (1977) 323-38

    [9]Black, Fischer and Scholes, Myron The Pricing of Options and Corporate Liabilities (May-Jun 1973)

    [10]Matsumoto, M., and Nishumira T. Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator, ACM Transactions on Modeling and Computer Simulation, Vol. 8, No. 1, Pages 3-30, January 1998

    [11]Intel Xeon processor: http://www.intel.com/content/www/us/en/processors/xeon/xeon-processor-e7-family.html

    [12]Intel Xeon Phi coprocessor:  https://software.intel.com/en-us/articles/quick-start-guide-for-the-intel-xeon-phi-coprocessor-developer

    License

    Intel sample source is provided under the Intel Sample Source License Agreement.

     

    Notices

    INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

    UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.

    Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

    The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

    Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

    Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

    Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

    Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.

    Intel, the Intel logo, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries.

    Copyright © 2014 Intel Corporation. All rights reserved.

    *Other names and brands may be claimed as the property of others.

  • Monte Carlo European Option Pricing
  • Intel(R) Xeon Phi(TM) Coprocessor
  • Développeurs
  • Linux*
  • Serveur
  • C/C++
  • Intermédiaire
  • Secteur des services financiers
  • Intel® Many Integrated Core Architecture
  • Serveur
  • Contrat de licence: 

    Fichiers joints protégés: 

    Fichier attachéTaille
    TéléchargementMonteCarloRNG.tar30 Ko
  • URL
  • Serveur

  • Determining the Idle Power of an Intel® Xeon Phi™ Coprocessor

    $
    0
    0

    Abstract

    This document gives platform designers, thermal engineers, hardware engineers, and computer architects instructions on how to acquire idle power readings from the Intel® Xeon Phi™ coprocessor.

    There are two access methods by which the server management and control panel component may obtain status information from the Intel® Xeon Phi™ coprocessor. The “in-band” method utilizes the Symmetric Communications Interface (SCIF), the capabilities designed into the coprocessor OS, and the host driver to deliver the Intel® Xeon Phi™ coprocessor status. It also provides a limited ability to set specific parameters that control hardware behavior. An example application using this interface would be ‘micsmc’, which is provided with the Intel® Manycore Platform Software Stack (Intel® MPSS), which reads the power in-band.

    The same information can be obtained using an “out-of-band” method. This method starts with the same capabilities in the coprocessor OS, but sends the information to the System Management Controller (SMC) using a proprietary protocol. With this method, the coprocessor idle power measurements can be made without waking up the card.

    The Intel® Xeon Phi™ coprocessor communicates with the baseboard management controller (BMC) or peripheral control hub (PCH) over the System Management Bus (SMBus) using the standard Intelligent Platform Management Bus (IPMB) protocol.  The SMC responds to queries from the platform’s BMC using the Intelligent Platform Management Interface (IPMI).  Through the Inter-Integrated Circuit (I2C) interface, the SMC can communicate with the Intel® Xeon Phi™ coprocessor and to sensors located on the PCIe card.

    Figure 1: Intel® Xeon Phi™ Coprocessor Board Schematic

    For Intel® Xeon® processor E5-2600 v3 product family platforms, the Intel® ME can provide access to the SMC on the Intel® Xeon Phi™ coprocessor with very little effort, and can read the idle power from the SMC.

    For Intel® Xeon® processor E5 V2 family platforms, the Intel® ME does not have a mechanism for reading power from the Intel® Xeon Phi™ coprocessor and so relies on the BMC to provide sensor information.  This however, requires that the BMC implements some mechanism for communicating with the Intel® Xeon Phi™ coprocessor either via special OEM commands or through a bridging mechanism.

    Figure 2: Example of a topology where the BMC is connected to Intel® Xeon Phi™ coprocessors

    Bridging, Channels, and OEM Commands


    Unlike the sensors that can be accessed via the BMC’s SDR, the sensors of the Intel® Xeon Phi™ coprocessor are abstracted behind a different I2C bus.  In order to access these sensors, the user needs to be familiar with the I2C network diagram and the mechanism for accessing the bus.  Also, they might also need to be exposed via special BMC OEM commands or via a third-party vendor’s help.  You can find out more details about how to do this from the Intel® Xeon Phi™ Coprocessor Datasheet.

    The example scripts were tested on both an Intel® Xeon® processor E7 V2 server product with four 7120A Intel® Xeon Phi™ Coprocessors, and an Intel® Xeon® processor E5-2600 v3 server product with two 7120A Intel® Xeon Phi™ Coprocessors.

    On these platforms, the BMC has implemented several OEM commands that provide reverse PCIe SMBus proxy.  On the Intel® Xeon® processor E5 V2 family* platforms, the format is below:

    Table 1: Get MIC Card Info Command (30h E3h)

    Net Function = Software Development Kit (SDK) General Application (0x3e)

    Code

    Command

    Request, Response Data

    Description

    E3h

    Get MIC Card Info

    Request

    *Byte 1:3 - Intel Manufacturer ID – 000157h, LS byte first
    Byte 4 - Card instance (1-based) for which information is requested. If this byte is zero only the total number of cards detected will be returned.

    This command returns information about management-capable PCIe* cards that are discovered by Intel® ME, including protocol support and addressing information that can be used in MIC IPMB Request command.

    E3h

    Get MIC Card Info

    Response

    Byte 1 – Completion Code

    = 00h – Success

    = CBh “Requested sensor, data, or record not present” the requested card instance is greater than the number of cards detected.

    Byte 2:4 – Intel Manufacturer ID – 000157h, LS byte first. The following bytes are only returned if there are any management-capable cards detected by the Intel ® ME.

    Byte 5 – Total number of MIC devices detected. The following bytes are only returned if the specified management-capable card is detected by the BMC.

    Byte 6 – Command Protocol Detection Support[7:4 ] – Reserved

    [3]  - MCTP over SMBus

    [2] – IPMI on PCIe* SMBus (refer to IPMI 2.0 spec)

    [1] - IPMB

    [0] – Unknown

    A value of 1b indicates detection of a protocol is supported. Support for detection of specific protocols is OEM specific.

    NOTE:  Intel ® ME firmware for Grantley only support detection of IPMB

    Byte 7 – Command Protocols Supported by Card

    [7:4 ] – Reserved

    [3] – MCTP over SMBus

    [2] – IPMI on PCIe* SMBus

    [1] - IPMB

    [0] – Unknown

    Byte 8 – Address/Protocol/Bus#

    [7:6] Address Type

    00b – Bus/Slot/Address

    Other values reserved

    [3:0] Bus Number – Identifies SMBus interface on which the MIC device was detected

    Byte 9 - Slot Number – identifies PCIe* slot in which the MIC device is inserted.

    Byte 10 - Slave Address - the I2C slave address (8 bit “write” address) of the MIC device

    This command returns information about management-capable PCIe* cards that are discovered by Intel® ME, including protocol support and addressing information that can be used in MIC IPMB Request command.

    * Note that this changed to match the E8h command in later versions of the document. Also, the Intel Manufacturer ID – 000157h is removed from the request and response parts of the command below.

    On the Intel® Xeon® processor E5-2600 v3 server products family and Intel® Xeon® processor E7 V2 family, the format is slightly different:

    Table 2: Get MIC Card Info Command (30h E8h)

    Net Function = SDK General Application (0x30)

    Code

    Command

    Request, Response Data

    Description

    E8h

    Get MIC card Info

    Request

    Byte 1 - Card instance (1-based) for which information is requested. If this byte is zero only the total number of cards detected will be returned.

    This command returns information about management-capable PCIe* cards that are discovered by the BMC, including protocol support and addressing information that can be used in the MIC card IPMB Request command.
    Note: E8h is the default value; it may be configured in spsFITC.

    Response

    Byte 1 – Completion Code
    =00h – Success
    =CBh “Requested sensor, data, or record not present” the requested card instance is greater than the number of cards detected.

    The following bytes are only returned if there are any management-capable cards detected by the Intel ® ME.
    Byte 2 – Total number of MIC devices detected.

    The following bytes are only returned if the specified management-capable card is detected by the BMC.

     

     

    Response

    Byte 3 – Command Protocol Detection Support [7:4 ] – Reserved
    [3]  - MCTP over SMBus
    [2] – IPMI on PCIe* SMBus (refer to IPMI 2.0 spec)
    [1] - IPMB
    [0] – Unknown A value of 1b indicates detection of a protocol is supported.  Support for detection of specific protocols is OEM specific.

    NOTE:  Intel ®ME firmware for Intel® Xeon® processor E7 V2 family only support detection of IPMB

    Byte 4 – Command Protocols Supported by Card
    [7:4 ] – Reserved
    [3] – MCTP over SMBus
    [2] – IPMI on PCIe SMBus
    [1] - IPMB
    [0] – Unknown

    Byte 5 – Address/Protocol/Bus#
    [7:6] Address Type
    00b – Bus/Slot/Address
    Other values reserved
    [3:0] Bus Number – Identifies SMBus interface on which the MIC card was detected.
    Byte 6 - Slot Number – Identifies PCIe* slot in which the MIC device is inserted.
    Byte 7 - Slave Address - The I2C slave address (8-bit “write” address) of the MIC device.

     

    The first step that needs to get determined is to find out how many Intel® Xeon Phi™ coprocessors are in the system.  Then once this is done, the bus number, slot number, and slave address of each Intel® Xeon Phi™ coprocessor needs to be determined.  The bus number determines which SMBus interface in which the Intel® Xeon Phi™ coprocessor was detected.  The slot number identifies which PCIe slot that the Intel® Xeon Phi™ coprocessor is inserted into.  Finally, the slave address is the I2C slave address of the SMC on the Intel® Xeon Phi™ coprocessor. With this information, commands can be sent directly to the Intel® Xeon Phi™ Coprocessor according to the commands in the Intel® Xeon Phi™ Coprocessor Datasheet, section 6.6.3.

    Perl is a great programming language that is great for scripting and it can be used to automate complicated IPMI commands using IPMItool.  In the example Perl subroutines below, IPMItool is used to send PCIe slot commands and determine how many cards are on the system:

    sub Read_PCIe_smbus_slot_card_info {
        my ($count_KNC) = @_;
        my $str0 = "ipmitool raw 0x30 0x".$eX_cmd." 0x00";
        #printf ("PCIe SMbus slot card info request: $str0\n");
        my $str1 = `$str0`;
        #If the response is "Unable..." this means that the command isn't implemented
        if (substr($str1,1,1) eq "") {
            #Let's see if this is an Intel(R) Xeon(R) processor E7 V2 product or an
            #Intel(R) Xeon(R) processor E5-2600 v3 product
            $eX_cmd= "e8";
            $str0 = "ipmitool raw 0x30 0x".$eX_cmd." 0x00";
            #printf ("PCIe SMbus slot card info request: $str0\n");
            $str1 = `$str0`;
        }
        if (substr($str1,1,1) eq "") {
            die("\nThe BMC on your platform does not support the PCIe Slot SMBus Slot Command. Please"," consult with your BMC vendor. This program will now quit.\n");
        }
            $count_KNC= substr($str1,1,3);
        #printf ("$count_KNC\n");
    
        return $count_KNC;
    }

     

    Next the Intel® Xeon Phi™ coprocessor’s addressing parameters can be determined with the following command:

    sub Read_PCIe_smbus_slot_card {
        my ($key) = @_;
        my $str0 = "ipmitool raw 0x30 0x".$eX_cmd."".$key;
        #printf "$str0\n";
        my $str1 = `$str0`;
        #printf "$str1\n";
        my $bus_num= substr($str1,4, 2);
        my $slot_num= substr($str1,13,2);
        my $slave_address= substr($str1,16,2);
        return $bus_num, $slot_num, $slave_address;
    }

    Now with the communication parameters of each Intel® Xeon Phi™ coprocessor, the BMC needs to provide a way to send the command to the SMC itself.  On the Intel® platforms that were just mentioned, this can be done using the Slot IPMB command below:

    Table 3: Slot IPMB Command (3Eh 51h)

    Net Function = SDK General Application (3Eh)

    Code

    Command

    Request, Response Data

    Description

    51h

    Slot IPMB

    Request
    Byte 1
    [7:6] – Address Type
    =00b – Bus/Slot/Address
    =01b – Reserved for Unique identifier
    [5:4] Reserved
    [3:0] Bus Number. Set to 0 for “Address Type” not
    “Bus/Slot/Address”
    Byte 2 - Slot Number – identifies PCIe slot in which the MIC device is inserted. Set to 0 if “Address Type” is not
    “Bus/Slot/Address”

    Byte 3 – Identifier/Slave-address. This byte holds either the unique ID or the slave address (8 bit “write” address), dependent
    on the “Address Type” field.
    Byte 4 – Net Function
    Byte 5 – IPMI Command
    Byte 6:n – Command Data (optional)

    This command is used for sending IPMB commands to a MIC device. This command can be used by BMC to communicate to Intel® Xeon Phi™ devices. This command may be sent at any time. If MIC is accessed via MUX the command handler will block MUX until a response is received or an IPMB timeout has occurred. In order to reduce effect of a nonresponsive card from impacting access to other slots, specific implementation might decide to shorten the IPMB timeout and/or limit the retry mechanism for all slot accesses (both proxy and nonproxy) if a MUX is used. If a card beyond the MUX is consistently not responding in a reasonable time it should be treated as a defect and needs to be root caused and fixed. Additional recommended action is to remove the non-responding card slot from any polling routines until the next system reset, power cycle, or PCIe hot-plug event for that slot.

     

     

    Response
    Byte 1 – Completion Code
    =00h – Normal
    =c1h – Command not supported on this platform.
    =c7h – Command data invalid length.
    =c9h – Parameter not implemented or supported.
    =82h – Bus error.
    =85h – Invalid PCIe slot number.
    Byte 2 – Reading Type
    Byte 2:n – Response Data

     

    Here a command is sent to the card in this way:

    Request:

    [intel]$ ipmitool raw	0x3e 0x51	0x02 0x96 0x30	 0x06 0x1

    Response (see below for the explanation):

                   (00) 00 00 00 01 16 02 0f 57 01 00 60 00 d6 13 00 00

    The response is broken down below.  For more details, see Table 4 after the explanation.

    The byte in parentheses will not be shown in the response.  It is a successful completion code from the command 0x3E 0x51 which is not displayed by IPMItool.

    The first byte represents the completion code 00h from the execution of the bridged Get Device ID command which normally would not be shown unless there is an error. It is displayed here because the Slot IPMB command simply returns the full content of the response without parsing the completion code. Byte 2 is the device ID (00h for unspecified). Byte 3 is the device revision (00h in this case and also indicating that the device does not provide device SDRs).

    Byte 4 refers to the Firmware Revision 1 (01h indicates Major Firmware Revision of 1, and indicating normal operation). Byte 5 refers to the Firmware Revision 2 (16h indicates 1.6).  (These two bytes combined together would correspond to the SMC FW’s revision, in the case, is 1.16).

    Byte 6 refers to the IPMI version (02h indicates 2.0).  Byte 7 represents Additional Device Support (0Fh means that the device supports a FRU, SEL, SDR, and sensor devices). Bytes 8 – 10 are the manufacturer ID, LS byte first (000157h means Intel’s manufacturer ID).

    Bytes 11-12 are the Product ID, LS byte first (0060h). Bytes 13-16 stand for the auxiliary firmware revision information (D6130000h).

    Table 4: Get Device ID Command

    The command above is a simple “Get Device ID” command which is common among most BMC and other IPMI devices.  Other commands can be sent to the SMC in a similar way.

    Searching the SMC’s SDR for Sensor Information and Calculating the Idle Power from the “avg_power1” Sensor


    The sensor names in the SMC’s SDR are static and do not change from release to release, however, the sensor numbers are not always static.  These numbers may change in future releases, so it is a good idea to query the BMC each time the BMC or SMC firmware has changed.  Doing this requires a few steps and the construction of some simple subroutines.  The “for” loop below shows how to send multiple IPMI commands in order to get some basic information out of the card:

    for (1)
    {
        #e3 is the PCIe SMBus Slot card command for the Intel(R) Xeon(R) processor E5-4600 v2 product family platforms;
        #e8 is for Intel(R) Xeon(R) processor E5-2600 v3 product family and Intel(R) Xeon(R) processor E7 V2 family
        #my $eX_cmd = "e3";
        my ($count_KNC1) = Read_PCIe_smbus_slot_card_info($eX_cmd);
        #printf "$count_KNC1\n";
        my $count_KNC = substr($count_KNC1,1,3);
        printf "The number of Intel(R) Xeon Phi(TM) coprocessors is $count_KNC\n";
    
        printf("Which Intel(R) Xeon Phi(TM) coprocessor PCIe card would you like to query? (0..n) ");
        my $key = getc(STDIN);
        $key +=1;
        if ($key > $count_KNC) {
            $key = 1;
        }
        my $key2 = $key - 1;
    
        my ($bus_num, $slot_num, $slave_address) = Read_PCIe_smbus_slot_card($key);
        printf "\nFor Intel(R) Xeon Phi(TM) coprocessor PCIe card#$key2:\nThe Bus Number is 0x$bus_num, Slot Number is 0x$slot_num, Slave Address is 0x$slave_address\n";
        (my $count_SDR) = Read_SMC_SDR_Repository($bus_num, $slot_num, $slave_address);
        printf "\nThe SMC's SDR Repository has 0x".$count_SDR." records\n";
        my $count_SDR_dec = hex($count_SDR);
    
        for (my $i=0; $i < $count_SDR_dec; $i++ ){
            Scan_SMC_SDR_Repository_for_Idle_Power ($key, $i, $bus_num, $slot_num, $slave_address);
        }
    
    }

     

    The first command calls the “Read_PCIe_smbus_slot_card_info()” subroutine to get the number of Intel® Xeon Phi™ coprocessors, and then ask the user which card they want to read.  Then the “Read_PCIe_smbus_slot_card()” subroutine is called to get the Intel® Xeon Phi™ coprocessor’s bus number, slot number, and slave address.

    Next the “Read_SMC_SDR_Repository()” subroutine is called to find the number of SDR records on the SMC:

    sub Read_SMC_SDR_Repository {
        my ($bus_num, $slot_num, $slave_address)=@_;
        my $str0 = "ipmitool raw 0x3e 0x51 0x".$bus_num." 0x".$slot_num." 0x".$slave_address." 0x0a 0x20";
        #printf "$str0\n";
        my $str1 = `$str0`;
        #printf "$str1\n";
        my $count_SDR = substr($str1, 7,2);
    }

    Here is the structure of that command and the output (based on one particular Intel® Xeon Phi™ coprocessor):

    Request:

    ipmitool raw 0x3e 0x51 0x02 0x96 0x30 0x0a 0x20

    Response:

     00 51 1c 00 00 00 00 00 00 00 00 00 00 00 01

    The third byte returns the number of record present in the SDR.  In this example, the card has 1Ch records, or 28 records.

    Next the “Scan_SMC_SDR_Repository_for_Idle_Power()” subroutine makes more IPMItool calls to read the sensor value.  A “for” loop calls this subroutine up to 28 times until the desired sensor is found, in this case, “avg_power1”.  This sensor is the sum of the three power sensors on the card and is averaged over time window 1, so it is a good indicator of the card’s power.

    The subroutine below is broken down into parts:

    sub Scan_SMC_SDR_Repository_for_Idle_Power {
        #my $SDR_no=@_[0];
        my ($key, $SDR_no, $bus_num, $slot_num, $slave_address)=@_;
        my $str_sdr = "ipmitool raw 0x3e 0x51 0x".$bus_num." 0x".$slot_num." 0x".$slave_address." 0x0a 0x23 0x00 0x00 ".$SDR_no." 0x00 0x07 0x0f";
        my $sensor = `$str_sdr`;
        #printf ("First part of the SDR $SDR_no is $sensor\n");
        my $sensor_no = substr($sensor,10, 2);

    These first few commands involve reading the SDR record and finding out the contents from byte 07h until byte 0Fh.  There are different types of sensor data records, but the most common one is Type 01h, for a Full Sensor Record.  The first 8 bytes of the SDR description are shown below:

    Table 5: Full Sensor Record - SDR Type 01h (First 8 Bytes)

    Byte 8 gives the sensor number, which can be used to match it up with the sensor name.

        my $str0 = "ipmitool raw 0x3e 0x51 0x".$bus_num." 0x".$slot_num." 0x".$slave_address." 0x0a 0x23 0x00 0x00 ".$SDR_no." 0x00 0x2e 0xff";
        #printf "$str0\n";
        my $str1 = `$str0`;
        #printf ("Second part of the SDR $SDR_no is $str1\n");
        my $str2 = substr($str1, 15, length($str1));
        #printf "The name of the SDR is in ASCII here: $str2\n";
        $str2 =~ s/\s+//g;
        #printf "Remove spaces: $str2\n";
        my $str3= hex_to_ascii($str2);
        #printf "The sensor name of SDR#$SDR_no is '$str3' (Sensor# 0x$sensor_no)\n";
        #Other sensors can be substituted for "avg_power1" if it is desired to poll a different sensor */
        if ($str3 eq "avg_power1") {
            #printf("Entered if comparison\n");
            my $str4 = "ipmitool raw 0x3e 0x51 0x".$bus_num." 0x".$slot_num." 0x".$slave_address." 0x04 0x2d 0x".$sensor_no;

    The above code reads the SDR record from bytes 2Eh until the end of the record. At byte 31h, or 49 in decimal, the name of the sensor is coded in ASCII character codes.  Here the bytes are saved into a variable, then converted from ASCII codes into characters, put into a string, and then compared to “avg_power1”.  If there is a match, then the right sensor is found.

     

    Table 6: Full Sensor Record - SDR Type 01h (ID String Bytes)

     

    The next step is to convert the raw data into something easily understood:

            # Get the M, B, Accuracy, Accuracy Exp, R exp, and B Exp for the SDR formula
            my $str6 = "ipmitool raw 0x3e 0x51 0x00 0x".$slot_num." 0x".$slave_address." 0x0a 0x23 0x00 0x00 ".$SDR_no." 0x00 0x18 0x06";
            my $str7 = `$str6`;
            # Only the M value seems to be needed for the formula. The other values can be ignored */
            my $M = substr($str7, 11, 2);
            #printf("M is $M\n");

    The values for the ‘y=mx+b’ reading conversion are determined by reading bytes 25 - 30. M can be read from byte 25 and parts of 26.  Typically, in power sensors, only the M value is significant (reading the SDR at these bytes reveal 02h, which means to multiple the decimal value from the sensor by 2).

    Table 7: Full Sensor Record - SDR Type 01h (M, B, Accuracy, R exp & B exp)

    In the last several lines of the code below, these parameters are then used to send an IPMItool command to read the sensor:

              my $key2 = $key - 1;
            while (1){
                my $str5 = `$str4`;
                #printf ("The Sensor value of SDR#$SDR_no is $str5");
                printf ("Sensor '$str3' (0x$sensor_no) is $str5\n");
                my $str6 = substr($str5, 4, 2);
                #printf ("String6 is $str5\n");
                my $dec_num = hex($str6);
                #printf "dec_num is $dec_num\n";
                my $idle_pw = $dec_num * $M;
                printf("Intel(R) Xeon Phi(TM) coprocessor PCIe card#$key2:\nThe Bus number is 0x$bus_num, Slot Number is 0x$slot_num, Slave Address is 0x$slave_address: Power is $idle_pw W\n");
                sleep 1;
            }
        }
    }

    Request:

    [intel]$ ipmitool raw	0x3e 0x51	0x02 0x96 0x30	  0x04 0x2d 0x19

             

    Response:                                               

     00 08 00 00

    The first byte is the completion code, and 00 means that the command executed successfully.  The second byte is the raw sensor value.  The subroutine converts the value from hex to decimal, multiplies it by a factor of 2, and then prints the calculated value to the screen along with the Intel® Xeon Phi™ coprocessor’s number, bus number, slot number, and the slave address. To prevent the overloading of the bus, the subroutine waits approximately 1 second and then reads the sensor again.

    Conclusion


    There could be several factors keeping the Intel® Xeon Phi™ coprocessor idle power to be higher than expected. Here are some tips to reduce energy consumption:

    • Intel® MPSS must be running in order to put the card in PC3 or PC6
    • Power management is handled by Intel® MPSS and the coprocessor OS running on the card
    • ‘micsmc’ will wake the card out of PC6, so micsmc must be shut down to allow the card to enter the PC6 idle state
    • Shutting down the virtual interface on the host platform will prevent the card from being woken up by pings to the card.  Use the command “ifdown micN” where N represents the Intel® Xeon Phi™ Coprocessor number
    • Always run the latest SMC firmware to make sure that your card supports power management (Note: Not all SKUs support all PC states)

    There are some steps that need to be followed in order to get the sample Perl script described in this white paper.  Here are instructions how to do this on Red Hat*:

    [intel]$ yum install perl

    For SuSE*, use YaST in GUI mode.  From the command line, use “rug” if using SuSE* 10.1, or “zypper” if using 10.3. Please check SuSE* documentation for more details.

    Once Perl is installed, enter the following command:

    [intel]$ perl -MCPAN -e shell

    Then at the new prompt:

    cpan> install String::HexConvert ':all'

    The Perl script and subroutines can be modified to read other sensors on the SMC if so desired.  Please check the Intel® Xeon Phi™ Coprocessor Datasheet for sensor names.  Also check the M, B, Tolerance, B, Accuracy, Accuracy exp, R exp, and B exp parameters from the SDR record when looking at other sensors.

    Additional Resources


    488073: Intel® Xeon Phi™ Coprocessor Datasheet, available from IBL/CDI

    513973: Intel® Intelligent Power Node Manager 3.0 External Interface Specification using IPMI, Rev. 1.0.3, available from IBL/CDI

    434090: Intel® Intelligent Power Node Manager 2.0 External Interface Specification Using IPMI, Revision 1.8, available from IBL/CDI

    Intelligent Platform Management Interface Specification, Second Generation, v2.0, available publicly at http://www.intel.com/content/www/us/en/servers/ipmi/ipmi-specifications.html

    IPMItool Man page: http://linux.die.net/man/1/ipmitool

     

    Acknowledgements


    This paper could not have been written without the SMC/BMC expertise of Patrick Voelker, and BMC expertise of Keith Kroeker and Gerald Wheeler.  A big thanks to Andrey Semin for being the voice of the customer.

     

    About the Author


    Todd Enger is a platform application engineer working overseas in Taipei, Taiwan for Intel Microelectronics Asia Ltd.  He specializes in software and firmware support of the Intel® Xeon Phi™ Coprocessor and is also working on enabling customers who will build platforms based upon the next generation Knights Landing processor. Todd has spent the last 10 years working in Taiwan, of which the past 4 years, he has been at Intel.  Prior to that, he worked for various OEMs and ODMs in the server, notebook, and smartphone areas.  Back in the US, Todd worked in the Chicago area until a business trip brought him to Taiwan.  After 2 weeks of astonishment, Todd used his ingenuity to find an opportunity in Taipei developing software on smartphones.  Todd received his BSE in Electrical and Computer Engineering from the University of Michigan-Dearborn.  In his spare time, he enjoys scuba diving in the waters around Taiwan, running, swimming, and hanging out at the beach.

    Notices


    INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

    A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

    Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

    The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

    Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
    Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to:  http://www.intel.com/design/literature.htm

    Intel, the Intel logo, VTune, Cilk and Xeon are trademarks of Intel Corporation in the U.S. and other countries.

    *Other names and brands may be claimed as the property of others

    Copyright© 2014 Intel Corporation. All rights reserved.

    This sample source code is released under the Intel Sample Source Code License Agreement

    General NoticesNotices placed in all materials distributed or released by Intel

    Benchmarking and Performance Disclaimers: Disclaimers for Intel materials that use benchmarks or make performance claims.

    Technical Collateral Disclaimers: Disclaimers that should be included in Intel technical materials that describe the form, fit or function of Intel products.

    Technology Notices: Notices for Intel materials when the benefits or features of a technology or program are described.  Note for technology disclaimers - if every product being discussed (e.g., ACER ULV) has the particular technology/feature, then you can remove the requirements statement in the disclaimer.  If you have multiple technical disclaimers, you can consolidate the "your performance may vary" statements and only put in a single "your mileage may vary".

  • performance
  • KNC
  • Knights Corner
  • MIC
  • Many Integrated Core
  • Many Core
  • Parallel Programing
  • Todd Enger
  • Développeurs
  • Linux*
  • Microsoft Windows* 8
  • Serveur
  • C#
  • Débutant
  • Intermédiaire
  • Informatique en cluster
  • Débogage
  • Outils de développement
  • Intel® Many Integrated Core Architecture
  • Code source libre
  • Efficacité de l’alimentation
  • Serveur
  • URL
  • Exemple de code
  • Serveur
  • 如何在offload程序中控制协处理器的执行环境

    $
    0
    0

    offload编译模式下Intel编译器的offload运行时系统提供了两种机制让主机CPU程序对协处理器上的执行环境进行控制:

    1. 在主机系统上设置环境变量,然后将这些环境变量传递到协处理器上
    2. 在主机程序中调用相应的运行环境控制函数

     

    环境变量:

    缺省情况下,当offload发生时运行时系统会把主机程序执行环境中的所有环境变量全部复制到协处理器的执行环境中。用户可以通过定义环境变量“MIC_ENV_PREFIX”的值来改变这一默认行为。当该环境变量被赋予某个特定值之后,offload运行时系统将不再复制全部主机环境变量,而改为只复制那些以“MIC_ENV_PREFIX”的值加上下划线为前缀的那些环境变量;而且,在协处理器执行环境中对应的环境变量将不会保留这些前缀。通过这种方式,用户就可以在主机系统和协处理器上对同一名字的环境变量使用不同的值。例如在主机系统中已如下方式设置环境变量:

     

    MIC_ENV_PREFIX=ABC

    OMP_NUM_THREADS=8

    ABC_OMP_NUM_THREADS=124

     

    那么主机上的OMP_NUM_THREADS被设置成8,而对协处理器上执行的offload代码而言其执行环境中的OMP_NUM_THREADS将被设置为124.

     

    Offload运行时系统还支持对特定的设备指定不同的环境变量值,其指定方式为在“MIC_ENV_PREFIX”前缀和环境变量名中加上协处理器号。例如,对于上面的例子如果设置OMP_NUM_THREADS的方式改为:

     

    MIC_ENV_PREFIX=ABC

    OMP_NUM_THREADS=8

    ABC_4_OMP_NUM_THREADS=124

     

    那么主机上的MP_NUM_THREADS被设置成8,第5个协处理器上的OMP_NUM_THREADS值被设置成124,而其他协处理器上的OMP_NUM_THREAD将不被设置。

     

    如果需要一次对协处理器指定多个环境变量,还可以采用下面的简化模式:

    mic_prefix_VAR=variable1=value1|variable2=value2|variable3=value3|...

    mic_prefix_card_number_VAR=variable1=value1|variable2=value2|variable3=value3|...

    其中的card_number为协处理器号。

     

    运行环境控制函数:

    某些CPU运行环境控制API函数具有对等的offload控制版本,区别在于增加了两个额外的参数:

     

    target_type: 设备类型。目前推荐使用预定义的值“DEFAULT_TARGET_TYPE”

    target_number: 设备编号。

     

    使用这些API函数之前要包含相应的头文件”offload.h”。例如用于设置OpenMP线程数目的API,两种形式分别为:

     

    CPU API:void omp_set_num_threads (int num_threads);

     

    Offload API: void omp_set_num_threads_target (TARGET_TYPE target_type, int target_number, int num_threads);

     

    更多关于如何使用英特尔编译器开发至强融核协处理器程序的信息请参见英特尔编译器用户参考手册的相关内容。

  • Intel Parallel Composer XE
  • Développeurs
  • Étudiants
  • Linux*
  • C/C++
  • Fortran
  • Intermédiaire
  • Intel® Composer XE
  • Outils de développement
  • Informatique parallèle
  • Serveur
  • URL
  • Rubriques de compilateurs
  • Amélioration des performances
  • Développement multithread
  • Zone des thèmes: 

    IDZone

    Meshcentral - Introduction & Overview

    $
    0
    0

     

    Site Links

    Main site: meshcentral.com
    Information site: info.meshcentral.com
    Developer blog:intel.com/software/ylian

    Overview
    Meshcentral is an open source project under Apache 2.0 license that allows administrators to remotely manage computers over the Internet using a single web portal. You have to download and install a mesh agent on all your devices, but once installed the agent is self-upgrading and makes the device available for management on the web portal. There are a few things that set Meshcentral apart from other solutions. It's open source and so, anyone can freely setup their own instance of Meshcentral on their own server. Meshcentral manages a very wide array of devices: Windows, OSX, Android, Linux, XEN and more. You can use the same solution to manage big servers and Intel® Galileo devices.

    Meshcentral 3D logoFeatures

    Meshcentral features can be seperated into in-band and out-of-band features. In-band features are available on all devices, out-of-band features are only available on computer with Intel® AMT.

    • Remote desktop (in-band and Intel® AMT hardware KVM)
    • Remote terminal access (in-band and Intel® AMT serial-over-lan)
    • Remote file access
    • Remote web access
    • Remote power control (in-band and Intel® AMT power control)
    • General monitoring
    • Video chat with Android

    Tutorial Videos

    To help, we have a YouTube playlist with a set of tutorial videos covering many aspects of using Meshcentral.  The first two videos "Getting Started" and "Basic Features" are probably the best way to get a quick initiation to Meshcentral.

    Compatible Tools

    Most people using Meshcentral will only use the web portal, which is feature rich and works on any device with a browser. But in addition to the web portal, we have applications and tools that are compatible with Meshcentral. So, if you are already using these tools, you can easily take advantage of remote management for the Internet.

  • Mesh
  • MeshCentral
  • MeshCentral.com
  • windows
  • linux
  • android
  • osx
  • Ylian Saint-Hilaire
  • Ylian
  • Développeurs
  • Partenaires
  • Professeurs
  • Étudiants
  • Android*
  • Apple OS X*
  • Arduino
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8
  • Unix*
  • Yocto Project
  • Client d’entreprise
  • Services Cloud
  • Internet des objets
  • Avancé
  • Débutant
  • Intermédiaire
  • Entreprise
  • Processeurs Intel® Atom™
  • Processeurs Intel® Core™
  • Technologie Intel® vPro™
  • Mobilité
  • Code source libre
  • Efficacité de l’alimentation
  • Sécurité
  • Petites entreprises
  • Intégré
  • PC portable
  • Téléphone
  • Serveur
  • Tablette
  • Bureau
  • URL
  • Apresente seus aplicativos Android para a Lenovo!

    $
    0
    0

    Objetivo: A Lenovo, líder mundial no mercado de computadores e detentora da marca CCE no Brasil, está em busca de aplicativos Android para seus tablets com tecnologia Intel.

    Detalhes: Os desenvolvedores podem submeter aplicativos Android de todas as categorias (Ex: Jogos, Infantil , Educação, Produtividade, etc...).
    A Lenovo e a Intel avaliarão os apps e os escolhidos poderão estar presentes nos produtos da companhia via a central de apps presente nos dispositivos. Todos os aplicativos devem possuir ao menos um modelo de monetização. A Lenovo definirá em conjunto com o desenvolvedor o modelo de negócio para a divisão de receita oriunda do app.

    Benefícios: os desenvolvedores dos apps escolhidos poderão firmar um acordo de distribuição com a Lenovo, uma das maiores empresas de tecnologia do mundo, e ampliar sua fonte de receita com os aplicativos.  

    Requerimentos:
    Os desenvolvedores deverão atender aos seguintes requisitos:

    - Oferecer um app para tablet compatível com o processador Intel Baytrail CR, Memoria 1Gb, Flash 8Gb, Android 4.4, Webcam frontal 0.3Mb, Traseira 2Mb, bateria 2400 mAH, Multitouch 5 toques, resolução da tela 1024 X 600, sem bluetooth, Wi-Fi Only.

    - O app desenvolvido para a Lenovo deve possuir uma diferenciação frente ao app disponível em lojas como o Google Play.

    - Os apps submetidos devem possuir ao menos um modelo de monetização.

    - Os desenvolvedores devem apresentar uma Ficha Cadastral constando: 

    •       Nome do Aplicativo:
    •       Categoria:
    •       Principais Caracterísitcas e funcionalidades:
    •       Nome do Responsável:
    •       Contato do Responsável:

    - A proposta de divisão de receita com a Lenovo será feito caso a caso após a aprovação do app.  

    - Interessados deverão  enviar um email intitulado APP ANDROID LENOVO, com a Ficha Cadastral mencionada acima e um link para download da versão. Se o aplicativo já fizer parte do Showroom da Intel, a empresa só precisa enviar a Ficha Cadastral e o link do Showroom no corpo do email.

    Datas: O período de escolha destes aplicativos inicia-se no dia 04 de julho o e vai até o dia 15 de agosto às 18 horas.

    Processo: Os aplicativos serão avaliados pelos times Lenovo e Intel. Os desenvolvedores escolhidos serão comunicados por email com as orientações de próximos passos da parceria.

    Emails de contato: Vitor Araujo  (varaujo@lenovo.com), Daniel Almeida (dalmeida@lenovo.com)  e Juliano Alves (juliano.alves@intel.com).

    Mais: Só serão aceitos submissões de aplicativos feitos no Brasil e desenvolvidos por empresas já cadastradas no programa de parceira de software Intel®. Para se cadastrar gratuitamente acesse: https://software.intel.com/pt-br/grow-business-reports

    LINKS RELACIONADOS:
    - Acesse e saiba mais sobre o Lenovo Developer Program em: http://lenovodev.com
    - Mais informações sobre o desenvolvimento de aplicativos Android acesse: https://software.intel.com/pt-br/android
    - Para outras oportunidades de negócio acesse a nossa página do Brasil: https://software.intel.com/pt-br/brazil-partners
    - Saiba mais porque é interessante a parceria com a Lenovo:DownloadLSP11102_LenovoDev_OneSheet_030414.pdf.

     

  • business marketing
  • Business opportunity
  • marketing and business
  • mobile marketing
  • sales and marketing
  • games
  • Android app development
  • negócios
  • Développeurs
  • Développeurs Intel AppUp®
  • Partenaires
  • Professeurs
  • Étudiants
  • Android*
  • Linux*
  • Android*
  • Client d’entreprise
  • HTML5
  • C/C++
  • HTML5
  • Java*
  • JavaScript*
  • Unity
  • Avancé
  • Intermédiaire
  • Tablette
  • URL
  • Debugging Intel® Xeon Phi™ Applications on Linux* Host

    $
    0
    0

    Contents

    Introduction

    Intel® Xeon Phi™ coprocessor is a product based on the Intel® Many Integrated Core Architecture (Intel® MIC). Intel® offers a debug solution for this architecture that can debug applications running on an Intel® Xeon Phi™ coprocessor.

    There are many reasons for the need of a debug solution for Intel® MIC. Some of the most important ones are the following:

    • Developing native Intel® MIC applications is as easy as for IA-32 or Intel® 64 hosts. In most cases they just need to be cross-compiled (-mmic).
      Yet, Intel® MIC Architecture is different to host architecture. Those differences could unveil existing issues. Also, incorrect tuning for Intel® MIC could introduce new issues (e.g. alignment of data, can an application handle more than hundreds of threads?, efficient memory consumption?, etc.)
    • Developing offload enabled applications induces more complexity as host and coprocessor share workload.
    • General lower level analysis, tracing execution paths, learning the instruction set of Intel® MIC Architecture, …

    Debug Solution for Intel® MIC

    For Linux* host, Intel offers a debug solution for Intel® MIC which is based on GNU* GDB. It can be used on the command line for both host and coprocessor. There is also an Eclipse* IDE integration that eases debugging of applications with hundreds of threads thanks to its user interface. It also supports debugging offload enabled applications.

    How to get it?

    There are currently two ways to obtain Intel’s debug solution for Intel® MIC Architecture on Linux* host:

    Both packages contain the same debug solutions for Intel® MIC Architecture!

    Why use the provided GNU* GDB from Intel?

    • Capabilities are released back to GNU* community
    • Latest GNU* GDB versions in future releases
    • Improved C/C++ & Fortran support thanks to Project Archer and contribution through Intel
    • Increased support for Intel® architecture (esp. Intel® MIC)
    • Eclipse* IDE integration for C/C++ and Fortran
    • Additional debugging capabilities – more later

    Why is Intel providing a Command Line and Eclipse* IDE Integration?

    The command line with GNU* GDB has the following advantages:

    • Well known syntax
    • Lightweight: no dependencies
    • Easy setup: no project needs to be created
    • Fast for debugging hundreds of threads
    • Can be automatized/scripted

    Using the Eclipse* IDE provides more features:

    • Comfortable user interface
    • Most known IDE in the Linux* space
    • Use existing Eclipse* projects
    • Simple integration of the Intel enhanced GNU* GDB
    • Works also with Photran* plug-in to support Fortran
    • Supports debugging of offload enabled applications
      (not supported by command line)

    Deprecation Notice

    Intel® Debugger is deprecated (incl. Intel® MIC Architecture support):

    • Intel® Debugger for Intel® MIC Architecture was only available in Composer XE 2013 & 2013 SP1
    • Intel® Debugger is not part of Intel® Composer XE 2015 anymore

    Users are advised to use GNU* GDB that comes with Intel® Composer XE 2013 SP1 and later!

    You can provide feedback via either your Intel® Premier account (http://premier.intel.com) or via the Debug Solutions User Forum (http://software.intel.com/en-us/forums/debug-solutions/).

    Features

    Intel’s GNU* GDB, starting with version 7.5, provides additional extensions that are available on the command line:

    • Support for Intel® Many Integrated Core Architecture (Intel® MIC Architecture):
      Displays registers (zmmX & kX) and disassembles the instruction set
    • Support for Intel® Transactional Synchronization Extensions (Intel® TSX):
      Helpers for Restricted Transactional Memory (RTM) model
      (only for host)
    • Data Race Detection (pdbx):
      Detect and locate data races for applications threaded using POSIX* thread (pthread) or OpenMP* models
    • Branch Trace Store (btrace):
      Record branches taken in the execution flow to backtrack easily after events like crashes, signals, exceptions, etc.
      (only for host)
    • Pointer Checker:
      Assist in finding pointer issues if compiled with Intel® C++ Compiler and having Pointer Checker feature enabled
      (only for host)
    • Register support for Intel® Memory Protection Extensions (Intel® MPX) and Intel® Advanced Vector Extensions 512 (Intel® AVX-512):
      Debugger is already prepared for future generations

    The features for Intel® MIC highlighted above are described in the following.

    Register and Instruction Set Support

    Compared to Intel® architecture on host systems, Intel® MIC Architecture comes with a different instruction and register set. Intel’s GNU* GDB comes with transparently integrated support for those.  Use is no different than with host systems, e.g.:

    • Disassembling of instructions:
      
      		(gdb) disassemble $pc, +10
      
      		Dump of assembler code from 0x11 to 0x24:
      
      		0x0000000000000011 <foobar+17>: vpackstorelps %zmm0,-0x10(%rbp){%k1}
      
      		0x0000000000000018 <foobar+24>: vbroadcastss -0x10(%rbp),%zmm0
      
      		⁞
      
      		


      In the above example the first ten instructions are disassembled beginning at the instruction pointer ($pc). Only first two lines are shown for brevity. The first two instructions are Intel® MIC specific and their mnemonic is correctly shown.
       
    • Listing of mask (kX) and vector (zmmX) registers:
      
      		(gdb) info registers zmm
      
      		k0   0x0  0
      
      		     ⁞
      
      		zmm31 {v16_float = {0x0 <repeats 16 times>},
      
      		      v8_double = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
      
      		      v64_int8 = {0x0 <repeats 64 times>},
      
      		      v32_int16 = {0x0 <repeats 32 times>},
      
      		      v16_int32 = {0x0 <repeats 16 times>},
      
      		      v8_int64 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
      
      		      v4_uint128 = {0x0, 0x0, 0x0, 0x0}}
      
      		


      Also registers have been extended by kX (mask) and zmmX (vector) register sets that come with Intel® MIC.

    If you use the Eclipse* IDE integration you’ll get the same information in dedicated windows:

    • Disassembling of instructions:
      Eclipse* IDE Disassembly Window
    • Listing of mask (kX) and vector (zmmX) registers:
      Eclipse* IDE Register Window

    Data Race Detection

    A quick excursion about what data races are:

    • A data race happens…
      If at least two threads/tasks access the same memory location w/o synchronization and at least one thread/task is writing.
    • Example:
      Imaging the two functions thread1()& thread2() are executed concurrently by different threads.

      
      		int a = 1;
      
      		int b = 2;
      
      		                                         | t
      
      		int thread1() {      int thread2() {     | i
      
      		  return a + b;        b = 42;           | m
      
      		}                    }                   | e
      
      		                                         v
      
      		


      Return value of thread1() depends on timing: 3 vs. 43!
      This is one (trivial) example of a data race.

    What are typical symptoms of data races?

    • Data race symptoms:
      • Corrupted results
      • Run-to-run variations
      • Corrupted data ending in a crash
      • Non-deterministic behavior
    • Solution is to synchronize concurrent accesses, e.g.:
      • Thread-level ordering (global synchronization)
      • Instruction level ordering/visibility (atomics)
        Note:
        Race free but still not necessarily run-to-run reproducible results!
      • No synchronization: data races might be acceptable

    Intel’s GNU* GDB data race detection can help to analyze correctness.

    How to detect data races?

    • Prepare to detect data races:
      • Only supported with Intel® C++/Fortran Compiler (part of Intel® Composer XE):
        Compile with -debug parallel (icc, icpc or ifort)
        Only objects compiled with-debug parallel are analyzed!
      • Optionally, add debug information via –g
    • Enable data race detection (PDBX) in debugger:
      
      		(gdb) pdbx enable
      
      		(gdb) c
      
      		data race detected
      
      		1: write shared, 4 bytes from foo.c:36
      
      		3: read shared, 4 bytes from foo.c:40
      
      		Breakpoint -11, 0x401515 in L_test_..._21 () at foo.c:36
      
      		*var = 42; /* bp.write */
      
      		

    Data race detection requires an additional library libpdbx.so.5:

    • Keeps track of the synchronizations
    • Part of Intel® C++ & Fortran Compiler
    • Copy to coprocessor if missing
      (found at <composer_xe_root>/compiler/lib/mic/libpdbx.so)

    Supported parallel programming models:

    • OpenMP*
    • POSIX* threads

    Data race detection can be enabled/disabled at any time

    • Only memory access are analyzed within a certain period
    • Keeps memory footprint and run-time overhead minimal

    There is finer grained control for minimizing overhead and selecting code sections to analyze by using filter sets.

    More control about what to analyze with filters:

    • Add filter to selected filter set, e.g.:
      
      		(gdb) pdbx filter line foo.c:36
      
      		(gdb) pdbx filter code 0x40518..0x40524
      
      		(gdb) pdbx filter var shared
      
      		(gdb) pdbx filter data 0x60f48..0x60f50
      
      		(gdb) pdbx filter reads # read accesses
      
      		

      Those define various filter on either instructions by specifying source file and line or the addresses (range), or variables using symbol names or addresses (range) respectively. There is also a filter to only report accesses that use (read) data in case of a data race.
       
    • There are two basic configurations, that are exclusive:
       
      • Ignore events specified by filters (default behavior)
        
        				(gdb) pdbx fset suppress
        
        				
      • Ignore events not specified by filters
        
        				(gdb) pdbx fset focus
        
        				

        The first one defines a white list, whilst the latter one blacklists code or data sections that should not be analyzed.
         
    • Get debug command help
      
      		(gdb) help pdbx
      
      		

      This command will provide additional help on the commands.

    Use cases for filters:

    • Focused debugging, e.g. debug a single source file or only focus on one specific memory location.
    • Limit overhead and control false positives. Detection involves some runtime and memory overhead at runtime. The more filters narrow down the scope of analysis, the more the overhead will be reduced. This can also be used to exclude false positives. Those can occur if real data races are detected, but without any impact on application’s correctness by design (e.g. results of multiple threads don’t need to be globally stored in strict order).
    • Exclude 3rd party code for analysis

    Some additional hints using PDBX:

    • Optimized code (symptom):
      
      		(gdb) run
      
      		data race detected
      
      		1: write question, 4 bytes from foo.c:36
      
      		3: read question, 4 bytes from foo.c:40
      
      		Breakpoint -11, 0x401515 in foo () at foo.c:36
      
      		*answer = 42;
      
      		(gdb)
      
      		

       
    • Incident has to be analyzed further:
      • Remember: data races are reported on memory objects
      • If symbol name cannot be resolved: only address is printed
         
    • Recommendation:
      Unoptimized code (-O0) makes it easier to understand due to removed/optimized away temporaries, etc.
       
    • Reported data races appear to be false positives:
      • Not all data races are bad… user intended?
      • OpenMP*: Distinct parallel sections using the same variable (same stack frame) can result in false positives

    Note:
    PDBX is not available for Eclipse* IDE and will only work for remote debugging of native coprocessor applications. See section Debugging Remotely with PDBX for more information on how to use it.

    Debugging on Command Line

    There are multiple versions available:

    • Debug natively on Intel® Xeon Phi™ coprocessor
    • Execute GNU* GDB on host and debug remotely

    Debug natively on Intel® Xeon Phi™ coprocessor
    This version of Intel’s GNU* GDB runs natively on the coprocessor. It is included in Intel® MPSS only and needs to be made available on the coprocessor first in order to run it. Depending on the MPSS version it can be found at the provided location:

    • MPSS 2.1: /usr/linux-k1om-4.7/linux-k1om/usr/bin/gdb
    • MPSS 3.[1|2]: included in gdb-7.5+mpss3.*.k1om.rpm as part of package mpss-3.*-k1om.tar
      (for MPSS 3.1.2, please see Errata, for MPSS 3.1.4 use mpss-3.1.4-k1om-gdb.tar)

      For MPSS 3.[1|2] the coprocessor native GNU* GDB requires debug information from some system libraries for proper operation. Please see Errata for more information.

    Execute GNU* GDB on host and debug remotely
    There are two ways to start GNU* GDB on the host and debug remotely using GDBServer on the coprocessor:

    • Intel® MPSS:
      • MPSS 2.1: /usr/linux-k1om-4.7/bin/x86_64-k1om-linux-gdb
      • MPSS 3.[1|2]: <mpss_root>/sysroots/x86_64-mpsssdk-linux/usr/bin/k1om-mpss-linux/k1om-mpss-linux-gdb
      • GDBServer:
        /usr/linux-k1om-4.7/linux-k1om/usr/bin/gdbserver
        (same path for MPSS 2.1 & 3.[1|2])
    • Intel® Composer XE:
      • Source environment to start GNU* GDB:
        
        				$ source debuggervars.[sh|csh]
        
        				$ gdb-mic
        
        				
      • GDBServer:
        <composer_xe_root>/debugger/gdb/target/mic/bin/gdbserver

    The sourcing of the debugger environment is only needed once. If you already sourced the according compilervars.[sh|csh] script you can omit this step and gdb-mic should already be in your default search paths.

    Attention: Do not mix GNU* GDB & GDBServer from different packages! Always use both from either Intel® MPSS or Intel® Composer XE!

    Debugging Natively

    1. Make sure GNU* GDB is already on the target by:
    • Copy manually, e.g.:
      
      		$ scp /usr/linux-k1om-4.7/linux-k1om/usr/bin/gdb mic0:/tmp
      
      		
    • Add to the coprocessor image (see Intel® MPSS documentation)
       
    1. Run GNU* GDB on the Intel® Xeon Phi™ coprocessor, e.g.:
      
      		$ ssh –t mic0 /tmp/gdb
      
      		

       
    2. Initiate debug session, e.g.:
    • Attach:
      
      		(gdb) attach <pid>

      <pid> is PID on the coprocessor
    • Load & execute:
      
      		(gdb) file <path_to_application>

      <path_to_application> is path on coprocessor

    Some additional hints:

    • If native application needs additional libraries:
      Set $LD_LIBRARY_PATH, e.g. via:
      
      		(gdb) set env LD_LIBRARY_PATH=/tmp/
      
      		

      …or set the variable before starting GDB
       
    • If source code is relocated, help the debugger to find it:
      
      		(gdb) set substitute-path <from> <to>

      Change paths from <from> to<to>. You can relocate a whole source (sub-)tree with that.

    Debugging is no different than on host thanks to a real Linux* environment on the coprocessor!

    Debugging Remotely

    1. Copy GDBServer to coprocessor, e.g.:
      
      		$ scp <composer_xe_root>/debugger/gdb/target/mic/bin/gdbserver mic0:/tmp

      During development you can also add GDBServer to your coprocessor image!
       
    2. Start GDB on host, e.g.:
      
      		$ source debuggervars.[sh|csh]
      
      		$ gdb-mic
      
      		


      Note:
      There is also a version named gdb-ia which is for IA-32/Intel® 64 only!
       
    3. Connect:
      
      		(gdb) target extended-remote | ssh -T mic0 /tmp/gdbserver --multi –
      
      		

       
    4. Set sysroot from MPSS installation, e.g.:
      
      		(gdb) set sysroot /opt/mpss/3.1.4/sysroots/k1om-mpss-linux/
      
      		

      If you do not specify this you won't get debugger support for system libraries.
       
    5. Debug:
    • Attach:
      
      		(gdb) file <path_to_application>
      
      		(gdb) attach <pid>

      <path_to_application> is path on host, <pid> is PID on the coprocessor
    • Load & execute:
      
      		(gdb) file <path_to_application>
      
      		(gdb) set remote exec-file <remote_path_to_application>

      <path_to_application> is path on host, <remote_path_to_application> is path on the coprocessor

    Some additional hints:

    • If remote application needs additional libraries:
      Set $LD_LIBRARY_PATH, e.g. via:
      
      		(gdb) target extended-remote | ssh mic0 LD_LIBRARY_PATH=/tmp/ /tmp/gdbserver --multi -
      
      		
    • If source code is relocated, help the debugger to find it:
      
      		(gdb) set substitute-path <from> <to>

      Change paths from <from> to <to>. You can relocate a whole source (sub-)tree with that.
       
    • If libraries have different paths on host & target, help the debugger to find them:
      
      		(gdb) set solib-search-path <lib_paths>

      <lib_paths> is a colon separated list of paths to look for libraries on the host

    Debugging is no different than on host thanks to a real Linux* environment on the coprocessor!

    Debugging Remotely with PDBX

    PDBX has some pre-requisites that must be fulfilled for proper operation. Use pdbx check command to see whether PDBX is working:

    1. First step:
      
      		(gdb) pdbx check
      
      		checking inferior...failed.
      
      		


      Solution:
      Start a remote application (inferior) and hit some breakpoint (e.g. b main& run)
       
    2. Second step:
      
      		(gdb) pdbx check
      
      		checking inferior...passed.
      
      		checking libpdbx...failed.
      
      		


      Solution:
      Use set solib-search-path <lib_paths> to provide the path of libpdbx.so.5 on the host.
       
    3. Third step:
      
      		(gdb) pdbx check
      
      		checking inferior...passed.
      
      		checking libpdbx...passed.
      
      		checking environment...failed.
      
      		


      Solution:
      Set additional environment variables on the target for OpenMP*. Those need to be set with starting GDBServer (similar to setting $LD_LIBRARY_PATH).
    • $INTEL_LIBITTNOTIFY32=""
    • $INTEL_LIBITTNOTIFY64=""
    • $INTEL_ITTNOTIFY_GROUPS=sync

    Debugging with Eclipse* IDE

    Intel offers an Eclipse* IDE debugger plug-in for Intel® MIC that has the following features:

    • Seamless debugging of host and coprocessor
    • Simultaneous view of host and coprocessor threads
    • Supports multiple coprocessor cards
    • Supports both C/C++ and Fortran
    • Support of offload extensions (auto-attach to offloaded code)
    • Support for Intel® Many Integrated Core Architecture (Intel® MIC Architecture): Registers & Disassembly

    Eclipse* IDE with Offload Debug Session

    The plug-in is part of both Intel® MPSS and Intel® Composer XE.

    Pre-requisites

    In order to use the provided plug-in the following pre-requisites have to be met:

    • Supported Eclipse* IDE version:
      • 4.2 with Eclipse C/C++ Development Tools (CDT) 8.1 or later
      • 3.8 with Eclipse C/C++ Development Tools (CDT) 8.1 or later
      • 3.7 with Eclipse C/C++ Development Tools (CDT) 8.0 or later

    We recommend: Eclipse* IDE for C/C++ Developers (4.2)

    • Java* Runtime Environment (JRE) 6.0 or later
    • For Fortran optionally Photran* plug-in
    • Remote System Explorer (aka. Target Management) to debug native coprocessor applications
    • Only for plug-in from Intel® Composer XE, source debuggervars.[sh|csh] for Eclipse* IDE environment!

    Install Intel® C++ Compiler plug-in (optional):
    Add plug-in via “Install New Software…”:
    Install Intel® C++ Compiler plug-in (optional)
    This Plug-in is part of Intel® Composer XE (<composer_xe_root>/eclipse_support/cdt8.0/). It adds Intel® C++ Compiler support which is not mandatory for debugging. For Fortran the counterpart is the Photran* plug-in. These plug-ins are recommended for the best experience.

    Note:
    Uncheck “Group items by category”, as the list will be empty otherwise!

    Install Plug-in for Offload Debugging

    Add plug-in via “Install New Software…”:
    Install Plug-in for Offload Debugging

    Plug-in is part of:

    • Intel® MPSS:
      • MPSS 2.1: <mpss_root>/eclipse_support/
      • MPSS 3.[1|2]: /usr/share/eclipse/mic_plugin/
    • Intel® Composer XE:<composer_xe_root>/debugger/cdt/

    Configure Offload Debugging

    • Create a new debug configuration for “C/C++ Application”
    • Click on “Select other…” and select MPM (DSF) Create Process Launcher:Configure Offload Debugging
      The “MPM (DSF) Create Process Launcher” needs to be used for our plug-in. Please note that this instruction is for both C/C++ and Fortran applications! Even though Photran* is installed and a “Fortran Local Application” entry is visible (not in the screenshot above!) don’t use it. It is not capable of using MPM.
       
    • In “Debugger” tab specify MPM script of Intel’s GNU* GDB:
      • Intel® MPSS:
        • MPSS 2.1: <mpss_root>/mpm/bin/start_mpm.sh
        • MPSS 3.[1|2]: /usr/bin/start_mpm.sh
          (for MPSS 3.1.1, 3.1.2 or 3.1.4, please see Errata)
      • Intel® Composer XE:
        <composer_xe_root>/debugger/mpm/bin/start_mpm.sh
        Configure Offload Debugging (Debugger)
        Here, you finally add Intel’s GNU* GDB for offload debugging (using MPM (DSF)). It is a script that takes care of setting up the full environment needed. No further configuration is required (e.g. which coprocessor cards, GDBServer & ports, IP addresses, etc.); it works fully automatic and transparent.

    Start Offload Debugging

    Debugging offload enabled applications is not much different than applications native for the host:

    • Create & build an executable with offload extensions (C/C++ or Fortran)
    • Don’t forget to add debug information (-g) and reduce optimization level if possible (-O0)
    • Start debug session:
      • Host & target debugger will work together seamlessly
      • All threads from host & target are shown and described
      • Debugging is same as used from Eclipse* IDE

    Eclipse* IDE with Offload Debug Session (Example)

    This is an example (Fortran) of what offload debugging looks like. On the left side we see host & mic0 threads running. One thread (11) from the coprocessor has hit the breakpoint we set inside the loop of the offloaded code. Run control (stepping, continuing, etc.), setting breakpoints, evaluating variables/memory, … work as they used to.

    Additional Requirements for Offload Debugging

    For debugging offload enabled applications additional environment variables need to be set:

    • Intel® MPSS 3.[1|2]:
      AMPLXE_COI_DEBUG_SUPPORT=TRUE
      MYO_WATCHDOG_MONITOR=-1

       
    • Intel® MPSS 2.1:
      COI_SEP_DISABLE=FALSE
      MYO_WATCHDOG_MONITOR=-1

    Set those variables before starting Eclipse* IDE!

    Those are currently needed but might become obsolete in the future. Please be aware that the debugger cannot and should not be used in combination with Intel® VTune™ Amplifier XE. Hence disabling SEP (as part of Intel® VTune™ Amplifier XE) is valid. The watchdog monitor must be disabled because a debugger can stop execution for an unspecified amount of time. Hence the system watchdog might assume that a debugged application, if not reacting anymore, is dead and will terminate it. For debugging we do not want that.

    Note:
    Do not set those variables for a production system!

    For Intel® MPSS 3.2 and later:
    MYO debug libraries are no longer installed with Intel MPSS 3.2 by default. This is a change from earlier Intel MPSS versions. Users must install the MYO debug libraries manually in order to debug MYO enabled applications using the Eclipse plug-in for offload debugging. For Intel MPSS 3.2 (and later) the MYO debug libraries can be found in the package mpss-myo-dbg-* which is included in the mpss-*.tar file.

    MPSS 3.2 and later does not support offload debugging with Intel® Composer XE 2013 SP1, please see Errata for more information!

    Configure Native Debugging

    Configure Remote System Explorer
    To debug native coprocessor applications we need to configure the Remote System Explorer (RSE).

    Note:
    Before you continue, make sure SSH works (e.g. via command line). You can also specify different credentials (user account) via RSE and save the password.

    The basic steps are quite simple:

    1. Show the Remote System window:
      Menu Window->Show View->Other…
      Select: Remote Systems->Remote Systems
       
    2. Add a new system node for each coprocessor:
      RSE Remote Systems Window
      Context menu in window Remote Systems: New Connection…
    • Select Linux, press Next>
    • Specify hostname of the coprocessor (e.g. mic0), press Next>
    • In the following dialogs select:
      • ssh.files
      • processes.shell.linux
      • ssh.shells
      • ssh.terminals

    Repeat this step for each coprocessor!

    Transfer GDBServer
    Transfer of the GDBServer to the coprocessor is required for remote debugging. We choose /tmp/gdberver as target on the coprocessor here (important for the following sections).

    Transfer the GDBServer to the coprocessor target, e.g.:

    
    	$ scp <composer_xe_root>/debugger/gdb/target/mic/bin/gdbserver mic0:/tmp

    During development you can also add GDBServer to your coprocessor image!

    Note:
    See section Debugging on Command Line above for the correct path of GDBServer, depending on the chosen package (Intel® MPSS or Intel® Composer XE)!

    Debug Configuration

    Eclipse* IDE Debug Configuration Window

    To create a new debug configuration for a native coprocessor application (here: native_c++) create a new one for C/C++ Remote Application.

    Set Connection to the coprocessor target configured with RSE before (here: mic0).

    Specify the remote path of the application, wherever it was copied to (here: /tmp/native_c++). We’ll address how to manually transfer files later.

    Set the flag for “Skip download to target path.” if you don’t want the debugger to upload the executable to the specified path. This can be meaningful if you have complex projects with external dependencies (e.g. libraries) and don’t want to manually transfer the binaries.
    (for MPSS 3.1.2 or 3.1.4, please see Errata)

    Note that we use C/C++ Remote Application here. This is also true for Fortran applications because there’s no remote debug configuration section provided by the Photran* plug-in!

    Eclipse* IDE Debug Configuration Window (Debugger)

    In Debugger tab, specify the provided Intel GNU* GDB for Intel® MIC (here: gdb-mic).

    Eclipse* IDE Debug Configuration Window (Debugger) -- Specify .gdbinit

    In the above example, set sysroot from MPSS installation in .gdbinit, e.g.:

    
    	set sysroot /opt/mpss/3.1.4/sysroots/k1om-mpss-linux/
    
    	

    You can use .gdbinit or any other command file that should be loaded before starting the debugging session. If you do not specify this you won't get debugger support for system libraries.

    Note:
    See section Debugging on Command Line above for the correct path of GDBServer, depending on the chosen package (Intel® MPSS or Intel® Composer XE)!

    Eclipse* IDE Debug Configuration Window (Debugger/GDBServer)

    In Debugger/Gdbserver Settings tab, specify the uploaded GDBServer (here: /tmp/gdbserver).

    Build Native Application for the Coprocessor

    Configuration depends on the installed plug-ins. For C/C++ applications we recommend to install the Intel® C++ Compiler XE plug-in that comes with Composer XE. For Fortran, install Photran* (3rd party) and select the Intel® Fortran Compiler manually.

    Make sure to use the debug configuration and provide options as if debugging on the host (-g). Optionally, disabling optimizations by –O0 can make the instruction flow comprehendible when debugging.

    The only difference compared to host builds is that you need to cross-compile for the coprocessor: Use the –mmic option, e.g.:
    Eclipse* IDE Project Properties

    After configuration, clean your build. This is needed because Eclipse* IDE might not notice all dependencies. And finally, build.

    Note:
    That the configuration dialog shown only exists for the Intel® C++ Compiler plug-in. For Fortran, users need to install the Photran* plug-in and switch the compiler/linker to ifort by hand plus adding -mmic manually. This has to be done for both the compiler & linker!

    Start Native Debugging

    Transfer the executable to the coprocessor, e.g.:

    • Copy manually  (e.g. via script on the terminal)
    • Use the Remote Systems window (RSE) to copy files from host and paste to coprocessor target (e.g. mic0):
      RSE Remote Systems Window (Copy)
      Select the files from the tree (Local Files) and paste them to where you want them on the target to be (e.g. mic0)
       
    • Use NFS to mirror builds to coprocessor (no need for update)
    • Use debugger to transfer (see earlier)

    Note:
    It is crucial that the executable can be executed on the coprocessor. In some cases the execution bits might not be set after copying.

    Start debugging using the C/C++ Remote Application created in the earlier steps. It should connect to the coprocessor target and launch the specified application via the GDBServer. Debugging is the same as for local/host applications.
    Native Debugging Session (Remote)

    Note:
    This works for coprocessor native Fortran applications the exact same way!

    Documentation

    More information can be found in the official documentation:

    • Intel® MPSS:
      • MPSS 2.1:
        <mpss_root>/docs/gdb/gdb.pdf
        <mpss_root>/eclipse_support/README-INTEL
      • MPSS 3.[1|2]:
        not available yet (please see Errata)
    • Intel® Composer XE:
      <composer_xe_root>/Documentation/[en_US|ja_JP]/debugger/gdb/gdb.pdf
      <composer_xe_root>/Documentation/[en_US|ja_JP]/debugger/gdb/eclmigdb_config_guide.pdf

    The PDF gdb.pdf is the original GNU* GDB manual for the base version Intel ships, extended by all features added. So, this is the place to get help for new commands, behavior, etc.
    README-INTEL from Intel® MPSS contains a short guide how to install and configure the Eclipse* IDE plug-in.
    PDF eclmigdb_config_guide.pdf provides an overall step-by-step guide how to debug with the command line and with Eclipse* IDE.

    Using Intel® C++ Compiler with the Eclipse* IDE on Linux*:
    http://software.intel.com/en-us/articles/intel-c-compiler-for-linux-using-intel-compilers-with-the-eclipse-ide-pdf/
    The knowledgebase article (Using Intel® C++ Compiler with the Eclipse* IDE on Linux*) is a step-by step guide how to install, configure and use the Intel® C++ Compiler with Eclipse* IDE.

    Errata

    • With the recent switch from MPSS 2.1 to 3.1 some packages might be incomplete or missing. Future updates will add improvements. Currently, documentation for GNU* GDB is missing.
       
    • For MPSS 3.1.2 and 3.1.4 the respective package mpss-3.1.[2|4]-k1om.tar is missing. It contains binaries for the coprocessor, like the native GNU* GDB for the coprocessor. It also contains /usr/libexec/sftp-server which is needed if you want to debug native applications on the coprocessor and require Eclipse* IDE to transfer the binary automatically. As this is missing you need to transfer the files manually (select “Skip download to target path.” in this case).
      As a workaround, you can use mpss-3.1.1-k1om.tar from MPSS 3.1.1 and install the binaries from there. If you use MPSS 3.1.4, the native GNU* GDB is available separately via mpss-3.1.4-k1om-gdb.tar.
       
    • With MPSS 3.1.1, 3.1.2 or 3.1.4 the script <mpss_root>/mpm/bin/start_mpm.sh uses an incorrect path to the MPSS root directory. Hence offload debugging is not working. You can fix this by creating a symlink for your MPSS root, e.g. for MPSS 3.1.2:

      $ ln -s /opt/mpss/3.1.2 /opt/mpss/3.1

      Future versions of MPSS will correct this. This workaround is not required if you use the start_mpm.sh script from the Intel(R) Composer XE package.
       
    • For MPSS 3.[1|2] the coprocessor native GNU* GDB requires debug information from some system libraries for proper opteration.
      Beginning with MPSS 3.1, debug information for system libraries is not installed on the coprocessor anymore. If the coprocessor native GNU* GDB is executed, it will fail when loading/continuing with a signal (SIGTRAP).
      Current workaround is to copy the .debug folders for the system libraries to the coprocessor, e.g.:

      $ scp -r /opt/mpss/3.1.2/sysroots/k1om-mpss-linux/lib64/.debug root@mic0:/lib64/
       
    • MPSS 3.2 and 3.2.1 do not support offload debugging with Intel® Composer XE 2013 SP1.
      Offload debugging with the Eclipse plug-in from Intel® Composer XE 2013 SP1 does not work with Intel MPSS 3.2 and 3.2.1. A configuration file which is required for operation by the Intel Composer XE 2013 SP1 package has been removed with Intel MPSS 3.2 and 3.2.1. Previous Intel MPSS versions are not affected. Intel MPSS 3.2.3 fixes this problem (there is no version Intel MPSS 3.2.2!).
  • Intel(R) Xeon Phi(TM) Coprocessor
  • Debugger
  • GNU* GDB
  • Eclipse* IDE
  • Développeurs
  • Linux*
  • Serveur
  • C/C++
  • Fortran
  • Avancé
  • Débutant
  • Intermédiaire
  • Intel® Many Integrated Core Architecture
  • Serveur
  • Bureau
  • URL
  • Pour commencer
  • Développement multithread
  • Erreurs de thread
  • Viewing all 677 articles
    Browse latest View live


    <script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>