Процедурный рендеринг разреженного пространства

Download Introduction-to-Resource-Binding-in-Microsoft-DirectX12.pdf

Zone des thèmes:

IDZone

↧

Introduction to Resource Binding in Microsoft DirectX* 12

April 6, 2015, 9:20 am

Latest and popular articles on Intel Technologies

≫ Next: Further Vectorization Features of the Intel Compiler - Webinar Code Samples

≪ Previous: Процедурный рендеринг разреженного пространства

By Wolfgang Engel, CEO of Confetti

On March 20^th, 2014, Microsoft announced DirectX* 12 at the Game Developers Conference. By reducing resource overhead, DirectX 12 will help applications run more efficiently, decreasing energy consumption and allow gamers to play longer on mobile devices.

At SIGGRAPH 2014 Intel measured the CPU power consumption when running a simple asteroids demo on a Microsoft Surface* Pro 3 tablet. The demo app can be switched from the DirectX 11 API to the DirectX 12 API by tapping a button. This demo draws a large number of asteroids in space at a locked framerate (https://software.intel.com/en-us/blogs/2014/08/11/siggraph-2014-directx-12-on-intel). It consumes less than half of the CPU power when driven by the DirectX 12 API compared to DirectX 11**, resulting in a cooler device with longer battery life. In a typical game scenario, any gains in CPU power can be invested in better physics, AI, pathfinding, or other CPU intense tasks making the game more feature rich or energy efficient.

Tools of the Trade

To develop games with DirectX 12, you need the following tools:

Windows* 10 Technical Preview
DirectX 12 SDK
Visual Studio* 2013
DirectX 12-capable GPU drivers

If you are a game developer, check out Microsoft’s DirectX Early Access Program at https://onedrive.live.com/survey?resid=A4B88088C01D9E9A!107&authkey=!AFgbVA2sYbeoepQ.

Set-up instructions for installing the SDK, and the GPU drivers are provided after your acceptance to the DirectX Early Access Program.

Overview

From a high-level point of view and compared to DirectX 10 and 11, the architecture of DirectX 12 differs in the areas of state management and the way resources are tracked and managed in memory.

DirectX 10 introduced state objects to set a group of states during run time. DirectX 12 introduces pipeline state objects (PSOs) used to set an even larger group of states along with shaders. This article focuses on the changes in dealing with resources and leaves the description of how states are grouped in PSOs to future articles.

In DirectX 11, the system was responsible for predicting or tracking resource usage patterns, which limited application design when using DirectX 11 on a broad scale. Basically, in DirectX 12, the programmer, not the system or driver, is responsible for handling the following three usage patterns:

Binding of resources
DirectX 10 and 11 tracked the binding of resources to the graphics pipeline to keep resources alive that were already released by the application because they were still referenced by outstanding GPU work. DirectX 12 does not keep track of resource binding. The application, or in other words the programmer, must handle object lifetime management.
Inspection of resource bindings
DirectX 12 does not inspect resource bindings to know if or when a resource transition might have occurred. For example, an application might write into a render target via a render target view (RTV) and then read this render target as a texture via a shader resource view (SRV). With the DirectX 11 API, the GPU driver was expected to know when such a resource transition was happening to avoid memory read-modify-write hazards. In DirectX 12 you have to identify and track any resource transitions via dedicated API calls.
Synchronization of mapped memory
In DirectX 11, the driver handles synchronization of mapped memory between the CPU and GPU. The system inspected the resource bindings to understand if rendering needed to be delayed because a resource that was mapped for CPU access had not been unmapped yet. In DirectX 12, the application needs to handle synchronization of CPU and GPU access of resources. One mechanism to synchronize memory access is requesting an event to wake up a thread when the GPU finished processing.

Moving these resource usage patterns into the realm of the application required a new set of programming interfaces that can deal with a wide range of different GPU architectures.

The rest of this paper describes the new resource binding mechanisms, the first building block being descriptors.

Descriptors

Descriptors describe resources stored in memory. A descriptor is a block of data that describes an object to the GPU, in a GPU-specific opaque format. A simple way of thinking about descriptors is as a replacement of the old “view” system in DirectX 11. In addition to the different types of descriptors like Shader Resource View (SRV) and Unordered Access View (UAV) in DirectX 11, DirectX 12 has other types of descriptors like Samplers and Constant Buffer Views (CBVs).

For example, an SRV selects which underlying resource to use, what set of mipmaps / array slices to use, and the format to interpret the memory. An SRV descriptor must contain the GPU virtual address of the Direct3D* resource, which might be a texture. The application must ensure that the underlying resource is not already destroyed or inaccessible because it is nonresident.

Figure 1 shows a descriptor that represents a “view” into a texture:

To create a shader resource view in DirectX 12, use the following structure and Direct3D device method:

typedef struct D3D12_SHADER_RESOURCE_VIEW_DESC
{
    DXGI_FORMAT Format;
    D3D12_SRV_DIMENSION ViewDimension;

    union
    {
        D3D12_BUFFER_SRV Buffer;
        D3D12_TEX1D_SRV Texture1D;
        D3D12_TEX1D_ARRAY_SRV Texture1DArray;
        D3D12_TEX2D_SRV Texture2D;
        D3D12_TEX2D_ARRAY_SRV Texture2DArray;
        D3D12_TEX2DMS_SRV Texture2DMS;
        D3D12_TEX2DMS_ARRAY_SRV Texture2DMSArray;
        D3D12_TEX3D_SRV Texture3D;
        D3D12_TEXCUBE_SRV TextureCube;
        D3D12_TEXCUBE_ARRAY_SRV TextureCubeArray;
        D3D12_BUFFEREX_SRV BufferEx;
    };
} D3D12_SHADER_RESOURCE_VIEW_DESC;

interface ID3D12Device
{
...
    void CreateShaderResourceView (
        _In_opt_ ID3D12Resource* pResource,
        _In_opt_ const D3D12_SHADER_RESOURCE_VIEW_DESC* pDesc,
        _In_ D3D12_CPU_DESCRIPTOR_HANDLE DestDescriptor);
};

Example code for an SRV might look like this:

// create SRV
D3D12_SHADER_RESOURCE_VIEW_DESC srvDesc;
ZeroMemory(&srvDesc, sizeof(D3D12_SHADER_RESOURCE_VIEW_DESC));
srvDesc.Format = mTexture->Format;
srvDesc.ViewDimension = D3D12_SRV_DIMENSION_TEXTURE2D;
srvDesc.Texture2D.MipLevels = 1;

mDevice->CreateShaderResourceView(mTexture.Get(), &srvDesc, mCbvSrvDescriptorHeap->GetCPUDescriptorHandleForHeapStart());

This code creates an SRV for a 2D texture and specifies its format and the GPU virtual address. The last argument to CreateShaderResourceView is a handle to what is called a descriptor heap that was allocated before calling this method. Descriptors are generally stored in descriptor heaps, detailed in the next section.

Note: It is also possible to pass some types of descriptors to the GPU through driver-versioned memory called root parameters. More on this later.

Descriptor Heaps

A descriptor heap can be thought of as one memory allocation for a number of descriptors. Different types of descriptor heaps can contain one or several types of descriptors. Here are the types currently supported:

Typedef enum D3D12_DESCRIPTOR_HEAP_TYPE
{
 D3D12_CBV_SRV_UAV_DESCRIPTOR_HEAP	= 0,
 D3D12_SAMPLER_DESCRIPTOR_HEAP = (D3D12_CBV_SRV_UAV_DESCRIPTOR_HEAP + 1) ,
 D3D12_RTV_DESCRIPTOR_HEAP	= ( D3D12_SAMPLER_DESCRIPTOR_HEAP + 1 ) ,
 D3D12_DSV_DESCRIPTOR_HEAP	= ( D3D12_RTV_DESCRIPTOR_HEAP + 1 ) ,
 D3D12_NUM_DESCRIPTOR_HEAP_TYPES = ( D3D12_DSV_DESCRIPTOR_HEAP + 1 )
} 	D3D12_DESCRIPTOR_HEAP_TYPE;

There is a descriptor heap type for CBVs, SRVs, and UAVs. There are also types that deal with render target view (RTV) and depth stencil view (DSV).

The following code creates a descriptor heap for nine descriptors—each one can be a CBV, SRV, or UAV:

// create shader resource view and constant buffer view descriptor heap
D3D12_DESCRIPTOR_HEAP_DESC descHeapCbvSrv = {};
descHeapCbvSrv.NumDescriptors = 9;
descHeapCbvSrv.Type = D3D12_CBV_SRV_UAV_DESCRIPTOR_HEAP;
descHeapCbvSrv.Flags = D3D12_DESCRIPTOR_HEAP_SHADER_VISIBLE;
ThrowIfFailed(mDevice->CreateDescriptorHeap(&descHeapCbvSrv, __uuidof(ID3D12DescriptorHeap), (void**)&mCbvSrvDescriptorHeap));

The first two entries in the descriptor heap description are the number of descriptors and the type of descriptors that are allowed in this descriptor heap. The third parameter D3D12_DESCRIPTOR_HEAP_SHADER_VISIBLE describes this descriptor heap as visible to a shader. Descriptor heaps that are not visible to a shader can be used, for example, for staging descriptors on the CPU or for RTV that are not selectable from within shaders.

Although this code sets the flag that makes the descriptor heap visible to a shader, there is one more level of indirection. A shader can “see” a descriptor heap through a descriptor table (there are also root descriptors that do not use tables; more on this later).

Descriptor Tables

The primary goal with a descriptor heap is to allocate as much memory as necessary to store all the descriptors for as much rendering as possible, perhaps a frame or more

Note: Switching descriptor heaps might—depending on the underlying hardware—result in flushing the GPU pipeline. Therefore switching descriptor heaps should be minimized or paired with other operations that would flush the graphics pipeline anyway.

A descriptor table offsets into the descriptor heap. Instead of forcing the graphics pipeline to always view the entire heap, switching descriptor tables is an inexpensive way to change a set of resources a given shader uses. This way the shader does not have to understand where to find resources in heap space.

In other words, an application can utilize several descriptor tables that index the same descriptor heap for different shaders as shown in Figure 2:

Figure 2. Different shaders index into the descriptor heap with different descriptor tables

Descriptor tables for an SRV and a sampler are created in the following code snippet with visibility for a pixel shader.

// define descriptor tables for a SRV and a sampler for pixel shaders
D3D12_DESCRIPTOR_RANGE descRange[2];
descRange[0].Init(D3D12_DESCRIPTOR_RANGE_SRV, 1, 0);
descRange[1].Init(D3D12_DESCRIPTOR_RANGE_SAMPLER, 1, 0);

D3D12_ROOT_PARAMETER rootParameters[2];
rootParameters[0].InitAsDescriptorTable(1, &descRange[0], D3D12_SHADER_VISIBILITY_PIXEL);
rootParameters[1].InitAsDescriptorTable(1, &descRange[1], D3D12_SHADER_VISIBILITY_PIXEL);

The visibility of the descriptor table is restricted to the pixel shader by providing the D3D12_SHADER_VISIBILITY_PIXEL flag. The following enum defines different levels of visibility of a descriptor table:

typedef enum D3D12_SHADER_VISIBILITY
{
 D3D12_SHADER_VISIBILITY_ALL	= 0,
 D3D12_SHADER_VISIBILITY_VERTEX	= 1,
 D3D12_SHADER_VISIBILITY_HULL	= 2,
 D3D12_SHADER_VISIBILITY_DOMAIN	= 3,
 D3D12_SHADER_VISIBILITY_GEOMETRY	= 4,
 D3D12_SHADER_VISIBILITY_PIXEL	= 5
} D3D12_SHADER_VISIBILITY;

Providing a flag that sets the visibility to all will broadcast the arguments to all shader stages, although it is only set once.

A shader can locate a resource through descriptor tables, but the descriptor tables need to be made known to this shader first as a root parameter in a root signature.

Root Signature and Parameters

A root signature stores root parameters that are used by shaders to locate the resources they need access to. These parameters exist as a binding space on a command list for the collection of resources the application needs to make available to shaders.

The root arguments can be:

Descriptor tables: As described above, they hold an offset plus the number of descriptors into the descriptor heap.
Root descriptors: Only a small amount of descriptors can be stored directly in a root parameter. This saves the application the effort to put those descriptors into a descriptor heap and removes an indirection.
Root constants: Those are constants provided directly to the shaders without having to go through root descriptors or descriptor tables.

To achieve optimal performance, applications should generally sort the layout of the root parameters in decreasing order of change frequency.

All the root parameters like descriptor tables, root descriptors, and root constants are baked in to a command list and the driver will be versioning them on behalf of the application. In other words, whenever any of the root parameters change between draw or dispatch calls, the hardware will update the version number of the root signature. Every draw / dispatch call gets a unique full set of root parameter states when any argument changes.

Root descriptors and root constants decrease the level of GPU indirection when accessed, while descriptor tables allow accessing a larger amount of data but incur the cost of the increased level of indirection. Because of the higher level of indirection, with descriptor tables the application can initialize content up until it submits the command list for execution. Additionally, shader model 5.1, which is supported by all DirectX 12 hardware, offers shaders to dynamically index into any given descriptor table. So a shader can select which descriptor it wants out of a descriptor table at shader execution time. An application could just create one large descriptor table and always use indexing (via something like a material ID) to get the desired descriptor.

Different hardware architectures will show different performance tradeoffs between using large sets of root constants and root descriptors versus using descriptor tables. Therefore it will be necessary to tune the ratio between root parameters and descriptor tables depending on the hardware target platforms.
A perfectly reasonable outcome for an application might be a combination of all types of bindings: root constants, root descriptors, descriptor tables for descriptors gathered on-the-fly as draw calls are issued, and dynamic indexing of large descriptor tables.

The following code stores the two descriptor tables mentioned above as root parameters in a root signature.

// define descriptor tables for a SRV and a sampler for pixel shaders
D3D12_DESCRIPTOR_RANGE descRange[2];
descRange[0].Init(D3D12_DESCRIPTOR_RANGE_SRV, 1, 0);
descRange[1].Init(D3D12_DESCRIPTOR_RANGE_SAMPLER, 1, 0);

D3D12_ROOT_PARAMETER rootParameters[2];
rootParameters[0].InitAsDescriptorTable(1, &descRange[0], D3D12_SHADER_VISIBILITY_PIXEL);
rootParameters[1].InitAsDescriptorTable(1, &descRange[1], D3D12_SHADER_VISIBILITY_PIXEL);

// store the descriptor tables int the root signature
D3D12_ROOT_SIGNATURE descRootSignature;
descRootSignature.Init(2, rootParameters, 0);

ComPtr<ID3DBlob> pOutBlob;
ComPtr<ID3DBlob> pErrorBlob;
ThrowIfFailed(D3D12SerializeRootSignature(&descRootSignature,
              D3D_ROOT_SIGNATURE_V1, pOutBlob.GetAddressOf(),
              pErrorBlob.GetAddressOf()));

ThrowIfFailed(mDevice->CreateRootSignature(pOutBlob->GetBufferPointer(),
              pOutBlob->GetBufferSize(), __uuidof(ID3D12RootSignature),
             (void**)&mRootSignature));

All shaders in a PSO need to be compatible with the root signature specified with this PSO; otherwise, the PSO won’t be created.

A root signature needs to be set on a command list or bundle. This is done by calling:

commandList->SetGraphicsRootSignature(mRootSignature);

After setting the root signature, the set of bindings needs to be defined. In the example above this would be done with the following code:

// set the two descriptor tables to index into the descriptor heap
// for the SRV and the sampler
commandList->SetGraphicsRootDescriptorTable(0,
               mCbvSrvDescriptorHeap->GetGPUDescriptorHandleForHeapStart());
commandList->SetGraphicsRootDescriptorTable(1,
               mSamplerDescriptorHeap->GetGPUDescriptorHandleForHeapStart());

The application must set the appropriate parameters in each of the two slots in the root signature before issuing a draw call or a dispatch call. In this example, the first slot now holds a descriptor handle that indexes into the descriptor heap to a SRV descriptor and the second slot now holds a descriptor table that indexes into the descriptor heap to a sampler descriptor.

An application can change, for example, the binding on the second slot between draw calls. That means it only has to bind the second slot for the second draw call.

Putting it all together

The large source code snippet below shows all the mechanisms used to bind resources. This application only uses one texture, and this code provides a sampler and an SRV for this texture:

// define descriptor tables for a SRV and a sampler for pixel shaders
D3D12_DESCRIPTOR_RANGE descRange[2];
descRange[0].Init(D3D12_DESCRIPTOR_RANGE_SRV, 1, 0);
descRange[1].Init(D3D12_DESCRIPTOR_RANGE_SAMPLER, 1, 0);

D3D12_ROOT_PARAMETER rootParameters[2];
rootParameters[0].InitAsDescriptorTable(1, &descRange[0], D3D12_SHADER_VISIBILITY_PIXEL);
rootParameters[1].InitAsDescriptorTable(1, &descRange[1], D3D12_SHADER_VISIBILITY_PIXEL);

// store the descriptor tables in the root signature
D3D12_ROOT_SIGNATURE descRootSignature;
descRootSignature.Init(2, rootParameters, 0);

ComPtr<ID3DBlob> pOutBlob;
ComPtr<ID3DBlob> pErrorBlob;
ThrowIfFailed(D3D12SerializeRootSignature(&descRootSignature,
              D3D_ROOT_SIGNATURE_V1, pOutBlob.GetAddressOf(),
              pErrorBlob.GetAddressOf()));

ThrowIfFailed(mDevice->CreateRootSignature(pOutBlob->GetBufferPointer(),
              pOutBlob->GetBufferSize(), __uuidof(ID3D12RootSignature),
             (void**)&mRootSignature));



// create descriptor heap for shader resource view
D3D12_DESCRIPTOR_HEAP_DESC descHeapCbvSrv = {};
descHeapCbvSrv.NumDescriptors = 1; // for SRV
descHeapCbvSrv.Type = D3D12_CBV_SRV_UAV_DESCRIPTOR_HEAP;
descHeapCbvSrv.Flags = D3D12_DESCRIPTOR_HEAP_SHADER_VISIBLE;
ThrowIfFailed(mDevice->CreateDescriptorHeap(&descHeapCbvSrv, __uuidof(ID3D12DescriptorHeap), (void**)&mCbvSrvDescriptorHeap));

// create sampler descriptor heap
D3D12_DESCRIPTOR_HEAP_DESC descHeapSampler = {};
descHeapSampler.NumDescriptors = 1;
descHeapSampler.Type = D3D12_SAMPLER_DESCRIPTOR_HEAP;
descHeapSampler.Flags = D3D12_DESCRIPTOR_HEAP_SHADER_VISIBLE;
ThrowIfFailed(mDevice->CreateDescriptorHeap(&descHeapSampler, __uuidof(ID3D12DescriptorHeap), (void**)&mSamplerDescriptorHeap));

// skip the code that uploads the texture data into heap

// create sampler descriptor in the sample descriptor heap
D3D12_SAMPLER_DESC samplerDesc;
ZeroMemory(&samplerDesc, sizeof(D3D12_SAMPLER_DESC));
samplerDesc.Filter = D3D12_FILTER_MIN_MAG_MIP_LINEAR;
samplerDesc.AddressU = D3D12_TEXTURE_ADDRESS_WRAP;
samplerDesc.AddressV = D3D12_TEXTURE_ADDRESS_WRAP;
samplerDesc.AddressW = D3D12_TEXTURE_ADDRESS_WRAP;
samplerDesc.MinLOD = 0;
samplerDesc.MaxLOD = D3D11_FLOAT32_MAX;
samplerDesc.MipLODBias = 0.0f;
samplerDesc.MaxAnisotropy = 1;
samplerDesc.ComparisonFunc = D3D12_COMPARISON_ALWAYS;
mDevice->CreateSampler(&samplerDesc,
           mSamplerDescriptorHeap->GetCPUDescriptorHandleForHeapStart());

// create SRV descriptor in the SRV descriptor heap
D3D12_SHADER_RESOURCE_VIEW_DESC srvDesc;
ZeroMemory(&srvDesc, sizeof(D3D12_SHADER_RESOURCE_VIEW_DESC));
srvDesc.Format = SampleAssets::Textures->Format;
srvDesc.ViewDimension = D3D12_SRV_DIMENSION_TEXTURE2D;
srvDesc.Texture2D.MipLevels = 1;
mDevice->CreateShaderResourceView(mTexture.Get(), &srvDesc,
            mCbvSrvDescriptorHeap->GetCPUDescriptorHandleForHeapStart());


// writing into the command list
// set the root signature
commandList->SetGraphicsRootSignature(mRootSignature);

// other commands here ...

// set the two descriptor tables to index into the descriptor heap
// for the SRV and the sampler
commandList->SetGraphicsRootDescriptorTable(0,
               mCbvSrvDescriptorHeap->GetGPUDescriptorHandleForHeapStart());
commandList->SetGraphicsRootDescriptorTable(1,
               mSamplerDescriptorHeap->GetGPUDescriptorHandleForHeapStart());

Static Samplers

Now that you’ve seen how to create a sampler using a descriptor heap and a descriptor table, there is another way to use samplers in applications. Because many applications only need a fixed set of samplers, it is possible to use static samplers as a root argument.

Currently, the root signature looks like this:

typedef struct D3D12_ROOT_SIGNATURE
{
    UINT NumParameters;
    const D3D12_ROOT_PARAMETER* pParameters;
    UINT NumStaticSamplers;
    const D3D12_STATIC_SAMPLER* pStaticSamplers;
    D3D12_ROOT_SIGNATURE_FLAGS Flags;

    // Initialize struct
    void Init(
        UINT numParameters,
        const D3D12_ROOT_PARAMETER* _pParameters,
        UINT numStaticSamplers = 0,
        const D3D12_STATIC_SAMPLER* _pStaticSamplers = NULL,
        D3D12_ROOT_SIGNATURE_FLAGS flags = D3D12_ROOT_SIGNATURE_NONE)
    {
        NumParameters = numParameters;
        pParameters = _pParameters;
        NumStaticSamplers = numStaticSamplers;
        pStaticSamplers = _pStaticSamplers;
        Flags = flags;
    }

    D3D12_ROOT_SIGNATURE() { Init(0,NULL,0,NULL,D3D12_ROOT_SIGNATURE_NONE);}

    D3D12_ROOT_SIGNATURE(
        UINT numParameters,
        const D3D12_ROOT_PARAMETER* _pParameters,
        UINT numStaticSamplers = 0,
        const D3D12_STATIC_SAMPLER* _pStaticSamplers = NULL,
        D3D12_ROOT_SIGNATURE_FLAGS flags = D3D12_ROOT_SIGNATURE_NONE)
    {
        Init(numParameters, _pParameters, numStaticSamplers, _pStaticSamplers, flags);
    }
} D3D12_ROOT_SIGNATURE;

A set of static samplers can be defined independently of the root parameters in a root signature. As mentioned above, root parameters define a binding space where arguments can be provided at run time, whereas static samplers are by definition unchanging.

Since root signatures can be authored in HLSL, static samplers can be authored with it as well. For now, an application can only have a maximum of 2032 unique static samplers. This is slightly less than the next power of two and allows drivers to use some of the slots for internal use.

The static samplers defined in a root signature are independent of samplers an application chooses to put in a descriptor heap, so both mechanisms can be used at the same time.

If the selection of samplers is truly dynamic and unknown at shader compile time, an application should manage samplers in a descriptor heap.

Conclusion

DirectX 12 offers full control over resource usage patterns. The application developer is responsible for allocating memory in descriptor heaps, describing the resources in descriptors, and letting the shader “index” into descriptor heaps via descriptor tables that are made “known” to the shader via root signatures.

Furthermore, root signatures can be used to define a custom parameter space for shaders using any combination of four options:

root constants
static samplers
root descriptors
descriptor tables

In the end, the challenge is to pick the most desirable form of binding for the types of resources and their frequency of update.

About the Author

Wolfgang is the CEO of Confetti. Confetti is a think-tank for advanced real-time graphics research and a service provider for the video game and movie industry. Before co-founding Confetti, Wolfgang worked as the lead graphics programmer in Rockstar's core technology group RAGE for more than four years. He is the founder and editor of the ShaderX and GPU Pro books series, a Microsoft MVP, the author of several books and articles on real-time rendering and a regular contributor to websites and the GDC. One of the books he edited -ShaderX4- won the Game developer Front line award in 2006. Wolfgang is in many advisory boards throughout the industry; one of them is the Microsoft’s Graphics Advisory Board for DirectX 12. He is an active contributor to several future standards that drive the Game Industry. You can find him on twitter at: wolfgangengel. Confetti's website is www.conffx.com.

Acknowledgement

I would like to thank Chas Boyd and Amar Patel for their proofreading and feedback.

References and Related Links

Microsoft DirectX blog: http://blogs.msdn.com/b/directx/
DirectX 12 on Twitter: @DirectX12 https://twitter.com/DirectX12
Direct3D* 12 - Console API Efficiency & Performance on PCs (https://software.intel.com/en-us/articles/console-api-efficiency-performance-on-pcs)

** Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

DirectX 12

Microsoft Direct3D* 12

Direct3D* 12 - Console API Efficiency & Performance on PCs

URL:

Zone des thèmes:

Windows Zone

Développement de jeu

↧

Further Vectorization Features of the Intel Compiler - Webinar Code Samples

April 6, 2015, 3:36 pm

Latest and popular articles on Intel Technologies

≫ Next: Эффективная порядко-независимая прозрачность на Android* с использованием фрагментного шейдера

≪ Previous: Introduction to Resource Binding in Microsoft DirectX* 12

The code samples for the webinar "Further Vectorization Features of the Intel Compiler" given on 4/7/2015 are attached below.

Here are some examples of command lines that may be used to build them. This list is not intended to be complete or to be a tutorial as such, just a guide to things to try. It uses Linux* switch syntax; Windows* equivalents are closely similar.

See the presentation for more detail; slides and video will be posted separately and later.

icpc -c -qopt-report-phase=vec -qopt-report=3 no_stl.cpp

icpc -c -qopt-report-file=stderr -qopt-report-phase=vec -qopt-report=2 stl_vec.cpp

icpc -c -std=c++11 -qopt-report-phase=vec -qopt-report=3 stl_vec_11.cpp

icpc -O2 -std=c++11 –I$BOOST/boost_1_56 sumsq.cpp timer.cpp; ./a.out (you may need to install boost)

icpc -O2 -std=c++11 –I$BOOST/boost_1_56 -DALIGN_CLASS sumsq.cpp timer.cpp; ./a.out

icpc -O2 -std=c++11 –I$BOOST/boost_1_56 -DBOOST_ALIGN sumsq.cpp timer.cpp; ./a.out

icpc -O2 -xcore-avx2 -std=c++11 -I$BOOST/boost_1_56 sumsq.cpp timer.cpp; ./a.out

icpc -c -std=c++11 –xcore-avx2 -qopt-report-file=stderr -qopt-report-phase=loop,vec -qopt-report3 -qopt-report-routine=main sumsq.cpp

ifort -fpp -c -qopt-report-phase=loop,vec -qopt-report-file=stderr dist.F90

ifort -fpp -c -qopenmp-simd -qopt-report-phase=loop,vec -qopt-report-file=stderr -qopt-report-routine=dist dist.F90

ifort -fpp -qopenmp-simd -DKNOWN_TRIP_COUNT -qopt-report-phase=loop,vec -qopt-report-file=stderr -qopt-report-routine=dist drive_dist.F90 dist.F90; ./a.out

icc -c –std=c99 -xavx -qopt-report-file=stderr -qopt-report-phase=vec mulmv.c

icc -c -std=c99 -xavx -fargument-noalias -qopt-report=4 -qopt-report-file=stderr -qopt-report-phase=vec mulmv.c

(-fargument-noalias becomes /Qalias-args- on Windows)

ifort -O2 -xavx -qopt-report=3 -qopt-report-file=stderr -qopt-report-phase= loop,vec -qopt-report-routine=indirect drive_indirect.F90 indirect.F90; ./a.out

ifort -xcore-avx2 -S indirect.F90

grep gather indirect.s

vectorization

Apple OS X*

Linux*

Download Sample Code ZIPfile

Zone des thèmes:

IDZone

↧

Эффективная порядко-независимая прозрачность на Android* с использованием фрагментного шейдера

April 7, 2015, 8:26 am

Latest and popular articles on Intel Technologies

≫ Next: Threading Intel® Integrated Performance Primitives Image Resize with Intel® Threading Building Blocks

≪ Previous: Further Vectorization Features of the Intel Compiler - Webinar Code Samples

Введение

Этот образец демонстрирует использование расширения GL_INTEL_fragment_shader_ordering, написанного под профиль OpenGL 4.4 и технические требования GLES 3.1. Минимальная требуемая версия OpenGL – 4.2 или ARB_shader_image_load_store. Расширение представляет новую встроенную функцию GLSL, beginFragmentShaderOrderingINTEL(),которая блокирует выполнение вызова фрагментного шейдера до тех пор, пока вызовы от предыдущих базовых элементов, отображаемые на тех же ху-координатах окна, не будут завершены. В примере эта линия поведения используется для предоставления решений, обеспечивающих порядко-независимую прозрачность в типичной 3D-сцене в реальном времени.

Порядко-независимая прозрачность

Прозрачность – это фундаментальная проблема для рендеринга в реальном времени, ввиду сложности наложения случайного числа прозрачных слоёв в правильном порядке. Этот пример построен на работе, изначально описанной в статьях adaptive-transparencyи multi-layer-alpha-blending (Марк Сальви, Джефферсон Монтгомери, Картик Вайданатан и Аарон Лефон). Эти статьи показывают, как прозрачность может точно соответствовать реальным результатам, полученным от компоновки с использованием А-буфера, но может быть от 5 до 40 раз быстрее, благодаря использованию различных техник необратимого сжатия применительно к прозрачности данных. Данный пример представляет собой алгоритм на базе этих техник сжатия, который подходит для включения в такие приложения, как, например, игры.

Прозрачность бросает вызов

Пример рендеринга тестовой сцены с использованием стандартного альфа-смешивания показан на Рис. 1:

Рис. 1:Пример порядко-независимой прозрачности (OIT)

Геометрия визуализируется в фиксированном порядке: за землей следуют объекты внутри свода, затем свод и, наконец, растения снаружи. Блочные объекты рисуются первыми и обновляют буфер глубины, а затем рисуются прозрачные объекты в том же порядке без обновления буфера глубины. Увеличенное изображение демонстрирует один из визуальных артефактов, получающихся в результате: листва находится внутри свода, но перед несколькими плоскостями стекла. К сожалению, порядок рендеринга диктует правила таким образом, что все плоскости стекла, даже те, что находятся позади листвы, рисуются поверх. Обновление буфера глубины прозрачным объектом создает другой ряд проблем. Традиционно их можно решить разбивкой объекта на несколько небольших частей и их сортировкой front-to-back, исходя из точки расположения камеры. Но даже так идеального результата не достичь, поскольку объекты могут перекрещиваться, а затраты рендеринга, тем временем, возрастают с прибавлением числа отсортированных объектов.

Рис. 2 и рис. 3 показывают увеличенный визуальный артефакт, где на рис. 2 все плоскости стекла нарисованы перед листвой и на рис. 3 корректно отсортированы.

Рис. 2: Не отсортированы

Рис. 3: Отсортированы

Порядко-независимая прозрачность в реальном времени

Было множество попыток применить компоновку произвольно упорядоченных базовых геометрических элементов без необходимости сортировки на CPU или разбивки геометрии на непересекающиеся элементы. Среди таких попыток - depth-peeling, требующий многократного представления геометрии и техник А-буфера, где все фрагменты, связанные с заданным пикселем, хранятся в связном списке, отсортированы и затем перемешаны в корректном порядке. Несмотря на успех А-буфера в офлайн-рендеринге, он мало используется при рендеринге в реальном времени из-за неограниченных требований к памяти и, как правило, низкой производительности.

Новый подход

Вместо А-буфера: хранения всех цветов и данных глубины в попиксельных списках и последующей их сортировки и компоновки, пример использует исследование Марко Сальви и реструктурирует уравнение альфа-смешивания с целью избегания рекурсии и сортировки, создавая «функцию видимости» (Рис. 4):

Рис. 4:Функция видимости

Число шагов в функции видимости соответствует числу узлов, используемых для хранения информации по видимости на попиксельном уровне в процессе рендеринга сцены. По мере добавления пиксели хранятся в структуре узла до его полного заполнения. Затем при попытке включения большего числа пикселей алгоритм подсчитывает, какой из предыдущих узлов может быть присоединен для создания самой маленькой вариации в функции видимости, при этом сохраняя размер набора данных. Финальный этап – вычисление функции видимости vis() и компоновка фрагментов при помощи формулы final_color= .

Образец визуализирует сцену на следующих этапах:

Очистка Shader Storage Buffer Object до стандартных значений по умолчанию.
Визуализация всей блочной геометрии в основной фреймбуфер с обновлением буфера глубины.
Визуализация всей прозрачной геометрии без обновления буфера глубины; финальные фрагментные данные отброшены из фреймбуфера. Фрагментные данные хранятся в наборе узлов внутри Shader Storage Buffer Object.
Резолв данных внутри Shader Storage Buffer Object и подмешивание финального результата в основной фреймбуфер.

Рис 5:

Априори, затраты чтения Shader Storage Buffer Object на стадии резолва могут быть крайне высокими из-за требований пропускной способности. В оптимизации, задействованной в примере, для маскировки участков, где прозрачные пиксели могли бы быть вмешаны во фреймбуфер, используется стенсил буфер. Это меняет рендеринг так, как показано на Рис. 6.

Clear the Stencil buffer.
Clear the Shader Storage Buffer Object to default values on the first pass.
Постановка следующей стенсил операции:
1. glDisable(GL_STENCIL_TEST);
Визуализация всей блочной геометрии в основной фреймбуфер с обновлением глубины.
Постановка следующих стенсил операций:
1. glEnable(GL_STENCIL_TEST);
2. glStencilOp(GL_KEEP, GL_KEEP, GL_REPLACE);
3. glStencilFunc(GL_ALWAYS, 1, 0xFF);
Визуализация всей прозрачной геометрии без обновления буфера глубины; финальные фрагментные данные интегрированы в основной фреймбуфер со значением альфа 0. Стенсил буфер отмечен для каждого фрагмента во фреймбуфере. Фрагментные данные хранятся в наборе узлов внутри Shader Storage Buffer Object. Отбрасывание фрагмента невозможно, так как это мешает обновлению стенсил.
Постановка следующих стенсил операций:
1. glStencilOp(GL_KEEP, GL_KEEP, GL_REPLACE);
2. glStencilFunc(GL_EQUAL, 1, 0xFF);
Резолв данных внутри Shader Storage Buffer Object только для фрагментов, прошедших стенсил тест и подмешивание финального результата в основной фреймбуфер.
Постановка следующих стенсил операций:
1. glStencilFunc(GL_ALWAYS, 1, 0xFF);
2. glDisable(GL_STENCIL_TEST);

Рис. 6: Stencil Render Path

Выгода от использования стенсил буфера проявляется в затратах на этапе резолва, которые падают на 80%, хотя это во многом зависит от площади экрана (в %), занятой прозрачной геометрией. Чем больше площадь, занятая прозрачными объектами, тем меньше вы выигрываете в производительности.

01 void PSOIT_InsertFragment_NoSync( float surfaceDepth, vec4 surfaceColor )
02{
03	ATSPNode nodeArray[AOIT_NODE_COUNT];
04
05	// Load AOIT data
06	PSOIT_LoadDataUAV(nodeArray);
07
08	// Update AOIT data
09	PSOIT_InsertFragment(surfaceDepth,
10		1.0f - surfaceColor.w,  // transmittance = 1 - alpha
11		surfaceColor.xyz,
12		nodeArray);
13	// Store AOIT data
14	PSOIT_StoreDataUAV(nodeArray);
15}

Рис. 7: GLSL Shader Storage Buffer Code

Алгоритм, представленный выше, может быть применен на любом устройстве, которое поддерживает Shader Storage Buffer Objects. Однако существует один очень значимый недостаток: возможно наличие множества фрагментов в работе, отображаемых на тех же ху-координатах окна.

Если множественные фрагменты выполняются на тех же xy-координатах окна в одно и то же время, они будут использовать одни и те же начальные данные в PSOIT_LoadDataUAV,но приведут к разным значениям, которые будут испытываться и храниться в inPSOIT_StoreDataUAV – и последнее из них завершит перезапись всех прежних, что были обработаны. Такой эффект – вполне рутинная процедура компрессии, которая может варьироваться от фрейма к фрейму. Его можно заметить в примере при отмене Pixel Sync. Пользователь должен увидеть легкое мерцание в тех местах, где перекрываются прозрачности. Чтобы это было проще это увидеть, применяется функция зума. Чем больше фрагментов графический процессор в состоянии исполнять параллельно, тем больше вероятность увидеть мерцание.

По умолчанию пример избегает эту проблему, применяя новую встроенную GLSL-функцию, beginFragmentShaderOrderingINTEL(),которая может быть использована, когда строка расширения GL_INTEL_fragment_shader_ordering показывается применительно к оборудованию. Функция ThebeginFragmentShaderOrderingINTEL()блокирует исполнение фрагментного шейдера до момента завершения всех вызовов шейдера от предыдущих базовых элементов, соответствующих тем же xy-координатам окна. Все операции обращения к памяти от предыдущих вызовов фрагментного шейдера, отображаемых на тех же ху-координатах, становятся видимыми для текущего вызова фрагментного шейдера при возврате функции. Это делает возможным слияние предыдущих фрагментов для создания функции видимости в детерминированной модели. Функция thebeginFragmentShaderOrderingINTEL не влияет на применение шейдера для фрагментов с неперекрывающимися ху-координатами.

Пример того, как вызвать beginFragmentShaderOrderingINTE,показан на Рис. 8.

01GLSL code example
02    -----------------
03
04    layout(binding = 0, rgba8) uniform image2D image;
05
06    vec4 main()
07    {
08        ... compute output color
09        if (color.w > 0)        // potential non-uniform control flow
10        {
11            beginFragmentShaderOrderingINTEL();
12            ... read/modify/write image         // ordered access guaranteed
13        }
14        ... no ordering guarantees (as varying branch might not be taken)
15
16        beginFragmentShaderOrderingINTEL();
17
18        ... update image again                  // ordered access guaranteed
19    }

Рис. 8: beginFragmentShaderOrderingINTEL

Обратите внимание, что нет заданной встроенной функции, сигнализирующей о конце диапазона, который нужно упорядочить. Взамен, диапазон, который по логике будет по упорядочен, расширяется до конца применения фрагментного шейдера.

В случае с OIT примером, она просто добавляется, как показано на Рис. 9:

1 void PSOIT_InsertFragment( float surfaceDepth, vec4 surfaceColor )
2 {
3    // from now on serialize all UAV accesses (with respect to other fragments shaded in flight which map to the same pixel)
4 #ifdef do_fso
5    beginFragmentShaderOrderingINTEL();
6 #endif
7    PSOIT_InsertFragment_NoSync( surfaceDepth, surfaceColor );
8 }

Рис. 9: Добавление упорядочения фрагмента в доступ к Shader Storage Buffer

Запрашивается из любого фрагментного шейдера, который потенциально может записывать прозрачные фрагменты, как показано на рис. 10.

01 out vec4 fragColor;// -------------------------------------
02 void main( )
03 {
04    vec4 result = vec4(0,0,0,1);
05
06    // Alpha-related computation
07    float alpha = ALPHA().x;
08    result.a =  alpha;
09    vec3 normal = normalize(outNormal);
10
11    // Specular-related computation
12    vec3 eyeDirection  = normalize(outWorldPosition - EyePosition.xyz);
13    vec3 Reflection    = reflect( eyeDirection, normal );
14    float  shadowAmount = 1.0;
15
16    // Ambient-related computation
17    vec3 ambient = AmbientColor.rgb * AMBIENT().rgb;
18    result.xyz +=  ambient;
19    vec3 lightDirection = -LightDirection.xyz;
20
21    // Diffuse-related computation
22    float  nDotL = max( 0.0 ,dot( normal.xyz, lightDirection.xyz ) );
23    vec3 diffuse = LightColor.rgb * nDotL * shadowAmount  * DIFFUSE().rgb;
24    result.xyz += diffuse;
25    float  rDotL = max(0.0,dot( Reflection.xyz, lightDirection.xyz ));
26    vec3 specular = pow(rDotL,  8.0 ) * SPECULAR().rgb * LightColor.rgb;
27    result.xyz += specular;
28    fragColor =  result;
29
30 #ifdef dopoit
31   if(fragColor.a > 0.01)
32   {
33	PSOIT_InsertFragment( outPositionView.z, fragColor );
34	fragColor = vec4(1.0,1.0,0.0,0.0);
35   }
36 #endif
37 }

Рис. 10: Типичный фрагментный шейдер

Только те фрагменты, которые имеют альфа-фактор выше граничного значения, добавляются в Shader Storage Buffer Object, при этом отбраковываются любые фрагменты, не представляющие сцене никаких значимых данных.

Сборка тестового примера

Требования к билду

Установите последние версии Android* SDK и NDK:

Добавьте NDK и SDK в свою ветвь:

export PATH=$ANDROID_NDK/:$ANDROID_SDK/tools/:$PATH

Для сборки:

Проследуйте в папку OIT_2014\OIT_Android*
Только единожды вам может понадобиться инициализировать проект: android update project –path . --target android-19.
Соберите NDK-компонент: NDK-BUILD
Соберите APK:
ant debug
Установите APK:
adb install -r bin\NativeActivity-debug.apk or ant installd
Выполните его

Выводы

Пример демонстрирует, как исследование адаптивной порядко-независимой прозрачности под руководством Марко Сальви, Джефферсона Монтгомери, Картика Вайданатана и Аарона Лефон, первоначально произведенное на высокопроизводительных дискретных видеокартах с использованием DirectX 11, может быть применено в реальном времени на планшете Android при помощи GLES 3.1 и упорядочения фрагментным шейдером. Алгоритм выполняется внутри постоянного требуемого объема памяти, который может варьироваться, исходя из требований визуальной достоверности. Оптимизации вроде стенсил буфера разрешают применение техники на широком ряде устройств на допустимом уровне производительности, обеспечивая практическое решение одной из самых насущных проблем рендеринга в реальном времени. Принципы, продемонстрированные в образце OIT, могут быть применены к целому спектру других алгоритмов, которые могли бы в нормальном режиме создавать попиксельные связные списки, включая техники объемного затенения и пост-процессинговое сглаживание.

Статьи по теме

https://www.opengl.org/registry/specs/INTEL/fragment_shader_ordering.txt
https://software.intel.com/ru-ru/articles/adaptive-transparency
https://software.intel.com/ru-ru/articles/multi-layer-alpha-blending

GameCodeSample

Order Independent Transparency

OIT

fragment shader ordering

↧

Threading Intel® Integrated Performance Primitives Image Resize with Intel® Threading Building Blocks

April 8, 2015, 3:23 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® IPP - Threading / OpenMP* FAQ

≪ Previous: Эффективная порядко-независимая прозрачность на Android* с использованием фрагментного шейдера

Threading Intel® IPP Image Resize with Intel® TBB.pdf (157.18 KB) :Download Now

Introduction

The Intel® Integrated Performance Primitives (Intel® IPP) library provides a wide variety of vectorized signal and image processing functions. Intel® Threading Building Blocks (Intel® TBB) adds simple but powerful abstractions for expressing parallelism in C++ programs. This article presents a starting point for using these tools together to combine the benefits of vectorization and threading to resize images.

From Intel® IPP 8.2 onwards multi-threading (internal threaded) libraries are deprecated due to issues with performance and interoperability with other threading models, but made available for legacy applications. However, multithreaded programming is now main stream and there is a rich ecosystem of threading tools such as Intel® TBB. In most cases, handling threading at an application level (that is, external/above the primitives) offers many advantages. Many applications already have their own threading model, and application level/external threading gives developers the greatest level of flexibility and control. With a little extra effort to add threading to applications it is possible to meet or exceed internal threading performance, and this opens the door to more advanced optimization techniques such as reusing local cache data for multiple operations. This is the main reason to start deprecating internal threading in the latest releases.

Getting started with parallel_for

Intel® TBB’s parallel_for offers an easy way to get started with parallelism, and it is one of the most commonly used parts of Intel® TBB. Any for() loop in the applications, where each iteration can be done independently and the order of execution doesn’t matter. In these scenarios, Intel® TBB parallel_for is useful and takes care of most details, like setting up a thread pool and a scheduler. You supply the partitioning scheme and the code to run on separate threads or cores. More sophisticated approaches are possible. However, the goal of this article and sample code is to provide a simple starting point and not the best possible threading configuration for every situation.

Intel® TBB’s parallel_for takes 2 or 3 arguments.

parallel_for ( range, body, optional partitioner )

The range, for this simplified line-based partitioning, is specified by:

blocked_range<int>(begin, end, grainsize)

This provides information to each thread about which lines of the image it is processing. It will automatically partition a range from begin to end in grainsize chunks. For Intel® TBB the grainsize is automatically adjusted when ranges don't partition evenly, so it is easy to accommodate arbitrary sizes.

The body is the section of code to be parallelized. This can be implemented separately (including as part of a class); though for simple cases it is often convenient to use a lambda expression. With the lambda approach the entire function body is part of the parallel_for call. Variables to pass to this anonymous function are listed in brackets [alg, pSrc, pDst, stridesrc_8u, …] and range information is passed via blocked_range<int>& range.

This is a general threading abstraction which can be applied to a wide variety of problems. There are many examples elsewhere showing parallel_for with simple loops such as array operations. Tailoring for resize follows the same pattern.

External Parallelization for Intel® IPP Resize

A threaded resize can be split into tiles of any shape. However, it is convenient to use groups of rows where the tiles are the width of the image.

Each thread can query range.begin(), range.size(), etc. to determine offsets into the image buffer. Note: this starting point implementation assumes that the entire image is available within a single buffer in memory.

The new image resize functions in Intel® IPP 7.1 and later versions, new approach has many advantages like

IppiResizeSpec holds precalculated coefficients based on input/output resolution combination. Multiple resizes which can be completed without recomputing them.
Separate functions for each interpolation method.
Significantly smaller executable size footprint with static linking.
Improved support for threading and tiled image processing.
For more information please refer to article : Resize Changes in Intel® IPP 7.1

Before starting resize, the offsets (number of bytes to add to the source and destination pointers to calculate where each thread’s region starts) must be calculated. Intel® IPP provides a convenient function for this purpose:

ippiResizeGetSrcOffset

This function calculates the corresponding offset/location in the source image for a location in the destination image. In this case, the destination offset is the beginning of the thread’s blocked range.

After this function it is easy to calculate the source and destination addresses for each thread’s current work unit:

pSrcT=pSrc+(srcOffset.y*stridesrc_8u);
pDstT=pDst+(dstOffset.y*stridedst_8u);

These are plugged into the resize function, like this:

ippiResizeLanczos_8u_C1R(pSrcT, stridesrc_8u, pDstT, stridedst_8u, dstOffset, dstSizeT, ippBorderRepl, 0, pSpec, localBuffer);

This specifies how each thread works on a subset of lines of the image. Instead of using the beginning of the source and destination buffers, pSrcT and pDstT provide the starting points of the regions each thread is working with. The height of each thread's region is passed to resize via dstSizeT. Of course, in the special case of 1 thread these values are the same as for a nonthreaded implementation.

Another difference to call out is that since each thread is doing its own resize simultaneously the same working buffer cannot be used for all threads. For simplicity the working buffer is allocated within the lambda function with scalable_aligned_malloc, though further efficiency could be gained by pre-allocating a buffer for each thread.

The following code snippet demonstrates how to set up resize within a parallel_for lambda function, and how the concepts described above could be implemented together.

Click here for full source code.

By downloading this sample code, you accept the End User License Agreement.

parallel_for( blocked_range<int>( 0, pnminfo_dst.imgsize.height, grainsize ),
            [pSrc, pDst, stridesrc_8u, stridedst_8u, pnminfo_src,
            pnminfo_dst, bufSize, pSpec]( const blocked_range<int>& range )
        {
            Ipp8u *pSrcT,*pDstT;
            IppiPoint srcOffset = {0, 0};
            IppiPoint dstOffset = {0, 0};

            // resized region is the full width of the image,
            // The height is set by TBB via range.size()
            IppiSize  dstSizeT = {pnminfo_dst.imgsize.width,(int)range.size()};

            // set up working buffer for this thread's resize
            Ipp32s localBufSize=0;
            ippiResizeGetBufferSize_8u( pSpec, dstSizeT,
                pnminfo_dst.nChannels, &localBufSize );

            Ipp8u *localBuffer =
                (Ipp8u*)scalable_aligned_malloc( localBufSize*sizeof(Ipp8u), 32);

            // given the destination offset, calculate the offset in the source image
            dstOffset.y=range.begin();
            ippiResizeGetSrcOffset_8u(pSpec,dstOffset,&srcOffset);

            // pointers to the starting points within the buffers that this thread
            // will read from/write to
            pSrcT=pSrc+(srcOffset.y*stridesrc_8u);
            pDstT=pDst+(dstOffset.y*stridedst_8u);


            // do the resize for greyscale or color
            switch (pnminfo_dst.nChannels)
            {
            case 1: ippiResizeLanczos_8u_C1R(pSrcT,stridesrc_8u,pDstT,stridedst_8u,
                        dstOffset,dstSizeT,ippBorderRepl, 0, pSpec,localBuffer); break;
            case 3: ippiResizeLanczos_8u_C3R(pSrcT,stridesrc_8u,pDstT,stridedst_8u,
                        dstOffset,dstSizeT,ippBorderRepl, 0, pSpec,localBuffer); break;
            default:break; //only 1 and 3 channel images
            }

            scalable_aligned_free((void*) localBuffer);
        });

As you can see, a threaded implementation can be quite similar to single threaded. The main difference is simply that the image is partitioned by Intel® TBB to work across several threads, and each thread is responsible for groups of image lines. This is a relatively straightforward way to divide the task of resizing an image across multiple cores or threads.

Conclusion

Intel® IPP provides a suite of SIMD-optimized functions. Intel® TBB provides a simple but powerful way to handle threading in Intel® IPP applications. Using them together allows access to great vectorized performance on each core as well as efficient partitioning to multiple cores. The deeper level of control available with external threading enables more efficient processing and better performance.

Example code: As with other Intel® IPP sample code, by downloading you accept the End User License Agreement.

resize; Intel IPP threading

Bibliothèque Intel® Integrated Performance Primitives (IPP)

Intel® Threading Building Blocks

Intel® Advanced Vector Extensions

Extensions Intel® Streaming SIMD

OpenMP*

Amélioration des performances

Processeurs Intel® Atom™

Internet des objets

Traitement média

Optimisation

Informatique parallèle

Fichiers joints protégés:

Fichier attaché	Taille
Télécharger tbb-resize-simple.cpp	14.8 Ko

URL

Exemple de code

Bibliothèques

Développement multithread

IPP-Learn

Dernière mise à jour:

Mardi, 7 avril, 2015

Dernière modification par:

Naveen Gv (Intel)

Co-auteurs:

Naveen Gv (Intel)

↧

Intel® IPP - Threading / OpenMP* FAQ

April 8, 2015, 4:03 am

Latest and popular articles on Intel Technologies

≫ Next: What is "Standard Manageability?"

≪ Previous: Threading Intel® Integrated Performance Primitives Image Resize with Intel® Threading Building Blocks

In Intel® IPP 8.2 and later versions, multi-threading (internal threading) libraries are deprecated due to issues with performance and interoperability with other threading models, but made available for legacy applications. Multi-threaded static and dynamic libraries are available as a separate download to support legacy applications. For new applications development, highly recommended to use the single-threaded versions with application-level threading (as shown in the below picture).

Intel® IPP 8.2 and later versions installation will have single threaded libraries in the following directory Structure

<ipp directory>lib/ia32– Single-threaded Static and Dynamic for IA32 architecture

<ipp directory>lib/intel64 - Single-threaded Static and Dynamic for Intel 64 architecture

Static linking (Both single threaded and Multi-threaded libraries)

Windows* OS: mt suffix in a library name (ipp<domain>mt.lib)
Linux* OS and OS X*: no suffix in a library name (libipp<domain>.a)

Dynamic Linking: Default (no suffix)

Windows* OS: ipp<domain>.lib
Linux* OS: libipp<domain>.a
OS X*: libipp<domain>.dylib

Q: Does Intel® IPP supports external multi-threading? Thread safe?

Answer: Yes, Intel® IPP supports external threading as in the below picture. User has option to use different threading models like Intel TBB, Intel Cilk Plus, Windows * threads, OpenMP or PoSIX. All Intel® Integrated Performance Primitives functions are thread-safe.

Q: How to get Intel® IPP threaded libraries?

Answer: While Installing Intel IPP, choose ‘custom’ installation option. Then you will get option to select threaded libraries for different architecture.

To select right package of threaded libraries, right click and enable ‘Install’ option.

After selecting threaded libraries, selection option will get highlighted with mark and memory requirement for threaded libraries will get highlighted.

Threading in Intel® IPP 8.1 and earlier versions

Threading, within the deprecated multi-threaded add-on packages of the Intel® IPP library, is accomplished by use of the Intel® OpenMP* library. Intel® IPP 8.0 continues the process of deprecating threading inside Intel IPP functions that was started in version 7.1. Though not installed by default, the threaded libraries can be installed so code written with these libraries will still work as before. However, moving to external threading is recommended.

Q: How can I determine the number of threads the Intel IPP creates?
Answer: You can use the function ippGetNumThreads to find the number of threads created by the Intel IPP.

Q: How do I control the number of threads the Intel IPP creates?
Ans: Call the function ippSetNumThreads to set the number of threads created.

Q: Is it possible to prevent Intel IPP from creating threads?
Ans: Yes, if you are calling the Intel IPP functions from multiple threads, it is recommended to have Intel IPP threading turned off. There are 3 ways to disable multi-threading:

Link to the non-threaded static libraries
Build and link to a custom DLL using the non-threaded static libraries
Call ippSetNumThread(1)

Q: When my application calls Intel IPP functions from a separate thread, the application hangs; how do I resolve this?

Ans: This issue occurs because the threading technology used in your application and in the Intel IPP (which has OpenMP threading) is incompatible. The ippSetNumThreads function has been developed so that threading can be disabled in the dynamic libraries. Please also check the sections above for other ways to prevent Intel IPP functions from creating threads.

Q: Which Intel IPP functions contain OpenMP* code?

Ans: "ThreadedFunctionsList.txt" file under ‘doc’ folder under product installation directory provide detailed list of threaded functions in Intel IPP Library. The list is updated in each release.

Please let us know if you have any feedback on deprecations via the feedback URL

threaded static library

ThreadedFunctionsList

Bibliothèque Intel® Integrated Performance Primitives (IPP)

Intel® Advanced Vector Extensions

Extensions Intel® Streaming SIMD

OpenMP*

Amélioration des performances

Processeurs Intel® Atom™

Processeurs Intel® Core™

Internet des objets

Optimisation

Informatique parallèle

Bibliothèques

Développement multithread

URL:

OpenMP support changes in Intel IPP 6.0 and Intel MKL 10.0

OMP Abort: Initializing libguide40.dll but found libiomp5md.dll already initialized

XCode link error: "file not found: libiomp5.dylib"

OpenMP static library has been deprecated since Intel® IPP 7.0

Dernière mise à jour:

Mardi, 7 avril, 2015

↧

What is "Standard Manageability?"

April 8, 2015, 2:50 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® RealSense™ — образец кода Blockhead

≪ Previous: Intel® IPP - Threading / OpenMP* FAQ

If you have an Intel AMT system and cannot figure out why you are getting error messages when trying to make API calls related to certain Intel AMT features, you may have a system that support Intel Standard Manageability.

You can use WS-Management commands to detect your platform's capabilities. See the Discovery Sample in the Intel AMT SDK located at <SDKRoot>\Windows\Intel_AMT\Samples\Discovery for an example. Also, messages displayed by these platforms in the MEBx and the WebUI indicate Intel Standard Manageability, instead of Intel AMT.

So what is Standard Manageability? Standard Manageability systems come with a subset of Intel AMT available features. Standard Manageability are not branded as Intel vPro. These systems are upgradeable to the full Intel AMT version capabilities.

The Standard Manageability SKU was introduced with Intel AMT Release 5.0. The following Intel AMT features are NOT included or supported on platforms with this SKU:

Remote Access, including Fast Call for Help and Remote Scheduled Maintenance
Local user-initiated call for help
Microsoft Network Access Protection (NAP)
Wireless connections to the Manageability Engine
Access Monitor
Alarm Clock (out-of-band & in-band from release 9.0)
KVM

Note: From Intel AMT Release 6.0, the Access Monitor and Microsoft Network Access Protection (NAP) are supported in this SKU.

You can detect the SKU of a platform by looking for the following CIM objects:

On an Intel AMT SKU, there will be:

An instance of CIM_RegisteredProfile with a RegisteredName equal to
- “Intel(r) AMT” and an InstanceID equal to
- “Intel(r) ME:Intel(r) AMT”
An instance of CIM_ComputerSystem with an ElementName of
- “Intel(r) AMT Subsystem”

On a Standard Manageability SKU, there will be:

An instance of CIM_RegisteredProfile with a RegisteredName equal to
- “Intel(r) Std. Mgt.” and an InstanceID equal to
- “Intel(r) ME:Intel(r) Std. Mgt.”
An instance of CIM_ComputerSystem with an ElementName of
- “Intel(r) Std. Mgt. Subsystem”.

For more information, please see the Support for Other Intel Platforms section of the Intel AMT Implementation and Reference Guide.

Standard Manageability

Upgradeable

Features

Intel AMT

Image de l’icône:

Technologie Intel® vPro™

Petites entreprises

Technologie d’administration active Intel®

C/C++

Client d’entreprise

Inclure dans RSS:

Download Blockhead Code Sample

↧

Intel® RealSense™ — образец кода Blockhead

April 9, 2015, 5:14 am

Latest and popular articles on Intel Technologies

≫ Next: Эффективность и Производительность Консольных API на ПК

≪ Previous: What is "Standard Manageability?"

Аннотация

В этом образце кода демонстрируется использование Intel® RealSense™ SDK для Windows* в классическом приложении на C#/WPF. Образец приложения под названием BlockHeadиспользует три интересных функции Intel RealSense SDK:

получает и отображает цветное изображение с камеры RGB;
получает оценочные данные о расположении лица и положении головы;
получает и анализирует данные о выражении лица.

(Примечание. Для реализации полной функциональности этого приложения требуется направленная на пользователя трехмерная камера.)

Посмотрите короткое видео о BlockHeadтут.

Введение в приложение Blockhead

Как показано на рис. 1, приложение отображает поток цветовых данных в элементе управления WPF Image и в реальном времени накладывает мультипликационное изображение на лицо пользователя.

Superimposed cartoon image
Рисунок 1.Наложение мультипликационного изображения на лицо пользователя

Мультипликационное изображение программным образом формируется в реальном времени на основе данных, получаемых от SDK.

Изображение масштабируется в соответствии с лицом пользователя (уменьшается, когда пользователь отодвигает голову от камеры, и увеличивается, когда пользователь приближает голову к камере) на основе информации о прямоугольной зоне лица.
Изображение наклоняется влево и вправо в зависимости от положения головы пользователя (поворот вокруг продольной оси).
Содержимое изображения изменяется на основе получения и анализа данных о выражении лица пользователя (см. рис. 2).

Expressions Detected in Real Time
Рисунок 2.Распознавание улыбки, высунутого языка, воздушного поцелуя и открытого рта в реальном времени

Подробные сведения

Для этого простого демонстрационного приложения графика была создана в графическом редакторе и записана в виде PNG-файлов. Вместо этих изображений можно использовать высококачественные изображения с различными уровнями прозрачности, фотографии друзей, карикатуры и прочее для достижения более интересных визуальных эффектов.

Различные преобразования (например, ScaleTransform, RotateTransform) применяются к объекту изображения для изменения его положения в соответствии с данными Intel RealSense SDK о положении головы. Эти данные включают расположение лица, расположение головы и данные распознавания выражения лица.

SDK может фиксировать около 20 различных выражений лица, которые затем можно анализировать в приложении. В этом приложении основное внимание уделяется выражениям лица с различными очертаниями рта: EXPRESSION_KISS, EXPRESSION_MOUTH_OPEN, EXPRESSION_SMILE и EXPRESSION_TONGUE_OUT. При этом можно без труда расширить возможности приложения, чтобы также использовать информацию о положении головы, глаз и бровей для определения выражения лица.

Ознакомьтесь

Чтобы узнать больше об этом приложении, просмотреть код и развить его, добавив более интересные возможности, опирающиеся на Intel RealSense SDK, загрузите этот пакет здесь.

О технологии Intel^® RealSense™

Чтобы приступить к работе и узнать больше о Intel RealSense SDK для Windows, перейдите по адресу https://software.intel.com/ru-ru/realsense/intel-realsense-sdk-for-windows.

Об авторе

Брайан Браун — инженер по разработке программных приложений в подразделении Developer Relations корпорации Intel. Его профессиональный опыт охватывает создание программного обеспечения и электроники, а также проектирование систем. Среди интересующих его направлений — применение технологий естественного взаимодействия и интерфейсов между компьютером и мозгом. Он активно участвует в нескольких программах разработки различных передовых технологиях в этих областях.

Intel® RealSense™ Technology

Windows*

SDK Intel® RealSense™

1.0. Хорошо забытое старое. 3

↧

Эффективность и Производительность Консольных API на ПК

April 9, 2015, 6:21 am

Latest and popular articles on Intel Technologies

≫ Next: Introducing Intel® Atom™ x3 (Code-Named “SoFIA”) SoC Processor Series

≪ Previous: Intel® RealSense™ — образец кода Blockhead

Общие сведения о Direct3D* 12
By: Michael Coppock

Загрузить PDF

Аннотация
Microsoft Direct3D* 12 — важный шаг в развитии технологий игр на ПК. В новой версии разработчики получают более мощные средства контроля над своими играми, могут эффективнее использовать ресурсы ЦП.

Содержание

Введение. 3

1.1. Ближе к железу. 4

2.0. Объект состояния конвейера. 5

3.0.Привязка ресурсов. 9

3.1. Опасности, связанные с ресурсами. 10

3.2. Управление резидентностью ресурсов. 11

3.3. Зеркальное копирование состояния. 11

4.0. Кучи и таблицы.. 12

4.1. Привязка избыточных ресурсов. 12

4.2. Дескрипторы.. 13

4.3. Кучи. 13

4.4. Таблицы.. 14

4.5. Эффективность и работа без привязки. 15

4.6. Обзор контекста отрисовки. 15

5.0. Наборы.. 17

5.1. Избыточные команды отрисовки. 17

5.2. Что такое наборы?. 18

5.3. Эффективность кода. 19

6.0. Списки команд. 21

6.1. Параллельное создание команд. 21

6.2. Списки и очередь. 22

6.3. Поток очереди команд. 23

7.0.Динамические кучи. 24

8.0. Параллельная работа ЦП.. 25

Ссылки и полезные материалы.. 29

Уведомления и примечания. 29

Ссылки и полезные материалы

Уведомления и примечания

Введение

На конференции GDC 2014 корпорация Microsoft объявила важную новость для всего рынка игр для ПК в 2015 году — выпуск новой версии Direct3D, а именно версии 12. В D3D 12 создатели вернулись к низкоуровневому программированию: оно дает разработчикам игр более полный контроль и много новых интересных возможностей. Группа разработки D3D 12 старается снизить издержки при задействовании ЦП и повысить масштабируемость, чтобы полнее нагружать ядра ЦП. Цель состоит в повышении эффективности и производительности консольных API, чтобы игры могли эффективнее использовать ресурсы ЦП/ГП. В мире игр для ПК большую часть работы, а то и всю работу часто выполняет один-единственный поток ЦП. Другие потоки заняты только операционной системой и другими системными задачами. Существует совсем немного действительно многопоточных игр для ПК. Microsoft стремится изменить эту ситуацию в наборе D3D 12, который является надстройкой над функциональностью рендеринга D3D 11. Это означает, что на всех современных ГП можно запускать D3D 12, поскольку при этом будут эффективнее задействованы современные многоядерные ЦП и ГП. Для использования всех преимуществ D3D 12 не нужно покупать новый графический процессор. Действительно, у игр на ПК с процессором Intel® очень яркое будущее.

1.0 Хорошо забытое старое

Низкоуровневое программирование широко применяется в отрасли консолей, поскольку характеристики и устройство каждой модели консолей неизменны. Разработчики игр могут подолгу отлаживать свои игры, чтобы выжать всю возможную производительность из Xbox One* или PlayStation* 4. С другой стороны, ПК по своей природе — это гибкая платформа с бесчисленным множеством разновидностей и вариаций. При планировании разработки новой игры для ПК требуется учитывать очень много факторов. Высокоуровневые API, такие как OpenGL* и Direct3D*, помогают упростить разработку. Эти API выполняют всю «черную работу», поэтому разработчики могут сосредоточиться собственно на играх. Проблема же заключается в том, что и API, и (в несколько меньшей степени) драйверы достигли уже такого уровня сложности, что они могут увеличить объем потребляемых ресурсов при рендеринге кадров, что приводит к снижению производительности. Именно здесь на сцену выходит низкоуровневое программирование.

Первая эпоха низкоуровневого программирования на ПК окончилась вместе с прекращением использования MS-DOS*. На смену пришли API разных поставщиков. После 3Dglide* компании 3DFX* появились такие API, как Direct3D. ПК проиграли в производительности, получив взамен столь необходимую гибкость и удобство. Рынок оборудования стал крайне сложным, с огромным множеством доступных аппаратных решений. Время разработки увеличилось, поскольку разработчики, естественно, стремились к тому, чтобы в их игры могли бы играть все пользователи. При этом перемены произошли не только на стороне программ, но и в оборудовании: эффективность ЦП с точки зрения потребления электроэнергии стала важнее, чем их производительность. Теперь вместо того, чтобы гнаться за увеличением тактовой частоты, больше внимания уделяется использованию нескольких ядер и потоков ЦП, параллельному рендерингу на современных ГП. Настала пора применить в области игр для ПК некоторые методики, используемые в консольных играх. Пора эффективнее, полнее использовать все доступные ядра и потоки. Словом, пора перейти в XXI век в мире игр для ПК.

1.1 Ближе к железу

Чтобы приблизить игры к «железу», нужно снизить сложность и размер API и драйвера. Между оборудованием и самой игрой должно быть меньше промежуточных уровней. Сейчас API и драйвер тратят слишком много времени на преобразование команд и вызовов. Некоторые или даже многие эти процедуры будут снова отданы разработчикам игр. За счет снижения издержек в D3D 12 повысится производительность, а за счет уменьшения количества промежуточных уровней обработки между игрой и оборудованием ГП игры будут быстрее работать и лучше выглядеть. Разумеется, у медали есть и обратная сторона: некоторые разработчики могут не желать заниматься областями, которые ранее были под контролем API, например программировать управление памятью ГП. Пожалуй, в этой области слово за разработчиками игровых движков, но, впрочем, только время расставит все на свои места. Поскольку выпуск D3D 12 состоится еще не скоро, имеется предостаточно времени для размышлений на эту тему. Итак, как же будут выполнены все эти заманчивые обещания? Главным образом с помощью новых возможностей. Это объекты состояния конвейера, списки команд, наборы и кучи.

2.0 Объект состояния конвейера

Прежде чем рассказывать об объекте состояния конвейера (PSO), сначала рассмотрим контекст рендеринга D3D 11, а потом перейдем к изменениям в D3D 12. На рис. 1 показан контекст рендеринга D3D 11 в том виде, в котором его представил Макс Мак-Маллен (Max McMullen), руководитель по разработке D3D, на конференции BUILD 2014 в апреле 2014 года.

Контекст отрисовки: Direct3D 11

Рисунок 1. Контекст отрисовки D3D 11 [Воспроизводится с разрешения корпорации Майкрософт.]

Крупные толстые стрелки указывают отдельные состояния конвейера. В соответствии с потребностями игры каждое такое состояние можно получить или задать. Прочие состояния в нижней части - фиксированные состояния функций, такие как поле зрения или прямоугольник обрезки. Другие важные компоненты этой схемы будут пояснены в дальнейших разделах данной статьи. При обсуждении PSO нас интересует только левая часть схемы. В D3D 11 удалось снизить издержки ЦП по сравнению с D3D 9 благодаря малым объектам состояния, но по-прежнему требовалась дополнительная работа драйвера, который должен был брать эти малые объекты состояния и сочетать их с кодом ГП во время рендеринга. Назовем эту проблему издержками аппаратного несоответствия. Теперь посмотрим еще на одну схему с конференции BUILD 2014, показанную на рис. 2.

Direct3D 11 — издержки состояния конвейера

Малые объекты состояния à издержки аппаратного несоответствия

Рисунок 2. В конвейере D3D 11 с малыми объектами состояния часто возникают издержки аппаратного несоответствия

На левой стороне показан конвейер в стиле D3D 9: здесь показано, что использует приложение для выполнения своей работы. Оборудование, показанное на правой стороне схемы на рис. 2, необходимо программировать. Состояние 1 представляет код шейдера. Состояние 2 — это сочетание растеризатора и потока управления, связывающего растеризатор с шейдерами. Состояние 3 — связь между смешением и пиксельным шейдером. Шейдер вертексов D3D влияет на аппаратные состояния 1 и 2, растеризатор — на состояние 2, пиксельный шейдер — на состояния с 1 по 3 и так далее. Драйверы в большинстве случаев не отправляют вызовы одновременно с приложением. Они предпочитают записывать вызовы и дожидаться выполнения работы, чтобы можно было определить, что на самом деле нужно приложению. Это означает дополнительные издержки для ЦП, поскольку старые и устаревшие данные помечаются как «непригодные». Поток управления драйвера проверяет состояние каждого объекта во время рендеринга и программирует оборудование так, чтобы соответствовать заданному игрой состоянию. При этой дополнительной работе возможно исчерпание ресурсов и возникновение затруднений. В идеале, как только игра задает состояние конвейера, драйвер тут же «знает», что нужно игре, и сразу программирует оборудование. На рис. 3 показан конвейер D3D 12, который делает именно это с помощью так называемого объекта состояния конвейера (PSO).

Direct3D 12 — оптимизациясостояния конвейера

Группировка конвейера в один объект

Копирование из PSO в аппаратное состояние

Рисунок 3. Оптимизация состояния конвейера в D3D 12 упорядочивает процесс.

На рис. 3 показан упорядоченный процесс с уменьшенными издержками. Один PSO с информацией о состоянии для каждого шейдера может за одну копию задать все аппаратные состояния. Вспомните, что некоторые состояния были помечены как «прочие» в контексте рендеринга D3D 11. Разработчики D3D 12 осознали важность уменьшения размера PSO и предоставления игре возможности смены цели рендеринга, не затрагивая скомпилированный PSO. Такие вещи, как поле зрения и прямоугольник обрезки, остались отдельными, они программируются вне остального конвейера (рис. 4).

Контекст рендеринга:
объект состояния конвейера (PSO)

Рисунок 4. Слева показан новый PSO D3D 12 с увеличенным состоянием для повышения эффективности

Вместо того чтобы задавать и прочитывать каждое состояние по отдельности, мы получили единую точку, тем самым полностью избавившись от издержек аппаратного несоответствия. Приложение задает PSO нужным образом, а драйвер получает команды API и преобразует их в код ГП без дополнительных издержек, связанных с управлением потоком. Такой подход (более близкий к «железу») означает, что для обработки команд рендеринга требуется меньше циклов, производительность возрастает.

3.0 Привязка ресурсов

Перед рассмотрением изменений в привязке ресурсов вспомним, какая модель привязки ресурсов использовалась в D3D 11. На рис. 5 снова показана схема контекста рендеринга: слева — объект состояния конвейера D3D 12, а справа — модель привязки ресурсов D3D 11.

Контекст отрисовки:
объект состояния конвейера (PSO)

Рисунок 5. Слева — объект состояния конвейера D3D 12, справа — модель привязки ресурсов D3D 11.

На рис. 5 принудительные точки привязки находятся справа от каждого шейдера. Принудительная модель привязки означает, что у каждого этапа в конвейере есть определенные ресурсы, на которые можно сослаться. Эти точки привязки ссылаются на ресурсы в памяти ГП. Это могут быть текстуры, цели рендеринга, неупорядоченные представления доступа (UAV) и так далее. Привязки ресурсов используются уже давно, они появились даже раньше, чем D3D. Цель в том, чтобы «за кадром» обработать множество свойств и помочь игре в эффективной отправке команд рендеринга. При этом системе необходимо выполнить анализ множества привязок в трех основных областях. В следующем разделе рассматриваются эти области и их оптимизация в D3D 12.

3.1 Опасности, связанные с ресурсами

Опасности обычно связаны с переходами, например с перемещением от цели рендеринга к текстуре. Игра может отрисовать кадр, который должен стать картой среды для сцены. Игра завершает рендеринг карты среды и теперь хочет использовать ее в качестве текстуры. В ходе этого процесса и среда выполнения, и драйвер отслеживают все ресурсы, привязанные как цели рендеринга или как текстуры. Если среда выполнения или драйвер видят какой-либо ресурс, привязанный одновременно и как цель рендеринга, и как текстура, они отменяют более старую по времени привязку и сохраняют более новую. За счет этого игра может переключаться нужным образом, а набор программного обеспечения управляет этим переключением «за кадром». Драйвер также должен очистить конвейер ГП, чтобы можно было использовать цель рендеринга в качестве текстуры. В противном случае пиксели будут прочитаны до того, как они будут удалены из ЦП, и полученный результат будет несогласованным. Собственно говоря, опасность — это все, для чего требуется дополнительная работа ГП с целью получения согласованных данных.

Как и в других усовершенствованиях в D3D 12, решение здесь состоит в том, чтобы предоставить игре больше возможностей управления. Почему API и драйвер должны выполнять всю работу и отслеживание, когда это всего лишь один момент в обработке кадра? Для переключения от одного ресурса к другому требуется около 1/60 секунды. Можно снизить издержки, передав управление игре, тогда дополнительное время будет израсходовано только один раз, когда игра осуществляет переключение ресурсов (рис. 6).


D3D12_RESOURCE_BARRIER_DESC Desc;
Desc.Type = D3D12_RESOURCE_BARRIER_TYPE_TRANSITION;
Desc.Transition.pResource   = pRTTexture;
Desc.Transition.Subresource = D3D12_RESOURCE_BARRIER_ALL_SUBRESOURCES;
Desc.Transition.StateBefore = D3D12_RESOURCE_USAGE_RENDER_TARGET;
Desc.Transition.StateAfter  = D3D12_RESOURCE_USAGE_PIXEL_SHADER_RESOURCE;
pContext->ResourceBarrier( 1, &Desc );

Рисунок 6. API ограничителя ресурсов, добавленный в D3D 12

API ограничителя ресурсов, показанный на рис. 6, объявляет ресурс, его начальное использование и целевое использование, после чего следует вызов, чтобы сообщить выполняемой среде и драйверу о переключении. Переключение становится явным вместо отслеживаемого в рендеринге кадра со множеством условной логики, оно осуществляется один раз за кадр или с такой частотой, с которой оно требуется в игре.

3.2 Управление резидентностью ресурсов

D3D 11 (и более старые версии) работают так, как если бы все вызовы были в очереди. Игра считает, что API немедленно выполняет вызов. Но на самом деле это не так. ГП поддерживает очередь команд, в которой все команды откладываются и выполняются позднее. При этом обеспечивается распараллеливание вычислений и более эффективное задействование ГП и ЦП, но требуется отслеживание ссылок и их учет. На учет и отслеживание расходуется достаточно много ресурсов ЦП.

Чтобы решить эту проблему, игра получает явное управление жизненным циклом ресурсов. В D3D 12 больше не скрывается очередь ГП. Для отслеживания выполнения на ГП добавлен API Fence. Игра может в заданный момент (например, один раз за кадр) проверить, какие ресурсы больше не нужны, и затем высвободить занимаемую ими память. Больше не требуется отслеживать ресурсы в течение всего рендеринга кадра с помощью дополнительной логики, чтобы высвобождать ресурсы и память.

3.3 Зеркальное копирование состояния

После оптимизации трех описанных выше областей был обнаружен дополнительный элемент, способный обеспечить прирост производительности, хотя и не столь заметный. Когда задана точка привязки, среда выполнения отслеживает эту точку, чтобы игра могла позднее вызвать Getи узнать, что привязано к конвейеру. Создается зеркальная копия точки привязки. Эта функция выполняет промежуточную работу, чтобы разделенное на компоненты программное обеспечение могло определить текущее состояние контекста рендеринга. После оптимизации привязки ресурсов зеркальные копии состояний больше не нужны. Помимо упразднения управления потоком из трех описанных выше областей также удалены операции Get для зеркальных копий.

4.0 Кучи и таблицы

Осталось рассмотреть еще одно важное изменение привязки ресурсов. В конце раздела 4 будет показан весь контекст рендеринга D3D 12. Новый контекст рендеринга D3D 12 — первый шаг на пути к повышению эффективности использования ЦП в API.

4.1 Привязка избыточных ресурсов

Проведя анализ нескольких игр, разработчики D3D заметили, что обычно в играх в множестве кадров используется одна и та же последовательность команд. Не только команды, но и привязки остаются точно такими же из кадра в кадр. ЦП создает последовательность привязок (скажем, 12) для рисования объекта в кадре. Зачастую ЦП приходится создавать эти же 12 привязок еще раз для следующего кадра. Почему же не поместить эти привязки в кэш? Разработчикам игр можно предоставить команду, указывающую на кэш, чтобы можно было многократно использовать одинаковые привязки.

Вразделе 3 мы обсудили очереди. Когда осуществляется вызов, игра исходит из того, что API немедленно выполняет этот вызов. Но на самом деле это не так. Команды помещаются в очередь, все содержимое которой откладывается и выполняется позднее в ГП. Поэтому если изменить одну из этих 12 привязок, о которых мы говорили ранее, то драйвер скопирует все 12 привязок в новое место, изменит копию, затем даст графическому процессору команду начать использовать скопированные привязки. Обычно у большинства из этих 12 привязок значения бывают статическими, а обновление требуется лишь для нескольких динамических значений. Когда игре нужно внести частичные изменения в эти привязки, она копирует все 12 штук, из-за чего наблюдается чрезмерный расход ресурсов ЦП при незначительных изменениях.

4.2 Дескрипторы

Что такое дескриптор? Коротко говоря, это фрагмент данных, определяющий параметры ресурсов. По сути, это то, из чего состоит объект представления D3D 11. Управление жизненным циклом на уровне операционной системы отсутствует. Это просто данные в памяти ГП. Здесь содержится информация о типе и формате, счетчик MIP для текстур и указатель на пиксельные данные. Дескрипторы находятся в центре новой модели привязки ресурсов.

Дескриптор

Рисунок 7. Дескриптор D3D 12 — небольшой фрагмент данных, определяющий параметр ресурса

4.3 Кучи

Когда представление задано в D3D 11, оно копирует дескриптор в текущее расположение в памяти ГП, откуда прочитываются дескрипторы. Если задать новое представление в этом же расположении, в D3D 11 дескрипторы будут скопированы в новое расположение в памяти, а ГП в следующем вызове команды рендеринга получит указание читать дескрипторы из этого нового расположения. В D3D 12 игра или приложение получают явное управление созданием дескрипторов, их копированием и пр.

Кучи дескрипторов

Рисунок 8. Куча дескрипторов является массивом дескрипторов

Кучи (рис. 8) являются просто очень большим массивом дескрипторов. Можно повторно использовать дескрипторы из предыдущих вызовов рендеринга и кадров. Можно создавать новые при необходимости. Вся разметка находится под управлением игры, поэтому при управлении кучей почти не возникают издержки. Размер кучи зависит от архитектуры ГП. В устаревших маломощных ГП размер может быть ограничен 65 тысячами, а в более мощных ГП ограничение будет определяться объемом памяти. В менее мощных ГП возможно превышение размера кучи. В D3D 12 поддерживается несколько куч и переключение от одной кучи дескрипторов к другой. Тем не менее при переключении между кучами в некоторых ГП происходит сброс данных, поэтому использованием этой функции лучше не злоупотреблять.

Итак, как сопоставить код шейдеров с определенными дескрипторами или наборами дескрипторов? Ответ – с помощью таблиц.

4.4 Таблицы

Таблицы содержат индекс начала и размер в куче. Они являются точками контекста, но не объектами API. При необходимости у каждого этапа шейдера может быть одна или несколько таблиц. Например, шейдер вертекса для вызова рендеринга может содержать таблицу, указывающую на дескрипторы со смещением с 20 по 32 в куче. Когда начнется работа над следующим вызовом рендеринга, смещение может измениться на 32—40.

Таблицы дескрипторов

Рисунок 9. Таблицы дескрипторов содержат индекс начала и размер в куче дескрипторов

Используя существующее оборудование, D3D 12 может обрабатывать несколько таблиц на каждое состояние шейдера в PSO. Можно поддерживать одну таблицу лишь с теми данными, которые часто изменяются между вызовами, а вторую таблицу — со статическими данными, неизменными для нескольких вызовов и для нескольких кадров. Это позволит избежать копирования дескрипторов из одного вызова в следующий. Тем не менее у старых ГП действует ограничение в одну таблицу на каждый этап шейдера. Поддержка нескольких таблиц возможна в современном и в перспективном оборудовании.

4.5 Эффективность и работа без привязки

Кучи дескрипторов и таблицы применяются в D3D для рендеринга без привязки, причем с возможностью задействования всех аппаратных ресурсов ПК. В D3D 12 поддерживаются любые устройства — от маломощных «систем на кристалле» до высокопроизводительных дискретных графических адаптеров. Благодаря такому универсальному подходу разработчики игр получают разнообразные возможности управления привязками. Кроме того, новая модель включает множественные обновления частоты. Поддерживаются кэшированные таблицы статических привязок с возможностью многократного использования и динамические таблицы для данных, изменяющихся в каждом вызове рендеринга. Таким образом, необходимость копировать все привязки для каждого нового вызова рендерингаотпадает.

4.6 Обзор контекста отрисовки

На рис. 10 показан контекст рендеринга с изменениями D3D 12, которые мы уже успели обсудить. Также показан новый объект состояния конвейера и упразднение вызовов Get, но сохранились явные точки привязки D3D 11.

Контекст отрисовки

Рисунок 10. Контекст отрисовки D3D 12 с изменениями, о которых мы уже успели поговорить.

Давайте удалим последние остатки контекста рендеринга D3D 11 и добавим таблицы дескрипторов и кучи. Теперь у нас появились таблицы для каждого этапа шейдера (или несколько таблиц, как показано для пиксельного шейдера).

Контекст отрисовки: Direct3D 12

Рисунок 11. Полный контекст рендеринга D3D 12

Тонко настраиваемые объекты состояния упразднены, их заменил объект состояния конвейера. Удалено отслеживание опасностей и зеркальное копирование состояния. Принудительные точки привязки заменены на управляемые приложением или игрой объекты памяти. ЦП используется более эффективно, издержки снижены, упразднены поток управления и логика и в API, и в драйвере.

5.0 Наборы

Мы завершили рассмотрение нового контекста рендеринга в D3D 12 и увидели, каким образом D3D 12 передает управление игре, приближая ее к «железу». Но этим возможности D3D 12 по повышению эффективности API не ограничиваются. В API по-прежнему существуют издержки, влияющие на производительность, и существуют дополнительные способы повысить эффективность использования ЦП. Как насчет последовательностей команд? Сколько существует повторяющихся последовательностей и как сделать их более эффективными?

5.1 Избыточные команды рендеринга

При изучении команд рендеринга в каждом кадре разработчики D3D в Microsoft обнаружили, что при переходе от одного кадра к другому происходит добавление или удаление только 5—10 % последовательностей команд. Остальные последовательности команд используются во множестве кадров. Итак, ЦП в течение 90—95 % времени своей работы повторяет одни и те же последовательности команд!

Как здесь повысить эффективность? И почему в D3D это не было сделано до сих пор? На конференции BUILD 2014 Макс Мак-Маллен сказал: «Очень сложно создать универсальный и надежный способ записывать команды. Такой способ, чтобы он работал всегда одинаково для разных ГП, с разными драйверами и при этом работал бы быстро». Игре требуется, чтобы все записанные последовательности команд выполнялись так же быстро, как отдельные команды. Что изменилось? D3D. Благодаря новым объектам состояния конвейера, кучам дескрипторов и таблицам состояние, необходимое для записи и воспроизведения команд, значительно упростилось.

5.2 Что такое наборы?

Наборы — это небольшие списки команд, которые один раз записываются, после чего их можно многократно использовать без каких-либо ограничений в одном кадре или в нескольких кадрах. Наборы можно создавать в любом потоке и использовать сколько угодно раз. Наборы не привязываются к состоянию объектов PSO. Это означает, объекты PSO могут обновлять таблицу дескрипторов, а при запуске набора для разных привязок игра будет получать разные результаты. Как и в случае с формулами в электронных таблицах Excel*, математика всегда одинаковая, а результат зависит от исходных данных. Существуют определенные ограничения, чтобы гарантировать эффективную реализацию наборов драйвером. Одно из таких ограничений состоит в том, что никакая команда не может сменить цель рендеринга. Но остается еще множество команд, которые можно записать и воспроизводить.

Наборы

Рисунок 12. Наборы — это часто повторяемые команды, записанные и воспроизводящиеся по мере необходимости

Слева на рис. 12 — пример контекста рендеринга, последовательность команд, созданных в ЦП и переданных на ГП для выполнения. Справа — два пакета, содержащих записанную последовательность команд для многократного использования в разных потоках. По мере выполнения команд ГП достигает команды на выполнение набора. После этого воспроизводится записанный набор. По завершении ГП возвращается к последовательности команд, продолжает и находит следующую команду выполнения набора. После этого прочитывается и воспроизводится второй набор, после чего выполнение продолжается.

5.3 Эффективность кода

Мы рассмотрели управление потоком в ГП. Теперь посмотрим, каким образом наборы упрощают код.

Пример кода без наборов

Перед нами страница настройки, задающая состояние конвейера и таблицы дескрипторов. Затем идут два вызова рендеринга объектов. В обоих случаях используется одинаковая последовательность команд, различаются только константы. Это типичный код D3D 11.


// Настройка
pContext->SetPipelineState(pPSO);
pContext->SetRenderTargetViewTable(0, 1, FALSE, 0);
pContext->SetVertexBufferTable(0, 1);
pContext->IASetPrimitiveTopology(D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST);

Рисунок 14. Настройка этапа в типичном коде D3D 11


// Рисунок 1
pContext->SetConstantBufferViewTable(D3D12_SHADER_STAGE_PIXEL, 0, 1);
pContext->SetShaderResourceViewTable(D3D12_SHADER_STAGE_PIXEL, 0, 1);
pContext->DrawInstanced(6, 1, 0, 0);
pContext->SetShaderResourceViewTable(D3D12_SHADER_STAGE_PIXEL, 1, 1);
pContext->DrawInstanced(6, 1, 6, 0);

Рисунок 15. Рендеринг в типичном коде D3D 11


// Рисунок 2
pContext->SetConstantBufferViewTable(D3D12_SHADER_STAGE_PIXEL, 1, 1);
pContext->SetShaderResourceViewTable(D3D12_SHADER_STAGE_PIXEL, 0, 1);
pContext->DrawInstanced(6, 1, 0, 0);
pContext->SetShaderResourceViewTable(D3D12_SHADER_STAGE_PIXEL, 1, 1);
pContext->DrawInstanced(6, 1, 6, 0);

Рисунок 16. Рендеринг в типичном коде D3D 11

Пример кода с наборами

Теперь рассмотрим эту же последовательность команд с пакетами в D3D 12. Первый вызов, показанный ниже, создает набор. Это может быть сделано в любом потоке. На следующем этапе создается последовательность команд. Это такие же команды, как и в предыдущем примере


// Создание набора
pDevice->CreateCommandList(D3D12_COMMAND_LIST_TYPE_BUNDLE, pBundleAllocator, pPSO, pDescriptorHeap, &pBundle);

Рисунок 17. Образец кода с созданием набора


// Запись команд
pBundle->IASetPrimitiveTopology(D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
pBundle->SetShaderResourceViewTable(D3D12_SHADER_STAGE_PIXEL, 0, 1);
pBundle->DrawInstanced(6, 1, 0, 0);
pBundle->SetShaderResourceViewTable(D3D12_SHADER_STAGE_PIXEL, 1, 1);
pBundle->DrawInstanced(6, 1, 6, 0);
pBundle->Close();

Рисунок 18. Образец кода с записью набора

В примерах кода на рис. 17 и 18 достигается такой же результат, как в коде без наборов на рис. 14—16. Хорошо видно, что наборы позволяют существенно сократить количество вызовов, необходимое для выполнения одной и той же задачи. ГП при этом выполняет точно такие же команды и получает такой же результат, но гораздо эффективнее.

6.0 Списки команд

Вы уже знаете, каким образом в D3D 12 повышается эффективность использования ЦП и предоставляются более широкие возможности разработчикам с помощью наборов, объектов состояния конвейера, куч дескрипторов и таблиц. Модель объектов состояния конвейера и дескрипторов поддерживает наборы, которые, в свою очередь, используются для широко распространенных и часто повторяющихся команд. Такой упрощенный и «приближенный к железу» подход снижает издержки и позволяет эффективнее использовать ЦП. Ранее мы упомянули, что в играх для ПК большую часть работы, а то и всю работу выполняет только один поток, а остальные потоки занимаются другими системными задачами и процессами ОС. Добиться эффективного использования нескольких ядер или потоков игрой для ПК не так просто. Зачастую для реализации многопоточности в игре требуются существенные затраты ресурсов и труда. Разработчики D3D собираются изменить это положение дел в D3D 12.

6.1 Параллельное создание команд

Как уже неоднократно упоминалось выше, отложенное выполнение команд — это модель, при которой создается ощущение, что каждая команда выполняется немедленно, но на самом деле команды помещаются в очередь и выполняются позднее. Эта функция сохраняется в D3D 12, но теперь она является прозрачной для игры. Не существует немедленного контекста, поскольку все откладывается. Потоки могут создавать команды параллельно для формирования списков команд, которые подаются в объект API, который называется очередью команд. ГП не будет выполнять команды, пока они не будут отправлены с помощью очереди команд. Очередь — это порядок следования команд, которые указываются в списке. Чем отличаются списки команд от наборов? Списки команд создаются и оптимизируются, поэтому несколько потоков могут одновременно создавать команды. Списки команд используются однократно, а затем удаляются из памяти; на месте удаленного списка команд записывается новый список. Наборы предназначены для многократного выполнения часто используемых команд рендеринга в одном или в нескольких кадрах.

В D3D 11 была сделала попытка распараллелить обработку команд; эта функция называлась отложенным контекстом. Но из-за издержек цели по повышению производительности тогда не были достигнуты. Подробный анализ показал множество мест с избыточной последовательной обработкой, что привело к неэффективному распределению нагрузки по ядрам ЦП. Часть издержек с последовательной обработкой была устранена в D3D 12 с помощью средств повышения эффективности использования ЦП, описанных в разделах 2—5.

6.2 Списки и очередь

Представьте, что два потока создают список команд рендеринга. Одна последовательность должна быть выполнена перед другой. При наличии опасностей один поток использует ресурс в качестве текстуры, а другой поток использует этот же ресурс в качестве цели рендеринга. Драйвер должен проанализировать использование ресурсов во время рендеринга и устранить опасности, обеспечив согласованность данных. Отслеживание опасностей — одна из областей с последовательными издержками в D3D 11. В D3D 12 за отслеживание опасностей отвечает игра, а не драйвер.

D3D 11 поддерживается несколько отложенных контекстов, но при их использовании возникает сопутствующая нагрузка. Драйвер отслеживает состояние для каждого ресурса. Поэтому, как только начата запись команд для отложенного контекста, драйвер должен выделять память для отслеживания состояния каждогоиспользуемого ресурса. Память занята, пока идет создание отложенного контекста. По завершении драйвер должен удалить из памяти все объекты отслеживания. Из-за этого возникают ненужные издержки. Игра объявляет максимальное количество списков команд, которые можно создать параллельно на уровне API. После этого драйвер упорядочивает и заранее выделяет все объекты отслеживания в одном согласованном объекте памяти.

В D3D 11 распространены динамические буферы (буфер контекста, вертексов и т. д.), но «за кадром» остается множество экземпляров удаленных буферов отслеживания памяти. Например, параллельно могут быть созданы два списка команд, и вызвана функция MapDiscard. После отправки списка драйвер должен вмешаться во второй список команд, чтобы исправить информацию об удаленном буфере. Как и в приведенном выше примере с отслеживанием опасностей, здесь возникают значительные издержки. В D3D 12 управление переименованием передано игре, динамического буфера больше нет. Игра получила полное управление. Она создает собственные распределители и может делить буфер на части по мере необходимости. Поэтому команды могут указывать на явным образом заданные точки в памяти.

Как мы говорили вразделе 3.1,среда выполнения и драйвер отслеживают жизненный цикл ресурсов в D3D 11. Для этого требуется подсчет и отслеживание ресурсов, все операции должны быть сделаны во время отправки. В D3D 12 игра управляет жизненным циклом ресурсов и обработкой опасностей, благодаря чему устраняются издержки последовательной обработки и повышается эффективность использования ЦП. Параллельное создание команд работает эффективнее в D3D 12 с учетом оптимизации в четырех описанных областях, что расширяет возможности распараллеливания нагрузки на ЦП. Кроме того, разработчики D3D создают новую модель драйверов WDDM 2.0 и планируют реализовать дополнительные меры по оптимизации, чтобы снизить нагрузку при отправке списков команд.

6.3 Поток очереди команд

Очередь команд

Рисунок 19. Очередь команд с двумя параллельно создаваемыми списками команд и двумя наборами повторяющихся команд

На рис. 19 показана схема набора израздела 5.2,но с многопоточностью. Очередь команд, показанная слева, представляет собой последовательность событий, отправленных на ГП. Списки команд находятся посередине, а справа — два набора, записанные перед началом сценария. Начиная со списков команд, создаваемых параллельно для разных фрагментов сцены, завершается запись списка команд 1, этот список отправляется в очередь команд, и ГП начинает его выполнять. Параллельно запускается процедура управления потоком очереди команд, а список команд 2 записывается в потоке 2. Пока ГП выполняет список команд 1, поток 2 завершает создание списка команд 2 и отправляет его в очередь команд. Когда очередь команд завершает выполнение списка команд 1, она последовательно переходит к списку команд 2. Очередь команд — это последовательность, в которой ГП должен выполнять команды. Хотя список команд 2 был создан и отправлен в ГП до того, как ГП завершил выполнение списка команд 1, список команд 2 будет выполнен только после завершения выполнения списка команд 1. В D3D 12 поддерживается более эффективное распараллеливание для всего процесса.

7.0 Динамические кучи

Как мы уже говорили выше, игра управляет переименованием ресурсов для распараллеливания создания команд. Кроме того, в D3D 12 упрощено переименование ресурсов. В D3D 11 буферы были разделены по типам: были буферы вертексов, констант и индексов. Разработчики игр запросили возможность использовать зарезервированную память так, как им заблагорассудится. И команда D3D выполнила эту просьбу. В D3D 12 разделение буферов по типам отсутствует. Буфер — это просто объем памяти, выделяемый игрой по мере необходимости в размере, необходимом для кадра (или нескольких кадров). Можно даже использовать распределитель кучи с делением по мере необходимости, что повышает эффективность процессов. В D3D 12 также применяется стандартное выравнивание. ГП сможет прочесть данные, если в игре используется стандартное выравнивание. Чем выше уровень стандартизации, тем проще создавать содержимое, работающее с разными моделями ЦП, ГП и другого оборудования. Распределение памяти также является постоянным, поэтому ЦП всегда знает нужный адрес. При этом также повышается эффективность параллельного использования ЦП: поток может направить ЦП на нужный адрес в памяти, после чего ЦП определяет, какие данные нужны для кадра.

Распределение и перераспределение

Рисунок 20. Распределение и перераспределение буферов

В верхней части рис. 20 показана модель D3D 11 с типизацией буферов. В нижней части показана новая модель D3D 12, где игра управляет кучей. Вместо выделения отдельной части памяти для буфера каждого типа используется единый постоянный фрагмент памяти. Размер буфера также настраивается игрой на основе потребностей рендеринга текущего кадра или даже нескольких последующих кадров.

8.0 Параллельная работа ЦП

Пора собрать все воедино и продемонстрировать, каким образом новые возможности D3D 12 позволяют создать действительно многопоточную игру для ПК. В D3D 12 поддерживается параллельное выполнение нескольких задач. Списки команд и наборы предоставляют возможность параллельного создания и выполнения команд. Наборы позволяют записывать команды и многократно запускать их в одном или в нескольких кадрах. Списки команд могут быть созданы в нескольких потоках и переданы в очередь команд для выполнения графическим процессором. И наконец, буферы с постоянным распределением параллельно создают динамические данные. Параллельная работа поддерживается и в D3D 12, и в WDDM 2.0. В D3D 12 устранены ограничения прежних версий D3D, разработчики могут распараллеливать свои игры или движки любым целесообразным способом.

Профилирование в D3D11

Рисунок 21. Типичная параллельная обработка в D3D 11. Поток 0 выполняет почти всю работу, остальные потоки используются незначительно

На схеме на рис. 21 показана типичная игровая нагрузка в D3D 11. Логика приложения, среда выполнения D3D, DXGKernel, KMD и текущая работа задействуют ЦП с четырьмя потоками. Поток 0 выполняет большую часть работы. Потоки 1—3 практически не используются: в них попадают только логика приложения и среда выполнения D3D 11, создающая команды рендеринга. Драйвер пользовательского режима вообще не создает команд для этих потоков в силу особенностей устройства D3D 11.

Профилирование в D3D12

Рисунок 22. Здесь показана такая же нагрузка, как на рис. 21, но в D3D 12. Нагрузка равномерно распределяется по всем 4 потокам, а с учетом других мер по повышению эффективности в D3D 12 скорость выполнения нагрузки значительно увеличена

Теперь рассмотрим такую же нагрузку в D3D 12 (рис. 22). Логика приложения, среда выполнения D3D, DXGKernel, KMD и текущая работа также задействуют ЦП с четырьмя потоками. Но здесь работа равномерно распределяется по всем потокам за счет оптимизации в D3D 12. Поскольку команды создаются параллельно, выполняемая среда D3D также работает параллельно. Издержки ядра значительно снижены за счет оптимизации ядра в WDDM 2.0. UMD работает во всех потоках, а не только в потоке 0, что означает настоящее распараллеливание создания команд. И наконец, наборы заменяют логику избыточного изменения состояния в D3D 11 и ускоряют работу логики приложения.

Показатели D3D11 и D3D12

Рисунок 23. Сравнение параллельной обработки в D3D 11 и D3D 12

На рис. 23 показано сравнение обеих версий. Поскольку уровень фактической параллельности достаточно высок, мы видим относительно равное использование ЦП потоком 0 и потоками 1—3. Потоки 1—3 выполняют больше работы, поэтому в столбце «Только графика» видно увеличение. Кроме того, благодаря снижению нагрузки в потоке 0 и новым мерам по повышению эффективности среды выполнения и драйвера общую нагрузку на ЦП удалось снизить примерно на 50 %. Если рассмотреть столбец «Приложение плюс графика», здесь также распределение нагрузки между потоками стало более равномерным, а использование ЦП снизилось примерно на 32 %.

9.0 Заключение

В D3D 12 повышена эффективность использования ЦП за счет более крупных объектов состояния конвейера. Вместо того чтобы задавать и прочитывать каждое состояние по отдельности, разработчики получили единую точку приложения сил, тем самым полностью избавившись от издержек аппаратного несоответствия. Приложение задает PSO, а драйвер получает команды API и преобразует их в код ГП. Новая модель привязки ресурсов не имеет издержек, вызванных логикой управления потоком, которая теперь не нужна.

За счет использования куч, таблиц и наборов D3D 12 обеспечивает более эффективное использование ЦП и более высокую масштабируемость. Вместо явных точек привязки используются управляемые приложением или игрой объекты памяти. Часто используемые команды можно записывать и многократно воспроизводить в одном кадре или в нескольких кадрах с помощью наборов. Списки команд и очередь команд позволяют параллельно создавать списки команд в нескольких потоках ЦП. Практически вся работа равномерно распределяется по всем потокам ЦП, что позволяет раскрыть весь потенциал и всю мощь процессоров Intel® Core™ 4-го и 5-го поколений.

Direct3D 12 — значительный шаг в развитии технологий игр на ПК. Разработчики игр смогут работать «ближе к железу» за счет более компактного API и драйвера с меньшим количеством промежуточных уровней. За счет этого повышается эффективность и производительность. С помощью сотрудничества группа разработки D3D создала новый API и модель драйверов, предоставляющую разработчикам более широкие возможности управления, что позволяет создавать игры в полном соответствии с замыслом, с великолепной графикой и отличной производительностью.

Ссылки и полезные материалы

Ссылки на материалы Intel

Корпоративный бренд: http://intelbrandcenter.tagworldwide.com/frames.cfm

Наименования продуктов Intel®: http://www.intel.com/products/processor number/

Уведомления и примечания

См. http://legal.intel.com/Marketing/notices+and+disclaimers.htm

Об авторе

Майкл Коппок (Michael Coppock) работает в корпорации Intel с 1994 года, он специализируется на производительности графики и игр для ПК. Он помогает компаниям, разрабатывающим игры, наиболее эффективно задействовать все возможности ГП и ЦП Intel. Занимаясь и программным обеспечением, и оборудованием, Майкл работал со множеством продуктов Intel, начиная с процессора 486DX4 Overdrive.

Примечания

ИНФОРМАЦИЯ В ДАННОМ ДОКУМЕНТЕ ПРИВЕДЕНА ТОЛЬКО В ОТНОШЕНИИ ПРОДУКТОВ INTEL. ДАННЫЙ ДОКУМЕНТ НЕ ПРЕДОСТАВЛЯЕТ ЯВНОЙ ИЛИ ПОДРАЗУМЕВАЕМОЙ ЛИЦЕНЗИИ, ЛИШЕНИЯ ПРАВА ВОЗРАЖЕНИЯ ИЛИ ИНЫХ ПРАВ НА ИНТЕЛЛЕКТУАЛЬНУЮ СОБСТВЕННОСТЬ. КРОМЕ СЛУЧАЕВ, УКАЗАННЫХ В УСЛОВИЯХ И ПРАВИЛАХ ПРОДАЖИ ТАКИХ ПРОДУКТОВ, INTEL НЕ НЕСЕТ НИКАКОЙ ОТВЕТСТВЕННОСТИ И ОТКАЗЫВАЕТСЯ ОТ ЯВНЫХ ИЛИ ПОДРАЗУМЕВАЕМЫХ ГАРАНТИЙ В ОТНОШЕНИИ ПРОДАЖИ И/ИЛИ ИСПОЛЬЗОВАНИЯ СВОИХ ПРОДУКТОВ, ВКЛЮЧАЯ ОТВЕТСТВЕННОСТЬ ИЛИ ГАРАНТИИ ОТНОСИТЕЛЬНО ИХ ПРИГОДНОСТИ ДЛЯ ОПРЕДЕЛЕННОЙ ЦЕЛИ, ОБЕСПЕЧЕНИЯ ПРИБЫЛИ ИЛИ НАРУШЕНИЯ КАКИХ-ЛИБО ПАТЕНТОВ, АВТОРСКИХ ПРАВ ИЛИ ИНЫХ ПРАВ НА ИНТЕЛЛЕКТУАЛЬНУЮ СОБСТВЕННОСТЬ.

КРОМЕ СЛУЧАЕВ, СОГЛАСОВАННЫХ INTEL В ПИСЬМЕННОЙ ФОРМЕ, ПРОДУКТЫ INTEL НЕ ПРЕДНАЗНАЧЕНЫ ДЛЯ ИСПОЛЬЗОВАНИЯ В СИТУАЦИЯХ, КОГДА ИХ НЕИСПРАВНОСТЬ МОЖЕТ ПРИВЕСТИ К ТРАВМАМ ИЛИ ЛЕТАЛЬНОМУ ИСХОДУ.

Корпорация Intel оставляет за собой право вносить изменения в технические характеристики и описания своих продуктов без предварительного уведомления. Проектировщики не должны полагаться на отсутствующие характеристики, а также характеристики с пометками «Зарезервировано» или «Не определено». Эти характеристики резервируются Intel для будущего использования, поэтому отсутствие конфликтов совместимости для них не гарантируется. Информация в данном документе может быть изменена без предварительного уведомления. Не используйте эту информацию в окончательном варианте дизайна.

Продукты, описанные в данном документе, могут содержать ошибки и неточности, из-за чего реальные характеристики продуктов могут отличаться от приведенных здесь. Уже выявленные ошибки могут быть предоставлены по запросу.

Перед размещением заказа получите последние версии спецификаций в региональном офисе продаж Intel или у местного дистрибьютора.

Копии документов с порядковым номером, ссылки на которые содержатся в этом документе, а также другую литературу Intel можно получить, позвонив по телефону 1-800-548-47-25 либо на сайте http://www.intel.com/design/literature.htm.

Intel, эмблема Intel, Intel Atom и Intel Core являются товарными знаками корпорации Intel в США и в других странах.

* Другие наименования и торговые марки могут быть собственностью третьих лиц.

Intel, эмблема Intel, Intel Atom и Intel Core являются товарными знаками корпорации Intel в США и в других странах.

* Другие наименования и торговые марки могут быть собственностью третьих лиц.

Уведомление об оптимизации

Уведомление об оптимизации

Компиляторы Intel могут не обеспечивать ту же степень оптимизации для других микропроцессоров (не корпорации Intel), даже если в них реализованы такие же возможности для оптимизации, как в микропроцессорах Intel. К ним относятся наборы команд SSE2®, SSE3 и SSSE3 и другие возможности для оптимизации. Корпорация Intel не гарантирует доступность, функциональность или эффективность какой-либо оптимизации на микропроцессорах других производителей. Микропроцессорная оптимизация, реализованная в этом продукте, предназначена только для использования с микропроцессорами Intel. Некоторые виды оптимизации, применяемые не только для микроархитектуры Intel, зарезервированы для микропроцессоров Intel. Ознакомьтесь с руководством пользователя и справочным руководством по соответствующему продукту для получения более подробной информации о конкретных наборах команд, которых касается данное уведомление.

Редакция уведомления № 20110804

ПРИМЕЧАНИЕ.В зависимости от содержимого могут потребоваться дополнительные уведомления и примечания. Как правило, они находятся в следующих местах, и их необходимо добавить в раздел «Уведомления» соответствующих документов.

Общие уведомления:уведомления, размещаемые во всех материалах, распространяемых и выпускаемых корпорацией Intel.

Уведомления в отношении производительности и ее измерения:уведомления для материалов, используемых корпорацией Intel для измерения производительности или для заявлений о производительности.

Сопутствующие технические уведомления:уведомления, которые следует добавлять в технические материалы Intel, описывающие внешний вид, пригодность или функциональность продукции Intel.

Технические примечания:примечания к материалам Intel, когда описываются преимущества или возможности технологий и программ.

Примечание для технических уведомлений: если все описываемые продукты (например, ACER ULV) обладают определенной функцией или поддерживают определенную технологию, то можно удалить заявление о требованиях в уведомлении. При наличии нескольких технических уведомлений можно объединить все заявления «фактическая производительность может различаться» в одно общее.

Microsoft Direct3D* 12

direct3d

D3D

http://newsroom.intel.com/community/intel_newsroom/blog/2015/03/02/intel-launches-new-mobile-socs-lte-solution

↧

Introducing Intel® Atom™ x3 (Code-Named “SoFIA”) SoC Processor Series

April 9, 2015, 12:03 pm

Latest and popular articles on Intel Technologies

≫ Next: Exploring Air Quality Monitoring Using Intel® Edison

≪ Previous: Эффективность и Производительность Консольных API на ПК

Introduction

On March 2, 2015, during the Mobile World Congress in Barcelona, Spain, one of the announcements Intel made was to introduce the Intel® Atom™ x3 Processor Series, Intel’s first integrated communication platform. Formerly code named “SoFIA”, Intel® Atom™ x3 Processor Series is a low-cost SoC with 64-bit Intel Atom processor cores and integrated cellular baseband modem for smart or feature phones, phablets, and tablets. The SoC will be available in 4G LTE and 3G versions.

Intel® Atom™ x3 Processor Series provides a foundation for full-featured and cost-effective platforms with fast and seamless mobile experiences which meet today’s consumers’ expectations.

This blog will go through the high level features of the Intel® Atom™ x3 platform, especially the features which mobile app developers are interested in.

Please note that the processor series has not been officially released yet, information discussed in this blog post is subject to change without notice.

The Intel® Atom™ x3 Processor Series includes 3 versions (SKUs): the Intel® Atom™ x3-C3440 processor, the Intel® Atom™ x3-C3130 processor, and the Intel® Atom™ x3-C3230RK processor.

Architectures and Specifications

The Intel® Atom™ x3-C3440 processor includes a 64-bit quad-core Intel® Atom™ CPU and an integrated 4G LTE 5-band modem. The upgraded video can provide 1080p HD playback. The Mali T720 MP2 GPU supports Open GL ES 3.0 and DirectX 9.3. Figure 1 shows the Intel® Atom™ x3-C3440 processor high level box diagram.

Figure 1The Intel® Atom™ x3-C3440 processor high level block diagram

Besides the C3440 version which supports the 4G LTE technologies, the Intel® Atom™ x3-C3130 processor series also includes 2 SKUs which support 3G mobile technologies: the Intel® Atom™ x3-C3130 processor and the Intel® Atom™ x3-C3230RK processor, which provide low-cost options with performance.

The Intel® Atom™ x3-C3130 processor features a 64-bit dual-core Intel® Atom™ CPU and an integrated 3G modem. Figure 2 shows its high level box diagram.

Figure 2 The Intel® Atom™ x3-C3130 processor features a dual-core Intel Atom CPU and an integrated 3G modem.

The Intel® Atom™ x3-C3230RK processor, a collaboration between Intel and Rockchip*, includes a quad-core 64-bit Intel® Atom™ CPU and an integrated 3G modem. Figure 3 describes this SKU’s high level architecture.

Figure 3 The Intel® Atom™ x3-C3230RK processor includes a quad-core Intel Atom CPU and an integrated 3G modem. During Intel Developer Forum in Shenzhen, China on April 8, 2015, Intel and Rockchip* announced devices based on the Intel® Atom™ x3-C3230RK processor are expected in market later in Q2, 2015

On the graphics side, the Intel® Atom™ x3-C3130 processor includes a Mali* 400MP2 GPU, and the Intel® Atom™ x3-C3230RK processor includes a Mali* 450 MP4 GPU. Both processors support up to OpenGL ES 2.0.

As a summary, Figure 4 shows a comparison table for the 3 Intel® Atom™ x3 processor SKUs.

Figure 4 A comparison of the Intel® Atom™ x3 processor SKUs

Feature Highlights

High Performance SoC integration for great mobile user experiences

The Intel Atom x3 processor series features 64-bit Intel Atom quad-core or dual-core IA processors at up to 1.4GHz Bust Mode, a fast 4G LTE or 3G integrated cellular baseband modem, HD video and quality audio, GPU, and the ISP module. The ISP supports dual cameras, including an up to 5MP front facing camera and up to 13MP rear-facing camera.

Low-power RF and low-power video encoding / decoding result in a long battery life.

Fast 4G LTE and 3G communications

4G LTE and 3G Intel modems are interoperable with mobile network operators around the globe and support world-wide roaming. In the 4G LTE SKU, up to 14 LTE bands can be supported with the 5-mode LTE modem (2G, 3G, 4G LTE, FDD/TDD, and TD-SCDMA).

Connectivity

The Intel Atom x3 processor series supports a full range of connectivity capabilities which will keep the user always connected: Wi-Fi, Bluetooth, and GPS and GNSS. These capabilities enable various mobile use cases, from networking to location-based services.

Impressive graphics and audio

The GPU provides clear and responsive graphics for games, including supports for OpenGL ES 2.0 (the C3130 and C3230RK SKUs) and 3.0 (the C3440 SKU). The platform supports high quality audio HD video media playback and Miracast-based wireless display.

Value-added capacities

The dual SIM capability enables mobile subscriptions from 2 different service providers / carriers. As an option on the C3440 SKU, Near Field Communication (NFC) supports streamlined and secure tap-and-pay transactions.

Summary

In the above discussion, we can see the Intel® Atom™ x3 Processor Series integrates high performance Intel® Atom™ CPU cores and a fast communication modem on a single SoC silicon. It provides a quality foundation for entry and value tablets, phablets, and smartphones affordable for consumers around the world.

References

http://www.intel.com/content/www/us/en/processors/atom/atom-x3-c3000-brief.html

*Other names and brands may be claimed as the property of others.

Intel® Atom™ processor

Android

Android Smartphone

Android Tablets

Image de l’icône:

Fichiers joints:

https://software.intel.com/sites/default/files/managed/a2/fd/C3130.jpg

https://software.intel.com/sites/default/files/managed/a2/fd/C3230RK.jpg

https://software.intel.com/sites/default/files/managed/a2/fd/C3440.jpg

https://software.intel.com/sites/default/files/managed/a2/fd/Specification.jpg

Processeurs Intel® Atom™

Expérience et conception utilisateur

Inclure dans RSS:

http://www.intel.com/content/www/us/en/do-it-yourself/edison.html

↧

Exploring Air Quality Monitoring Using Intel® Edison

April 10, 2015, 11:02 am

Latest and popular articles on Intel Technologies

≫ Next: Using the Unity* Toolkit’s SendMessage Action with Intel® RealSense™ Technology

≪ Previous: Introducing Intel® Atom™ x3 (Code-Named “SoFIA”) SoC Processor Series

Air quality monitoring is an interesting topic to explore with the rises in pollution, allergy sensitivity, awareness of health & fitness, and technology innovation. The consumer marketplace has seen innovative products released bringing more awareness to air quality monitoring in the home. One such product is the smart scale. These smart scales monitor a variety of health related parameters and also the air quality. The air quality is sent to the cloud and an app can alert you to the changes in the air quality so you will know when an area needs ventilation with fresh air. Having an awareness of the air quality could allow for an improved quality of life. This article shows a method of exploring air quality monitoring by measuring carbon dioxide, volatile organic compounds (VOC), and dust levels using the Arduino* ecosystem and sending the data to a cloud service provider.

The Intel® Edison platform is a natural fit for starting a new prototype or migrating an existing one given its fast processor, large memory size, and integrated connectivity for WiFi and Bluetooth. The Arduino ecosystem provides a capable set of hardware and firmware libraries to experiment with using the Intel® Edison Compute Module and Intel® Edison Arduino Breakout Board.

To learn more about the Intel Edison platform, please see the link below:

Hardware Components:

This project uses the following hardware components for the air quality monitoring system:

Intel® Edison Compute Module
Intel® Edison Arduino Breakout Board
Common Cathode RGB LED + 3 x 1kΩ resistors
GP2Y1010AU0F Optical Dust Sensor + 150Ω resistor + 220 µF electrolytic capacitor
MQ-135 Gas Sensor
K-30 CO2 Sensor
PIR Motion Sensor

Figure 1 - Hardware Diagram

Theory of Operation:

Figure 1 shows the hardware component connections to the Intel® Edison Arduino Breakout Board. The system uses an RGB LED as a simple visual indication system for displaying the air quality.

To determine the total air quality of an area, three sensors are used:

1. An optical dust sensor is used to measure the dust in the area.

2. A gas sensor is used to measure the Volatile Organic Compounds (VOC) such as smoke.

3. A CO2 sensor is used to measure the carbon dioxide levels with an I2C interface.

In addition, a motion sensor is used for helping the system get the best representation of the total air quality in an area, by filtering out temporary increases in dust concentration caused by movement, and temporary increases in CO2 concentration caused by a person breathing close to the sensors.

When there is no motion detected, the firmware reads the air quality sensors, analyzes the sensor data, updates the visual indication system, and sends the air quality data to the cloud. The details of the system are further discussed in the Firmware section.

To learn more about the sensors, please see the data sheets at the links below:

http://www.kosmodrom.com.ua/pdf/MQ135.pdf

https://www.sparkfun.com/datasheets/Sensors/gp2y1010au_e.pdf

http://www.co2meter.com/collections/co2-sensors/products/k-30-co2-sensor-module

http://www.ladyada.net/media/sensors/PIRSensor-V1.2.pdf

Configuring the I2C Clock Frequency:

It is important to note that at the time of this writing, the default I2C clock frequency on Intel® Edison is above 100kHZ which is outside the specification of the K-30 CO2 sensor. The K-30 CO2 sensor supports a maximum I2C clock frequency (SCL) of 100kHz. The Intel® Edison I2C clock frequency can be changed to 100kHZ following a few steps:

-Ensure that the latest Intel® Edison Yocto firmware image is installed:

http://www.intel.com/support/edison/sb/CS-035180.htm

-Open an Edison Linux terminal and login as root:

https://software.intel.com/en-us/articles/getting-started-with-the-intel-edison-board-on-windows

-cd /sys/devices/pci0000:00/0000:00:09.1/i2c_dw_sysnode

-echo std > mode

-cat mode

To learn more about the Intel® Edison compute module and the I2C peripheral, please see the link below:

http://www.intel.com/support/edison/sb/CS-035274.htm?wapkw=intel+edison+compute+module+hardware+guide

Firmware:

The following code shows the includes, macros, and functions for the air quality system. Functions for Initialization, Main Loop, Reading Motion Sensor, Reading Air Quality Sensors, Analyzing Total Air Quality, Updating Visual Indication LED, and Sending Data to a Cloud Service Provider are discussed.

Includes:

#include <Wire.h>

Macros:

//Pin Defines
#define gasSensorPin A1
#define dustSensorPin A0
#define dustSensorLEDPin 2
#define redRGBLEDPin 3
#define greenRGBLEDPin 4
#define blueRGBLEDPin 5
#define motionSensorPin 6

//Air Quality Defines
#define AIR_QUALITY_OPTIMAL 2
#define AIR_QUALITY_GOOD    1
#define AIR_QUALITY_BAD     0
#define AIR_QUALITY_UNKNOWN -1
#define MAX_SENSOR_READINGS        10
#define SENSOR_READING_DELAY 1000

//Motion Sensor Defines 
#define MOTION_NOT_DETECTED 0
#define MOTION_DETECTED     1
#define MOTION_DELAY_TIME   1000

//Dust Sensor Timing Parameters (from p.5 of datasheet)
#define SAMPLE_DELAY        280  //Sampling
#define PULSEWIDTH_DELAY    40   //Pw
#define PERIOD_DELAY        9680 //T

//Gas Sensor Thresholds
#define GAS_SENSOR_OPTIMAL 140
#define GAS_SENSOR_GOOD    200

//Dust Sensor Thresholds
#define DUST_SENSOR_OPTIMAL 125
#define DUST_SENSOR_GOOD    250

//CO2 Sensor Thresholds
#define CO2_SENSOR_OPTIMAL 800
#define CO2_SENSOR_GOOD    2000

Functions:

Initialization: This function initializes the serial debug interface, the I/O pins, and the I2C interface.

void setup() {
  Serial.begin(9600);
  pinMode(gasSensorPin, INPUT);
  pinMode(dustSensorPin, INPUT);
  pinMode(dustSensorLEDPin, OUTPUT);
  pinMode(redRGBLEDPin, OUTPUT);
  pinMode(greenRGBLEDPin, OUTPUT);
  pinMode(blueRGBLEDPin, OUTPUT);
  pinMode(motionSensorPin, INPUT);
  Wire.begin();
}

Main Loop: The main loop initializes the system, checks for motion, reads the air quality sensors, analyzes the total air quality, updates the indication LED, and sends the data to a cloud service.

void loop() {
  // -- Init
  int airQuality = 0;
  int motion = 0;
  int sensorAirQuality[3] = {0,0,0}; //0-Gas Sensor, 1-CO2 Sensor, 2-DustSensor
  Serial.println("");
 
  // -- Check for motion
  motion = readMotionSensor();
 
  if (motion == MOTION_NOT_DETECTED) {
    // -- Read Air Quality Sensors
    readAirQualitySensors(sensorAirQuality);
   
    // -- Analyze Total Air Quality
    airQuality = analyzeTotalAirQuality(sensorAirQuality[0],sensorAirQuality[1],sensorAirQuality[2]);
   
    // -- Update Indication LED
    updateIndicationLED(airQuality);
   
    // -- Update Air Quality Value for Cloud Datastream
    updateCloudDatastreamValue(CHANNEL_AIR_QUALITY_ID, airQuality);
 
    // -- Send Data To Cloud Service
    sendToCloudService();
  }
}

Reading Motion Sensor: The motion sensor is read by sampling the sensor’s digital output pin. If motion is detected, the sensor output pin will go HIGH. The function attempts to filter glitches and returns whether motion was detected or not.

int readMotionSensor() {
  // -- Init
  int motionSensorValue = MOTION_NOT_DETECTED;
  int motion = MOTION_NOT_DETECTED;
 
  Serial.println("-Read Motion Sensor");
 
  // -- Read Sensor
  motionSensorValue = digitalRead(motionSensorPin);
 
  // -- Analyze Value
  if (motionSensorValue == MOTION_DETECTED) {
    delay(MOTION_DELAY_TIME); 
    motionSensorValue = digitalRead(motionSensorPin);
   
    if (motionSensorValue == MOTION_DETECTED) {
      motion = MOTION_DETECTED;
      Serial.println("--Motion Detected");
      updateIndicationLED(AIR_QUALITY_UNKNOWN);
    }
  }
  return motion;
}

Reading Air Quality Sensors: This function calls the individual gas, co2, and dust sensor functions. The function takes a pointer to integer array for storing the air quality results for each sensor.

void readAirQualitySensors(int* sensorAirQuality)
{
  Serial.println("-Read Air Quality Sensors");
 
  sensorAirQuality[0] = readGasSensor();
  sensorAirQuality[1] = readCO2Sensor();
  sensorAirQuality[2] = readDustSensor();
}

Reading Gas Sensor: The gas sensor can detect gases such as NH3, NOx, alcohol, Benzene, and smoke. The gas sensor contains an analog voltage output that is proportional to the gas levels in the air. An A/D conversion is performed to read this sensor. The function reads the sensor, averages the readings, analyzes the sensor data, and returns the air quality for this sensor.

int readGasSensor() {
  // -- Init
  int airQuality = 0;
  int gasSensorValue = 0;
 
  // -- Read Sensor
  for (int i=0; i < MAX_SENSOR_READINGS; i++) {
    gasSensorValue += analogRead(gasSensorPin);
    delay(SENSOR_READING_DELAY);
  }
  gasSensorValue /= MAX_SENSOR_READINGS; //Average the sensor readings
 
  // -- Update Cloud Datastream
  Serial.print("--gasSensorValue = ");
  Serial.println(gasSensorValue);
  updateCloudDatastreamValue(CHANNEL_GAS_SENSOR_ID, gasSensorValue);
 
  // -- Analyze Value
  if (gasSensorValue < GAS_SENSOR_OPTIMAL) {
    airQuality = AIR_QUALITY_OPTIMAL;
  }
  else if (gasSensorValue < GAS_SENSOR_GOOD) {
    airQuality = AIR_QUALITY_GOOD;
  }
  else {
    airQuality = AIR_QUALITY_BAD;
  }

  return airQuality;
}

Reading Dust Sensor: The dust sensor contains an optical sensing system that is energized using a digital output pin. An A/D conversion is then performed to sample the sensor’s analog voltage output that is proportional to the dust in the air. This function reads the sensor, averages the readings, analyzes the sensor data, and returns the air quality for this sensor.

int readDustSensor() {
  // -- Init
  int airQuality = 0;
  int dustSensorValue = 0;
 
 
  // -- Read Sensor
  for (int i=0; i < MAX_SENSOR_READINGS; i++) {
    digitalWrite(dustSensorLEDPin,LOW);  //Enable LED
    delayMicroseconds(SAMPLE_DELAY);
    dustSensorValue += analogRead(dustSensorPin);
    delayMicroseconds(PULSEWIDTH_DELAY);
    digitalWrite(dustSensorLEDPin,HIGH); //Disable LED
    delayMicroseconds(PERIOD_DELAY);
    delay(SENSOR_READING_DELAY);
  }
  dustSensorValue /= MAX_SENSOR_READINGS; //Average the sensor readings
 
  // -- Update Cloud Datastream
  Serial.print("--dustSensorValue = ");
  Serial.println(dustSensorValue);
  updateCloudDatastreamValue(CHANNEL_DUST_SENSOR_ID, dustSensorValue);
 
  // -- Analyze Value
  if (dustSensorValue < DUST_SENSOR_OPTIMAL) {
    airQuality = AIR_QUALITY_OPTIMAL;
  }
  else if (dustSensorValue < DUST_SENSOR_GOOD) {
    airQuality = AIR_QUALITY_GOOD;
  }
  else {
    airQuality = AIR_QUALITY_BAD;
  } 
 
  return airQuality;
}

Reading CO2 Sensor: The CO2 sensor returns a CO2 concentration level in parts per million (ppm). The CO2 sensor is read through the I2C interface. This function reads the sensor, averages the readings, analyzes the sensor data, and returns the air quality for this sensor.

int readCO2Sensor() {
  // -- Init
  int airQuality = 0;
  int co2SensorValue = 0;
  int tempValue=0;
  int invalidCount=0;
 
  // -- Read Sensor
  for (int i=0; i < MAX_SENSOR_READINGS; i++) {
    tempValue = readCO2();  // see http://cdn.shopify.com/s/files/1/0019/5952/files/Senseair-Arduino.pdf?1264294173 for this function
    (tempValue == 0) ? invalidCount++ : co2SensorValue += tempValue;
    delay(SENSOR_READING_DELAY);
  }
 
  if (invalidCount != MAX_SENSOR_READINGS) {
    co2SensorValue /= (MAX_SENSOR_READINGS - invalidCount); //Average the sensor readings
  }
 
  // -- Update Cloud Datastream
  Serial.print("--co2SensorValue = ");
  Serial.println(co2SensorValue);
  updateCloudDatastreamValue(CHANNEL_CO2_SENSOR_ID, co2SensorValue);
 
  // -- Analyze Value
  if (co2SensorValue < CO2_SENSOR_OPTIMAL) {
    airQuality = AIR_QUALITY_OPTIMAL;
  }
  else if (co2SensorValue < CO2_SENSOR_GOOD) {
    airQuality = AIR_QUALITY_GOOD;
  }
  else {
    airQuality = AIR_QUALITY_BAD;
  } 
 
  return airQuality;
}

Analyzing Total Air Quality: This function determines the total air quality for the area by analyzing the gas, co2, and dust air quality values passed to this function. The function returns the total air quality level for the area.

int analyzeTotalAirQuality(int gasAirQuality, int co2AirQuality, int dustAirQuality) {
    int airQuality = 0;
    Serial.println("-Analyze Total Air Quality");
    if (gasAirQuality==AIR_QUALITY_BAD    \
        || dustAirQuality==AIR_QUALITY_BAD \
        || co2AirQuality==AIR_QUALITY_BAD) {
      Serial.println("--Air Quality Is BAD");
      airQuality = AIR_QUALITY_BAD;
    }
    else if (gasAirQuality == AIR_QUALITY_OPTIMAL \
             && dustAirQuality == AIR_QUALITY_OPTIMAL \
             && co2AirQuality==AIR_QUALITY_OPTIMAL) {
      Serial.println("--Air Quality Is OPTIMAL");
      airQuality = AIR_QUALITY_OPTIMAL;
    }
    else  {
      Serial.println("--Air Quality Is Good");
      airQuality = AIR_QUALITY_GOOD;
    }
    return airQuality;
}

Updating Visual Indication LED: This function updates the indication LED to the appropriate color for the air quality value that is passed to this function. The LED turns blue for optimal air quality levels, green for good air quality levels, and red for bad air quality levels. The LED turns magenta if motion is detected.

void updateIndicationLED(int airQuality) {
  Serial.println("-Update Indication LED");
  // --Turn off all colors
  digitalWrite(redRGBLEDPin,LOW);
  digitalWrite(greenRGBLEDPin,LOW);
  digitalWrite(blueRGBLEDPin,LOW);
     
  // --Update Indication LED
  if (airQuality == AIR_QUALITY_UNKNOWN) {
    digitalWrite(redRGBLEDPin,HIGH);
    digitalWrite(greenRGBLEDPin,HIGH);
    digitalWrite(blueRGBLEDPin,HIGH);
  }
  else if (airQuality == AIR_QUALITY_OPTIMAL) {
    digitalWrite(blueRGBLEDPin, HIGH);
  }
  else if (airQuality == AIR_QUALITY_GOOD) {
    digitalWrite(greenRGBLEDPin, HIGH);
  }
  else {
    digitalWrite(redRGBLEDPin, HIGH);
  }
}

Sending Data to a Cloud Service Provider:

To connect Intel® Edison to a WiFi network, please see the link below:

http://www.intel.com/support/edison/sb/CS-035342.htm

Figure 2 - xively.com feed

In this example, xively.com is used as the cloud service provider that the air quality data is sent to. Figure 2 shows an example feed with four channels. The channels are further discussed in the Functions section. Integration with xively.com requires the Http Client and Xively libraries added to the Arduino IDE. Please see the link below to learn more about xively.com, creating an account, Arduino tutorials, and library integration with the Arduino IDE.

https://xively.com/dev/tutorials/arduino_wi-fi/

The following code shows an example of the includes, macros, and functions that can be added to the air quality system to add xively.com support.

Includes:

#include <WiFi.h>
#include <HttpClient.h>
#include <Xively.h>

Macros:

//Xively.com Defines
#define XIVELY_FEED <enter your feed number here>
#define XIVELY_KEY <enter your key string here>
#define XIVELY_HTTP_SUCCESS 200
#define CHANNEL_AIR_QUALITY "AIR_QUALITY"
#define CHANNEL_AIR_QUALITY_ID    0
#define CHANNEL_GAS_SENSOR "GAS_SENSOR"
#define CHANNEL_GAS_SENSOR_ID     1
#define CHANNEL_CO2_SENSOR "CO2_SENSOR"
#define CHANNEL_CO2_SENSOR_ID     2
#define CHANNEL_DUST_SENSOR "DUST_SENSOR"
#define CHANNEL_DUST_SENSOR_ID    3
#define MAX_CHANNELS              4

Global Variables:

//Xively Datastream
XivelyDatastream datastreams[] = {
    XivelyDatastream(CHANNEL_AIR_QUALITY, strlen(CHANNEL_AIR_QUALITY), DATASTREAM_FLOAT),
    XivelyDatastream(CHANNEL_GAS_SENSOR, strlen(CHANNEL_GAS_SENSOR), DATASTREAM_FLOAT),
    XivelyDatastream(CHANNEL_CO2_SENSOR, strlen(CHANNEL_CO2_SENSOR), DATASTREAM_FLOAT),
    XivelyDatastream(CHANNEL_DUST_SENSOR, strlen(CHANNEL_DUST_SENSOR), DATASTREAM_FLOAT)
  };

//Xively Feed
XivelyFeed feed(XIVELY_FEED, datastreams, MAX_CHANNELS);
 
//Xively Client
WiFiClient client;
XivelyClient xivelyclient(client);

Functions:

Updating the data stream: This function is called to update the values for a xively.com channel datastream. The function is passed the channelID, and the datastream value. In this system as shown in Figure 2, four datastreams are used. The datastreams are updated with raw sensor data from the gas, co2, and dust sensor functions. In addition, a datastream is also updated in the main loop with the total air quality value.

void updateCloudDatastreamValue(int channelID, int value) {
  // -- Update the Datastream Value
  datastreams[channelID].setFloat(value);
}

Sending the Datastreams to Xively: This function performs a PUT operation to a xively.com feed. The function returns the status of successful or the error code. The main loop calls this function.

void sendToCloudService() {
  int status=0;
  Serial.println("-Send To Cloud Service”);

  // -- Upload the Datastream to Xively
  status = xivelyclient.put(feed, XIVELY_KEY);
 
  // -- Verify Transaction
  if (status == XIVELY_HTTP_SUCCESS) {
   Serial.println("--HTTP OK");
  }
  else {
    Serial.print("--ERROR: ");
    Serial.println(status);
  }
}

Summary:

Hope you enjoyed exploring air quality monitoring with the Intel Edison platform. Challenge yourself to add additional indication showing the status of each sensor, to add enhancements to the cloud service experience with alert triggers when the air quality changes, and also look for opportunities to integrate air quality monitoring with other systems.

About the Author:

Mike Rylee is a Software Engineer at Intel Corporation with a background in developing embedded systems and apps on Android*, Windows*, iOS*, and Mac*. He currently works on enabling for Android and the Internet of Things.

++This sample source code is released under the Intel Sample Source License

Notices  No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others

Contrat de licence:

Intel Sample Source Code License Agreement

URL

↧

Using the Unity* Toolkit’s SendMessage Action with Intel® RealSense™ Technology

April 17, 2015, 2:24 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel AMT and Java

≪ Previous: Exploring Air Quality Monitoring Using Intel® Edison

The Unity Toolkit for Intel® RealSense™ technology is a standard Unity package. It contains all the DLLs and scripts that are needed to use the Intel RealSense SDK in a Unity game or application. The toolkit allows for ease of use by making scripting properties available in the Inspector without the need to do much of the boilerplate coding required in the previous version of the SDK, the Intel® Perceptual Computing SDK.

Purpose

The purpose of this tutorial is to show you how to use the Unity Toolkit’s SendMessage action. The SendMessage action allows you to call a function from another script whenever an Intel RealSense application event is triggered. This can be very useful when you want to use functionality from a different script than what is included in any of the action scripts.

What this Tutorial Covers

This tutorial will cover the steps needed to use the SendMessageAction action.

Assumptions

You must have read the sdktoolkit.pdf file, which is installed in the \RSSDK\doc folder. This document gives you a general idea how the Unity add-on works. This tutorial also assumes that you have some basic Unity skills such as importing an external package, adding a cube onto a scene, and applying a script to a game object.

Requirements

Unity Professional 4.1 or higher.

An Intel® RealSense™ 3D camera either embedded in a device or an external camera.

Let’s Begin

Import the Visual Studio* Plugin (Optional)

While this step is not required, I prefer to use Visual Studio as my IDE. If you would like to use this free plugin, you can download it here http://unityvs.com/.

Import the Unity* package for Intel® RealSense™ Technology

This package contains everything you need to set up and run Intel RealSense applications. The Unity package is located in the RSSDK\Framework\Unity folder. If you installed the Intel RealSense SDK in the default location, the RSSDK folder is in C:\Program Files (x86).

You can import the Unity Toolkit as you would any package. When doing so, you have the options to pick and choose what components you want to use. For the purpose of this tutorial, we will use the defaults and import everything.

As you can see in the following image, there are now several new items under the Assets folder.

Plugins and Plugins.Managed contain DLLs required for using the Intel RealSense SDK.
RSUnityToolkit folder contains all the scripts and assets for running the toolkit.

We won’t go into what all the folders are and what they contain here as that is out of the scope of this tutorial.

Game Objects

Add a 3D cube to the scene.

Add a Directional light to give the ship some light so you can see your game object better.

Scripting

To make sure the SendMessageAction action works properly on a game object, you first need to drag and drop the SendMessageAction script onto the game object. Second, you need to create a new script that will contain a function that you want to be called whenever an Intel RealSense SDK action has been triggered.

Create a new folder, then create a new script

To keep things organized, I created a new folder to hold my custom script. Inside that folder, I created a new C# script called MyRotateScript. Inside this script I set up a function that will rotate the cube along the X axis at a set interval.

Open the script for editing

Clear out any existing functions as they won’t be needed. Your script should look like this:

Next add the following code to your script:

Any time this function is called, the cube is rotated 30 degrees on each axis.

Once you have saved your script, return to Unity and drag and drop this new script onto your cube game object. You won’t need to worry about editing this script in the Inspector. However, it is important that you add this script to your cube game object. The SendMessageScript relies on the fact that this script is attached to the same game object.

Add the SendMessageAction script

Grab the SendMessageAction script and drop it onto the cube.

At this point in the Inspector you should see the two scripts, the MyRotateScript we just created and the SendMessageAction script.

Configure the Send Message Action script to call the rotation scripts RotateMyCube()

On the Add Trigger button, select Event Source.

On the Event Source Add button, I have chosen Hand Detected. This is probably the easiest way to get this up and working.

Next choose the hand options. I have chosen Right hand. What this will do is that any time the camera sees my hand it will enable the Send Message Action.

Under Function Name enter the name of the public function RotateMyCube, which resides in the MyRotateScript class that was created earlier.

One thing to note, the event fires only when the SDK detects your hand, and only one message is sent at a time. So, to see the cube rotate more than once, you must move your hand out of the camera’s view and bring it back.

Congratulations, at this point, everything is ready to go.

Save your scene,

Save your project,

and run the application you just created!

Intel RealSense

unity

Unity

Technologie d’administration active Intel®

↧

Intel AMT and Java

April 17, 2015, 3:49 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® XDK Update for April 2015, and Apache* Cordova*

≪ Previous: Using the Unity* Toolkit’s SendMessage Action with Intel® RealSense™ Technology

For all of you Java developers out there that have be searching for tools to help with your Intel® Active Management Technology solutions, your search is over. I’d like to introduce you to the Intel® WS-Management Java Client Library, available for download here.

This download contains source code for a library that can be used to make WS-Management calls to Intel® AMT devices. Also included are code samples that demonstrate how to use the library to interact with many AMT features and documentation on the library & samples.

Key Features

Allows Intel® AMT development in Java environments

100% pure Java (no JNI or native code wrapping)

Supports Digest and Kerberos authentication

Library is packaged in a single .jar file with JavaDocs

Supports JDK5 and newer (JDK6 required for native Kerberos support)

Getting Started

Just include IntelWsmanLib.jar in your class path and you can start calling the API. I wrote a few of the samples in the download using the NetBeans Integrated Development Environment. But you can use any Java capable environment as long as you include the jar file. Relevant objects are found in the intel.management.wsman Java package.

The library is WS-Management compliant and not AMT specific or even AMT version specific. There are no WSDL files, XSD files, or MOF files needed.

Writing Code

Here are some snippets to give you an idea of how the code looks.

First create a connection to a WS-Management Service as follows:

        import intel.management.wsman.*;

        connection = WsmanConnection.createConnection(<url>);

Then use the resulting connection object to perform WS-Transfer, WS-Enumeration, or WS-Eventing operations using either digest or Kerberos credentials.

For example, to get the AMT host name:

        ManagedReference ref = connection.newReference(“AMT_GeneralSettings”);

        ManagedInstance inst = ref.get();

        inst.GetProperty(“HostName”);

Take a look at the sample code included in the download for more details.

Please post to our forum any questions, comments or feedback.

AMTJAVA

Image de l’icône:

Technologie Intel® vPro™

Sécurité

Client d’entreprise

Inclure dans RSS:

Intel XDK Cordova HTML5 Javascript Crosswalk

↧

Intel® XDK Update for April 2015, and Apache* Cordova*

April 14, 2015, 4:50 pm

Latest and popular articles on Intel Technologies

≫ Next: Use which hardware PMU events to calculate FLOPS on Intel(R) Xeon Phi(TM) coprocessor?

≪ Previous: Intel AMT and Java

As you are probably aware, we did two updates in the past couple of weeks. The March 30 update, build 1878, was a regular update to fix a number of open issues. This one today, build 1912, is to address some regressions with that update. Not what we had planned. From all of us in the Intel® XDK team, our apologies for having to give you another update so quickly. Fortunately, we do know exactly why and how the regressions happened and are correcting our development and testing processes appropriately.

So, what’s in this update?

Bug-fixes – that’s it. We fixed the Cordova* plugins regressions – selecting/unselecting the standard Cordova plugins and the intel.xdk plugins, a problem with selecting 3^rd-party plugins, issues with ajax calls, and an issue with the Emulator giving a 404 error when it couldn’t find the index.html file. There was also a firewall proxy problem a few users reported on OS X* that we fixed. There were a few others, check the Release Notes to see if we fixed the problems you may have. If you don’t see it there, please make sure you visit our user forums.
Also, we continued to improve the secure, single sign-on support we released earlier in March, allowing your Intel XDK login to work with the user forums on the Intel Developer Zone. There were a few issues with non-English characters in user names that we resolved.

Cordova* changes coming!

Now, an important note about Apache* Cordova*. We are a big fan, supporter, and contributor to the Apache Cordova project – it is doing a great service to mobile app developers in helping get access to native functionality in a standard way. It also moves very fast, which is important given the rapidly changing nature of web development and web technologies. As a tools vendor for HTML5 and one which views Cordova support as central to the core value of the product – the Intel XDK’s cross-platform support – we have to move very quickly as well to ensure we can offer new Cordova functionality as soon as it is ready. It is challenging to stay up to date, but it is something we must do.

Cordova 5.0 is being released this Summer. It is a big change in that plugins will become node packages, adopting the NPM naming conventions, and the plugins registry will be going away as of this Fall. App developers will need to be aware of the changes to how to integrate plugins, and the new names; We intend to support 5.0 fully and will have releases in the Summer for it to continue to make it as easy as possible to add plugins into apps. We’ll also be looking at ways to preview apps with plugins (coming soon!). But, primarily, we want to make sure you are aware of the changes coming up so you can prepare. Please let us know if you have any questions. We’ll keep info on Cordova 5.0 on our user forums and website over the coming months.

Please provide us feedback on our User Forum as to what or how you would like to see us help the Cordova project and your app development.

We greatly appreciate your feedback and comments in the user forums and in private. Please keep them coming!

Joe

Image de l’icône:

Tizen*

Inclure dans RSS:

Avancé

VTune Xeon Phi vector instruction performance metric

↧

Use which hardware PMU events to calculate FLOPS on Intel(R) Xeon Phi(TM) coprocessor?

April 20, 2015, 1:19 am

Latest and popular articles on Intel Technologies

≫ Next: Null Pointer Dereferencing Causes Undefined Behavior

≪ Previous: Intel® XDK Update for April 2015, and Apache* Cordova*

FLOPS means total floating point operations per second, which is used in High Performance Computing. In general, Intel(R) VTune(TM) Amplifier XE
only provides metric named Cycles Per Instruction (average CPI), that is to measure performance for general programs.

In this article, I use matrix1.c as an example and show what events will be used to calculate FLOPS in code for different platform.
First at all, I will use Intel(R) C++ compiler with different switchers to generate binary for legacy x87, SSE, AVX on Intel(R) Xeon
processor, vector instructions on Intel(R) Xeon Phi(TM) coprocessor, then use events to calculate FLOPS.

(I work on 2nd Generation Intel(R) Core(TM) Architecture, Sandy Bridge processor, CPUfrequency is 3.4 GHz, 64bit operation system)
(I also work on Intel Xeon Phi coprocessor, CPU frequency is 1.09GHz)

1. Use X87 OPS as traditional FP to generate legacy X87 instructions used, calculate FLOPS

Build:
gcc -g –mno-sse matrix1.c -o matrix1.x87

Run VTune:
amplxe-cl -collect-with runsa -knob event-config=FP_COMP_OPS_EXE.X87:sa=10000 -- ./matrix1.x87

amplxe-cl –report hw-events

Function Module Hardware Event Count:CPU_CLK_UNHALTED.REF_TSC (K) Hardware Event Count:FP_COMP_OPS_EXE.X87 (K)
----------------------- ----------- ------------------------------------------------- --------------------------------------------
multiply matrix1.x87 36,782,055 2,160,570

There were 2,160,570,000 counts of FP_COMP_OPS_EXE.X87
Elapsed time of multiply() = 36,782,055,000 / 3,400,000,000 = 10.818 seconds
FLOPS = 2,160,570,000 / 1,000,000 / 10.818 = 199.719 Mflops

2. Use SSE registers by using Intel C++ compiler with SSE enabled options, calculate FLOPS

Build:
icc –g –fno-inline –xSSE4.1 matrix1.c –o matrix1.SSE41

Run VTune:
amplxe-cl -collect-with runsa -knob event-config=FP_COMP_OPS_EXE.X87:sa=10000 -- ./matrix1.SSE41

amplxe-cl –collect-with runsa -knob event-config= FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE:sa=10000, FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE:sa=10000 -- ./matrix1.SSE41

amplxe-cl –report hw-events

Function Module Hardware Event Count:CPU_CLK_UNHALTED.REF_TSC (K) Hardware Event Count:FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE (K) Hardware Event Count:FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE (K)
---------------------------- ------------- ------------------------------------------------- ---------------------------------------------------------- ----------------------------------------------------------
multiply matrix1.SSE41 1,100,002 0 1,185,800

There were 1,185,800,000 counts of COMP_OPS_EXE.SSE_PACKED_DOUBLE
Elapsed time of multiply() = 1,100,002,000 / 3,400,000,000 = 0.3235s
FLOPS = 1,185,800,000 / 1,000,000 / 0.3235 = 3665.53 Mflops

3. Use AVX registers by using Intel C++ compiler with the option to enable AVX, calculate FLOPS

Build:
icc -g -fno-inline -xAVX matrix1.c -o matrix1.AVX

Run VTune:
amplxe-cl -collect-with runsa -knob event-config=SIMD_FP_256.PACKED_DOUBLE:sa=10000 -- ./matrix1.AVX

amplxe-cl –report hw-events
Function Module Hardware Event Count:CPU_CLK_UNHALTED.REF_TSC (K) Hardware Event Count:SIMD_FP_256.PACKED_DOUBLE (K)
------------- ----------- ------------------------------------------------- --------------------------------------------------
multiply matrix1.AVX 1,486,002 777,070

There were 777,070,000 counts of SIMD_FP_256.PACKED_DOUBLE
Elapsed time of multiply() = 1,486,002,000 / 3,400,000,000 = 0.437s
FLOPS = 777,070,000 / 1,000,000 / 0.437 = 1778.19 Mflops

4. Use vector instructions by using Intel C++ compiler to build native program for Intel Xeon Phi coprocessor, calculate FLOPS
Build:
icc -g -fno-inline -mmic -O3 matrix1.c -o matrix1.MIC

FP operations of application will be processed via the vector processing unit (VPU), which provides data parallelism, VTune provides supported events:
VPU_DATA_READ
VPU_DATA_WRITE

Run VTune:
amplxe-cl -target-system=mic-native:0 -collect-with runsa -knob event-config=VPU_DATA_READ,VPU_DATA_WRITE -search-dir=. -- /root/matrix1.MIC

amplxe-cl -R hw-events
Function Module Hardware Event Count:VPU_DATA_READ (M) Hardware Event Count:VPU_DATA_WRITE (M) Hardware Event Count:CPU_CLK_UNHALTED (M)
---------------- ------------------ -------------------------------------- --------------------------------------- -----------------------------------------
multiply matrix1.MIC 176 134 2,152

There were (176+134)=300M counts of VPU_DAT_READ & VPU_DATA_WRITE
Elapsed time of multiply() = 2,152,000,000 / 1,090,000,000 = 1.974s
FLOPS = 300,000,000 / 1,000,000 / 1.974 = 151.97 Mflops

Please note that my example is a single thread app working on one core, and you may develop multithreaded app working on multiple cores of Intel Xeon Phi coprocessor.

Image de l’icône:

Fichiers joints:

https://software.intel.com/sites/default/files/managed/90/de/matrix1.c

Intel® Advanced Vector Extensions

Optimisation

Vectorisation

Intel® VTune™ Amplifier

Inclure dans RSS:

↧

Null Pointer Dereferencing Causes Undefined Behavior

April 20, 2015, 6:17 am

Latest and popular articles on Intel Technologies

≫ Next: The Last Line Effect

≪ Previous: Use which hardware PMU events to calculate FLOPS on Intel(R) Xeon Phi(TM) coprocessor?

I have unintentionally raised a large debate recently concerning the question if it is legal in C/C++ to use the &P->m_foo expression with P being a null pointer. The programmers' community divided into two camps. The first claimed with confidence that it wasn't legal while the others were as sure saying that it was. Both parties gave various arguments and links, and it occurred to me at some point that I had to make things clear. For that purpose, I contacted Microsoft MVP experts and Visual C++ Microsoft development team communicating through a closed mailing list. They helped me to prepare this article and now everyone interested is welcome to read it. For those who can't wait to learn the answer: That code is NOT correct.

Debate history

It all started with an article about a Linux kernel's check by the PVS-Studio analyzer. But the issue doesn't have to do anything with the check itself. The point is that in that article I cited the following fragment from Linux' code:

static int podhd_try_init(struct usb_interface *interface,
        struct usb_line6_podhd *podhd)
{
  int err;
  struct usb_line6 *line6 = &podhd->line6;

  if ((interface == NULL) || (podhd == NULL))
    return -ENODEV;
  ....
}

I called this code dangerous because I thought it to cause undefined behavior.

After that, I got a pile of emails and comments, readers objecting to that idea of mine, and even was almost about to give in to their convincing arguments. For instance, as proof of that code being correct they pointed out the implementation of the offsetof macro, typically looking like this:

#define offsetof(st, m) ((size_t)(&((st *)0)->m))

We deal with null pointer dereferencing here, but the code still works well. There were also some other emails reasoning that since there had been no access by null pointer, there was no problem.

Although I tend to be gullible, I still try to double-check any information I may doubt. I started investigating the subject and eventually wrote a small article: "Reflections on the Null Pointer Dereferencing Issue".

Everything suggested that I had been right: One cannot write code like that. But I didn't manage to provide convincing proof for my conclusions and cite the relevant excerpts from the standard.

After publishing that article, I again was bombarded by protesting emails, so I thought I should figure it all out once and for all. I addressed language experts with a question to find out their opinions. This article is a summary of their answers.

About C

The '&podhd->line6' expression is undefined behavior in the C language when 'podhd' is a null pointer.

The C99 standard says the following about the '&' address-of operator (6.5.3.2 "Address and indirection operators"):

The operand of the unary & operator shall be either a function designator, the result of a [] or unary * operator, or an lvalue that designates an object that is not a bit-field and is not declared with the register storage-class specifier.

The expression 'podhd->line6' is clearly not a function designator, the result of a [] or * operator. It is an lvalue expression. However, when the 'podhd' pointer is NULL, the expression does not designate an object since 6.3.2.3 "Pointers" says:

If a null pointer constant is converted to a pointer type, the resulting pointer, called a null pointer, is guaranteed to compare unequal to a pointer to any object or function.

When "an lvalue does not designate an object when it is evaluated, the behavior is undefined" (C99 6.3.2.1 "Lvalues, arrays, and function designators"):

An lvalue is an expression with an object type or an incomplete type other than void; if an lvalue does not designate an object when it is evaluated, the behavior is undefined.

So, the same idea in brief:

When -> was executed on the pointer, it evaluated to an lvalue where no object exists, and as a result the behavior is undefined.

About C++

In the C++ language, things are absolutely the same. The '&podhd->line6' expression is undefined behavior here when 'podhd' is a null pointer.

The discussion at WG21 (232. Is indirection through a null pointer undefined behavior?), to which I referred to in the previous article, brings in some confusion. The programmers participating in it insist that this expression is not undefined behavior. However, no one has found any clause in the C++ standard permitting the use of "poldh->line6" with "polhd" being a null pointer.

The "polhd" pointer fails the basic constraint (5.2.5/4, second bullet) that it must designate an object. No C++ object has nullptr as address.

Summing it all up

struct usb_line6 *line6 = &podhd->line6;

This code is incorrect in both C and C++ when the podhd pointer equals 0. If the pointer equals 0, undefined behavior occurs.

The program running well is pure luck. Undefined behavior may take different forms, including program execution in just the way the programmer expected. It's just one of the special cases of undefined behavior, and that's all.

You cannot write code like that. The pointer must be checked before being dereferenced.

Additional ideas and links

When considering the idiomatic implementation of the 'offsetof()' operator, one must take into account that a compiler implementation is permitted to use what would be non-portable techniques to implement its functionality. The fact that a compiler's library implementation uses the null pointer constant in its implementation of 'offsetof()' doesn't make it OK for user code to use '&podhd->line6' when 'podhd' is a null pointer.
GCC can / does optimize assuming no undefined behavior ever occurs, and would remove the null checks here -- the Kernel compiles with a bunch of switches to tell the compiler not to do this. As an example, the experts refer to the article "What Every C Programmer Should Know About Undefined Behavior #2/3".
You may also find it interesting that a similar use of a null pointer was involved in a kernel exploit with the TUN/TAP driver. See "Fun with NULL pointers". The major difference that might cause some people to think the similarity doesn't apply is that in the TUN/TAP driver bug the structure field that the null pointer accessed was explicitly taken as a value to initialize a variable instead of simply having the address of the field taken. However, as far as standard C goes, taking the address of the field through a null pointer is still undefined behavior.
Is there any case when writing &P->m_foo where P == nullptr is OK? Yes, for example when it is an argument of the sizeof operator: sizeof(&P->m_foo).

Acknowledgements

This article has become possible thanks to the experts whose competence I can see no reason to doubt. I want to thank the following people for helping me in writing it:

Michael Burr is a C/C++ enthusiast who specializes in systems level and embedded software including Windows services, networking, and device drivers. He can often be found on the StackOverflow community answering questions about C and C++ (and occasionally fielding the easier C# questions). He has 6 Microsoft MVP awards for Visual C++.
Billy O'Neal is a (mostly) C++ developer and contributor to StackOverflow. He is a Microsoft Software Development Engineer on the Trustworthy Computing Team. He has worked at several security related places previously, including Malware Bytes and PreEmptive Solutions.
Giovanni Dicanio is a computer programmer, specialized in Windows operating system development. Giovanni wrote computer programming articles on C++, OpenGL and other programming subjects on Italian computer magazines. He contributed code to some open-source projects as well. Giovanni likes helping people solving C and C++ programming problems on Microsoft MSDN forums and recently on StackOverflow. He has 8 Microsoft MVP awards for Visual C++.
Gabriel Dos Reis is a Principal Software Development Engineer at Microsoft. He is also a researcher and a longtime member of the C++ community. His research interests include programming tools for dependable software. Prior to joining Microsoft, he was Assistant Professor at Texas A&M University. Dr. Dos Reis was a recipient of the 2012 National Science Foundation CAREER award for his research in compilers for dependable computational mathematics and educational activities. He is a member of the C++ standardization committee.

References

Wikipedia. Undefined Behavior.
A Guide to Undefined Behavior in C and C++. Part 1, 2, 3.
Wikipedia. offsetof.
LLVM Blog. What Every C Programmer Should Know About Undefined Behavior #2/3.
LWN. Fun with NULL pointers. Part 1, 2.

Image de l’icône:

Unix*

Inclure dans RSS:

Avancé

↧

The Last Line Effect

April 22, 2015, 7:21 am

Latest and popular articles on Intel Technologies

≫ Next: Explicit offload for Quantum ESPRESSO

≪ Previous: Null Pointer Dereferencing Causes Undefined Behavior

I have studied numbers of errors caused by using the Copy-Paste method and can assure you that programmers most often tend to make mistakes in the last fragment of a homogeneous code block. I have never seen this phenomenon described in books on programming, so I decided to write about it myself. I called it the "last line effect".

Introduction

My name is Andrey Karpov and I do an unusual job - I analyze program code of various applications with the help of static analyzers and write descriptions of errors and defects I find. I do this for pragmatic and mercenary reasons because what I do is the way our company advertises its tools PVS-Studio and CppCat. The scheme is very simple. I find bugs. Then I describe them in an article. The article attracts our potential customers' attention. Profit. But today's article is not about the analyzers.

When carrying out analysis of various projects, I save bugs I find and the corresponding code fragments in a special database. By the way, anyone interested can take a look at this database. We convert it into a collection of html-pages and upload them to our website in the "Detected errors" section.

This database is unique indeed! It currently contains 1800 code fragments with errors and is waiting for programmers to study it and reveal certain regularity patterns among these errors. That may serve as a useful basis for many future researches, manuals and articles.

I have never carried out any special investigation of the material gathered by now. One pattern, however, is showing up so clearly that I decided to investigate it a bit deeper. You see, in my articles I have to write the phrase "note the last line" pretty often. It occurred to me that there had to be some reason behind it.

Last line effect

When writing program code, programmers often have to write a series of similar constructs. Typing the same code several times is boring and inefficient. That's why they use the Copy-Paste method: a code fragment is copied and pasted several times with further editing. Everyone knows what is bad about this method: you risk easily forgetting to change something in the pasted lines and thus giving birth to errors. Unfortunately, there is often no better alternative to be found.

Now let's speak of the pattern I discovered. I figured out that mistakes are most often made in the last pasted block of code.

Here is a simple and short example:

inline Vector3int32& operator+=(const Vector3int32& other) {
  x += other.x;
  y += other.y;
  z += other.y;
  return *this;
}

Note the line "z += other.y;". The programmer forgot to replace 'y' with 'z' in it.

You may think this is an artificial sample, but it is actually taken from a real application. Further in this article, I am going to convince you that this is a very frequent and common issue. This is what the "last line effect" looks like. Programmers most often make mistakes at the very end of a sequence of similar edits.

I heard somewhere that mountain-climbers often fall off at the last few dozens of meters of ascent. Not because they are tired; they are simply too joyful about almost reaching the top - they anticipate the sweet taste of victory, get less attentive, and make some fatal mistake. I guess something similar happens to programmers.

Now a few figures.

Having studied the bug database, I singled out 84 code fragments that I found to have been written through the Copy-Paste method. Out of them, 41 fragments contain mistakes somewhere in the middle of copied-and-pasted blocks. For example:

strncmp(argv[argidx], "CAT=", 4) &&
strncmp(argv[argidx], "DECOY=", 6) &&
strncmp(argv[argidx], "THREADS=", 6) &&
strncmp(argv[argidx], "MINPROB=", 8)) {

The length of the "THREADS=" string is 8 characters, not 6.

In other 43 cases, mistakes were found in the last copied code block.

Well, the number 43 looks just slightly bigger than 41. But keep in mind that there may be quite a lot of homogeneous blocks, so mistakes can be found in the first, second, fifth, or even tenth block. So we get a relatively smooth distribution of mistakes throughout blocks and a sharp peak at the end.

I accepted the number of homogeneous blocks to be 5 on the average.

So it appears that the first 4 blocks contain 41 mistakes distributed throughout them; that makes about 10 mistakes per block.

And 43 mistakes are left for the fifth block!

To make it clearer, here is a rough diagram:

Figure 1. A rough diagram of mistake distribution in five homogeneous code blocks.

So what we get is the following pattern:

The probability of making a mistake in the last pasted block of code is 4 times higher than in any other block.

I don't draw any grand conclusions from that. It's just an interesting observation that may be useful to know about for practical reasons - you will stay alert when writing the last fragments of code.

Examples

Now I only have to convince the readers that it all is not my fancy, but a real tendency. To prove my words, I will show you some examples.

I won't cite all the examples, of course - only the simplest or most representative ones.

Source Engine SDK

inline void Init( float ix=0, float iy=0,
                  float iz=0, float iw = 0 )
{
  SetX( ix );
  SetY( iy );
  SetZ( iz );
  SetZ( iw );
}

The SetW() function should be called at the end.

Chromium

if (access & FILE_WRITE_ATTRIBUTES)
  output.append(ASCIIToUTF16("\tFILE_WRITE_ATTRIBUTES\n"));
if (access & FILE_WRITE_DATA)
  output.append(ASCIIToUTF16("\tFILE_WRITE_DATA\n"));
if (access & FILE_WRITE_EA)
  output.append(ASCIIToUTF16("\tFILE_WRITE_EA\n"));
if (access & FILE_WRITE_EA)
  output.append(ASCIIToUTF16("\tFILE_WRITE_EA\n"));
break;

The last block and the one before it are identical.

ReactOS

if (*ScanString == L'\"' ||
    *ScanString == L'^' ||
    *ScanString == L'\"')

Multi Theft Auto

class CWaterPolySAInterface
{
public:
    WORD m_wVertexIDs[3];
};
CWaterPoly* CWaterManagerSA::CreateQuad (....)
{
  ....
  pInterface->m_wVertexIDs [ 0 ] = pV1->GetID ();
  pInterface->m_wVertexIDs [ 1 ] = pV2->GetID ();
  pInterface->m_wVertexIDs [ 2 ] = pV3->GetID ();
  pInterface->m_wVertexIDs [ 3 ] = pV4->GetID ();
  ....
}

The last line was pasted mechanically and is redundant. There are only 3 items in the array.

Source Engine SDK

intens.x=OrSIMD(AndSIMD(BackgroundColor.x,no_hit_mask),
                AndNotSIMD(no_hit_mask,intens.x));
intens.y=OrSIMD(AndSIMD(BackgroundColor.y,no_hit_mask),
                AndNotSIMD(no_hit_mask,intens.y));
intens.z=OrSIMD(AndSIMD(BackgroundColor.y,no_hit_mask),
                AndNotSIMD(no_hit_mask,intens.z));

The programmer forgot to replace "BackgroundColor.y" with "BackgroundColor.z" in the last block.

Trans-Proteomic Pipeline

void setPepMaxProb(....)
{
  ....
  double max4 = 0.0;
  double max5 = 0.0;
  double max6 = 0.0;
  double max7 = 0.0;
  ....
  if ( pep3 ) { ... if ( use_joint_probs && prob > max3 ) ... }
  ....
  if ( pep4 ) { ... if ( use_joint_probs && prob > max4 ) ... }
  ....
  if ( pep5 ) { ... if ( use_joint_probs && prob > max5 ) ... }
  ....
  if ( pep6 ) { ... if ( use_joint_probs && prob > max6 ) ... }
  ....
  if ( pep7 ) { ... if ( use_joint_probs && prob > max6 ) ... }
  ....
}

The programmer forgot to replace "prob > max6" with "prob > max7" in the last condition.

SeqAn

inline typename Value<Pipe>::Type const & operator*() {
  tmp.i1 = *in.in1;
  tmp.i2 = *in.in2;
  tmp.i3 = *in.in2;
  return tmp;
}

SlimDX

for( int i = 0; i < 2; i++ )
{
  sliders[i] = joystate.rglSlider[i];
  asliders[i] = joystate.rglASlider[i];
  vsliders[i] = joystate.rglVSlider[i];
  fsliders[i] = joystate.rglVSlider[i];
}

The rglFSlider array should have been used in the last line.

Qt

if (repetition == QStringLiteral("repeat") ||
    repetition.isEmpty()) {
  pattern->patternRepeatX = true;
  pattern->patternRepeatY = true;
} else if (repetition == QStringLiteral("repeat-x")) {
  pattern->patternRepeatX = true;
} else if (repetition == QStringLiteral("repeat-y")) {
  pattern->patternRepeatY = true;
} else if (repetition == QStringLiteral("no-repeat")) {
  pattern->patternRepeatY = false;
  pattern->patternRepeatY = false;
} else {
  //TODO: exception: SYNTAX_ERR
}

'patternRepeatX' is missing in the very last block. The correct code looks as follows:

pattern->patternRepeatX = false;
pattern->patternRepeatY = false;

ReactOS

const int istride = sizeof(tmp[0]) / sizeof(tmp[0][0][0]);
const int jstride = sizeof(tmp[0][0]) / sizeof(tmp[0][0][0]);
const int mistride = sizeof(mag[0]) / sizeof(mag[0][0]);
const int mjstride = sizeof(mag[0][0]) / sizeof(mag[0][0]);

The 'mjstride' variable will always be equal to one. The last line should have been written like this:

const int mjstride = sizeof(mag[0][0]) / sizeof(mag[0][0][0]);

Mozilla Firefox

if (protocol.EqualsIgnoreCase("http") ||
    protocol.EqualsIgnoreCase("https") ||
    protocol.EqualsIgnoreCase("news") ||
    protocol.EqualsIgnoreCase("ftp") ||          <<<---
    protocol.EqualsIgnoreCase("file") ||
    protocol.EqualsIgnoreCase("javascript") ||
    protocol.EqualsIgnoreCase("ftp")) {          <<<---

A suspicious string "ftp" at the end - it has already been compared to.

Quake-III-Arena

if (fabs(dir[0]) > test->radius ||
    fabs(dir[1]) > test->radius ||
    fabs(dir[1]) > test->radius)

The value from the dir[2] cell is left unchecked.

Clang

return (ContainerBegLine <= ContaineeBegLine &&
        ContainerEndLine >= ContaineeEndLine &&
        (ContainerBegLine != ContaineeBegLine ||
         SM.getExpansionColumnNumber(ContainerRBeg) <=
         SM.getExpansionColumnNumber(ContaineeRBeg)) &&
        (ContainerEndLine != ContaineeEndLine ||
         SM.getExpansionColumnNumber(ContainerREnd) >=
         SM.getExpansionColumnNumber(ContainerREnd)));

At the very end of the block, the "SM.getExpansionColumnNumber(ContainerREnd)" expression is compared to itself.

MongoDB

bool operator==(const MemberCfg& r) const {
  ....
  return _id==r._id && votes == r.votes &&
         h == r.h && priority == r.priority &&
         arbiterOnly == r.arbiterOnly &&
         slaveDelay == r.slaveDelay &&
         hidden == r.hidden &&
         buildIndexes == buildIndexes;
}

The programmer forgot about "r." in the last line.

Unreal Engine 4

static bool PositionIsInside(....)
{
  return
    Position.X >= Control.Center.X - BoxSize.X * 0.5f &&
    Position.X <= Control.Center.X + BoxSize.X * 0.5f &&
    Position.Y >= Control.Center.Y - BoxSize.Y * 0.5f &&
    Position.Y >= Control.Center.Y - BoxSize.Y * 0.5f;
}

The programmer forgot to make 2 edits in the last line. Firstly, ">=" should be replaced with "<=; secondly, minus should be replaced with plus.

Qt

qreal x = ctx->callData->args[0].toNumber();
qreal y = ctx->callData->args[1].toNumber();
qreal w = ctx->callData->args[2].toNumber();
qreal h = ctx->callData->args[3].toNumber();
if (!qIsFinite(x) || !qIsFinite(y) ||
    !qIsFinite(w) || !qIsFinite(w))

In the very last call of the function qIsFinite, the 'h' variable should have been used as an argument.

OpenSSL

if (!strncmp(vstart, "ASCII", 5))
  arg->format = ASN1_GEN_FORMAT_ASCII;
else if (!strncmp(vstart, "UTF8", 4))
  arg->format = ASN1_GEN_FORMAT_UTF8;
else if (!strncmp(vstart, "HEX", 3))
  arg->format = ASN1_GEN_FORMAT_HEX;
else if (!strncmp(vstart, "BITLIST", 3))
  arg->format = ASN1_GEN_FORMAT_BITLIST;

The length of the "BITLIST" string is 7, not 3 characters.

Let's stop here. I hope the examples I have demonstrated are more than enough.

Conclusion

From this article you have learned that with the Copy-Paste method making a mistake in the last pasted block of code is 4 times more probable than in any other fragment.

It has to do with the specifics of human psychology, not professional skills. I have shown you in this article that even highly-skilled developers of such projects as Clang or Qt tend to make mistakes of this kind.

I hope my observation will be useful for programmers and perhaps urge them to investigate our bug database. I believe it will help reveal many regularity patterns among errors and work out new recommendations for programmers.

Image de l’icône:

Article technique

Enseignement

Expérience utilisateur

Inclure dans RSS:

Intel® Xeon Phi™ Coprocessor

↧

Explicit offload for Quantum ESPRESSO

April 23, 2015, 9:01 am

Latest and popular articles on Intel Technologies

≫ Next: Let's Play a Game - find bugs in popular open-source projects

≪ Previous: The Last Line Effect

Purpose

This code recipe describes how to get, build, and use the Quantum ESPRESSO code that includes support for the Intel® Xeon Phi™ coprocessor with Intel® Many-Integrated Core (MIC) architecture. This recipe focuses on how to run this code using explicit offload.

Code Access

Quantum ESPRESSO is an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves, and pseudo potentials. The Quantum ESPRESSO code is maintained by Quantum ESPRESSO Foundation and is available under the GPLv2 licensing agreement. The code supports the offload mode of operation of the Intel® Xeon® processor (referred to as ‘host’ in this document) with the Intel® Xeon Phi™ coprocessor (referred to as ‘coprocessor’ in this document) in a single node and in a cluster environment.

To get access to the code and test workloads:

Download the latest Quantum ESPRESSO version from http://www.quantum-espresso.org/download/
Clone the linear algebra package libxphi from Gibthub:
```
$ git clone https://github.com/cdahnken/libxphi.
```

Build Directions

Untar the Quantum ESPRESSO tarball
```
$ tar xzf espresso-5.1.tar.gz
```

Source the Intel® compiler and Intel® MPI Library

$ source /opt/intel/composer_xe_2013_sp1.4.211/bin/compilervars.sh intel64
$ source /opt/intel/impi/latest/bin64/mpivars.sh

Change to the espresso directory and run the configure script

$ cd espresso-5.1
$ export SCALAPACK_LIBS="-lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64"
$ export LAPACK_LIBS="-mkl=parallel"
$ export BLAS_LIBS="-mkl=parallel"
$ export FFT_LIBS="-mkl=parallel"
$ export MPIF90=mpiifort
$ export AR=xiar
$ ./configure --enable-openmp

Make sure make.sys (by editing make.sys) has the following configuration:

MANUAL_DFLAGS = -D__KNC_OFFLOAD
FLAGS =  -D__INTEL -D__FFTW -D__MPI -D__PARA -D__SCALAPACK  -D__OPENMP $(MANUAL_DFLAGS)
MOD_FLAG = -I<PATH_TO_PW> -I
MPIF90 = mpif90
CC = icc
F77 = ifort
BLAS_LIBS =  "-mkl=parallel"
BLAS_LIBS_SWITCH = external
LAPACK_LIBS = "-mkl=parallel"
LAPACK_LIBS_SWITCH = external
SCALAPACK_LIBS = -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64
FFT_LIBS =  "-mkl=parallel"

You can add “-xHost -ansi-alias” to CFLAGS as well as FFLAGS.

Build the Quantum ESPRESSO PW binary
```
$ make pw -j16
```
You should now have bin/pw.x
Change to the directory you cloned libxphi to and execute the build script. Make sure you do this in the shell you have the Intel compilers and Intel MPI library sourced.
```
$ cd libxphi
$ ./build-library.sh
```
You should now find two libraries: libxphi.so and libmkl_proxy.so

The build process is now complete.

Run Directions

A single Quantum ESPRESSO on a single node

The Quantum ESPRESSO binary compiled above initially has support for accelerated 3D FFT. Additionally, the library libxphi.so contains a number of linear algebra numerical routines invoked by Quantum ESPRESSO, particularly the numerically intensive ZGEMM BLAS3 routine for complex matrix-matrix multiplication. Instead of executing this routine via Intel® Math Kernel Library (Intel MKL), libxphi blocks the matrices and buffers them asynchronously to the card, where Intel MKL then executes the multiplication of the blocks and transfers the result back. When the Quantum ESPRESSO binary is created with the build instructions above, it will contain dynamic calls to the ZGEMM routine, which are usually satisfied by Intel MKL. To get offloaded ZGEMM in place, libxphi.so needs to be preloaded:

$ export LD_LIBRARY_PATH=$PATH_TO_LIBXPH:$LD_LIBRARY_PATH
$ LD_PRELOAD=”$PATH_TO_LIBXPHI/libxphi.so” ./pw.x <pw arguments>

The last line executes the Quantum ESPRESSO binary pw.x with offloaded ZGEMM support. To make this easier, we provide a shell script that facilitates this preloading and just takes the binary and its arguments as input, so that the execution of an offloaded run would look like this:

$ <PATH_TO_LIBPXPHI>/xphilibwrapper.sh <PATH_TO_PW>/pw.x <pw arguments>.

In this case Quantum ESPRESSO will execute a single instance with OpenMP* threads (by default as many as you have cores) and offload FFT and ZGEMM to all the cores of the Intel Xeon Phi coprocessor.

Tuning the linear algebra offloading

To tune the offloading process, we need to understand the ZGEMM routine, which executes matrix-matrix multiplication

C=αA∙B+βC

where α and β are complex numbers and C, A and B are matrices of dimension MxN, MxK, and KxN, respectively. The library libxphi.so now blocks this matrix-matrix multiplication, so that the resulting block-matrix multiplication consists of smaller blocks that are continuously streamed to the coprocessor and back. The size of those blocks can be defined by three parameters, M, N, and K, where m x n, m x k, and k x n are the dimensions of the C-, A- and B-block, respectively. By default, libxphi will block the matrices in sizes of m=n=k =1024.You can play with these values to achieve better performance, depending on your workload size. We have found that making m and n somewhat larger (m=n=2048) and playing with the size of k (between 512 and 1024) can yield very good results.

Block size can be set via the environment variables QE_MIC_BLOCKSIZE_M, QE_MIC_BLOCKSIZE_N, and QE_MIC_BLOCKSIZE_K. For example:

$ QE_MIC_BLOCKSIZE_M=2048
$ QE_MIC_BLOCKSIZE_N=2048
$ QE_MIC_BLOCKSIZE_K=512

An additional setting is required to avoid the offloading of small matrices, which might be more efficiently computed on the host instead of the coprocessor. With QE_MIC_OFFLOAD_THRESHOLD you can define the minimal number of floating point operations a matrix must have in order to get offloaded. The setting

$  export QE_MIC_OFFLOAD_THRESHOLD=20

achieves good results.

Partitioning the coprocessor

Partitioning the coprocessor leverages the advantages of multi-processing vs. multi-threading. It is somewhat similar to running Message Passing Interface (MPI) ranks on the coprocessor (a.k.a. symmetric usage model) although the MPI ranks are only on the host. Varying the number of ranks on the host can be used to partition each coprocessor into independent sets of threads. The vehicle to achieve independent thread-partitions is given by the KMP_PLACE_THREADS environment variable. In addition, using the environment variable OFFLOAD_DEVICES utilizes multiple coprocessors within the same system. Of course there is nothing wrong with using OpenMP instead of this proposed method; however, we found that portioning the coprocessor unlocks more performance–this is simply trading implicit locks at the end of parallel regions against absolutely independent executions. To ease the tuning process, a script is provided that generates the appropriate “mpirun”-command line.

$ ~/mpirun/mpirun.sh -h
-n: list of comma separated node names
-p: number of processes per socket (host)
-q: number of processes per mic (native)
-s: number of sockets per node
-d: number of devices per node
-e: number of CPU cores per socket
-t: number of CPU threads per core
-m: number of MIC cores per device
-r: number of MIC cores reserved
-u: number of MIC threads per core
-a: affinity (CPU) e.g., compact
-b: affinity (MIC) e.g., balanced
-c: schedule, e.g., dynamic
-0: executable (rank-0)
-x: executable (host)
-y: executable (mic)
-z: prefixed mic name
-i: inputfile (<)
-w: wrapper
-v: dryrun

The script “mpirun.sh” is actually inspecting the system hardware in order to provide defaults for all of the above arguments. The script then launches “mpirun.py,” which actually builds and launches the command line for “mpirun.” This initial inspection, for example, avoids using multiple host sockets in case there is only one coprocessor attached to the system (avoids performing data transfers to a “remote” coprocessor). Any default provided by the launcher script “mpirun.sh” can be overridden at the command line (while still being able to leverage all other default settings). Please note that the script also supports symmetric execution (“-y”, etc.), which is discussed here.

Here is an example of running QE with four partitions on each of the coprocessor(s):

$ ./mpirun.sh -p4
    -w <PATH_TO_LIBPXPHI>/xphilibwrapper.sh
    -x <PATH_TO_PW>/pw.x
    -i <input-file.in>

Any argument passed at the end of the command line is simply forwarded to the next underlying mechanism if not consumed by option processing. If you need to pass arguments to the executable using “<”, you can use the script’s “-i” option; otherwise, options for the executable can be simply appended to the above command line.

The number of ranks per host-socket (“-p”) is not only dividing the number of cores per host-processor but also dividing each coprocessor’s number of cores. Therefore some ratios produce some remaining unused cores. On the other hand, the coprocessor usually comes with more cores than cores in a single host socket/processor; therefore, it is likely acceptable and anyways a subject of tuning the number of partitions.

Performance

Figure 1: Performance of Quantum Espresso executing the GRIR443 benchmark on 16 Xeon E5-2697v2 and 16 Xeon Phi 7120A.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Application parameterization

-npool=2,
2 MPI ranks/socket
6 threads/MPI rank

Platform configuration

Host configuration

Intel® Xeon® processor E5-2697 v2 64GB
64GB DDR3-1600
RHEL 6.4
Intel® Turbo Boost Technology /EIST/SMT/NUMA enabled

MIC configuration

7120A, 61cores, 1.238GHz
MPSS 2.1.6720-16
ECC enabled, Turbo disabled

Software configuration

Icc 14.0.0 update 1, Intel MPI Library 14.1.1.036

Quantum ESPRESSO

Bibliothèque Intel® MPI Library

Intel® Many Integrated Core Architecture

URL

↧

Let's Play a Game - find bugs in popular open-source projects

April 24, 2015, 2:14 am

Latest and popular articles on Intel Technologies

≫ Next: Space Apps Challenge SF: Where the Sky Was Not the Limit

≪ Previous: Explicit offload for Quantum ESPRESSO

Authors of PVS-Studio static code analyzers offer programmers to test their sight and to try finding errors in C/C++ code fragments.

Code analyzers work tirelessly and are able to find many bugs that can be difficult to notice. We chose some code fragments in which we had founded some errors using PVS-Studio.

Quiz is not intended to check C++ language knowledge. There are many quality and interesting tests. For instance, we would recommend this C++ Quiz then. In our case, we made our test just for fun.

We quite frequently hear an opinion that code analyzers are pointless tools. It is possible to find misplaced parenthesis or comma in five seconds. However, analyzer would not find difficult logical errors. Therefore, this tool could be useful only for students.

We decided to troll these people. There is a time limit in tests. We ask them to find an error in five seconds. Well, OK, not in five seconds, but in a minute. Fifteen randomly selected problems would be shown. Every solved problem worth one point, but only if user provided the answer in one minute.

We want to stress that we are not talking about syntax errors. We found all these code fragments in open-source projects that compiles flawlessly. Let us explain on a pair of examples how to point out the correct answer.

First example. For instance, you got this code:

The bug here is highlighted with red color. Of course, there would be no such emphasizing in a quiz problem.

Programmer accidently made a misprint and wrote index 3 instead of index 2. Mouse cursor movement would highlight fragments of code, such as words and numbers. You should point the cursor into number 3 and press left mouse button.

This would be the correct answer.

Second example. It is not always possible to point out the error exactly.

Buffer size should be compared with number 48. An excess sizeof operator was put there by accident. In result, buffer size is compared with size of int type.

At my opinion, an error there is in sizeof operator, and it is required to point it out to score a correct answer. However, without knowledge about the whole text, it is possible to think this way. Sizeof operator should have evaluated the size of some buffer, but accidently evaluates the value of the macro. The error is in SSL3_MASTER_SECRET_LENGTH usage.

In this case, the answer will be scored no matter what you choose: sizeof or SSL3_MASTER_SECRET_LENGTH.

Good luck! You can start a game.

Footnote.

Test does not support mobile devices. It is very easy to miss with finger. We are working on new version of tests with better mobile devices support, new problems to solve etc. However, it is not implemented yet.

Image de l’icône:

Exemples de code

Messages d'erreur et de diagnostic

Article technique

Enseignement

Expérience et conception utilisateur

Éducation

Code source libre

Sécurité

C/C++

Code produit

Développement de jeu

Expérience utilisateur

Unix*

Inclure dans RSS:

↧

Space Apps Challenge SF: Where the Sky Was Not the Limit

April 24, 2015, 7:13 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® System Studio - Solutions, Tips and Tricks

≪ Previous: Let's Play a Game - find bugs in popular open-source projects

On April 10^th, 2015, I was fortunate to travel from Hillsboro, Oregon to San Francisco, California especially to take part in the International NASA Space App Challenge hosted at Constant Contact. The challenge was held for two days in 133 cities around the world focusing on 4 themes: Earth, outer space, humans, and robotics.

On the first day, Intel donated Intel Edison platforms and sensor kits to all the participants interested in using them during the challenge and beyond because who said creativity is limited to two days! I had the pleasure of distributing the kits to the participants while getting to know them in the process and get inspired by their backgrounds, motivation, creativity, and excitement. The inspiration and the excitement of the event were not limited to our location. The participants spread it by tweeting about their projects and the event during both days. We even kept in touch via Twitter with the challenge participants in various locations worldwide especially with my colleague Wai Lan who supported the event in NYC and wrote a great blog about his experience there.

On day two of the challenge, the teams presented their projects. The projects were nothing short of amazing. The teams' imagination were limitless and truly, the sky was not the limit. I was particularly happy to learn that the top 2 projects that qualified for the global competition used Intel technology. Team ScanSat and Team AirOS, you rocked!

Team ScanSat:

Team ScanSat members are Anand Biradar (aerospace engineer), Krishna Sadasivam (computer engineer), Sheen Kao (mechanical engineer), and Robert Chen (computer scientist). Their project solved the Deep Space CamSat Challenge. The team came to the event prepared with an idea. They wanted to develop the docking and magnetic propulsion mechanisms for a CubeSat that is docked with a Dragon-sized spacecraft. Utilizing magnetic and ion propulsion, ScanSat should be capable of undocking and re-docking with the main craft autonomously. Since the ScanSat is supposed to be equipped with a camera, it can capture images of the spacecraft as it physically approaches interesting phenomena, as well as perform image processing to analyze the exterior.

When the members arrived at the location, they were pleasantly surprised to find out that Intel was providing hardware. As a result, they were able to expand their original idea of just developing the docking and magnetic propulsion mechanisms of ScanSat to creating an actual demo as a proof of concept. The end result was a stellar. Using the below hardware, they were able to develop the CubeSat which is able to steer in all directions following a red light representing the mothership.

The list of hardware components used for the demo:

1 x Intel® Edison with Arduino Breakout Kit
1 x Base Shield v2 from the Grove starter kit
4 x Grove - Smart Relay
1 x Camera
1 x DC Motor
1 x Lunchbox
1 x Tub
Water

On the software side, they used:

Python
OpenCV image processing libraries
Libmraa for i/o control - https://github.com/intel-iot-devkit/mraa

Source Code:https://github.com/ksivam/scansat

Video Demo: https://www.youtube.com/watch?v=CDbrzUlAxt4

Team AirOS:

Team AirOS members are Patrick Chamelo, Mario Roosiaas, Maria Rossiaas, Karl Aleksander Kongas, David Bradley, Marc Seitz, and Scott Mobley. Their project solved the Space Wearables: Designing for Today's Launch & Research Stars Challenge. Inspired by Star Wars, they developed an augmented reality platform designed for gesture, voice, and maximum awareness of an astronaut's surroundings. It feeds live sensor data into the user's HUD, allows instant video communications with other astronauts, remote teleconferencing, voice control, and AI assistance.

The team was very creative and resourceful by combining multiple technology and tools for the next generation of space wearables. In particular, the project pipes video straight into the Oculus Rift which creates a pseudo augmented reality environment. They also overlay a GUI over the video which gets live temperature updates and detects flames using Intel's Edison. Moreover, it detects gesture controls using the Leap Motion API and takes advantage of IBM's Watson Instant Answers in order to support questions and answer speech to text communications.

I was honored to be one of the first to test AirOS and let me tell you.... It was fascinating, I loved it!!

The list of hardware components used for the demo:

1 x Oculus Rift
1 x Intel Edison
1 x Temperature Sensor
1 x Flame Sensor
1 x Leap Motion
1 x Camera

Source Code: https://github.com/badave/airos.git

Video Demo: https://www.youtube.com/embed/xfqIYuQEkqM

A big THANK YOU to the entire SpaceRocks Apps team who did a FANTASTIC job organizing the event and allowing me to be part of it. I am already counting down the days to the next one.... Is it 2016 yet??!!

Internet of Things (IoT)

Hackathons

Intel Edison

Image de l’icône:

Internet des objets

Étudiants

Inclure dans RSS:

Avancé