Apr 8, 2021
Qualcomm products mentioned within this post are offered by Qualcomm Technologies, Inc. and/or its subsidiaries.
Windows 10, ARM64 and Unreal Engine? On the same laptop? Is that a thing?
It soon will be. In this blog, I’ll show how you can get started porting your UE4 games and graphics to Windows 10 on ARM in order to prepare for a targeted version of Unreal Engine.
It shouldn’t be that much of a surprise. Visual C++ compilers and libraries for ARM64 have been available since Visual Studio 15.9. And OEMs like Acer, HP, Lenovo, Microsoft and Samsung have been selling Windows 10 on ARM devices powered by the Qualcomm Snapdragon compute platform.
What may surprise you, though, is that Windows 10 on ARM is suitable for a lot more than just word processing, web browsing and checking email. Miguel Nunes demonstrated Zoom on a Windows on Snapdragon 8cx with up to 3x-longer battery life than a comparable x86-powered laptop. And Windows on ARM support for x64 apps on the Surface Pro X is here as well.
Even more surprising is the performance you’ll see when you start porting your UE games to ARM64 and running them on a laptop powered by Snapdragon compute platforms.
But maybe the biggest surprise is that you won’t have to run in emulation.
The icing on the cake: Native ARM64 support for Unreal Engine is coming
Windows 10 on ARM supports x86 (32-bit) and x64 (64-bit) targets with emulation, which gives you a quick way to deploy. Recompiling your code as native ARM64 apps opens the doors to faster processing, more-efficient power consumption and all the advantages of the Snapdragon compute platform and the Qualcomm Adreno GPU.
That’s why we engaged ForwardXP to create an ARM64 version of the venerable Infiltrator Demo as a proof of concept. While this demo is several years old now, it was created with the purposes of showcasing the capabilities of Unreal Engine 4 with high-end content. To this day, the demo still demands a lot from even a high-end, discrete PC GPU for credible, real-time performance.
We put ourselves in your shoes and decided to optimize the content of the demo itself rather than the core engine. Why? Because it better represented the journey you would take in porting your games to ARM64 and in scaling the performance for a wide range of hardware.
Besides optimizations to the demo content, the project required development work to add Windows 10 ARM64 support to Unreal Engine 4.23. For more information on targeting Unreal Engine for Windows on ARM64 visit the Unreal Engine for Windows on Arm64 tutorial in our Game Developer Guide.
Changes to UE to support Snapdragon
Unreal Engine 4 uses a custom build system written in C# called Unreal Build Tool (UBT). To add an ARM64 architecture variant to the Windows platform, most of the changes in the source were updates to the UBT.
UE4 already supported Microsoft’s ARM-based HoloLens. While UBT manages it as its own platform, it is very similar to a Windows platform build. As a result, the build system logic could be adapted to support ARM64 for Windows.
Existing Hololens and Android support allowed most of the C++ code to compile for ARM64 without significant changes. The codebase already had support for NEON SIMD intrinsics.
Thread affinity configuration
ARM big.LITTLE architecture requires that thread affinity be set up carefully, to maximize use of the high-performance Gold cores (the Silver cores are numbered 0 to 3, the Gold cores 4 to 7):
- The main game thread and the rendering threads are configured to run exclusively on the Gold cores.
- The audio thread and the worker thread pool can be scheduled on any of the cores (including Silver cores).
The Demo uses the UE4 DirectX 12 render hardware interface (RHI). This is controlled by passing -d3d12 in the command line options, and it can also be configured in the project’s .ini settings.
-prime_render command line option
The ARM64 configuration also supports a new command line option: -prime_render
- Rendering thread: Responsible for the scene graph setup and traversal, and most of the engine work.
- RHI thread: Manages the submission of the render command streams with the DirectX 12 API.
-prime_render locks the “rendering thread” exclusively to core 6 and the “RHI thread” exclusively to core 7. That leaves the game thread floating between cores 4 and 5. If that command line option is not used, or if the DirectX 11 RHI is active, the threads mentioned here are all float among the four Gold Cores, based on the Windows scheduler heuristics.
Using the -prime_render option in the Infiltrator Demo gave slightly better performance and smoother FPS. Note, though, that this command line parameter may not be applicable outside of GPU-bound demonstration situations like Infiltrator. A game project with less rendering work and more game thread load might not benefit from this setting.
In Visual Studio, the ARM64 NEON intrinsics are available through arm64_neon.h. See Compiler intrinsics for a comparison of available intrinsics across the different architectures.
UE4 uses NEON SIMD intrinsics to implement the Vector* functions used across the engine. This is implemented in UnrealMathNeon.h (see UnrealMathSSE.h for the SSE/SSE4 equivalent on the x86_64 architecture).
However, there is another family of functions in UE4 that have SSE implementations but no NEON equivalent at this time. Those are provided by UnrealMathSSE.h, UnrealPlatformMathSSE.h and UnrealPlatformMathSSE4.h. The functions are the following families:
- FMath::InvSqrt, FMath::InvSqrtEst
FMath::InvSqrt is used in vector and matrix code. The quaternion implementation uses VectorReciprocalSqrt; there is a code path for quaternion computations that uses FMath::InvSqrt but it is not used on ARM64.
The functions that lack a NEON implementation have not been flagged in performance profiles during work on the Infiltrator Demo. It is possible that CPU-bound cases would get some minor benefit from further work to expand NEON intrinsics usage in the engine.
XAudio API on Windows 10 ARM64
UE4 on Windows x86_64 uses XAudio 2.7. This older XAudio API is used for backwards compatibility (UE4 compiles against an older Windows SDK and supports back to Windows 7).
Windows 10 on ARM64 offers only the newer XAudio 2.9. That does not change or restrict functionality, but the build system and audio module code was modified to initialize against the newer version.
Tuning and optimizing UE4 post-processing
To create as near to a 1:1 experience as possible on ARM64, all post-processing effects were left in place. However, their render quality was lowered across the board to reach the desired 30 FPS target.
Optimizing content for Windows 10 on ARM64
We selected a scene from Infiltrator Demo (check out the video from Unreal Engine) that takes place in a corridor. Our benchmarks showed that the number of draw calls (approximately 1960) was the single greatest performance limiter on the Snapdragon device. Our goal, then, was to reduce draw calls wherever possible for acceptable 720p playback at 30 FPS average on ARM64.
TL;DR: From 1960 draw calls down to 1210
Here are the optimizations we made to the content of the Infiltrator Demo:
Particles were reduced across the board. Priority was given to particle effects placed closest to the cinematic camera (reducing those the least), and becoming increasingly aggressive as the distance to the camera increased. This was especially necessary in the panning shot of the long construction belt into the distance.
The majority of particle emitters were completely deactivated and were turned off for one of the following reasons:
- Their visual impact was either very subtle or, in some cases, not noticeable at all because of their distance from the camera. These included water drips, light coronas, and steam or vapor.
- Their cost was simply too high and a similar effect could be obtained with fewer active emitters. This primarily applied to the overlapping steam emitters, which dramatically affected shader complexity and draw calls.
The emitters that were kept active had an obvious, easily visible impact on the scene, like emissive sparks. However, even these emitters had their spawn counts reduced and some of their secondary effects, like faint smoke, were deactivated.
Scrolling transparent materials
Because transparency and overlapping planes dramatically increase overdraw and draw calls, many of the planes that used scrolling transparent materials were removed, especially where those planes created overlapping layers. That affected fog and steam effects the most.
Almost every mesh in the scene was optimized to reduce polygon counts. In many cases, we merged clusters of meshes together into single meshes to reduce draw calls.
To improve performance and reduce memory footprint, we used Simplygon to merge and optimize meshes and to combine multiple materials.
The meshes shown below have a total of 31,569 triangles and 19,910 vertices. They are all part of a single piece of equipment instanced multiple times and are seen briefly to the side of the main corridor as the camera pans at the beginning of Infiltrator Demo.
The image below shows the resulting single mesh after merging and reducing the meshes above.
With minimal impact to in-scene quality, we reduced the number of triangles to 13,876 and vertices to 13,387 – a reduction of 17,693 triangles (56%) and 6,523 vertices (32.76%).
Those savings snowballed across dozens of individual meshes in the level, yielding the largest performance gains.
All lights in the scene were changed from Dynamic to Static.
All texture sizes were reduced to balance performance versus image quality.
Settings for render quality were as follows:
Of those settings, reducing the quality of Shadows and Anti-Aliasing resulted in the greatest performance gains. Note also that the scene is dark overall, so the changes did not dramatically affect visual quality.
To optimize objects that appeared at a greater distance from the camera, normal maps were removed and meshes were more aggressively simplified and combined (in some cases, partially culled). These optimizations were essential for the long, establishing shots down the construction corridor towards the core tower.
After implementing all the optimizations described above, the number of draw calls in the selected sequence fell from 1960 to 1210. A heatmap of the scene reveals one remaining hotspot at the center, which is not shown in-camera.
Those are the optimizations we applied to the demo content. Now you can start optimizing your own content for Windows on Snapdragon. I think you’ll find the tips above helpful, especially if you’re accustomed to developing for the PC or console world of discrete graphics cards and mountains of RAM.
Now it’s your turn. You can now — yes, right now — use Visual Studio on ARM to build and run high-end, GPU-intensive graphics on Snapdragon-powered devices. Have a look at our Game Developer Guides for the tutorial on Windows on ARM64. You can build the UBT to target ARM64 and then compile your own UE games with Visual Studio on Windows.
Let us know what else you need to get your games into Windows on Snapdragon.