Aug 6, 2020
Qualcomm products mentioned within this post are offered by Qualcomm Technologies, Inc. and/or its subsidiaries.
Like most of us, your experience with human machine interaction (HMI) probably began with 2D computer desktop graphic user interfaces (GUIs). Recently, HMI has grown to encompass movements and gestures made possible by touchscreen sensors on mobile devices. HMI is now entering a whole new era, as extended reality (XR) is gaining prevalence in enterprise.
XR immerses users in a 3D environment with scale and depth, which brings both new possibilities and new design considerations for user interfaces. It also brings several advantages over traditional 2D desktop GUIs namely more screen space, less distractions from occluded elements and greater flexibility to spatially organize tasks. It also adds a third dimension that more closely resembles how we perceive the real world.
In this blog, we will briefly look at how basic interactions with GUI menus, keyboard input, and pointing can be achieved in XR. We will also share some key design considerations and essential human movements for these interactions. Finally, we will show you how perception technologies powered by our recently announced Qualcomm® Snapdragon™ XR2 5G Platform can help make it a reality.
Given the inherent 3D nature of XR and the user’s ability to control both their position and orientation, GUI elements like menus can be presented as either 2D overlays or interactive 3D objects that respect the user’s field of view.
For example, menus can appear as 3D objects that can be moved by the user or even locked to a location in the environment. This can be done for both virtual reality (VR) and augmented reality (AR) and the user can move, orient, and/or manipulate them. For VR, such menus can be tracked and rendered like any other object, while in AR, technologies like GPS, sensor fusion, and techniques like Kalman filters, are often employed to lock their positions and orientations. Similarly, popups such as those providing information must be handled this way as well.
Unlike traditional 2D menus, 3D menu design for XR leads to some interesting questions. For example, how do you provide feedback for a button press when there may not be any tactile feedback? One solution is to render such contact by animating a feature, like an expanding ring, and further augmenting that with sound.
Another question to consider is what should happen if the user presses a virtual button beyond the button’s plane, as if they pressed the button too hard? Without a formal solution, the user’s virtual finger might appear to penetrate the button. This can be handled in many interesting ways, such as making the button itself translate inwards if pushed too hard.
When it comes to mimicking the mouse, the rough 3D equivalent is a laser pointer. This is effectively a 3D pointer that is ray casted from a source position such as the user’s hands or a controller to some target in the viewport. The orientation of the ray’s vector may be based on finger joints, or the orientation of a controller. The ray itself may be rendered in whole from source to target, or just the final target pointer (e.g., crosshairs).
Finally, while immersive XR applications are often less oriented towards text capture, there are still use cases where some sort of keyboard input is required. For this, a virtual 3D keyboard can be rendered as an object suspended in space, located at a certain distance from the user. This can be useful for both AR and hands-free (i.e., non-controller) VR applications.
Another option is to use voice commands, as was done by Delta Cygni Labs in their assisted reality product: POINTR. Using voice commands, the user can say what they want to do (e.g., display notepad), and then say additional words, such as the content that is to be recorded.
While implementing menus, pointers, and virtual keyboards may seem trivial, there are numerous design considerations, which can vary for AR and VR.
Design should start with extensive mockups and wireframes. Wireframes for different resolutions and distances from the user (e.g., near, medium, and far) can be useful for setting the Z-order of objects in both AR and VR. For AR, 2D mockups of what the user will see in 3D at various locations and angles, can be effective for visualizing the end result, while gray boxing provides a similar tool for 3D:
In addition, diagrams should be constructed identifying zones for both types of XR and how they map in 3D. If eye tracking will be employed, such diagrams may need to be duplicated for each eye.
For head tracking, developers need to consider comfort versus limits, as well as content versus peripheral zones. Generally, people can comfortably turn their heads to the left or right about 30° in each direction up to a maximum of 55° both ways. For upwards movement, the comfort level is about 20° to a maximum of 60°, and comfortably downwards 12° to a maximum of 40°. Zones should be identified based on these ranges while factoring in the peripheral zones that will be captured by eye tracking. For example, a diagram such as the following, identifies where content such as menus should be replaced in relation to the user.
With GUIs that need to portray a lot of information, it can be beneficial to place user information based on its importance. One good approach is to prioritize, with the most important information to appear front and center, secondary information at the sides, third down at the bottom, and fourth up, similar to how one might look for information on the cover of a newspaper.
In AR, the real world serves as a limitless 3D environment and therefore the developer does not have control over factors such as lighting. Because digital elements are effectively overlaid in AR as an additive rendering process, they are often easiest to view when presented using light colours. Also keep in mind that AR can be viewed in an immersive 3D headset, or via a 2D mobile device like a tablet that captures the world through a camera. Developers may need handle input for both since some users prefer one over the other, or set a hard requirement for one type of device. When targeting a headset, developers should employ perception technologies for hand, head, and eye tracking. On mobile devices, movements and gestures will be captured through motion sensors and touchscreen. For VR, developers should consider rendering a fully-articulated skeletal model of the user’s hand, and/or controller that accurately reflects the movements of their real-world counterparts.
Since XR is inherently 3D, developers will want to consider how to render detailed items. For example, the angular resolution of a headset will dictate the font size to use for best readability as the user moves and rotates, while the choice of GUI elements will depend on the perception technologies being used.
The perception technology that makes it all a reality
The three key perceptions to track in XR are head, hand, and eye movements. And each perception technology that tracks them, must fulfill three key KPIs to a high degree in order to be effective:
- Accuracy: how well the perceived result compares to the perfect result (ground truth) (e.g., low motion-to-photon latency in XR headsets to handle head movements).
- Robustness: how corner/edge cases are handled (e.g., when sensors are blocked or a connection is lost).
- Power consumption: power and thermal efficiency of a technology or algorithm.
Tracking the orientation and translation of the user’s head is another important consideration, so the application can render the UI (and 3D environment in the case of VR). Head tracking algorithms track three orientation DoFs: yaw, pitch, and roll, as well as positional DoFs for the user’s position (x, y, and z). Such algorithms are often referred to as 6-DoF algorithms.
Head tracking algorithms are often augmented with the ability to predict where the user’s head/orientation will be in the future (typically 10 to 20ms out), which can help smooth out rendering and ensure virtual objects remain fixed in space.
Head tracking algorithms must run constantly, provide low latency, and maintain low power consumption. Accuracy of such a demanding algorithm is achieved when few or no translation and rotation errors occur. To help with this, the Snapdragon XR2 creates a map of the environment with a technology called SLAM (Simultaneous Localization and Mapping). Mapping the environment is used for robust, fast and accurate tracking. And in order for the algorithm to be robust, it must handle cases like an obscured camera, poor lighting, etc. For this, the Snapdragon XR2 uses special monochrome cameras with a wide FOV, and has a hardware-accelerated head tracking algorithm, that is designed to perform with high accuracy at very low power.
Capturing hand movements is important because the hands are used by most people as their primary way to interact with the natural world. This can be achieved through a number of technologies including camera-based hand tracking, sensors and touch screens on mobile devices, etc. There are a wide range of movements and gestures that can be captured including:
- Pointing and tapping motions (e.g., to select, move, and place objects)
- Swipes (e.g., to navigate screens, interact with fields, etc.)
- Side-to-side hand movements (e.g., to scroll)
- Wrist rolls, similar to looking at a watch (e.g., to display user interface elements on a virtual arm)
- Touch measurements through capacitors (e.g., to detect hand gestures like thumbs up, pointing, squeezing, etc.)
The fidelity of hand tracking can be measured in up to 28 DoFs (including all joints of the hand and each finger). An important KPI for hand movements is to allow for a large tracking volume, as users expect to be able to stretch arms wide, high, and low. Robustness is achieved by handling edge cases such as the hand covering the camera, intertwined fingers, one hand covering another, etc.
It’s also important that hand tracking maintain low latency to prevent lag and to allow for near real-time UI interactions rendering of virtual hands/controllers. For hand tracking algorithms that involve the use of neural networks, the Snapdragon XR2 is capable of processing neural network-based algorithms at up to 15 trillion computations per second.
Finally, eye tracking tracks where a user is looking (also known as the gaze vector).
Effective eye tracking generally requires a sub-10ms pipeline from eye movement to the on-screen change, and this can be a challenge given how fast the eyes can move. For this, we worked with Tobii to bring eye tracking to the Qualcomm Hexagon DSP SDK on the Snapdragon XR2, and provides low-latency access to the cameras. For eye tracking to be robust, it should handle eye features that vary in people around the world, so the underlying neural network used in the Snapdragon XR2 algorithm was trained on a large, diverse dataset.
Eye tracking can also help with power efficiency by contributing to foveated rendering. This involves rendering different regions of the screen to varying degrees of detail, based on the knowledge that our peripheral vision sees less detail.
UI design continues to evolve, as new mediums like immersive XR experiences become more prevalent. But unlike conventional 2D desktop GUIs, 3D environments along with new ways of interacting, are opening up new possibilities while exposing new design considerations.
Powering this are cutting-edge perception technologies including head, eye, and hand tracking devices. But in order to deliver such functionality effectively, these technologies must be robust, accurate, and power efficient.
If you have an interesting XR project that you’d like to share, be sure to submit your project. Also be sure to check out these additional learning resources related to XR development: