Weeks 7 and 8

Tracking / Navigation / Interaction

Week 7 Homework - due 9pm Chicago time on Friday 10/13/17

In project 1 you used VR to create your dream office - a new trend in AR apps is evaluating virtual furniture placed within the room where you are thinking about putting it. For this homework you should try out one of these apps - IKEA has a couple, IStaging is another. Take a room you know, place a piece of appropriate furniture in it and take a screen capture. Then do the same thing for a place you normally wouldn't see a piece of furniture like that - like a bed in the middle of Halsted St. or a stove in the middle of campus. Add these two snapshots to another homework page and write down your thoughts on how AR applications like this could be used in the future.

Week 8 Homework - due 9pm Chicago time on Friday 10/20/17

One nice set of AR applications for smartphones has been those that let you point your phone up at a part of the sky (usually at night, but of course they work anytime) and have the app tell you what stars, planets, etc. are visible from where you are. If you want to know if that bright light on the horizon is a UFO or the planet Venus, then this might help. For this homework you should try out one of these apps - Pocket Universe is a nice one for ios, or Star Tracker or SkyView on Android. Knowing where you are on Earth and the orientation of your phone, its pretty easy to tell what celestial objects you are looking at since they don't move very fast. Take one screenshot when you are looking at the moon and another looking at Saturn. Please do not point your camera towards the sun while doing this. As with the google Translate homework, add these two images to your homework page and then imagine a future where this app is built into typical AR glasses as a way to get more information about the world around you and quickly answer the question 'what is that?' In this case let's just focus on the Sky. Write about what other data sources might you want to integrate into an app like this? What different layers of information might you want to combine in this view.

The primary purpose of tracking is to update the visual display based on the viewers head position and orientation.

Ideally we would track the users eyes directly, and while that is possible, it is cumbersome, and not overly necessary. Instead of tracking the viewer's eyes directly, we track the position and orientation of the user's head. From this we determine the position and orientation of the user's two eyes.

We may also be tracking the user's hand(s), fingers, legs, or other interface devices.

Want tracking to be as 'invisible' as possible to the user.

Want the user to be able to move freely with few encumbrances

Want to be able to have multiple 'guests' nearby

Want to track as many objects as necessary

Want to have minimal delay between movement of an object (including the head and hand) and the detection of the objects new position / orientation (< 50 msec total)

Want tracking to be accurate (1mm and 1 degree)

In order to interact with the virtual world beyond moving through it, we probably need to track at least one of the user's hands as well, and preferably both of them. Tracking the position and orientation of the hand allows the user to interact with the virtual world or other users as though the user is wearing mittens with no fine control of the fingers. When thinking about how the user interacts with the worlds that you are building, think about the kinds of actions a person can do while wearing mittens.

There are a variety of ways of tracking people

Electromagnetic - what we used in the early to mid 1990s for VR

large transmitter and one or more small sensors.

transmitter emits an electromagnetic field.

sensors report the strength of that field at their location to a computer

sensors can be polled specifically by the computer or transmit continuously.

advantages are:

disadvantages are:


Mechanical - still around for haptics and applications that require very high accuracy and very low latency

rigid structures with multiple joints

one end is fixed, the other is the object being tracked

could be tracking users head, or their hand

physically measure the rotation about joints in the armature to compute position and orientation

structure is counter-weighted - movements are slow and smooth

Knowing the length of each joint and the rotation at each joint, location and orientation of the end point is easy to compute.


advantages are:

disadvantages are:

Acoustic (ultrasonic) - what we used in the late 1990s for VR

small transmitter and one medium sized sensor

each transmitter emits ultrasonic pulses which are received by microphones on the sensor (usually arranged in a triangle)

as the pulses will reach the different microphones at slightly different times, the position and orientation of the transmitter can be determined


advantages are:

disadvantages are:


Optical - What pretty much everyone uses today for VR

LEDs or reflective materials are placed on the object to be tracked

video cameras at fixed locations capture the scene (usually in IR)

image processing techniques are used to locate the object

With fast enough processing you can also use computer vision techniques to isolate a head in the image and then use the head to find the position of the eyes




In CAVE2 we use a Vicon optical camera tracking system, in hte new classroom space it will be an OptiTrack system.

One can also use much less expensive camera based systems like the Xbox Kinect to track multiple people in a small area. Staying within the field of view and focal area is very important here since you only have a single camera, and users are usually limited to facing a single direction

How the consumer headsets do it

VIVE - two powered lighthouses at the high corners of the tracked volume each sweeping a laser horizontally and another vertically through the space which is read by the wired headset (1000 hz) and the wireless controllers (360 hz) giving a general accuracy of 3 mm.


Oculus - wired Headset and the wireless Oculus touch controllers have LEDs mounted all around them which are then detected by (initially) a small sensor sitting on the desk/table in front of you, and later two or three sensors to track a larger space

Playstation VR - similarly has a wired headset and wired controllers that makes use of the existing PlayStation move controller infrastructure using the PlayStation Camera and LEDs on the headset

Inertial - used for a lot of current smartphone based experiences


gyroscopes / accelerometers used

knowing where the object was and its change in position / orientation the device can 'know' where it now is

tend to work for limited periods of time then drift as errors accumulate.


For outdoors work GPS can give the general location of the user (3 meter accuracy horizontally in open field, much less as you get near buildings). Vertical accuracy is more like 10 meters, so that is not very useful right now. Newer constellations of satellites, as well as ground based reference stations can substantially improve on that accuracy.

For better vertical accuracy devices are now including barometers. These work pretty well when calibrated to the local air pressure, which may be constantly changing as the weather changes.

Fiducial Markers

A common way for camera based AR systems to orient themselves is by using fiducial markers. These could be pieces of paper held in front of a camera where a 3D object suddenly appears on the paper (when looking at the camera feed). They can also be placed on walls, floors, ceilings so moving users with cameras can locate where they are.


Combining multiple forms of tracking is a very good way to improve tracking in complex situations, just as our phones GPS based information is improved if we also have the WiFi antennas working.

Intersense uses a combination on Acoustic and Inertial. Inertial can deal with fast movements and acoustic keeps the inertial from drifting

Outdoor AR devices can use GPS and orientation / accelerometer information to get a general idea where the user is, and then use the on board camera to refine that information given what should be in sight from that location at that orientation.

a current popular version of this is Inside Out Tracking

The HoloLens and future derived headsets don't want to rely on external markers or emitters or cameras, they want to be able to track using just what the user is wearing with cameras and sensors looking outward. This requires a combination of sensors including inertial (for orientation tracking), and visible light camera(s) and depth camera(s) for position tracking, and all of the together are used for space mapping.

With the HoloLens you first have to help the headset map the space by looking all around the room you are in, and then the HoloLens remembers that room, and as you move between multiple rooms it adds to its internal map and combines independent spaces together into larger contiguous spaces.

Google's Project Tango and others use similar combinations of sensors on headsets and smartphones

In class demos of setting up tracking for the VIVE and OptiTrack.

VIVE - https://support.steampowered.com/steamvr/HTC_Vive/

OptiTrack - http://v110.wiki.optitrack.com/index.php?title=Quick_Start_Guide:_Getting_Started


Rather than looking for a generic solution, specialized VR applications are usually better served using specialized tracking hardware. These pieces of specialized hardware generally replace tracking of the user with an input device that handles navigation

For Caterpillar's testing of their cab designs they place the actual cab hardware into the CAVE so the driver controls the virtual loader in the same way the actual loader would be controlled. The position of the gear shift, the pedals, and the steering wheel determine the location of the user in the virtual space.

A treadmill can be used to allow walking and running within a confined space. More sophisticated multilayer treadmills or spheres allow motion in a plane.

A bicycle with handlebars allows the user to pedal and turn, driving through a virtual environment

a bit more about latency

Accuracy needs to come from the tracker manufacturers. Latency is partly our fault.

Latency is the sum of:

another important point about latency is the importance of consistent latency. If the latency isn't too bad, people will adapt to it, but its very annoying if the latency isn't consistent - people can't adapt to jitter.

How many sensors is enough?

Tracking the head and hand is often enough for working with remote people as avatars.

A user putting on sensors and another user dancing with 'the thing growing' at SIGGRAPH 98 in Orlando. This application tracked the head, both hands and the lower back.

Interaction and Navigation

In the simplest Virtual Reality world there is an object floating in front of you in 3D which you can look at. Moving your head and / or body allows you to see the object from different points of view. This is also the default in an Augmented Reality world.

In Fish Tank VR setups, or HMDs like the original Rift, the user is typically sitting with a limited space to move in. With more modern HMDs like the newer Rift and VIVE, or room scale systems like the CAVE and CAVE2 the user has a larger space to move walk in, jump up, kneel down, lie on the floor etc, and arcade level systems give you larger rooms to move around in using just your body.  Unseen Diplomacy pushes this navigation in a limited space to the extreme - https://www.youtube.com/watch?v=KirQtdsG5yE

But often you want to move further, or in effect move a different part of the virtual world into the area that you can easily move through.

Common ways of doing this involve using a joystick or directional pad on the wand to move 'drive' through the space as though you were in a first person video game, which gives you a better sense of continuity in the virtual world, though this can risk simulator sickness, which is why most current consumer HMD games don't do it.

Another simple option is using a wand to point to where you want to go and teleporting from one place to another within the virtual world, which is what most current consumer HMD games use.

Another option is to use large gestures such as swinging both your arms (holding two wands) up and down as though you were jogging to tell the system you want to walk, or pointing in the direction you want to go if you have hand and finger tracking.  VR Dungeon Knight uses the jogging metaphor - https://www.youtube.com/watch?v=TTolJoKUcks

Other specialized options as discussed above include real devices like bicycles, treadmills, car interiors, fighter plane interiors, train interiors.

In Augmented Reality you are limited to your actual physical movements (or the movements of a real car or a real bike) as the Augmented Reality world is anchored to the real world.

Here are several controllers that we used in the first 10 years of the CAVE. The common elements on these included a joystick and three buttons (same as the 3 buttons on unix / IRIX computer mice.)

original 'hand made' CAVE / ImmersaDesk wand based on Flock of Birds tracker from 1992-1998. It would have been really handy to have had rapid prototyping machines back then to make these.

New CAVE/ ImmersaDesk wanda with a similar joystick plus three buttons, based on Flock of Birds tracker from 1998-2001

InterSense controller, joystick plus 4 buttons, used in the early 2000s

When we designed and built CAVE2 form 2009-2012 we wanted to go with controllers that were easier to replace if they were broken, so we shifted to playstation controllers with marker balls mounted on the front, again giving us multiple buttons, a d-pad and joystick up top and a trigger below. This was the first CAVE controller to be cable free, so it needs to be charged up like any wireless game controller.

The d-pad added a lot of advantages for interacting with menus. Using lower trigger was a big advantage over the joystick for navigation.

The initial VIVE controllers followed a similar pattern.

The Oculus Touch has similar features in a more compact arrangement

More buttons give you more options that can be directly controlled by the user, but may also make it harder to remember what all those buttons do. More buttons also makes it harder to instruct a novice user what he/she can do. Asking a new user to press the 'left button' or the 'right button' is pretty easy but when you get to 'left on the d-pad' or 'press the right shoulder button' then you have a more limited audience that can understand you.

Most VR software today automatically brings up context sensitive overlays about what the various controller buttons do to help users get familiar with the controls.

Game controllers (and things that look like game controllers) have a big advantage in terms of familiarity for people who play games, and have often gone through pretty substantial user testing, and are often relatively inexpensive to replace.

More interesting is the ability to manipulate objects in the virtual world, or manipulate the virtual world itself.

The user is given a very 'human' interface to VR ... the person can move their head or hand, and move their body, but this also limits the user's interaction with the space to what you carry around with you. There is also the obvious problem that you are in a virtual space made out of light, so its not easy to touch, smell, or taste the virtual world, though all of the senses have been used in various projects.

Even if you want to just 'grab' an object there are several issues involved.

A `natural' way to grab a virtual object is to move your hand holding a controller so that it touches the virtual object you want to manipulate. At this point the virtual environment could vibrate the controller, or add a halo to the object, or make the object glow, or play a sound to help you know that you have 'touched' the object. You could then press a button to 'grab' the object, or have the object 'jump' into your hand.

While this kind of motion is very natural, navigating to the object may not be as easy, or the type of display may not encourage you to 'touch' the virtual objects. It can also be impractical to pick up very large objects because they can obscure your field of view. In that case the users hand may cast a ray (raycasting) which allows a user to interact at a distance. One fun thing to try in VR is to act like a superhero and pick up a large building or train and throw them around - turns out that when you pick them up you cant see anything else - 'church chuck'.


One common way of interacting with the virtual world is to take the concept of 2D menus from the desktop into the 3D space of V.R. These menus exist as mostly 2D objects in the 3D space

This can be extended from simple buttons to various forms of 2D sliders.

These menus may be fixed to the user, appearing near the users head, hand, or waist, so as the user moves through the space, the menus stay in a fixed position relative to the user. Alternatively the menus may stay at a fixed location in the real space, or a fixed location in the virtual space.

e.g. from the current crop of HMD experiences:

Google Earth VR maps the controls to buttons on the controller with tooltips floating nearby

Vanishing realms has a nice UI at the user's waist where you store keys and food and weapons and then to interact the user intersects that menu with one of the controllers as though you were reaching down to grab something off your belt.

Tilt brush has a nice 2-handed 3D UI where the multi-faceted menu appears in one hand and you select from it with the other hand

Bridge Crew has a nice UI with (lots of) buttons that you have to 'press' with your virtual hands - including virtual overlay text to remind you which is which

Job Simulator has a nice UI built into the 3D environment itself based on object manipulation

In either case these menus may collide with other objects in the scene. One way to avoid this is to turn z-buffering off for these menus so they are always visible even when they are 'behind' another object.

When you get into more complicated virtual worlds for design or visualization the number of menus multiplies dramatically as does the need for textually naming them so more traditional menus are more common in these domains. There are several ways to activate these kinds of menus - using the and as a pointer to select menu items, using a d-pad to move through a menu, intersecting the wand itself with the menu items. Using a d-pad tends to work better than a pointer if you have a single menu to move through as it can be hard to hold your hand steady when pointing at complex menus.

Another option is to use a head-up display for the menu system where you look at the menu item you want to select and choose it with a controller. Here is a version of that we did back in 1995 with the additional option for selecting the menu options by voice. The HoloLens uses a similar menu system.

In Augmented Reality this can be trickier since you also have the real world involved, both in terms of the graphics, and in terms of the people you share the world with. Google glass's physical control on the side of glass worked OK for small menu systems, augmented by voice. Microsoft's HoloLens pinch gesture for selecting within the field of view of the camera didn't work quite so well, but the physical button did, as long as you keep the physical button with you.


One way to get around the complexity of the menus is to talk to the computer via a voice recognition system. This is a very natural way for people to communicate. These systems are quite robust, even for multiple speakers given a small fixed vocabulary, or a single speaker and a large vocabulary, and they are not very expensive.

However, voice commands can be hard to learn and remember.

In the case of VR applications like the Virtual Director from the 90s, voice control was the only convenient way to get around a very complicated menuing system

The HoloLens makes effective use of voice to rapidly move through the menus without needing to look and pinch.

Ambient microphones do not add any extra encumberance to the user in dedicated rooms, and small wireless microphones are a small encumberance. HMDs or the controllers can include microphones which work pretty well.

Problems can occur in projection-based systems since there are multiple users in the same place and they are frequently talking to each other. This can make it difficult for the computer to know when you are talking to your friends and when you are talking to the computer. There is a need for a way to turn the system on and off, and often the need for a personal microphone.

Voice is becoming much more common now for our smartphones and our homes, and soon our cars, as the processing and the learning can be offloaded into the cloud.

gesture recognition

This also seems like a very natural interface. Gloves can be used to accurately track the position of the user's hand and fingers. Some simple gloves track contacts (e.g. thumb touching middle finger), others track the extension of the fingers. The former are fairly robust, the latter are still somewhat fragile. Camera tracking as in the Kinect and AR systems can do a fairly good job with simple gestures, and are rapidly improving.

The possibilities with tracking hands improve if you have two of them. Multigen's SmartScene is a good example using two Fakespace Pinchgloves for manipulation. The two handed interaction of Tilt Brush seems like a modern version of the SmartScene interface.

Full body tracking involving a body suit or gives you more opportunities for gesture recognition, and simple camera tracking does a pretty good job with gross positions and gestures.

One issue here, as with voice, is how does the computer decide that you are gesturing to it and expect something to happen, as opposed to gesturing to yourself or another person.


Several different models of PHANToMs

a PHANToM in use as part of a cranial implant modelling application with a video here - https://www.youtube.com/watch?v=cr4u69r4kn8

The PHANToM gives 6 degrees of freedom as input (translation in XYZ and roll, pitch, yaw) 3 degrees of freedom in output (translation in XYZ)

You can use the PHANToM by holding a stylus at the end of its arm as a pen, or by putting your finger  into a thimble at the end of its arm.

The 3D workspace ranges from 5x7x10 inches to 16x23x33 inches


and a nice introductory video here - https://www.youtube.com/watch?v=0_NB38m86aw

There is also work today using air pressure and sound to create a kind of sense of touch, though not as strong as a PHANToM, they operate over a wider area than the PHANToM.

Other issues

We often compensate for the lack of one sense in VR by using another. For example we can use a sound to replace the sense of touch, or a visual effect to replace the lack of audio.

In projection based VR systems you can carry things with you, for example in the 90s we could carry PDAs giving an additional display, handwriting recognition, or a hand-held physical menu system. Today smart phones or tablets provide the same functionality with infinitely more capabilities.

In smaller fish tank VR systems, or in hybrid systems like CAVE2 you have access to everything on your desk which can be very important when VR is only part of the material you need to work with.

In Augmented Reality you have access to everything in the real world so interacting with the real world is pretty much the same as before, especially with a head mounted AR system.

Coming Next Time

Project 2 Presentations
last revision 10/15/17