I’ve always been an enthusiast of 3D, VR and AR and I’m lucky enough that a significant portion of my engineering carreer so far has been dedicated to projects within that realm.
Even though the harsh VR winter of the 2010s left me a bit skeptical of new headsets, every time a new one hits the market I still get that old feeling of excitement back.
I work at a small startup called Things, Inc. That “Inc.” suffix makes it sound super serious and corporate but we’re just 3 people and none of us wears a tie or a suit. We have pretty fancy hoodies, though.
Needless to say, when Apple announced the Vision Pro, we were very excited to get our hands on it, so we decided to pre-order it as soon as it came out. While we waited for the headsets to ship, we started doing some preparatory work using the simulator.
We sat down and went over our plans for world domination. They started and ended with “let’s just make something fun” (we’re not very good at world domination). We binge-watched the visionOS developer videos as if they were the highly anticipated post-cliff-hanger Season 2 of a not-yet-ruined TV series, then got to work.
Naturally, we had to divide our attention between this shiny new toy and our work in Rooms, but we decided it was worth dedicating a few weeks to this.
While there is no substitute to holding the actual device in our own hands, credit goes to Apple for having prepared extensive developer documentation that allowed us to have a pretty clear idea of what the headset was capable of, so we could start making progress right away.
What We Built
Without any further ado (and sorry for the previous ados), here is what we ended up developing:
I say “ended up developing” because the decision process was far from straightforward. But more about this soon.
This app shows you a catalog of Things, which are user-made objects that people have created and uploaded using Rooms. When you click a category or tag, you are presented with a collection of Things:
Need a voxel corgi? Who doesn’t! Coming right up. This is the details view that shows the object and some information about it.
One of the (many) cool things in the visionOS SDK is that it allows you to embed a RealityView directly into the view hierarchy, so we can have a truly 3D widget where a model comes out of the window:
So naturally we thought “why don’t try this idea on visionOS?” After all, what could go wrong?
So I ported all our VOX parsing and mesh-building code from our C# codebase to Swift and RealityKit while Nick and Jason worked on the spatial UI for the app.
It was here that we started to understand that this new device had a very different interaction paradigm that we had to take into account.
Hover is in the eye of the beholder
One of the unique features of the Apple Vision Pro is its focus on user privacy. Because of this, and unlike previous headsets I’ve worked with, it doesn’t automatically expose some key pieces of information like the the user’s head orientation, gaze direction, window positions, or even where the user is in the physical space.
All we can do is tell the system what elements can be interacted with, and it will take care of detecting if the user has made any gestures that affect those elements. In particular, even hover effects are completely opaque to the application: we just say we want hover effects, but we don’t get notified when the user is hovering over our components.
This also applies to 3D graphics: we can mark certain entities as being hoverable and interactable, but we don’t have any information about when that actually happens.
As a result, we can’t “smartly” react to a hover state or a click on demand: whatever is hoverable or clickable needs to exist in the scene as a RealityKit entity.
We had initially intended to make a voxel editor, and actually started building it:
But because of this limitation (which, granted, exists for good reasons), every single voxel in the model would have to have a separate click target for each one of its faces on each one of its voxels, making it prohibitively expensive even for small models.
But at this point we hadn’t tried the actual headset yet! Maybe there was a surprise in store for us. It wasn’t out of the realm of belieavability that the headset would be so performant that having thousands of entities wouldn’t be a problem after all.
Engineers like me are quite good at wishful thinking when faced with the prospect of a painful rewrite of a major piece of software. I mean, surprisingly good.
Getting the headsets
With the headset ship date fast approaching, we decided to meet up in New York so we could spend a week together working on the app.
To be absolutely sure we’d get the headsets in time, we each bought one and shipped them all to Jason’s (our CEO) house. He monitored the delivery and heroically stood guard as the UPS truck came in with the goods.
We tried on the headsets for the first time.
Magic.
We did the onboarding. It was incredible. Crisp high-res display, perfect eye tracking, natural hand gestures, immersive spatial sound, delightful UI.
We tested the latency using the “coaster test” where we tossed a coaster back and forth while wearing the headset to see if the latency was low enough that we’d be able to catch it with the headset on.
We realized that we can’t catch coasters even without the headset, due to their weird shape, so that wasn’t a good test. If you were reading this article for the scientific rigor, I’m sorry to disappoint you.
Issues
Even though the headset is state-of-the-art and the experience is excellent at first, we soon realized a few incompatibilities between our aspirations and reality (and when that’s the case, reality tends to win):
Gaze-based interactions are intuitive, but not extremely precise (not to the level that they’d need to be for voxel editing).
It’s not super comfortable to use eyes for precision work. Our eyes quickly got tired from repetitively darting from surface to surface to place voxels.
However, there were some unexpectedly good things we realized too:
The headset’s image is sharp and clear, no blur whatsoever, and it’s so high-res that we can use our phones even with the headset on.
Scrolling is fuuuuuuuun. Why is scrolling so fun? It doesn’t have a right to be fun, it’s scrolling. Even the settings UI is fun because there are lots of settings and you can scroll throught them.
Other hand gestures are fun too (dragging, rotating, scaling).
Lots of finger pointing
Having thousands of entities in the scene sitting there waiting to be gaze-activated was a non-starter from the perspective of the hardware. Using our eyes to select voxels was a non-starter from the perspective of ergonomics.
Okay, so if gaze-based voxel editing wasn’t going to work, we needed to figure out how the user can select where they want to place a voxel.
We searched far and wide for the answer and eventually found it at the end of our arms. What if we could simply point using our index finger towards the voxel we want to modify and do a gesture to edit it?
So I implemented hand tracking using the ARKit API, which required me to learn a bit of hand anatomy so I could understand the docs, for example when something returns a vector that goes “from the metacarpal-phalangeal joint to the proximal interphalangeal knuckle”.
So the user’s finger would be a “laser pointer” and they’d join their fingers or “click” with their thumb to place a voxel.