As the end of 2011 approaches, it is only natural to begin reminiscing about the year past and prognosticating about the future. With the introduction of Siri on the iPhone 4S, the continued popularity of gesture tracking-devices like the Microsoft Kinect, and a cryptic quote in Steve Jobs’ biography about “cracking” the TV problem, many otherwise-sensible writers have begun dreaming of a voice- and gesture-activated AppleTV to come in 2012. While pervasive voice-activated computing may be part of our future, there will still be room for good old buttons here and there—including remote controls and keyboards. So, while asking your TV to show you the latest episode of Arrested Development is undoubtedly cool, it is simply not likely to happen anytime in the near future.
Can you hear me now?
To begin, there is the classic example of why voice-controlled televisions are a bad idea. Alec Baldwin as Jack Donaghy on 30 Rock demonstrates VoAct television to his new boss, with hilariously disastrous results in this clip:
Commenters have proposed myriad solutions to this issue, like adding a special command word and noise subtraction, so the TV knows when to listen and also ignores sounds that are part of the currently-playing program. All fine and good, except for one problem: adding a special word to the interaction, like “TV – volume up”, breaks the magic of Siri as a natural language assistant. You never have to say “Siri – find me a good Ethiopian restaurant near here,” just as you would likely not address a real life assistant by name if you are looking straight at him or her. Even if you do use your assistant’s name, it will only be once at the beginning of conversation—from there, the context lets the assistant know to whom you are speaking.
What Steve really said
If you look closely at what Steve Jobs had to say at the D8 conference about revolutionizing the TV experience, his focus was on the fundamental problems of actually getting a product to market and creating a unified, intelligent interface. From his response:
The only way that’s ever going to change is if you can really go back to square one, tear up the set top box, redesign it from scratch with a consistent UI across all these different functions, and get it to consumers in a way that they’re willing to pay for it.
Getting customers to pay a large sum of money up front is a huge issue, because the current cable box/DVR/satellite receiver model involves paying smaller amounts over time. Most consumers would rather pay their $10 a month in perpetuity than drop $300 up front on a set top box. The cable/satellite industry is quite happy with its profits on set top boxes and would likely say thanks-but-no-thanks to providing an Apple-branded box which required sharing revenue. Providing a unified, consistent UI for accessing live television, on demand content, stored content, and web content would make the TV experience much better, but getting people to pay for it and getting content providers to disrupt their existing business models are not trivial tasks.
Questions of getting this product into consumers’ hands aside, how would this magical AppleTV work? According to prognosticators, we will politely ask our television to do all sorts of things, not the least of which are changing channels and raising the volume. A voice command interface like Siri is infinitely expandable, as the reasoning goes, so we could eventually use our TV to pick a restaurant or look up movie showtimes (or make us a cup of tea, Earl Grey, hot). This line of reasoning contains one major flaw: do we really need an assistant in the TV? How hard is it to find Real Housewives of Kalamazoo that we need our assistant to find it for us? Siri is designed for a fundamentally different set of commands; as an assistant, it (she?) is capable of artificially intelligent interpretation. Switching between Dirty Jobs and South Park does not require intelligence, which is why a remote control works quite well for the task. The proliferation of remote layouts, button-happy engineers, and disparate interfaces for the variety of entertainment gadgets we use is a problem Apple can solve with a single, Front Row-esque interface. The method of interaction…does not really need fixing.
You do the Hokey Pokey
Interacting with a Siri-powered TV would be awkward; how would you you scroll through a list of Yelp recommendations without saying “down…down…down…wait, go back up?” At this point in the future predicting, gesture-based enhancements are provided with a motion tracking FaceTime camera so you could gesture to scroll through a list, bump up the channel, or decide to display only results from the theaters nearby that are known to sell Swedish Fish. Microsoft has proven that people are willing to get up and dance in front of their televisions, but can it be definitively said they are ready to hold their arms out like zombies to idly scroll through channels? That answer is decidedly less clear. More to the point, what if you raise your arm to stretch and the TV misinterprets the motion and raises the volume to an earsplitting level?
The more complex the combination of voice and gesture control required to operate the TV, the more like an all-singing, all-dancing buffoon the operator is going to look. Watching TV is a vegetative activity, not an aerobic workout. Current remote controls work well enough for their purpose, but they lack the signature Apple flair for simplicity and usability. For Apple to revolutionize TV, there is no need to turn the experience into a chatty Jazzercise session. An elegant multitouch remote, powerful yet simple UI to access all forms of content, and signature level of ease of use in the whole package would be a refreshing departure from today’s maze of cables, clunky Java-based set top OSes, and interfaces designed by engineers. The ability to stream content from nearly any source to a TV over AirPlay positions iOS as a standardized interface while unlocking a multitude of content sources for streaming. The simple wireless connection and dearth of configuration options complete the package, so it could be (at least partially) argued that Apple’s cracking of TV has already happened. With Siri controlling an iDevice to access content, the voice control interface makes more sense, and the method of interaction is already proven; hold the home button or raise your iPhone to initiate a chat with Siri.
Voice control is a useful technology, but predictions of its ubiquitous use are unlikely to ever come true. Would you feel comfortable shouting your PIN or password at each website you visit? Can you imagine how loud the average office would be with workers dictating every word? Touch interfaces and gesture controls are similarly oversold as a magic wonder, when the reality is holding your arms out in front of you all day leads to a syndrome called gorilla arm.
There are obvious use cases for each technology, though it is premature to declare the death of the remote control, keyboard, or multitouch mouse in favor of an arm waving, command barking future. If you prefer to sit back on the couch and disengage your brain for this week’s episode of Jersey Shore, by all means, do not raise your hand and ask your TV to DVR it. Sit back and use your delightfully functional remote control.