Clips or It Didn’t Happen

Mountain View

12 October 2017

Babies and pets, Google is watching you. Not only does it know when you’re sleeping and when you’re awake, it knows when you’re doing something cute, funny, or otherwise remarkable. It sees and knows these things through a deceptively tiny box, roughly the size of a tin of breath mints, which wirelessly beams this social-media-feed fodder to a doting and/or watchful parent, guardian or custodian.

It’s not the latest Pixel smartphone – it’s a new gadget called Clips. Unveiled by Google last week alongside a suite of new hardware products, the diminutive AI-enabled camera pre-selects the “best” moments of a long-take home video. Expressly intended for parents of young children, the “lightweight hands-free camera [...] helps you capture more genuine and spontaneous moments of the people – and pets! – who matter to you.” (It also doubles as a baby monitor.)

Well-intentioned or wilfully creepy, Clips infuses the simplicity of a dedicated hands-free recording device – like a GoPro, it does not have a screen or viewfinder – with a steroidal dose of computational processing power: the device algorithmically curates a highlight reel of candid seven-second memories that can be synced, viewed, and saved via an app. Mindful of post-Glass privacy concerns, Google stresses that “nothing leaves your device until you decide to save it and share it”. Clips does not have a microphone.

From an industrial design perspective, Clips is unremarkable, and just another manifestation of Rams-via-Ive minimalism: a two-inch square, white at the front and teal at the back. The exaggerated affordances of the lens, the single button below it, and the detachable, clippable case, are meant to express “don’t be evil” at every turn. As Google puts it, “it looks like a camera, and lights up when it’s on so everyone knows what Clips does and when it’s capturing.” This is precisely why it is noteworthy: Google Clips is a computer disguised as a camera.

Truth be told, it’s happening more quickly than anyone could have foreseen – not so much a matter of faster-than-the-eye-can-see but rather more profoundly than the mind can fathom. It was just over a year ago that Snapchat, the disappearing-message app, unveiled its first hardware product, a pair of youthfully styled camera-equipped sunglasses. Released last November, Spectacles are meant for capturing and sharing point-of-view video. Lacking an interface in itself, the $129 tech-enabled fashion accessory wirelessly beams this visual data to the user’s smartphone in a proprietary circular video format, as 10-second memories that can be synced, viewed, and saved via Snapchat.

Along with the announcement of Spectacles, Snapchat also declared that “Snap Inc.” – as it would henceforth be known – “is a camera company.” Implicit is the fact that “camera” now refers not to a precision-engineered hunk of metal and glass that captures light with chemicals but rather a screen-based app that is one of many functions of a precision-engineered hunk of metal and glass that is wirelessly connected to a network of unimaginable scale.

Google Glass being the obvious precedent to Spectacles, it is interesting to see both how Snapchat learned from Goliath’s failure and how the proverbial “search giant” has set out to probe the real-world applications of its robust machine-learning R&D. Clips was just one of eight new AI-enabled hardware products that Google unveiled last week, from earbuds that can translate speech in real-time to a new model of its VR headset. But the camera, with its ultra-specific target audience and use case, is the one that demands a closer look. Whereas Glass was designed to be a wearable computer – a heads-up display, or HUD – embedded with a camera, Clips represents a camera that is a full-fledged computer.

What, then, does the camera actually see? In the case of Pinterest and Amazon, it sees sales opportunities: the former has created a “visual search” app that free-associates keywords based on image or video input, and can identify clothing and shoes to drive sales. Amazon, in turn, launched the “Echo Look” in April, imbuing its voice-controlled “intelligent personal assistant” Alexa with sight, purportedly for sartorial tips linked to the e-commerce titan’s robust recommendation engine.

And then of course there’s Apple’s forthcoming iPhone X, which will make its way to the hands of consumers next month. Its revolutionary Face ID system, powered by the cutting-edge “TrueDepth” front-facing camera, marks both the apotheosis and the subversion of the selfie. Unveiled to a mix of measured adulation and justifiable consternation, it remains to be seen whether facial recognition will catch on as the de facto authentication protocol; in any case, Apple’s grand experiment will play out on a scale that is several orders of magnitude greater than that of Google’s Clips.

Both Apple and Google insist that the users control their data, which is stored on the device itself, but the mere existence of these kinds of devices – to say nothing of the demand for them – shows how far we have come since 2013. Just weeks after the initial “public beta” of Glass, in May of that year, The Guardian and The Washington Post published whistleblower Edward Snowden’s damning leaks from the US government that irrevocably undermined users’ trust in tech companies. Now, four years on, we seem to trust them more than ever before, and continue to trade personal data for a few likes or followers. The ironic side effect of the surveillance disclosures from 2013 is the general sense of powerlessness that has taken hold in the face of the sheer scale of big data.

And why shouldn’t we stand to benefit from these technologies? Both Spectacles and Clips speak to a powerful impulse to record by default – a corollary to Snapchat’s mantra “delete by default” – while remaining “in the moment,” whether the camera sits on one’s face or if it abides in the background, waiting for the perfect shot. In an age where it’s easier to take a photo of, say, a wall text at a museum than to actually read it, it’s only logical that an “always-on” camera should know which clips are keepers and what to leave on the proverbial cutting room floor.

In any case, it’s still too soon to tell. Google says that Clips will be available soon, but Spectacles may serve as a cautionary tale, just as Glass did for Snap before. Insofar as the marketing for the camcorder shades depicted it as a high-tech toy, Spectacles have proven to be just that – a novelty, apropos last holiday-season’s hot-ticket gadget – and, perhaps more tellingly, Snap has stumbled since its much-hyped initial public offering in March this year, when it went public to the tune of a $34 billion valuation. Since then, the company has seemingly resigned to the fact that the app’s most successful features have been copied, down to the name, by the likes of Facebook and Instagram (i.e. Stories). Nevertheless, a prescient passage from the company’s S-1 Registration Form, filed to the FDIC on the occasion of the stock offering, uncannily foreshadows cameras to come:

In the way that the flashing cursor became the starting point for most products on desktop computers, we believe that the camera screen will be the starting point for most products on smartphones. This is because images created by smartphone cameras contain more context and richer information than other forms of input like text entered on a keyboard. This means that we are willing to take risks in an attempt to create innovative and different camera products that are better able to reflect and improve our life experiences.

If the last bit sounds like it could just as well describe Clips, it is all the more interesting that another Google product, launched in May of this year, exemplifies the first part of the prescient passage, repurposing the camera as an interface for navigating the real world. Likened to a visual form of the popular song-identification app Shazam, Google “Lens” is capable of detecting and recognising objects such as flowers and restaurants in images or via smartphone camera. (The Google Translate app has offered real-time on-screen translations of on-camera text since January 2015 – which, incidentally, is when they discontinued Glass.) Thus, if Snapchat strategically excluded the interface from Glass, Google has taken the opposite tack and revived the visual search function as an app instead of a wearable camera, putting (or keeping) the HUD at arm’s length.

Whereas Lens conjures an informational layer atop the visible world, the new Clips camera is the first consumer-facing AI that attempts to understand the emotional content of visual data. Together, they illustrate two sides of Google’s overarching AI effort – the left hemisphere of the brain versus the right, if you will – though the jury’s still out as to whether their machine-learning algorithms, “trained” as they are by software engineers in Mountain View, can outsmart the skeptics.

For now, they’ve set their sights on babies and pets.