Smart Content West: The Power of Azure AI

LOS ANGELES — For Martin Wahl, principal program manager for Microsoft’s Azure Media Services, a big part of his job is speaking to both audiences and customers about how to use artificial intelligence technology in day-to-day production workflows.

And, during his meetings and presentations, he’s constantly surprised at how much manual effort content companies are still putting into tagging, categorizing and sharing their video.

“You possibly have staff — very large staff — that is producing manual tagging, looking at faces, or words, objects, scenery, making manual notes, and then trying to figure out how to work that into the production workflow and editing,” Wahl said, speaking Feb. 27 during a presentation at the Smart Content Summit West event. “Perhaps you’re doing production [tagging] of the day’s work, manually. Perhaps you are trying to share that content with your colleagues around the world, manually.”

It doesn’t have to be that way: The ability to use the cloud (and artificial intelligence capabilities within the cloud), to help offset and offload those manual tasks, is part of what Microsoft has focused on recently, via its Microsoft Cognitive Services work. Cognitive Services is a collection of AI capabilities running in the cloud that allow clients to use the high-processing and neural networking of AI, allowing for the in-video identification of what only humans could pinpoint in the past, whether it’s vision, speech or language.

“We took this set of cognitive capabilities, and applied it to [the media and entertainment] space, for video in particular, creating a category called Video AI,” Wahl said, listing a set of nearly a dozen features — available as individual media APIs — covering speech-to-text, face and facial emotion detection, video stabilization and summarization, object detection, content moderation and more.

“These are available today — not pie-in-the-sky pipe dreams — real services you can be using today [that] automate tasks that were previously done manually,” he said.

When Microsoft introduced its cognitive video tech roughly a year ago, it initially found clients using the service for one content need at a time: “’I need a transcript of the show that I’ve shot, or just produced, or a lecture I just made, run that through a text-to-speech engine, get a transcript, and walk away,’” Wahl said. “And the next day they realize, I need to do the same thing for the faces or the objects or the scenery.

“That’s not optimal for this industry, which really needs to build a library of data,” he added. “If we’ve learned anything, while content is king, data is possibly queen.”

To address that data need, Azure debuted the industry-specific Video Indexer product last year, an offering that takes all of Microsoft’s AI capabilities, combines them together, and forms one set of tools that allows users to compile a database that covers everything in a piece of content, from spoken words to words on the screen, from every face to every brand that appears.

Later this year, Microsoft will introduce new capabilities for Indexer users, including emotion sensing (which detects emotions expressed via both speech and facial expressions), logo detection, and a live analytics feature that can analyze content from live broadcast sources.

In-video search is among the clearer benefits of Azure’s offering, with the ability to pinpoint every person, brand, service, image, keyword, you name it, Wahl said. “The ability to jump right to parts of your video, pull that out, tag it and use it, is exactly what this capability is all about,” he said. “If I want to do a search, find a cup of coffee that showed up in one of my videos in my library, I can find exactly where that cup of coffee might be hiding.” And that could include the word “coffee,” the image of a cup of coffee, or just an associated reference to anything having to do with coffee.

Wahl offered up a current example of how Azure’s AI tech is being employed, with Endemol Shine Group using it to help capture and automatically tag everything, from every camera angle, for the long-running reality show “Big Brother.”

“We’re not necessarily making the show that much more exciting — because only the creative side can do that — but we are giving them the ability to focus on that creative element more, and the manual process less,” Wahl said. “And that’s precisely what this technology is intended to do.”

Click here for audio of and click here to download Martin Wahl’s presentation from the Smart Content West Summit.