The future of video search
Welcome back to What’s NEXT, a podcast from Samsung NEXT exploring the future of technology. In this episode, I talk with Vidrovr CEO, Joe Ellis to discuss the implications of video search. Also joining us is my colleague Jacob Loewenstein who is our resident expert on media technology.
Ryan Lawler: I guess to start, Joe, tell us a little bit about Vidrovr. How did you get started, what does the company do?
Joe Ellis: Cool. Vidrovr is a video search and understanding platform. We help media companies, networks, broadcasters, digital publishers either index their large video archives, or real-time video streams. For example, some of our customers are sending us television channels, we’re then providing metadata across that video asset. We’ll say, this is when this person appears on screen, this is when some text is on screen, this is what’s visually appearing. The topics that they’re talking about, and information like that. Then, that gets fed back into a structured knowledge graph, indexed into our search solution, and then we provide user interfaces that can either be launched on the front of owned and operated websites, or internally within these companies to help their editors and producers find the right small granular clip of video for whatever granular search query they’re doing.
Ryan Lawler: How did you get started doing that? What got you interested in the space?
Joe Ellis: I’ve been working on this problem for a really long time, actually. Much longer than Vidrovr. I started at my Grad school at Columbia University working under the supervision of Professor Shih-Fu Chang. Newsrovr was a platform we actually built it in the Columbia research facility. Processed a hundred hours of television news content today. We stored it. We actually had antennas in CRF, just downloading content, digitizing it, and then pumping it into a storage drive. We would then run these algorithms over it. After we had all this different information and stuff we then built a search interface, and launched a website and allowed anyone basically to search through anything that’s happened over those hundred hours of television news across any day.
Ryan Lawler: Where did you get access to the actual video itself?
Joe Ellis: Oh, we stole it. We just, we actually plugged cables into the wall at Columbia, or just got it over the air. Stole it being like kind of a funny word. I’ll get into why that’s funny. We’re just doing research, so it was all, we didn’t show it to anyone. We didn’t sell it anywhere obviously. We were just Grad students at Columbia trying to build a search solution that could help you find anything that appeared in video that was on television.
Ryan Lawler: Were there trends that were going on at the time that really made you focus on video? Why did video grab your attention?
Joe Ellis: I think back in 2010, 2012 I was watching a ton of YouTube, and so I was excited. I was like wait, this is awesome. Often times I would go in and do a super deep dive on YouTube. Where, I remember the first one I did and right before I went to Grad school was the JFK assassination. I had never really understood, or heard anything about it. I went through this deep dive documentary and spent four hours on YouTube looking up all this stuff. It was awesome because there’s all this content, but it was also a pain because it was really hard to find what I was looking for. Decided hey, I think we can make this solution a little bit better. We could probably help people do their JFK deep dives in one hour instead of four hours.
Ryan Lawler: How many people think you’re a VR company, because your name ends with VR?
Joe Ellis: So many. I’ll tell the VR name story. Newsrovr, this is how we started the company. We named it Newsrovr because we were processing news videos, and the Marsrovr had popped up just when we started. This stupid rover was appearing on every video that was searched. We’re like, oh we can call it Newsrovr, we’re roving over the news. We had this little logo, it was like a little newspaper that was a rover. It was ridiculous. Decided to start the company. We’re like oh, Grad students, we’re like oh Newsrovr we’ll just keep the name. Everyone said, “No, that’s stupid because you do a lot of other stuff.” Like, we process other types of video, the news. You can’t pigeon hole yourself. We said, “Cool, we’ll do Videorovr.” Went online, tried to buy Videorovr, that was like 3 million dollars for a domain name. I said, “Okay, well, Videorovr is not.” Tried Vidrover, V-I-D-R-O-V-E-R. 500,000 dollars. So, try again. Just cut off the E, and went to V-I-D-R-O-V-R, and that was eight dollars so ended up incorporating Vidrovr. About a week later, probably the first person asked me, “Hey, are you guys a VR company?”
Ryan Lawler: How about transitioning from a Grad school researcher, an academic environment, to actually becoming an entrepreneur and productizing this?
Joe Ellis: A couple of things I think that I did that were really useful, and looking back were valuable. I did internships at IBM. I was at TJ Watson, and then I also worked at Google while I was at Columbia. Both those things made me much better engineers than I was just being a researcher. Actually working in a production environment, being able to ship code that worked, and was valuable was really, really important. After I learned that skill, being in kind of industry started think about, hey, all of the stuff that we built is really valuable, but ultimately if I just finish my PHD it’s going to get turned off. We’ll just go into the server room, probably turn it off. No one else knows how to run it. A bunch of my colleagues were graduating that I had worked on it with.
Joe Ellis: I started thinking about I think this is valuable. We think that there’s media companies both in New York City and around the World that could really take advantage of this type of technology. Why don’t we try to spin it out and actually build a useful solution that people will use. Columbia patented the technology as well. That was also kind of a bit of an onus. Having that patent in our back pocket as well, knowing that this was somewhat defensible was great for us, and it helped us move the company forward.
Ryan Lawler: Who are your customers today, and what are they using Vidrovr for?
Joe Ellis: Most of them are broadcaster from networks. They are folks that have large amounts of television content that they need to quickly and seamlessly transition to digital. What they do is they send that television content directly through our platform. We provide them detailed metadata about each particular section of the video. Provide them start, and end points where they can clip those videos out. Titles, people that are appearing in all the different metadata to make it searchable. One, we send that data to them. Their engineers and technical teams can leverage that to build new products, or that video actually comes into our search platform. Then, they can actually leverage our search interfaces, our search API’s and our search knowledge graph as well.
Ryan Lawler: Give me some examples of the types of metadata that you’re collecting, and able to present to them.
Joe Ellis: If you were to say send us a television channel, we can present you things like hey, this is the text that onscreen. We think it might be a great title for this video. These are the different people that are appearing within the video content. Visually this is appearing. For an example is North Korea, we have all these videos coming through our platform about this North Korea crisis. This is a section of video where the ballistic missile is appearing. This is a section of video about the demilitarized zone. Then, they can use those to then publish that out to their digital content.
Ryan Lawler: It sounds like at the core of this, or part of the core of this is pretty broad type of image recognition. Basically understanding what’s in the video. There could be so many things in a news clip, right? But, you’re not necessarily specializing for one type of object. How do you handle covering such a broad range of content that you have to identify?
Joe Ellis: That’s an awesome question. What we actually specialize and what I did a lot at Columbia, is broadly speaking multimodal machine learning. Video is really great for that. Instead of say image recognition, a lot of times all you have is the image content that is truly an image recognition problem. There’s no other way to solve it. What you can do with video is vast. There’s a transcript that’s available. We leverage the transcript. There’s onscreen text that’s available. Although, that’s like a visual recognition problem it’s a totally different thing than detecting a ballistic missile on screen. Right? We leverage the traditional image recognition, facial recognition, and then also other multimodal machine learning methods to do person discovery on these video assets.
Joe Ellis: For example, if Susan Collins is appearing on screen, whether or not we know exactly who Susan Collins is, her name might appear on screen below that. The producer or the person, the anchor might say now we’re going to hear from Susan Collins on blank. We automatically leverage all these cues to better determine that this person’s probably Susan Collins, now we know what she looks like. Every other time she appears in a video, in this particular library, we’ll be able to find her.
Joe Ellis: What’s interesting about that, is there is not a ton of supervised data available. Leveraging all these modalities in unison and kind of combining them together is a bit of a semi supervised learning approach. Which, I think is really powerful and that’s a lot of the work that I did at Columbia. I think that’s kind of the way that we should be developing video search solutions. The other really multimodal method that we leverage is automatically assigning hashtags that are appearing across Twitter from the Twitter fire hose on to video content. We do that by taking a hashtag, taking all of the tweets that are correlated with that hashtag, extracting images, extracting videos, and extracting text content and then building multimodal representations of each of those things. Say, hey, this is the text representation that we have of this particular hashtag. This is the patterns, visually that we think are appearing. Then, we apply those over videos within your video achieve or real-time stream so that automatic you’ll be able to say, okay, I think this video would actually work really well in this particular space on Twitter.
Ryan Lawler: Given your approach, what are the alternatives that you’re competing against, and how do you convince your customers to go with you versus the alternatives? That could be both alternatives using machine learning, but also I don’t know just human beings tagging information.
Joe Ellis: That’s a great question. Manual is always kind of the baseline alternative that we are competing against. That’s what people are doing today. Effectively, everyone wants to not do it manually. Things have to work well enough that it takes them less time, and it’s an efficiency, than if they have to go back and correct things. When you talk to customers that’s always the kind of question. They’ll be like oh, we tried this last year, it only got us 60 percent of the way there. The extra 20 percent that it took us to correct it, actually made it a longer worse process for us. We’re always working to push the accuracy of either a video clipping, or the metadata generation so that we can get to a point where we’re delivering an efficiency to the company.
Joe Ellis: When you actually do stuff in specific domains, and build algorithms for specific domains like for us broadcast and pros production content we’re able to leverage the structure of that domain and develop an antilogy that might work a little bit better in that specific instance. The example I like to give is I’m a big San Diego Padres fan, so anyone listening they can pity me because we’ve been terrible the last ten years. Google Photos it’s an awesome solution. I use it to organize all the photos in my phone. I went to Padres games, I could send them to my friends. Amazing, right?
Joe Ellis: That’s interesting for me as a consumer, but if you are an enterprise user, say the Padres, that type of data is actually not that interesting. Baseball, they know that’s all of the photos that they have within their whole repository. Right? They need a much more granular antilogy. This is a curve ball, this is Tony Gwynn hitting a line drive to left field, that type of information. Based on the structure of the domain, you should be building antilogies like that. That’s what we do. We build domain specific antilogies that work really, really well in these different areas and verticals.
Ryan Lawler: Were they, wear the camo uniforms. Can Vidrovr determine that they’re not military, or have we reached that level of sophistication?
Joe Ellis: We actually don’t even know, the camo is so good on the Padre uniforms we don’t even know they’re there.
Ryan Lawler: One of the things that I wonder about, hearing you talk about this, is this idea of censorship, and this idea of recognizing content that companies don’t want on their platforms. I feel like those systems fall down consistently and don’t work as well as we would like. Maybe break that down for us. What’s going wrong, and why aren’t these systems able to recognize obviously content that might be controversial, or illegal, or…
Joe Ellis: It’s a great question. I’ll talk first as a company how we deal with this, and then I’ll talk about my personal views on it. As a company and in many ways we kind of punted on this, because we go direct to the media companies and the content creators themselves. We’re almost always working with prevetted really strong content. We don’t worry that much about it because people that are giving us stuff know that they like their stuff. They want to make it searchable and want the World to find it, that’s what we help them do.
Joe Ellis: It’s really, really hard when you’re going beyond violence to some contextual level of what is not okay I guess to post online or what should be censored to figure that out. The reason is a lot of it is outside of the video itself. Violence is actually pretty easy to detect, nudity is easy to detect, there’s really good algorithms in these big companies that do that. Contextually maybe this is during this time, this type of post is very, very not good. During the Me Too Movement maybe there’s particular posts that are especially more sensitive now than they would have been two years ago and should be removed from some of these sites. Their sensitivity is changing because of the contextual environment that we live in, and that’s really hard for algorithms to understand. As the goals shift in each of these things, the algorithms have to move too, and it’s hard to stay ahead of culture.
Ryan Lawler: Earlier in the conversation we were talking about who your customers are, but what have you learned about challenges to adoption from customers who for whatever reason don’t want to buy your product?
Joe Ellis: I think one thing we’ve found this is probably true for most tech startups, but it’s really important to have an internal champion, and good internal champions we’ve found tend to be highly technical, and have really strong product visions. Where, what we’re doing helps them build that out. Other challenges that we have is, like I talked a little bit about before, a lot of people have punted on a making their videos available on platform just because they haven’t seen a ton of views over the past five to six years, and all of those views are being generated from YouTube or Facebook. That’s kind of a catch 22, in many ways for us. One, it’s good because there’s not a lot of work going in there, so if we can convince people to hey, turn it all over to us, we’ll make it work for you. Don’t worry about it. It’s not really taking over anyone’s jobs or anything like that. At the same time it’s hard to make people think that this is something that will work. This is something that can really change the way people interact with our content.
Ryan Lawler: I want to key in on that platform discussion a little bit, because in addition to all of the different types of video that’s being created, whether it’s professional, or user generated, or whatever there’s also an explosion of platforms. It sounds like you work primarily with broadcast, but then you’ve got owned and operated channels, you’ve got third party channels like YouTube. You’ve got social channels, like Facebook, and Snap, and all of those. How do you work across all of those different…
Joe Ellis: We’ll just take YouTube for example. What we do is we actually have a direct integration into their platform. If you’re the content owner and you own a YouTube channel you own all the rights to that content within that channel. Then, if you give us your account credentials we’ll actually log in, use their API’s to pull all of the video assets from that channel down onto our platform. Process them, index them, and can provide either metadata back into YouTube to make them more searchable across for the YouTube platform, or help you launch a YouTube like search interface on your O & O so that people can actually go to your O & O, as well as YouTube to find the videos that you’re creating that they love.
Ryan Lawler: In terms of competition, especially among the big platforms besides we’re literally locking you out from accessing those videos or that data, is there anything anyone can do to stop you from training on video data? Can they put anything in the video files, can they put anything in the frames that disrupt how you are able to train?
Joe Ellis: A couple of things there. One is, the answer is kind of no. If it’s available, it’s available. Broadly speaking we only work on say YouTube with customers that have agreed to integrate, give us access to the YouTube content.
Ryan Lawler: Sure.
Joe Ellis: Because, they own the rights to those video files, so then we can pull them in.
Ryan Lawler: You’re not scrapping the trillions of hours of YouTube content to…
Joe Ellis: No, that would be tough.
Ryan Lawler: When I think about that too, I think about each of the different types of platforms. There’s different types of content that should live on them. There’s YouTube specific content, there’s Facebook specific content. Are there things that you’re doing to help customers develop the right type of content for the right platform?
Joe Ellis: This is something that publishers talking about a ton. There’s often times specific teams within each of these companies that do only that platform, and they know all the intricacies of the platform, both from a technical perspective and from what content works best perspective. Our goal would be to, for companies like this, make your library as malleable as possible so that all of the people on those teams can all plug into one source, and do whatever they want with it super quickly so that you can get the right piece of content into those places at the right time.
Ryan Lawler: One of the other subjects that I think, or one of the scariest trends in video right now is this idea of deep fix. Right? People using technology to create fake videos, where someone is represented on someone else’s face, or they’re saying something that they didn’t actually say. Are you able to spot, or defend against those videos? What are the technical challenges there?
Joe Ellis: It’s a super hard technical problem. We actually have not started tackling it at Vidrovr yet. I know there’s some interesting work going on at Columbia looking at some of this stuff. How can you detect these, and is there some level of censorship, what’s going on there? It’s not something that we’ve looked at, although it’s something that I am interested in. I’m looking forward to seeing how we are going to be able to combat that type of stuff of the next year, to two years. I do know there’s a lot of research going on around Universities about what’s the best way to do that.
Ryan Lawler: I was just wondering, can your tool, can Vidrovr be used to enable modification of those videos? For example, if I know what’s in the video can I make it more easily, swap things out?
Joe Ellis: That’s an interesting question. Yes, and no. The first most obvious way that that could actually be used would be on lower thirds and stuff like that. If there’s any text on screen, de scrambling it, and posting other pieces of text. We have localization boxes on basically everything that comes out of the platform. It would be a good first layer put in to anything like that. Now, my head is spinning, and I have all these ideas about stuff that I could potentially do. It would be a good first layer that we could feed into something, some type of algorithms like that.
Ryan Lawler: Got it. I’m, just because we’re kind of hitting on this general topic. Is there anything else that your customers have been really screaming for, pounding on the door for from Vidrovr given your core competency? We talked about specific insights for platforms. Maybe there’s fake detection. Anything else on the wish list that you’re hearing a lot?
Joe Ellis: Social is one. We’re really happy with this hashtag push, because a lot of people are interested in publishing to social and how can we drive as much video views from social as possible. The other one is, I’ll be honest, we’re seeing a lot of traction in search. When we started pitching this idea a year ago, I will say that platform search was much, much less of an idea. Now, it’s becoming more important. The other thing that we’ve heard from folks is GDPR coming in is actually very interesting. So many recommendation algorithms are based on user behavior, not necessarily content based algorithms that do this recommendation there’s going to be potentially over the next year to two years much more of a need for that content based recommendation instead of just user based click recommendation, because it’s unclear how data can be tracked and things like that. That’s another area that I think will be providing real value, and will be useful for us moving forward.
Ryan Lawler: There’s a big problem on the measurements side when you’re trying to understand whether or not people are talking about your product. Whether an ad is impactful in getting people to talk about your product. It seems like Vidrovr might be able particularly for social videos that the average person posts, be able to help solve that problem. Have people been asking you for your help with that?
Joe Ellis: Definitely. There’s two things there that I think are interesting. One is, if it’s micro targeting, I think we would be really helpful for that. That’s actually analyzing a single piece of video content, understanding different sentiment, and emotion, the people are feeling about a particular product and then making an action off of that. If, it’s trying to find macro trends around products I tend to think that just doing text processing is the better way to go. It’s much less expensive and you can get a really nice sampling of the distribution just using text or tweets to get the overall sentiment around a product or the overall sentiment of a product in a region. That’s probably sufficient.
Joe Ellis: But, if you’re looking at say a person level or a tweet level, taking into account all the modalities that exist within that tweet leads to a much more accurate experience. At that point, it becomes a real cost benefit analysis. If, we know we have a much higher level of certainty as to how this person feels about a particular product, what action can we take that will offset the cost of running that, which is much more expensive than any text processing.
Ryan Lawler: What are your thoughts on personalization just in terms of where it is today, and how important it is?
Joe Ellis: This is a complicated question. I think personalization today does deliver what people want. Right? They can always deliver videos that will keep people watching, and keep them on site. In an ad driven world, that’s exactly what media publishers want, that exactly what all these folks that we work with want. It makes a ton of sense. The question is, after you’ve watched these ten or twenty minutes of videos do you think that you got real value from those videos, and did you care about, or will you remember that watching experience? I think that’s an open question.
Joe Ellis: I was at South by Southwest this year, and heard Evan Williams talk about medium, and that’s one of the things that he spoke about a lot when they were moving to a subscription model. He said, we started by basically marketing the articles that had the most engagement to show people things that we thought would spur subscriptions. A lot of people are watching this, or a lot of people are actually logging on to this piece of text, or reading this article. This is probably going to be something that’s going to spur subscriptions. They found that wasn’t the case. What they found was they hired a bunch of editors in specific verticals, they the editors trolled through a bunch of different articles, found the ones they thought were the best, and then those were the ones that they marketed out. Whether or not, they had a bunch of views and that actually did spur subscriptions.
Joe Ellis: It’s like there’s a bit of an dichotomy there. We’re spending time online because of these personalization algorithms are tailored to feed us stuff that will keep us in front of them. After that, we don’t really associate real value with that. We would just decide okay, I would pay for this piece of content. I think that’s the interesting dichotomy, and out at South by SouthWest there’s a lot of discussion around that. We’re thinking about that. I think search is a way that we can kind of shift that paradigm and help move us forward in that space.
Ryan Lawler: We’ve talked a lot about what Vidrovr does, but taking a step back, what are some of the bigger trends that are converging at this point in the video world?
Joe Ellis: One of the reasons I started the company and I thought what was really interesting was especially in the 2016 election, we are getting more information from video than we ever have before. Right? For, before, in the previous times a lot of this was text, or based on newspapers, blah, blah, blah. Now, as we see video as one of the most informative sources where people are actually understanding the world today through video, it’s becoming really important that people can find what they’re looking for in that video content. I think today, most of the video that people watch online are either recommended to them via some type of news feed or personalization algorithm. What we really care about at Vidrovr and what I think is really important is shifting that paradigm into the point where people can actually do informative search queries to find that granular clip of video that’s going to answer what they’re looking for.
Ryan Lawler: Do you think that people will increasingly search for video? Or, do you think that this fits into the back end of how video is curated for them because you now can sort of segment and collect more data about what’s in that video?
Joe Ellis: It’s a great question. I think when you talk to media companies today, lots of people have punted on the video search bar on their websites. It’s really hard to do, so people don’t use it because it doesn’t work. I think our goal is to shift that paradigm. I will say that we’ve had a lot of buy in from a editor or producer perspective, so that they can find the video content that’s useful, because that’s their job. Right? They don’t, they’ll spend all of that time actually finding the right video clip. Whereas, the user wants that presented to them really quickly and seamlessly. That’s not been possible before, but that’s where we’re going.
Ryan Lawler: Got it. What about the democratization of creating and cutting together video? Are there lots of people out there that are trying to take video and splice it together? Is this going to make it easier for your amateur and semiprofessional creator to do their work better?
Joe Ellis: It’s a huge workflow problem. Even with the broadcasters we work with when they send us, when television channels are sent to our platform, typically they have tons of people watching those channels and using a platform like maybe Snappy TV, or something like that to do the actually clipping itself. With our algorithms because we’re able to identify frame start and endpoints, provide all this detailed metadata, we can actually kind of I guess set those up for all of those editors, just boom, boom, boom. Clip them out really quickly and seamlessly. It becomes a much less of a manual process and ultimately that’s, it’s somewhat low level, that what’s going on there and we want people doing higher level tasks. More intelligent curation, and things like that. We want to open up the workforce within these journalistic institutions to be able to do that with our platform.
Ryan Lawler: What’s one controversial opinion that you have that’s really strongly held?
Joe Ellis: Come back to me, I’ll give it to you at the end. I was thinking…
Jacob Loewenstein: Who killed Kennedy?
Ryan Lawler: This is the end.
Joe Ellis: I honestly don’t know. I don’t like chocolate, that’s the most controversial opinion I could’ve had.
Ryan Lawler: Awesome. Well thanks for joining us, and good luck with what you’re building.
Joe Ellis: Awesome. Thank you very much for the time guys, I really appreciate you having me.