Investing in Twelve Labs: Building the most powerful video-understanding infrastructure

By Avi Bharadwaj

Videos have seamlessly woven themselves into the fabric of our daily existence, captivating our focus and sparking creativity in unprecedented ways. The number of videos being created is staggering; just on YouTube alone, 500 hours of video are uploaded every minute. That's 30,000 hours of video uploaded every hour, and 720,000 hours of video uploaded every day. Overall, 328 terabytes of data are created each day with roughly half of that estimated to be video data. This explosion of video data makes it unwieldy for users to work with or extract insights from them. For instance, it can be impossible for ​YouTube creators today​ to reduce hundreds of hours of filmed content to minute-sized clips that perform well on their channels. 

Historically, these problems were solved using video perception, which includes understanding low-level features of videos such as color, texture, or motion. Further, transcribing audio within videos and applying text-based approaches were also commonly used for perception tasks. However, video perception approaches are hard to adapt when new classes or labels are introduced. Further, video perception models are often trained on a specific dataset, and their performance may suffer when applied to out-of-distribution video data. On the other hand, newer techniques of video understanding tasks involve higher-level processing of video data, such as recognizing objects, actions, or events in the video. These tasks demand advanced models capable of grasping contextual details and the temporal connections between frames. Critical underlying technical components of video understanding are video embeddings: The representation of videos in a lower-dimensional vector space, where each video is represented as a numerical vector. These embeddings capture the semantic meaning and visual features of the videos, allowing foundation models to understand the content of videos. Multimodal embeddings combine different modalities within videos to create a comprehensive representation.  

Twelve Labs has pioneered new foundation model-based approaches for video understanding. Unlike previous generations of companies that have treated video understanding as image or speech problems, Twelve Labs is building its platform with a video-first approach. Having collected hundreds of millions of unique video-text datasets (which is one of the largest such dataset for video-language foundation model training), Twelve Labs enables its users to build products that can offer video-based search, classification, clustering, summarization, and Q&A. Twelve Labs consistently outperforms alternative approaches across multiple dimensions such as correctness of information, contextual understanding, detail orientation, as well as quantitative benchmarks such as Precision, Recall, and F1 score. For instance, Twelve Labs’ recently released Pegasus-1 model significantly outperforms state of the art models and alternative approaches. Customers of Twelve Labs’ APIs build use cases such as video archival search, enterprise video search, automated surveillance and security, sports analysis, content moderation, automated video editing, contextual advertising, interactive media, and customer support, among many others. Take the YouTube creator problem I mentioned earlier, for instance. With Twelve Labs, they can use text-based search to pinpoint specific segments, upload historical segments that performed well, and identify similar segments in the unedited footage. These tasks would have been impractical or highly restrictive without Twelve Labs. Currently, the company works with top YouTube creators to help solve this problem, vastly improving their ideation and post-production workflows. 

When we met Jae and the team at Twelve Labs, we were incredibly impressed by the magnitude of the problem they were tackling, their technical prowess, and the breadth of customer use cases they were able to support in a short span of time. We’re excited to announce our investment into Twelve Labs and are looking forward to partnering with the team as they build a generational video understanding company.