Twelve Labs Breaks New Ground With First-of-its-kind Video-to-text Generative APIs

Company opens beta to the public after highly successful private beta period

Twelve Labs adds $10 million strategic investment with participation from Intel Capital, NVentures, Samsung Next, and more for future of video understanding

San Francisco, Calif – Oct 24, 2023 Twelve Labs, the video understanding company, today announced the debut of its multimodal technology along with the release of its public beta. Twelve Labs is the first in its industry to commercially release video-to-text generative APIs powered by its latest video-language foundation model, Pegasus-1. This model will enable novel capabilities like Summaries, Chapters, Video Titles, and Captioning from videos –even those without audio or text– with the release of its public beta.

Across use cases, language models are generally trained to guess the most probable next word. This task alone enabled new possibilities to emerge, including solving complex problems, effectively summarizing a 1,000-page long document, passing the bar exam and more. Taking language models’ capabilities to the next level, Twelve Labs enables the models to map visual and audio content to language, increasing the range of advanced problems AI can be used to solve.

The company has uniquely trained its multimodal AI model to solve complex video-language alignment problems. Twelve Labs’ proprietary model, evolved, tested, and refined for its public beta, leverages all the components present in videos like action, object, and background sounds, and it learns to map human language to what’s happening inside a video.

Twelve Labs’ technology enables video to not only tell a holistic story. Importantly, it also endows models with powerful capabilities so that users can find the best video to meet their needs, whether it’s pulling a highlight reel or generating a custom report. Twelve Labs users can now extract topics, as well as create summaries and chapters of video leveraging multimodal data. Such features not only save users substantial amounts of time, but also help uncover new insights, suggest marketing content such as catchy headlines or SEO-friendly tags, and unlock new possibilities for video through simple-to-use APIs.

Strategic Investments Signal Future of Video Understanding

In addition to its latest advancements, Twelve Labs disclosed a $10 million strategic investment with participation from NVentures, NVIDIA’s venture capital arm, Intel Capital, Samsung Next, and others. Their investment in and alignment with the company will create novel opportunities and exciting product integrations.

Twelve Labs built the go-to video understanding infrastructure for developers and enterprises that are innovating the video experience in their respective areas. It makes video just as easy to reference and useful as text. In essence, Twelve Labs provides the video intelligence layer on top of which customers build their dream features. For the first time, organizations and developers can leverage AI to retrieve an exact moment within hundreds of thousands of hours of footage by describing the scene in text, generate the relevant body text, be it titles, chapters, summaries, reports, or even tags from videos and incorporating the visual and audio just by prompting the model for it. With these capabilities, Twelve Labs pushes boundaries to provide a text-based interface that solves all video-related downstream tasks, ranging from low-level perception tasks to high-level video understanding.

Understanding Powers Growth

Over the course of its successful closed beta, in which more than 17,000 developers tested the platform, Twelve Labs worked to ensure a scalable, fast, and reliable experience and saw an exponential increase in use cases.

“It’s essential for our business to access exact moments, angles, or events within a game in order to package the best content to our fans, so we prioritize video search tools for our content creators,” said Brad Boim, Senior Director of Asset Management and Post-Production, NFL Media. “It’s exciting to see the shift from traditional video labeling and tagging towards contextual video search using natural language. The emergence of multi-modal AI and natural language search can be a game-changer in opening up access to a media library and surfacing the best content you have available.”

“The Twelve Labs team has consistently pushed the envelope and broken new ground in video understanding since our founding in 2021. Our latest features represent this tireless work,” said Jae Lee, co-founder and CEO of Twelve Labs. “Based on the remarkable feedback we have received, and the breadth of test cases we’ve seen, we are incredibly excited to welcome a broader audience to our platform so that anyone can use best-in-class AI to understand video content without manually watching thousands of hours to find what they are looking for. We believe this is the best, most efficient way to make use of video.”

To learn more about all that you can do with Twelve Labs, please visit, and to sign up for the Twelve Labs beta, go to

About Twelve Labs

Twelve Labs makes video instantly, intelligently searchable and understandable. Twelve Labs’ state-of-the-art video understanding technology enables the accurate and timely discovery of valuable moments within an organization’s vast sea of videos so that users can do and learn more. The company is backed by leading venture capitalists, technology companies, AI luminaries, and successful founders. It is headquartered in San Francisco, with an APAC office in Seoul. Learn more at