Modern multimodal AI models can recognize objects, describe scenes, and answer questions about images and short video clips, but they struggle with long-form and large-scale visual data, where real-world reasoning requires moving beyond object recognition and short-clip analysis. Real-world reasoning increasingly involves analyzing long-form video content, where context spans minutes or hours, far beyond the










