Imagine a future where AI lives in the physical world. It constantly processes visual information and has the exact context necessary to solve problems. Humans constantly process visual information over time, recognizing patterns, recalling key moments, and building understanding from hours of experience. AI should be able to understand long-context visual information the way humans do.
One way to test for “long-context” visual understanding is through hour-long vidoes. There are numerous benchmarks which test AI models on hour-long and multi-hour videos. Current AI models have a fixed context length. A single highly compressed 1 hour video holds around 1 million tokens, making it difficult for AI to reason through long visual contexts. Even with fully linearized attention, the visual data we generate daily will exceed what any model can process in a single pass.
We need a fundamentally different approach to understand and process multi-hour videos.
Prior work such as Deep Video Discovery has explored agentic systems for long-video evaluation, and we build on these ideas and push them further with a third party critic module. While reasoning agents are powerful, they remain prone to alignment errors, lossy image-to-text translation, and hallucinations. During our experiments, we noticed their unreliability, especially on subjective visual data. To address these issues, we introduce a third-party critic-agent that evaluates agent outputs, identifies discrepancies, and prompts re-evaluation. Following Deep Video Discovery, we propose an agentic method for understanding long videos that addresses each of these components using fully open-source models. Our approach utilizes the ReAct framework, which integrates reasoning with tool use. Within this framework, an LLM can both perform step-by-step reasoning and invoke external resources, such as searching through a captions database and communicating with other multimodal LLMs.
Solving long-context video understanding requires three key parts:
We first take in a video input, extract frames at 1 fps, and store them in a frame database.
For each frame, we use an LLM to capture a detailed description and build a captions database consisting of:
Most queries are framed around some “signal” (subject, scene, location), and we try to capture any referenced “signal” in our caption representation, drawing inspiration from named entity recognition. Once these captions are parsed, we chunk them and pass through an LLM one more time to generate:
Once we have our caption database, we embed it using an open-source token embedder, so we can semantically match captions with queries.
This representation is:
Next, we construct a multi-turn pathway between a reasoning LLM and a VLM. An LLM receives a question query from a user and can choose between three actions:
Action 1: Semantic Caption Search
The LLM generates a short “search query” from the user query and performs a retrieval algorithm between the search query and the image captions in the database. For example:
Here, our retrieval algorithm performs a cosine-similarity search of embeddings and returns the top k = 40 caption-similarity scores with timestamps, and the LLM reads and clusters relevant timestamps.
Here are the top $7$ captions returned. The agent is asked to consider both similarity score and frequency of clustered frames when deciding which frames to attach to the VLM query:
Action 2: Query the VLM with Chosen Frames
Following the previous example, the LLM may have pinpointed relevant frames from captions regarding when a council appears deep in thought, but needs more detail. It can query a VLM with a question of choice:
The VLM returns a chain of thought with a response to the LLM’s prompt. This chain of thought gives the LLM a more complete picture of which components of frames are relevant and allows for further detailed prompting. It also provides a simple summary that the user can follow to understand the components that give rise to the answer. For each VLM call, we also attach the global summary to provide required context.
Action 3: Decide on a Final Answer
Depending on how much information it has received, the LLM can then choose to perform more caption search queries, query the VLM with a different question prompt or different frames, or decide on a final answer.
We limit the total number of LLM actions to 10 before the LLM is forced to decide on a final answer.
We notice that when the LLM queries the VLM asking for specific visual information, the VLM has a tendency to “hallucinate” or exaggerate details of the subjective visual data to force a fitting answer to the LLM’s queries.
Similarly, the LLM may make certain assumptions that aren’t explicitly seen from the frames. An example following the question we’ve been exploring:
The question asked “what kind of person comes into the meeting room,” yet the LLM receives information describing the people seated at the table and assumes one of those were the ones that entered the meeting room. It’s easy for the LLM to lose small details of the question or to assume information from structured VLM responses.
To remediate, we run a second pass through a critic model, which takes in the original question, global summary, LLM’s reasoning, and the relevant VLM frames chosen by the LLM.
The critic agent also has access to a critic VLM and analyzes the LLM’s reasoning + evidence in relevant frames as a sanity check. An example pass through the critic VLM:
The critic analyzes the visual data with the accompanying reasoning and determines a confidence score and suggestions for the VLM.
If the confidence score is below a threshold T = 70%, the critic’s reasoning is passed back to the acting LLM, and the LLM follows its suggestions for re-evaluating, utilizing the same three tools as earlier.
An example of the entire pipeline can be seen here:
To see the full walkthroughs of questions, please visit our interactive demo.
Our agentic video pipeline on the open-source models DeepSeek V3.1 and Meta’s Llama-4-Maverick VLM performs state-of-the-art out of all open-source models on the LVBench dataset:
On a random video from the HourVideo dataset, the pipeline is able to correctly identify relevant frames as annotated and reviewed by humans (to ensure that that the retrieval algorithm finds frames relevant to the question) in the dataset 68.19% of the time, showing the strength of our representation. Performance also beats prior socratic models, and is almost on par with closed-source models running when tested on HourVideo’s 50-video Development Set.
LV Bench
| LV Bench | Overall |
|---|---|
| Deep Video Discovery | 74.2% |
| Our Pipeline (Critic Pass) | 65.2% |
| Our Pipeline (No Critic) | 60.2% |
| Seed 1.5-VL-Thinking | 64.6% |
| AdaReTaKe | 53.3% |
| GPT-4o-2024-11-20 | 48.9% |
| InternVL2.5-78B | 43.6% |
| mPLUG-Owl3 | 43.5% |
Blue text is open-source models. Results taken from LV Bench leaderboard.
Hour Video
| Hour Video | Overall |
|---|---|
| Gemini 1.5 Pro | 37.3% |
| Our Pipeline (Critic Pass) | 31.4% |
| LLaVa-34B-DPO (Socratic) | 22.3% |
| GPT-4 (Socratic) | 25.7% |
Evaluation on HourVideo Dev set. Results taken from HourVideo paper.
| Num Tokens (AVG) | VLM Input | VLM Output | LLM Input | LLM Output | Total Input | Total Output |
|---|---|---|---|---|---|---|
| One Question (With Critic Pass) | 7249 | 812 | 3186 | 1852 | 10435 | 3998 |
| Critic Pass | 751 | 342 | 500 | 233 | 1251 | 575 |
</div> </div>
Context: A normal hour-long video compresses to about 1,000,000 tokens. If you naively pass the entire video for every question, you spend ~1M input tokens per question, which is expensive and slow, especially in streaming settings.
With an offline representation that can be re-queried, we can save token costs through amortization. We precompute multi-granularity captions once, then retrieve only relevant information and context at query-time.
With the critic enabled, a typical Q&A cycle looks like:
Across 16 questions, we spend:
This is approximately 23% of a single 1M-token hour and 96× smaller than a naive pass through the video 16x (1M tokens × 16 questions).
The amortization effect: The heavy lifting of the models(captions and embeddings) is reused, so the incremental cost of each new question stays near ~10k input + ~4k output tokens, without exceeding context windows for long videos. We still account for the initial video representation cost.
Critic trade-off: We add ~12.7% more tokens per question to identify low-confidence questions for re-evaluation and achieve ~4.98% absolute accuracy gain. If quality matters, run the critic conditionally on confidence for customizable accuracy-cost trade-offs.
In a streaming setting, instead of having to pass the entire video and its past history through a model when new context comes in, we can cache the captions database and add to it as new frames come in, saving token costs.
The approach inherits limitations from captioning-based retrieval:
Despite these constraints, hierarchical retrieval from the frame level to the global summary level, combined with a Reason → Act → Critique → React loop, provides a useful balance. The method restricts the LLM’s working set to relevant evidence, defers to vision models when necessary, and uses a critic to flag inconsistencies and prompt re-evaluation.
Future directions include:
Long-context video understanding is fundamentally a systems integration problem requiring compact multimodal representations, targeted retrieval, and a strong reasoning cycle. The proposed agent combines hierarchical captions, semantic search, VLM-based image understanding, and a critic to improve robustness.
On open-source models, this pipeline yields state-of-the-art performance over all open-source models on LVBench at 65.19% accuracy, with an approximately 5% absolute accuracy improvement at approximately 13% additional tokens per question (not including re-evaluation) using the critic model.
The future of AI isn’t necessarily about bigger context windows. It’s also about smarter systems that know what information to search for and how to reason. Just like humans don’t remember every second of a movie but can recall key moments when asked, AI should work the same way.