Long-Context Video Understanding with AI Agents

Long-Context Video Understanding: An Agentic Approach


Imagine a future where AI lives in the physical world. It constantly processes visual information and has the exact context necessary to solve problems. Humans constantly process visual information over time, recognizing patterns, recalling key moments, and building understanding from hours of experience. AI should be able to understand long-context visual information the way humans do.

The Challenge

One way to test for “long-context” visual understanding is through hour-long vidoes. There are numerous benchmarks which test AI models on hour-long and multi-hour videos. Current AI models have a fixed context length. A single highly compressed 1 hour video holds around 1 million tokens, making it difficult for AI to reason through long visual contexts. Even with fully linearized attention, the visual data we generate daily will exceed what any model can process in a single pass.

We need a fundamentally different approach to understand and process multi-hour videos.

Prior work such as Deep Video Discovery has explored agentic systems for long-video evaluation, and we build on these ideas and push them further with a third party critic module. While reasoning agents are powerful, they remain prone to alignment errors, lossy image-to-text translation, and hallucinations. During our experiments, we noticed their unreliability, especially on subjective visual data. To address these issues, we introduce a third-party critic-agent that evaluates agent outputs, identifies discrepancies, and prompts re-evaluation. Following Deep Video Discovery, we propose an agentic method for understanding long videos that addresses each of these components using fully open-source models. Our approach utilizes the ReAct framework, which integrates reasoning with tool use. Within this framework, an LLM can both perform step-by-step reasoning and invoke external resources, such as searching through a captions database and communicating with other multimodal LLMs.

Our Solution

Solving long-context video understanding requires three key parts:

  1. An efficient offline/streaming video representation
  2. Smart retrieval of relevant moments
  3. Strong reasoning over visual and temporal information

Step 1: Building an Efficient Offline Video Representation

We first take in a video input, extract frames at 1 fps, and store them in a frame database.

For each frame, we use an LLM to capture a detailed description and build a captions database consisting of:

Most queries are framed around some “signal” (subject, scene, location), and we try to capture any referenced “signal” in our caption representation, drawing inspiration from named entity recognition. Once these captions are parsed, we chunk them and pass through an LLM one more time to generate:

Once we have our caption database, we embed it using an open-source token embedder, so we can semantically match captions with queries.

Frame and Caption database

Fig 1. Frame and Caption Database Creation

This representation is:

Step 2: Smart Retrieval of Relevant Moments

Next, we construct a multi-turn pathway between a reasoning LLM and a VLM. An LLM receives a question query from a user and can choose between three actions:

Action 1: Semantic Caption Search

The LLM generates a short “search query” from the user query and performs a retrieval algorithm between the search query and the image captions in the database. For example:

Question + Caption Example

Question + Caption Example

Here, our retrieval algorithm performs a cosine-similarity search of embeddings and returns the top k = 40 caption-similarity scores with timestamps, and the LLM reads and clusters relevant timestamps.

Here are the top $7$ captions returned. The agent is asked to consider both similarity score and frequency of clustered frames when deciding which frames to attach to the VLM query:

Top 7 captions

Caption Search Results

Action 2: Query the VLM with Chosen Frames

Following the previous example, the LLM may have pinpointed relevant frames from captions regarding when a council appears deep in thought, but needs more detail. It can query a VLM with a question of choice:

VLM Query Example

VLM Query Example

The VLM returns a chain of thought with a response to the LLM’s prompt. This chain of thought gives the LLM a more complete picture of which components of frames are relevant and allows for further detailed prompting. It also provides a simple summary that the user can follow to understand the components that give rise to the answer. For each VLM call, we also attach the global summary to provide required context.

Action 3: Decide on a Final Answer

Depending on how much information it has received, the LLM can then choose to perform more caption search queries, query the VLM with a different question prompt or different frames, or decide on a final answer.

We limit the total number of LLM actions to 10 before the LLM is forced to decide on a final answer.

Step 3: The Critic Agent

We notice that when the LLM queries the VLM asking for specific visual information, the VLM has a tendency to “hallucinate” or exaggerate details of the subjective visual data to force a fitting answer to the LLM’s queries.

Similarly, the LLM may make certain assumptions that aren’t explicitly seen from the frames. An example following the question we’ve been exploring:

Incorrect Reasoning Path

Incorrect Reasoning Path

The question asked “what kind of person comes into the meeting room,” yet the LLM receives information describing the people seated at the table and assumes one of those were the ones that entered the meeting room. It’s easy for the LLM to lose small details of the question or to assume information from structured VLM responses.

To remediate, we run a second pass through a critic model, which takes in the original question, global summary, LLM’s reasoning, and the relevant VLM frames chosen by the LLM.

The critic agent also has access to a critic VLM and analyzes the LLM’s reasoning + evidence in relevant frames as a sanity check. An example pass through the critic VLM:

Critic Response

Critic Response

The critic analyzes the visual data with the accompanying reasoning and determines a confidence score and suggestions for the VLM.

If the confidence score is below a threshold T = 70%, the critic’s reasoning is passed back to the acting LLM, and the LLM follows its suggestions for re-evaluating, utilizing the same three tools as earlier.

An example of the entire pipeline can be seen here:

Full Pipeline

Full Pipeline

To see the full walkthroughs of questions, please visit our interactive demo.

Results

Our agentic video pipeline on the open-source models DeepSeek V3.1 and Meta’s Llama-4-Maverick VLM performs state-of-the-art out of all open-source models on the LVBench dataset:

On a random video from the HourVideo dataset, the pipeline is able to correctly identify relevant frames as annotated and reviewed by humans (to ensure that that the retrieval algorithm finds frames relevant to the question) in the dataset 68.19% of the time, showing the strength of our representation. Performance also beats prior socratic models, and is almost on par with closed-source models running when tested on HourVideo’s 50-video Development Set.

LV Bench

LV Bench Overall
Deep Video Discovery 74.2%
Our Pipeline (Critic Pass) 65.2%
Our Pipeline (No Critic) 60.2%
Seed 1.5-VL-Thinking 64.6%
AdaReTaKe 53.3%
GPT-4o-2024-11-20 48.9%
InternVL2.5-78B 43.6%
mPLUG-Owl3 43.5%

Blue text is open-source models. Results taken from LV Bench leaderboard.

Hour Video

Hour Video Overall
Gemini 1.5 Pro 37.3%
Our Pipeline (Critic Pass) 31.4%
LLaVa-34B-DPO (Socratic) 22.3%
GPT-4 (Socratic) 25.7%

Evaluation on HourVideo Dev set. Results taken from HourVideo paper.

Token and Cost Analysis

Num Tokens (AVG) VLM Input VLM Output LLM Input LLM Output Total Input Total Output
One Question (With Critic Pass) 7249 812 3186 1852 10435 3998
Critic Pass 751 342 500 233 1251 575

</div> </div>

Context: A normal hour-long video compresses to about 1,000,000 tokens. If you naively pass the entire video for every question, you spend ~1M input tokens per question, which is expensive and slow, especially in streaming settings.

With an offline representation that can be re-queried, we can save token costs through amortization. We precompute multi-granularity captions once, then retrieve only relevant information and context at query-time.

With the critic enabled, a typical Q&A cycle looks like:

Across 16 questions, we spend:

This is approximately 23% of a single 1M-token hour and 96× smaller than a naive pass through the video 16x (1M tokens × 16 questions).

The amortization effect: The heavy lifting of the models(captions and embeddings) is reused, so the incremental cost of each new question stays near ~10k input + ~4k output tokens, without exceeding context windows for long videos. We still account for the initial video representation cost.

Critic trade-off: We add ~12.7% more tokens per question to identify low-confidence questions for re-evaluation and achieve ~4.98% absolute accuracy gain. If quality matters, run the critic conditionally on confidence for customizable accuracy-cost trade-offs.

In a streaming setting, instead of having to pass the entire video and its past history through a model when new context comes in, we can cache the captions database and add to it as new frames come in, saving token costs.

Limitations and Discussion

The approach inherits limitations from captioning-based retrieval:

Despite these constraints, hierarchical retrieval from the frame level to the global summary level, combined with a Reason → Act → Critique → React loop, provides a useful balance. The method restricts the LLM’s working set to relevant evidence, defers to vision models when necessary, and uses a critic to flag inconsistencies and prompt re-evaluation.

Future Work

Future directions include:

Conclusion

Long-context video understanding is fundamentally a systems integration problem requiring compact multimodal representations, targeted retrieval, and a strong reasoning cycle. The proposed agent combines hierarchical captions, semantic search, VLM-based image understanding, and a critic to improve robustness.

On open-source models, this pipeline yields state-of-the-art performance over all open-source models on LVBench at 65.19% accuracy, with an approximately 5% absolute accuracy improvement at approximately 13% additional tokens per question (not including re-evaluation) using the critic model.

The future of AI isn’t necessarily about bigger context windows. It’s also about smarter systems that know what information to search for and how to reason. Just like humans don’t remember every second of a movie but can recall key moments when asked, AI should work the same way.