The steps of the pipeline are detailed below.
"can't close my mouth"
The caption index returned many captions with high similarity scores. The strongest was: frame_1672, which described a setting with chairs and framed pictures reminiscent of a restaurant. Other captions (not listed here) clustered around frame 1672, leading the LLM to search for frames centered here.
Frames 1650-1700: Look at these consecutive frames from around the 28:00 timestamp. The vlogger is eating something and her mouth stays open. Describe what is happening: Is the food extremely hard, extremely tasty, or does she appear to have a jaw injury or any other reason why she cannot close her mouth? Provide visual clues that support the explanation:
VLM judgment: “Female vlogger, eyes squinting, mouth half-open in a disgusted expression while holding chopsticks near lips.” The negative sentiment aligns with the ‘weird taste’ hypothesis.
Frames 1665-1710: Examine these consecutive frames (around the 28:00 mark, i.e., ~1680 seconds) showing the vlogger eating. Describe the food she is eating, her facial expression, and any visible signs of discomfort, strain, or injury. Specifically note whether her mouth is open because the food is extremely tasty, extremely hard, or because she appears to have a jaw problem. Provide clear visual evidence that supports one of the answer choices (A: food is so delicious, B: food tastes weird, C: food is so hard, D: jaw dislocated).
VLM judgment: “Same subject grimacing, sticking tongue slightly out as if tasting something unpleasant.” Reinforces choice B.
Final answer: B — The food tastes weird.
Reasoning: Frame 1665 shows the vlogger biting a stick of fermented shark (hákarl). Her eyebrows are raised, eyes widened and her mouth is open in a clear reaction of surprise/disgust – a typical response to the strong, unusual taste of the food. There are no signs of strain, pain, or abnormal jaw positioning. Frame 1670, taken a few seconds later, shows her with a normal closed‑mouth smile, confirming that her jaw is fine and the open‑mouth pose was only a reaction to the taste. Therefore, the reason she cannot close her mouth at 28:00 is that the food tastes weird.
council deep thought. A few high similarity captions around 390, 415. Below is a spread of the top 7 captions from the query, along with corresponding scores, frames. The model received k = 40 top captions.
Prompt to VLM: “Describe the figures in frames 390-391. Pay attention to facial features (especially noses), head-wear colour, and any comedic attire that could signal a jester, and any visual cues of an entrance.”
Officials in medieval garb sit around a large table, poring over scrolls (frame 0390). Characters are animated, wearing large hats and colorful decorative clothing. Characters are seated around a table, indicative of a council meeting. A man in a large colorful hat sits at the head of the table.
The man in the large colorful hat is the jester who enters the doorway. His nose is long and he is wearing a large colorful hat.
The critic VLM flagged lack of evidence for a multicoloured hat and the frames show no evidence of an entrance, only a council meeting. Confidence 25%.
"Re-inspect frames 0390 and 0391. Decide whether the the man who enters the council meeting is a jester, and if he wears a colorful hat. List concrete visual cues for your conclusion."
Jester enters doorway Numerous similar captions around 0525 were found.
Prompt to VLM: “Focus on the doorway character in frames 519-526. Describe nose length/shape, hat colour/pattern, and any props. Decide which single visual trait best differentiates him from other characters.”
VLM judgment: A man is coming out of a doorway. The scene is too bright to tell the colors of his hat, but he has a long, grotesque hooked nose.
Prompt to VLM: “Focus on the doorway character in frames 521-524. Describe nose length/shape, hat colour/pattern, and any props. Decide which single visual trait best differentiates him from other characters.”
VLM judgment: It is confirmed that the man has a long hooked nose. He is not wearing a hat and has white hair, and no other distinguishing features.
Final answer: B — A person with a particularly long nose.
Reasoning: Frame 0519 shows the jester entering the doorway. Frame 0521 shows the jester with a long nose. There is a nose, and no evidence of a multicoloured hat, or a tall hat. Therefore the correct answer is B.
"blue clothes" AND "yellow clothes" — the model searched the subtitle index for co-mentions of these colour terms. It found matches around the 2057-2073 frame window. The top matches are shown below.
Prompt to VLM: “For frames 2057 & 2073 describe the setting and interpersonal roles. Note uniforms, teaching aids, and any mentor-like posture.”
Both frames show a classroom: children in uniform on the floor, educational mats and craft supplies visible. The woman in blue guides them, leaning over and pointing – classic teacher body language.
Prompt to VLM: “Analyse frame 2086. Focus on body language, eye contact, and relative height. Does the adult display parental affection or instructional guidance?”
A second adult in blue joins, both crouching to child eye-level while giving instructions. No parental hugging or casual interaction – the dynamic is authoritative yet supportive, confirming a teacher-student relationship.
The critic model assigned 65 % confidence to teacher-student based on classroom setting, uniformed children, and directive gestures.
"Re-inspect frame 2086. Decide whether the adult’s posture and spatial distance suggest parental affection or classroom instruction. List concrete visual cues for your conclusion."
Combining these cues, the critic model increased probability for teacher-student from 65 % to 88 % and left the answer unchanged.
Final answer: D — Teacher-student relationship.
Reasoning: Frame 2057 shows the adult distributed materials while kneeling among multiple children—a classic teacher posture in early-learning environments. In frame 2086 she leans forward with an open palm pointing to the worksheet, while the child looks up attentively. No intimate touch or protective stance suggests a parent; instead the power dynamic and setting confirm a teacher-student relationship.