ViRe-Agent: Multi-Modal Composed Retrieval with Dynamic Scene Segmentation for Cross-Duration Video

ViRe-Agent demo 1

lmage + Text or Video + Text Query

ViRe-Agent demo 2

lmage or Video Query

ViRe-Agent demo 3

Text Query

ViRe-Agent supports interactive search, formulation of multimodal search as a policy-driven decision process, and full-process database construction.

Abstract

Existing multimodal video retrieval systems are limited by weak intent understanding and rigid similarity-based matching, especially when handling composed queries that integrate heterogeneous inputs.

To fill this gap, we propose ViRe-Agent, a retrieval agent that formulates multimodal search as a policy-driven decision process. The core contribution lies in a Light MDP routing mechanism, which selects an optimal retrieval policy (semantic, similarity, composed, or fingerprint-based) conditioned on both query intent and segment-level video semantics.

We further integrate scene segmentation and multi-modal fingerprinting to support cross-duration queries and provenance tracing. Extensive experiments demonstrate consistent improvements over recent baselines including CoVR, TransVCL, and CLIP4Clip, with gains up to 4.6% R@1 on real-world data.

Methodology

1. System Architecture

ViRe-Agent operates under two distinct modes: Feature Extraction Mode and Retrieval Mode. The system integrates multimodal input parsing, LLM-based intent understanding, and dynamic retrieval strategy selection.

ViRe-Agent Framework

Figure 1: The overall architecture of the multi-modal video retrieval agent system.


2. Agent Workflow & Routing

The core of our agent is the Light MDP routing mechanism. It selects the optimal retrieval policy (Semantic, Similarity, Composed, or Fingerprint) conditioned on query intent and current states.

Agent Working Flow

Figure 2: The decision-making workflow of the ViRe-Agent.


3. Dynamic Scene Segmentation

We perform semantic scene segmentation on the original long video based on TransNetV2. Shot transition detection and clustering semantic coherence shots into segments allows for fine-grained retrieval.

Scene Segmentation

Figure 3: Semantic scene segmentation pipeline.

R3V Dataset

We constructed a real-world video retrieval dataset (R3V) comprising 1,000 clips sourced from diverse content including films, TV dramas, documentaries, and variety shows.

Dataset Examples

Display of part of real-word private dataset with multilingual annotations.