🌐Overview
This paper introduces a new approach called ‘actor-agnostic multi-modal multi-label action recognition’ to address the limitations of existing action recognition methods that are typically actor-specific and focus on single-label classification using only visual data. The proposed method offers a unified solution for various actors, including humans and animals, by leveraging both visual and textual modalities. The authors formulate a novel Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object detection framework (DETR), which uses visual and textual modalities to represent action classes better. This eliminates the need for actor-specific pose estimation, reducing model complexity and maintenance costs. Experiments on five public benchmarks demonstrate that MSQNet outperforms existing actor-specific alternatives in human and animal single- and multi-label action recognition tasks.
🌟Contribution
🛠️Approach
The core of the paper is the Multi-modal Semantic Query Network (MSQNet). Here’s a breakdown of its implementation:
- Spatio-temporal Video Encoder:
- Input: A video V consisting of T frames, each with spatial dimensions H x W.
- Patching: Each frame is divided into N non-overlapping square patches of size P x P.
- Embedding: These patches are flattened into vectors and mapped into an embedding space using a projection layer. Learnable positional embeddings are added to encode the spatio-temporal position of each patch.
- Transformer Encoder: The sequence of embedded patch vectors is fed into a Transformer encoder with Lv layers. This encoder captures spatial and temporal relationships between patches.
- Global Encoding: Patch tokens from each of the frames are averaged and then projected to a dimension D using a linear projection layer (also called global encoder)
- Multi-modal Query Encoder:
- Objective: Generate a query that combines visual and textual information to represent action classes.
- Learnable Label Embeddings: Each action class is associated with a learnable embedding vector Ql of dimension D. These embeddings are initialized with the text embeddings of the corresponding classes generated by a pre-trained text encoder (e.g., CLIP).
- Video Embedding: The CLIP image encoder is applied independently to each frame of the input video and then average pooled to obtain video embedding Qv of dimension D′′.
- Multi-modal Fusion: The learnable label embedding and video-specific embedding are fused to form the multi-modal query. The details of the fusion method are elided in the uploaded paper.
- Multi-modal Decoder:
- Transformer Decoder: The fused multi-modal query and the output of the video encoder are fed into a Transformer decoder.
- Action Classification: The decoder transforms the video encoding to make multi-label classification with a feed-forward network (FFN). The decoder predicts the probability of each action class being present in the video.
- The entire process casts the multi-label action classification problem into a multi-modal target detection task in the elegant Transformer encoder-decoder framework.
