Category-Specific Prompts for Animal Action Recognition with Pretrained Vision-Language Models

🌐Overview

This paper is about the adaptation of the pretrained Vision-Language Model for animal action recognition.

🌟Contribution

The authors introduce a novel category-specific prompting technique to refine both textual and visual features for different animals, leveraging the animal category information.

🛠️Approach

The model comprises three branches: a video branch, an animal category branch, and a text branch. The process involves extracting semantic representations of animals in video clips by first obtaining probabilities for all animal categories in the video using a finetuned CLIP model, then extracting text features of each category name using a pretrained text encoder, and finally fusing these features based on their probabilities to create a category feature specific to the video clip.

📊 Evaluation

The proposed method has been evaluated in the Animal Kingdom dataset along with the performance comparison with other action recognition models.

💭 Personal Perspectives

While the category-specific prompting technique augments the CLIP model to predict animal classes, this specialization based on the animal category may offer advantages, but it can also introduce biases and hinder the model’s ability to accurately recognize behaviors from unseen or underrepresented classes. Additionally, errors in the initial animal class prediction can propagate through subsequent stages of the model, increasing inaccuracies in behavior recognition.

🔗Links and references

https://dl.acm.org/doi/10.1145/3581783.3612551: Category-Specific Prompts for Animal Action Recognition with Pretrained Vision-Language Models