From June 11th to 15th, the IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR 2025)—the world’s leading event in computer vision and artificial intelligence—was held in Nashville, Tennessee.

During the conference, the work titled “ViMoCLIP: Augmenting Static CLIP Representations with Video Motion Cues for Animal Action Recognition” was presented. This research introduces an innovative extension of the CLIP model to tackle the challenge of recognizing animal actions in video—a key task in areas such as computational biology and smart aquaculture.
The proposed system is based on a student–teacher learning framework that integrates:
- Static visual representations extracted by CLIP,
- Temporal features learned through optical flow,
- A temporal Transformer that fuses both modalities to enable dynamic scene understanding.
When evaluated on the Animal Kingdom dataset, the model outperformed previous CLIP-based approaches, achieving higher accuracy in detecting complex behaviors such as stalking, jumping, or flying.
By combining computer vision with motion analysis, this technology enables continuous monitoring of animal behavior in aquatic environments, making it possible to automatically detect feeding patterns, signs of stress, or illness. These capabilities have a direct impact on animal welfare and production efficiency.
The full implementation is available at: https://github.com/MarcosRodrigoT/VIMO-CLIP
