What if we could predict whether we will remember a video after a second viewing? What aspects of a social post, advertisement or educational content make it more or less memorable? In a world with an abundance of multimedia content, this question is crucial for improving how we communicate through videos. With this aim, we present in this paper a system that can predict the memorability score, i.e. its likelihood of being remembered over time, of a short video based on its frames or textual descriptions.
Recent findings in psychology and neuroscience challenge the idea that memory is purely a matter of personal judgment. Instead, they suggest that certain visual aspects tend to stick in our memory more often (Isola et al., 2011; Lin et al., 2021; Xie et al., 2020). Additionally, the memorability of visual content isn’t just about personal opinions but is also influenced by the semantics or main topic of the scene that is presented (Bylinskii et al., 2022). This leads to approaches that not just look at pictures but start considering descriptions of those scenes in text (Kleinlein et al., 2021).
Our approach combines text and images to enhance predictions of what people will remember. We merge images and text descriptions to jointly train an image and text encoder, creating a representation that extracts semantic information from both sources. We achieve this using Contrastive Language-Image Pretraining (CLIP) (Radford et al., 2021), which creates similar vector representations for image-text pairs depicting the same concepts.
Expanding on this approach, we adapt the representations generated by these jointly-learned encoders by fine-tuning them using image-text pairs extracted from videos in a dataset specifically curated for memorability analysis, namely Memento10K (Newman et al., 2020). Hence, we refine the CLIP models into F-CLIP models enriched with semantic knowledge from both visual and textual elements.
In essence, we further pre-train these encoders to indirectly generate useful vector representations for predicting video memorability, capturing semantics from both frames and text descriptions.
We can then use these models as feature extractors by decoupling them and freezing their weights. The generated features are subsequently channeled into a Bayesian Ridge Regressor (BRR), which plays the pivotal role in our predictive pipeline by generating memorability scores. This model is a simple variation of the Ordinary Least Squares Linear Regressor that provides akin results whilst being more robust to ill posed problems. By employing this straightforward yet resilient model, our primary focus centers on comparing the various features we extract and their impact on memorability prediction.
Our results indicate that, as a general pattern, our fine-tuning approach improves memorability prediction, as reflected in the Spearman Rank Correlation Coefficient (SRCC). Specifically, our method achieved an SRCC of 0.575 when using text as input, while the default pre-trained text encoder scored 0.538. However, we also noticed that this enhancement is less pronounced when dealing with visual data. One explanation for this difference is that text tends to offer richer and more accessible information than visual data, especially in the context of Memento10K’s often lower-quality videos.
In conclusion, our study demonstrates that blending text and images, as seen in the F-CLIP model, improves video memorability prediction, offering promise for various applications. However, there’s still more to explore in this field. Future work may involve leveraging temporal information within videos, incorporating pixel-level descriptors, and exploring the advanced capabilities of Large Language Models (LLMs). Additionally, integrating other types of signals that correlate with our way of creating memories, such as Electro-Encephalogram (EEG), could open up new avenues for understanding and predicting video memorability in more comprehensive ways.
In a nutshell, our research highlights the potential of cross-modal learning, paving the way for better video content creation and communication strategies in our information-rich world.
References:
Isola, P., Parikh, D., Torralba, A., & Oliva, A. (2011). Understanding the intrinsic memorability of images. *Advances in neural information processing systems*, *24*.
Lin, Q., Yousif, S. R., Chun, M. M., & Scholl, B. J. (2021). Visual memorability in the absence of semantic content. *Cognition*, *212*, 104714.
Xie, W., Bainbridge, W. A., Inati, S. K., Baker, C. I., & Zaghloul, K. A. (2020). Memorability of words in arbitrary verbal associations modulates memory retrieval in the anterior temporal lobe. *Nature human behaviour*, *4*(9), 937-948.
Bylinskii, Z., Goetschalckx, L., Newman, A., & Oliva, A. (2022). Memorability: An image-computable measure of information utility. *Human Perception of Visual Information: Psychological and Computational Perspectives*, 207-239.
Kleinlein, R., Luna-Jiménez, C., & Fernández-Martínez, F. (2021). THAU-UPM at MediaEval 2021: From Video Semantics To Memorability Using Pretrained Transformers. In *MediaEval 2021 workshop*.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In *International conference on machine learning* (pp. 8748-8763). PMLR.
Newman, A., Fosco, C., Casser, V., Lee, A., McNamara, B., & Oliva, A. (2020). Multimodal memorability: Modeling effects of semantics and decay on video memorability. In *Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16* (pp. 223-240). Springer International Publishing.