POSTS |

3 October 2023
by fjmirandav
0 comments

Video Memorability Prediction From Jointly-learnt Semantic and Visual Features

What if we could predict whether we will remember a video after a second viewing? What aspects of a social post, advertisement or educational content make it more or less memorable? In a world with an abundance of multimedia content, this question is crucial for improving how we communicate through videos. With this aim, we present in this paper a system that can predict the memorability score, i.e. its likelihood of being remembered over time, of a short video based on its frames or textual descriptions.

Recent findings in psychology and neuroscience challenge the idea that memory is purely a matter of personal judgment. Instead, they suggest that certain visual aspects tend to stick in our memory more often (Isola et al., 2011; Lin et al., 2021; Xie et al., 2020). Additionally, the memorability of visual content isn’t just about personal opinions but is also influenced by the semantics or main topic of the scene that is presented (Bylinskii et al., 2022). This leads to approaches that not just look at pictures but start considering descriptions of those scenes in text (Kleinlein et al., 2021).

Our approach combines text and images to enhance predictions of what people will remember. We merge images and text descriptions to jointly train an image and text encoder, creating a representation that extracts semantic information from both sources. We achieve this using Contrastive Language-Image Pretraining (CLIP) (Radford et al., 2021), which creates similar vector representations for image-text pairs depicting the same concepts.

Expanding on this approach, we adapt the representations generated by these jointly-learned encoders by fine-tuning them using image-text pairs extracted from videos in a dataset specifically curated for memorability analysis, namely Memento10K (Newman et al., 2020). Hence, we refine the CLIP models into F-CLIP models enriched with semantic knowledge from both visual and textual elements.

In essence, we further pre-train these encoders to indirectly generate useful vector representations for predicting video memorability, capturing semantics from both frames and text descriptions.

We can then use these models as feature extractors by decoupling them and freezing their weights. The generated features are subsequently channeled into a Bayesian Ridge Regressor (BRR), which plays the pivotal role in our predictive pipeline by generating memorability scores. This model is a simple variation of the Ordinary Least Squares Linear Regressor that provides akin results whilst being more robust to ill posed problems. By employing this straightforward yet resilient model, our primary focus centers on comparing the various features we extract and their impact on memorability prediction.

Our results indicate that, as a general pattern, our fine-tuning approach improves memorability prediction, as reflected in the Spearman Rank Correlation Coefficient (SRCC). Specifically, our method achieved an SRCC of 0.575 when using text as input, while the default pre-trained text encoder scored 0.538. However, we also noticed that this enhancement is less pronounced when dealing with visual data. One explanation for this difference is that text tends to offer richer and more accessible information than visual data, especially in the context of Memento10K’s often lower-quality videos.

In conclusion, our study demonstrates that blending text and images, as seen in the F-CLIP model, improves video memorability prediction, offering promise for various applications. However, there’s still more to explore in this field. Future work may involve leveraging temporal information within videos, incorporating pixel-level descriptors, and exploring the advanced capabilities of Large Language Models (LLMs). Additionally, integrating other types of signals that correlate with our way of creating memories, such as Electro-Encephalogram (EEG), could open up new avenues for understanding and predicting video memorability in more comprehensive ways.

In a nutshell, our research highlights the potential of cross-modal learning, paving the way for better video content creation and communication strategies in our information-rich world.

References:

Isola, P., Parikh, D., Torralba, A., & Oliva, A. (2011). Understanding the intrinsic memorability of images. *Advances in neural information processing systems*, *24*.

Lin, Q., Yousif, S. R., Chun, M. M., & Scholl, B. J. (2021). Visual memorability in the absence of semantic content. *Cognition*, *212*, 104714.

Xie, W., Bainbridge, W. A., Inati, S. K., Baker, C. I., & Zaghloul, K. A. (2020). Memorability of words in arbitrary verbal associations modulates memory retrieval in the anterior temporal lobe. *Nature human behaviour*, *4*(9), 937-948.

Bylinskii, Z., Goetschalckx, L., Newman, A., & Oliva, A. (2022). Memorability: An image-computable measure of information utility. *Human Perception of Visual Information: Psychological and Computational Perspectives*, 207-239.

Kleinlein, R., Luna-Jiménez, C., & Fernández-Martínez, F. (2021). THAU-UPM at MediaEval 2021: From Video Semantics To Memorability Using Pretrained Transformers. In *MediaEval 2021 workshop*.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In *International conference on machine learning* (pp. 8748-8763). PMLR.

Newman, A., Fosco, C., Casser, V., Lee, A., McNamara, B., & Oliva, A. (2020). Multimodal memorability: Modeling effects of semantics and decay on video memorability. In *Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16* (pp. 223-240). Springer International Publishing.

28 September 2023
by luisfernandodharo
0 comments

Automatic Detection of Inconsistencies and Hierarchical Topic Classification for Open-Domain Chatbots

Current State-of-the-Art (SotA) chatbots are able to produce high-quality sentences, handling different conversation topics and larger interaction times. Unfortunately, the generated responses depend greatly on the data on which they have been trained, the specific dialogue history and current turn used for guiding the response, the internal decoding mechanisms, and ranking strategies, among others. Therefore, it may happen that for semantically similar questions asked by users, the chatbot may provide a different answer, which can be considered as a form of hallucination or producing confusion in long-term interactions.

In this research paper, we propose a novel methodology consisting of two main phases: (a) hierarchical automatic detection of topics and subtopics in dialogue interactions using a zero-shot learning approach, and (b) detecting inconsistent answers using k-means and the Silhouette coefficient.

To evaluate the efficacy of topic and subtopic detection, we use a subset of the DailyDialog dataset and real dialogue interactions gathered during the Alexa Socialbot Grand Challenge 5 (SGC5). The proposed approach enables the detection of up to 18 different topics and 102 subtopics. The experimental results demonstrate the efficacy of the topic detection algorithm, achieving an F1 weighted score of 0.67 when detecting 13 distinct topics and an F1 weighted score of 0.45 when detecting 18 distinct topics. In terms of the subtopic level, a weighted F1 score of 0.67 was achieved. Besides, we show how our proposed approach outperforms a larger model trained on specific dialogue data. An advantage of our approach is that it is scalable allowing the incorporation of new categories and subcategories (fine-grained) that the larger model is not able to recognize.

For the purpose of detecting inconsistencies, we manually generate multiple paraphrased questions and employ several pre-trained SotA chatbot models to generate responses. Moreover, the algorithm exhibits precise estimation capabilities in determining the number of diverse responses, as evidenced by an MSE of 3.4 calculated over a set of 109 handcrafted responses, 15 sets of original questions plus their paraphrases, passed to 4 small model chatbots. In the case of the 120 questions created with GPT-4, 15 question sets each consisting of 1 original and its respective 7 paraphrases, and fed into 4 State-of-the-Art chatbots, the overall resulting MSE was 3.2. These results show that even LLMs produce inconsistent answers and our approach is a good proxy to detect such cases.

As future work, we will primarily focus on two main aspects: expanding the range of high-level topics and subsequently evaluating the algorithm’s performance in identifying subtopics. Additionally, we included this topic and subtopic classifier in the dialogue management for the chatbot that we used during our participation in the Alexa Socialbot Grand Challenge (SGC5) [1]. Regarding the detection of inconsistent responses, our efforts will be directed towards the development of controllable algorithms and architectures, such as TransferTransfo [2] or CTRL [3], leveraging persona profiles within these frameworks with the idea of generating more consistent responses. Furthermore, we seek to explore mechanisms to incorporate these identified inconsistencies into an automated evaluation of dialogue systems [4, 5], according to the recommendations made in [6].

Bibliography

[1]. Estecha-Garitagoitia, Marcos, Mario Rodríguez-Cantelar, Alfredo Garrachón Ruiz, Claudia Garoé Fernández García, Sergio Esteban Romero, Cristina Conforto, Alberto Saiz Fernández, Luis Fernando Fernández Salvador, and Luis Fernando D’Haro. “THAURUS: An Innovative Multimodal Chatbot Based on the Next Generation of Conversational AI.” Alexa Prize SocialBot Grand Challenge 5.

[2]. Wolf, T.; Sanh, V.; Chaumond, J.; Delangue, C. TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents. arXiv 2019, arXiv:cs.CL/1901.08149.

[3]. Keskar, N.S.; McCann, B.; Varshney, L.R.; Xiong, C.; Socher, R. CTRL: A Conditional Transformer Language Model for Controllable Generation. arXiv 2019, arXiv:cs.CL/1909.05858.

[4]. Zhang, C.; Sedoc, J.; D’Haro, L.F.; Banchs, R.; Rudnicky, A. Automatic Evaluation and Moderation of Open-domain Dialogue Systems. arXiv 2021, arXiv:cs.CL/2111.02110.

[5]. Zhang, C.; D’Haro, L.F.; Friedrichs, T.; Li, H. MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue Evaluation. Proc. AAAI Conf. Artif. Intell. 2022, 36, 11657–11666.

11 September 2023
by fjmirandav
0 comments

ETHICS AND HUMAN RIGHTS MATTER IN ASTOUND PROJECT

Improving social competences of virtual agents through artificial consciousness based on the Attention Schema Theory

On May 15 and 16 2023, a congress about Public Law and Algorithms was held in the University of Valencia. Professor Celia Fernández Aller and prof. Jesús Salgado Criado were invited to contribute with a presentation: An interdisciplinary conversation: how to integrate ethics and people’s rights into ASTOUND project.

https://esdeveniments.uv.es/97884/detail/public-law-and-algorithms-towards-a-legal-framework-for-the-fair-and-non-discriminatory-use-of-ai-w.html

The starting point was that a human rights law approach to algorithmic accountability is something crucial. Ethics is relevant, and the rights approach is a complementary and a essential framework.

Ethics is currently at the heart of ASTOUND project. At this stage, the project has checked available multimodal datasets for training and evaluation and has carried out an analysis of current approaches for dataset curation for bias and toxicity. While designing our chatbot architecture, many discussions during monthly general meetings have taken place around ethical aspects. Apart from that, a small ethics group has been organized, and the External Ethical Board has been selected.

ASTOUND project is based on the assumption that Ethics will not create, but avoid, future risk for the project’s success, as it will bring trust. There is an opportunity to offer guidelines to contribute to other future projects with similar challenges.

The most pressing issue is related to selecting risks to avoid. During the first phase of ethical analysis, a thorough list of potential issues or topics has been identified, such as: a) fairness (no discrimination against any group of persons); b) dignity (not impersonating, making all the time clear that the person is chatting with a machine); c) autonomy (potential influence/manipulation from chatbot side, exercising influence to persuade on a specific position); d) explainability/interpretability of chatbot responses; e) observability, auditability and monitoring: how is the system planned to be audited and monitored in terms of performance; what performance variables can be collected; f) safety: controllability – how can the system be controlled in case of degradation of performance?; g) security/privacy in order not to use personal data without knowdlege of the data subject; h) transparency; accountability/responsibility assigned in case of malfunction (legal and moral), errors, inaccuracies or dangerous suggestions; i) long term impacts of the technology.

Some tools which are being used are the Assessment List for Trustworthy Artificial Intelligence (ALTAI) to develop procedures to detect, assess the level and address potential risks (https://futurium.ec.europa.eu/en/european-aialliance/pages/altai-assessment-list-trustworthy-artificial-intelligence), as well the Ethics By Design and Ethics of Use Approaches for Artificial Intelligence (https://ec.europa.eu/info/funding-tenders/opportunities/docs/2021-2027/horizon/guidance/ethics-by-design-and-ethics-of-use-approaches-for-artificial-intelligence_he_en.pdf).

As Nissenbaum says, “we cannot simply align the world with the values and principles we adhered to prior to the advent of technological challenges. Rather, we must grapple with the new demands that changes wrought by the presence and use of information technology have placed on values and moral principles“¹. We must bring attention to the values that are unconsciously built into technology.

Although the ethical framework is something crucial, it has limitations, as ethics is not compulsory, its principles are not universal and do not have consensus (hundreds of ethical codes are available). The human rights approach offers a complementary vision, based on common principles as universality of human rights, participation, transparency, accountabiliy and non discrimination. AI governance is needed and legal instruments as the UE Artificial Intelligence Act can be very useful. The regulator will make clear which are the responsibilities of each of the actors involved in the life of an AI system. Supervisory bodies will be able to remove from the loop those who seek to make irresponsible use of this technology. The AI Act is based on a system of risk analysis and provides for a series of requirements applicable to high-risk AI systems, in particular for system providers, such as the obligation to draw up an EU declaration of conformity and to affix the CE conformity marking². The human rights approach will help adapting and overcoming limitations of the text³.

General-purpose artificial intelligence (AI) technologies are now included in the AI Act, so ASTOUND project will have to analyse carefully the legal implications and impacts on human rights of future conscious chatbots .

References:

¹C. Allen, W. Wallach and I. Smit, “Why Machine Ethics?,” in IEEE Intelligent Systems, vol. 21, no. 4, pp. 12-17, July-Aug. 2006, doi: 10.1109/MIS.2006.83.

² Leonardo Cervera Navas. Por qué hay que abordar la regulación de la inteligencia artificial como la de la aviación comercial. EL PAIS, 25-05-2023.

³ J. Salgado-Criado and C. Fernández-Aller, “A Wide Human-Rights Approach to Artificial Intelligence Regulation in Europe,” in IEEE Technology and Society Magazine, vol. 40, no. 2, pp. 55-65, June 2021, doi: 10.1109/MTS.2021.3056284; Vinodkumar Prabhakaran, Margaret Mitchell, Timnit Gebru, Iason Gabriel. “A Human Rights-Based Approach to Responsible AI”. arXiv:2210.02667 [cs.AI]

6 June 2023
by hhardy
2 Comments

The Science of Consciousness

From the 22rd of May to the 27th of May 2023, our consortion members Indeep AI and École Normale Supérieure, attended The Science of Consciousness conference in Taormina, Italy as part of our ground breaking project, ASTOUND.

Our mission at Astound is to push the boundaries of artificial intelligence and lay the groundwork for the development of consciousness within AI systems. Through knowledge-sharing, idea exchange, and collaboration with some of the brightest minds in the scientific community, our team strove to foster innovation and gain deeper insights into the enigmatic nature of consciousness.

Furthermore, our very own Aïda Elamrani of École Normale Supérieure gave a talk during the conference on “To What Extent can Machines be Conscious”. The talk discussed how experts disagree on the possibility of mechanically implementing consciousness, with challenges involving phenomenalism, physicalism, computation, and information. A compromise is possible by viewing consciousness as a virtual reality implemented through computational mechanisms, but this raises further research questions regarding different interpretations of information. Achieving machine consciousness depends on the chosen interpretation.

Stay tuned for more updates as we take the lead in making significant progress towards developing consciousness in artificial intelligence.

23 March 2023
by luisfernandodharo
0 comments

Paper on automatic evaluation at IEEE/ACM Trans. on Audio, Speech, and Language Processing

On March 1st, 2023 was published our paper entitled PoE: A Panel of Experts for Generalized Automatic Dialogue Assessment which describe a state-of-the-art model for automatic evaluation of dialogue systems at turn-level. The proposed model was assessed on 16 dialogue evaluation datasets spanning a wide range of dialogue domains. The model achieves high Spearman correlations (+0.47) with respect to the human annotations over all the evaluation datasets. This result is particularly good as the model exhibits better zero-shot generalization (i.e., good correlations on completely unseen datasets) than existing state-of-the-art models. Besides, the proposed model has the ability to easily adapt to new domains thanks to the usage of few-shot transfer learning techniques.

System architecture of a Panel of Experts (PoE). A transformer encoder T consists of L layers (T₁, T₂, …T_L). Different colors (blue, red, and green) denote domain-specific adapter modules. Each domain-specific adapter has L − 1 layers that are injected in between every two consecutive transformer layers. There are domain-specific classifiers after the final transformer layer, T_L. T is shared by all the domain-specific modules. Each adapter is trained using a different dataset. Adapters can be added as required or removed after testing the performance of the model on an new dataset.

In more detail, the proposed Panel of Experts (PoE) model is a multitask network that consists of a shared transformer encoder and a collection of lightweight adapters. The shared encoder captures the general knowledge of dialogues across domains, while each adapter specializes in one specific domain and serves as a domain expert. The following figure shows the architecture of the network.

In addition, to improve the performance of the system, we also applied four different data augmentation techniques: 1) Syntactic & Semantic Negative Sampling, 2) Back-Translation, 3) Generation From State-of-the-art Dialogue Systems, and 4) Automatic Generation of Adversarial Responses.

Finally, the model generates the final score as the average of the different adapters or using one of the adapters whose trained data is closer to the evaluation data. Different tables including comparisons between the proposed model against other state-of-the-art metrics are provided on different settings. Besides, the zero or few-shot capabilities of the model are also evaluated depending on the percentage of in domain data used for adapting the model.

This paper is a collaboration between Universidad Politécnica de Madrid (UPM) and the National University of Singapore (NUS). The work leading to these results is also supported by the
European Commission through Project ASTOUND (101071191 – HORIZON-EIC-2021-PATHFINDERCHALLENGES-01)