A EC funded project aimed at improving social competences of virtual agents through artificial consciousness based on the Attention Schema Theory

Automatic Detection of Inconsistencies and Hierarchical Topic Classification for Open-Domain Chatbots


Current State-of-the-Art (SotA) chatbots are able to produce high-quality sentences, handling different conversation topics and larger interaction times. Unfortunately, the generated responses depend greatly on the data on which they have been trained, the specific dialogue history and current turn used for guiding the response, the internal decoding mechanisms, and ranking strategies, among others. Therefore, it may happen that for semantically similar questions asked by users, the chatbot may provide a different answer, which can be considered as a form of hallucination or producing confusion in long-term interactions.

In this research paper, we propose a novel methodology consisting of two main phases: (a) hierarchical automatic detection of topics and subtopics in dialogue interactions using a zero-shot learning approach, and (b) detecting inconsistent answers using k-means and the Silhouette coefficient.

To evaluate the efficacy of topic and subtopic detection, we use a subset of the DailyDialog dataset and real dialogue interactions gathered during the Alexa Socialbot Grand Challenge 5 (SGC5). The proposed approach enables the detection of up to 18 different topics and 102 subtopics. The experimental results demonstrate the efficacy of the topic detection algorithm, achieving an F1 weighted score of 0.67 when detecting 13 distinct topics and an F1 weighted score of 0.45 when detecting 18 distinct topics. In terms of the subtopic level, a weighted F1 score of 0.67 was achieved. Besides, we show how our proposed approach outperforms a larger model trained on specific dialogue data. An advantage of our approach is that it is scalable allowing the incorporation of new categories and subcategories (fine-grained) that the larger model is not able to recognize.

For the purpose of detecting inconsistencies, we manually generate multiple paraphrased questions and employ several pre-trained SotA chatbot models to generate responses. Moreover, the algorithm exhibits precise estimation capabilities in determining the number of diverse responses, as evidenced by an MSE of 3.4 calculated over a set of 109 handcrafted responses, 15 sets of original questions plus their paraphrases, passed to 4 small model chatbots. In the case of the 120 questions created with GPT-4, 15 question sets each consisting of 1 original and its respective 7 paraphrases, and fed into 4 State-of-the-Art chatbots, the overall resulting MSE was 3.2. These results show that even LLMs produce inconsistent answers and our approach is a good proxy to detect such cases.

As future work, we will primarily focus on two main aspects: expanding the range of high-level topics and subsequently evaluating the algorithm’s performance in identifying subtopics. Additionally, we included this topic and subtopic classifier in the dialogue management for the chatbot that we used during our participation in the Alexa Socialbot Grand Challenge (SGC5) [1]. Regarding the detection of inconsistent responses, our efforts will be directed towards the development of controllable algorithms and architectures, such as TransferTransfo [2] or CTRL [3], leveraging persona profiles within these frameworks with the idea of generating more consistent responses. Furthermore, we seek to explore mechanisms to incorporate these identified inconsistencies into an automated evaluation of dialogue systems [4, 5], according to the recommendations made in [6].


[1]. Estecha-Garitagoitia, Marcos, Mario Rodríguez-Cantelar, Alfredo Garrachón Ruiz, Claudia Garoé Fernández García, Sergio Esteban Romero, Cristina Conforto, Alberto Saiz Fernández, Luis Fernando Fernández Salvador, and Luis Fernando D’Haro. “THAURUS: An Innovative Multimodal Chatbot Based on the Next Generation of Conversational AI.” Alexa Prize SocialBot Grand Challenge 5.

[2]. Wolf, T.; Sanh, V.; Chaumond, J.; Delangue, C. TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents. arXiv 2019, arXiv:cs.CL/1901.08149. 

[3]. Keskar, N.S.; McCann, B.; Varshney, L.R.; Xiong, C.; Socher, R. CTRL: A Conditional Transformer Language Model for Controllable GenerationarXiv 2019, arXiv:cs.CL/1909.05858.

[4]. Zhang, C.; Sedoc, J.; D’Haro, L.F.; Banchs, R.; Rudnicky, A. Automatic Evaluation and Moderation of Open-domain Dialogue SystemsarXiv 2021, arXiv:cs.CL/2111.02110.

[5]. Zhang, C.; D’Haro, L.F.; Friedrichs, T.; Li, H. MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue EvaluationProc. AAAI Conf. Artif. Intell. 202236, 11657–11666.

Leave a Reply

Required fields are marked *.