DeepPavlov Faq Model Returning Wrong Index

Hello, I am currently building a faq model based on the given example and guidelines on the deeppavlov documentation. However there seems to be a weird behavior with the “y_pred_id” values that doesn’t match what it says on the documentation.

Following the pipeline configuration, when the faq model receives a query, it gives a list of percentage of similarity to each question in the data file and then selects the highest percentage and returns its corresponding answer. However there are some odd behaviors of the model from this perspective.

First of all, the “y_pred_id” which holds the index of the highest percentage, does not align with the data file. To clarify, if y_pred_id is 6, it does not correspond to the 6th row of the data but the 2nd. There is this weird offest of 4 units for the y_pred_ids and y_pred_probas. It also does not always follow this offset rule sometimes giving a different offset making it not addressable systematically. I tried reading the documentations and source code but still have no idea why this is happening.

Moreover, since the offset is by 4, what does it mean when y_pred_id holds an index from 0 to 3? To me it seems that the faq model has a built-in default function (not mentioned in the documentations) where if the user’s utterance is close to gibberish, it returns ids from 0 to 3 (or something like this).

Another interesting thing to note is that the y_pred_id data type is numpy.int64 and when turned into python native type using .item() method, if the index was from 0 to 3, it returns -1. Why?

These features or characteristics is not mentioned in the documentation at all so I am very confused…

Thank you so much for your help!

Dear @brandonra97 Thank you very for your interest!

You are right y_pred_id does not follow the indexes of the original dataset file. y_pred_id stores the index of the best possible Answer of the answers sorted accordingly to the number of occurrences (there are different number of questions for each answers, that’s why there are more popular answer and less popular answers). You can check out {MODELS_PATH}/faq/mipt/en_mipt_faq_v4/en_mipt_answers.dict file of the simple_vocab component.

I think it is a good idea to preserve indexes of the answers in the output, and we should definitely implements this in upcoming deployments.

Meanwhile you can use the answer itself y_pred_answers to locate the relevant answer index in your dataset.

Please let me know if I can help you further.