How to set threshold that no answer is returned if no question really matched?

Hi

I am using

python3 -m deeppavlov interact tfidf_logreg_en_faq

which works great :slight_smile: but when I enter a question which is completely out of context, like for example “What did I dream last night?”, then I would prefer that no answer from the FAQ is returned.

Is there a way to set some kind of threshold that in such a case no question is returned or that one can detect that this question did not really match with any question inside the FAQ?

Thanks

Michael

Hi @michaelwechner,
The simple way would be to replace "max_proba": true in the config’s proba2labels block with "confident_threshold": 0.5 or your desired threshold. This will change the output format and might even lead to returning multiple possible answers for a question if the threshold is smaller then 0.5.

The other way would be to add a post-processor to your config that would filter answers by their probability. It could look something like this:

class ProbaFilter:
    def __init__(self, threshold: float = 0.5, default_value: str = '', **kwargs):
        self.threshold = threshold
        self.default_value = default_value
        
    def __call__(self, answers: List[str], probas: List[List[float]]):
        return [answer if max(ans_probas) > self.threshold else self.default_value
                for answer, ans_probas in zip(answers, probas)]

Hi @yoptar, thank you very much for your explanation and hint!

Hi @yoptar

In my trainings data I have the following question/answer:

What is the name of the president of the USA?,“Donald Trump”

When I do the following request:

“q”: [
“What is the name of the president of the USA?”
]

Then I receive as response

[
[
[
“Donald Trump”
],
[
0.0009037585956436307,

0.00020985532607495578,
0.9807839562113666
]

When I do the following query

“q”: [
“What is the name of the president of the Russia?”
]

then I receive the following response

[
[
[
“Donald Trump”
],
[
0.0010917099011234978,

0.0002233205864783391,
0.948826280203339
]
]
]
]
]

I guess it is basically the same, because I don’t have the question/answer

What is the name of the president of the Russia?,“Vladimir Putin”

in my trainings data.

I understand that the query is nearly the same, just the word “USA” and “Russia” is different. I would have hoped that I am somehow able to recognize in the response that there is no answer to this question, because the answer regarding Russia is not in the trainings data yet, but nevertheless the response values are very similar

Russia: 0.948826280203339
USA: 0.9807839562113666

Do you have a hint how to differentiate, such that the program can decide not to return an answer for “Russia”?

Btw, I have set “confident_threshold”: 0.5 but I guess this does not help in such a case.

Thanks very much

Michael