Trying to understand tfidf

Hi

I have added trained with the following questions/answers

python -m deeppavlov train deeppavlov/configs/faq/tfidf_logreg_en_faq.json

Question,Answer
“aaa aaa?”,“Answer 1”
“bbb bbb?”,“Answer 2”
“ccc ccc?”,“Answer 3”

and test it with

python -m deeppavlov interact deeppavlov/configs/faq/tfidf_logreg_en_faq.json

and receive the following results

q::aaa

(‘Answer 1’, [0.9919848791014834, 0.00400756044925833, 0.00400756044925833])

q::bbb

(‘Answer 2’, [0.00400756044925833, 0.9919848791014834, 0.00400756044925833])

q::ccc

(‘Answer 3’, [0.00400756044925833, 0.00400756044925833, 0.9919848791014834])

which somehow makes sense, but

q::zzz

(‘Answer 3’, [0.3333333333333333, 0.3333333333333333, 0.33333333333333337])

I would expect all values to be zero, but I assume this is just how the algorithm works.
Is there some non-code documenation re how the algorithm works?

Also please see my related question some time ago

Thanks for your help

Michael

Hey @michaelwechner, thank you very much for your interest.

The output of the model is the probability distribution over the Answers. This is the reason why you get [0.3, 0.3, 0.3] for the last example, this means that the model is equally unsure about all three labels. You can decide about the correct answer by defining a threshold on the maximal probability score.

Let me know if it’s helpful.

Hi @Vasily

Thanks very much for your feedback!

Yes, that’s what I thought, but I wonder whether there eixists a better alternative :slight_smile:

I replaced

  •    "max_proba": true
    
  •    "confident_threshold": 0.5
    

and I would have expected that one still receives

q::zzz

(‘Answer 3’, [0.3333333333333333, 0.3333333333333333, 0.33333333333333337])

because 0.5 > 0.3333333

but instead I received

q::zzz

(, [0.3333333333333333, 0.3333333333333333, 0.33333333333333337])

How does confident_threshold work?

Thanks again

Michael

You can use just one of the available three options [confident_threshold, max_proba, top_n] in the according priority. When you set confident_threshold=0.5 you filter out all the candidates with the probability less or equal 0.5, which is in your case all the candidates.

ah ok, got it :slight_smile: Thanks again!