Hi!
What data are you using for training?
python -m deeppavlov evaluate ner_rus -d
outputs:
{"valid": {"eval_examples_count": 2153, "metrics": {"ner_f1": 95.2828, "ner_token_f1": 97.063}, "time_spent": "0:00:06"}}
{"test": {"eval_examples_count": 1922, "metrics": {"ner_f1": 95.1432, "ner_token_f1": 97.13}, "time_spent": "0:00:05"}}
ner_f1
is measured on the level of entities (actual text spans should match exactly)
ner_token_f1
is measured on the level of tokens.
Here is an example:
y_true = [['B-PER', 'I-PER', 'O']]
y_pred = [['B-PER', 'O', 'O']]
ner_f1(y_true, y_pred) # 0
ner_token_f1(y_true, y_pred) # 66.66
So, ner_f1 == 0 and ner_token_f1 != 0
says that all entities were not fully extracted, but tokens that are part of entities were extracted.
It is better to check output of your model on real examples to make sure if it is a desired behaviour. You should also change the order of the metrics in the config file to track progress with ner_token_f1
and save the best model with the highest `ner_token_f1``.
1 Like