DeepPavlov ner tokenization

Lately we’ve found out, that our tokenizer works much worse than in-built NER service has. Named entities are detected in much better fashion and the appearance of commas and punctuation marks in the whole could heavily affect the result.

The only problem I have with NER’s tokenizer is that it fractions multipart words like ‘экс-президент’ or ‘amazon.com’ into 3-part entities [‘экс’, ‘-’, ‘президент’] and [‘amazon’, ‘.’, ‘com’] respectively. As well as “Barack Obama” is fractioned to [‘Barack’, ‘Obama’]. So, in further pipeline it is unclear how to concatenate these tokens with spacebar or without it. Is there a way to determine using inbuild methods?

Hi @Graygood,

The older NER models just use nltk.word_tokenize. And the BERT-based ones use a simple regular expression r"[\w']+|[^\w ]". If you’re satisfied with the way it works, you can tokenize your text beforehand and get spans for every token in the original text.