DeepPavlov ner tokenization

Graygood · December 6, 2019, 9:52am

Lately we’ve found out, that our tokenizer works much worse than in-built NER service has. Named entities are detected in much better fashion and the appearance of commas and punctuation marks in the whole could heavily affect the result.

The only problem I have with NER’s tokenizer is that it fractions multipart words like ‘экс-президент’ or ‘amazon.com’ into 3-part entities [‘экс’, ‘-’, ‘президент’] and [‘amazon’, ‘.’, ‘com’] respectively. As well as “Barack Obama” is fractioned to [‘Barack’, ‘Obama’]. So, in further pipeline it is unclear how to concatenate these tokens with spacebar or without it. Is there a way to determine using inbuild methods?

yoptar · December 6, 2019, 11:21am

Hi @Graygood,

The older NER models just use nltk.word_tokenize. And the BERT-based ones use a simple regular expression r"[\w']+|[^\w ]". If you’re satisfied with the way it works, you can tokenize your text beforehand and get spans for every token in the original text.

Topic		Replies	Views
Tokenizer used in model inference Models	1	26	July 30, 2024
NER - "input sequence after bert tokenization shouldn't exceed 512 tokens" (ner_conll2003_bert) Models	5	171	April 24, 2024
Associate NER with lemma Models	1	290	February 26, 2021
Токенизация в morpho и ner DeepPavlov Library	1	334	October 26, 2019
NER - name being tagged as "I-PER" alone (ner_conll2003_bert) Models	6	140	April 24, 2024

DeepPavlov ner tokenization

Related topics