Lately we’ve found out, that our tokenizer works much worse than in-built NER service has. Named entities are detected in much better fashion and the appearance of commas and punctuation marks in the whole could heavily affect the result.
The only problem I have with NER’s tokenizer is that it fractions multipart words like ‘экс-президент’ or ‘amazon.com’ into 3-part entities [‘экс’, ‘-’, ‘президент’] and [‘amazon’, ‘.’, ‘com’] respectively. As well as “Barack Obama” is fractioned to [‘Barack’, ‘Obama’]. So, in further pipeline it is unclear how to concatenate these tokens with spacebar or without it. Is there a way to determine using inbuild methods?