Fine-Tuning Multi-lang NER for a specific language


I want to use the multi-lang NER model to extract turkish time expressions. The algorithm doesn’t recognize every kind of expression i want though. For example, it recognizes ‘Today’ and ‘Heute’, but not the turkish equivalent ‘bugün’ as TIME. What could be the reason for this?

Would Fine-Tuning also work in this case by using a dozen turkish annotated sentences for my missing cases although it’s a zero-shot transfer model? How should i proceed with this?

Thank you!


I would say that the main reason is absence of supervision on the target language.

Fine-tuning on data from the target distributionn might help. However, it is really hard to say what are possible gains from fine-tuning. Also, a possible problem in this case is sparsity of time expressions. Time expressions has an order of magnitude lower frequency compared to, for instance, person entities. Furthermore, time expressions absolute frequency is around only 0.1% in the original corpus. Assuming the data of interest is the same distribution, to make a balanced markup for each time expression token you need to mark up around 1000 non time tokens. Marking up only a few sentences with time expressions possibly will bias network to make many false positive TIME predictions.

Thank you for your fast answer! Do you think it would make sense to take a split of the original data set and only annotate time expressions (build a temporal tagger)? I could balance this dataset out with a 2/3 split of non-time expressions. After training this split in english, I could fine-tune it with far less data from the target language, right? Or am I missing something.

I think the best way is to make experiments and see what works better. It is hard to say whether proposed approach will or wont make any gain.

Okay thank you! One more basic question: How can I fine-tune the multilingual NER model?

I guess I first have to load it with:

ner_model = build_model(configs.ner.ner_ontonotes_bert_mult, download=False) # after having it downloaded

and then run a train command with my accordingly annotated data defined in the config file, but how exactly? How can i define this fine-tuning step and not that it trains a new model with only my data?

Thank you!

This topic was discussed few weeks ago in this thread Strange fine-tuned NER model behavior. Basically, just remove fit_on line in the vocabulary part of the config and replace the path to data with your path. Training and data format are described in the docs

1 Like