Best size of context-text?

Kalle · September 23, 2020, 9:54am

Hi!
Thank you, very interesting project and I have played around with model_qa_ml = build_model(configs.squad.squad_bert_multilingual_freezed_emb, download=True) for questions (in Swedish) based on my attached (Swedish) context-text :). Can you say something about the size/length of the context-text? If I have a long document of say 220 pages, should I split it into parts and and loop questions on the different parts separately or on the whole original text, I have tried different but don’t know if I hit into some size-limit? What split-size is “best”? Like 512 words, 5000 characters or something?
All the Best, Kalle

yurakuratov · September 24, 2020, 9:31am

Hi!

squad_bert_multilingual_freezed_emb uses 384 as maximum sequence length in subtokens. You might change this parameter but BERT-based models support up to 512 subtokens.
In case of long texts we have configuration files that have _infer suffix in the name, e.g. squad_bert_infer.json, you can modify it to use multilingual BERT instead. In _infer setup we split long texts on chunks and choose the best answer from all chunks.

Kalle · September 24, 2020, 9:54am

Thank you for your excellent reply. I made the splitting myself (good to know the 384-number, will be about 384/2 words or something I guess) and when selecting best answer I guessed the highest returned numeric value (logit) as chosen (although that is not always best ). I will see if I can try the _infer setup also to compare. Thank you again for answer and useful library.

Topic		Replies	Views
NER - "input sequence after bert tokenization shouldn't exceed 512 tokens" (ner_conll2003_bert) Models	5	163	April 24, 2024
Complete guide on mulilingual QA model implementation Tutorials & Guidelines	1	388	August 12, 2021
Differences between Squad models DeepPavlov Library	1	667	October 29, 2019
BERT SQUAD no answer DeepPavlov Library	1	989	November 5, 2019
Working with squad_noans Models	2	468	June 26, 2020

Best size of context-text?

Related topics