Best size of context-text?

Hi!
Thank you, very interesting project and I have played around with model_qa_ml = build_model(configs.squad.squad_bert_multilingual_freezed_emb, download=True) for questions (in Swedish) based on my attached (Swedish) context-text :). Can you say something about the size/length of the context-text? If I have a long document of say 220 pages, should I split it into parts and and loop questions on the different parts separately or on the whole original text, I have tried different but don’t know if I hit into some size-limit? What split-size is “best”? Like 512 words, 5000 characters or something?
All the Best, Kalle

Hi!

squad_bert_multilingual_freezed_emb uses 384 as maximum sequence length in subtokens. You might change this parameter but BERT-based models support up to 512 subtokens.
In case of long texts we have configuration files that have _infer suffix in the name, e.g. squad_bert_infer.json, you can modify it to use multilingual BERT instead. In _infer setup we split long texts on chunks and choose the best answer from all chunks.

Thank you for your excellent reply. I made the splitting myself (good to know the 384-number, will be about 384/2 words or something I guess) and when selecting best answer I guessed the highest returned numeric value (logit) as chosen (although that is not always best :wink: ). I will see if I can try the _infer setup also to compare. Thank you again for answer and useful library.

1 Like