Reproducibility of model training

kuraga · November 27, 2020, 8:21pm

Good day!

Is models training going to be reproducible?

Say, BERT on RuSentiment. Ok, we have deepmipt/bert repository. But is there a code (say, a script for CI/CD system) to get exact files from download section (well suppose we have input data)?

And other parts. Specifically, how do I get vocabulary file?

Thanks!

yurakuratov · December 4, 2020, 4:30pm

Hi!

All necessary files for a specific config file could be downloaded with command:

python -m deeppavlov download config_name

This command downloads everything from configuration’s download section.

Requirements (deepmipt/bert) could be installed with:

python -m deeppavlov install config_name

In case of “BERT on RuSentiment”, download will get pre-trained MultilingualBERT model (the first link) and parameters of the model (the second link) fine-tuned on RuSentiment data.

Vocabulary file is a part of MultilingualBERT.

kuraga · December 4, 2020, 4:48pm

@yurakuratov Yuriy, thanks! But it’s not topic of the question.

download will get pre-trained MultilingualBERT model

Ok, but can I build this pre-trained model by myself (i.e. pre-train from scratch)? Can I do it with DeepPavlov?

If no: how did you get it? Exactly: with which code and steps?

yurakuratov · December 4, 2020, 6:28pm

MultilingualBERT model was pre-trained by Google and we re-use it.

In case if you want to train your BERT model from scratch on MLM and NSP tasks then I would recommend to use original BERT repo or our fork with multi-gpu support and follow instructions in readme file.

kuraga · December 4, 2020, 7:21pm

our fork with multi-gpu support and follow instructions in readme file.

Are there plans to make such steps for such models programmatic/reproducible?

Would it be a good feature request or is it out of DeepPavlov’s scope?

yurakuratov · December 5, 2020, 3:28pm

Are there plans to make such steps for such models programmatic/reproducible?

Most of the BERT models that are available (English, Multilingual, …) are not pre-trained by DeepPavlov, we just use them as-is. We cannot provide the way to reproduce them.

Also, we don’t have one-command solution for BERT pre-training and I’m not sure that we have this in our roadmap. But all necessary steps for pre-training are known and described:

Collect data. This step is user dependent.
Build vocabulary. Here is the code that we use for vocabulary building: https://github.com/deepmipt/bert/tree/feat/build_vocab_scripts/scripts
& 4. preprocessing and pre-training. They are well described by original BERT instruction.

We followed these steps when we trained RuBERT, Conversational BERT, Conversational RuBERT.
Some details on how we trained RuBERT were discussed here: What parameters of RuBert training? · Issue #1074 · deeppavlov/DeepPavlov · GitHub

Topic		Replies	Views
Complete guide on mulilingual QA model implementation Tutorials & Guidelines	1	388	August 12, 2021
Integrating custom BERT model and training model with csv dataset Models	18	1298	May 11, 2023
BERT for classification error DeepPavlov Library	1	831	September 5, 2019
Run Time Error Deeppavlov	3	174	July 27, 2023
Question Answering Models Models	5	533	September 16, 2021

Reproducibility of model training

Related topics