Reproducibility of model training

Good day!

Is models training going to be reproducible?

Say, BERT on RuSentiment. Ok, we have deepmipt/bert repository. But is there a code (say, a script for CI/CD system) to get exact files from download section (well suppose we have input data)?

And other parts. Specifically, how do I get vocabulary file?

Thanks!

Hi!

All necessary files for a specific config file could be downloaded with command:

python -m deeppavlov download config_name

This command downloads everything from configuration’s download section.

Requirements (deepmipt/bert) could be installed with:

python -m deeppavlov install config_name

In case of “BERT on RuSentiment”, download will get pre-trained MultilingualBERT model (the first link) and parameters of the model (the second link) fine-tuned on RuSentiment data.

Vocabulary file is a part of MultilingualBERT.

@yurakuratov Yuriy, thanks! But it’s not topic of the question.

download will get pre-trained MultilingualBERT model

Ok, but can I build this pre-trained model by myself (i.e. pre-train from scratch)? Can I do it with DeepPavlov?

If no: how did you get it? Exactly: with which code and steps?

MultilingualBERT model was pre-trained by Google and we re-use it.

In case if you want to train your BERT model from scratch on MLM and NSP tasks then I would recommend to use original BERT repo or our fork with multi-gpu support and follow instructions in readme file.

our fork with multi-gpu support and follow instructions in readme file.

Are there plans to make such steps for such models programmatic/reproducible?

Would it be a good feature request or is it out of DeepPavlov’s scope?

Are there plans to make such steps for such models programmatic/reproducible?

Most of the BERT models that are available (English, Multilingual, …) are not pre-trained by DeepPavlov, we just use them as-is. We cannot provide the way to reproduce them.

Also, we don’t have one-command solution for BERT pre-training and I’m not sure that we have this in our roadmap. But all necessary steps for pre-training are known and described:

  1. Collect data. This step is user dependent.
  2. Build vocabulary. Here is the code that we use for vocabulary building: https://github.com/deepmipt/bert/tree/feat/build_vocab_scripts/scripts
  3. & 4. preprocessing and pre-training. They are well described by original BERT instruction.

We followed these steps when we trained RuBERT, Conversational BERT, Conversational RuBERT.
Some details on how we trained RuBERT were discussed here: What parameters of RuBert training? · Issue #1074 · deeppavlov/DeepPavlov · GitHub