Integrating custom BERT model and training model with csv dataset

Hi

I am trying to use another pretrained model with deeppavlov. I have made required changes in config file.

Now I want to integrate it with my .csv dataset. But I am unable to do so.

I have changed the dataset reader in config file.

I cant find any train.csv and valid.csv after running the training code from notebook.

Any help is welcomed. Find my error below:

WARNING in 'deeppavlov.core.trainers.fit_trainer'['fit_trainer'] at line 214: Got empty data iterable for scoring
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
in
1 from deeppavlov import train_model, configs
2
----> 3 model = train_model(‘E:\anaconda\envs\deeppavlov\lib\site-packages\deeppavlov\configs\squad\squad_biobert.json’, download=False)

E:\anaconda\envs\deeppavlov\lib\site-packages\deeppavlov\__init__.py in train_model(config, download, recursive)
     27 # TODO: make better
     28 def train_model(config: [str, Path, dict], download: bool = False, recursive: bool = False) -> Chainer:
---> 29     train_evaluate_model_from_config(config, download=download, recursive=recursive)
     30     return build_model(config, load_trained=True)
     31 

E:\anaconda\envs\deeppavlov\lib\site-packages\deeppavlov\core\commands\train.py in train_evaluate_model_from_config(config, iterator, to_train, evaluation_targets, to_validate, download, start_epoch_num, recursive)
    119 
    120     if to_train:
--> 121         trainer.train(iterator)
    122 
    123     res = {}

E:\anaconda\envs\deeppavlov\lib\site-packages\deeppavlov\core\trainers\nn_trainer.py in train(self, iterator)
    335         if callable(getattr(self._chainer, 'train_on_batch', None)):
    336             try:
--> 337                 self.train_on_batches(iterator)
    338             except KeyboardInterrupt:
    339                 log.info('Stopped training')

E:\anaconda\envs\deeppavlov\lib\site-packages\deeppavlov\core\trainers\nn_trainer.py in train_on_batches(self, iterator)
    275         self.start_time = time.time()
    276         if self.validate_first:
--> 277             self._validate(iterator)
    278 
    279         while True:

E:\anaconda\envs\deeppavlov\lib\site-packages\deeppavlov\core\trainers\nn_trainer.py in _validate(self, iterator, tensorboard_tag, tensorboard_index)
    173         report['train_examples_seen'] = self.examples
    174 
--> 175         metrics = list(report['metrics'].items())
    176 
    177         if tensorboard_tag is not None and self.tensorboard_log_dir is not None:

AttributeError: 'NoneType' object has no attribute 'items'

​

Thank you!

Hi!

Could you also provide configuration file that you are using?

Hi thank you for your reply!

Sure I can copy my config file here!

Blockquote{
“dataset_reader”: {
“class_name”: “BasicClassificationDatasetReader”,
“format”: “csv”,
“sep”: “,”,
“header”: 0,
“names”: [
“text”,
“labels”
],
“class_sep”: “,”,
“train”: “covid19_articles.csv”,
“data_path”: “{DOWNLOADS_PATH}/biosquad/covid19_articles”,
“seed”: 42,
“split_seed”:23,
“field_to_split”: “train”,
“split_fields”: [
“train”,
“valid”
],
“split_proportions”: [
0.9,
0.1
]
},
“dataset_iterator”: {
“class_name”: “BasicClassificationDatasetIterator”,
“seed”: 1337,
“shuffle”: true
},
“chainer”: {
“in”: [“context_raw”, “question_raw”],
“in_y”: [“ans_raw”, “ans_raw_start”],
“pipe”: [{
“class_name”: “bert_preprocessor”,
“vocab_file”: “{DOWNLOADS_PATH}/biobert_models/cased_L-12_H-768_A-12/vocab.txt”,
“do_lower_case”: false,
“max_seq_length”: 384,
“in”: [“question_raw”, “context_raw”],
“out”: [“bert_features”]
},
{
“class_name”: “squad_bert_mapping”,
“do_lower_case”: false,
“in”: [“context_raw”, “bert_features”],
“out”: [“subtok2chars”, “char2subtoks”]
},
{
“class_name”: “squad_bert_ans_preprocessor”,
“do_lower_case”: false,
“in”: [“ans_raw”, “ans_raw_start”, “char2subtoks”],
“out”: [“ans”, “ans_start”, “ans_end”]
},
{
“class_name”: “squad_bert_model”,
“bert_config_file”: “{DOWNLOADS_PATH}/biobert_models/cased_L-12_H-768_A-12/config.json”,
“pretrained_bert”: “{DOWNLOADS_PATH}/biobert_models/cased_L-12_H-768_A-12/model.ckpt”,
“save_path”: “{MODELS_PATH}/squad_biobert/model”,
“load_path”: “{MODELS_PATH}/squad_biobert/model”,
“keep_prob”: 0.5,
“learning_rate”: 2e-05,
“learning_rate_drop_patience”: 2,
“learning_rate_drop_div”: 2.0,
“in”: [“bert_features”],
“in_y”: [“ans_start”, “ans_end”],
“out”: [“ans_start_predicted”, “ans_end_predicted”, “logits”]
},
{
“class_name”: “squad_bert_ans_postprocessor”,
“in”: [“ans_start_predicted”, “ans_end_predicted”, “context_raw”, “bert_features”, “subtok2chars”],
“out”: [“ans_predicted”, “ans_start_predicted”, “ans_end_predicted”]
}
],
“out”: [“ans_predicted”, “ans_start_predicted”, “logits”]
},
“train”: {
“show_examples”: false,
“test_best”: false,
“validate_best”: true,
“log_every_n_batches”: 250,
“val_every_n_batches”: 500,
“batch_size”: 10,
“pytest_max_batches”: 2,
“pytest_batch_size”: 5,
“validation_patience”: 10,
“evaluation_targets”: [
“train”,
“valid”
],
“metrics”: [“accuracy”],

"tensorboard_log_dir": "{MODELS_PATH}/squad_biobert/logs"

},
“metadata”: {
“variables”: {
“ROOT_PATH”: “~/.deeppavlov”,
“DOWNLOADS_PATH”: “{ROOT_PATH}/downloads”,
“MODELS_PATH”: “{ROOT_PATH}/models”
},
“requirements”: [
“{DEEPPAVLOV_PATH}/requirements/tf.txt”,
“{DEEPPAVLOV_PATH}/requirements/bert_dp.txt”
]
}
}

I would like to add here that the error I mentioned earlier in my post is kind of resolved now. This particular config file is running fine now.

But I want to ask 2 short questions:

  1. Have I configured the config file properly?
  2. Also this config file answers questions from a given context. but I want it to answer my question from the dataset I have loaded in it. Can I have your thoughts on it, what references should I try to make such changes.
  1. Have I configured the config file properly?

I am not sure that the config is configured right. Here are the reasons for it:

  1. Only field text from csv file is used, is it a paragraph and a question together?
  2. Labels field should be consistent with in_y: [“ans_raw”, “ans_raw_start”]

I would suggest you to format your .csv dataset into the same format that is used for SQuAD and use SQuAD dataset readers and iterators. Here is the link to download dataset.

  1. Also this config file answers questions from a given context. but I want it to answer my question from the dataset I have loaded in it. Can I have your thoughts on it, what references should I try to make such changes.

This setting is called open domain question answering (ODQA). Models for ODQA usually consist of two parts: document retrieval (e.g., by tfidf) and answer extraction (e.g., with model trained on squad dataset). Take a look at Open Domain Question Answering Skill on Wikipedia — DeepPavlov 0.14.0 documentation

Thank you for your response.

Regarding your question about the ‘text’ field, it is just a column which contains headlines from news channels regarding COVID-19.

I am trying to plugin an external model (BioBert). I have chosen the squad_bert config file to make changes.

Can you indicate some best practices to accomplish it? Also as you know along with an external model, I am using .csv dataset.

squad_bert model should be trained on triples of (texts, questions) - X and answers - Y
I can’t get where are questions in your dataset are coming from. It would be better to convert your .csv dataset into SQuAD json format.

All changes regarding Bert model are done right.

Thank you for guiding me patiently!

Yes I am now understanding that I need to have questions in my dataset.

I am trying to give a question as an input and in return I want model to search answer from the given dataset.

But I guess I need to either change my dataset format or I should switch to any other config file.

Oh, I see. You have got a collection of texts and you want to build a system that will look for an answer from this collection.

I this case (there is no questions and answers to train the model) you should take a look at ODQA.
@Vasily might have more references for building you own ODQA system.

The next step to improve question-answering component could be training BioBERT on regular SQuAD dataset and its integration into ODQA system.

1 Like

Hey @Shafaq ,

please take a look at this article Open-domain question answering with DeepPavlov | by Vasily Konovalov | DeepPavlov | Medium and let me know if you have question.

I am overwhelmed by your response and I am glad that you understand my problem now. Yes I had a look on ODQA. I have few questions on it. I will ask Vasily!
Thank you for your time!

Hi Vasily,

Thank you for your time.

I actually read this article, it is amazingly put together.

Kudos to you!

For this very start I am asking about the first few points which got my attention and I see them as limitations for my situation. These are stated below:

1-It is stated in the article that
The dataset_reader section of the ranker’s configuration defines the source of the articles. The source can be of the following dataset_format:
wiki txt and .json files.
My dataset is in .csv format. So what limitations probably I can face?

2-It says
Both models require about 24 GB of RAM.(Google Colab is also of no use)
If my machine does not meet this requirement I am unable to use this model. If yes, is there any alternative(online)platform which can be used.

Best Regards

@Shafaq Thank you very for your feedback!

  1. In case of csv I would consider converting it into txt or json, currently we don’t have an implementation for csv. You can find the details here DeepPavlov/odqa_reader.py at b66179e584d3eb6da73c5731ba7b732dab7e94bd · deepmipt/DeepPavlov · GitHub

  2. The specific requirement depends on how large your database is (the csv file). The Wikipedia based model requires 24 GB of RAM, I believe your model will require less RAM.

Please let me know if you need a further assistance.

Okay great! I need to ask

1-Can you point to some reference which may help me in converting my .csv format to .txt. What format should it have.

2- I want to make sure that, if the root path in the config file needs to be changes when used in Windows environment.
“metadata”: {
“variables”: {
“ROOT_PATH”: “~/.deeppavlov”, This needs to be changed in Windows environment or it is fine??
“DOWNLOADS_PATH”: “{ROOT_PATH}/downloads”,
“MODELS_PATH”: “{ROOT_PATH}/models”

  1. Separate your articles into txt files and put them into a directory then train the reader model just like in the example 3-2.py · GitHub

  2. I would recomend using the full path.

Okay Thank you!
I am running into this wired error. Command is unable to pick the file. Can you point to possible issues due to which it isn’t able to find the file. Although the file is copied and is present at the exact location.

Hi @yurakuratov
I hope you are doing well. As you said that I should train BioBert model with squad dataset. I am doing the same, made required changes in config file but the training is taking too long.

On a low power machine it is up for 15 hours and still not responding. While at another machine it ran but gave the following error(Resource issue).

I’ll be waiting to hear from you!

While waiting, I am trying to:

1-use any other dataset for training.
2-trying to make a subset out of squad dataset.
3-giving optional parameters for training like this:
from deeppavlov import train_model
squadbert_config[‘train’][‘batch_size’] = 4 # set batch size
squadbert_config[‘train’][‘max_batches’] = 30 # maximum number of training batches
squadbert_config[‘train’][‘val_every_n_batches’] = 30
squadbert_config[‘train’][‘log_every_n_batches’] = 5
train_model (squadbert_config);
.

We trained such models (build on top of the BERT-base) for SQuAD on GPUs with >11Gb RAM (NVIDIA 1080Ti, P100). With default configuration file from the library it takes <=6h to train the model.

Without GPU training will take an order of magnitude more time.

Error on the last screenshot says that training with the current batch_size does not feet to the GPU memory.