Integrating custom BERT model and training model with csv dataset

Shafaq · January 5, 2021, 4:54pm

Hi

I am trying to use another pretrained model with deeppavlov. I have made required changes in config file.

Now I want to integrate it with my .csv dataset. But I am unable to do so.

I have changed the dataset reader in config file.

I cant find any train.csv and valid.csv after running the training code from notebook.

Any help is welcomed. Find my error below:

WARNING in 'deeppavlov.core.trainers.fit_trainer'['fit_trainer'] at line 214: Got empty data iterable for scoring
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
in
1 from deeppavlov import train_model, configs
2
----> 3 model = train_model(‘E:\anaconda\envs\deeppavlov\lib\site-packages\deeppavlov\configs\squad\squad_biobert.json’, download=False)

E:\anaconda\envs\deeppavlov\lib\site-packages\deeppavlov\__init__.py in train_model(config, download, recursive)
     27 # TODO: make better
     28 def train_model(config: [str, Path, dict], download: bool = False, recursive: bool = False) -> Chainer:
---> 29     train_evaluate_model_from_config(config, download=download, recursive=recursive)
     30     return build_model(config, load_trained=True)
     31 

E:\anaconda\envs\deeppavlov\lib\site-packages\deeppavlov\core\commands\train.py in train_evaluate_model_from_config(config, iterator, to_train, evaluation_targets, to_validate, download, start_epoch_num, recursive)
    119 
    120     if to_train:
--> 121         trainer.train(iterator)
    122 
    123     res = {}

E:\anaconda\envs\deeppavlov\lib\site-packages\deeppavlov\core\trainers\nn_trainer.py in train(self, iterator)
    335         if callable(getattr(self._chainer, 'train_on_batch', None)):
    336             try:
--> 337                 self.train_on_batches(iterator)
    338             except KeyboardInterrupt:
    339                 log.info('Stopped training')

E:\anaconda\envs\deeppavlov\lib\site-packages\deeppavlov\core\trainers\nn_trainer.py in train_on_batches(self, iterator)
    275         self.start_time = time.time()
    276         if self.validate_first:
--> 277             self._validate(iterator)
    278 
    279         while True:

E:\anaconda\envs\deeppavlov\lib\site-packages\deeppavlov\core\trainers\nn_trainer.py in _validate(self, iterator, tensorboard_tag, tensorboard_index)
    173         report['train_examples_seen'] = self.examples
    174 
--> 175         metrics = list(report['metrics'].items())
    176 
    177         if tensorboard_tag is not None and self.tensorboard_log_dir is not None:

AttributeError: 'NoneType' object has no attribute 'items'



Thank you!

yurakuratov · January 13, 2021, 12:25pm

Hi!

Could you also provide configuration file that you are using?

Shafaq · January 13, 2021, 12:41pm

Hi thank you for your reply!

Sure I can copy my config file here!

Blockquote{
“dataset_reader”: {
“class_name”: “BasicClassificationDatasetReader”,
“format”: “csv”,
“sep”: “,”,
“header”: 0,
“names”: [
“text”,
“labels”
],
“class_sep”: “,”,
“train”: “covid19_articles.csv”,
“data_path”: “{DOWNLOADS_PATH}/biosquad/covid19_articles”,
“seed”: 42,
“split_seed”:23,
“field_to_split”: “train”,
“split_fields”: [
“train”,
“valid”
],
“split_proportions”: [
0.9,
0.1
]
},
“dataset_iterator”: {
“class_name”: “BasicClassificationDatasetIterator”,
“seed”: 1337,
“shuffle”: true
},
“chainer”: {
“in”: [“context_raw”, “question_raw”],
“in_y”: [“ans_raw”, “ans_raw_start”],
“pipe”: [{
“class_name”: “bert_preprocessor”,
“vocab_file”: “{DOWNLOADS_PATH}/biobert_models/cased_L-12_H-768_A-12/vocab.txt”,
“do_lower_case”: false,
“max_seq_length”: 384,
“in”: [“question_raw”, “context_raw”],
“out”: [“bert_features”]
},
{
“class_name”: “squad_bert_mapping”,
“do_lower_case”: false,
“in”: [“context_raw”, “bert_features”],
“out”: [“subtok2chars”, “char2subtoks”]
},
{
“class_name”: “squad_bert_ans_preprocessor”,
“do_lower_case”: false,
“in”: [“ans_raw”, “ans_raw_start”, “char2subtoks”],
“out”: [“ans”, “ans_start”, “ans_end”]
},
{
“class_name”: “squad_bert_model”,
“bert_config_file”: “{DOWNLOADS_PATH}/biobert_models/cased_L-12_H-768_A-12/config.json”,
“pretrained_bert”: “{DOWNLOADS_PATH}/biobert_models/cased_L-12_H-768_A-12/model.ckpt”,
“save_path”: “{MODELS_PATH}/squad_biobert/model”,
“load_path”: “{MODELS_PATH}/squad_biobert/model”,
“keep_prob”: 0.5,
“learning_rate”: 2e-05,
“learning_rate_drop_patience”: 2,
“learning_rate_drop_div”: 2.0,
“in”: [“bert_features”],
“in_y”: [“ans_start”, “ans_end”],
“out”: [“ans_start_predicted”, “ans_end_predicted”, “logits”]
},
{
“class_name”: “squad_bert_ans_postprocessor”,
“in”: [“ans_start_predicted”, “ans_end_predicted”, “context_raw”, “bert_features”, “subtok2chars”],
“out”: [“ans_predicted”, “ans_start_predicted”, “ans_end_predicted”]
}
],
“out”: [“ans_predicted”, “ans_start_predicted”, “logits”]
},
“train”: {
“show_examples”: false,
“test_best”: false,
“validate_best”: true,
“log_every_n_batches”: 250,
“val_every_n_batches”: 500,
“batch_size”: 10,
“pytest_max_batches”: 2,
“pytest_batch_size”: 5,
“validation_patience”: 10,
“evaluation_targets”: [
“train”,
“valid”
],
“metrics”: [“accuracy”],

"tensorboard_log_dir": "{MODELS_PATH}/squad_biobert/logs"

},
“metadata”: {
“variables”: {
“ROOT_PATH”: “~/.deeppavlov”,
“DOWNLOADS_PATH”: “{ROOT_PATH}/downloads”,
“MODELS_PATH”: “{ROOT_PATH}/models”
},
“requirements”: [
“{DEEPPAVLOV_PATH}/requirements/tf.txt”,
“{DEEPPAVLOV_PATH}/requirements/bert_dp.txt”
]
}
}

Shafaq · January 13, 2021, 12:50pm

I would like to add here that the error I mentioned earlier in my post is kind of resolved now. This particular config file is running fine now.

But I want to ask 2 short questions:

Have I configured the config file properly?
Also this config file answers questions from a given context. but I want it to answer my question from the dataset I have loaded in it. Can I have your thoughts on it, what references should I try to make such changes.

yurakuratov · January 14, 2021, 10:04am

Have I configured the config file properly?

I am not sure that the config is configured right. Here are the reasons for it:

Only field text from csv file is used, is it a paragraph and a question together?
Labels field should be consistent with in_y: [“ans_raw”, “ans_raw_start”]

I would suggest you to format your .csv dataset into the same format that is used for SQuAD and use SQuAD dataset readers and iterators. Here is the link to download dataset.

Also this config file answers questions from a given context. but I want it to answer my question from the dataset I have loaded in it. Can I have your thoughts on it, what references should I try to make such changes.

This setting is called open domain question answering (ODQA). Models for ODQA usually consist of two parts: document retrieval (e.g., by tfidf) and answer extraction (e.g., with model trained on squad dataset). Take a look at http://docs.deeppavlov.ai/en/master/features/skills/odqa.html

Shafaq · January 14, 2021, 10:31am

Thank you for your response.

Regarding your question about the ‘text’ field, it is just a column which contains headlines from news channels regarding COVID-19.

I am trying to plugin an external model (BioBert). I have chosen the squad_bert config file to make changes.

Can you indicate some best practices to accomplish it? Also as you know along with an external model, I am using .csv dataset.

yurakuratov · January 18, 2021, 9:07am

squad_bert model should be trained on triples of (texts, questions) - X and answers - Y
I can’t get where are questions in your dataset are coming from. It would be better to convert your .csv dataset into SQuAD json format.

All changes regarding Bert model are done right.

Shafaq · January 18, 2021, 9:54am

Thank you for guiding me patiently!

Yes I am now understanding that I need to have questions in my dataset.

I am trying to give a question as an input and in return I want model to search answer from the given dataset.

But I guess I need to either change my dataset format or I should switch to any other config file.

yurakuratov · January 18, 2021, 12:37pm

Oh, I see. You have got a collection of texts and you want to build a system that will look for an answer from this collection.

I this case (there is no questions and answers to train the model) you should take a look at ODQA.
@Vasily might have more references for building you own ODQA system.

The next step to improve question-answering component could be training BioBERT on regular SQuAD dataset and its integration into ODQA system.

Vasily · January 18, 2021, 12:42pm

Hey @Shafaq ,

please take a look at this article Open-domain question answering with DeepPavlov | by Vasily Konovalov | DeepPavlov | Medium and let me know if you have question.

Shafaq · January 18, 2021, 12:52pm

I am overwhelmed by your response and I am glad that you understand my problem now. Yes I had a look on ODQA. I have few questions on it. I will ask Vasily!
Thank you for your time!

Shafaq · January 18, 2021, 1:09pm

Hi Vasily,

Thank you for your time.

I actually read this article, it is amazingly put together.

Kudos to you!

For this very start I am asking about the first few points which got my attention and I see them as limitations for my situation. These are stated below:

1-It is stated in the article that
The dataset_reader section of the ranker’s configuration defines the source of the articles. The source can be of the following dataset_format:
wiki txt and .json files.
My dataset is in .csv format. So what limitations probably I can face?

2-It says
Both models require about 24 GB of RAM.(Google Colab is also of no use)
If my machine does not meet this requirement I am unable to use this model. If yes, is there any alternative(online)platform which can be used.

Best Regards

Vasily · January 19, 2021, 10:34am

@Shafaq Thank you very for your feedback!

In case of csv I would consider converting it into txt or json, currently we don’t have an implementation for csv. You can find the details here DeepPavlov/odqa_reader.py at b66179e584d3eb6da73c5731ba7b732dab7e94bd · deepmipt/DeepPavlov · GitHub
The specific requirement depends on how large your database is (the csv file). The Wikipedia based model requires 24 GB of RAM, I believe your model will require less RAM.

Please let me know if you need a further assistance.

Shafaq · January 19, 2021, 5:58pm

Okay great! I need to ask

1-Can you point to some reference which may help me in converting my .csv format to .txt. What format should it have.

2- I want to make sure that, if the root path in the config file needs to be changes when used in Windows environment.
“metadata”: {
“variables”: {
“ROOT_PATH”: “~/.deeppavlov”, This needs to be changed in Windows environment or it is fine??
“DOWNLOADS_PATH”: “{ROOT_PATH}/downloads”,
“MODELS_PATH”: “{ROOT_PATH}/models”

Vasily · January 20, 2021, 8:34am

Separate your articles into txt files and put them into a directory then train the reader model just like in the example 3-2.py · GitHub
I would recomend using the full path.

Shafaq · January 21, 2021, 5:59pm

Okay Thank you!
I am running into this wired error. Command is unable to pick the file. Can you point to possible issues due to which it isn’t able to find the file. Although the file is copied and is present at the exact location.

Shafaq · January 26, 2021, 1:53pm

Hi @yurakuratov
I hope you are doing well. As you said that I should train BioBert model with squad dataset. I am doing the same, made required changes in config file but the training is taking too long.

On a low power machine it is up for 15 hours and still not responding. While at another machine it ran but gave the following error(Resource issue).

I’ll be waiting to hear from you!

While waiting, I am trying to:

1-use any other dataset for training.
2-trying to make a subset out of squad dataset.
3-giving optional parameters for training like this:
from deeppavlov import train_model
squadbert_config[‘train’][‘batch_size’] = 4 # set batch size
squadbert_config[‘train’][‘max_batches’] = 30 # maximum number of training batches
squadbert_config[‘train’][‘val_every_n_batches’] = 30
squadbert_config[‘train’][‘log_every_n_batches’] = 5
train_model (squadbert_config);
.

yurakuratov · February 3, 2021, 10:49am

We trained such models (build on top of the BERT-base) for SQuAD on GPUs with >11Gb RAM (NVIDIA 1080Ti, P100). With default configuration file from the library it takes <=6h to train the model.

Without GPU training will take an order of magnitude more time.

Error on the last screenshot says that training with the current batch_size does not feet to the GPU memory.

Anthony · May 11, 2023, 12:14pm

I did not think that I would find the answer to my problem here, thanks guys)

Topic		Replies	Views
Need some advice regarding using own data Tutorials & Guidelines	17	1342	May 1, 2020
How to change dataset for the demo Simple intent recognition question answering bot? DeepPavlov Library	2	334	April 13, 2022
Question Answering Models Models	5	533	September 16, 2021
Train squad_ru_rubert_infer on own data	1	286	December 7, 2020
Question about testing CSV reader (With REST) Documentation	4	462	May 28, 2020

Integrating custom BERT model and training model with csv dataset

Related topics