OOM error while training BioBert on SQuAD dataset

Runtime: 32GB-RAM, 8-Core-CPU, 8-GB-GPU

Environment: DeepPavlov = 0.14.0, Tensorflow-gpu = 1.15.2, CUDA = 11.2, Ubuntu = 20.04.1 LTS, Python=3.7

Ask: Is this a known error? How to avoid/resolve this error?

Error: OOM Error

2021-01-25 07:20:55.515007: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 30720 totalling 30.0KiB
2021-01-25 07:20:55.515015: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 7 Chunks of size 1572864 totalling 10.50MiB
2021-01-25 07:20:55.515025: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 332 Chunks of size 2359296 totalling 747.00MiB
2021-01-25 07:20:55.515033: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 3 Chunks of size 4194304 totalling 12.00MiB
2021-01-25 07:20:55.515043: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 2 Chunks of size 4456448 totalling 8.50MiB
2021-01-25 07:20:55.515052: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 5898240 totalling 5.62MiB
2021-01-25 07:20:55.515061: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 167 Chunks of size 9437184 totalling 1.47GiB
2021-01-25 07:20:55.515070: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 11116544 totalling 10.60MiB
2021-01-25 07:20:55.515079: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 54 Chunks of size 11796480 totalling 607.50MiB
2021-01-25 07:20:55.515089: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 18296832 totalling 17.45MiB
2021-01-25 07:20:55.515098: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 16 Chunks of size 47185920 totalling 720.00MiB
2021-01-25 07:20:55.515108: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 11 Chunks of size 70778880 totalling 742.50MiB
2021-01-25 07:20:55.515117: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 4 Chunks of size 89075712 totalling 339.80MiB
2021-01-25 07:20:55.515127: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 91004928 totalling 86.79MiB
2021-01-25 07:20:55.515136: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 117374976 totalling 111.94MiB
2021-01-25 07:20:55.515145: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 121289728 totalling 115.67MiB
2021-01-25 07:20:55.515154: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 134217728 totalling 128.00MiB
2021-01-25 07:20:55.515163: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 5.05GiB
2021-01-25 07:20:55.515173: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 5444665344 memory_limit_: 5570428928 available bytes: 125763584 curr_region_allocation_bytes_: 8589934592
2021-01-25 07:20:55.515184: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
Limit: 5570428928
InUse: 5421348864
MaxInUse: 5421358080
NumAllocs: 553759
MaxAllocSize: 134217728

2021-01-25 07:20:55.515243: W tensorflow/core/common_runtime/bfc_allocator.cc:424] ****************************************************************************************************
2021-01-25 07:20:55.515807: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at softmax_op_gpu.cu.cc:162 : Resource exhausted: OOM when allocating tensor with shape[10,12,384,384] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[10,12,384,384] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node bert/encoder/layer_4/attention/self/Softmax}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[loss/Mean/_891]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[10,12,384,384] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node bert/encoder/layer_4/attention/self/Softmax}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “a.py”, line 4, in
model=train_model(’/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/deeppavlov/configs/squad/squad_bert.json’, download=False)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/deeppavlov/init.py”, line 29, in train_model
train_evaluate_model_from_config(config, download=download, recursive=recursive)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/deeppavlov/core/commands/train.py”, line 121, in train_evaluate_model_from_config
trainer.train(iterator)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/deeppavlov/core/trainers/nn_trainer.py”, line 337, in train
self.train_on_batches(iterator)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/deeppavlov/core/trainers/nn_trainer.py”, line 283, in train_on_batches
self.last_result = self._chainer.train_on_batch(x, y_true)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/deeppavlov/core/common/chainer.py”, line 169, in train_on_batch
return component.train_on_batch(*preprocessed)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/deeppavlov/core/models/tf_backend.py”, line 28, in _wrapped
return func(*args, **kwargs)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/deeppavlov/models/bert/bert_squad.py”, line 243, in train_on_batch
_, loss = self.sess.run([self.train_op, self.loss], feed_dict=feed_dict)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[10,12,384,384] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node bert/encoder/layer_4/attention/self/Softmax (defined at /home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[loss/Mean/_891]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[10,12,384,384] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node bert/encoder/layer_4/attention/self/Softmax (defined at /home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Original stack trace for ‘bert/encoder/layer_4/attention/self/Softmax’:
File “a.py”, line 4, in
model=train_model(’/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/deeppavlov/configs/squad/squad_bert.json’, download=False)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/deeppavlov/init.py”, line 29, in train_model
train_evaluate_model_from_config(config, download=download, recursive=recursive)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/deeppavlov/core/commands/train.py”, line 121, in train_evaluate_model_from_config
trainer.train(iterator)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/deeppavlov/core/trainers/nn_trainer.py”, line 334, in train
self.fit_chainer(iterator)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/deeppavlov/core/trainers/fit_trainer.py”, line 104, in fit_chainer
component = from_params(component_config, mode=‘train’)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/deeppavlov/core/common/params.py”, line 106, in from_params
component = obj(**dict(config_params, **kwargs))
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/deeppavlov/core/models/tf_backend.py”, line 76, in call
obj.init(*args, **kwargs)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/deeppavlov/core/models/tf_backend.py”, line 28, in _wrapped
return func(*args, **kwargs)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/deeppavlov/models/bert/bert_squad.py”, line 85, in init
self._init_graph()
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/deeppavlov/core/models/tf_backend.py”, line 28, in _wrapped
return func(*args, **kwargs)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/deeppavlov/models/bert/bert_squad.py”, line 117, in _init_graph
use_one_hot_embeddings=False,
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/bert_dp/modeling.py”, line 223, in init
do_return_all_layers=True)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/bert_dp/modeling.py”, line 853, in transformer_model
to_seq_length=seq_length)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/bert_dp/modeling.py”, line 729, in attention_layer
attention_probs = tf.nn.softmax(attention_scores)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_ops.py”, line 2958, in softmax
return _softmax(logits, gen_nn_ops.softmax, axis, name)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_ops.py”, line 2891, in _softmax
return compute_op(logits, name=name)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_nn_ops.py”, line 11376, in softmax
“Softmax”, logits=logits, name=name)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

Hi!

The problem could be with batch_size, you can decrease it to fit it into your GPU memory.

Reduced the # of batches and batch size and got this output

python train.py
2021-01-26 16:01:08.488 WARNING in ‘deeppavlov.core.commands.train’[‘train’] at line 108: “validate_best” and “test_best” parameters are deprecated. Please, use “evaluation_targets” list instead
2021-01-26 16:01:08.490 INFO in ‘deeppavlov.core.trainers.fit_trainer’[‘fit_trainer’] at line 68: NNTrainer got additional init parameters [‘pytest_max_batches’, ‘pytest_batch_size’] that will be ignored:
WARNING:tensorflow:From /home/openaimp/anaconda3/envs/bot/lib/python3.7/site-packages/deeppavlov/core/trainers/nn_trainer.py:150: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

[nltk_data] Error loading punkt: <urlopen error [Errno 101] Network is
[nltk_data] unreachable>
[nltk_data] Error loading stopwords: <urlopen error [Errno 101]
[nltk_data] Network is unreachable>
[nltk_data] Error loading perluniprops: <urlopen error [Errno 101]
[nltk_data] Network is unreachable>
[nltk_data] Error loading nonbreaking_prefixes: <urlopen error [Errno
[nltk_data] 101] Network is unreachable>

Here is the code:

from deeppavlov import configs^M
from deeppavlov.core.common.file import read_json^M
^M
squadbert_config = read_json(configs.squad.squad_bert)^M
^M
from deeppavlov import train_model^M
^M
squadbert_config[‘train’][‘batch_size’] = 2 # set batch size^M
squadbert_config[‘train’][‘max_batches’] = 10 # maximum number of training batches^M
squadbert_config[‘train’][‘val_every_n_batches’] = 10 # evaluate on full ‘valid’ split every 30 epochs^M
squadbert_config[‘train’][‘log_every_n_batches’] = 2 # evaluate on full ‘train’ split every 5 batches^M
^M
train_model (squadbert_config);^M
^M
from deeppavlov import build_model^M
^M
bot = build_model(squadbert_config)^M
^M
bot([‘Coronavirus disease 2019 (COVID-19) is a contagious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The first case was identified in Wuhan, China, in December 2019. It has since spread worldwide, leading to an ongoing pandemic.’], [‘What is Coronavirus?’])
~
~
~