Error in training multilingual NER with own data

Hi, I open this topic regarding the issue in here. I set up my data with my own tags and I modified the ner configs file by following referenced issues. It looks like:

with configs.ner.ner_ontonotes_bert_mult.open(encoding='utf8') as f:
    ner_config = json.load(f)

ner_config['dataset_reader']['data_path'] = ner_path
# directory with train.txt, valid.txt and test.txt files

ner_config['metadata']['variables']['NER_PATH'] = os.path.join(os.path.dirname(ner_path), 'my_saved_model')
ner_config['metadata']['download'] = [
    ner_config['metadata']['download'][-1]]  # do not download the pretrained ontonotes model

# remove the last tag_vocab component from the pipeline
# rename y_pred_ind to y_pred in bert_sequence_tagger’s out field.
ner_config['chainer']['pipe'][2]['return_probas'] = True
ner_config['chainer']['pipe'][2]['out'] = ['y_pred']
ner_config['chainer']['pipe'].pop()

However, when I try to train the model by doing ner_model = train_model(ner_config, download=True), I get the following error:

In [1]: ner_model = train_model(ner_config, download=True)
…:
2020-05-11 12:01:53.574 INFO in ‘deeppavlov.download’[‘download’] at line 117: Skipped http://files.deeppavlov.ai/deeppavlov_data/bert/multi_cased_L-12_H-768_A-12.zip download because of matching hashes
2020-05-11 12:01:53.669 INFO in ‘deeppavlov.core.trainers.fit_trainer’[‘fit_trainer’] at line 68: NNTrainer got additional init parameters [‘pytest_max_batches’, ‘pytest_batch_size’] that will be ignored:
WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/deeppavlov/core/trainers/nn_trainer.py:149: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/bert_dp/tokenization.py:125: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

2020-05-11 12:01:56.960 INFO in ‘deeppavlov.core.data.simple_vocab’[‘simple_vocab’] at line 101: [saving vocabulary to /Users/paulagomezduran/Desktop/DeepPavlov/datasets/cameras/my_saved_model/tag.dict]
WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/deeppavlov/core/models/tf_model.py:37: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/deeppavlov/core/models/tf_model.py:222: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/deeppavlov/core/models/tf_model.py:222: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/deeppavlov/core/models/tf_model.py:193: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/deeppavlov/models/bert/bert_sequence_tagger.py:236: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

2020-05-11 12:01:56.990754: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-05-11 12:01:57.005979: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fb4d62ecd70 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-11 12:01:57.006047: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/deeppavlov/models/bert/bert_sequence_tagger.py:314: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/bert_dp/modeling.py:178: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/bert_dp/modeling.py:418: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/bert_dp/modeling.py:499: The name tf.assert_less_equal is deprecated. Please use tf.compat.v1.assert_less_equal instead.

WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:

WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/bert_dp/modeling.py:366: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob.
WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/bert_dp/modeling.py:680: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.Dense instead.
WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/tensorflow_core/python/layers/core.py:187: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use layer.__call__ method instead.
WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/bert_dp/modeling.py:283: The name tf.erf is deprecated. Please use tf.math.erf instead.

WARNING:tensorflow:Variable *= will be deprecated. Use var.assign(var * other) if you want assignment to the variable value or x = x * y if you want a new python Tensor object.
WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/deeppavlov/models/bert/bert_sequence_tagger.py:75: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/tensorflow_core/contrib/crf/python/ops/crf.py:213: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use keras.layers.RNN(cell), which is equivalent to this API
WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/deeppavlov/core/models/tf_model.py:234: The name tf.train.AdadeltaOptimizer is deprecated. Please use tf.compat.v1.train.AdadeltaOptimizer instead.

WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/deeppavlov/core/models/tf_model.py:131: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/deeppavlov/core/models/tf_model.py:131: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/deeppavlov/core/models/tf_model.py:94: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/tensorflow_core/python/training/moving_averages.py:433: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/deeppavlov/models/bert/bert_sequence_tagger.py:671: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/deeppavlov/models/bert/bert_sequence_tagger.py:244: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/deeppavlov/models/bert/bert_sequence_tagger.py:249: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
2020-05-11 12:02:47.971 INFO in ‘deeppavlov.models.bert.bert_sequence_tagger’[‘bert_sequence_tagger’] at line 251: [initializing model with Bert from /Users/paulagomezduran/.deeppavlov/downloads/bert_models/multi_cased_L-12_H-768_A-12/bert_model.ckpt]
WARNING:tensorflow:From /Users/paulagomezduran/Workspace/virtualenvs/research/lib/python3.6/site-packages/deeppavlov/models/bert/bert_sequence_tagger.py:255: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

INFO:tensorflow:Restoring parameters from /Users/paulagomezduran/.deeppavlov/downloads/bert_models/multi_cased_L-12_H-768_A-12/bert_model.ckpt
2020-05-11 12:03:00.445 WARNING in ‘deeppavlov.core.trainers.fit_trainer’[‘fit_trainer’] at line 214: Got empty data iterable for scoring

AttributeError Traceback (most recent call last)
~/Desktop/DeepPavlov/NER_cameras.py in
----> 1 ner_model = train_model(ner_config, download=True)

~/Workspace/virtualenvs/research/lib/python3.6/site-packages/deeppavlov/init.py in train_model(config, download, recursive)
30 # TODO: make better
31 def train_model(config: [str, Path, dict], download: bool = False, recursive: bool = False) → Chainer:
—> 32 train_evaluate_model_from_config(config, download=download, recursive=recursive)
33 return build_model(config, load_trained=True)
34

~/Workspace/virtualenvs/research/lib/python3.6/site-packages/deeppavlov/core/commands/train.py in train_evaluate_model_from_config(config, iterator, to_train, evaluation_targets, to_validate, download, start_epoch_num, recursive)
119
120 if to_train:
→ 121 trainer.train(iterator)
122
123 res = {}

~/Workspace/virtualenvs/research/lib/python3.6/site-packages/deeppavlov/core/trainers/nn_trainer.py in train(self, iterator)
334 if callable(getattr(self._chainer, ‘train_on_batch’, None)):
335 try:
→ 336 self.train_on_batches(iterator)
337 except KeyboardInterrupt:
338 log.info(‘Stopped training’)

~/Workspace/virtualenvs/research/lib/python3.6/site-packages/deeppavlov/core/trainers/nn_trainer.py in train_on_batches(self, iterator)
274 self.start_time = time.time()
275 if self.validate_first:
→ 276 self._validate(iterator)
277
278 while True:

~/Workspace/virtualenvs/research/lib/python3.6/site-packages/deeppavlov/core/trainers/nn_trainer.py in _validate(self, iterator, tensorboard_tag, tensorboard_index)
172 report[‘train_examples_seen’] = self.examples
173
→ 174 metrics = list(report[‘metrics’].items())
175
176 if tensorboard_tag is not None and self.tensorboard_log_dir is not None:

AttributeError: ‘NoneType’ object has no attribute ‘items’

Could you help me please?
Thank you so much!

Hi @paulagd,

Sorry for the late reply.

I think that this block of code is to blame:

ner_config['chainer']['pipe'][2]['return_probas'] = True
ner_config['chainer']['pipe'][2]['out'] = ['y_pred']
ner_config['chainer']['pipe'].pop()

Since the metrics are supposed to be calculated on predicted tags and not their probabilities, they break.
I would comment this block out for the training process and then turn it back on for inference.

Hi, I have done it and it still getting the same error. So, I am training it with the code ner_model = train_model(ner_config, download=True). Is download = True the right flag?

2020-05-19 10:52:40.996 WARNING in 'deeppavlov.core.trainers.fit_trainer'['fit_trainer'] at line 214: 
Got empty data iterable for scoring
Traceback (most recent call last):
File "NER_cameras.py", line 43, in <module>
ner_model = train_model(ner_config, download=True)
File "/research/lib/python3.6/sitepackages/deeppavlov/__init__.py", line 32, in train_model
train_evaluate_model_from_config(config, download=download, recursive=recursive)
File "research/lib/python3.6/site-packages/deeppavlov/core/commands/train.py", line 121, in 
train_evaluate_model_from_config
trainer.train(iterator)
File "research/lib/python3.6/site-packages/deeppavlov/core/trainers/nn_trainer.py", line 336, in train
self.train_on_batches(iterator)
File "research/lib/python3.6/site- 
packages/deeppavlov/core/trainers/nn_trainer.py", line 276, in train_on_batches
self._validate(iterator)
File "research/lib/python3.6/site-packages/deeppavlov/core/trainers/nn_trainer.py", line 174, in _validate
metrics = list(report['metrics'].items())
AttributeError: 'NoneType' object has no attribute 'items'

I think that is something with fit_trainer but I don’t know what it is. After my modifications, the ner configuration file looks like:

{'dataset_reader': 
{
'class_name': 'conll2003_reader',
'data_path': 'datasets/cameras/my_data',
'dataset_name': 'ontonotes',
'provide_pos': False
},
'dataset_iterator': {'class_name': 'data_learning_iterator'},
'chainer': {
'in': ['x'],
'in_y': ['y'],
'pipe': 
[{'class_name': 'bert_ner_preprocessor',
'vocab_file': '{BERT_PATH}/vocab.txt',
'do_lower_case': False,
'max_seq_length': 512,
'max_subword_length': 15,
'token_masking_prob': 0.0,
'in': ['x'],
'out': ['x_tokens',
 'x_subword_tokens',
 'x_subword_tok_ids',
 'startofword_markers',
 'attention_mask'
]},
{'id': 'tag_vocab',
'class_name': 'simple_vocab',
'unk_token': ['O'],
'pad_with_zeros': True,
'save_path': '{NER_PATH}/tag.dict',
'load_path': '{NER_PATH}/tag.dict',
'fit_on': ['y'],
'in': ['y'],
'out': ['y_ind']
},
{'class_name': 'bert_sequence_tagger',
'n_tags': '#tag_vocab.len',
'keep_prob': 0.1,
'bert_config_file': '{BERT_PATH}/bert_config.json',
'pretrained_bert': '{BERT_PATH}/bert_model.ckpt',
'attention_probs_keep_prob': 0.5,
'use_crf': True,
'return_probas': False,
'ema_decay': 0.9,
'encoder_layer_ids': [-1],
'weight_decay_rate': 1e-06,
'learning_rate': 0.01,
'bert_learning_rate': 2e-05,
'min_learning_rate': 1e-07,
'learning_rate_drop_patience': 30,
'learning_rate_drop_div': 1.5,
'load_before_drop': False,
'clip_norm': 1.0,
'save_path': '{NER_PATH}/model',
'load_path': '{NER_PATH}/model',
'in': ['x_subword_tok_ids', 'attention_mask', 'startofword_markers'],
'in_y': ['y_ind'],
'out': ['y_pred_ind']},
{'ref': 'tag_vocab', 'in': ['y_pred_ind'], 'out': ['y_pred']}],
'out': ['x_tokens', 'y_pred']},
'train': {'epochs': 30,
'batch_size': 16,
'metrics': [{'name': 'ner_f1', 'inputs': ['y', 'y_pred']},
{'name': 'ner_token_f1', 'inputs': ['y', 'y_pred']}],
'validation_patience': 100,
'val_every_n_batches': 20,
'log_every_n_batches': 20,
'tensorboard_log_dir': '{NER_PATH}/logs',
'pytest_max_batches': 2,
'pytest_batch_size': 8,
'show_examples': False,
'evaluation_targets': ['valid', 'test'],
'class_name': 'nn_trainer'},
'metadata': {'variables': {'ROOT_PATH': '~/.deeppavlov',
'DOWNLOADS_PATH': '{ROOT_PATH}/downloads',
'MODELS_PATH': '{ROOT_PATH}/models',
'BERT_PATH': '{DOWNLOADS_PATH}/bert_models/multi_cased_L-12_H-768_A-12',
'NER_PATH': 'datasets/cameras/my_saved_model'},
'requirements': ['{DEEPPAVLOV_PATH}/requirements/tf.txt',
'{DEEPPAVLOV_PATH}/requirements/bert_dp.txt'],
'download': [{'url': 'http://files.deeppavlov.ai/deeppavlov_data/bert/multi_cased_L-12_H-768_A- 12.zip',
 'subdir': '{DOWNLOADS_PATH}/bert_models'}]}}

Do you know what I am doing wrong? The vocab file is well generated but something is failing on the evaluation part …

Thank you so much!

There is a message Got empty data iterable for scoring before breaking.
Maybe the valid.txt file is empty or not there?

No, they are not empty, I have double-checked!

The file has to be called valid.txt and not val.txt

Thank you very much!!

Now I can train it. Then, I uncomment my lines for getting the confidence interval like they say in this issue. However, as an output I just get the original sentence tokenized plus the probabilities, but I still want to get the ner predictions. So, I saw here that they propose to use dict(model['tag_vocab']) but, I don’t get it to work. Could you tell me when I need to put this part of code?

What I am doing at the moment is:

 model(["I want a camera of 24 MP"]) 

and what I do get is:

Out[14]:
[[['I', 'want', 'a', 'camera', 'of', '24', 'MP']],
 array([[[0.21772474, 0.42379874, 0.21222411, 0.14625245],
     [0.24185373, 0.29909214, 0.33246854, 0.12658568],
     [0.25843135, 0.46049514, 0.1479896 , 0.13308382],
     [0.22358803, 0.41628262, 0.1902188 , 0.16991048],
     [0.20893978, 0.41218695, 0.1645924 , 0.21428095],
     [0.23138203, 0.36987516, 0.19612728, 0.2026155 ],
     [0.32253727, 0.46139917, 0.11783274, 0.09823079]]], dtype=float32)]

what I was getting before was:

Out[3]:
[[['I', 'want', 'a', 'camera', 'of', '24', 'MP']],
[['O', 'O', 'O', 'O', 'O', 'megapixels', 'megapixels']]]

I would like to join both results. Is it possible?

Thank you again, your information is really helpful!!

The code could look something like this:

tags_dict = dict(model['tag_vocab'])

tokens_batch, proba_batch = model(["I want a camera of 24 MP"]) 

tags_idx_batch = proba_batch.argmax(axis=2)
predicted_tags = [[tags_dict[tag_id]
                   for tag_id in sent_tags_idx]
                  for sent_tags_idx in tags_idx_batch]
1 Like

Thank you a lot!

It have been really helpful!