В документации сказано, что можно запустить ner_few_shot_ru даже если есть только 10 размеченных предложений.
Я положила в train.txt, test.txt, valid.txt по 10 предложений с BIO-разметкой.
Лог выглядит так:
2020-03-13 08:01:14.66 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 394: processed 423 tokens with 15 phrases; found: 0 phrases; correct: 0.
precision: 0.00%; recall: 0.00%; FB1: 0.00
B-ORG: precision: 0.00%; recall: 0.00%; F1: 0.00 0
I-ORG: precision: 0.00%; recall: 0.00%; F1: 0.00 0
{"valid": {"eval_examples_count": 9, "metrics": {"ner_f1": 0}, "time_spent": "0:00:04"}}
2020-03-13 08:01:15.604 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 394: processed 465 tokens with 18 phrases; found: 0 phrases; correct: 0.
precision: 0.00%; recall: 0.00%; FB1: 0.00
B-ORG: precision: 0.00%; recall: 0.00%; F1: 0.00 0
I-ORG: precision: 0.00%; recall: 0.00%; F1: 0.00 0
2020-03-13 08:01:15.682 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 115: [loading vocabulary from /data/home/r/.deeppavlov/models/ner_fs/tag.dict]
{"test": {"eval_examples_count": 9, "metrics": {"ner_f1": 0}, "time_spent": "0:00:02"}}
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
На этом обучение заканчивается. Понятно, что сейчас очень мало данных, но совсем непонятно, куда и в каком формате сложить данные без разметки, чтобы при обучении модель тоже на них смотрела. Я ведь правильно понимаю, что эта модель нужна для того, чтобы решать задачу NER в условиях малого количества размеченных данных? Заранее спасибо за помощь.
2020-03-13 08:00:29.667 INFO in 'deeppavlov.download'['download'] at line 117: Skipped http://files.deeppavlov.ai/deeppavlov_data/elmo_ru-news_wmt11-16_1.5M_steps.tar.gz download because of matching hashes
2020-03-13 08:00:29.726 INFO in 'deeppavlov.core.trainers.fit_trainer'['fit_trainer'] at line 68: FitTrainer got additional init parameters ['epochs', 'validation_patience', 'val_every_n_epochs', 'log_every_n_epochs'] that will be ignored:
2020-03-13 08:00:29.751 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 101: [saving vocabulary to /data/home/reshetnikova/.deeppavlov/models/ner_fs/tag.dict]
WARNING:tensorflow:From /opt/anaconda3/lib/python3.6/site-packages/deeppavlov/models/embedders/elmo_embedder.py:186: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.
WARNING:tensorflow:From /opt/anaconda3/lib/python3.6/site-packages/deeppavlov/models/embedders/elmo_embedder.py:188: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
WARNING:tensorflow:From /opt/anaconda3/lib/python3.6/site-packages/deeppavlov/models/embedders/elmo_embedder.py:190: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
WARNING:tensorflow:From /opt/anaconda3/lib/python3.6/site-packages/deeppavlov/models/embedders/elmo_embedder.py:198: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.
2020-03-13 08:01:00.308 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 115: [loading vocabulary from /data/home/reshetnikova/.deeppavlov/models/ner_fs/tag.dict]
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
2020-03-13 08:01:14.66 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 394: processed 423 tokens with 15 phrases; found: 0 phrases; correct: 0.
precision: 0.00%; recall: 0.00%; FB1: 0.00
B-ORG: precision: 0.00%; recall: 0.00%; F1: 0.00 0
I-ORG: precision: 0.00%; recall: 0.00%; F1: 0.00 0
{"valid": {"eval_examples_count": 9, "metrics": {"ner_f1": 0}, "time_spent": "0:00:04"}}
2020-03-13 08:01:15.604 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 394: processed 465 tokens with 18 phrases; found: 0 phrases; correct: 0.
precision: 0.00%; recall: 0.00%; FB1: 0.00
B-ORG: precision: 0.00%; recall: 0.00%; F1: 0.00 0
I-ORG: precision: 0.00%; recall: 0.00%; F1: 0.00 0
2020-03-13 08:01:15.682 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 115: [loading vocabulary from /data/home/reshetnikova/.deeppavlov/models/ner_fs/tag.dict]
{"test": {"eval_examples_count": 9, "metrics": {"ner_f1": 0}, "time_spent": "0:00:02"}}
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
Разметила побольше: в трейне 100 предложений, в тесте и валидации по 20. Лог теперь такой:
2020-03-18 09:06:45.671 INFO in 'deeppavlov.download'['download'] at line 117: Skipped http://files.deeppavlov.ai/deeppavlov_data/elmo_ru-news_wmt11-16_1.5M_steps.tar.gz download because of matching hashes
2020-03-18 09:06:45.685 WARNING in 'deeppavlov.dataset_readers.conll2003_reader'['conll2003_reader'] at line 96: Skip '\xa0 O\n', splitted as ['O']
2020-03-18 09:06:45.687 WARNING in 'deeppavlov.dataset_readers.conll2003_reader'['conll2003_reader'] at line 96: Skip '\xa0 O\n', splitted as ['O']
2020-03-18 09:06:45.687 WARNING in 'deeppavlov.dataset_readers.conll2003_reader'['conll2003_reader'] at line 96: Skip '\xa0 O\n', splitted as ['O']
2020-03-18 09:06:45.688 WARNING in 'deeppavlov.dataset_readers.conll2003_reader'['conll2003_reader'] at line 96: Skip '\xa0 O\n', splitted as ['O']
2020-03-18 09:06:45.694 WARNING in 'deeppavlov.dataset_readers.conll2003_reader'['conll2003_reader'] at line 96: Skip '\xa0 O\n', splitted as ['O']
2020-03-18 09:06:45.698 WARNING in 'deeppavlov.dataset_readers.conll2003_reader'['conll2003_reader'] at line 96: Skip '\xa0 O\n', splitted as ['O']
2020-03-18 09:06:45.702 WARNING in 'deeppavlov.dataset_readers.conll2003_reader'['conll2003_reader'] at line 96: Skip '\xa0 O\n', splitted as ['O']
2020-03-18 09:06:45.704 WARNING in 'deeppavlov.dataset_readers.conll2003_reader'['conll2003_reader'] at line 96: Skip '\xa0 O\n', splitted as ['O']
2020-03-18 09:06:45.705 WARNING in 'deeppavlov.dataset_readers.conll2003_reader'['conll2003_reader'] at line 96: Skip '\xa0 O\n', splitted as ['O']
2020-03-18 09:06:45.706 WARNING in 'deeppavlov.dataset_readers.conll2003_reader'['conll2003_reader'] at line 96: Skip '\xa0 O\n', splitted as ['O']
2020-03-18 09:06:45.737 INFO in 'deeppavlov.core.trainers.fit_trainer'['fit_trainer'] at line 68: FitTrainer got additional init parameters ['epochs', 'validation_patience', 'val_every_n_epochs', 'log_every_n_epochs'] that will be ignored:
2020-03-18 09:06:45.749 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 115: [loading vocabulary from /data/home/reshetnikova/.deeppavlov/models/ner_fs/tag.dict]
2020-03-18 09:06:45.767 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 101: [saving vocabulary to /data/home/reshetnikova/.deeppavlov/models/ner_fs/tag.dict]
WARNING:tensorflow:From /opt/anaconda3/lib/python3.6/site-packages/deeppavlov/models/embedders/elmo_embedder.py:186: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.
WARNING:tensorflow:From /opt/anaconda3/lib/python3.6/site-packages/deeppavlov/models/embedders/elmo_embedder.py:188: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
WARNING:tensorflow:From /opt/anaconda3/lib/python3.6/site-packages/deeppavlov/models/embedders/elmo_embedder.py:190: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
WARNING:tensorflow:From /opt/anaconda3/lib/python3.6/site-packages/deeppavlov/models/embedders/elmo_embedder.py:198: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.
2020-03-18 09:07:26.709 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 115: [loading vocabulary from /data/home/reshetnikova/.deeppavlov/models/ner_fs/tag.dict]
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
2020-03-18 09:07:40.41 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 394: processed 858 tokens with 35 phrases; found: 11 phrases; correct: 0.
precision: 81.82%; recall: 25.71%; FB1: 39.13
B-ORG: precision: 90.00%; recall: 29.03%; F1: 43.90 10
I-ORG: precision: 0.00%; recall: 0.00%; F1: 0.00 1
{"valid": {"eval_examples_count": 20, "metrics": {"ner_f1": 39.1304}, "time_spent": "0:00:05"}}
2020-03-18 09:07:44.592 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 394: processed 997 tokens with 42 phrases; found: 8 phrases; correct: 0.
precision: 100.00%; recall: 19.05%; FB1: 32.00
B-ORG: precision: 100.00%; recall: 20.00%; F1: 33.33 8
I-ORG: precision: 0.00%; recall: 0.00%; F1: 0.00 0
2020-03-18 09:07:44.658 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 115: [loading vocabulary from /data/home/reshetnikova/.deeppavlov/models/ner_fs/tag.dict]
{"test": {"eval_examples_count": 20, "metrics": {"ner_f1": 32.0}, "time_spent": "0:00:05"}}
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
Присоединюсь к уже заданному вопросу: Named Entity Recognition (NER) — DeepPavlov 0.17.4 documentation - здесь сказано, что при обучении можно использовать в том числе и не размеченные данные. Но не сказано какой формат файла с тренировочными данными в таком случае. Поясните, пожалуйста!
@Vasily У меня узкоспециализированные сущности и мало трейн данных (максимум 50 примеров на каждую сущность). Bert я пробовала, правда в проекте Natasha (BERT + CRF), там получается, что необходимо около 100-150 примеров для более-менее нормального качества. Имеет смысл попробовать вашу модель дообучать на 50 примерах?
А более эффективные few-shot есть в deeppavlov?
@crout из какого именно домена сущности? Все зависит от сложности сущности, но 50 примеров будет мало. Все для русского языка? В следующих релизах мы хотим добавить полноценный zero-shot для NER на основе squad. Сейчас можно попробовать чистый squad для определения сущностей, где вместо вопроса необходимо указать определение сущности.