Обучение ODQA на собственных данных

Здравствуйте, изучаю вашу библиотеку для реализации своего проекта.
Нашёл несколько статей, где рассказывается как обучить модели ODQA на собственных данных:
1 - DeepPavlov: «Keras» для обработки естественного языка помогает отвечать на вопросы про COVID-2019 / Блог компании Microsoft / Хабр
2 - https://medium.com/deeppavlov/open-domain-question-answering-with-deeppavlov-c665d2ee4d65

На основе статей я написал код и датасет:
код - test_deeppavlov/obuchenie.py at main · NikitaAkimov/test_deeppavlov · GitHub
датасет - test_deeppavlov/model.csv at main · NikitaAkimov/test_deeppavlov · GitHub

После нескольких часов обучения модели у меня в консоль выводится следующее и как я понимаю так до бесконечности:

2021-04-04 18:47:22.215 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 155: Counting hash…
2021-04-04 18:47:30.446 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 153: Tokenizing batch…
2021-04-04 18:51:25.300 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 155: Counting hash…
2021-04-04 18:51:35.784 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 153: Tokenizing batch…
2021-04-04 18:55:07.505 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 155: Counting hash…
2021-04-04 18:55:17.715 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 153: Tokenizing batch…
2021-04-04 18:58:45.701 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 155: Counting hash…
2021-04-04 18:58:53.939 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 153: Tokenizing batch…

Не подскажите пожалуйста в чём я ошибся и как обучить данную модель на собственных данных.

Спасибо!

@NikitaAkimov

необходимо придерживаться следующего формата данных

data_path: a directory/file with texts to create a database from
dataset_format: initial data format; should be selected from ['txt', 'wiki', 'json']

Таким образом data_path это путь к папке, в которой находятся файлы статей для поиска.

Вообще говоря, ODQA модели неверно использовать для ващего случая.

Посмотрите на faq модели, они больше подходят.

@Vasily
Здравствуйте, прочёл вашу статью (Open-domain question answering with DeepPavlov | by Vasily Konovalov | DeepPavlov | Medium) и посмотрел видеоролик (Школа Алисы. Как использовать библиотеки DeepPavlov для ответа на часто задаваемые вопросы клиентов? - YouTube), где вы рассказывали как реализовать обучение по данному методу. Спасибо, за очень подробное объяснение!

На основе статей я переписал код:
код - test_deeppavlov/obuchenie.py at main · NikitaAkimov/test_deeppavlov · GitHub

К сожалению вылезает ошибка, которую я не могу ни как решить:

2021-04-05 21:40:03.976 INFO in ‘deeppavlov.download’[‘download’] at line 138: Skipped http://files.deeppavlov.ai/vectorizer/tfidf_vectorizer_ruwiki.pkl download because of matching hashes
2021-04-05 21:40:04.99 INFO in ‘deeppavlov.download’[‘download’] at line 138: Skipped http://files.deeppavlov.ai/faq/school/tfidf_cos_sim_classifier.pkl download because of matching hashes
Traceback (most recent call last):
File “obuchenie.py”, line 8, in
train = True)
File “/root/deepSearch_DoctorAi/env/lib/python3.7/site-packages/deeppavlov/deprecated/skills/similarity_matching_skill/similarity_matching_skill.py”, line 80, in init
self.model = train_model(model_config, download=True)
File “/root/deepSearch_DoctorAi/env/lib/python3.7/site-packages/deeppavlov/init.py”, line 29, in train_model
train_evaluate_model_from_config(config, download=download, recursive=recursive)
File “/root/deepSearch_DoctorAi/env/lib/python3.7/site-packages/deeppavlov/core/commands/train.py”, line 92, in train_evaluate_model_from_config
data = read_data_by_config(config)
File “/root/deepSearch_DoctorAi/env/lib/python3.7/site-packages/deeppavlov/core/commands/train.py”, line 51, in read_data_by_config
reader = get_model(reader_config.pop(‘class_name’))()
File “/root/deepSearch_DoctorAi/env/lib/python3.7/site-packages/deeppavlov/core/common/registry.py”, line 72, in get_model
return cls_from_str(_REGISTRY[name])
File “/root/deepSearch_DoctorAi/env/lib/python3.7/site-packages/deeppavlov/core/common/registry.py”, line 40, in cls_from_str
return getattr(importlib.import_module(module_name), cls_name)
File “/usr/local/lib/python3.7/importlib/init.py”, line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File “”, line 1006, in _gcd_import
File “”, line 983, in _find_and_load
File “”, line 967, in _find_and_load_unlocked
File “”, line 677, in _load_unlocked
File “”, line 728, in exec_module
File “”, line 219, in _call_with_frames_removed
File “/root/deepSearch_DoctorAi/env/lib/python3.7/site-packages/deeppavlov/dataset_readers/faq_reader.py”, line 17, in
from pandas import read_csv
File “/root/deepSearch_DoctorAi/env/lib/python3.7/site-packages/pandas/init.py”, line 55, in
from pandas.core.api import (
File “/root/deepSearch_DoctorAi/env/lib/python3.7/site-packages/pandas/core/api.py”, line 24, in
from pandas.core.groupby import Grouper, NamedAgg
File “/root/deepSearch_DoctorAi/env/lib/python3.7/site-packages/pandas/core/groupby/init.py”, line 1, in
from pandas.core.groupby.generic import ( # noqa: F401
File “/root/deepSearch_DoctorAi/env/lib/python3.7/site-packages/pandas/core/groupby/generic.py”, line 44, in
from pandas.core.frame import DataFrame
File “/root/deepSearch_DoctorAi/env/lib/python3.7/site-packages/pandas/core/frame.py”, line 88, in
from pandas.core.generic import NDFrame, _shared_docs
File “/root/deepSearch_DoctorAi/env/lib/python3.7/site-packages/pandas/core/generic.py”, line 70, in
from pandas.io.formats.format import DataFrameFormatter, format_percentiles
File “/root/deepSearch_DoctorAi/env/lib/python3.7/site-packages/pandas/io/formats/format.py”, line 48, in
from pandas.io.common import _expand_user, _stringify_path
File “/root/deepSearch_DoctorAi/env/lib/python3.7/site-packages/pandas/io/common.py”, line 3, in
import bz2
File “/usr/local/lib/python3.7/bz2.py”, line 19, in
from _bz2 import BZ2Compressor, BZ2Decompressor
ModuleNotFoundError: No module named ‘_bz2’

Не подскажите как устранить данную ошибку и обучить сеть на собственном датасете?

Спасибо!

@NikitaAkimov

прошу воспользоваться кодом из статьи https://medium.com/deeppavlov/simple-intent-recognition-and-question-answering-with-deeppavlov-c54ccf5339a9

@Vasily, здравствуйте. Прочёл вашу статью и по ней проделал установку tfidf_autofaq (с библиотекой tfidf_logreg_autofaq происходит тоже самое. tfidf_logreg_en_faq установилась и запустилась удачно). К сожалению всё равно выходит прежняя ошибка. Не подскажите как её устранить?

(env) root@178-21-11-97:~/DocAi_deep/test_deeppavlov# python -m deeppavlov install tfidf_autofaq
2021-04-06 19:39:48.150 INFO in ‘deeppavlov.core.common.file’[‘file’] at line 32: Interpreting ‘tfidf_autofaq’ as ‘/root/DocAi_deep/env/lib/python3.7/site-packages/deeppavlov/configs/faq/tfidf_autofaq.json’
2021-04-06 19:39:48.153 WARNING in ‘deeppavlov.utils.pip_wrapper.pip_wrapper’[‘pip_wrapper’] at line 59: No requirements found in config
(env) root@178-21-11-97:~/DocAi_deep/test_deeppavlov#
(env) root@178-21-11-97:~/DocAi_deep/test_deeppavlov#
(env) root@178-21-11-97:~/DocAi_deep/test_deeppavlov#
(env) root@178-21-11-97:~/DocAi_deep/test_deeppavlov# python -m deeppavlov interact tfidf_autofaq -d
2021-04-06 19:40:03.996 INFO in ‘deeppavlov.core.common.file’[‘file’] at line 32: Interpreting ‘tfidf_autofaq’ as ‘/root/DocAi_deep/env/lib/python3.7/site-packages/deeppavlov/configs/faq/tfidf_autofaq.json’
2021-04-06 19:40:04.425 INFO in ‘deeppavlov.download’[‘download’] at line 138: Skipped http://files.deeppavlov.ai/vectorizer/tfidf_vectorizer_ruwiki.pkl?config=tfidf_autofaq download because of matching hashes
2021-04-06 19:40:04.501 INFO in ‘deeppavlov.download’[‘download’] at line 138: Skipped http://files.deeppavlov.ai/faq/school/tfidf_cos_sim_classifier.pkl?config=tfidf_autofaq download because of matching hashes
[nltk_data] Downloading package punkt to /root/nltk_data…
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data…
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package perluniprops to /root/nltk_data…
[nltk_data] Package perluniprops is already up-to-date!
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data] /root/nltk_data…
[nltk_data] Package nonbreaking_prefixes is already up-to-date!
Traceback (most recent call last):
File “/usr/local/lib/python3.7/runpy.py”, line 193, in _run_module_as_main
main”, mod_spec)
File “/usr/local/lib/python3.7/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/root/DocAi_deep/env/lib/python3.7/site-packages/deeppavlov/main.py”, line 4, in
main()
File “/root/DocAi_deep/env/lib/python3.7/site-packages/deeppavlov/deep.py”, line 89, in main
interact_model(pipeline_config_path)
File “/root/DocAi_deep/env/lib/python3.7/site-packages/deeppavlov/core/commands/infer.py”, line 79, in interact_model
model = build_model(config)
File “/root/DocAi_deep/env/lib/python3.7/site-packages/deeppavlov/core/commands/infer.py”, line 62, in build_model
component = from_params(component_config, mode=mode, serialized=component_serialized)
File “/root/DocAi_deep/env/lib/python3.7/site-packages/deeppavlov/core/common/params.py”, line 95, in from_params
obj = get_model(cls_name)
File “/root/DocAi_deep/env/lib/python3.7/site-packages/deeppavlov/core/common/registry.py”, line 72, in get_model
return cls_from_str(_REGISTRY[name])
File “/root/DocAi_deep/env/lib/python3.7/site-packages/deeppavlov/core/common/registry.py”, line 40, in cls_from_str
return getattr(importlib.import_module(module_name), cls_name)
File “/usr/local/lib/python3.7/importlib/init.py”, line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File “”, line 1006, in _gcd_import
File “”, line 983, in _find_and_load
File “”, line 967, in _find_and_load_unlocked
File “”, line 677, in _load_unlocked
File “”, line 728, in exec_module
File “”, line 219, in _call_with_frames_removed
File “/root/DocAi_deep/env/lib/python3.7/site-packages/deeppavlov/models/tokenizers/ru_tokenizer.py”, line 20, in
import pymorphy2
File “/root/DocAi_deep/env/lib/python3.7/site-packages/pymorphy2/init.py”, line 3, in
from .analyzer import MorphAnalyzer
File “/root/DocAi_deep/env/lib/python3.7/site-packages/pymorphy2/analyzer.py”, line 10, in
from pymorphy2 import opencorpora_dict
File “/root/DocAi_deep/env/lib/python3.7/site-packages/pymorphy2/opencorpora_dict/init.py”, line 4, in
from .storage import load_dict as load
File “/root/DocAi_deep/env/lib/python3.7/site-packages/pymorphy2/opencorpora_dict/storage.py”, line 24, in
from pymorphy2.utils import json_write, json_read
File “/root/DocAi_deep/env/lib/python3.7/site-packages/pymorphy2/utils.py”, line 5, in
import bz2
File “/usr/local/lib/python3.7/bz2.py”, line 19, in
from _bz2 import BZ2Compressor, BZ2Decompressor
ModuleNotFoundError: No module named ‘_bz2’