Need some advice regarding using own data

I would like to use deeppavlov to give me answers based on two things (could you give me some advice on which models to use?):

  • Firstly, a documentation page.
    Should I be using Text QA for that, or some kind of combination?

  • Secondly I want to use my own database with Q&A Values.
    (I’m trying to figure out how to do that, but I can’t find it in your documentation…)

Also I’m trying to get the ODQA model to work with the examples stated in this link. However I keep getting the same two lines over and over in an endless loop:

“2020-03-19 16:26:05.628 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 153: Tokenizing batch…”

And

2020-03-19 16:29:49.882 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 155: Counting hash…

Is this supposed to happen or not?

Can somebody please give me an answer, because in the medium article that I found you guys don’t describe how to use your own data…
(https://medium.com/deeppavlov/open-domain-question-answering-with-deeppavlov-c665d2ee4d65)?
Why is it stuck on tokenizing batch?

Dear @kostis95,

Regarding your first question I would recommend you taking a look either on ODQA model (because you can train it on your own data), alternatively you can separate your documentation page into logical chunks and iteratively search for an answer in each chunk by applying squad model.

You second task is similar to what our Frequently Asked Question skill does. Basically, you need to either define you own data_reader or convert your question-answers pairs into the csv format and pick up the best performing config for your tasks.

Unfortunately, I was not able to replicate this ODQA related fail, please mention the exact code snipet and the full error message.

The training process is described in the section Training the model of the ODQA tutorial. These are the relevant code snippets code 1 code 2. Basically, you need to change the data source of the ranker then retrain it. Please make sure you use the right data format, then build_model ODQA with download = False.

Please let me know if I can assist you further.

Ah okay thanks for the clarification, I’m trying to work with the ODQA model.
As for the example I was using the biology corpus: (again from the medium article). But maybe it’s better to leave it for now and just try with my own data and see how that goes…

For the Documentation part
I’m currently trying to define my own ranker.json file, while trying to figure out all those terms used by looking at the documentation and seeing what I need to use for my own file. Maybe you could give me some advice since I can not find the meaning of all the terms in the documentation.

(Also I either want to be using a folder of .txt-files (for now just to get it working) or a database(soon)).
First of all my file (which is based on the “en_ranker_tfidf_wiki”-jsonfile)
is the following
:

{
“dataset_reader”: {
“class_name”: “odqa_reader”,
“data_path”: “{DOWNLOADS_PATH}/odqa/servoy_articles”,
“save_path”: “{DOWNLOADS_PATH}/odqa/servoy_articles”,
“dataset_format”: “txt”
},
“dataset_iterator”: {
“class_name”: “sqlite_iterator”,
“shuffle”: false,
“load_path”: “{DOWNLOADS_PATH}/odqa/enwiki.db”
},
“chainer”: {
“in”: [
“docs”
],
“in_y”: [
“doc_ids”,
“doc_nums”
],
“out”: [
“tfidf_doc_ids”
],
“pipe”: [
{
“class_name”: “hashing_tfidf_vectorizer”,
“id”: “vectorizer”,
“fit_on”: [
“docs”,
“doc_ids”,
“doc_nums”
],
“save_path”: “{MODELS_PATH}/odqa/servoy_articles_tfidf_matrix.npz”,
“load_path”: “{MODELS_PATH}/odqa/servoy_articles_tfidf_matrix.npz”,
“tokenizer”: {
“class_name”: “stream_spacy_tokenizer”,
“lemmas”: true,
“ngram_range”: [
1,
2
]
}
},
{
“class_name”: “tfidf_ranker”,
“top_n”: 20,
“in”: [
“docs”
],
“out”: [
“tfidf_doc_ids”,
“tfidf_doc_scores”
],
“vectorizer”: “#vectorizer
}
]
},
“train”: {
“batch_size”: 10000,
“evaluation_targets”: ,
“class_name”: “fit_trainer”
},
“metadata”: {
“variables”: {
“ROOT_PATH”: “~/.deeppavlov”,
“DOWNLOADS_PATH”: “{ROOT_PATH}/downloads”,
“MODELS_PATH”: “{ROOT_PATH}/models”
},
“requirements”: [
“{DEEPPAVLOV_PATH}/requirements/spacy.txt”,
“{DEEPPAVLOV_PATH}/requirements/en_core_web_sm.txt”
],
“download”: [
{
“url”: “http://files.deeppavlov.ai/deeppavlov_data/en_odqa.tar.gz”,
“subdir”: “{MODELS_PATH}”
}
]
}
}

*servoy_articles being a folder that I have placed inside .deeppavlov/downloads containing a couple .txt-files
*Also I was wondering whether the npz-tfidf-matrix is supposed to be generated, since the documentation states that it:

Creates a tfidf matrix from collection of documents

I’m unsure what the following property means:
save_path (I’m unable to find this in the dataset_reader documentation), it’s basically the same as
data_path but with the extension? And in case of a folder of .txt-files would there then be no difference (since a folder has no extension)?


Python Code
The following piece of code is in my python project
(which I’m just trying to run it in order to see if I can learn something from errors that might show up):

from deeppavlov import configs, train_model
ranker = train_model(configs.doc_retrieval.kostis_ranker, download=True)

result = ranker([‘What is Servoy?’])
print(result[:5])

For now I keep getting just like with the other example an infinite loop of:
“Tokenizing Batch” and “Counting Hash”


For the FAQ part
I would like to use my elasticsearch database for using with the data_reader if possible (the documentation states it should either be wiki, txt, json). So would it also be possible to connect to elasticsearch with the reader and deeppavlov?

First of all I would recommend you strictly following the tutorial. To do so you need to run code1 and code2.

Once you get it working with your data you can alter the configuration files.

servoy_articles is indeed a folder where you place all your data.
npz-tfidf-matrix is supposed to be generated by the ranker.

The full description of the odqa_reader parameters is following

Args:
data_path: a directory/file with texts to create a database from
db_url: path to a database url
kwargs:
save_path: a path where a database should be saved to, or path to a ready database
dataset_format: initial data format; should be selected from ['txt', 'wiki', 'json']

ranker = train_model(configs.doc_retrieval.kostis_ranker, download=True)
Please omit the download parameter as in the the tutorial, because in this case your overwrite your data with the wiki data.

For the FAQ part
There is no predefined data_reader for elasticsearch, I would recommend convert your data into csv format, then train FAQ model.

Hi there Vasiliy,
Thanks fror the quick response. I am running the corpus with that code exactly, unfortunately I get the following output looped:

2020-04-14 08:41:38.103 INFO in ‘deeppavlov.dataset_readers.odqa_reader’[‘odqa_reader’] at line 57: Reading files…
2020-04-14 08:41:38.111 INFO in ‘deeppavlov.dataset_readers.odqa_reader’[‘odqa_reader’] at line 134: Building the database…
0%| | 0/300 [00:00<?, ?it/s]
0it [00:00, ?it/s]2020-04-14 08:41:39.17 INFO in ‘deeppavlov.dataset_readers.odqa_reader’[‘odqa_reader’] at line 57: Reading files…
Traceback (most recent call last):
File “”, line 1, in
File “C:\Users\Kostis\AppData\Local\Programs\Python\Python37\lib\multiprocessing\spawn.py”, line 105, in spawn_main
exitcode = _main(fd)
File “C:\Users\Kostis\AppData\Local\Programs\Python\Python37\lib\multiprocessing\spawn.py”, line 114, in _main
prepare(preparation_data)
File “C:\Users\Kostis\AppData\Local\Programs\Python\Python37\lib\multiprocessing\spawn.py”, line 225, in prepare
_fixup_main_from_path(data[‘init_main_from_path’])
File “C:\Users\Kostis\AppData\Local\Programs\Python\Python37\lib\multiprocessing\spawn.py”, line 277, in _fixup_main_from_path
run_name=“mp_main”)
File “C:\Users\Kostis\AppData\Local\Programs\Python\Python37\lib\runpy.py”, line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File “C:\Users\Kostis\AppData\Local\Programs\Python\Python37\lib\runpy.py”, line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File “C:\Users\Kostis\AppData\Local\Programs\Python\Python37\lib\runpy.py”, line 85, in run_code
exec(code, run_globals)
File “C:\Users\Kostis\Desktop\Deeppavlov-Python\RunningPavlovWithOwnData.py”, line 16, in
ranker = train_model(model_config)
File "C:\Users\Kostis\env\Lib\site-packages\deeppavlov_init
.py", line 32, in train_model
train_evaluate_model_from_config(config, download=download, recursive=recursive)
File “C:\Users\Kostis\env\Lib\site-packages\deeppavlov\core\commands\train.py”, line 92, in train_evaluate_model_from_config
data = read_data_by_config(config)
File “C:\Users\Kostis\env\Lib\site-packages\deeppavlov\core\commands\train.py”, line 58, in read_data_by_config
return reader.read(data_path, **reader_config)
File “C:\Users\Kostis\env\Lib\site-packages\deeppavlov\dataset_readers\odqa_reader.py”, line 81, in read
self._build_db(save_path, dataset_format, expand_path(data_path))
File “C:\Users\Kostis\env\Lib\site-packages\deeppavlov\dataset_readers\odqa_reader.py”, line 130, in _build_db
Path(save_path).unlink()
File “C:\Users\Kostis\AppData\Local\Programs\Python\Python37\lib\pathlib.py”, line 1304, in unlink
self._accessor.unlink(self)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: ‘C:\Users\Kostis\.deeppavlov\downloads\odqa\enwiki.db’

Also
While I’m trying with my own ranker file I have changed the pipe section’s save_path and load_path to so:
“save_path”: “{MODELS_PATH}/servoy_articles/servoy_documentation.npz”,
“load_path”: “{MODELS_PATH}/servoy_articles/servoy_documentation.npz”
But I get: FileNotFoundError: HashingTfIdfVectorizer path doesn’t exist!
Which is understandable since it doesn’t exist. But I thought it was supposed to create itself when working with my own data.

I’ve also tried creating my own empty .npz file but then the npyio.py file cannot perform:

return pickle.load(fid, **pickle_kwargs)

and thus I get:
Failed to interpret file WindowsPath(‘C:/Users/Kostis/.deeppavlov/models/servoy/servoy_documentation_tfidf_matrix.npz’) as a pickle

How should the .npz be generated then and how should the vectorizer parameters be set up when using your own data?

Or maybe is something wrong with my reader file?
I have set datapath and savepath the following:

“data_path”: “{DOWNLOADS_PATH}/servoy_articles”,
“save_path”: “{DOWNLOADS_PATH}/servoy.db”,

The database is just a small file (12kb) with a table “documents” containing fields “id” (text) and “text” (text).
I copied it from the enwiki.db file from the example.
So does the db need to be filled first before starting with the vectorizer? Or maybe am I doing something wrong with the permissions?

I’m lost…

It seems like you have another process that is accessing enwiki.db, please make sure that you stopped the previous process.

Then paste the code from RunningPavlovWithOwnData.py with an error message and make sure it follows the tutorial.

Okay so this is my code:

Corpus

#Part 2 Step 1: Contains top 30 relevant articles for query cerebellum.
from deeppavlov import configs
from deeppavlov.core.common.file import read_json
from deeppavlov import configs, train_model
model_config = read_json(configs.doc_retrieval.en_ranker_tfidf_wiki)
model_config[“dataset_reader”][“data_path”] = “/Users/Kostis/Desktop/SentenceCorpus/unlabeled_articles/plos_unlabeled”
model_config[“dataset_reader”][“dataset_format”] = “txt”
ranker = train_model(model_config)
docs = ranker([‘cerebellum’])

#Part 2 Step 2: Build the ODQA models and run the query. Seems to work but needs a database first
from deeppavlov import configs
from deeppavlov.core.commands.infer import build_model
odqa = build_model(configs.odqa.en_odqa_infer_wiki, download = True)
a = odqa([“what is tuberculosis ?”])
print(a)


My Own Data model

from deeppavlov import configs, train_model
ranker = train_model(configs. doc_retrieval.kostis_ranker)

result = ranker([‘What is Servoy?’])
print(result[:5])

Im sure this is the code from the tutorial.

Lastly my ranker file located in “C:\Users\Kostis\env\Lib\site-packages\deeppavlov\configs\doc_retrieval\kostis_ranker.json” is like this

{
“dataset_reader”: {
“class_name”: “odqa_reader”,
“data_path”: “{DOWNLOADS_PATH}/servoy_odqa/articles”,
“save_path”: “{DOWNLOADS_PATH}/servoy_odqa/servoy_documentation_docs.db”,
“dataset_format”: “txt”

},
“dataset_iterator”: {
“class_name”: “sqlite_iterator”,
“shuffle”: false,
“load_path”: “{DOWNLOADS_PATH}/servoy_odqa/servoy_documentation_docs.db”
},
“chainer”: {
“in”: [
“docs”
],
“in_y”: [
“doc_ids”,
“doc_nums”
],
“out”: [
“pop_doc_ids”
],
“pipe”: [
{
“class_name”: “hashing_tfidf_vectorizer”,
“id”: “vectorizer”,
“fit_on”: [
“docs”,
“doc_ids”,
“doc_nums”
],
“save_path”: “{MODELS_PATH}/servoy_odqa/servoy_documentation_tfidf_matrix.npz”,
“load_path”: “{MODELS_PATH}/servoy_odqa/servoy_documentation_tfidf_matrix.npz”,
“tokenizer”: {
“class_name”: “stream_spacy_tokenizer”,
“lemmas”: true,
“ngram_range”: [
1,
2
]
}
},
{
“class_name”: “tfidf_ranker”,
“top_n”: 20,
“in”: [
“docs”
],
“out”: [
“tfidf_doc_ids”,
“tfidf_doc_scores”
],
“vectorizer”: “#vectorizer
}
]
},
“train”: {
“batch_size”: 10000,
“evaluation_targets”: ,
“class_name”: “fit_trainer”
},
“metadata”: {
“variables”: {
“ROOT_PATH”: “~/.deeppavlov”,
“DOWNLOADS_PATH”: “{ROOT_PATH}/downloads”,
“MODELS_PATH”: “{ROOT_PATH}/models”
},
“requirements”: [
“{DEEPPAVLOV_PATH}/requirements/spacy.txt”,
“{DEEPPAVLOV_PATH}/requirements/en_core_web_sm.txt”
],
“download”: [
{
“url”: “http://files.deeppavlov.ai/deeppavlov_data/en_odqa.tar.gz”,
“subdir”: “{MODELS_PATH}”
}
]
}
}

The .txt articles are located in: C:\Users\Kostis.deeppavlov\downloads\servoy_odqa\articles
I also have a db file caleld “servoy_documentation_docs.db” in the servoy_odqa folder to satisfy the data reader’s save_path (because the info should be saved to a database).

With this ranker file and these settings it does not generate an npz file and I get the error:
FileNotFoundError: HashingTfIdfVectorizer path doesn’t exist!
So how is the npz matrix file supposed to generate is my question?

So did you run the Corpus part? Make sure you can successfully run this minimal example and the example should be sufficient enough because it’s trained on your data. As for now I would ignore further modifications before you get the starter code working.

Regarding the My Own Data model part. It seems like the folder {MODELS_PATH}/servoy_odqa/ doesn’t exist.

Ah okay get that working first, understood.

Even one of the first examples goes wrong:

from deeppavlov import configs
from deeppavlov.core.commands.infer import build_model
odqa = build_model(configs.odqa.en_odqa_infer_wiki, download = True)
answers = odqa([“Where did guinea pigs originate?”, “When did the Lynmouth floods happen?”, “When is the Bastille Day?”])

This is supposed to return:

[“Argentina”, “15-16 August 1952”, “14 July 1789”]

But I’m getting this:

[‘Andes of South America’, ‘1804’, ‘’]

So what is the cause of the wrong answer, because I followed all the instructions.


With the next example of the corpus I’m getting:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: ‘C:\Users\Kostis\.deeppavlov\downloads\odqa\enwiki.db’

but I can’t find what process that would be…
If I use the windows’ resource monitor and look for the associated handle “deeppavlov”
I can not see enwiki.db being used by any other PID. It seems to be a Pycharm problem, what IDE do you use to work with deeppavlov?

Noticed database reset!
It also seems that when changing to a different example that enwiki.tar.gz is downloaded all over again and the enwiki.db is being reset from 14GB to 12 kb (this is on windows). Why?

I think this reset of the database might be the reason that I’m getting the permission error. On my macbook I don’t have this reset issue and the code progresses (without a permission error).
Only then I get an infinite loop of Tokenizing batch and counting hash. Is that supposed to happen? (maybe you could show me what the output is supposed to look like)

That’s true, the current squad model provides different answers. We continuously improve the model that’s why you might experinece different outputs. However, overall performance became much better.

Regarding the PermissionError issue I would recommend you to use either Python in terminal or Jupyter notebook. As you’ve noticed, on Mac you don’t have this error.

Please paste the code and complete error message that results in infinite loop here, please make sure that the code follows the tutorial. Before using your data, please confirm that it works for the SentenceCorpus dataset provided in the tutorial.

Okay I will try that on windows today!, on mac I have the code of the corpus example, everything the same:

#Step 1
from deeppavlov import configs
from deeppavlov.core.common.file import read_json
from deeppavlov import configs, train_model
model_config = read_json(configs.doc_retrieval.en_ranker_tfidf_wiki)
model_config[“dataset_reader”][“data_path”] = “/Users/antoniosthanos/Desktop/SentenceCorpus/unlabeled_articles/plos_unlabeled”
model_config[“dataset_reader”][“dataset_format”] = “txt”
ranker = train_model(model_config)
docs = ranker([‘cerebellum’])

#4 Step 2
from deeppavlov import configs
from deeppavlov.core.commands.infer import build_model
odqa = build_model(configs.odqa.en_odqa_infer_wiki, download = False)
a = odqa([“what is tuberculosis ?”])

My output is the following:

020-04-14 23:37:24.749 INFO in ‘deeppavlov.dataset_readers.odqa_reader’[‘odqa_reader’] at line 57: Reading files…
2020-04-14 23:37:24.753 INFO in ‘deeppavlov.dataset_iterators.sqlite_iterator’[‘sqlite_iterator’] at line 57: Connecting to database, path: /Users/antoniosthanos/.deeppavlov/downloads/odqa/enwiki.db
2020-04-14 23:37:34.273 INFO in ‘deeppavlov.dataset_iterators.sqlite_iterator’[‘sqlite_iterator’] at line 112: SQLite iterator: The size of the database is 5180368 documents
[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data] unable to get local issuer certificate (_ssl.c:1076)>
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data] unable to get local issuer certificate (_ssl.c:1076)>
[nltk_data] Error loading perluniprops: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data] unable to get local issuer certificate (_ssl.c:1076)>
[nltk_data] Error loading nonbreaking_prefixes: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data] unable to get local issuer certificate (_ssl.c:1076)>
2020-04-14 23:37:38.206 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 153: Tokenizing batch…
2020-04-14 23:41:05.294 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 155: Counting hash…
2020-04-14 23:41:12.419 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 153: Tokenizing batch…
2020-04-14 23:42:35.492 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 155: Counting hash…
2020-04-14 23:42:39.109 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 153: Tokenizing batch…
2020-04-14 23:43:57.459 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 155: Counting hash…
2020-04-14 23:44:00.926 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 153: Tokenizing batch…
2020-04-14 23:45:20.812 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 155: Counting hash…
2020-04-14 23:45:24.285 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 153: Tokenizing batch…
2020-04-14 23:46:51.13 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 155: Counting hash…
2020-04-14 23:46:55.431 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 153: Tokenizing batch…
2020-04-14 23:48:30.650 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 155: Counting hash…
2020-04-14 23:48:34.911 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 153: Tokenizing batch…
2020-04-14 23:50:12.86 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 155: Counting hash…
2020-04-14 23:50:16.124 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 153: Tokenizing batch…
2020-04-14 23:51:43.783 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 155: Counting hash…
2020-04-14 23:51:47.670 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 153: Tokenizing batch…
2020-04-14 23:53:27.896 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 155: Counting hash…
2020-04-14 23:53:32.14 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 153: Tokenizing batch…
2020-04-14 23:55:08.271 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 155: Counting hash…
2020-04-14 23:55:12.403 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 153: Tokenizing batch…
2020-04-14 23:56:37.485 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 155: Counting hash…
2020-04-14 23:56:42.300 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 153: Tokenizing batch…
2020-04-14 23:59:16.574 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 155: Counting hash…
2020-04-14 23:59:21.874 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 153: Tokenizing batch…
2020-04-15 00:01:05.519 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 155: Counting hash…
2020-04-15 00:01:09.857 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 153: Tokenizing batch…

And it doesn’t stop (it goes on and on), I’ve let it run for two hours before. And then I killed the process. So it’s not an error, but is it really supposed to take that long? I’m running with a macbook pro 2015 (it has 16 GB 1600 MHz DDR3).

Python terminal returns the same permission denied error, so it must be my computer then…
Have you had any other users report this?

Apparently something wrong with the nltk_data certificate. Please try so solve it according to this topic

As far as I know there was no issue with PermissionError previously.

The nltk_data can be fixed I added this code:

import nltk
import ssl
try:
_create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
pass
else:
ssl._create_default_https_context = _create_unverified_https_context
nltk.download()

, but I still get an infinite loop:

/Library/Frameworks/Python.framework/Versions/3.7/bin/python3 “/Users/antoniosthanos/Desktop/Pavlov configuration files/testDeeppavlovPython.py”
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
2020-04-15 11:39:51.336 INFO in ‘deeppavlov.dataset_readers.odqa_reader’[‘odqa_reader’] at line 57: Reading files…
2020-04-15 11:39:51.340 INFO in ‘deeppavlov.dataset_iterators.sqlite_iterator’[‘sqlite_iterator’] at line 57: Connecting to database, path: /Users/antoniosthanos/.deeppavlov/downloads/odqa/enwiki.db
2020-04-15 11:40:01.231 INFO in ‘deeppavlov.dataset_iterators.sqlite_iterator’[‘sqlite_iterator’] at line 112: SQLite iterator: The size of the database is 5180368 documents
[nltk_data] Downloading package punkt to
[nltk_data] /Users/antoniosthanos/nltk_data…
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/antoniosthanos/nltk_data…
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package perluniprops to
[nltk_data] /Users/antoniosthanos/nltk_data…
[nltk_data] Package perluniprops is already up-to-date!
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data] /Users/antoniosthanos/nltk_data…
[nltk_data] Package nonbreaking_prefixes is already up-to-date!
2020-04-15 11:40:04.544 INFO in ‘deeppavlov.models.vectorizers.hashing_tfidf_vectorizer’[‘hashing_tfidf_vectorizer’] at line 153: Tokenizing batch…

And it continues: tokenizing batch, counting hash etc…

It doesn’t seem like the log from the tutorial.

Here you are trying to index 5180368 documents. The indexation runs in batches (10000), you can find it in model_config["train"]["batch_size"] parameter.

First, please, make sure you are able to run it successfully with the SentenceCopus from the tutorial then just replace it with your data.

So I understand that I’m supposed to see tokenizing batch 520 times in the output (~5.200.000 / 10.000). The point is that on my macbook (with 16GB RAM) for me getting that 52 times took me 3 hours, so to get 520 it would take me 10 times as much, 30 hours.
That is too much time to get done testing. Don’t you have any other smaller example to work with, with the exact same settings and configuration (using train_model)?

Also regarding the windows PC I’m using would you have any idea why I get the permission error already in use? I’ve done everything I could think of: Using Python terminal I get the same permission error. Scanning with the Hardware Monitor shows there aren’t any process ID’s other than Pycharm itself using the enwiki.db. I also added Python to “Path” in the System variables. It all doesn’t work. I will try to reinstall and see if that might work also.

I now run my own ranker file, it seems the save_path variable caused me the error
The process cannot access the file because it is being used by another process

It is because the save_path, the data_path and the load_path were exactly the same. So the save_path is my folder with .txt files. And I created a database with the same strucutre of the enwiki.db called servoywiki.db. It saves all the files correctly and it is less documents (only about 70.000 instead of the previous 5 million). So now the beginning of my ranker file looks like this:

{
“dataset_reader”: {
“class_name”: “odqa_reader”,
“data_path”: “{DOWNLOADS_PATH}/servoywiki_textfiles_all”,
“save_path”: “{DOWNLOADS_PATH}/servoywiki.db”,
“dataset_format”: “txt”

},
“dataset_iterator”: {
“class_name”: “sqlite_iterator”,
“shuffle”: false,
“load_path”: “{DOWNLOADS_PATH}/servoywiki.db”

My mac works with the files and returns me the list of most likely text files.
However on windows I’m still getting the dreaded permission error with exactly the same ranker file and servoywiki.db file, which is strange…

And on mac it also doesn’t always work, depending on which text files I’m putting in the folder I sometimes get the error:

“UnicodeDecodeError: ‘utf-8’ codec can’t decide byte 0x80 in position 3131: invalid start byte”

I just converted the files from my database to text with functional sql. Maybe it wasn’t enough and I have to add encoding somewhere?

Okay it works on mac now, the plos unlabeled cerebellum example and also with my own data. My only question is this: If the chatbot receives a query what am I supposed to take as the query? For in the example ‘cerebellum’ is the query.

Let’s say the question the user asks is: “What are the solution Settings?”
Then what would the query be, just the subject of the question, so “Solution Settings”?