Stuck on getting ODQA to run with custom data (and few questions to clear things up)

sieradd · November 7, 2021, 10:58pm

Hello,

I’m attempting to get ODQA running in deeppavlov by running it in Pycharm. The purpose is to have it answer a scripted I’m using the following guide as a base of my efforts, almost all of my code is derived from it: Open-domain question answering with DeepPavlov | by Vasily Konovalov | DeepPavlov | Medium . I’m training the ranker model and building the reader model.

Attempting to run the reader on the entirety of my dataset (approximately 1.8gb worth of txt files) results in memory error. Using the small fraction of my dataset, i was able to yield the following logs:

The machine I’m using has 32 gb ram, out of which approximately 24gb RAM available for operations. The tf-idf keeps looping at tokenization and counting hash. Here are my model config settings:

I hope for answers for the following questions:

How large a dataset is the ranker supposed to handle? If it can only handle small datasets, am I doing something wrong? Is the loop as seen in 1st image something to be expected (and waited out), or an erroneous situation indicating I made a mistake?

Given Deeppavlov itself and the guide already went a long way for me, I suspect that if there’s a problem, it’s in configuration file.

Please let me know if there’s anything I should elaborate upon.

Vasily · November 8, 2021, 7:52am

Hey @sieradd, Thank you very much for your interest in DeepPavlov!

The ranker is able to handle pretty large datasets, for example, we have a config that retrieves in the entire English Wikipedia. The loop of counting and tokenizing is OK, because it builds a hash for the ranker, so I recommend to wait until it builds the entire index.

Try starting off with maybe few Mb of text, see that it goes smoothly, then incrementally increase the number of files. I believe that 32Gb RAM should be enough to build the index for 1.8Gb of text, make sure that your OS doesn’t limit the RAM usage. Try decreasing the batch size this might be helpful.

Let me know if you need further assistance.

sieradd · November 8, 2021, 1:42pm

Thank you,

I’m currently running on 2.3mb folder. How long should I expect the counting/tokenizing loop to take until index is complete? Right now it remains busy for over an hour. Current batch size is 100.

Vasily · November 8, 2021, 1:56pm

Is should be pretty fast, maybe you have a lot of small files, if so, then try let’s say ten files, just to make sure that everything goes fine.

Topic		Replies	Views
Determining configuration sufficiency for odqa Models	1	453	May 13, 2020
MemoryError, Unable to allocate memory for array Models	0	506	July 10, 2020
ODQA, simultaneous users as opposed to Hardware capacity and speed? Models	2	301	June 12, 2020
Need some advice regarding using own data Tutorials & Guidelines	17	1339	May 1, 2020
Проблема с установкой для ODQA DeepPavlov Library	3	1494	July 19, 2019

Stuck on getting ODQA to run with custom data (and few questions to clear things up)

Related topics