Stuck on getting ODQA to run with custom data (and few questions to clear things up)

Hello,

I’m attempting to get ODQA running in deeppavlov by running it in Pycharm. The purpose is to have it answer a scripted I’m using the following guide as a base of my efforts, almost all of my code is derived from it: Open-domain question answering with DeepPavlov | by Vasily Konovalov | DeepPavlov | Medium . I’m training the ranker model and building the reader model.

Attempting to run the reader on the entirety of my dataset (approximately 1.8gb worth of txt files) results in memory error. Using the small fraction of my dataset, i was able to yield the following logs:

The machine I’m using has 32 gb ram, out of which approximately 24gb RAM available for operations. The tf-idf keeps looping at tokenization and counting hash. Here are my model config settings:

I hope for answers for the following questions:

How large a dataset is the ranker supposed to handle? If it can only handle small datasets, am I doing something wrong? Is the loop as seen in 1st image something to be expected (and waited out), or an erroneous situation indicating I made a mistake?

Given Deeppavlov itself and the guide already went a long way for me, I suspect that if there’s a problem, it’s in configuration file.

Please let me know if there’s anything I should elaborate upon.

Hey @sieradd, Thank you very much for your interest in DeepPavlov!

The ranker is able to handle pretty large datasets, for example, we have a config that retrieves in the entire English Wikipedia. The loop of counting and tokenizing is OK, because it builds a hash for the ranker, so I recommend to wait until it builds the entire index.

Try starting off with maybe few Mb of text, see that it goes smoothly, then incrementally increase the number of files. I believe that 32Gb RAM should be enough to build the index for 1.8Gb of text, make sure that your OS doesn’t limit the RAM usage. Try decreasing the batch size this might be helpful.

Let me know if you need further assistance.

Thank you,

I’m currently running on 2.3mb folder. How long should I expect the counting/tokenizing loop to take until index is complete? Right now it remains busy for over an hour. Current batch size is 100.

Is should be pretty fast, maybe you have a lot of small files, if so, then try let’s say ten files, just to make sure that everything goes fine.