Dataset yields no useful answers in ODQA

sieradd · December 3, 2021, 4:18pm

Hello,
I am implementing open domain question answering on dataset based on a web wiki. The reader is using SQuAD model.
The answers I am receiving, however, are discouraging, as they make no sense in any context, let alone context of technical support. The questions are in format of "what is ". The answers are unusable strings like “‘2 nonvisible returns jsfile’, ‘’, 'nbsp

'”. The expected answer is one or two sentences about the term does and what it’s for. The term I test for is plainly defined in the website my dataset originates from.

My question is: How come the answers are not what’s expected. I suspect my dataset either has improper content or wrong shape. See pictures for comparison of my dataset and a corpus from working ODQA example:

My dataset excerpt:

Compared to the corpus:

The size of my dataset is 1.2gb, in txt files. The SQuAD is using multi_squad_noans_infer config file.

Based on the disparities, I hypothesise that the current form of my dataset is not optimal for a SQuAD-trained reader, and that its content ought to resemble the corpus example more. There’s also still bits of data that isn’t human-readable text: Is their presence substantially relevant for datasets the size of appx 1 gb? I seek responses that indicate what are the causes of improper output, and whether my hypotheses make sense.

Vasily · December 9, 2021, 8:41am

Hey @sieradd ,

It seems like you are right. The first screenshot doesn’t contain proper sentences (with big first letter) and contains a lot of links.

I suggest you to check out our online demo of squad model and see how it process your data.

Topic		Replies	Views
Determining configuration sufficiency for odqa Models	1	453	May 13, 2020
Need some advice regarding using own data Tutorials & Guidelines	17	1337	May 1, 2020
Stuck on getting ODQA to run with custom data (and few questions to clear things up) DeepPavlov Library	3	375	November 8, 2021
Working with squad_noans Models	2	468	June 26, 2020
BERT SQUAD no answer DeepPavlov Library	1	988	November 5, 2019

Dataset yields no useful answers in ODQA

Related topics