Dataset yields no useful answers in ODQA

Hello,
I am implementing open domain question answering on dataset based on a web wiki. The reader is using SQuAD model.
The answers I am receiving, however, are discouraging, as they make no sense in any context, let alone context of technical support. The questions are in format of "what is ". The answers are unusable strings like “‘2 nonvisible returns jsfile’, ‘’, 'nbsp

'”. The expected answer is one or two sentences about the term does and what it’s for. The term I test for is plainly defined in the website my dataset originates from.

My question is: How come the answers are not what’s expected. I suspect my dataset either has improper content or wrong shape. See pictures for comparison of my dataset and a corpus from working ODQA example:

My dataset excerpt:


Compared to the corpus:

The size of my dataset is 1.2gb, in txt files. The SQuAD is using multi_squad_noans_infer config file.

Based on the disparities, I hypothesise that the current form of my dataset is not optimal for a SQuAD-trained reader, and that its content ought to resemble the corpus example more. There’s also still bits of data that isn’t human-readable text: Is their presence substantially relevant for datasets the size of appx 1 gb? I seek responses that indicate what are the causes of improper output, and whether my hypotheses make sense.

Hey @sieradd ,

It seems like you are right. The first screenshot doesn’t contain proper sentences (with big first letter) and contains a lot of links.

I suggest you to check out our online demo of squad model and see how it process your data.