I am implementing open domain question answering on dataset based on a web wiki. The reader is using SQuAD model.
The answers I am receiving, however, are discouraging, as they make no sense in any context, let alone context of technical support. The questions are in format of "what is ". The answers are unusable strings like “‘2 nonvisible returns jsfile’, ‘’, 'nbsp
My question is: How come the answers are not what’s expected. I suspect my dataset either has improper content or wrong shape. See pictures for comparison of my dataset and a corpus from working ODQA example:
My dataset excerpt:
Compared to the corpus:
The size of my dataset is 1.2gb, in txt files. The SQuAD is using multi_squad_noans_infer config file.
Based on the disparities, I hypothesise that the current form of my dataset is not optimal for a SQuAD-trained reader, and that its content ought to resemble the corpus example more. There’s also still bits of data that isn’t human-readable text: Is their presence substantially relevant for datasets the size of appx 1 gb? I seek responses that indicate what are the causes of improper output, and whether my hypotheses make sense.