How to validate a trained BioBert model with squad dataset?


I am trying to train BioBert model on squad dataset. I am facing 2 problems:
1- The answers to context are not accurate.
2-I am not sure if the model is properly trained?

Following are my questions/contexts and answers from the newly trained model.

x=bot([‘Coronavirus disease 2019 (COVID-19) is a contagious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The first case was identified in Wuhan, China, in December 2019. It has since spread worldwide, leading to an ongoing pandemic.’], [‘What is coronavirus?’])

y=bot([‘DeepPavlov is an open-source conversational AI library built on TensorFlow and Keras. DeepPavlov is designed for development of production ready chatbots and complex conversational systems, research in the area of NLP and, particularly, of dialog systems.’’], [‘What is deeppavlov?’])

z=bot([‘Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.’], [‘What is machine learning?’])

a=bot([‘Diabetes is a disease in which your blood glucose, or blood sugar, levels are too high. Glucose comes from the foods you eat. Insulin is a hormone that helps the glucose get into your cells to give them energy.’], [‘What is diabetes?’])
[[‘ongoing pandemic’], [242], [1.2818810939788818]]

[[‘built on TensorFlow and Keras’], [55], [1.2005016803741455]]

[[‘It is a branch of artificial intelligence based on’], [88], [1.1758620738983154]]

[[‘levels are too high. Glucose’], [67], [1.076728105545044]]

Any feedback will be of great help!
Thank you!

This is the config file:

“dataset_reader”: {
“class_name”: “squad_dataset_reader”,
“data_path”: “{DOWNLOADS_PATH}/squad/”
“dataset_iterator”: {
“class_name”: “squad_iterator”,
“seed”: 1337,
“shuffle”: true
“chainer”: {
“in”: [“context_raw”, “question_raw”],
“in_y”: [“ans_raw”, “ans_raw_start”],
“pipe”: [
“class_name”: “bert_preprocessor”,
“vocab_file”: “{DOWNLOADS_PATH}/biobert_models/biobert_v1.1_pubmed/vocab.txt”,
“do_lower_case”: false,
“max_seq_length”: 384,
“in”: [“question_raw”, “context_raw”],
“out”: [“bert_features”]
“class_name”: “squad_bert_mapping”,
“do_lower_case”: false,
“in”: [“context_raw”, “bert_features”],
“out”: [“subtok2chars”, “char2subtoks”]
“class_name”: “squad_bert_ans_preprocessor”,
“do_lower_case”: false,
“in”: [“ans_raw”, “ans_raw_start”,“char2subtoks”],
“out”: [“ans”, “ans_start”, “ans_end”]
“class_name”: “squad_bert_model”,
“bert_config_file”: “{DOWNLOADS_PATH}/biobert_models/biobert_v1.1_pubmed/bert_config.json”,
“pretrained_bert”: “{DOWNLOADS_PATH}/biobert_models/biobert_v1.1_pubmed/model.ckpt”,
“save_path”: “{MODELS_PATH}/squad_biobert/model”,
“load_path”: “{MODELS_PATH}/squad_biobert/model”,
“keep_prob”: 0.5,
“learning_rate”: 2e-05,
“learning_rate_drop_patience”: 2,
“learning_rate_drop_div”: 2.0,
“in”: [“bert_features”],
“in_y”: [“ans_start”, “ans_end”],
“out”: [“ans_start_predicted”, “ans_end_predicted”, “logits”]
“class_name”: “squad_bert_ans_postprocessor”,
“in”: [“ans_start_predicted”, “ans_end_predicted”, “context_raw”, “bert_features”, “subtok2chars”],
“out”: [“ans_predicted”, “ans_start_predicted”, “ans_end_predicted”]
“out”: [“ans_predicted”, “ans_start_predicted”, “logits”]
“train”: {
“show_examples”: false,
“test_best”: false,
“validate_best”: true,
“log_every_n_batches”: 250,
“val_every_n_batches”: 500,
“batch_size”: 10,
“pytest_max_batches”: 2,
“pytest_batch_size”: 5,
“validation_patience”: 10,
“metrics”: [
“name”: “squad_v1_f1”,
“inputs”: [“ans”, “ans_predicted”]
“name”: “squad_v1_em”,
“inputs”: [“ans”, “ans_predicted”]
“name”: “squad_v2_f1”,
“inputs”: [“ans”, “ans_predicted”]
“name”: “squad_v2_em”,
“inputs”: [“ans”, “ans_predicted”]
“tensorboard_log_dir”: “{MODELS_PATH}/squad_biobert/logs”
“metadata”: {
“variables”: {
“ROOT_PATH”: “C:/Users/Amjad Enterprises/.deeppavlov”,
“DOWNLOADS_PATH”: “{ROOT_PATH}/downloads”,
“requirements”: [
“download”: [
“url”: “https:/”,
“subdir”: “{DOWNLOADS_PATH}/biobert_models”
“url”: “https:/”,
“subdir”: “{MODELS_PATH}”

During the training model reports EM and F1 metrics on the validation set. Did you check them?
On SQuAD dataset model should have about 80 EM and 88 F-1 (docs).

Here are the numbers.
{“valid”: {“eval_examples_count”: 10570, “metrics”: {“squad_v1_f1”: 88.4918, “squad_v1_em”: 80.8828, “squad_v2_f1”: 88.2996, “squad_v2_em”: 80.7001}

They are same as you said!

These numbers are suspiciously close to the numbers that we get with default BERT-base. Are these numbers for BioBERT after training or for BERT-base?

I apologise for this inconvenience. I have mistakenly copied from the BERT base.

Give me some time I will copy the scores for BIOBERT model.

These are the scores for BioBert model.

{“valid”: {“eval_examples_count”: 10570, “metrics”: {“squad_v1_f1”: 6.6215, “squad_v1_em”: 0.4841, “squad_v2_f1”: 6.6023, “squad_v2_em”: 0.4825}, “time_spent”: “0:06:37”, “epochs_done”: 0, “batches_seen”: 0, “train_examples_seen”: 0, “impatience”: 0, “patience_limit”: 10}}