In DeepPavlov, the BERT model for Question Answering is trained on SQuAD v1.1 dataset. Performance of the model is about 88 F-1, so it is not perfect and it can make mistakes.
SQuAD v1.1 dataset has been criticised for a high level of word overlapping in a question and a context. In some way it might cause problem in your example. Some other problems of this dataset are covered in this paper: https://arxiv.org/abs/1707.07328/
You can try to train model on SQuAD 2.0 dataset (which includes more sophisticated examples) and on adversarial examples from paper mentioned earlier, and/or train BERT-large.
Also, you might look into multi-hop reasoning dataset for question answering like HotPotQA: https://hotpotqa.github.io/