Face recognition technology is now mature enough to reach commercial products, such as smart phones or tablets. However, it still needs to increase robustness against imposter attacks. In this regard, face Presentation Attack Detection (face-PAD) is a key component in providing trustable facial access to digital devices. Despite the success of several face-PAD works in publicly available datasets, most of them fail to reach the market, revealing the lack of evaluation frameworks that represent realistic settings. Here, an extensive analysis of the generalisation problem in face-PAD is provided, jointly with an evaluation strategy based on the aggregation of most publicly available datasets and a set of novel protocols to cover the most realistic settings, including a novel demographic bias analysis. Besides, a new finegrained categorisation of presentation attacks and instruments is provided, enabling higher flexibility in assessing the generalisation of different algorithms under a common framework. As a result, GRAD-GPAD v2, a comprehensive and modular framework is presented to evaluate the performance of face-PAD approaches in realistic settings, enabling accountability and fair comparison of most face-PAD approaches in the literature.This is an open access article under the terms of the Creative Commons Attribution-NonCommercial License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for commercial purposes.