Machine learning methods have been applied to predict COVID-19 using chest X-ray images in several works. However, to be helpful, a machine learning model must be robust to give reliable predictions for any target population, rather than only for the population used to generate the training data. Despite such an important issue, testing the generalizability of machine learning models is frequently not performed in current works. To test the generalizability of three models of CNN, four different databases obtained from various data sources are investigated in this paper in an internal-and-external validation procedure. All models are trained considering lung segmentation as a pre-processing step and without lung segmentation. The results show how important an external evaluation is to avoid providing performance evaluations excessively optimistic and inaccurate.