The spread of online hate has become a major problem for newspapers that host comment sections. As a result, there is growing interest in using machine learning (ML) and natural language processing (NLP) for (semi-) automated abusive language detection to avoid manual comment moderation costs or having to shut down comment sections all together. However, much of the past work on abusive language detection with ML uses random train-test splitting procedures that assume an unrealistically static language environment. In this paper, we show using a new German newspaper comments dataset that a time-stratified evaluation procedure provides a more realistic measure of a classifier's performance on future data. We also show that the performance of classifiers can degrade quickly as the training data grows more outdated and language and news coverage evolve. Further, we demonstrate that the performance of classifiers trained on data from before the COVID-19 pandemic drops sharply when evaluated on COVID-era comments. Our findings suggest that when standard ML techniques are applied naively to abusive language detection, a classifier will fail to meet the advertised evaluation benchmarks in the real-world environment.