Forecasting air pollution is crucial for understanding the phenomenological and contextual variety of mechanisms underlying pollution in a particular area or region. Analyzing high-dimensional data with spatial and temporal dependencies is pointed out as a major challenge for traditional machine learning approaches for air pollution forecasting. The unprecedented advances in deep learning employed on massive quantities of IoT sensor data raise high hopes for the future of air pollution forecasting. Drawing on past and current experiences and solutions advanced in a growing body of research on the topic, we proposed and evaluated four encoder-decoder architectures with attention for forecasting particulate matter (PM) levels, with a general applicability that is both location-and season-independent. The attention mechanism was employed to better learn representation of time series data. By selectively focusing the attention on elements that contribute the most, superior performance gains have been obtained by the proposed models. Sensor measurements obtained in the past 7 or 14 days as well a set of external factors known to affect the pollutant levels, such as weather conditions, location and time frame (e.g., season, time of day) were taken into account. In this research, we carried out a case study on PM 2.5 forecasting for the city of Skopje, allowing us to discuss the relevance of the results obtained by the proposed solutions. The ability to draw valid inferences from data has critical importance when forecasting air pollution. In this respect, any air pollution forecasting model based on deep learning would require a component for generation of realistic data samples to augment the time series dataset. We address the challenges in missing data by proposing and evaluating two adversarial networks for data augmentation. We have conducted a number of experiments to investigate the performance of the predictive models with and without augmenting the training datasets and obtain superior performance gains by the proposed adversarial models for data augmentation. The deep neural architectures advanced in this research are general enough to be used for predictive and generative tasks for other pollutants; moreover, they could be adopted for related tasks handling time series data in other domains.