Abstract. In classification tasks, data labeling is an expensive and time-consuming process, hence, active learning which query labels for a small representative portion of data, is becoming increasingly important. However, few works consider the challenges from data steam setting because most of the active learning method is designed for non-streaming setting. Be based upon the status quo, after synthesizing the evidence-based uncertainty sampling strategy and split sampling strategy above, we propose a new sampling strategy for active learning over evolving stream data, which can take full advantages of the strengths of each. First, the original data stream is randomly divided into two substreams. Instances from one sub-stream are labeled according to the high evidence-focused uncertainty strategy, while instances from the other sub-stream are marked by the random strategy for detecting true concept drifts. Second, we introduce a sliding window in the high evidence-focused uncertainty strategy, finding out whether an instance is the conflict-uncertainty instance or not. Clearly, our strategy solves the issue of the effective use of evidence in data streams setting, and can choose more representative instances over evolving data streams for training a model. Finally, in experiments over four benchmark datasets, compared with state-of-art active learning strategies, the result illustrates good predictive performance of our proposed approach.
IntroductionNowadays, more and more data are being generated continuously by networks, such as sensor networks, social networks, web applications and financial activities etc. Unlike traditional datasets, data items within a data stream are temporally ordered, fast-changing, generally large-scale, and potentially infinite [1].For learning predictive models on data stream, it is possible to access the true labels of the instances continuously. Unfortunately, inherent labeled instances in data streams are very scarce in practice. Conversely, a very limited number of labeled instances can be collected, and they can hardly provide enough information to train models with good generalization capabilities [2]. However, the manual labeling is expensive, especially in terms of time. Besides, in pace with time, the relationship between attributes and labels might change, such as spam identification and vaccine production. In order to know the true label, it is needed to scan the mail or make a laboratory test, which is timeconsuming. Hence, querying labels for a small representative subset of all stream data, has become an effective solution. Such a learning situation goes by the name of active learning. In pool-based and online environments [3,4], active learning has received widely attention and research.In the data stream setting, active learning is further divided into online active learning and active learning in data streams. The main difference between the two branches is whether the concept drifts exist or not. Online active learning has a generally accepted assumption that th...