Process mining is a relatively new subject that builds a bridge between traditional process modeling and data mining. Process discovery is one of the most critical parts of process mining, which aims at discovering process models automatically from event logs. Like other data mining techniques, the performance of existing process discovery algorithms can be affected when there are missing activity labels in event logs. In this paper, we assume that the control-flow information in event logs could be useful in repairing missing activity labels. We propose an LSTM-based prediction model, which takes both the prefix and suffix sequences of the events with missing activity labels as input to predict missing activity labels. Additional attributes of event logs are also utilized to improve the performance. Our evaluation of several publicly available datasets shows that the proposed method performed consistently better than existing methods in terms of repairing missing activity labels in event logs.
Process mining aims to gain knowledge of business processes via the discovery of process models from event logs generated by information systems. The insights revealed from process mining heavily rely on the quality of the event logs. Activities extracted from different data sources or the free-text nature within the same system may lead to inconsistent labels. Such inconsistency would then lead to redundancy in activity labels, which refer to labels that have different syntax but share the same behaviours. Redundant activity labels can introduce unnecessary complexities to the event logs. The identification of these labels from data-driven process discovery are difficult and rely heavily on human intervention. Neither existing process discovery algorithms nor event data preprocessing techniques can solve such redundancy efficiently. In this paper, we propose a multi-view approach to automatically detect redundant activity labels by using not only context-aware features such as control–flow relations and attribute values but also semantic features from the event logs. Our evaluation of several publicly available datasets and a real-life case study demonstrate that our approach can efficiently detect redundant activity labels even with low-occurrence frequencies. The proposed approach can add value to the preprocessing step to generate more representative event logs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.