Rapid Parallel Detection of Distance-based Outliers in Time Series using MapReduce

Sorin N. Ciolofan, Florin Pop, Mariana Mocanu, Valentin Cristea

Abstract


Time series analysis is crucial in a large number of knowledge domains ranging from micro and macro economy, industry, tourism, health to hydrology, meteorology, agriculture, demography, etc. The interest in efficiently and meaningfully processing of time series data increased in the last decade with the spreading of sensor networks and Cyber-Physical Systems which produce huge amounts of measured data. The outlier detection is a key issue for Quality Assurance of time series data and its goal is to detect the objects that present a very different behavior from the expected one. Once identified, these objects are either removed or corrected. In this paper we propose a highly scalable parallel data processing algorithm for outlier ranking based on the distance between data objects. As opposed to the current existing sequential implementations, the provided algorithm is based on the parallel processing employed by the MapReduce paradigm. Using real monitored solar data for experimental validation we show the dramatically improvement of running time for large archives of time series (millions of records order).

Keywords


Time Series; Outliers; Distributed Processing; MapReduce; Data Mining

Full Text: PDF