Detection of anomalies in water main ﬂow time series

(1)

(2)

(3)

Bachelor’s thesis

Detection of anomalies in water main flow time series

Dmytro Molokoiedov

Department of Knowledge Engineering Supervisor: Mgr. Petr ˇSim´anek

June 3, 2020

(4)

(5)

Acknowledgements

I would like to express my special thanks of gratitude to my supervisor Mgr. Petr Sim´ˇ anek who helped me in doing a lot of research and gave me many pieces of advice. Also, I want to thank my parents and friends who were supporting me a lot during this project.

(6)

(7)

Declaration

I hereby declare that the presented thesis is my own work and that I have cited all sources of information in accordance with the Guideline for adhering to ethical principles when elaborating an academic final thesis.

I acknowledge that my thesis is subject to the rights and obligations stipulated by the Act No. 121/2000 Coll., the Copyright Act, as amended, in particular that the Czech Technical University in Prague has the right to conclude a license agreement on the utilization of this thesis as a school work under the provisions of Article 60 (1) of the Act.

In Prague on June 3, 2020 . . .. . .. . .. . .. . .. . .. . .

(8)

Czech Technical University in Prague Faculty of Information Technology

This thesis is school work as defined by Copyright Act of the Czech Republic. It has been submitted at Czech Technical University in Prague, Faculty of Information Technology. The thesis is protected by the Copyright Act and its usage without author’s permission is prohibited (with exceptions defined by the Copyright Act).

Citation of this thesis

Molokoiedov, Dmytro. Detection of anomalies in water main flow time series.

Bachelor’s thesis. Czech Technical University in Prague, Faculty of Information Technology, 2020.

(9)

Abstrakt

Vodn´ı rozvodná s´ıt’ je komplexn´ı systém potrub´ı a uzl˚u, které jsou vzájemnˇe pro- pojeny a kterými procház´ı voda. Jedn´ım z nejd˚uleˇzitˇejˇs´ıch problém˚u v takových s´ıt´ıch jsou úniky. Tento ˇclánek pˇredstavuje nˇekolik metod ˇreˇsen´ı tohoto problému.

Byl zaveden benchmark pro pˇresnˇejˇs´ı vyhodnocen´ı algoritm˚u a byly implemen- továny metody, jako jsou one-class SVM, izolaˇcn´ı les a LSTM. Data pouˇzitá v této práci byla umˇele vytvoˇrena na základˇe skuteˇcných historických dat pomoc´ı frameworku LeakDB.

Kl´ıˇcová slova ˇcasové ˇrady, detekce anomáli´ı, strojové uˇcen´ı, rekurentn´ı neuro- nové s´ıtˇe, detekce úniku, LeakDB

(10)

Abstract

The water distribution network is a complex system of pipes and nodes connected to each other, through which water flows. One of the most important problems in such networks is leakages. This paper presents several methods to solve this problem. A benchmark was set for a more accurate evaluation of algorithms and methods such as one-class SVM, Isolation forest, and LSTM were implemented.

The data used in this work was artificially created based on real historical data using the LeakDB framework.

Keywords time series, anomaly detection, machine learning, recurrent neural networks, leakage detection, LeakDB

viii

(11)

List of Figures

2.1 One-class SVM fromhttps://www.researchgate.net/publication/

242572058_One-Class_Support_Vector_Machines_for_Protein_

Protein_Interactions_Prediction/figures?lo=1 . . . 8 2.2 Isolation forest fromhttps://donghwa-kim.github.io/iforest.html 9 2.3 Recurrent neural network fromhttps://necst.it/exploring-boundary- accuracy-performances-recurrent-neural-networks/ . . . 11 2.4 LSTM architecture fromhttps://www.researchgate.net/publication/

324600237_Improving_Long-Horizon_Forecasts_with_Expectation- Biased_LSTM_Networks/figures?lo=1 . . . 11 2.5 CNN architecture fromhttps://commons.wikimedia.org/wiki/File:

Typical_cnn.png . . . 13 2.6 Autoencoder architecture fromhttps://www.wikiwand.com/en/Autoencoder 14 3.1 Water distribution network architecture fromhttps://www.researchgate.net/

publication/257672905_Design_of_Water_Distribution_Networks_

using_a_Pseudo-Genetic_Algorithm_and_Sensitivity_of_Genetic_

Operators . . . 16 3.2 Dataset . . . 18 3.3 Dataset for Isolation forest and one-class SVM . . . 19 3.4 LSTM with Mahalanobis distance fromhttps://www.renom.jp/notebooks/

tutorial/time_series/lstm-anomalydetection/notebook.html 22 3.5 Mahalanobis distance based on pressure . . . 24 3.6 Actual test leaks . . . 25 3.7 Predicted test leaks . . . 25

(14)

(15)

List of Tables

3.1 Detectors and their score . . . 24

(16)

(17)

Chapter 1 Introduction

The water distribution network is a complex system of pipes and nodes connected to each other, through which water flows. Thanks to it, people can use such amenities as a shower, toilet, access to drinking water, etc., without which it is difficult to imagine the existence and development of civilization. Many problems constantly arise in this system, such as pump breakdowns, clogging of pipes, and water pollution. One example of such a network is sewage. Among all the problems, it is worth highlighting one that requires special attention. This problem is leakages caused by a variety of reasons. As a result of this, people have to endure a weak pressure of water or its shutdown, and companies lose a lot of money every day until the leak is fixed. To fix this problem, many sewer companies still use manual searches, which can take weeks until a leak is found. To facilitate the work, people came up with many different algorithms that help them in solving this problem. However, only with the appearance of a large amount of data and sensors, algorithms, and methods were created that can completely replace people in solving such problems. This work presents the machine and deep learning methods that were created to solve precisely this problem of leaks and a comparison of these methods is given. The data used in this work was artificially created using the LeakDB framework and based on real historical data. The next chapter describes the algorithms that are often used to detect anomalies. The third chapter describes the frameworks that were used to create the algorithms as well as the implementation of methods and comparison of the results. The final chapter describes the conclusions and ideas for future work on this issue.

1.1 Goals

There are four goals of this bachelor work:

• Get familiar with machine learning techniques for anomaly detection in time series in the environmental domain.

(18)

1. Introduction

• Design and evaluate techniques for the detection of leaks in water mains based on flow and pressure in the system of water channels.

• Demonstrate the capability of the chosen algorithm to accurately and early detect leaks on the provided dataset.

• Compare different methods.

2

(19)

Chapter 2 Theory

2.1 Anomaly detection

Anomaly detection is the identification of rare items, events, or observations that raise suspicions by differing significantly from the majority of the data [1]. Anoma- lies in the data can be attributed to one of three main types:

• Point anomalies arise in a situation where a single point in data can be considered abnormal concerning other data. This type of anomaly is the most easily recognizable, most of the existing methods are designed to recognize point anomalies.

• Contextual anomalies are observed if the data point is abnormal only in a specific context.

• Collective anomalies occur when a sequence of related data points is abnormal concerning the whole data set. A single data point in this sequence may not be a deviation, however, the joint occurrence of such points is a collective anomaly.

Depending on data type is used to implement the algorithm, anomaly detection methods can be performed in one of the three modes listed below:

1. Supervised anomaly detection. This technique requires a training sample that fully represents the system and includes normal and abnormal classes.

The algorithm works in two stages: training and recognition. At the first stage, a model is built and unlabeled instances are compared. In most cases, it is assumed that the data does not change their statistical characteristics, otherwise it becomes necessary to change the classifier. The main complexity of the supervised algorithms is the formation of data for training. Often the anomalous class is represented by a significantly smaller number of instances than the normal one, which can lead to inaccuracies in the resulting model. In such cases, artificial anomaly generation is used.

(20)

2. Theory

2. Semi-Supervised anomaly detection. The initial data in this approach rep- resent only the normal class. Having studied one data class, the system can determine whether the new data belong to it, thus determining the opposite. Algorithms that work in semi-supervised mode don’t require information about the anomalous class of data, as a result, they are wider applicable and allow them to recognize deviations without certain information about them.

3. Unsupervised anomaly detection. It is used in the absence of a priori information about the data. Unsupervised algorithms are based on the assumption that abnormal points are much less common than normal ones.

Data is processed, the most distant are defined as anomalies. To apply this technique, the entire data set must be available.

It is an actual problem in different areas, such as intrusion detection, fraud detection, industrial damage detection, medical and public health, image processing, anomaly detection in text data, sensor networks, and other domains. Most of the techniques look for individual objects that are different from normal ones but don’t consider the sequence aspect of the data into consideration. Such anomalies are also referred to as point anomalies. Anomaly detection in time series data is a little bit different from finding anomalies in non-time series data because we need to take into account such properties as seasonality and stationarity.

2.2 State space models

State-space model (SSM) refers to a class of probabilistic graphical model that describes the probabilistic dependence between the latent state variable and the observed measurement. The state or the measurement can be either continuous or discrete [2].

2.2.1 ARIMA

Parameter estimation and forecasting procedures suggest that a mathematical model of a process is known. In real data, there are often no distinct regular components. Some observations contain a significant error, which prevents not only the selection of regular components but also a forecast. The autoregressive integrated moving average (ARIMA) methodology, developed by Box and Jenkins [3], allows this to be done. This method is extremely popular in many applications, and practice has confirmed its power and flexibility. The model considers 2 main processes:

1. The process of autoregression (AR): If the time series contains elements that are sequentially dependent on each other, then this dependence can be 4

(21)

2.2. State space models represented as follows:

yt=υ+a1yt−1+a2yt−2+a3yt−3+· · ·+εt

where: υ is a constant (free term), and a1, a2, a3...are the autoregression parameters. Each element of the series is the sum of a random component and a linear combination of previous observations.

2. Moving Average Process (MA): Unlike the previous process, in the moving average process, each observation is affected by the cumulative effect of previous errors. Such a process has the following form:

y_t=µ+e_t−b₁εt−1−b₂εt−2−b₃εt−3− · · ·

where: µis a constant,b₁, b₂, b₃...are the moving average parameters. This means that the current observation of the series is the sum of the random component at this moment and the linear combination of random influences at previous points in time.

The standard notation used in ARIM A(p, d, q), where parameters are replaced by numbers to quickly indicate the particular model. Model parameters mean:

• pis the number of delayed observations contained in the model, also known as the number of time lags;

• dis the degree of differencing;

• q is the order of the moving-average model.

This algorithm, in conjunction with others, can be used in anomaly detection.

2.2.2 Exponential Smoothing

In addition to autoregressive models, there are various smoothing models. One of the first smoothing models is exponential smoothing. The exponential smoothing model is a recurrence relation in which each subsequent predicted value is expressed in terms of the value predicted in the previous step and the true value in the previous step:

yˆ_t+1 = ˆy_t+α(y_t−yˆ_t)

However, this model does not take into account the various phenomena that can be observed in the data. Therefore, other models were developed that take into account trends and seasonality. One such model is called the Holt-Winters method. The Holt-Winters method [4] estimates the level, slope, and seasonal components at the current time. The equations of this model are as follows:

S_t=α y_t It−L

+ (1−α)(St−1+bt−1)−OverallSmoothing

(22)

2. Theory

b_t=γ(S_t−St−1) + (1−γ)bt−1−T rendSmoothing

It=βyt

St

+ (1−β)It−L−SeasonalSmoothing

where: y is the observation, S is smoothed observation, b is the trend factor, I is the seasonal index, t is an index denoting a time. Smoothing is controlled by three parameters: α,β, andγ, to estimate the slope level, trend component, and the seasonal component. Parametersα,β, and γ all have values between 0 and 1, and values close to 0 mean that the recent observations have a relatively small weight.

2.3 Machine learning

Machine learning is the study of computer algorithms that improve automatically through experience [5]. Thanks to machine learning, computers learn to recognize in photographs and drawings not only faces, but also landscapes, objects, text, and numbers. As for the text, the function of checking grammar is now present in any text editor and even in phones. Moreover, not only the spelling of words is taken into account, but also the context, shades of meaning, and other linguistic aspects. Moreover, there already exists a software capable of writing news articles without human intervention. All tasks that can be solved using machine learning fall into one of the following categories.

1. Regression task is a forecast based on a sample of objects with various attributes. The output should be a real number, for example, the price of an apartment, the expected income for the next month, the quality of beer during blind testing.

2. Classification task is to obtain a categorical answer based on a set of features. It has a finite number of answers: is there a dog on the photo, is the image a meal.

3. Clustering task is the dividing data into groups: assigning objects to one or another category.

4. Reducing dimension task is to reduce a large number of features to a smaller one for the visualization.

5. The task of detecting anomalies is to separate the anomalies from standard cases. At first glance, it coincides with the classification task, but there is one significant difference: anomalies are a rare phenomenon, and the training examples on which you can train a machine learning model to identify such objects are either vanishingly small or simply not, therefore classification methods do not work here. In practice, such a task is, for example, identifying leaks in the water distribution network.

6

(23)

2.3. Machine learning 2.3.1 Elliptic Envelope

One of the common methods for detecting anomalies assumes that the data is distributed in some known way, for example, according to Gauss. In this case, the task is to determine the type of this distribution and select those objects that do not satisfy the found distribution. In the elliptic envelope method, the set of points is modeled as the interior of an ellipsoid [6]. The method successfully works only on normally distributed data. The degree of the anomaly of an object is determined by the Mahalanobis distance, which in mathematical statistics is a measure of the distance between vectors of random variables and generalizes the concept of Euclidean distance. To determine whether a point belongs to one of the classes, it is necessary to find the covariance matrices of all classes.

When calculating the Mahalanobis distance to each class, the class for which this distance turned out to be the minimum is selected, which is equivalent to the maximum likelihood estimation method. The point with the greatest Mahalanobis distance to the rest set of points is considered an anomaly. Such a point has the greatest influence on the curvature and the coefficients of the regression equation.

If there are several clusters in the data, it can use Gaussian mixture models [7]. The Gaussian mixture model is a statistical model for describing normally distributed data subgroups within a general group and is parameterized by a mixture of weights and medium components or covariances for the multidimensional case. If the number of components (data clusters) is known, Expectation-maximization algorithm [8] is used to estimate distribution mixture parameters. As a result of calculating the centroids, covariance matrix, weights, and logarithmic likelihood functions for each Gaussian component, we obtain probabilistic clusters, describing the distribution of data. Based on the likelihood probability function, it is possible to detect those anomalous values that were not clustered.

2.3.2 One-Class SVM

The support vector machine relates to supervised algorithms, also called the dis- criminative classifier formally defined by a separating hyperplane. The main idea of the method is to translate the original vectors into a space of higher dimension and to build a separating hyperplane in a multidimensional space. If the training data are linearly separable, then additionally construct two parallel separating hyperplanes so that the class points closest to the separating hyperplane belong to these hyperplanes. These points are support vectors. The constructed hyperplanes form a space free of classification objects, and the further task is to achieve the maximum width of this space. The separating hyperplane, in this case, called optimally. The dimension of space corresponds to the number of classifying features, and their value determines the position of points in space. The method performs the separation of objects, as well as a hyperplane, is selected in it so that it is as far away as possible from the nearest element of each of the groups.

If the set is not linearly separable, the dimension of space is increased and then

(24)

2. Theory

Figure 2.1: One-class SVM from https://www.researchgate.net/

publication/242572058_One-Class_Support_Vector_Machines_for_

Protein_Protein_Interactions_Prediction/figures?lo=1

a linear separator is found in the already changed space. The construction of a separating hyperplane, in this case, requires the use of a nonlinear kernel function.

Among the advantages of the approach, the following points are highlighted:

• Solving the quadratic programming problem in a convex domain that always has a unique solution;

• Finding the dividing space of maximum width, which contributes to a more accurate classification.

• Processing a high dimensionality of data.

A support vector machine for one class (one-class SVM) is a modification of the classical algorithm of the support vector machine [9], which separates the data from the origin. Despite the simplicity of the idea, it turned out to be quite workable and is used in various fields. It should be noted that, in contrast to the 8

(25)

2.3. Machine learning standard method of support vectors, in this case, only radial basis functions are suitable as a kernel, since other nonlinear cores show worse results. This method is more suitable for searching for novelty, rather than outliers.

2.3.3 Isolation Forest

The idea of an Isolation Forest [10] is based on the Monte Carlo principle: a random partition of the feature space is carried out, such that in average isolated points are cut off from normal, clustered data. The final result is averaged over several starts of the stochastic algorithm. The Isolation Tree algorithm is a random binary decision tree. The root of the tree is the entire space of features. In the next node, a random feature and a randomly selected threshold sampled from a uniform distribution over a segment from the minimum to the maximum value of the selected attribute. The stopping criterion is the identical coincidence of all objects in a node, i.e. a decision tree is built completely. The answer in the leaf, which also matches the algorithm’s anomaly score, is the depth of the leaf in the constructed tree. It is argued that abnormal points tend to appear in leaves with low depth, i.e. in leaves that are close to the root when for splitting hyperplanes of a cluster of normal data the tree will need to build some more levels.

Figure 2.2: Isolation forest from https://donghwa-kim.github.io/

iforest.html

Moreover, the number of such levels is proportional to cluster size; therefore, the anomaly score is proportional to the points lying in it. This means that objects from small clusters that are potentially anomalies will have an anomaly score lower than normal data clusters. The algorithm has several significant advantages:

• The algorithm recognizes anomalies of various types: both isolated points with a low local density and small clusters of anomalies.

• The complexity of the isolation tree is O(nlogn), which is more efficient than a lot of other algorithms.

(26)

2. Theory

• Does not require significant memory costs, in contrast to, for example, metric methods, which often require the construction of a matrix of pairwise distances.

• There are no parameters requiring selection.

• Invariant to scaling features; no metric required or other apriori information about the data.

• Resistant to the curse of dimensionality.

2.4 Deep Learning

Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction [11].

Deep neural networks are currently becoming one of the most popular deep learning methods. They show better results compared to alternative methods in areas such as speech recognition, natural language processing, computer vision, medical informatics, etc. One of the reasons for the successful use of deep neural networks is that the network automatically extracts important features from the data that are necessary to solve the problem. In alternative machine learning algorithms, features should be selected by people. There is a specialized area of research - feature engineering. However, when processing large amounts of data, a neural network manages with feature selection much better than a human.

2.4.1 Recurrent Neural Networks

In anomaly detection in sequential data, where each value depends on the previous one, should be taken into account the existence of long-term relationships between points. There are a large number of neural network topologies and many of them can easily cope with the task of predicting an element from a training set, but the assumption of the presence of long-term relationships makes it necessary to make a choice in favor of recurrent neural networks. Traditional neural networks don’t possess the ability to store information and don’t have feedback. They are not able to analyze long-term relationships in data. Recurrent neural networks, in turn, allow solving the problem of information storage due to the presence of feedbacks. A recurrent network can be considered as several copies of the same network, each of which transfers information to a subsequent copy (Figure 2.3).

Neurons in such networks send values not only to the neuron in the next layer but also to themselves in the next stage. That is, when the data arrives at the neuron for the first time, it processes them in accordance with the activation function, sends them to the next layer, and remembers part of this information. When receiving data for the second time, the neuron, along with this data, also receives a value that it saved at the previous iteration. The fact that RNNs resemble a chain means that they are closely related to sequences and lists. RNN is the 10

(27)

2.4. Deep Learning

Figure 2.3: Recurrent neural network from https://necst.it/exploring- boundary-accuracy-performances-recurrent-neural-networks/

Figure 2.4: LSTM architecture from https://www.researchgate.net/

publication/324600237_Improving_Long-Horizon_Forecasts_with_

Expectation-Biased_LSTM_Networks/figures?lo=1

most natural neural network architecture for working with this type of data. The standard version of RNN, however, cannot always solve the problem of long-term dependencies: no matter how important the information is, its weight during a large number of iterations can greatly decrease, or it can disappear altogether.

There is a modification of the recurrent neural network that allows solving this problem: LSTM (Long Short-Term Memory). They were introduced by Sepp Hochreiter and J¨urgen Schmidhuber in 1997 [12], and then improved and popularly presented in the works of many other researchers. The equations of LSTM are as

(28)

2. Theory

follows:

it=σ(wi[ht−1, xt] +bi) f_t=σ(w_f[ht−1, x_t] +b_f) o_t=σ(w_o[ht−1, x_t] +b_o)

where i_t represents input gate, f_t - forget gate, o_t - output gate, σ is sigmoid function,wx- weight for the respective gate neurons,ht−1- output of the previous lstm block at timestamp t-1,xt - input at current timestamp, bx - biases for the respective gates. The equations for the cell state, cell input activation vector and the final output:

c˜t= tanh (wc[ht−1, xt] +bc) c_t=f_t∗ct−1+i_t∗c˜_t h_t=o_t∗tanh (c_t)

wherec_t- cell state at timestamp t,c˜_t- cell input activation vector at timestamp t (Figure 2.4).

In LSTM networks, neurons are equipped with a system of gates and cells. Internal neurons contain cells that act as long-term memory. Gates allow you to control whether a value is needed at this stage or not. Using the forget gate, you can erase the data from the neuron cell. Thanks to such cells, networks of long-short- term memory can determine the importance of events that occurred thousands of discrete time steps back and remember these events. It is worth noting that this particular RNN model is currently widely used.

2.4.2 Convolutional Neural Networks

Convolutional neural networks (CNN) [13] (Figure 2.5) are an extension of fully- connected networks with two types of layers:

1. The convolutional layer convolves the image with a filter, therefore it follows either the input layer or another convolutional layer. Each such layer consists of several filters of the same size, that is, at the output, there are several feature maps, one map per filter. Filter weights are adjusted during network training. Due to the locality, convolutional layers help both reduce the set of configurable parameters compared to fully connected layers and increase the generalizing ability of the network.

2. The pooling layer decreases the dimension of the feature map. There are several approaches to pooling: taking each n-th element of the row or column of the map, choosing the middle or maximum element in a certain area of the feature map. The presence of pooling layers makes the network invariant to scale changes, and also speeds up calculations by reducing the computational complexity of the convolution operation. A filter is a matrix 12

(29)

2.4. Deep Learning

Figure 2.5: CNN architecture from https://commons.wikimedia.org/wiki/

File:Typical_cnn.png

of weights with which the image is convolved. Usually convolutional and pooling layers alternate, so on the way out several thousand feature maps are obtained, which are connected with fully connected layers for further transformations.

Convolutional neural networks are commonly used in image classification and object detection. However, it can be also used in anomaly detection [14].

(30)

2. Theory

2.4.3 Autoencoders

An Autoencoder is a type of neural network consisting of two, as a rule, symmetri- cal parts (Figure 2.6), whose task is to obtain at the output the most approximate value received at the input.

Figure 2.6: Autoencoder architecture from https://www.wikiwand.com/en/

Autoencoder

The first half, called the encoder, compresses the input into a vector of smaller dimensions (hidden representation). The second part, the decoder, tries to recon- struct the original data, relying solely on a hidden representation. In the learning process, the autoencoder is faced with the task of minimizing the difference between input and output data. In this way, it is assumed that the hidden representation will contain essential features of the original data.

14

(31)

Chapter 3 Realisation

This chapter describes the tools and methods used to solve this problem. First, comes the frameworks section that helped in writing the algorithms, as well as a description and explanation of the database from which the dataset was made.

It should be noted that this database was created artificially and is a water distribution network. Next is the preparation of the dataset, on which the anomaly detection methods were tested, the separation of data into the training, validation, and test sets. In the next part, the benchmark is described - a simple algorithm by which you can compare the results of more complex methods. Then, there is the main section of the chapter, which describes in detail the steps of con- structing each algorithm. The last two sections show the evaluation methods and comparing algorithms among themselves. All code was written in python.

3.1 Frameworks

The choice of a framework is extremely important in any project because it determines the complexity and functionality that is available when solving a problem.

As a rule, the more functionality the framework provides, the more complicated it is, however, there are exceptions.

3.1.1 TensorFlow

TensorFlow is an end-to-end open-source framework for machine and deep learning [15]. It has imposing, convenient tools, libraries, and other resources that give researchers a chance to push the state-of-the-art in machine learning, and developers easily create and deploy huge applications. This platform allows you to work at different levels of difficulty. You can write different algorithms in several lines of code using the Keras API [16] or you can invent your architectures and the main components of neural networks using a lower-level code. Using this framework, a recurrent neural network for detecting anomalies was implemented in this work.

(32)

3. Realisation

3.1.2 Scikit-learn

Scikit-learn [17] is a framework that provides a variety of machine learning algorithms as well as some datasets that are often used in this area. The main advantage of this library is its simplicity and a huge number of ready-made algorithms that can be called with just one line of code. Also, this platform has a large community and a huge number of examples with detailed explanations. In this work, it was used to test various simple algorithms, as well as to implement the basic anomaly detection methods.

3.1.3 LeakDB

LeakDB (Leakage Diagnosis Benchmark) is a realistic leakage dataset for water distribution networks [18]. The dataset consists of a huge number of artificially created but realistic leakage scenarios, on different water distribution networks, under different conditions. The dataset was based on historical data and simulates the water flow in the water distribution network. The database consists of the so- called scenarios - annual simulations of the water flow in the network, consisting of 34 pipes and 32 nodes (Figure 3.1).

Figure 3.1: Water distribution network architecture from https:

//www.researchgate.net/publication/257672905_Design_of_Water_

Distribution_Networks_using_a_Pseudo-Genetic_Algorithm_and_

Sensitivity_of_Genetic_Operators

16

(33)

3.2. Data set preparation Each scenario has the following structure:

Scenario-n Demands

Node 1.csv Node 2.csv ...

Flows

Link 1.csv Link 2.csv ...

Hanoi CMH Scenario-n.inp Labels.csv

Leaks

Leak n demand.csv Leak n info.csv ...

Pressures Node 1.csv Node 2.csv ...

Timestamps.csv Scenario-n info.csv

Folder Demands, like the folder Pressures, has 32 files, one for each node, which shows demand and pressure in each node. In the Flows folder, there are 34 files, one for each pipe, which indicates the flow rate in the pipe. In the file Hanoi CMH Scenario-n.inp. various network parameters can be found, such as pipe diameter and length, network topology, pump settings, tank sizes, etc. The Labels.csv file provides information on whether there was a leak in each timestamp in this scenario. The Leaks folder provides information about leaks sizes, types, start and end timestamps.

3.2 Data set preparation

Creating and preparing a dataset is one of the most important steps in a machine learning project. The first attempt to create a dataset was unsuccessful since the database with which the dataset was created was misinterpreted. Namely, the fact, that the leak affected not only the node in which it was located but also all the nodes that were connected to the problem one, was missed. As a result of this, all the algorithms tested on this dataset showed extremely low values and accuracy.

After several weeks of searching the Internet for the necessary information and the help of a colleague, the database was correctly interpreted. Since the water

(34)

3. Realisation

flow in scenarios without leaks was very similar in all nodes, it was decided to take one node from each scenario to the final dataset. This decision was made due to the limited power of the computer and a large amount of data. In the same way, leak scenarios were also chosen. It is worth noting that the scenarios with leaks in two or more nodes were not considered since this greatly complicated the creation of the final dataset and there was no need for these scenarios. The main problem of such scenarios was that leaks located in several nodes influenced each other and created a rather complex dependence in the entire water distribution network. Some pipes had a distorted water flow in different places, while others had almost no change in water flow. Although, in the original database, the leak was identified only in two nodes. In fact, these two nodes influenced the flow in almost all pipes. After all the unsuccessful attempts, a fairly simple dataset was created, which however had a subset of the data describing the entire system.

The final dataset has eight columns with all the data that the original database provided for each scenario (Figure 3.2).

Figure 3.2: Dataset

The first column indicates the time step, every half hour starting from the beginning of the year. For each node, the countdown starts over. The Node column shows the node number, the Demands, Pressures columns show the demand and pressure in this node, respectively. The Flows column indicates the water flow in the pipe that is connected to the node. The last three columns indicate the presence of a leak in the scenario, the scenario number, and the size of the leak.

The next important step in the preparation of the dataset is its division into training, validation, and test sets. Different algorithms require different data separation, therefore, ten leakage scenarios were manually selected, on which all methods were evaluated. This was done to more accurately and independently evaluate the results. The training and validation sets were separated according to the algorithm that was used. Some algorithms were trained on scenarios without leaks, therefore, leak-free scenarios fell into the training set, and leaky and non- leak scenarios fell into the validation set. Other algorithms, on the contrary, require data with and without leaks, therefore, in this case, the corresponding scenarios were distributed in the training and validation sets. The last step in 18

(35)

3.3. Benchmark preparing the dataset was data scaling for better training of algorithms using standard scaling. The creation and preparation of the dataset were done using the popular library pandas [19].

3.3 Benchmark

At first, it is worth trying to solve any problem through simple methods, gradually increasing the complexity of the algorithms. If this is not done, then it can be wasted a lot of time trying to make a very complex algorithm, which in the end can turn out to be worse than simpler methods. To avoid this, it is needed to create a very simple algorithm and take it as a benchmark. Sometimes, projects come across in which the results of the methods are not very clear without comparison.

For this, a benchmark is created. An algorithm called MNF (minimum night flow) was implemented for this problem. This algorithm is quite simple and at the same time quite effective for this task. Its essence is that based on the previous n days, the presence of a leak on a given day is calculated. First, the minimum night flow for the previous n days is calculated. Then the minimum night flow for that day is computed. After that, these two values are taken and compared. If the minimum night flow for that day exceeds the minimum night flow for the previous n days by a certain threshold, then there was a leak on that day. It is also important to take n previous days without leaks.

3.4 Anomaly detection system

Before describing the implementation of algorithms to solve this problem, it is worth noting that the dataset for Isolation forest and one-class SVM has been slightly modified. Since these two methods do not take into account the time component and they do not have memory, it was decided to make up for this shortcoming and create a new dataset (Figure 3.3).

Figure 3.3: Dataset for Isolation forest and one-class SVM

He has 100 new columns and 48 times fewer rows. If in the previous dataset, each time step was half an hour, then in the new it is a day. Now the flow and pressure of the day have been turned into features. Also, a few more characteristics were added in the form of minimum, a maximum, a mean and standard deviation of flow and pressure. If there was a leak in the old dataset in at least one timestep

(36)

3. Realisation

in a day, then it was on that day in the new one. Thus, these additional features allowed methods such as Isolation forest and one-class SVM to somehow take into account the time dependence in the data.

3.4.1 Isolation forest

To implement this algorithm, a function from the scikit-learn library was used:

IsolationF orest(∗, n estimators= 100, max samples= ”auto”, contamination= ”auto”, max f eatures= 1.0)

where n estimators - the number of base estimators in the ensemble, max samples - the number of samples to draw from X to train each base estimator, contamination - the amount of contamination of the data set, i.e. the proportion of outliers in the data set, max features - the number of features to draw from X to train each base estimator. Brute-force was used to select the optimal parameters, where the determining factor was the F1 score on the validation dataset. As a result, the best parameters for this task turned out as follows: n estimators = 10, max samples = 512, contamination = 0.2, max f eatures= 2. F1-score on the training dataset turned out 0.29 , on validation - 0.18, on test - 0.37. As expected, the results are not very good, but still better than on the benchmark.

3.4.2 One-class SVM

To implement this algorithm, a function from the scikit-learn library was used:

OneClassSV M(∗, kernel= ”rbf”, degree= 3, gamma= ”scale”, nu= 0.5, cache size= 200)

where kernel - specifies the kernel type to be used in the algorithm. It must be one of ”linear”, ”poly”, ”rbf”, ”sigmoid”, ”precomputed” or a callable. If none is given, ”rbf” will be used. If a callable is given it is used to precompute the kernel matrix. A degree is used only for the polynomial kernel function (”poly”).

Ignored by all other kernels. Gamma is a kernel coefficient for ”rbf”, ”poly”

and ”sigmoid”. It can be ”scale” or ”auto”. If gamma=”scale” (default) is passed then it uses 1

n f eatures∗X.var() as value of gamma, if ”auto”, uses 1

n f eatures. Parameter nu is an upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. Should be in the interval (0,1]. By default 0.5 will be taken. The way to determine the best parameters for this task was the same as in the Isolation forest. As a result, the best parameters for this task turned out as follows: kernel = ”poly”, nu = 0.4, gamma = ”scale”, degree = 3. F1-score on the training dataset turned out 0.24, on validation - 0.25, on test - 0.28.

20

(37)

3.4. Anomaly detection system 3.4.3 LSTM

For this problem, two different algorithms were created using LSTM. The first algorithm was quite simple to implement and train. It consisted only of a recurrent neural network, which received a time series of a certain length at the input, and at the output it gave a value of 0 or 1, indicating the presence of a leak at a given time step. The Tensorflow library was used to create this method. The first step in the development of the algorithm was the separation of the data into the training, validation, and test parts. Since this was supervised learning, it was necessary to divide the data with anomalies proportionally between the training and validation parts. After that, several attempts were made to create an optimal neural network architecture to get the best result. Such hyperparameters were tested as the number of layers, the number of units in each layer, batch size, the length of the input time series, the activation function, and the number of epochs.

After training the recurrent network, the optimal parameters were selected: the number of LSTM layers - 2, the number of units in each layer - 35, batch size - 5000, the length of the input time series - 24, the activation function - ”relu”, the number of epochs - 70. The accuracy of the algorithm was 87.28%. However, although the accuracy seems rather high, this algorithm does not work well for this task. High accuracy is due to the number of anomalies in the training dataset, of which there was only 20 percent. That is, if the algorithm classified all the data as normal, then the accuracy of such an algorithm would be 80%. To evaluate this method appropriately, an f1-score was calculated, which showed the incapacity of such a neural network architecture. On the training dataset f1-score was 0.46, on the validation - 0.25, and on the test - 0.15. Even if this algorithm showed a good result, it would still be difficult to use on real data, as it would require accurate information about the anomalies, which is not always available in the real world.

The next algorithm with LSTM was a bit more complicated than the previous one and its feature was that the presence of information about anomalies in the data was not necessary. The LSTM architecture in this method differs from that described a little earlier. At the input, the neural network received a time series of a certain length, and at the output, it predicted a time series of a certain length. If in the previous algorithm, the recurrent network immediately detected anomalies, then this only predicted the flow or pressure of water. After that, error vectors were calculated: e=x_true−x_predwherex_true,x_predare the observed value and the predicted value respectively. Then, a multivariate Gaussian distribution was fitted to error vectors computed over test data by the maximum likelihood estimation. Compute the error vector at the point where an anomaly is likely to have happened. If the error vector is located at the end of the Gaussian distribution estimated in the previous step, conclude an anomaly happened. The rarity of the event can be measured with the location in the distribution. The Mahalanobis’ distance is statistics representing an anomaly score. Assuming the

(38)

3. Realisation

parameters of anM dimensional Gaussian distribution are estimated as follows p(x|Data) =N(x|ˆµ,Xˆ

).

Then, the Mahalanobis’ distance is defined as a(x) = (x−µ)ˆ ^T^Xˆ ⁻¹

(x−µ).ˆ

The rarity of the event can be measured witha(x).

Figure 3.4: LSTM with Mahalanobis distance from https://www.renom.jp/

notebooks/tutorial/time_series/lstm-anomalydetection/

notebook.html

One of the most important parts of the algorithm was the architecture and training of the recurrent neural network. Also, the data partitioning was slightly different from the partitioning in the previous algorithm. To train this network, only data without anomalies should have been in the training dataset. For this network, mean absolute error was used to evaluate its quality. The optimal hyperparameters were: the number of LSTM layers - 2, the number of units in each layer - 35, batch size - 5000, the length of the input and output time series - 46, the activation function - ”relu”, the number of epochs - 35. Validation’s mean absolute error was 0.0467. The next step in the algorithm was the calculation of 22

(39)

3.5. Evaluation error vectors and fitting on a multivariate Gaussian distribution. After that, Ma- halanobis distance was calculated on the validation data and selected an optimal threshold by maximum f1-score for detecting anomalies. Then, on the test data, everything was calculated again and anomalies were detected using the threshold set on the validation data.

3.5 Evaluation

The correct interpretation of the results depends on the choice of evaluation of the algorithms. The choice of the wrong method can lead to very sad consequences.

In the anomaly detection problem, for example, if you take the method that is usually used for balanced classes in classification, you can greatly err in the results and mislead the scientific community. Evaluation, usually used in classification, shows the accuracy of a particular class. However, such an evaluation only makes sense if the number of elements in each class is balanced. Otherwise, for example in the anomaly detection, in which the number of elements of one class is about one percent of the number of elements of another class, this will lead to the fact that the accuracy of your algorithm will always be about one hundred percent.

3.5.1 Metrics

The metrics used to evaluate the implemented algorithms were chosen taking into account the particularity of the task and data. The confusion matrix is a table of four elements: true positive (TP), false negative (FN), false positive (FP), and true negative (TN). Usually, it is used when there are only two classes. True positive and true negative shows the number of correctly classified elements of the positive and negative classes. False negative and false positive indicates the number of algorithm errors. This matrix is best suited for this task. Using it, the following metrics were calculated:

• True Positive Rate (R_{T P}) also known as recall shows the ability of the algorithm to detect leakage when it exists.

R_{T P} = T P T P +F N

• True Negative Rate (R_{T N}) also known as specificity shows the ability of the algorithm to avoid false alarms when there is no leak.

R_{T N} = T N T N+F P

• F1-score - a frequently used metric when there are imbalanced classes in the data. It balances precision and recall of classifiers.

F_{tp,f p} = 2T P

2T P +F P +F N

(40)

3. Realisation

3.6 Comparing methods

For a more accurate result, all algorithms were evaluated on a test dataset, which consisted of eight scenarios without anomalies and two with anomalies. As expected, the benchmark showed the worst result, the classification recurrent neural network showed a little bit better result. One-class SVM and Isolation forest showed a very good result, while not taking into account the time component. In the first place, as expected, was lstm with Mahalanobis distance. It is also worth noting that proper data separation and sampling have a huge role in the result.

All results are shown in the table below.

Detectors Ftp,f p (%) RT N (%) RT P (%)

MNF 17.66 48.74 71.58

One-class SVM 28.92 68.98 82.28

Isolation forest 37.44 84.25 68.26

LSTM 18.86 50.44 69.24

LSTM + Mahalanobis 57.70 97.56 64.58

Table 3.1: Detectors and their score

Leak detection example using lstm with Mahalanobis using water pressure (Figure 3.7).

Figure 3.5: Mahalanobis distance based on pressure 24

(41)

3.6. Comparing methods

Figure 3.6: Actual test leaks

Figure 3.7: Predicted test leaks

(42)

(43)

Chapter 4 Conclusion

In this work, several methods for detecting anomalies and their comparison were shown. As practice has shown, one of the most difficult stages was the correct interpretation of the data as well as the creation of a dataset using LeakDB.

Although the results did not turn out to be as good as expected, however, all the algorithms work better than the benchmark. All the goals set at the beginning of the work were fulfilled. In the future, to improve the results in this problem, it is worth trying to use all the data and not just some of it. A longer training of neural networks and different architectures can also help. It can be also tried using convolutional neural networks, as well as autoencoders. Expanding the dataset using real data also can be helpful. It is worth warning that on real data, these algorithms in this configuration may not work, because in the real world many factors cannot be simulated. Also in the real world, there is a human factor in front of which almost any algorithm is powerless.

(44)

(45)

Bibliography

[1] Chandola, V.; Banerjee, A.; et al. Anomaly Detection: A Survey. ACM Comput. Surv., volume 41, no. 3, July 2009, ISSN 0360-0300, doi:

10.1145/1541880.1541882. Available from: https://doi.org/10.1145/

1541880.1541882

[2] Chen, Z.; Brown, E. N. State space model. Scholarpedia, volume 8, no. 3, 2013: p. 30868, doi:10.4249/scholarpedia.30868, revision #189565.

[3] Box, G. E. P.; Jenkins, G. Time Series Analysis, Forecasting and Control.

USA: Holden-Day, Inc., 1990, ISBN 0816211043.

[4] Winters, P. R. Forecasting Sales by Exponentially Weighted Moving Av- erages. Manage. Sci., volume 6, no. 3, Apr. 1960: p. 324–342, ISSN 0025-1909, doi:10.1287/mnsc.6.3.324. Available from: https://doi.org/

10.1287/mnsc.6.3.324

[5] Alpaydin, E. Introduction to Machine Learning. The MIT Press, second edi- tion, 2010, ISBN 026201243X.

[6] Rousseeuw, P. J.; Driessen, K. V. A Fast Algorithm for the Minimum Co- variance Determinant Estimator. Technometrics, volume 41, no. 3, Aug.

1999: p. 212–223, ISSN 0040-1706, doi:10.2307/1270566. Available from:

https://doi.org/10.2307/1270566

[7] Dadi, H.; Venkatesh, P.; et al. Tracking Multiple Moving Objects Using Gaussian Mixture Model. International Journal of Soft Computing and En- gineering (IJSCE), volume 3, May 2013: pp. 114–119, ISSN 2231-2307.

[8] Moon, T. The expectation-maximization algorithm.Signal Processing Mag- azine, IEEE, volume 13, 12 1996: pp. 47 – 60, doi:10.1109/79.543975.

[9] Sch¨olkopf, B.; Williamson, R.; et al. Support Vector Method for Novelty Detection. In Proceedings of the 12th International Conference on Neural

(46)

Bibliography

Information Processing Systems, NIPS’99, Cambridge, MA, USA: MIT Press, 1999, p. 582–588.

[10] Liu, F. T.; Ting, K. M.; et al. Isolation Forest. InProceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM ’08, USA: IEEE Computer Society, 2008, ISBN 9780769535029, p. 413–422, doi:10.1109/

ICDM.2008.17. Available from: https://doi.org/10.1109/ICDM.2008.17 [11] LeCun, Y.; Bengio, Y.; et al. Deep Learning.Nature, volume 521, 05 2015:

pp. 436–44, doi:10.1038/nature14539.

[12] Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Com- put., volume 9, no. 8, Nov. 1997: p. 1735–1780, ISSN 0899-7667, doi:

10.1162/neco.1997.9.8.1735. Available from: https://doi.org/10.1162/

neco.1997.9.8.1735

[13] Fukushima, K. Neocognitron. Scholarpedia, volume 2, 01 2007: p. 1717, doi:10.4249/scholarpedia.1717.

[14] Wen, T.; Keyes, R. Time Series Anomaly Detection Using Convolutional Neural Networks and Transfer Learning. 2019,1905.13628.

[15] Abadi, M.; Barham, P.; et al. TensorFlow: A System for Large-Scale Ma- chine Learning. InProceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, USA: USENIX Association, 2016, ISBN 9781931971331, p. 265–283.

[16] Chollet, F.; et al. Keras.https://keras.io, 2015.

[17] Pedregosa, F.; Varoquaux, G.; et al. Scikit-Learn: Machine Learning in Python.J. Mach. Learn. Res., volume 12, no. null, Nov. 2011: p. 2825–2830, ISSN 1532-4435.

[18] LeakDB : A benchmark dataset for leakage diagnosis in water distribution networks, Zenodo, July 2018, doi:10.5281/zenodo.1313116. Available from:

https://doi.org/10.5281/zenodo.1313116

[19] McKinney, W.; et al. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, volume 445, Austin, TX, 2010, pp. 56–61, doi:10.25080/Majora-92bf1922-00a.

30

(47)

Appendix A

Acronyms

ARIMA Autoregressive integrated moving average CNN Convolutional neural network

LeakDB Leakage diagnosis benchmark LSTM Long short-term memory RNN Recurrent neural network SSM State space model SVM Support vector machine

(48)

(49)

Appendix B

Contents of enclosed CD

readme.txt ...the file with CD contents description src...the directory of source codes impl...implementation sources thesis ...the directory of L^ATEX source codes of the thesis text ...the thesis text directory thesis.pdf ...the thesis text in PDF format