• Nebyly nalezeny žádné výsledky

6.3.1 Representation

Picture is visual method for representation of the network.

Fig. 1. Picture representation of network (directed x undirected graph)

Adjacency matrix is a square matrix Aij with elements representing if given pair of vertices are adjacent in graph or not. There are following data representation of adjacent matrix for graph from Fig. 1.

An arrow (v1, v4) is directed from v1 to v4; then v4 is called the head and v1 is called the tail of the arrow;

v4 is said to be a direct successor of v1 and v1 is said to be a direct predecessor of v4.

Directed graph Undirected graph

𝐴𝑖𝑗= {1 … 𝑤ℎ𝑒𝑛 𝑣𝑖 𝑖𝑠 𝑡𝑎𝑖𝑙 𝑎𝑛𝑑 𝑣𝑗𝑖𝑠 ℎ𝑒𝑎𝑑

Incidence matrix is a n × m matrix B, where n is number of vertices and m is number of edges defined as follows (for graph from Fig. 1).

Directed graph Undirected graph

𝐵𝑖𝑗 = {1 … 𝑤ℎ𝑒𝑛 𝑣𝑖 𝑖𝑠 𝑡𝑎𝑖𝑙 𝑜𝑓 𝑒𝑗

6.3.2 Bipartite network projection

Bipartite network is such network whose vertices can be divided into two disjointed sets V1 and V2 and every edge connects one vertex from V1 and one vertex from V2.

Bipartite network projection is a method used for compressing bipartite network to one-mode network.

The result one-mode network contains nodes of only one set (V1 or V2), where nodes are connected only if when they have at least one common neighboring node in original bipartite network.

Several weighting methods have been proposed in literature:

- Simple weighting: weights are calculated by the number of common associations - Hyperbolic weighting: weights are calculated by scaling factor 1/(n - 1)

- Resource allocation weighting: produces initial distribution of some resource among nodes and eliminates miss of nodes with degree 1

We used weighting given by a minimum of common associations in our work (9.3) as we are also finding measure of similarity of connected vertices.

6.3.3 Degree

Degree of vertex v is defined d(v) as number of edges incident to given vertex.

We calculate local vertex degree 𝑑𝑖 = ∑ 𝐴(𝑖, 𝑗)𝑗 and global mean degree 𝜇𝑑=∑ 𝑑𝑖 𝑖

𝑛 . Degree sum of a graph G=(V,E) is calculated as ∑𝑣∈𝑉𝑑𝑒𝑔(𝑣) = 2|𝐸|.

6.3.4 Local and global distances

Distance 𝑑(𝑣𝑖, 𝑣𝑗) between vertex vi and vj is defined as the shortest path between vertex vi and vj in network.

Eccentricity of vertex e(vi) as local property of vertex vi is defined as the longest distance from vertex vi to any other vertex of the network

𝑒(𝑣𝑖) = max

𝑗 {𝑑(𝑣𝑖, 𝑣𝑗)} (6)

Diameter d(G) of a graph G is the maximum eccentricity of any vertex in the graph. That means d(G) is the greatest distance between any pair of vertices

𝑑(𝐺) = max

𝑖 {𝑒(𝑣𝑖)} = max

𝑖,𝑗 {𝑑(𝑣𝑖, 𝑣𝑗)} (7) Global mean distance is defined as

𝜇𝐿 = 2

𝑛(𝑛 − 1)∑ ∑ 𝑑(𝑣𝑖, 𝑣𝑗) (8)

𝑗>𝑖 𝑖

6.3.5 Structural network properties

We are interesting on importance of vertices in a graph (network). Several structural properties, representing “importance” of given vertex was defined (vertex v can be important for other vertices if change in vertex v produces changes to many other vertices in network). There are many definitions and approaches published for importance property.

One often used concept of importance is centrality. This concept was developed in social network analysis and was later developed i.e by [BO87] and others. Every concept of centrality is based on some assumptions and should be analyzed if it meets our interest.

Most of centrality measures are calculated from neighbor matrix. There are often used centralities:

- Based on degree (Degree, Eigenvector, PageRank) - Based on paths (Closeness, Betweenness)

Degree centrality is defined as degree of the vertex (see 6.3.3).

Closeness centrality of a vertex is a measure of local centrality in a network. Closeness is defined using mean distance. Let dij represents length of the shortest path between vertex i and j. Mean distance of vertex i is defined as

𝑙𝑖 =1 𝑛∑ 𝑑𝑖𝑗

𝑗

(8)

Mean distance of vertex, that is little distant from others, is small. Such vertex has fast access to other vertices and can influence other vertices fast. Closeness is defined as

𝐶𝑖 = 1 𝑙𝑖 = 𝑛

∑ 𝑑𝑗 𝑖𝑗 (9) Closeness have known problems with

- networks with more components (closeness is calculated for each vertex in context of their component separately, but it handicaps vertices in large networks)

- range of values has log n size, it is small range, distinction of vertices is small

Betweenness centrality is a measure based on shortest paths in a network – betweenness of the vertex means the number of shortest paths between any two vertices in the network that pass through given vertex. By contrast to closeness, this betweenness centrality represents the “degree” how the vertex fits among other vertices. Let 𝑛𝑠𝑡𝑖 is defined as 1 (if vertex i is on the shortest path between s and t) or 0 otherwise. Let gst is number of shortest paths between s and t. The betweenness centrality xi of vertex i is defined as

The Mean of dataset (represented by a data matrix X) is the vector which coordinates are constructed as the average of coordinates of all records.

6.3.7 Silhouette

Silhouette describes method for validation of cluster data consistency. Silhouette value is a measure of how similar an object is to its own cluster compared to how similar the object is to other clusters. The silhouette value ranges from −1 to +1. Higher value indicates that the object is more similar to its own cluster and weakly similar to neighboring clusters.

Stable clusters have most objects with high value. If many objects have a low or negative value, then the clustering configuration is not stable.

The silhouette can be calculated with any distance metric, such as the Euclidean distance or the Manhattan distance.

Let’s assume the data have been clustered into k clusters. For each object i, let a(i) be the average distance between i and all other objects within the same cluster. We can interpret a(i) as a measure of how well object i is assigned to its cluster (the smaller the value, the better the assignment). Let’s define the average dissimilarity of object i to a cluster c as the average of the distance from i to all objects in cluster c.

Let b(i) be the smallest average distance of objects i to any object in any other cluster (other than i is a member of). The cluster with this smallest average dissimilarity is said to be the “neighboring cluster"

of i because it is the next best fit cluster for object i.

We now define a silhouette value:

𝑠(𝑖) = 𝑏(𝑖) − 𝑎(𝑖)

𝑚𝑎𝑥{𝑎(𝑖), 𝑏(𝑖)} (11) Other writing of definition can be :

𝑠(𝑖) = {

1 −𝑎(𝑖)

𝑏(𝑖), 𝑖𝑓 𝑎(𝑖) < 𝑏(𝑖) 0, 𝑖𝑓 𝑎(𝑖) = 𝑏(𝑖) 𝑏(𝑖)

𝑎(𝑖)− 1, 𝑖𝑓 𝑎(𝑖) > 𝑏(𝑖)

(12)

Function s range of value: s(i): -1  s(i)  1 Silhouette is usually visualized as follows in Fig. 2

Fig. 2. Visualization of silhouette of clusters

6.3.8 Clustering coefficients

Clustering coefficients were defined to provide a measure of clustering potential of the network (graph).

Clustering coefficients measure density of triangles (local smallest clusters) in network. Principally we work with local ang global measure of clustering. The method works with “triple” as triple connected vertices (can be connected by two or three edges) and “triangle” as a triple connected by three edges.

Network (global) clustering coefficient C1 is defined as follows [WS98]:

𝐶1 =3 ∗ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑖𝑎𝑛𝑔𝑙𝑒𝑠 𝑖𝑛 𝑛𝑒𝑡𝑤𝑜𝑟𝑘

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑖𝑝𝑙𝑒𝑠 (13) Vertex (local) clustering coefficient of vertex i is defined as follows [KP15]:

𝐶𝑖=𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑖𝑎𝑛𝑔𝑙𝑒𝑠 𝑐𝑒𝑛𝑡𝑒𝑟𝑒𝑑 𝑎𝑡 𝑛𝑜𝑑𝑒 𝑖

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑖𝑝𝑙𝑒𝑠 𝑐𝑒𝑛𝑡𝑒𝑟𝑠 𝑎𝑡 𝑛𝑜𝑑𝑒 𝑖 (14) Clustering coefficient C2 is defined as follows:

𝐶2 =1 𝑛∑ 𝐶𝑖

𝑖

(15) Clustering coefficient Ci is defined as follows [KP15]:

𝐶𝑖 =𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑎𝑖𝑟𝑠 𝑜𝑓 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑢𝑟𝑠 𝑜𝑓 𝑖 𝑡ℎ𝑎𝑡 𝑎𝑟𝑒 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑎𝑖𝑟𝑠 𝑜𝑓 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑢𝑟𝑠 𝑜𝑓 𝑖 (16) 6.3.9 Detection of outliers

Interquartile range (H-spread)

Interquartile range (IQR) is a measure of the spread of a distribution. The IQR is the difference between the 75th and 25th percentiles [DO14] or between upper and lower quartiles [GI96]. In statistics quantiles are limits splitting the range of a probability distribution into unbroken intervals with equal probabilities or dividing the observations in a sample in the same way. It means we have n-1 quantiles dividing distribution into n intervals. A quartile is a type of quantile – quartiles are the three limits that divide our dataset into four equal-sized groups.

The first quartile (Q1) is defined as the middle number between the smallest number and the median of the dataset. The second quartile (Q2) is the median of the dataset. The third quartile (Q3) is the middle value between the median and the highest value of the data set.

IQR = Q3 – Q1

The interquartile range is often used to find outliers in data. Outliers here are defined as observations that fall below Q1 – 1,5 IQR or above Q3 + 1,5 IQR.