Hlavní práce75007_maln01.pdf, 4.8 MB Stáhnout

(1)

Prague University of Economics and Business Faculty of Informatics and Statistics

COVID-19 Data Analysis

MASTER THESIS

Study programme: Applied Informatics Field of study: Information Technologies

Author: Nasiha Maleškić

Supervisor: prof. Ing. Petr Berka, CSc.

Prague, June 2021

(2)

(3)

Acknowledgement

I want to thank my supervisor prof. Ing. Petr Berka, CSc. for his guidance and valuable advice. I express the deepest gratitude to my family for the constant encouragement and support I never cease to receive from them. I also thank my partner for his constant love and support.

(4)

(5)

Abstrakt

Tato práce analyzuje data COVID-19. Použitá datová sada zahrnuje data z celého světa.

Hlavním cílem této práce je analýza dat pomocí klastrového algoritmu. Tato práce se skládá z pěti částí. V první části vysvětlíme, co je vyhledávání znalostí v databázích a úkoly, které řeší, různé oblasti použití a nakonec CRISP-DM, což je metodika použitá v této práci. Ve druhé části vysvětlujeme různé metody zjišťování znalostí, včetně shlukování, které se používá v analytické části, a také použité prostředí a knihovny. Ve třetí části porovnáváme pandemii COVID-19 s předchozími pandemiemi. Ve čtvrté části jsme stanovili cíle, vyčistili a připravili data a vizualizovali data, abychom jim lépe porozuměli. V závěrečné páté části vytváříme shlukové modely s různou úrovní podrobnosti a na základě údajů pro celý svět a poté pouze pro Evropu. Také vizualizujeme výsledky a vysvětlíme viditelné vzory.

Klíčová slova

COVID-19, analýza dat, shlukování, vizualizace dat

Abstract

This thesis analyses COVID-19 data. The dataset used covers data from all over the world.

The main aim of this thesis is to analyse the data with the use of a clustering algorithm. This thesis consists of five parts. In the first part we explain what knowledge discovery in databases is and the tasks it solves, different areas of application and finally CRISP-DM which is the methodology used in this thesis. In the second part we explain different knowledge discovery methods including clustering which is used in the analysis part, as well as the environment and libraries used. In the third part we compare the COVID-19 pandemic with previous pandemics. In the fourth part we set the goals, clean and prepare the data and visualize the data to get a better understanding of it. In the final, fifth part, we create clustering models with different levels of granularity and by taking data for the entire world and then only for Europe. We also visualize the results and explain the visible patterns.

Keywords

COVID-19, data analysis, clustering, data visualization

(6)

(7)

List of Figures

1.1 Example segmentation. . . 19

1.2 Example classification vs prediction. Image source: [7] . . . 20

1.3 Example of an outlier. Image source:[8] . . . 21

1.4 Process diagram showing the relationship between the phases of CRISP-DM. . . 25

2.1 Hierarchical vs non-hierarchical clustering. . . 27

2.2 Clustering example with centroids. . . 29

2.3 Clustering example with elbow method. . . 29

2.4 Linear regression. . . 30

2.5 No linearity. . . 31

2.6 Logistic regression. . . 32

3.1 Graph showing CFR values for different pandemics. . . 38

3.2 Graph showing R0 values for different pandemics. . . 39

4.1 Table showing the dataset information - part 1. . . 42

4.4 Table showing the missing data. . . 45

4.5 Table showing the missing data percentages. . . 45

4.6 World: countries with most cases (total cases per country). Date: 07.10.2020. . . 49

4.7 World: countries with most cases (total cases per million). Date: 07.10.2020. . . 49

4.8 World: countries with most deaths (total deaths per million). Date: 07.10.2020. . 50

4.9 World - Countries with most deaths - Total deaths per million per country. Date: 07.10.2020. . . 51

4.10 World - Countries with most cases - Total cases per country. Date: 07.10.2020. . 51

4.11 World - Countries with most cases - Total cases per million per country. Date: 07.10.2020. . . 52

4.12 Europe: countries with most cases (total cases per million). Date: 07.10.2020. . . 52

4.13 Europe: countries with most deaths (total deaths per million). Date: 07.10.2020. 53 4.14 Europe: countries with most tests (total tests per country). Date: 07.10.2020. . . 54

4.15 Europe: countries with most tests (total tests per million). Date: 07.10.2020. . . 54

4.16 Europe: countries with most positive rate of tests. Date: 07.10.2020. . . 55

5.1 Original dataframe for clustering. . . 58

5.2 Cleaned dataframe. . . 59

5.3 Dataframe with normalized values. . . 59

5.4 World clusters with KElbowVisualizer. . . 60

5.5 World clusters with Elbow method. . . 61

5.6 World - four clusters PCA visualization. . . 62

(10)

5.7 World map with 4 clusters. Date: 07.10.2020. . . 65

5.8 World four clusters with total cases per million (number) with life expectancy (color) graphs. . . 66

5.9 World four clusters with total cases per million (number) with GDP per capita (color) graphs. . . 66

5.10 World - eight clusters PCA visualization. . . 68

5.11 World map of eight clusters. Date: 07.10.2020. . . 70

5.12 Graphs for first four clusters out of eight with total cases per million (number) with GDP per capita (color). . . 71

5.13 Graphs for second four clusters out of eight with total cases per million (number) with GDP per capita (color). . . 71

5.14 Europe clusters with KElbowVisualizer. . . 72

5.15 Europe clusters with Elbow method. . . 72

5.16 European clusters PCA visualization. . . 73

5.17 European clusters. Date: 07.10.2020. . . 74

5.18 European cluster zero with total cases per million (number) with GDP per capita (color). . . 75

5.19 European cluster one with total cases per million (number) with GDP per capita (color). . . 75

5.20 European cluster two with total cases per million (number) with GDP per capita (color). . . 76

5.21 Clustering world countries using stringency index. . . 77

5.22 Stringency index. . . 78

5.23 Clustering world countries using COVID-19 indicators. . . 79

5.24 Clustering world countries using socio-demographic and economic indicators. . . 80

5.25 Clustering European countries using stringency index. . . 80

5.26 Stringency index. . . 81

5.27 Clustering European countries using COVID-19 indicators. . . 81 5.28 Clustering European countries using socio-demographic and economic indicators. 82

(11)

List of Tables

3.1 Overview of previous/ongoing pandemics . . . 37

(12)

(13)

Abbreviations

AIDS Acquired Immunodeficiency Syndrome

CFR Case Fatality Rate

CIA Central Intelligence Agency COVID-19 Coronavirus Disease 2019 COiN Contract Intelligence

CRISP-DM Cross-industry Standard Process for Data Mining

CSV Comma-Separated Values ECDC European Centre for Disease

Prevention and Control EDA European Defence Agency GDP Gross Domestic Product GPA Grade Point Average HDI Human Development Index HIV Human Immunodeficiency Virus HTML HyperText Markup Language ICU Intensive Care Unit

IDE Integrated Development Environment IFR Infection Fatality Ratio

KDD Knowledge Discovery in Databases MERS Middle East Respiratory Syndrome OECD Organisation for Economic

Co-operation and Development

PAHO Pan American Health Organization PCA Principal Component Analysis PDF Portable Document Format PPP Purchasing Power Parity

SARS Severe Acute Respiratory Syndrome SARS-CoV-2 Severe Acute Respiratory

Syndrome Coronavirus 2 SAT Scholastic Assessment Test UNDP United Nations Development

Programme

WHO World Health Organization

(14)

(15)

Introduction

COVID-19 is the contagious disease that was first identified in December 2019 in Wuhan, China.[1] It has since led to a pandemic that impacted the entire world. In order to slow down this pandemic which had taken so many lives and caused health difficulties for many there were many measures taken. Since the virus is transmitted from one person to another when in close proximity, the main measures revolved around closing down all places of gathering, such as stores, schools and offices. In some countries, for some periods of time there were total lockdowns when people were allowed only to go out only for essential needs, such as food and medicine. Whether people lost their beloved ones, got sick, had to take classes online or work from home - this pandemic has caused an upheaval in everyone’s lives. The world is still struggling to understand what is the best way to save people’s lives and minimize the impact on the economy.

This is the main motivation for this work. The goal of this thesis is to analyse COVID-19 data and create segmentation models. We have gathered large amounts of data from all over the world and with this data we can better understand this pandemic and figure out which approaches have had more success. The goal is to create a cluster analysis of countries to find out similarities between them.

(16)

(17)

1. Knowledge Discovery in Databases

1.1 Introduction

Today we have a colossal amount of raw data. The statistics regarding the stunning data growth are astounding. There are 2.5 quintillion bytes of data produced by humans every day. Data growth is rapidly increasing, only in the last two years 90% of the world’s data has been created.[2] While raw data can come from anywhere, including sensors, transactions and security cameras, a lot of the data is generated on the social networks. Some daily statistics include: 500 million tweets and 294 billion emails sent, 4 petabytes of data created on Facebook, 65 billion messages sent on WhatsApp, 5 billion searches made[3], 50% of which using smartphones[4] and 4,500,000 videos are streamed over YouTube every 60 seconds.[4]

There are two main types of generated data: captured and exhaust data. Captured data comes from purposeful experiments and investigation, such as collecting information and creating statistics about the world. We can find such information from health organizations or government censuses. Exhaust data is gathered by machines as a secondary function, be it smartphones, cash registers or radars.[5]

All of this data begs the question whether it can be of some use for use, whether we can find some knowledge in it. And this is exactly what knowledge discovery in databases does. The goal of knowledge discovery in databases is to gain knowledge from data.[6] Data mining is another word that is incorrectly, although often, used interchangeably with it. Data mining is only one step of the entire knowledge discovery process. It is the step where we are applying intelligent methods in order to extract patterns from the data.

1.2 Tasks

There is a variety of tasks in knowledge discovery for databases, all with different goals. When performing these tasks we are looking for different patterns and their goals differ based on the output we want to get from it. In some cases the output is a textual or a visual summary of data and in others a model capable of labeling new objects. Some tasks can be seen as obligatory when working with data, such as giving a proper description of the dataset. Other tasks will be performed based on the goals set by the researchers. Each task is described into more detail below.

They can be split into three different groups according to Klosgen and Zytkow:

1. Classification / prediction 2. Description

3. Searching for nuggets, that is searching for surprising knowledge.

(18)

According to Chapman there are seven basic knowledge discovery in databases tasks:

1. Data description and summarization 2. Segmentation

3. Concept description 4. Classification 5. Prediction

6. Dependency analysis 7. Deviation detection

Data description and summarization

We want to describe and summarize the data so that it is easier understood by ourselves and the others. This can especially be useful in large companies where different teams will be working on the same data, since it saves time necessary to understand the dataset’s specifics.

Describing the main characteristics of the data, as well as all of the relevant details helps the user of the data know how to analyze this data in the best way possible. This means they will be able to clean and prepare the data properly, as well as apply the proper techniques for modeling the data. This may include, but is not limited to: finding the average and the dispersion of the data, describing the shape of the data using plots such as histograms and showing data statistics for example regarding its value and shape.

Concept description

Concept description is the simplest kind of descriptive data mining. It provides us with a concise and succinct summary of a concept, where the concept is usually a collection of data.

It will characterize a collection of the data and compare it with others. The output can take many different forms, including charts, graphs, generalized relations and logical rules.

Dependency analysis

The goal of dependency analysis is to describe significant dependencies or associations between data items or events. One of the techniques used is called Extended Dependency Analysis. It is a heuristic search for finding significant relationships between nominal variables in large datasets.

(19)

Segmentation

Data segmentation happens when we take the data and split it into groups based on their common characteristics. We organize the data objects based on the similarity of one or multiple attributes. It is often used in marketing, where people are split into different groups according to their interests and preferences. This makes it easier to create targeted marketing, understand the buyer personas and appeal to them in a more personalized manner and increase your sales.

Figure 1.1: Example segmentation.

Classification

Classification is the task of assigning new, never before seen observation to a category. It consists of two phases. In the first phase we train the classification model using the training data where the category membership of each observation is known already. The model is trained and evaluated until a satisfactory level of accuracy has been reached. This model is then used to classify and assign class labels / category membership to new data. Classification is one of the most commonly performed tasks due to its versatility. Some classification examples: creating a model that classifies which loan applicant belongs to a group that is more likely not to pay their loan on time (a business can decide to chose its customers this way), classifying which patient belongs to a riskier group based on their health status (doctors can provide specific recommendations to such patients) and whether the email you have just received is spam or not (to save your time).

(20)

Prediction

Prediction is similar to classification, only in this case we want to predict the missing / unknown element (continuous value) of a dataset. Another name for prediction is regression.

The goal of the prediction task is to build a model that will predict the outcome which is a continuous value. Some prediction examples: a model that will predict how well the patient will respond to the treatment (what is the probability they will require additional treatment), how many customers will come into the restaurant at a given time or day (which can help know how much staff and what quantities of food will be needed) and how much money is a certain customer willing to spend on his data plan (this can help the business know whether they should suggest upgrades or not).

Classification vs Prediction

On the image we can see an example of difference between classification and prediction.

In case of prediction we would create a model that would predict the precise temperature for tomorrow. In case of classification we would create a model that would predict whether tomorrow would be hot or cold, that is classify the day to belong to one of those two categories.

Figure 1.2: Example classification vs prediction. Image source: [7]

(21)

Deviation detection

Deviation detection describes the most significant changes in the data from previously measured or normative values. It can reveal surprising facts hidden in the data, it can reveal possible problems and malpractices, inconsistent information that could indicate fraud or incorrectly inputted data, and unrealized trends that could lead to fruitful business decisions and scientific discoveries. An outlier is a legitimate data point that is far away from the mean or the median in a distribution. It is always necessary to understand the outliers before we decide to remove them. There are three main causes for outliers:

1. Data entry or measurement errors

When entering the data manually into the database typos can happen. For example in the column for gender where we see F and M we find 1, or in the column for height in centimeters we find the value 9000. Those two are clearly outliers. Today most of the software developers try to prevent such mistakes from happening by disabling the user from entering anything else other than a range of specific values or characters.

2. Sampling problems and unusual conditions

One such example is a study that was conducted to model bone density growth in pre-adolescent girls with no health conditions that affect bone growth. One of the subject’s had an unusual growth value. They managed to find out that she had a medical condition which affected it. Since this was not in the scope of the study the measurements from that subject would be removed as outliers.

3. Natural variations

Natural variations are simply produced by nature. And they should not be removed as outliers since they could lead to important scientific discoveries.

Outlier example

We have data from a running competition where 100 students competed. All students took less than 30 seconds to finish the race, whereas one student took 82 seconds to finish it. This is an outlier. It also shows that in this case median and mode have not been affected by the outlier, whereas the mean has been and therefore it is not a suitable representation when reporting this event.

Figure 1.3: Example of an outlier. Image source:[8]

(22)

1.3 Applications

1.3.1 Medicine and drug development

Drug discovery and development

In order to make a new drug it is shown that on average it takes more than a decade and costs $350 million to $2.7 billion to bring a new drug to market. A lot of this money goes down the drain because only a small portion of the chemical is used in the end. There are two main ways we can use artificial intelligence when seeking treatment for a new disease.

It can either be in order to find a new drug, after which it will be necessary to go through months (or years) of clinical trials. The second way is to repurpose existing drugs, which can lead to much faster delivery to the patients since these drugs have already been through clinical trials.[9]

Personalized treatments

Researchers at the Hospital Clinic in Barcelona (Spain) have developed a tool that was

‘trained’ on more than a trillion anonymized data points retrieved from the clinic’s electronic health records system. The initial studies have shown that the tool was able to correctly predict the trajectory of the disease in individual patients. This is one example of a tool that can help medical staff plan and prepare in advance, whether it be increasing the number of staff members or ordering additional supplies of medication. This enables personalized treatments and their study has shown that there was an improvement at day five of the treatment in 93.3% in the personalized patients, compared to 59.9% on standard of care.

Additionally, at day five, 2% of personalized therapy patients had died vs. 17.7% on standard of care and twenty eight day mortality was 20% vs. 44.2%. The total number of patients included in the study is less than 300. These are very small numbers and additional research is needed. But even so, it sheds light on personalized treatments and opens new paths for artificial intelligence in healthcare.[10]

1.3.2 Banking

Banks are also looking for ways they can profit from the data that they have. JP Morgan Chase & Co. is the biggest American multinational investment bank in the United States, with more than 240,000 employees serving millions of customers.[11] They implemented a system called COiN, a short-form for Contract Intelligence. It was created to process and analyze different documents. In just a few seconds COiN was able to analyze 12,000 annual commercial credit agreements. When done manually by the employees it took 360,000 hours to finish the same amount of work. It was also noted that it was successful in reducing human errors.[12]

(23)

1.3.3 Crime

Data mining systems are created to prevent crime, predict where and when it may occur and to counter terrorism. Predpol is a company that has created a machine learning system that predicts crime type, location and time.[13] They analyze existing data to recommend where the police patrol should be increased. They call thisreal-time epidemic-type aftershock sequence crime forecasting.[14] Predpol is already used in several American cities. In Wash- ington there was a 22 percent drop in residential burglaries after implementing the Predpol.

Another study concluded that it resulted in 7.4% reduction in crime volume.[14]

1.3.4 Retail

It is more than eight years now since Target, an American retail corporation made big news for creating a data mining system so good that it figured out a teen girl was pregnant before her father did. They created a system that would profile customers based on their purchases and send them coupons for relevant products in the future. The teen received coupons for pregnancy related products and that is how her father found out about it as well.[15] Searching for a pattern and identifying relationships between the items that people buy is known as market basket analysis.[16] Once such patterns are identified this can affect the promotions that the retailers will give out, the recommendations they will make and even placement of the items in the store (be it physical or online).[17]

1.3.5 Politics

Facebook–Cambridge Analytica data scandal is the most famous scandal involving unethical usage of data recently. It involved using personal data of millions of Facebook users without their consent in order to create profiles that would be used for political advertising. The main reasons for the huge public outrage was due to the lack of consent for using personal data and the fact it was used to sway elections and therefore create influence that goes beyond regular targeted ads to make you buy one product or another.[18]

They used Amazon Mechanical Turk in order to give people a task for which they would be paid. But in order to get paid it was necessary to download a Facebook app called This Is Your Digital Life. This app would take their responses to the survey along with all of the user’s Facebook data as well as all of the data from all of their friends.[19] While only about 260000 users downloaded the app, Cambridge Analytica managed to harvest the data of up to 87 million Facebook profiles (mostly from America).[20] Other data was gathered as well and different models were created to figure out the best way to find target users and influence their behaviour. This data was used to inform targeted political advertising.[18]

They chose those users who were more prone to impulsive anger or conspiratorial thinking than average citizens.[18] Cambridge Analytica would create fake Facebook groups, post videos and images to create maximum engagement and in turn influence the users’ opinions

(24)

and behaviours. The company started operating in 2014, influencing the 2016 elections in the USA. In 2018 Christopher Wylie, a former Cambridge Analytica employee disclosed all of the inside information about how the company operated and the way they managed to sway the elections. [21]

1.4 CRISP-DM

1.4.1 Methodology

The methodology used in this project is called Cross-industry standard process for data mining (CRISP-DM). It consist of six phases in total:

1. Business understanding The goal is to determine business objectives, data mining goals and to create a plan for the project. It is necessary to discuss these goals well with the stakeholders of the project otherwise we may end up wasting time creating solutions that were not required of us and do not contribute to the business.

2. Data understanding The goal is to collect the data, describe it, explore it and verify the quality of the data. This can include creating a textual summary of the data, statistics or graphs to visualize the data.

3. Data preparation The goal of this stage is to select, clean and format the data that will be used for the modelling. Data preparation consists of all activities that are done in order to create a final dataset that we will be using to create our models. Data preparation is considered to be one of the most time consuming parts of the project.

4. Modelling The goal is to create a model, therefore during this stage we will explore multiple options and select appropriate modelling technique. The modelling technique used will depend on the tasks we are trying to solve, size of the dataset and type of data we are working with.

5. Evaluation In this phase we evaluate the results we got in the modelling phase. If the models are not sufficiently precise we may decide to create new models, examine the data again or establish new business goals which may lead to creating more accurate and precise models.

6. Deployment The knowledge we have obtained from the models is to be deployed so that the end user can benefit from it. In most of the cases it will actually be the user that carries out the deployment and not the person that created the models. Deployment will look differently across different domains and organizations, whether it be a set of new guidelines for the company, recommendations for new treatment of patients or a new software deployed to increase efficiency in sales.

(25)

This process is not linear. During the course of the project it is possible to go back from one stage to another. It is especially common to go back and forth between business and data understanding because one helps us understand the other one better. It is also commonly seen between data preparation and modelling, because depending on which algorithms we decide to use we may have to alter our data slightly. And finally, when we are evaluating our model and we decide that we are not happy with it we can go back to business understanding to get a better picture of what it is we wanted in the first place. This process can be seen in the image below.

Figure 1.4: Process diagram showing the relationship between the phases of CRISP-DM.

(26)

(27)

2. Knowledge discovery methods

2.1 Cluster analysis

2.1.1 Introduction

Cluster analysis or clustering is grouping a set of objects into groups or clusters so that those objects within one group are more similar to each other than to the others. This can be useful for market segmentation, understanding our target audience better so that we are able to offer specific products to specific people depending on the segment they fall into. It has many further uses, such as in medical imaging, crime analysis and biology.

There are different types of clustering, the two main categories being hierarchical and non- hierarchical. Hierarchical clustering creates a tree of clusters and it is well suited for tasks such as the taxonomy of animals or hierarchical structures of an organization. There are following non-hierarchical clustering algorithms: distribution-based, density-based and centroid-based.

Distribution-based clustering assumes data is composed of distributions and based on the distance from the distribution center it is calculated whether and how much the point belongs to that distribution. It can only be used when we know the type of distribution in our data. Density-based clustering connects areas of high density into clusters. Centroid-based clustering is best known and we will be using it in our example. It organizes the data into non-hierarchical clusters. Common and simple algorithm we will use is called K-means.[22]

Figure 2.1: Hierarchical vs non-hierarchical clustering.

(28)

2.1.2 K-means clustering

K-means is an iterative, unsupervised algorithm. It will iterate until each data point belongs to only one group. It is unsupervised because we do not know what the correct answer is in advance. We do not know how many clusters the data is supposed to have or where each point is supposed to belong. Due to its simplicity it is one of the most popular algorithms for clustering.[23]

K-means clustering algorithm steps:

1. Choose the number of clusters (possible techniques will be described below).

2. Randomly choose a centroid for each cluster (centroid should be the middle of a cluster).

3. Assign each point to the closest centroid.

4. Recompute the centroid by taking the average of all points in the cluster, then re-assign the points to the now nearest centroid.

5. Repeat the calculation of the centroids until points stop changing the clusters or in case of larger dataset until convergence is reached.[22]

Techniques for selecting the number of clusters:

• Elbow method

• Gap statistic

• Silhouette Coefficient

Elbow method

Elbow method is one of the heuristics that is used to determine the number of clusters in a dataset. Explained variation measures the proportion to which a mathematical model accounts for the dispersion in the data, that is how stretched or squeezed the data is. The elbow method looks at the percentage of variance explained as a function of the number of clusters. The idea is to select such a number of clusters that adding a further cluster would not increase the quality of the model. The method plots the explained variations as clusters and adding the first cluster will explain a lot of variance, it will add a lot of information for our model. But at some point the marginal gain will drop resulting in an angle in the plot. The number of clusters is chosen at this point. Despite its popularity, this technique is considered subjective and unreliable considering that the elbow cannot always be unambiguously identified.

In the example below we can see the entire process. First we see the creation of centroids in four images, that is in their final phase when they are at the center of all of the data points.

In the first image we only have one centroid, meaning we would have only one cluster and a distortion score of 608.47. Distortion can be understood as misrepresentation of a dataset.

The distortion is high because we have only one centroid trying to represent all of the data.

With two clusters the distortion lowers down to 130.40. With three clusters the distortion

(29)

significantly drops to 27.24. With four clusters it drops only to 20.62. The best way to visualize this is to use a line graph. For this example we see that the optimal number of clusters is three since the line flattens out after that point.

Figure 2.2: Clustering example with centroids.

Figure 2.3: Clustering example with elbow method.

(30)

2.2 Regression

2.2.1 Linear Regression

There are two types of linear regression. Simple linear regression allows us to study the relationship between two continuous variables. The model function is:

Y =a+bX+u

Variable denoted X is regarded as the explanatory variable, the variable that we use to predict the variable Y. Variable denoted Y is regarded as the response, the outcome of the model we get once we input the explanatory variable. Since Y depends on the X it is also called the dependent variable, whereas X is called the independent variable.[24] The residual value, u, which is the difference between the actual outcome and the predicted outcome, is included in the model to account for such slight variations. Variable a is the y-intercept (constant term) and bi is the slope coefficient for the explanatory variable.[25]

In the image below we can see an example application of linear regression. Blue dots represent data points and the red line represents the linear function onto which we are mapping our data.

Figure 2.4: Linear regression.

Illustrative examples include: predicting GPA based on SAT scores, predicting work perfor- mance based on IQ test scores and predicting soil erosion based on rainfall.

Linear regression makes five assumptions that have to be fulfilled in order to apply linear regression on our data[26]:

(31)

• Linearity: There has to be a linear relationship between the explanatory and the outcome variable. In the image below we can see an example of the data where no linearity is present.

Figure 2.5: No linearity.

• Homoscedasticity: Noise is the same across all explanatory variables.

• Independence: Observations are independent of each other.[26]

• Normality: Linear combination of the random variables should have a normal distribution.[27]

• No or little multicollinearity: Multicollinearity occurs when the independent variables are too highly correlated with each other. Some examples are: height and weight (taller people likely weigh more), two variables that seem different but are in fact same (weight in kilos and pounds), one variable can be deducted from the other variable and two variables that are the same but have a different name.[28]

Multiple linear regression uses multiple explanatory variables to predict the outcome variable.

The model function is:

Y =a+b1X1 +b2X2 +b3X3 +...+btXt+u

As we can see it is very similar, the only difference is multiple independent variables. Vari- able denoted Y is regarded as the dependent variable, Xi are the explanatory (independent) variables. Variable a is the y-intercept (constant term) and bi are slope coefficients for each explanatory variable. Variable u is the model’s error term.

Multiple linear regression is used more commonly than simple linear regression. This is due to the fact that in real world it is difficult to find a dependent variable that would be exaplained by only one variable. There are typically many variables that have an impact on something.

(32)

2.2.2 Logistic Regression

Logistic regression is used to explain the relationship between one or more explanatory variables of any type and one binary dependent variable.[29] We are trying to predict whether something is True or False. Illustrative examples include: Checking whether the email is spam, whether the patient is healthy, whether the student will pass or fail and any other yes/no types of predictions.[30]

Figure 2.6: Logistic regression.

2.3 Principal component analysis

Principal component analysis is a linear dimensionality reduction unsupervised method. It is most often used to reduce the dimensionality of large data sets. One of the ways it does that is by transforming a large set of variables into a smaller one that still contains most of the information in the large set. It extracts information from high dimensional space and projects it into a lower dimensional subspace. It tries to preserve the parts containing more variations of the data and remove those that do not. Other than reducing the number of dimensions in a dataset, principal component analysis can also be used to find patterns in multidimensional data, visualise high dimensional data, ignore noise, improve quality of the models, get a compact description and capture as much of the original variance in the data as possible.

2.4 Environment and libraries

We will use Jupyter Notebook which is an open-source web application that allows us to create shared documents which contain live code, visualizations, equations, images and text. These

(33)

documents are called Notebooks.[31] Jupyter Notebook has become very popular in data science since due to the Notebooks’ versatility and interactivity can be used as a presentational and educational tool. It is much more than just an IDE, it allows for saving the Notebooks as PDFs, HTMLs, Latex documents and it supports over 40 different programming languages.

We will use the programming language Python, together with many of its libraries. Two main libraries we will use are pandas and numpy. Pandas is a powerful library used for reading and writing to different data formats and structures and easily manipulating large data sets.[32]

Numpy, among other uses, is used for working with arrays.[33]

For plotting graphs and creating visualizations we will use Matplotlib[34], Plotly[35] and Seaborn[36]. Both Matplotlib and Plotly are used for visualizations, Plotly has some ad- vantages when it comes to creating interactive plots, making it easier to adjust them and is well-suited for conveying important insights from the data. Seaborn is based on Matplotlib and it helps to create more attractive graphics.[37]

Sci-kit library will be used to provide us with useful algorithms: K-means for cluster analysis, preprocessing for normalization of the data and PCA for graph creation.[38]

(34)

(35)

3. COVID-19 Pandemic

3.1 Introduction

COVID-19 is a novel disease that is caused by a coronavirus, named 2019-nCoV at first and later renamed to SARS-CoV-2, belonging to the Orthocoronavirinae subfamily.

Coronaviruses can infect the respiratory, gastrointestinal, hepatic, and central nervous system of human, livestock, birds, bat, mouse, and many other wild animals.[39] The initial outbreak was reported in the seafood wholesale wet market, the Huanan Seafood Wholesale Market, in Wuhan, Hubei, China in December 2019.

As of February 6, 2020 WHO had documented 28,276 confirmed cases with 565 deaths globally including 25 new countries to which to COVID-19 had spread.[40] The disease has spread rapidly and globally. As of 26 November, COVID-19 has affected 218 countries and territories around the world, resulting in a total of 60,819,346 cases and 1,428,873 deaths. According to some approximations there are a total of 17,266,426 active cases from which 104,519 patients are in a serious or a critical condition.[41]

Transmission happens when an infected person comes in close contact with another. The virus spreads mainly through the respiratory route, meaning when the infected person is either heavily breathing, speaking, coughing, sneezing or doing any other similar activity they emit small liquid particles. Larger liquid particles are called respiratory droplets and smaller ones are called aerosols. Studies have shown the main way the virus spreads is by the respiratory droplets, although aerosol transmission can occur in specific settings,particularly in indoor, crowded and inadequately ventilated spaces, where infected person(s) spend long periods of time with others.[42] The research is still inconclusive but there are some studies pointing out that the asymptomatic transmission (an infected person with no symptoms infects another person) is very high and the reason why the pandemic is spreading so quickly.

Early research has suggested that the rate of asymptomatic infections could be as high as 81%.[43] A meta-analysis, which included 13 studies involving 21,708 people, calculated the rate of asymptomatic presentation to be 17%.[44] Aside from asymptomatic transmissions, another reason for the rapid growth of infected persons are super-spreader events. These include business conferences, night clubs and religious gatherings.

In order to prevent contraction of COVID-19 it is recommended to practice social distance, wear a mask when unable to keep social distance, regularly wash hands and soap (which destroys the protective membrane of the virus) or clean them with a hand sanitizer with at least 70% alcohol.[40] In order to slow down the spread of the virus many countries have ordered some forms of restrictions. By April 2020, over 3.9 billion people in over 90 countries and territories have been asked or ordered to stay at home by their governments.[42]

(36)

3.1.1 Importance of data

Collecting, sharing and analyzing data regarding COVID-19 pandemic is of the utmost importance. It is helping doctors get better insights about the disease, since they are able to access information about thousands, even millions of other patients and therefore be able to provide better treatment. Sharing data on what works and what does not helps create better insights.

3.2 Comparison with other pandemics

In this chapter we will compare the COVID-19 pandemic with some of the pandemics before.

Scientists have been researching previous pandemics (even before this one started) in order to understand what has worked before. Looking at previous pandemics can help us understand the current pandemic from different points of view. Comparing the current and previous viruses that were the root cause, transmission and incubation specifics as well as the social factors, such as human behaviour and reaction to different measures, such as social distancing and masks.

The code for creating the dataset that contains the data for all pandemics, as well as the visualizations can be found in the Jupyter Notebook titled Comparison with other pandemics.

3.2.1 Overview of previous/ongoing pandemics

MERS: Middle East respiratory syndrome is a viral respiratory infection and just like COVID-19 it is caused by coronavirus, albeit a different one. The first case occurred in June 2012 in Jeddah, Saudi Arabia.[45][46]

SARS: Severe acute respiratory syndrome broke out in 2002 and no case has been reported since 2004. It was caused by the first identified strain of the SARS coronavirus SARSr-CoV.

In 2019 another strain of SARSr-CoV was identified and it is the cause of COVID-19.[46]

ZIKA: Zika virus, which is spread by mosquitoes, caused an epidemic that lasted from 2015 to 2016. Zika can spread from a pregnant woman to the baby and cause different birth defects.[47][48][49]

HIV/AIDS: Human immunodeficiency virus infection and acquired immunodeficiency syn- drome is a spectrum of conditions caused by the human immunodeficiency virus. The person at first may not even know they have been infected, but as a result their immune system will continue to weaken unless they start taking medications. At the moment there is no vaccine, but the antiretroviral treatment can slow the course of the disease. It was first recognized in 1981 and to this day it is still considered to be a pandemic.[50]

(37)

The Spanish flu: The Spanish flu, also known as the 1918 flu pandemic, was caused by the H1N1 influenza A virus. It caused a pandemic that lasted from February 1918 to April 1920.[51]

Ebola: Ebola is a viral hemorrhagic fever that was first identified in 1976, but it has had multiple outbreaks since then, last one in July 2019 in Congo. The largest outbreak to this day was in West Africa from 2013 to 2016.[52][53]

H1N1: The 2009 swine flu pandemic was an influenza pandemic that lasted about 19 months, from January 2009 to August 2010. Just like the 1918 flu pandemic it was caused by the H1N1 influenza virus.

Nipah: Nipah virus has caused numerous disease outbreaks in South and Southeast Asia.[54]

H3N2: Influenza A virus subtype H3N2 (A/H3N2) is a subtype of viruses that causes influenza (also known as the seasonal flu).[55]

Cholera: Cholera is an infection of the small intestine caused by some strains of the bac- terium Vibrio cholerae. In the last 200 years there have been seven cholera pandemics along with numerous outbreaks. The first pandemic originated in India in 1817. The sixth cholera pandemic that lasted from 1899 to 1923 is considered to be the biggest one.[56]

3.2.2 Comparison

Using data from various cited sources we have created a simple dataset that consists of total number of cases, total number of deaths, case fatality rate, reproductive rate number, number of countries to which the pandemic had spread and the year the first case was reported. This dataset can be seen below and ‘-1’ indicates missing data.

Table 3.1: Overview of previous/ongoing pandemics

Pandemic Cases Deaths CFR R0 Countries Reported COVID-19 61800000 1450000 2.5 2.1 214 2019

MERS 2519 86 35.0 2.7 27 2012

SARS 8098 774 11.0 2.7 29 2002

ZIKA 711381 18 8.3 3.0 87 2015

HIV/AIDS 65000000 25000000 -1 -1 214 1981 Spanish flu 500000000 60000000 5.0 -1 214 1918

EBOLA 28646 11323 50.0 2.0 10 1976

H1N1 491382 18449 0.03 1.75 214 2009

NIPHA 19 17 -1 -1 1 2018

H3N2 -1 2000000 -1 1.8 214 1968

6th Cholera -1 1500000 -1 -1 214 1899

H2N2 1100000 1100000 0.67 -1 214 1957

(38)

Case Fatality Rate

Case fatality rate is the proportion of people who die from a specified disease among all individuals diagnosed with the disease over a certain period of time.[57]

While this is a statistic that is most commonly used, especially when someone wants to estimate how “dangerous” a pandemic may be, it is not ideal. A better one and a precise statistic would be the infection fatality rate. That is the number we get when we divide the number of people who died from the disease with the total number of cases. But this is almost impossible, because we cannot know the total number of cases. It would require knowing for every single person whether they have had the disease or not. And it is especially difficult for COVID-19, even in modern times, due to the fact that a lot of people will have the disease and have no symptoms. There are two main reasons why this statistic needs to be interpreted carefully. This is how CFR is calculated.

CF R= N umberof deathsf romdisease N umberof diagnosedcasesof disease

As explained above, this statistic greatly depends on the denominator. If there is not enough testing to catch all the mild cases or cases with no symptoms then it can make it appear as if the CFR is higher. Since countries have different approaches to testing it is difficult to compare countries using CFR.

The second one of them is that due to the fact that while it shows one thing, which is the percentage of people that die from the infection it does not include any other severe outcomes in its numbers. Unfortunately, there is currently no such statistic that would include all of the people that while have not died have had severe long lasting symptoms. Such a statistic could further help the general public understand the seriousness of this pandemic.

Figure 3.1: Graph showing CFR values for different pandemics.

(39)

R0

R0 tells you the average number of people who will contract a contagious disease from one person with that disease.

In order to calculate the reproduction rate scientists have to work backwards. They look at all of the infections up until that point in time and conclude what the current reproduction rate is. But this number is not fixed for the entire duration of the pandemic. It can change as our behaviour changes, as our immunity develops or the time progresses. Government restrictions, social distancing and isolation can help bring this number down. While reproduction rate matters and it helps us understand and predict how many people will potentially get infected in the upcoming period it is not the most important statistic. The second statistic we will look at is also very important.

Figure 3.2: Graph showing R0 values for different pandemics.

(40)

(41)

4. Case study

4.1 Business understanding

4.1.1 Understanding the problem

We have a large amount of data that could help us better understand this pandemic: understand its progress, whether stricter measures lead to better containment of the pandemic, how different countries are affected by the pandemic, whether a country’s economic status makes a significant difference and many more. We will try to tackle the following problem.

The problem is clustering the countries, this can help understand similarities between countries that are performing better (less cases, less deaths) and see if we can isolate some of the leading factors. These factors could then be used as guidelines for everyone else. We want to take a look at all of the attributes that we have (life expectancy, number of beds in the hospital, GDP per capita) and try to see whether there is any relation to the total number of cases in deaths. We want to examine whether countries that are closer or perhaps countries that are on the same continent have similar experiences. We want to understand whether there are certain factors that we can use to group these countries together.

There are a few more insights that the work on COVID-19 data can bring. That is ra- tioning whether the amount and the quality of data we are collecting is sufficient. We can also advise on standardization of the data collected, collecting as much data as possible, suggest implementing guidelines that would force country officials to collect and share such data responsibly, all for the purpose of enabling scientists get a better understanding of the pandemic and increasing the chances of slowing it down and eventually stopping it.

4.1.2 Establishing goals

We will create a model where the goal is to group similar countries together based on the different attributes we have. Since the data changes every day, we will create numerous models for different dates to see if there is any pattern we can confirm. Our goal will be to display the models created on the most recent date since these should hold the most information and provide us with most insight. This could help us potentially understand why some countries do better. This may be challenging due to the fact that the data is constantly changing and there are many factors that are difficult to account for.

The goal is to create clustering models for both the entire world and Europe. Another goal is to visualize the results and make comparisons between different clusters and some of their attributes. After creating different clusters with different granularity levels we will try to give a possible explanation of them. In each case we will take a look whether there is a lesson to be learnt and what it could mean for the future development of the pandemic.

(42)

4.2 Data understanding

4.2.1 Data Collection

The COVID-19 dataset we will be working with is a collection of the COVID-19 data main- tained by Our World in Data. Our World in Data is a collaborative effort between researchers at the University of Oxford and the non-profit organization Global Change Data Lab. Their goal is toshare the research and data to make progress against the world’s largest problems.[58]

This data has been collected, aggregated, and documented by Cameron Appel, Diana Bel- tekian, Daniel Gavrilov, Charlie Giattino, Joe Hasell, Bobbie Macdonald, Edouard Mathieu, Esteban Ortiz-Ospina, Hannah Ritchie, Max Roser.

The data is gathered from multiple data sources. Confirmed cases and deaths, as well as hospitalizations and intensive care unit (ICU) admissions, come from the European Centre for Disease Prevention and Control (ECDC). Data regarding the testing for COVID-19 comes from official national government reports and it is collected by the Our World in Data team.

The rest of the variables come from a variety of sources, such as the United Nations, World Bank, Global Burden of Disease, Blavatnik School of Government, OECD, Eurostat, etc.[59]

There are a total of 48 columns and in the figures below we can see the column name, description of the variable as well as the source from which it was obtained.

Figure 4.1: Table showing the dataset information - part 1.

(43)

4.2.2 Data Description

The variables we have can be split into four segments for better understanding. The first segment consists of variables regarding the number of new and total cases and deaths. The second segment consists of variables regarding the number of hospitalized and intensive care unit patients. The third segment are the variables regarding the testing. And the fourth segment are socio-demographic variables giving a better overview of each location.

(44)

Stringency index

One interesting variable that we have is the COVID-19 Government Response Stringency Index, in our data named stringency index. It is used as a baseline measure of variation in governments’ responses. This measure is an additive score of the seven indicators, which are: School closing, Workplace closing, Cancel public events, Restrictions on gathering size, Close public transport, Stay at home requirements, Restrictions on internal movement and Restrictions on international travel. These indicators are measured on an international scale, but rescaled to vary from 0 to 100, with 100 being the strictest response.

It is necessary to point out that this index does not reflect the suitability of the government’s response, only the strictness of measures. The data is collected by academics and students from all over the world using various publicly available sources such as news articles and government reports. This study is led by the Blavatnik School of Government from the University of Oxford. At the moment of writing it contains data from more than 180 countries.

4.3 Data Preparation

Data preparation consists of all activities that are done in order to create a final dataset that we will be using to create our models. This may include removing outliers, adding missing data, transforming the data and formatting in case of raw data. Data preparation is considered to be one of the most time consuming parts of the project. It is necessary to spend time preparing the data to be able to alleviate any biases that could influence our models.

4.3.1 Missing data

One of the first things we will take a look at is any missing data we may have. When possible we will fill in the data that we can find ourselves. This will help us get a more inclusive look at the world. Using the Missingno library we will create a bar that shows us where we have the most missing data.

As we can see this data is far from complete. We do have a lot of data, but to truly be able to look at the entire world we are missing some critical data. On the graph below are the following variables that have more than 50% of data missing including the respective percentages. As we can see some of them have almost all of the data missing.

(45)

Figure 4.4: Table showing the missing data.

Figure 4.5: Table showing the missing data percentages.

(46)

Most of the missing data is regarding the hospitalizations and testing. And also some of the missing data is regarding the socio economic variables. During the research we have tried searching for the hospitalizations and testing data from government websites. This is where we have also come to realize the reasons why the researchers creating the original data set have not included some of the data. Even when the data is there it is not easily accessible, it will either be just text (opposed to being able to download a CSV or other data friendly file type) on a website which is typically only in the native language. While this is true for testing, when it comes to hospitalizations most of the countries provide no data publicly.

Finding some of the missing socio economic variables was possible and down below is the description of those we have found and filled out.

Human development index

Human development index is asummary measure of average achievement in key dimensions of human development: a long and healthy life, being knowledgeable and having a decent standard of living.[59] The human development index is calculated by the United Nations Development Programme. Due to a number of reasons, the main one being inaccessibility of the data UNDP HDI does not exist for a number of countries. Because of this I have resorted to the paper called Filling gaps in the human development index: Findings for Asia and the Pacific. This paper extends the number of countries with calculated HDI from 177 to over 230. It includes smaller countries, which are typically omitted by the UNDP. While this is not the official standing of UNDP, these calculations are widely accepted as they are measured in a similar manner. This enables us to better assess the human development situation for a number of countries that we have in our dataset that otherwise would have missing values. Thanks to this report we have been able to fill in the values for the following countries: San Marino, Anguilla, Aruba, Bermuda , British Virgin Islands , Cayman Islands, Falkland Islands, French Polynesia, Gibraltar, Jersey, Kosovo, Monaco, Guam, Greenland, Isle of Man, Montserrat, New Caledonia, Somalia, Guernsey, Puerto Rico, Taiwan, Northern Mariana Islands, Turks and Caicos Islands, United States Virgin Islands, Wallis and Futuna, Curacao and Sint Maarten (Dutch part).

Life expectancy

This variable refers to life expectancy at birth in 2019. Using data from the research paper Filling gaps in the human development index: Findings for Asia and the Pacific, I was also able to fill in the missing values in life_expectancy for the following countries: Kosovo, Guernsey and Jersey.

(47)

Gross Domestic Product per capita

The World Factbook is prepared by the Central Intelligence Agency for the use of US Gov- ernment officials containing various information on history, economics, demographics etc. It contains information for 267 world entities, which are collected by various organizations and institutions all over the world.

The variable gdp_per_capita is the Gross Domestic Product per capita when adjusted by purchasing power parity (PPP). The missing values were found in the The World Factbook.

The World Factbook is prepared by the Central Intelligence Agency for the use of US Gov- ernment officials containing various information on history, economics, demographics etc.

It contains information for 267 world entities, which are collected by various organizations and institutions all over the world. Data was filled in for the following countries: Andorra, Anguilla, Guam, Greenland, French Polynesia, Gibraltar, Curacao, Falkland Islands, Faeroe Islands, British Virgin Islands, Guernsey, Isle of Man, New Caledonia, Northern Mariana Islands, Taiwan, Montserrat, Jersey, Turks and Caicos Islands, United States Virgin Islands, Wallis and Futuna and Western Sahara.

Hospital beds per thousand

Missing data was added from different sources, the main one being The World Factbook.

Other data sources include: Health Policy Institute[60] (private think tank focused on health policy and health economics in Central and Eastern European countries), The Pan American Health Organization[61] (specialized international health agency for the Americas) and The Falkland Islands Government Health Service.[62] Data was added for the following countries:

Taiwan, Andorra, Maldives, Greenland, Namibia, Faeroe Islands, Senegal, Angola, Palestine, Chad, Congo, Mauritania, Rwanda, Sierra Leone, Nigeria, Puerto Rico, Cote d’Ivoire, South Sudan, Kosovo, Lesotho, Papua New Guinea, Aruba, British Virgin Islands, Falkland Islands, Wallis Futuna, Guinea-Bissau.

Extreme poverty

This is the share of the population living in extreme poverty, which refers to an income below the international poverty line of $1.90 per day, set by the World Bank.[62]

Missing data was filled in for the following countries, with respective sources: Afghanistan, Azerbaijan (Asian Development Bank[63]), Belarus, Swaziland, Saudi Arabia (World Bank[64]), Angola (The Centre for Scientific Studies and Research of the Catholic University of Ango- la[63]), Singapore (Statistics provided by the Ministry of Social and Family Affairs in Sin- gapore), Slovenia, Barbados, Brunei, Andorra (The Borgen Project[65]), Somalia (Save the Children[66]), Antigua and Barbuda (PAHO[61]), Suriname, Trinidad and Tobago (Human Development Report 2019), Venezuela (The 2019–2020 National Survey of Living Conditions,

(48)

Andrés Bello Catholic University in Caracas), Aruba (Central Bureau of Statistics Aruba).

The data for the rest of the countries was filled in from The World Factbook.

Population density

It measures the number of people divided by the land area. The following countries were filled in: Anguilla, Bonaire Sint Eustatius and Saba, Falkland Islands, Wallis and Futuna, Western Sahara, Taiwan, Syria, South Sudan and Montserrat. Data source is Worldometer which processes data collected by the United Nations Population Division. Data source for Guernsey, Vatican and Jersey is Wikipedia which calculated density based on population and land data.

The code we have used to check for missing data and for adding the missing instances can be found in the Jupyter Notebook titled Data Cleaning and Preparation. In total, over 150 missing instances of the data were added to our dataset. This is the new dataset that we will continue to use from now on.

4.4 Data visualizations

Data visualization is typically one of the first things done during the data understanding phase, but since we were aware there is a lot of data missing, we have decided to create the visualizations again and present them. This gives us an interesting overview of the data that we have. We will not include all of the visualizations that we have made since not all of them are of crucial interest. We have created many visualizations by looking at total numbers of cases, deaths and testing and using different attributes to understand the social, demographic and economic factors in the country.

4.4.1 Cases and deaths

The code for the following visualizations can be found in the Jupyter Notebook titled Visu- alizations of the data (cases and deaths). They are split into those relating to the world and those relating to Europe. Below we have those we have found to be the most interesting and the rest of them can be seen in the Notebook Visualizations of the data (cases and deaths).

In the figure below we are looking at the countries with the absolute highest number of cases with relation to the life expectancy in the given country. No pattern emerges from that, but the United States, India, Brazil and India are in the top ten most populated countries in the world. This makes it understandable they have the absolute highest number of cases.

(49)

Figure 4.6: World: countries with most cases (total cases per country). Date: 07.10.2020.

Figure 4.7: World: countries with most cases (total cases per million). Date: 07.10.2020.

Figure above shows how that all countries with the highest number of cases per million at the time, have a low number of hospital beds per thousand with the exception of Czech Republic.

This has been one of the biggest issues during the pandemic. Scientists have been warning about the dangers of a potential pandemic and one of the recommendations had always been

(50)

to improve the structure of medical institutions, by increasing the number of hospital beds, amount of material with long shelf-life and the number of medical personnel. Had such a threat been taken more seriously in advance there could have been less stress about where to place all of the patients. Tracking the number of beds and their capacity can also help countries sign agreements to help each other by transferring patients and providing medical help in their hospitals.

Figure below shows that all of the countries with the most total deaths per million are countries that have low GDP per capita with the exception of the United States and San Marino. Low GDP per capita is typical for poorer, less developed countries that also have lacking infrastructure. This can indicate that these countries possibly did not have a good healthcare system due to the lack of finances, which led to inability to hospitalize everyone who needed it. One possible explanation for San Marino is that it is practically a small enclave of about 30000 citizens in a severely hit part of Italy and additionally an older population could have contributed slightly as well.

Figure 4.8: World: countries with most deaths (total deaths per million). Date: 07.10.2020.

One important thing to note for the figure below is that the Vatican has 0 for the hu- man_development_index variable due to the specificity of the country but it is widely accepted that if the human development index were to be calculated for the Vatican, it would be nearing 1. This would mean all of the countries with most total cases per million have a very high human development index. This could also indicate that there is more movement in these countries and therefore an easier spread of the virus.

The three following figures were added for better understanding and easier comparison with the figures created for Europe.

(51)

Figure 4.9: World - Countries with most deaths - Total deaths per million per country. Date:

07.10.2020.

Figure 4.10: World - Countries with most cases - Total cases per country. Date: 07.10.2020.

(52)

Figure 4.11: World - Countries with most cases - Total cases per million per country. Date:

07.10.2020.

Figure 4.12: Europe: countries with most cases (total cases per million). Date: 07.10.2020.

(53)

Figure 4.13: Europe: countries with most deaths (total deaths per million). Date: 07.10.2020.

4.4.2 Testing

Testing is very important when fighting a pandemic. First of all, we want to make sure our tests will not provide us with many false negatives, since this would mean that infectious people could possibly go on with their lives and infect others. Second of all, there has to be a proper testing strategy that catches as many positive cases without slowing down the systems due to the lack of manpower or materials to handle larger capacities.

In the chapter Comparison with other pandemics we have already explained what Case Fa- tality Rate is. That is another reason why not only knowing total testing numbers, but also understanding different testing strategies in different countries is important. An example of how CFR can be deluding without sufficient testing data follows. At the time of writing San Marino has the most total deaths per million and a CFR of 1.8%, whereas Bosnia and Herzegovina is currently 8th for the most total deaths per million but it has a CFR of 3.8%.

This is directly impacted by the number of tests. San Marino has done 1.5 million tests per 1 million residents, with 132k cases/mil, 29 tests per case. Bosnia and Herzegovina have 234k tests per 1 million residents, with 49k cases/mil, less than 5 tests per case.

In the figures below we see that the countries that have the most total tests per million are also countries prevalently with a higher human development index. This is self explanatory as those are the countries that are typically trying to provide the healthiest and longest lives to their citizens.

(54)

Figure 4.14: Europe: countries with most tests (total tests per country). Date: 07.10.2020.

Figure 4.15: Europe: countries with most tests (total tests per million). Date: 07.10.2020.

Hlavní práce75007_maln01.pdf, 4.8 MB Stáhnout

Prague University of Economics and Business Faculty of Informatics and Statistics

COVID-19 Data Analysis

MASTER THESIS

Acknowledgement

Abstrakt

Abstract

Contents

List of Figures

List of Tables

Abbreviations

Introduction

1. Knowledge Discovery in Databases

1.1 Introduction

1.2 Tasks

1.3 Applications

1.4 CRISP-DM

2. Knowledge discovery methods

2.1 Cluster analysis

2.2 Regression

2.3 Principal component analysis

2.4 Environment and libraries

3. COVID-19 Pandemic

3.1 Introduction

3.2 Comparison with other pandemics

4. Case study

4.1 Business understanding

4.2 Data understanding

4.3 Data Preparation

4.4 Data visualizations