• Nebyly nalezeny žádné výsledky

4. Case study

4.3 Data Preparation

Data preparation consists of all activities that are done in order to create a final dataset that we will be using to create our models. This may include removing outliers, adding missing data, transforming the data and formatting in case of raw data. Data preparation is considered to be one of the most time consuming parts of the project. It is necessary to spend time preparing the data to be able to alleviate any biases that could influence our models.

4.3.1 Missing data

One of the first things we will take a look at is any missing data we may have. When possible we will fill in the data that we can find ourselves. This will help us get a more inclusive look at the world. Using the Missingno library we will create a bar that shows us where we have the most missing data.

As we can see this data is far from complete. We do have a lot of data, but to truly be able to look at the entire world we are missing some critical data. On the graph below are the following variables that have more than 50% of data missing including the respective percentages. As we can see some of them have almost all of the data missing.

Figure 4.4: Table showing the missing data.

Figure 4.5: Table showing the missing data percentages.

Most of the missing data is regarding the hospitalizations and testing. And also some of the missing data is regarding the socio economic variables. During the research we have tried searching for the hospitalizations and testing data from government websites. This is where we have also come to realize the reasons why the researchers creating the original data set have not included some of the data. Even when the data is there it is not easily accessible, it will either be just text (opposed to being able to download a CSV or other data friendly file type) on a website which is typically only in the native language. While this is true for testing, when it comes to hospitalizations most of the countries provide no data publicly.

Finding some of the missing socio economic variables was possible and down below is the description of those we have found and filled out.

Human development index

Human development index is asummary measure of average achievement in key dimensions of human development: a long and healthy life, being knowledgeable and having a decent standard of living.[59] The human development index is calculated by the United Nations Development Programme. Due to a number of reasons, the main one being inaccessibility of the data UNDP HDI does not exist for a number of countries. Because of this I have resorted to the paper called Filling gaps in the human development index: Findings for Asia and the Pacific. This paper extends the number of countries with calculated HDI from 177 to over 230. It includes smaller countries, which are typically omitted by the UNDP. While this is not the official standing of UNDP, these calculations are widely accepted as they are measured in a similar manner. This enables us to better assess the human development situation for a number of countries that we have in our dataset that otherwise would have missing values. Thanks to this report we have been able to fill in the values for the following countries: San Marino, Anguilla, Aruba, Bermuda , British Virgin Islands , Cayman Islands, Falkland Islands, French Polynesia, Gibraltar, Jersey, Kosovo, Monaco, Guam, Greenland, Isle of Man, Montserrat, New Caledonia, Somalia, Guernsey, Puerto Rico, Taiwan, Northern Mariana Islands, Turks and Caicos Islands, United States Virgin Islands, Wallis and Futuna, Curacao and Sint Maarten (Dutch part).

Life expectancy

This variable refers to life expectancy at birth in 2019. Using data from the research paper Filling gaps in the human development index: Findings for Asia and the Pacific, I was also able to fill in the missing values in life_expectancy for the following countries: Kosovo, Guernsey and Jersey.

Gross Domestic Product per capita

The World Factbook is prepared by the Central Intelligence Agency for the use of US Gov-ernment officials containing various information on history, economics, demographics etc. It contains information for 267 world entities, which are collected by various organizations and institutions all over the world.

The variable gdp_per_capita is the Gross Domestic Product per capita when adjusted by purchasing power parity (PPP). The missing values were found in the The World Factbook.

The World Factbook is prepared by the Central Intelligence Agency for the use of US Gov-ernment officials containing various information on history, economics, demographics etc.

It contains information for 267 world entities, which are collected by various organizations and institutions all over the world. Data was filled in for the following countries: Andorra, Anguilla, Guam, Greenland, French Polynesia, Gibraltar, Curacao, Falkland Islands, Faeroe Islands, British Virgin Islands, Guernsey, Isle of Man, New Caledonia, Northern Mariana Islands, Taiwan, Montserrat, Jersey, Turks and Caicos Islands, United States Virgin Islands, Wallis and Futuna and Western Sahara.

Hospital beds per thousand

Missing data was added from different sources, the main one being The World Factbook.

Other data sources include: Health Policy Institute[60] (private think tank focused on health policy and health economics in Central and Eastern European countries), The Pan American Health Organization[61] (specialized international health agency for the Americas) and The Falkland Islands Government Health Service.[62] Data was added for the following countries:

Taiwan, Andorra, Maldives, Greenland, Namibia, Faeroe Islands, Senegal, Angola, Palestine, Chad, Congo, Mauritania, Rwanda, Sierra Leone, Nigeria, Puerto Rico, Cote d’Ivoire, South Sudan, Kosovo, Lesotho, Papua New Guinea, Aruba, British Virgin Islands, Falkland Islands, Wallis Futuna, Guinea-Bissau.

Extreme poverty

This is the share of the population living in extreme poverty, which refers to an income below the international poverty line of $1.90 per day, set by the World Bank.[62]

Missing data was filled in for the following countries, with respective sources: Afghanistan, Azerbaijan (Asian Development Bank[63]), Belarus, Swaziland, Saudi Arabia (World Bank[64]), Angola (The Centre for Scientific Studies and Research of the Catholic University of Ango-la[63]), Singapore (Statistics provided by the Ministry of Social and Family Affairs in Sin-gapore), Slovenia, Barbados, Brunei, Andorra (The Borgen Project[65]), Somalia (Save the Children[66]), Antigua and Barbuda (PAHO[61]), Suriname, Trinidad and Tobago (Human Development Report 2019), Venezuela (The 2019–2020 National Survey of Living Conditions,

Andrés Bello Catholic University in Caracas), Aruba (Central Bureau of Statistics Aruba).

The data for the rest of the countries was filled in from The World Factbook.

Population density

It measures the number of people divided by the land area. The following countries were filled in: Anguilla, Bonaire Sint Eustatius and Saba, Falkland Islands, Wallis and Futuna, Western Sahara, Taiwan, Syria, South Sudan and Montserrat. Data source is Worldometer which processes data collected by the United Nations Population Division. Data source for Guernsey, Vatican and Jersey is Wikipedia which calculated density based on population and land data.

The code we have used to check for missing data and for adding the missing instances can be found in the Jupyter Notebook titled Data Cleaning and Preparation. In total, over 150 missing instances of the data were added to our dataset. This is the new dataset that we will continue to use from now on.