• Nebyly nalezeny žádné výsledky

Cluster analysis - summary

4. Case study

5.4 Cluster analysis - summary

Regardless of the socio-demographic and economic characteristics all of the countries have had at one point at a time a similar progress of the disease. The main difference between the countries is caused by the fact that the spread of the COVID-19 pandemic comes in waves.

And the timing of the first, second and third wave is different for different countries. The stringency index is not an ideal indicator because it takes a lot of time for it to start making a difference and additionally not all of the governments set stringency measures into effect at the same time of the pandemic progress. Some countries add measures gradually before the number of cases has even started rising, others will implement a total lockdown once the number of cases goes drastically up. The measures also differ in different regions of the country as well as their acceptance by the general society.


The main goal of this thesis was to create a clustering of COVID-19 data. We have successfully created clustering models for the entire world (including both models with four and with eight clusters) and Europe. We have shown the entire process of creating a clustering model including descriptions of each part of code as well as explanation of individual steps taken, libraries used and techniques that were applied. The results were visualized and patterns were explained.

Comparison of COVID-19 with previous pandemics was also done, where we put the current pandemic in perspective by explaining previous pandemics and comparing case fatality rates and reproduction rates. CRISP-DM methodology was used to establish workflow. There is business understanding, which is the value and the goals of analyzing COVID-19 data. The data was cleaned and prepared for modelling. There is also visualization of the data to get a better understanding of it and the attributes we have.

In the theoretical part of the thesis we have explained what knowledge discovery in databases is and the different tasks it can solve. We examined its applications as well as the methods used. All of the techniques used in the practical part of the thesis were explained in the theoretical part.

When working with any sort of data it is necessary to carefully examine it and understand it. Finding patterns is not always straightforward due to the complexity of the data and the field we are examining. There is still a lot that we do not know about COVID-19, and especially as new mutations keep rolling in and some people are already getting vaccinated in some parts of the world, the progress of the pandemic keeps changing every single day.

Even so, the existing code that we have created can be used to create new models using the new data.


[1] Jeremy Page, Drew Hinshaw, and Betsy McKay. In Hunt for Covid-19 Origin, Pa-tient Zero Points to Second Wuhan Market. Feb. 2021. url: https://www.wsj.com/

articles/in- hunt- for- covid- 19- origin- patient- zero- points- to- second-wuhan-market-11614335404.

[2] How Much Data Is Created Every Day in 2020? Jan. 2021. url:https://techjury.


[3] Founder Written by Jeff Desjardins and editor.How much data is generated each day?

url: https://www.weforum.org/agenda/2019/04/how-much-data-is-generated-each-day-cf4bddf29f/.

[4] Can you guess how much data is generated every day? Nov. 2020.url:https://www.


[5] Rob Kitchin.The data revolution: big data, open data, data infrastructures their con-sequences. Sage, 2017.

[6] University of Regina DBD. Overview of the KDD Process. url: http://www2.cs.


[7] Ali Reza Kohani. Regression vs Classification. May 2017. url:https://medium.com/


[8] Outliers. url: https://www.cese.nsw.gov.au/effective-practices/using-data-with-confidence-main/outliers.

[9] Yadi Zhou et al. “Artificial intelligence in COVID-19 drug repurposing”. In:The Lancet Digital Health 2.12 (2020). doi: 10.1016/s2589- 7500(20)30192- 8. url: https:

//www. thelancet.com/journals/landig /article/PIIS2589- 7500(20)30192-8/fulltext.

[10] Nuala Moran. AI tool for customizing COVID-19 treatment in the works. Aug. 2020.

url: https://www.bioworld.com/articles/496906-ai-tool-for-customizing-covid-19-treatment-in-the-works.

[11] Michael Georgiou. AI in Banking: A JP Morgan Case Study How Your Business Can Benefit. Dec. 2019. url: https://www.imaginovation.net/blog/ai-in-banking-jp-morgan-case-study-benefits-to-businesses/.

[12] 7 Best Real-Life Example of Data Mining.url: https://prowebscraper.com/blog/


[13] Predict Prevent Crime: Predictive Policing Software. Mar. 2021. url:https://www.


[14] Daniel Faggella.AI for Crime Prevention and Detection - 5 Current Applications. Feb.

2019. url: https://emerj.com/ai-sector-overviews/ai-crime-prevention-5-current-applications/.

[15] Kashmir Hill. How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did. Mar. 2016. url: https://www.forbes.com/sites/kashmirhill/2012/02/16/

how- target- figured- out- a- teen- girl- was- pregnant- before- her- father-did/?sh=5ae15bf56668.

[16] Susan Li.A Gentle Introduction on Market Basket Analysis - Association Rules. Sept.

2017. url: https : / / towardsdatascience . com / a gentle introduction on -market-basket-analysis-association-rules-fa4b986a40ce.

[17] About the Author: Smartbridge Smartbridge focuses on simplifying business transfor-mation. We apply thought leadership et al. Market Basket Analysis 101: Anticipating Customer Behavior. Jan. 2021. url: https://smartbridge.com/market- basket-analysis-101/.

[18] Rosalie Chan.The Cambridge Analytica whistleblower explains how the firm used Face-book data to sway elections. Oct. 2019. url: https://www.businessinsider.com/


[19] Random House.How I Helped Hack Democracy. Oct. 2019.url:https://nymag.com/

intelligencer/2019/10/book-excerpt-mindf-ck-by-christopher-wylie.html. [20] Sam Meredith.Facebook-Cambridge Analytica: A timeline of the data hijacking scandal.

Apr. 2018. url: https : / / www . cnbc . com / 2018 / 04 / 10 / facebook cambridge -analytica-a-timeline-of-the-data-hijacking-scandal.html.

[21] Alexandra Ma. Facebook understood how dangerous the Trump-linked data firm Cam-bridge Analytica could be much earlier than it previously said. Heres everything thats happened up until now. Aug. 2019. url: https : / / www . businessinsider . com / cambridge analytica a guide to the trump linked data firm that -harvested-50-million-facebook-profiles-2018-3.

[22] Clustering Algorithms | Clustering in Machine Learning.url: https://developers.


[23] Dan Pelleg and Andrew Moore. “Accelerating exact k-means algorithms with geomet-ric reasoning”. In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD 99 (1999).doi:10.1145/312129.312248. [24] David M. Lane.Introduction to Statistics. Rice University.url:https://onlinestatbook.


[25] Adam Hayes. How Multiple Linear Regression Works. Mar. 2021. url: https://www.


[26] url: https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/R/R5_Correlation-Regression/R5_Correlation-Regression4.html.

[27] SrinivasanSrinivasan. The Five Major Assumptions of Linear Regression. Nov. 2019.

url:https://www.digitalvidya.com/blog/assumptions-of-linear-regression/. [28] Stephanie. Multicollinearity: Definition, Causes, Examples. Sept. 2020. url: https:


[29] What is Logistic Regression? Mar. 2020. url: https://www.statisticssolutions.


[30] Aman Goel.4 Logistic Regressions Examples to Help You Understand. May 2018.url: https://magoosh.com/data-science/4-logistic-regressions-examples/.

[31] url:https://jupyter.org/.

[32] About pandas.url:https://pandas.pydata.org/about/index.html. [33] url:https://numpy.org/.

[34] Visualization with Python.url:https://matplotlib.org/.

[35] The front end for ML and data science models.url:https://plotly.com/. [36] Seaborn.url:https://seaborn.pydata.org/.

[37] Dante SblendorioGuest blogger: Dante is a physicist currently pursuing a PhD in Physics at École polytechnique fédérale de Lausanne. He has a Masters in Data Sci-ence. Plotting Data in Python: matplotlib vs plotly. Oct. 2020. url: https://www.

activestate.com/blog/plotting-data-in-python-matplotlib-vs-plotly/. [38] Scikit-learn.url:https://scikit-learn.org/stable/.

[39] Yu Chen, Qianyun Liu, and Deyin Guo. “Emerging coronaviruses: Genome structure, replication, and pathogenesis”. In:Journal of Medical Virology 92.10 (2020), pp. 2249–

2249. doi:10.1002/jmv.26234.

[40] Coronavirus Disease (COVID-19) Situation Reports. url: https://www.who.int/


[41] Coronavirus Cases Worldometer.url:https://www.worldometers.info/coronavirus/. [42] Coronavirus disease (COVID-19): How is it transmitted? url: https://www.who.

int / news room / q a detail / coronavirus disease covid 19 how is it -transmitted.

[43] Alvin J Ing, Christine Cocks, and Jeffery Peter Green. “COVID-19: in the footsteps of Ernest Shackleton”. In: Thorax 75.8 (2020), pp. 693–694. doi: 10.1136/thoraxjnl-2020-215091.

[44] Oyungerel Byambasuren et al. “Estimating the extent of asymptomatic COVID-19 and its potential for community transmission: Systematic review and meta-analysis”.

In: Official Journal of the Association of Medical Microbiology and Infectious Disease Canada 5.4 (2020), pp. 223–234.doi:10.3138/jammi-2020-0030.

[45] Middle East respiratory syndrome coronavirus (MERS-CoV).url:https://www.who.


[46] Moira Chan-Yeung and Rui-Heng Xu. “SARS: epidemiology”. In: Respirology 8.s1 (2003). doi:10.1046/j.1440-1843.2003.00518.x.

[47] url: https://portalarquivos.saude.gov.br/images/pdf/2016/janeiro/22/


[48] Antonio José Ledo Alves Da Cunha et al. “Microcephaly Case Fatality Rate Associat-ed with Zika Virus Infection in Brazil”. In: Pediatric Infectious Disease Journal 36.5 (2017), pp. 528–530. doi:10.1097/inf.0000000000001486.

[49] Ying Liu et al. “Reviewing estimates of the basic reproduction number for dengue, Zika and chikungunya across global climate zones”. In:Environmental Research 182 (2020), p. 109114.doi:10.1016/j.envres.2020.109114.

[50] The Global HIV/AIDS Pandemic, 2006.url:https://www.cdc.gov/mmwr/preview/


[51] Peter Spreeuwenberg, Madelon Kroneman, and John Paget. “Reassessing the Global Mortality Burden of the 1918 Influenza Pandemic”. In:American Journal of Epidemi-ology 187.12 (2018), pp. 2561–2567.doi:10.1093/aje/kwy191.

[52] Ebola virus disease. url: https://www.who.int/en/news- room/fact- sheets/


[53] Ebola virus disease. url: https://www.who.int/en/news- room/fact- sheets/


[54] Al Jazeera. Deadly Nipah virus claims lives in India. May 2018. url: https://www.

aljazeera.com/videos/2018/5/28/deadly- nipah- virus- claims- lives- in-india.

[55] 1968 flu pandemic.url:https://www.britannica.com/event/1968-flu-pandemic. [56] Cholera.url:https://www.britannica.com/science/cholera.

[57] Case fatality rate. url: https://www.britannica.com/science/case-fatality-rate.

[58] Guest Post, Max Roser, and Hannah Ritchie. url:https://ourworldindata.org/. [59] Owid. owid/covid-19-data. url: https://github.com/owid/covid-19-data/tree/

[65] Mar. 2021.url:https://borgenproject.org/.

[66] Child Poverty. url: https : / / somalia . savethechildren . net / content / child -poverty.
