Big Data Ecosystem

(1)

Big Data Ecosystem

Roman Hanzlík

Bachelor's Thesis

2019

(2)

(3)

(4)

 I understand that by submitting my Diploma thesis, I agree to the publication of my work according to Law No. 111/1998, Coll., On Universities and on changes and amendments to other acts (e.g. the Universities Act), as amended by subsequent legislation, without regard to the results of the defense of the thesis.

 I understand that my Diploma Thesis will be stored electronically in the university information system and be made available for on-site inspection, and that a copy of the Diploma/Thesis will be stored in the Reference Library of the Faculty of Applied Informatics, Tomas Bata University in Zlín, and that a copy shall be deposited with my Supervisor.

 I am aware of the fact that my Diploma Thesis is fully covered by Act No. 121/2000 Coll. On Copyright, and Rights Related to Copyright, as amended by some other laws (e.g. the Copyright Act), as amended by subsequent legislation; and especially, by §35, Para. 3.

 I understand that, according to §60, Para. 1 of the Copyright Act, TBU in Zlín has the right to conclude licensing agreements relating to the use of scholastic work within the full extent of §12, Para. 4, of the Copyright Act.

 I understand that, according to §60, Para. 2, and Para. 3, of the Copyright Act, I may use my work - Diploma Thesis, or grant a license for its use, only if permitted by the licensing agreement concluded between myself and Tomas Bata University in Zlín with a view to the fact that Tomas Bata University in Zlín must be compensated for any reasonable contribution to covering such expenses/costs as invested by them in the creation of the thesis (up until the full actual amount) shall also be a subject of this licensing agreement.

 I understand that, should the elaboration of the Diploma Thesis include the use of software provided by Tomas Bata University in Zlín or other such entities strictly for study and research purposes (i.e. only for non-commercial use), the results of my Diploma Thesis cannot be used for commercial purposes.

 I understand that, if the output of my Diploma Thesis is any software product(s), this/these shall equally be considered as part of the thesis, as well as any source codes, or files from which the project is composed. Not submitting any part of this/these component(s) may be a reason for the non-defense of my thesis.

I herewith declare that:

 I have worked on my thesis alone and duly cited any literature I have used. In the case of the publication of the results of my thesis, I shall be listed as co-author.

 That the submitted version of the thesis and its electronic version uploaded to IS/STAG are both identical.

In Zlín; dated: May 15, 2019 Roman Hanzlík v. r.

...

Student´s Signature

(5)

Cílem této bakalářské práce je představení technologií Big Data a jejich možné využití v praxi. Hlavním cílem této práce v teoretické části je poskytnout přehled v oblasti Big Data a NoSQL databází pro všechny začátečníky. V praktické části je detailně představen Apache Hadoop a projekty s ním svázané. Součástí práce jsou postupy, jak dané technologie nainstalovat včetně příkladů a scénářů použití.

Klíčová slova: Big Data, NoSQL, CAP teorém, Apache Hadoop, HDFS, YARN, MapReduce, Apache Hadoop Ecosystem, Spark, Pig, Hive, Sqoop

ABSTRACT

This bachelor‘s thesis presents Big Data technologies and their possible real-world use cases and applications. The main goal of this thesis is to provide the first introduction to Big Data and NoSQL databases for newcomers. In practical part, Apache Hadoop and its surrounding projects are presented in detail. Integral part of this thesis is light cookbook how to install particular technologies itself with functional demo examples of possible use cases and scenarios.

Keywords: Big Data, NoSQL, CAP Theorem, Apache Hadoop, HDFS, YARN, MapReduce, Apache Hadoop Ecosystem, Spark, Pig, Hive, Sqoop

(6)

Study the past if you would deﬁne the future.

Confucius

Scientia potentia est, sapientiae privilegium.

[Knowledge is power, wisdom privilege.]

Sir Francis Bacon/Roman Hanzlík

Where is the wisdom we have lost in knowledge?

Where is the knowledge we have lost in information?

Thomas Stearns Eliot

(7)

INTRODUCTION ... 9

STRUCTURE OF THESIS ... 10

AIM OF THESIS ... 11

I THEORY ... 12

1 BIG DATA ... 13

1.1 HISTORY OF DATA PROCESSING ... 13

1.2 KNOWLEDGE PYRAMID ... 14

1.3 FROM RELATIONAL DATABASES TO NOSQL ... 16

1.4 MOTIVATIONS TO USE NOSQL DATABASES ... 18

1.5 DEFINITION OF BIG DATA... 19

1.6 HOW BIG IS BIG DATA ... 20

1.7 THE V’S OF BIG DATA ... 24

1.8 DATA SCIENCE VS.BIG DATA VS.DATA ANALYTICS ... 30

1.9 APPLICATION OF BIG DATA ... 35

2 NOSQL ... 39

2.1 INTRODUCTION ... 39

2.2 DISTRIBUTION ... 39

2.3 CONSISTENCY... 40

2.4 CAPTHEOREM ... 41

2.5 CLASSIFICATION OF DATABASE MANAGEMENT SYSTEMS ... 44

2.6 KEY VALUE DATABASES ... 48

2.7 DOCUMENT DATABASES ... 49

2.8 COLUMN BASE DATABASES ... 50

2.9 GRAPH DATABASES ... 51

2.10 MULTI-MODEL DATABASES ... 52

II ANALYSIS ... 54

3 APACHE HADOOP ECOSYSTEM ... 55

3.1 APACHE HADOOP ... 58

3.1.1 Overview of History and Release Versioning ... 59

3.1.2 Distribution ... 61

3.1.3 Core Components ... 62

3.1.4 Deploying Hadoop ... 63

3.1.4.1 Java Installation ... 65

3.1.4.2 SSH Configuration ... 66

3.1.4.3 Hadoop Installation ... 67

3.1.4.4 Setup Hadoop Configuration Files ... 69

3.1.4.5 Format Namenode ... 70

3.1.4.6 Start Hadoop Cluster ... 72

3.1.4.7 Test Hadoop Cluster ... 73

3.1.5 HDFS ... 76

3.1.5.1 HDFS Architecture ... 78

3.1.5.2 Replication ... 79

(8)

3.1.6 YARN ... 84

3.1.7 MapReduce ... 86

3.1.7.1 Examples of MapReduce usage ... 88

3.1.8 Summary ... 95

3.2 GENERAL DATA PROCESSING ... 95

3.2.1 Apache Spark ... 95

3.2.1.1 Spark Introduction ... 96

3.2.1.2 Deploying Spark ... 101

3.2.1.3 Examples of Spark usage ... 105

3.2.1.4 Summary ... 110

3.3 DATA ANALYTICS ... 111

3.3.1 Apache Pig ... 111

3.3.1.1 Pig Introduction ... 111

3.3.1.2 Deploying Pig ... 113

3.3.1.3 Examples of Pig usage ... 114

3.3.1.4 Summary ... 118

3.3.2 Apache Hive ... 118

3.3.2.1 Hive Introduction ... 118

3.3.2.2 Deploying Hive ... 119

3.3.2.3 Examples of Hive usage ... 122

3.3.2.4 Summary ... 128

3.4 DATA INGESTION... 129

3.4.1 Apache Sqoop ... 129

3.4.1.1 Sqoop Introduction ... 129

3.4.1.2 Deploying Sqoop ... 130

3.4.1.3 Examples of Sqoop usage ... 131

3.4.1.4 Summary ... 135

CONCLUSION ... 136

BIBLIOGRAPHY ... 138

LIST OF ABBREVIATIONS ... 147

LIST OF FIGURES ... 148

LIST OF TABLES ... 150

LIST OF SOFTWARE ... 151

(9)

INTRODUCTION

Look around, and tell me what do you see? One could say “Cities, fields, mountains ... “, other one “Twittering of the birds, the noise of the rivers, a hastiness of people ...” and I? I see the data all around. Data in form of temperature, wind, humidity when I wake up and open the window. I see the stock market price in the morning news, when I switch the TV on. I see the traffic data during my commute to the office. I see the data behind the attendance system, once I check in. I see the data in surveillance cameras. I see the data in Twitter and Facebook, every time I connect. I see … I see the data everywhere. Working with data for almost 25 years I just DO see them.

When I was a kid, I remember myself doing tables of ski races results and collecting them, in that time just collecting. When I became older, I did tables of soccer matches scores of most of the national leagues and try to predict next matches results. And when I left the high school and went to business I surrounded myself with data or better said, data just surrounded me. It started as a hobby and turned into passion, passion for data. Even now, when I look at you, I see data, I can utilize. Data are like time, they’re with us from the very beginning.

Data flow with time like inseparable twins, they’re getting older, one faster than the other.

One piece of data is old and useless the very next second, the other one could be valuable for a long time even forever.

There were 4 milestones in the era of human which dramatically change the amount of data recording. First - c. 6000 BC start of using language and writing, second - Gutenberg’s printing press in mid of 15^th century, third - 40s of 20^th century invention of computers; and finally, fourth – Internet and surrounding technologies. At each of these evolution steps, the amount of data increased dramatically. The first record of data being reliably stored in a mechanized medium was using paper tape in 1846 by Alexander Bain – the inventor of the electric printing telegraph [1]. Since then, with every single year the technology innovations continue to accelerate the data in diverse characteristics. Not only volume, but also velocity and variety of data increased in last couple of years exponentially, traditional approaches are not able to fulfil expectation of businesses or governments, everything is getting bigger and faster, so the new technologies must come on board. We are going to discuss the volume of the data from historical perspective deeply in further chapter.

(10)

Structure of Thesis

The structure and outline of this thesis contains two parts. In the first part, the theoretical aspects of Big Data technologies and NoSQL databases are presented with their possible real-world use cases and applications. In the second, practical part, the Apache Hadoop Ecosystem is presented and explained in detail, with practical installation and configuration cookbook and demo examples how to use them.

Figure 1 – The structure of the Bachelor’s thesis

(11)

Theoretical part initially gives an overview of the data history, information processing and presents factors which led to need for this new approach, continues with the definition of Big Data and the V’s of Big Data. Next, in the second section of theoretical part, NoSQL are presented together with CAP theorem, and the most popular and used NoSQL databases from each type such as Key-value stores, Wide-column stores, Document databases, and Graph databases are listed.

The practical part starts with description of the big picture of Apache Hadoop Ecosystem, continues with detail presentation of Apache Hadoop itself. Main principles of Hadoop are described, such as a distributed storage (HDFS), a resource management (YARN) and a data processing (MapReduce). Step-by-step installation and configuration instructions are also provided as well as for other chosen technologies from surrounding ecosystem. For each particular technology exists a set of demo examples with the functional lines of codes to demonstrate usage of technology.

Finally, in the end, the conclusion of the thesis summarizes this work. Bibliography, list of figures and tables and list of used software and their versions are integral part of this work.

Aim of Thesis

The primary aim of this thesis is to extend my current knowledge of area around Datawarehouse, Business Intelligence and Datamining to new emerging technologies of Big Data so I could in my consulting practice provide broader knowledge and deeper expertise to my customers.

The next one, equally important goal of this thesis is to provide the first introduction to Big Data and NoSQL databases to all students at Tomas Bata University in the new course started 2019/2020 named “Advanced Database Systems” with light cookbook how to install particular technologies itself with examples of possible use cases and scenarios.

Last but definitely not least goal is to build the technology background, so I can continue on top of it in Master and/or Doctoral study in the area of Data Science.

(12)

I. THEORY

(13)

1 BIG DATA

1.1 History of data processing

From the very beginning (2.8 mil years ago) the Homo {„human being“} genus differs from all other species around. Homo evolved through Homo habilis (2.3 – 1.4 mil years ago), Homo erectus (1.8 mil – 30,000 years ago), Homo neanderthalensis (400,000 – 40,000 years ago) into a current version of us: Homo sapiens specifically Homo sapiens sapiens {„wise man“} (280,000 years ago until now). What differs this genus from others? Most of the species transfer their knowledge through learning descendant from their parents. Human is the one who transfers his knowledge via teaching and is a part of the other species who also teach their descendant differ with precision of the transfer. The next thing, and definitely the one what gives to human biggest advantage was a gradual ability to speak. It’s estimated that around 100,000 years ago Homo started using language - first basic interjection of pain or happiness, later nouns of things in surrounding nature and last with verb of daily activities.

This research [2] indicates that some Homo species had the ability to produce speech sounds that overlap with the range of speech sounds of modern humans, and that species such as Neanderthals possessed genes that play a role in language of humans. Next big milestone - human started to record its information in form of pictures. The oldest pictures in caves are dated about 30,000 years ago. Next logical step (for us in these days, not in that time) was evolution of pictures into symbols what can be called first writing system used mostly for recording and later also for communication. Writing preceded first with pictogram and ideograms. The best-known examples are Jiahu symbols carved on tortoise shells in Jiahu (6600 BC), Vinča symbols sometimes called Danube script (Tărtăria tablets, c. 5300 BC).

The invention of the first writing systems is roughly dated as of the late 4^th millennium BC.

The Sumerian cuneiform script and the Egyptian hieroglyphs are generally considered the earliest writing systems (c. 3400 to 3200 BC). It is generally agreed that Sumerian writing was an independent invention; however, it is debated whether Egyptian writing was developed completely independently of Sumerian or was a case of cultural diffusion. The similar debate exists for Chinese script developed around (1200 BC). So-called modern language appeared in the archaeological record only recently, with the advent of lexicographic writing around 5,400 years ago [3].

The oldest record of counting is dated from around 5000 BC, when Sumerian merchants used small clay beads to denote goods for trade. Counting on a larger scale, was the privilege

(14)

of the state, when governments for millennia have tried to keep track of their people by collecting information. The ancient Egyptians conducted census, as did also the Chinese.

They’re mentioned in the Old Testament, and the New Testament tells us that a census imposed by Caesar Augustus — “that all the world should be taxed” (Luke 2:1) — took Joseph and Mary to Bethlehem, where Jesus was born [4].

With both these abilities (speaking and writing same language) started the Age of Data.

Ability to express own ideas led cultures to collect information and share knowledge and wisdom to next generations.

1.2 Knowledge Pyramid

To understand Big Data, it’s necessary to start with defining some terms like data, information, and knowledge, describe their context and continuity.

DIKW Pyramid, Knowledge Pyramid, Wisdom Hierarchy, Information Hierarchy are some of the names referring to the popular representation of the relationships between data, information, knowledge and wisdom. Each step up the pyramid graphically represented in Figure 2 answers the questions about the initial data and adds value to it. The more questions we answer, the higher we move up the pyramid. In other words, the more we enrich our data with meaning and context, the more knowledge and insights we get out of it. At the top of the pyramid, we have turned the knowledge and insights into a learning experience that guides our actions.

Figure 2 – Knowledge Pyramid [5]

(15)

Data – a collection of facts in a raw or unorganized form. In informatics, text, number, picture, sound, or everything we can process automatically with machines are understood as data. However, mostly without any context, data can mean little.

Data could be divided into different groups by several conditions like their structure, origin or what they are intended for. In the Table 1 is shown how data are classified by their structure.

Table 1 – Data by their structure

Group Description Example

Structured data in predefined structure Relational data Semi-structured not predefined structure, can be process them easily XML

Unstructured data with undefined internal structure PDF

Structured data – are data whose elements are addressable for effective analysis or further processing. The data are organized into a formatted repository (typically a database). It concern all data which can be stored in relational database in the form of table with rows and columns. They have relational key and can be easily mapped into pre-designed fields. Today, these kind of data are most processed and easily managed information. Structured data represent only 5 to 10% of all informatics data. Examples of these kind of data could be:

Meta-data (Time and date of creation, File size, Author etc.), Library Catalogues (date, author, place, subject etc.), Census records (birth, income, employment, place etc.), Economic data (rates, GDP, VAT etc.).

Semi-structured data – are data which do not reside in a relational database but that have some organizational properties that make it easier to analyze. With some process, you can store them in the relational database, on the other hand it could be very hard for some semi- structured data. Semi-structured data also represent only 5 to 10% of all informatics data.

Examples of these data could be e.g. personal data stored in XML or JSON files.

Unstructured data – are data that are not organized in a pre-defined manner or does not have a pre-defined data model, thus they are not a good fit for a mainstream relational database.

This is one of the reason, why there are alternative platforms for unstructured data. These data increasingly prevail in IT systems and are used by organizations in a variety of business intelligence and analytics applications. Today, more than 80-95% of all data that exist globally are estimated to be unstructured data. Examples of these data could be e.g. user- generated content from social media (e.g., Facebook, Twitter, Instagram, and Tumblr),

(16)

images, videos, surveillance data, sensor data, call center information, geolocation data, weather data, economic data, government data and reports, research, Internet search trends, and web log files [6] [7].

Another possibility how could be data distinguish is by their origin, so they’re divided into External/Internal, Primary/Secondary as listed in the Table 2.

Table 2 – Data by their origin

Group Description Example

External Data from public sector, or external companies Government data

Internal Data originated internally Orders, transactions

Primary Data originated in the system Customers in CRM

Secondary Data taken from other systems ERP sales reports

Information – is the next building block of the DIKW Pyramid. These are data that have been “cleaned” of errors and further processed in a way that makes it easier to measure, visualize and analyze for a specific purpose. Depending on this purpose, data processing can involve different operations such as combining different sets of data (aggregation), ensuring that the collected data are relevant and accurate (validation), etc. By asking relevant questions about ‘who’, ‘what’, ‘when’, ‘where’ etc., can be derived valuable information from the data that could be more valuable for us [8].

Knowledge – when is placed the question of ‘how’, this is what makes the leap from information to knowledge. “How” are the pieces of information connected to other pieces to add more meaning and value to us is the ‘Knowledge’.

Wisdom – is knowledge applied in action. We can also say that, if data and information are like a look back to the past, knowledge and wisdom are associated with what we do now and what we want to achieve in the future [8].

1.3 From relational databases to NoSQL

To explain motivation for emergence of NoSQL databases, it’s necessary to go back and look at the traditional approach of building applications with the current mainstream usage

(17)

of the persistent storage. For the past 30 years, applications have been backed using an RDBMS (Relational Database Management System) with a predefined schema that forces data to conform to a schema on-write. Many people still think that they must use an RDBMS for applications, even though records in their data sets have no relation to one another.

Additionally, those databases are optimized for transactional use, and data must be exported for analytics purposes. NoSQL technologies have turned that model on their side to deliver groundbreaking performance improvements [9].

The key impulse for emergence wider using alternative DBMS (Database Management System) is dynamic growth of development of global business opportunities and business models based on providing services and selling goods to millions of customers all around the world. In the last few years the source of data become data captured from devices like IoT (Internet of Things). With increasing data volumes and numbers of users, it is becoming more difficult and costly to respond to this situation by scaling vertically, by investing in more and more capacity of database servers. Therefore, there is a need to address this problem by scaling horizontally, by running hundreds to thousands of commodity servers, which at the same time must ensure the continued availability of services with adequate data consistency, consistent with the specific strategy, objectives, content and business processes of the enterprise. Google and Amazon were among the first companies to meet these limits due to their activities. Both companies have created their own scalable database systems based on searching data by keys. Bigtable (Google) [10] and Dynamo (Amazon) [11], respectively database systems described in these papers, are considered the key impulses that inspired and influenced the emergence and further development of NoSQL DBMS.

Google's Bigtable database system is a scalable distributed structured data repository run on several thousand commodity servers. Google uses Bigtable for a variety of applications (such as web page indexing, Google Earth) with varying data storage requirements. Stored data range from individual URLs of web content to satellite imagery of Earth.

The second of the above-mentioned companies, which came up with their own persistent data storage solution, based on principles different from relational DBMS and allowing horizontal scalability, was Amazon. In the second half of the first decade of this century, Amazon was in a position to run one of the largest e-commerce platforms with ten thousands of servers in several data centers around the world. Such infrastructure was necessarily associated with server downtime or, in more severe cases, whole data centers downtime.

Even in the event of disruptions to individual servers or part of the infrastructure, it was

(18)

necessary to ensure the availability of services, as even the slightest outage could mean serious financial implications and permanently shake customers' trust. For example, it was necessary to ensure that customers still see items placed in the shopping cart in the event of any downtime. Amazon addressed these requirements by creating the Dynamo database system, which ensures continued availability of core services. The system worked on the key-value principle, extensively uses versioning of the objects. Unlike Bigtable, which allows applications to query multiple-attribute data, Dynamo was designed for applications that need simple data access based on a single key and with primary focus on a high degree of availability, where interruption of connection with a part of the network infrastructure or server failure does not reject UPDATE operations [11].

Soon other non-relational database systems began to emerge, generally named "NoSQL database". Many of them were inspired by the above mentioned Bigtable and Dynamo projects. For example, one of the systems is Cassandra, database developed by Facebook.

NoSQL is not associated with any fixed definition, but the history of its origin can be mapped, and its current understanding is defined quite precisely. For the first time, the notion of NoSQL was used in the late 1990s, paradoxically as the name of a relational database for the UNIX operating system, which dates back to 1998 [12]. However, this act has nothing to do with NoSQL databases in the current concept. The new meaning of NoSQL is attributed to the software developer Johan Oskarsson, who held a meeting in 2009 to explore the latest trends in new non-relational database technologies inspired by the already mentioned Bigtable and Dynamo. Oskarsson tried to find a concise and easy-to-remember name for the upcoming event. From the suggestions sent through the #cassandra IRC channel, he chose the name "NoSQL", which quickly took over, not just as the name of the meeting, but even unintentionally as an aggregate term referring to a whole set of database systems. The vast majority of the professional public agree that the first two letters of the "NoSQL" acronym do not mean "no SQL" but a "not only SQL" definition. While there is no official definition of this term, it can be said that it is a group of databases with certain common attributes: a non-relational model, very suitable for open-source clustering, without a fixed data schema [13].

1.4 Motivations to use NoSQL databases

NoSQL’s shared nothing architecture (distributed computing architecture in which each node is independent and self-sufficient (none of the nodes share memory or disk storage),

(19)

and there is no single point of contention across the system) makes it easy to support scale- out architecture, whether on-premises or in the cloud, to deliver high linear scalability.

NoSQL is suitable for a variety of scale-out applications including social network, customer analytical workloads, real-time reporting, embedded database applications, mobile applications, the internet of things (IoT), and gaming applications.

NoSQL supports flexible data models to accommodate any type of data. Unlike relational databases, NoSQL can accommodate structured, semi-structured, and unstructured data to support new types of business applications. NoSQL offers more flexible approach in which the application, rather than the datastore, defines the schema and access paths. NoSQL supports a wide range of new data types, including textual types such as JSON as well as many other unstructured and semi-structured data types.

NoSQL delivers an extreme read and write capabilities for demanding customer apps.

Applications that demand extensive data reads and writes, often experience excessive latency input/output (I/O) bottlenecks, especially when the app has high volumes of both.

NoSQL database management systems are very efficient in reading very large amounts of data.

NoSQL simplifies data management for any type of application. Not only deploying traditional databases takes time and effort – managing them requires specialized database administrators (DBAs) and designing data structures can also take months. NoSQL accelerates deployment through automation and simplification processes, provisioning capabilities, and flexible data structures. When using NoSQL, application developers have complete control over data storage and access and don’t need a DBA to support the NoSQL data store.

NoSQL lowers data management cost. Many NoSQL solutions are open source, and others sell for much less than a full version of a commercial relational DBMS. Compared with conventional DBMSs, enterprises report that NoSQL products saved them more than 50% of the cost [14].

1.5 Definition of Big Data

Back in 2013 Dan Ariely’s likened Big Data to “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.” From that time many people tried to define or box this

(20)

new phrase, but the market is so fast and changing that we can’t still express exact definition.

I see Big Data as area within Information Technology industry which took place to analyze, systematically extract information from, or otherwise deal with data sets that are too large or too complex to be dealt with by traditional data processing application software within the meaning of volume or/and time. In the early 2010s there was also a big discussion regarding if Big Data is next big trend, if companies should take care of it. Now almost 10 years later I would modify Dan Ariely’s quote from 2013 to something like “Big Data is like university student sex: Everybody talks about it, most of them claims they are doing it regularly, only some of them really know how to do it. No one says he never heard about it.

Almost everybody looking for other who has know-how and real experience with it.”

1.6 How big is Big Data

To have a rough estimate how big are Big Data, first look into the Table 3 for some examples of data volume related to metric prefixes.

Table 3 – Example of data volume

Unit Value Example

kilo 10³ 1 000 a paragraph of a text document mega 10⁶ 1 000 000 a small novel

giga 10⁹ 1 000 000 000 Beethoven’s 5th Symphony tera 10¹² 1 000 000 000 000 all the X-rays in a large hospital

peta 10¹⁵ 1 000 000 000 000 000 half of all US academic research libraries exa 10¹⁸ 1 000 000 000 000 000 000 20% of the words ever spoken by human zetta 10²¹ 1 000 000 000 000 000 000 000 grains of sand on all the world’s beaches

Most of us can probably imagine size of mega or giga, maybe some of us even worked with data of terabytes in volume, but petabytes is for most people like 4^th dimension in Euclidean geometry. To have a better idea, it’s necessary to look at the volume of data from different perspective or to cut one of the dimensions – time – to smaller scale. As you can see in Figure 3Figure 3 – How much data is generated every minute and Figure 4Figure 4 – What happens online in 60 seconds the amount of data generated every minute is enormous.

(21)

Figure 3 – How much data is generated every minute [15]

Figure 4 – What happens online in 60 seconds [15]

(22)

The Figure 5 depicts the relative increase of data stored over time, together with the diversity and number of users creating and using data, and the variety of use cases for data.

Complexity and cost are two other important factors that continue to increase. Therefore the disciplines of Big Data are taking the key role in the 21^st century.

Figure 5 – Data Volume (1800s – 2010) [17]

To have an idea about the volume increase in long-term time horizon, see Figure 6, where you can see data volume in last 10 years with prediction for next 6. Any further prediction especially in IT sector would be just divination from a crystal ball.

Figure 6 – Data Volume (2010 – 2025) [18]

(23)

Definitely we can agree that imagine the volume of data in peta or zetta bytes is like imagine how far nearest Black hole is. To help you with this, following example of the amount of data generated worldwide this year should definitely help you.

Source data:

 40 zettabytes of data generated this year = 40,000,000,000,000,000,000,000 bytes

 stack of 100 of 3.5” floppy disks measures 13 inches (33 cm) ⇒ 0.33 cm / 1 floppy disk

 capacity of one 3.5” floppy disk is 1.44 MB = 1,474,560 bytes

Calculation:

 # floppy disks to store all data generated this year

40,000,000,000,000,000,000,000 bytes of data / 1,474,560 capacity of 3.55” floppy disk = 27,126,736,111,111,111

 stack height of all floppy disks with data

27,126,736,111,111,111 floppy disks * 0.33 cm = 8,951,822,916,666,666 cm = 89,518,229,166 km

 # of stacks between Earth and Sun

89,518,229,166 km / 14,960,000 km (distance between Earth and Sun) = 600

To sum it up, all data generated this year worldwide could be stored in 600 stacks (towers) of 3.5” floppy disk from Earth to Sun.

To complete the question „How big are Big Data“, it’s necessary to mention the key factor for storing the data and it’s the cost of storage. Figure 7 shows cost of 1 gigabyte of storage from 1980 till 2014. For the past 35+ years or so, hard drives prices have dropped, from around $500,000 per gigabyte in 1981 to less than $0.03 per gigabyte today. In the last 5 years (2014-2019) the price dropdown is slowing and stabilized between $0.04 - $0.03 [19].

(24)

Figure 7 – History of storage cost per gigabyte (1980-2014) [20]

1.7 The V’s of Big Data

In the context of large amounts of data, it is often referred to as so-called 3 V’s, which are generally recognized as characteristics of Big Data (for graphical representation of these characteristics please see Figure 8). These distinctive 3 V’s were defined in 2001 by Dough Laney, analyst in Gartner [21].

Figure 8 – The 3 V’s of Big Data [22]

(25)

Volume - Data Size

The first "V" is the volume, i.e. the size of the data being processed. With the ever-increasing amount of data generated by modern technologies, the question of how to process these data so that the analysis of these data results in the most inaccurate results in the shortest possible time arose. Such data volume cannot be processed by traditional database tools. What size of data already means "Big" is not said anywhere, but this boundary is constantly shifting as technology evolves.

Velocity - Data Rate

The second "V" is the speed at which the data are produced. As data size increases, the speed of data that an organization produces, processes, or just receives increases. The ability to quickly receive and respond to data can be a significant competitive advantage or even a necessity. However, in order for an organization to respond quickly to data, fast data processing is necessary. For traditional database tools, it is very difficult to process unstructured data or structured data of large volumes, the cost of such processing and storage would be unbearable. In contrast, Big Data technologies allow this storage and processing and are designed for it.

Variety - Data Diversity

The third "V" represents the diversity of data. Data rarely come from a single source and are often not in a unified format. This diversity of data is a major difference from traditional data storage methods, where fixed-structure data are used - structured data. Structured data are described by their metadata, and analysis can be easily performed by using traditional database systems. Opposite examples are unstructured data that are neither described nor organized, so it is not easy to process. This type of data occurs, for example, in emails, documents, or social networks. However, in the case of Big Data, it is also possible to analyze such data from different sources and with different structures, which can be a great benefit to organizations.

To these basic 3 V’s characteristics of Big Data some authors added later another V’s so- called 4 V’s / 5 V’s as is visualized in Figure 9 and Figure 10.

(26)

Veracity – Data Credibility

Credibility is highlighted for possible poor data quality at source. This non-quality and unreliability of the data is then transferred to the analysis, which can result in distorted or non-full results.

(27)

Value – Data Usefulness

The term value is often understood as the most important one, because the result of the analysis should bring added value and be useful for making the business more effective.

Very soon there was race who will discover next V’s so the competition continued with other V’s, such as 8 V’s (see Figure 11).

Visualization – Big Data visualization involves the presentation of data of almost any type in a graphical format that makes it easy to understand and interpret. But it goes far beyond typical corporate graphs, histograms and pie charts to more complex representations like heat maps and fever charts, enabling decision makers to explore data sets to identify correlations or unexpected patterns [26].

Viscosity – Viscosity measures the resistance to flow in the volume of data. This resistance can come from different data sources, friction from integration flow rates, and processing

(28)

required to turn the data into insight. Technologies to deal with viscosity include improved streaming, agile integration bus, and complex event processing [27].

Virality – Virality describes how quickly information gets dispersed across people to people (P2P) networks. Virality measures how quickly data are spread and shared to each unique node. Time is a determinant factor along with rate of spread [27].

Table 4 – The V's of Big Data [28]

# Characteristics Elucidation Description

1 Volume Size of Data Quantity of collected and stored data 2 Velocity Speed of Data The transfer rate of data between source

and destination

3 Variety Type of Data Different type of data like pictures, videos and audio arrives at the receiving end

4 Veracity Data Quality Accurate analysis of captured data is virtually worthless if it’s not accurate 5 Value Importance of Data It represents the business value to be

derived from Big Data 6 Visualization Data Act/ Data

Process

It is a process of representing abstract 7 Viscosity Lag of Event It is a time difference the event occurred

and the event being described

8 Virality Spreading Speed It is defined as the rate at which the data are broadcast /spread by a user and received by different users for their use 9 Variability Data Differentiation Data arrives constantly from different

sources and how efficiently it differentiates between noisy data or important data

10 Validity Data Authenticity Correctness or accuracy of data used to extract result in the form of information 11 Vulnerability Data Security Big Data brings new security concerns.

12 Volatility Duration of Usefulness

Big Data volatility means the stored data and how long is useful to the user

13 Venue Different Platform Various types of data arrived from different sources via different platforms like personnel system and private &

public cloud

14 Vocabulary Data Terminology Data terminology likes data model and data structures

(29)

# Characteristics Elucidation Description 15 Vagueness Indistinctness of

existence

Vagueness concern the reality in information that suggested little or no thought about what each might convey 16 Verbosity Data Redundancy The redundancy of the information

available at different sources

17 Voluntariness Data Voluntariness The will full availability of Big Data to be used according to the context 18 Versatility Data Flexibility The ability of Big Data to be flexible

enough to be used differently for different context

19 Complexity Correlation of Data Data comes from different sources and it is necessary to figure out the changes whether small or large in data with respect to the previously arrived data so that information can get quickly

Table 4 shows many other examples of creativity of some authors in V’s race. To close this race of defining more and more V’s, look at the Figure 12, where the first appearances of these V’s are symbolized.

Figure 12 – First occurrence of V's by year [29]

(30)

1.8 Data Science vs. Big Data vs. Data Analytics

For companies, data are becoming more and more available, primarily due to the cost of acquisition, processing and storage. The problem is not how to obtain the data, but how to benefit of them. Every company has now huge amount of data and it brings new challenges how to utilize them, how to find real value inside. That’s what will split them between successful and losers.

It was interesting to watch Gartner’s Hype Cycle for Emerging Technologies for several last years. Big Data appeared in viewfinder of Gartner around 2010-11, in 2012-2013 reached top of hype cycle and in 2015 surprisingly disappeared with explanation, that Big Data is so pervasive that it can’t really be considered an emerging technology any longer.

But what appeared in Gartner’s Hype Cycle in 2014? The answer could be found in Figure 13. It is a Data Science – the new area worth to follow.

Figure 13 – Hype Cycle for Emerging Technologies, 2014 [30]

As it became common, there is always no proper definition, as everybody can see it from different perspective, everybody add something own. There are hundreds of definitions what

(31)

Data Science is or could be. Definitely the best definition could be found on Wiki page where the term Data Science is best described.

Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data [31].

But the key word in data science is not “data”, it is “science”. Data science is only useful when the data are used to answer a question [32]. Data science is the same concept as Data mining and Big Data: "use the most powerful hardware, the most powerful programming systems, and the most efficient algorithms to solve problems" [33].

In 2012, when Harvard Business Review called it "The Sexiest Job of the 21st Century", [34] the term "data science" became a buzzword. It is now often used interchangeably with earlier concepts like business analytics, business intelligence, predictive modeling, and statistics. In many cases, earlier approaches and solutions are now simply rebranded as

"data science" to be more attractive [35].

Figure 14 – Relations of Data Science with surrounding area [36]

It’s very hard to explain the differences between Data Mining, Data Analytics, Data Analysis, Data Science and Big Data itself and how all fit together. Data Science helps in discovering useful information from Big Data for efficient data analysis with the help of

(32)

Artificial Intelligence and Machine Learning/Data Mining algorithms. Jonathan Nolis breaks data science down into three components:

(1) business intelligence, which is essentially about “taking data that the company has and getting it in front of the right people” in the form of dashboards and reports

(2) decision science, which is about “taking data and using it to help a company make a decision”

(3) machine learning, which is about “how can we take data science models and put them continuously into production.” Although many working data scientists are currently generalists and do all three, we are seeing distinct career paths emerging, as in the case of machine learning engineers [37].

Figure 15 – The Fields of Data Science [38]

One or more pictures (Figure 14, Figure 15 and Figure 16) could say more than 100 words how all fit and relate together.

(33)

Data mining is analyzing data for the purpose of discovering unforeseen patterns or properties. It makes messy unstructured data into useful info. It is the computational process of discovering patterns in large data sets (involving Big Data abstraction) involving methods at the intersection of artificial intelligence, machine learning, and database systems. Data mining closely relates to data analysis. One can say that Data mining is data analytics operating on big data sets, because no small data sets would issue meaningful analytics insights. Data mining, shortly speaking, is the process of transforming data into useful information. Data mining is more rooted on the database (static, already stored data) point of view, whereas machine learning has been originated from a desire to make an Artificial Intelligence (AI). Algorithms used in Data mining: Apriori (finding associations), DBSCAN (finding clusters) and Decision trees.

Figure 16 – Data Science [39]

Data Analysis is a heuristic activity, where scanning through all the data the analyst gains some insight (makes useful information). Data Analysis leverages statistical methods to analyze aggregated or non-aggregated data.

Data Analytics is about applying a mechanical or algorithmic process to find insights. For example, running through various data sets with a purpose of finding meaningful

(34)

correlation between them. This takes the use of statistics and data science tools. Analytics are the result of analysis and the form of presentation of the analysis results; might simply prediction interest.

Machine Learning finds patterns in (big) data that useful for researchers and are not visible from human point of view. Machine Learning is not a static, hard-coded model but a self- learning, self-adjusting model (machine learns and changes itself). Machine Learning or Artificial Intelligence compared to Data Mining is more on incorporating acquired knowledge into the framework for further (i.e. future) use in analysis.

Figure 17 – Data Science Is Multidisciplinary [40]

Data Science is a science field that includes methods and processes for operating over data.

It’s a cluster of mathematics, statistics, programming, and ingenious ways of capturing data that may not be being captured right now. Data Science includes Machine Learning and other methods like problem formulation, exploratory data analysis, data model compiling, data visualization, data extraction, etc. [38].

(35)

As you can see, there are many views from different perspective on Data Science, Data Analytics, Data Analysis, Data Mining, Machine Learning, Big Data and other popular term used in the industry these days. It would be beneficial to finish this subchapter with last picture (see Figure 17) of which all areas Data Science touch. There are so many discipline involve, that we have to wait few years to let it settled down.

1.9 Application of Big Data

Listing all areas where Big Data are used today would be a superhuman task. In the last 10 years, this phenomenon has come to all areas of human activity. In the following paragraph are presented some of the application of Big Data technology by the industry [41].

Retail/Consumer

 Merchandizing and market basket analysis

 Campaign management and customer loyalty programs

 Supply-chain management and analytics

 Event- and behavior-based targeting

 Market and consumer segmentations Finances & Frauds Services

 Compliance and regulatory reporting

 Risk analysis and management

 Fraud detection and security analytics

 Credit risk, scoring and analysis

 High speed arbitrage trading

 Trade surveillance

 Abnormal trading pattern analysis Web and Digital media

 Large-scale clickstream analytics

 Ad targeting, analysis, forecasting and optimization

 Abuse and click-fraud prevention

 Social graph analysis and profile segmentation

 Campaign management and loyalty programs Health & Life Sciences

 Clinical trials data analysis

 Disease pattern analysis

(36)

 Campaign and sales program optimization

 Patient care quality and program analysis

 Medical device and pharmacy supply-chain management

 Drug discovery and development analysis Telecommunications

 Revenue assurance and price optimization

 Customer churn prevention

 Campaign management and customer loyalty

 Call detail record (CDR) analysis

 Network performance and optimization

 Mobile user location analysis Ecommerce & customer service

 Cross-channel analytics

 Event analytics

 Recommendation engines using predictive analytics

 Right offer at the right time

 Next best offer or next best action

To at least partially present Big Data in detail, use case from the automotive industry was chosen and are presented in following paragraphs.

Automotive Industry Use Case of Big Data

The average vehicle on today's roads generates huge amounts of data. Vehicle sensors monitor everything from tire pressure, engine speed to oil temperature and vehicle speed or position. Thus, vehicles can produce anywhere from 5 to 250 GB of data per hour. The study from 2014 by McKinsey states that the average connected car generates 25 GB of data per hour [42]. Patrick Nelson from IDG, says a self-driving car can generate up to 4 TB of data per day (details are depicted in Figure 18) [43]. Autonomous vehicles, such as Google, generate about 1 GB of data every second. The vast majority of this data are used in real time to check and report on vehicle functions and has no real long-term value. Taking thousands of messages from the sensor saying "Normal tire pressure" does not bring any further benefit, so car makers do not bother to store this data in the car or on servers.

However, some data are valuable. And if such information can be gathered from a substantial

(37)

portion of the billions of cars in operation today, we do not need more than basic knowledge of arithmetic to understand why Big Data is attracting so much attention in the automotive sector.

There is a great interest in collecting large data not only by OEMs (Original Equipment Manufacturer), but also by companies that have nothing to do with OEM production or delivery. These are companies that offer solutions to the surrounding ecosystem. Big Data opens up many other possibilities. Another example how to use large amounts of data is to understand how customers actually use their products. In most cars, some features are unnoticed while others are highly valued. With large amounts of data, the OEM has the opportunity to learn a lot more about how customers use the vehicle and what their preferences are. By analyzing data, the data collected can be used to plan and develop other features.

Figure 18 – The coming flood of data in autonomous vehicles [43]

It is useful for the automaker to monitor the habits of a particular customer group, the number of their short trips on a single day or long rides on weekends, or the frequency of rides in difficult winter conditions. In addition, manufacturers can begin to focus on additional value-added services that can target specific demographic groups within their customer base.

Interesting customers can be insurance companies that want to collect a particular subset of car data - such as speed, driving time, braking and cornering - to find out how drivers

(38)

actually drive, rather than base their risk calculation on much less reliable sources like their age or credit status. Insurance companies are already on the market with replacement solutions that enable this data collection. An example of this is the AXA Drive application, which is able to record driver’s driving style with GPS and gyro sensors. Patiently collects data, records abrupt starts, passages of dangerous places, taking into account day, night, meteorological conditions and other details. After driving, it offers tips for improvement and greater safety, but how the data are utilize internally, there’s no evidence. But we can be sure, they can score individual drivers and offer them better insurance rate. More and more vehicle manufacturers are also offering insurance services through their partners, so there are more data seekers who can provide them with sufficient details to calculate the risk.

Another example of how to profit from Big Data is that manufacturers can sell their data to third parties. For example, USPS has priced for its large data and allows organizations to access the National Address Change Database for $ 175,000 a year. Cisco has recently calculated that the financial benefits of large data for vehicle manufacturers will be close to

$300 per vehicle per year. Most of this amount comes from lower warranty service costs and improved concept design. In the same analysis, it was concluded that large data would help drivers save $500 a year thanks to better navigation and smarter driving.

To collect data for car manufacturers, it is now necessary to wait for the driver to arrive with their cars at a service whose technicians have access to the data through the on-board diagnostic system. But not all drivers have their cars settled regularly, for example, 15,000 km, which means that data analysis from all the vehicles in operation is not going on in a continuous and proactive way, and achieving this is still a challenge. With the arrival of cars connected to the Internet respectively, using 5G mobile networks, car makers can have full access to data from the car anywhere, anytime, making the analysis of this data easier and more viable. But as long as these still connected cars do not make up the majority of the cars they operate, the big data will be rather small data [44] [45].

(39)

2 NOSQL

2.1 Introduction

The main characteristics of NoSQL DBMS are flexible scalability, lower cost, flexible data structure and availability. The capacity of the cluster is much easier to expand with horizontal scaling (involves running multiple servers (nodes) within a cluster). In contrast vertical scaling, data migration or at least shutdown of the system is necessary in case of capacity increase. NoSQL databases are designed to allow new nodes to be easily added or removed as needed without shutdown. High availability can be assured by other nodes taking on the role of individual nodes in the event of failure or maintenance. A significant cost factor is the fact that commodity hardware can be used as clustered servers, which is significantly less expensive than large database servers. Vast majority of NoSQL databases are available as open source products with the option to purchase support as commercial services provided by third parties. Almost all NoSQL databases offer the ability to represent data without a fixed data schema. As mentioned above, the main reason for interest in NoSQL databases is the ability to run databases in clusters. Whether the database system will perform optimally, for example, with a large number of write operations or whether it will guarantee availability in the event of disruption between servers, depends on the distribution model be chose.

2.2 Distribution

NoSQL databases, also known as distributed databases, are based on the principle of data distribution, as the name implies. However, this does not mean that data distribution is necessary when using the NoSQL database. If the size of the data allows it, only a minimum to no data distribution can occur, what limit or eliminate the risks associated with the distribution. The basic model is running a database system on a single server. Even in this case, it may be worthwhile to use a NoSQL DBMS, if appropriate from the perspective of the data structure that the application is working with. The following data distribution techniques are used for ideal data processing.

First distribution model is called sharding, i.e. placing different parts of data on different nodes. This approach is useful when different users access different parts of the data. Many NoSQL DBMS supports auto-sharding, where the database system itself allocates data to

(40)

individual nodes. Sharding is a convenient way to simultaneously increase both read and write performance.

Another distribution model is master-slave replication. The data are replicated to several nodes, one of which is designated as a primary node (called master) and is generally responsible for executing data update operations. The replication mechanism then synchronizes the secondary nodes (called slaves) with the current data on the primary node.

Master-slave replication is useful when a large number of read operations are performed on the data. Conversely, the Peer-to-Peer replication distribution model can help when data are written frequently. There is no primary node, all nodes have the same weight and perform read and write operations. Replication is performed between nodes. The downside of this model is the risk of data violation. This occurs when several write operations attempt to modify the same entry on different nodes. This problem can be solved by mutual coordination between nodes, but this is associated with higher network traffic intensity. In this case, it is not necessary to wait for the response of all nodes, but to ensure a high degree of consistency, their absolute majority is sufficient. The number of nodes to receive confirmation is called a quorum. The second possible solution is to allow the emergence of such conflicts of enrollment operations and to subsequently address them within the application in accordance with its business logic.

2.3 Consistency

Efficient and correct data processing requires consistency. Therefore, traditional, relational databases require that the data contained therein meet certain restrictive conditions, so-called integrity constraints. These database systems work with so-called transactions that are logically related operations that convert data from one consistent state to another. At the end of the transaction, all integrity restrictions must be met. For this reason, the following features are required for transactions that are referred to as ACID transaction. An important feature of NoSQL databases is the absence of a classic transaction approach and a different view of data consistency. While traditional RDBMS is based on the principle of ACID transactions (Atomicity, Consistency, Isolation, Durability), the principle of NoSQL databases is BASE (Basic Availability, Soft state, Eventual consistency).

Atomicity - the database transaction is as an operation further indivisible (atomic). It is performed either as a whole or not at all (and the database system gives the user a note, e.g.

by an error message).

(41)

Consistency - no integrity constraints are violated after the transaction. Database status after transaction completion is always consistent, i.e. valid according to all defined rules and restrictions. There is never a situation where the database is in an inconsistent state.

Isolation - operations inside a transaction are hidden from external operations. Returning a transaction (ROLLBACK operation) does not affect another transaction, and if so, it must be returned. This behavior may result in a cascading rollback.

Durability - changes made as a result of successful transactions are actually stored in the database and can no longer be lost.

The term "Basic Availability" means if single node fails, part of the data won't be available, while the remaining parts of the system remain functional. "Soft state" refers to the fact that data may be overwritten by a newer version, which is closely related to the third attribute of the BASE principle - "Eventual consistency". This term means that the database may, under certain circumstances, be in an inconsistent state. Typically, this is where copies of the same data are located on several nodes within a cluster. If a user or application updates data on one of the servers, other copies of the data are inconsistent for a certain (usually short) period until the NoSQL database replication mechanism updates all copies of the data.

Unlike ACID, BASE is not so strict and permits temporary data inconsistency to increase availability and performance. For BASE access, the database is also available in case of partial failures between nodes, or at least parts of it (e.g. in case of node failure). This results in temporary data inconsistency. This also occurs with delays in synchronizing changes between individual parts. However, if the connection is restored and the database is unloaded for some time, a consistent state is restored. Since it does not block writing (as opposed to ACID transactions), it is more responsive, even at the cost of temporary inaccuracies in stored data.

2.4 CAP Theorem

CAP theorem also known as Brewer theorem was introduced by Eric Brewer in 2000 at ACM Symposium on Principles of Distributed Computing and expresses a triple constraints related to distributed database systems. It states that a distributed database system, running on a cluster, can only provide two of the following three properties:

(42)

Consistency – a read from any node results in the same data across multiple nodes. A guarantee that every node in a distributed cluster returns the same, most recent data.

Consistency refers to every client having the same view of the data. There are various types of consistency models. Consistency in CAP (used to prove the theorem) refers to linearizability or sequential consistency, a very strong form of consistency.

Availability – a read/write request will always be acknowledged in the form of a success or a failure. Every non-failing node returns a response for all read and write requests in a reasonable amount of time. The key word here is “every”. To be available, every node on (either side of a network partition) must be able to respond in a reasonable amount of time.

Partition tolerance – the database system can tolerate communication outages that split the cluster into multiple silos and can still service read/write requests. The system continues to function and upholds its consistency guarantees in spite of network partitions. Distributed systems guaranteeing partition tolerance can gracefully recover from partitions once the partition heals.

Figure 19 – A Venn diagram summarizing the CAP theorem

In 2002, Seth Gilbert and Nancy Lynch of MIT published a formal proof of Brewer's conjecture [46]. The theorem states that networked shared-data systems can only

(43)

guarantee/strongly support two of the three properties. The CAP theorem categorizes systems into three categories:

CP (Consistent and Partition Tolerant) - At first glance, the CP category is confusing, i.e., a system that is consistent and partition tolerant but never available. CP is referring to a category of systems where availability is sacrificed only in the case of a network partition.

CA (Consistent and Available) - CA systems are consistent and available systems in the absence of any network partition. Often a single node's DB servers are categorized as CA systems. Single node DB servers do not need to deal with partition tolerance and are thus considered CA systems. The only hole in this theory is that single node DB systems are not a network of shared data systems and thus do not fall under the preview of CAP.

AP (Available and Partition Tolerant) - These are systems that are available and partition tolerant but cannot guarantee consistency.

Figure 19 – A Venn diagram summarizing the CAP theorem shows the overlapping areas between Consistency, Availability and Partition tolerance.

However, the CAP theorem cannot be taken literally, and its critics point out a few points for discussion. Above all, it is the fact that the originally used definitions are not entirely accurate and the particular theorem, which has been proven later, has other restrictive conditions. Furthermore, some properties are questionable. E.g. while the lack of consistency is assumed throughout the system's work, the lack of availability is only in the case of network disconnection, that is, sometimes only. Thus, these two conditions do not have symmetrical properties. Or if we understand the theorem literally, then it says that providing a combination of consistency conditions and resistance to network disintegration means that the system is not available at all after disconnection. This is, of course, a very bad feature [47] [48] [49].

(44)

2.5 Classification of Database Management Systems

Before the basic types or classification of NoSQL databases will be presented, it is necessary to step back and look at database classification from higher perspective and present top most DBMS classification from different edges. There are several criteria based on which DBMS are classified.

Based on the data model

Relational database - definitely most popular data model used worldwide. It is based on the SQL and ACID transaction paradigm (Atomicity, Consistency, Isolation, Durability). A relational database is a set of formally described tables from which data can be accessed or reassembled in many different ways without having to reorganize the database tables. The tables or the files with the data are called as relations that help in designating the row or record, and columns are referred to attributes or fields. Examples: Oracle, MySQL.

Microsoft SQL Server.

Object oriented database - object oriented database management systems (often referred to as object databases) were developed in the 1980s motivated by the common use of object- oriented programming languages. The goal was to be able to simply store the objects in a database in a way that corresponds to their representation in a programming language, without the need of conversion or decomposition. Examples: InterSystems Caché, Versant Object Database, Db4o.

Hierarchical database - in which the data are organized into a tree-like structure. The data are stored as records which are connected to one another through links. A record is a collection of fields, with each field containing only one value. The type of a record defines which fields the record contains. Examples: IMS (IBM), Windows registry (Microsoft).

Network database – is a database model similar to a hierarchical database model that has been almost exclusively used by the database model for a long time. In addition to the hierarchical database model, it provides more to more relationships, so one entity could have more parents. However, this data concept was overcome in 1970 by the relational database concept. In addition, it also allows recursion, i.e. the entity can be the parent of its parent.

The disadvantage of using a network database is its inflexibility and the resulting difficult change in its structure. Examples: RDM Server, Integrated Data Store (IDS).