Python - Analytical tools - Hlavní práce70941

2.3 Analytical tools

2.3.4 Python

Python is a high-level general-purpose programming language [31]. According to well-established and reputable websites Tiobe and KDnuggets, in 2020 Python was dominating the list of the most preferred platforms for data analysis, and for other areas as well [27, 32]. Python owes its success to many different factors; among these are legibility, code brevity and intelligibility, support of many contemporary libraries, compatibility with other programming languages, platform-independent script launch, a large world-wide audience, etc. The software releases can be divided into two main categories – version 2.7 and Python 3. While they share a lot of similarities, there are also significant differences that are often critical for client products. The latest stable version 3.9 introduced changes in dictionary updates, work with time zones, decorator’s management, and a new PEG parser [33]. The PEP8 standard containing information on the best practices in scriptwriting is also an inseparable element of the Python language. Guidelines to help improve the code quality are another valuable feature that allows to spend less time on repeated understanding.

Package-management system pip is also worth mentioning – it allows users to easily download all third-party instruments, which makes it an essential part of the workflow.

The combination of different libraries allows users to achieve great results in data analysis.

NumPy and SciPy libraries have proven effective for tasks related to information manipulation, making it easier to perform tasks with multidimensional matrices and mathematical and numerical analysis. Pandas library can be characterized in a similar manner, helping with structured (tabular, multidimensional, potentially heterogeneous) and time series data. Pandas’ key elements are data structures – Series (one-dimensional labeled array with axis-labels) and DataFrame (dimensional labeled data structure).

DataFrame usually uses Series, lists, dicts, and other Python data types as entry values. The Pandas library was designed to help with the following types of problems: handling of missing data, working with columns and rows, intelligent slicing and data indexing,

hierarchical labeling of axes, reading/writing various data formats (including csv and excel), time-series data processing, etc. [34]. Figure 9 shows an example of Pandas library being used with the Jupyter Notebook web app.

Many convenient data visualization libraries are also available for Python. The abovementioned Pandas package built upon another library, Matplotlib, allows to quickly build images by simply using the plot() option and selecting the graph type. Below is a list of several other types of data visualization [16]:

• Bar plot is a chart (or graph) created to compare multiple independent categories (there is a space between the columns) in the form of rectangular bars, with inputs determining their size. The results can be represented vertically or horizontally by using bar() and barh() methods, respectively.

• Histograms provide a graphic representation of the frequency of numerical data in the columns without space between bars using the hist() method.

• Pie chart is used to separate individual parts or components from the whole. Most often, the elements are supplemented with a percentage ratio. It is also recommended to provide the sponsor audience with a maximum of three attributes. The rendering is performed with a pie() function.

• Scatter Plot shows interconnections between attributes at the X and the Y axes. The data are displayed as dots, usually covering 2-4 variables. Scatter plot could be used for the graph of the linear regression algorithm described in chapter 2.2.1, to give an example.

The scatter() method uses five basic arguments – X coordinates, Y coordinates, dot size, dot color, and keyword arguments.

• Box plots are constructed on the basis of five parameters: the minimum value, the first quartile, the second quartile, the third quartile, and the maximum value. Box plot is a great way to compare categorical attributes, distribution of quantitative values of a certain topic, and to identify outliers. The result is a rectangle with lines at opposite sides. It can be constructed using several different methods, with box() and boxplot() being the most commonly used.

Machine learning algorithm sets are also provided by various libraries. One of these is Scikit-learn, a freeware software designed for Python programming language that provides clients with access to a wide array of supervised and unsupervised learning methods, model evaluation tools, utilities aimed at solving real-world tasks with large amounts of data, and other useful data science tools. First, all Scikit-learn packages are imported as normal Python modules. Then, the desired object is pulled up and the parameters are set. Those parameters that cannon be learned directly within estimators are called hyper-parameters;

the Scikit-learn library has an option to tune the hyper-parameters using Random Search and Grid Search techniques or, more precisely, RandomizedSearchCV and GridSearchCV.

Both methods are optimized by cross-validated search over parameter settings, hence the

“CV” in their names. Apart from Random Search and Grid Search, the latest version of Scikit-learn now includes two new tools: HalvingGridSearchCV and HalvingRandomSearchCV that can sometimes achieve better results. Estimator objects, inspection utilities, results visualization (e.g. ROC curve) and dataset transformations can also be found on the list of basic work areas. When compared to other libraries, Scikit-learn can boast many strengths and advantages, however, it falls slightly short in the area of

neural networks where users would benefit more from using tools that were designed specifically for these purposes [26].

Another open-source library, Keras, helps users to solve machine learning problems with a focus on modern deep learning. The platform was created for meticulous work with algorithms, allowing the users to customize the features of virtually any method. Neural networks are the main focus of Keras that is able to build complex models with a minimal number of code lines thanks to working with low-level TensorFlow framework features [35].

In 2020, the combination of Keras and TensorFlow was the most popular deep learning method both in scientific and in business community, with companies like Netflix or Uber using this method in their products [36]. It is also worth mentioning that using Keras together with Scikit-learn is very convenient as well since initially, some tools are not available and need to be programmed manually. A good example of this would be the abovementioned search for hyper-parameters for algorithms, where the user can take the shortcut by simply pulling up the Scikit-learn methods.

TensorFlow was developed by Google and it is a default backend of the Keras library. Key elements of TensorFlow include computational graphs (networks of nodes and edges) and tensors (n-dimensional arrays). The new TensorFlow 2 received improvements like native use of eager execution, allowing dynamic model recognition and immediate operations evaluation without the need to use graphs. Another important upgrade is the support of imperative Python code with an option to convert it into graphs. Among the best practices of working with TensorFlow are the use of high-level API, following the Python code-writing standards, distribution of training across multiple GPUs, and TPUs with Keras. The TensorFlow developers have created an entire ecosystem designed for interworking. There is a TensorFlow Lite framework that is required to launch the library on mobile devices; a collection of Google datasets called TensorFlow Datasets; a machine learning library TensorFlow.js for JavaScript and browser deployment. At the same time, the TensorFlow platform is not inseparable from Keras, which allows the clients to choose their preferred software [37, 38].

Figure 9. Jupyter Notebook web application. Source: author.

In document Hlavní práce70941_lisa07.pdf, 2.2 MB Stáhnout (Stránka 26-29)