• Nebyly nalezeny žádné výsledky

ALIGNMENT-FREE VISUALIZATION OF METAGENOMIC DATA BY GENOMIC SIGNAL PROCESSING

N/A
N/A
Protected

Academic year: 2022

Podíl "ALIGNMENT-FREE VISUALIZATION OF METAGENOMIC DATA BY GENOMIC SIGNAL PROCESSING"

Copied!
3
0
0

Načítání.... (zobrazit plný text nyní)

Fulltext

(1)

ALIGNMENT-FREE VISUALIZATION OF METAGENOMIC DATA BY GENOMIC SIGNAL PROCESSING

Kristýna Kupková

Master Degree Programme (2), FEEC BUT E-mail: xkupko00@stud.feec.vutbr.cz

Supervised by: Karel Sedlář

E-mail: sedlar@feec.vutbr.cz

Abstract: Alignment-free visualization of metagenomic data without prior knowledge about organ- ism composition is one of the major issues in metagenomic bioinformatics. Such a visualization method must place fragments from one organism in close proximity to each other, be reproducible and with rapidly increasing amount of data also work ideally in linear time. A novel technique ful- filling all of these requirements is introduced in this paper. The method is based on transformation of genomic sequence into phase signal representation with extraction of three Hjorth’s descriptors forming three dimensional space for visualization.

Keywords: metagenome; numerical representation; visualization; Hjorth’s descriptors

1. INTRODUCTION

Next generation sequencing (NGS) with third generation sequencing (TGS) have allowed a rapid progress in metagenomics. However, the amount of data produced by both the NGS, and TGS is enormous which leads to the need for fast and efficient techniques which would allow interpreta- tion of this data. Alignment-free visualization of metagenomic data without prior knowledge about the taxonomical composition is one of the major challenges. The state of the art is based on Barnes- Hut Stochastic Neighbor Embedding (BH-SNE) of oligonucleotide signatures with runtime of Ο(n log n) where n represents the number of fragments [1]. A new visualization method based on transformation of genomic sequences into signal representation followed by feature extraction is introduced in this article. The major advantage over the BH-SNE is not only the non-stationary character of this method which allows its reproducibility, but also the runtime of Ο(n), rapidly ac- celerating the whole process of visualization as shown further.

2. MATERIALS AND METHODS 2.1. DATASET

In order to test the method effectiveness and compare it with the technique using BH-SNE algo- rithm, the same simulated dataset as in [1] was used. This dataset comprises of genomic fragments (three fragment lengths were tested: 500 nt, 1,000 nt, and 5,000 nt) of ten different taxa further in- troduced in Figure 1:. The whole genome sequences were obtained from National Center for Bio- technology Information (NCBI) GenBank database.

2.2. SIGNAL REPRESENTATION AND HJORTHS DESCRIPTORS

The phase of a complex number has been chosen as a transformation technique for obtaining nu- merical sequences from genomic fragments. In case of this paper the method assigns the values {-3π/4, -π/4, π/4, 3π/4} rad to the four nucleotides {C, T, A, G} respectively. This representation allows further use of signal processing tools, compared to the character representation, while keep- ing the biological information stored within the sequence [2].

247

(2)

After the sequence fragments are transformed into numerical signals a set of features consisting of the three Hjorth’s descriptors (activity, mobility, complexity) is extracted from each of the signal.

The Hjorth’s descriptors, often used in e.g. EEG analysis, were selected for their ability to capture the features of non-stationary signals not only in signal time (position) domain but also in frequen- cy domain of spectrum. The descriptors are computed according to equations (1)-(3), where σ0

2, σ12, and σ22

are the variances of signal and its first and second derivatives respectively [3].

2,

0

Activity

A (1)

,

0 1

Mobility

M (2)

0 1

1 2

 

 

Complexity

C (3)

3. RESULTS AND DISSCUSION

The Hjorth’s descriptors from each fragment are placed into an imaginary three dimensional space as shown in Figure 1:. Here every point represents one 5000 nt long fragment related to the organ- ism indicated by color in the legend. Except for the two E. coli organisms forming one cluster, each of the taxa forms its own cluster allowing a rather reliable visualization of the data.

Figure 1: Visualization of Hjorth’s descriptors from phase representation of genomic sequences In order to quantify the effectivity of the introduced visualization technique, k-means clustering was performed. The algorithm was used on all the thee datasets differing in fragment length, spe- cifically 500 nt, 1,000 nt and 5,000 nt. The final statistics for each organism in form of precision (precision = TP/(TP+FP)) and accuracy (accuracy = (TP+TN)/(TP+FN+FP+FN)) values can be observed in Table 1:. Here, we can see that in the majority of cases, both precision and accuracy values are being improved with the increasing fragment length. Compared to [1], where manual human augmented binning was used for classification, therefore the reached results are highly sub- jective, k-means algorithm used in this paper is an automated method providing objective results.

Despite this fact the average precision and accuracy reach high values of 76.27%, and 92.78% re- spectively.

Another performance parameter, the runtime of the algorithm for different number of sequences, is presented in Table 2:. Here it is easy to observe that computational complexity of the method with Hjorth’s descriptors is significantly lower compared to BH-SNE method and allows almost imme-

248

(3)

diate interpretation of metagenomic data. Furthermore, according to linear fitting with equation t = 0.00039n + 13, where t is runtime of Hjorth’s descriptors and n represents the number of frag- ments, the computational complexity of Hjorth’s descriptors is really linearly dependent on number of fragments. Whereas the BH-SNE based method is dependent on both number and length of fragments, as it is possible to see from Table 2:.

Table 1: Precision and accuracy values after applying k-means algorithm on Hjorth’s descriptors obtained from phase representation of genomic fragments of variable lengths.

Organism Precision (%) Accuracy (%)

500 nt 1000 nt 5000 nt 500 nt 1000 nt 5000 nt

L. xyli 71.41 74.49 91.63 92.34 93.22 96.42

E. coli 44.24 47.63 42.17 77.74 80.26 75.80

Candidatus C.ruddii 92.97 95.31 92.31 95.53 94.29 99.90

H. influenzae 46.47 66.98 94.26 90.00 92.44 94.93

B. amyloliquefaciens 44.56 46.18 58.66 84.35 82.62 89.53

B. hyodysenteriae 50.15 49.24 70.46 90.87 90.35 96.19

G. obscurus 70.82 74.06 84.51 93.48 94.40 96.76

R. prowazekii 57.89 68.75 79.03 91.44 91.93 94.93

Termite group 1 bacterium 57.82 66.20 73.43 91.06 91.50 91.02 Table 2: Runtimes for the BH-SNE approach and for the Hjorth’s descriptors approach implemented

in MATLAB® using 2.4 GHz Intel® Core™ i5-2430M CPU.

Fragment length (nt) No. of fragments BH-SNE (s) Hjorth’s descriptors (s)

500 70,113 1,903.7 40.5

1,000 35,058 944.8 26.5

5,000 7,017 1,110.2 15.7

4. CONCLUSION

A new technique for visualization of metagenomic data without prior alignment to a reference da- tabase has been introduced in this paper. The method first transforms the short genomic fragments into numerical representation in form of phase of complex number and then three Hjorth’s de- scriptors are extracted from each of the signals. The result is visualization of whole metagenomic dataset in three dimensional space where genomic fragments from the same organism cluster to- gether. Unlike the previous technique, the introduced method is non-stochastic, therefore is repro- ducible. Furthermore the Hjorth’s descriptors are computed from variances of signal and its first and second derivatives, which results in linear computational complexity allowing almost immedi- ate analysis and on standard desktop or laptop computer.

REFERENCES

[1] Laczny, C. C., Pinel, N., Vlassis, N. and Wilmes, P., 2014, Alignment-free Visualization of Metagenomic Data by Nonlinear Dimension Reduction. Sci. Rep.. 2014. Vol. 4.

DOI 10.1038/srep04516. Nature Publishing Group

[2] Cristea, P. D., 2003, Large scale features in DNA genomic signals. Signal Processing. 2003.

Vol. 83, no. 4, p. 871-888. DOI 10.1016/s0165-1684(02)00477-2. Elsevier BV

[3] Oh, S.-H., Lee, Y.-R. and Kim, H.-N., 2014, A Novel EEG Feature Extraction Method Using Hjorth Parameter. IJEEE. 2014. P. 106-110. DOI 10.12720/ijeee.2.2.106-110. EJournal Pub- lishing

249

Odkazy

Související dokumenty

• Analysis of genomic features (genomic differentiation, rate of recombination and functional composition of genes) associated with varying degrees of introgression

The result of such interaction is a transformation of narrow spectrum signal into a signal with broadband spectrum However, the nature of the interaction of these effects as well as

Our extension adds estimation of independent components of measured signal into the transformation of ECG creating a signal called complex component, which enhances ECG activity

• For the vertex-by-vertex or vertex-by-vertex free compaction problem of three-dimensional orthogonal drawings, for every > 0, it is not possible to approximate the minimum

The author’s method is based on an explicit description of ramification filtration for maximal p-extensions of local 1-dimensional fields of characteristic p with Galois groups

, zN in three dimensional Euclidean space, and the points zt are independently distributed in a three-dimensional spherical normal (i.e. Gaussian) distribution with

In this paper, we obtained the three-dimensional Pauli equation for a spin-1/2 particle in the presence of an electromagnetic field in a noncommutative phase-space as well as

The frequency of the measuring signal Uc used, makes it possible for the signal U re to represent, except for a known error, a function of the temperature of