• Nebyly nalezeny žádné výsledky

Reconstructing Phylogenetic Networks

N/A
N/A
Protected

Academic year: 2022

Podíl "Reconstructing Phylogenetic Networks"

Copied!
44
0
0

Načítání.... (zobrazit plný text nyní)

Fulltext

(1)

Reconstructing Phylogenetic Networks

Mareike Fischer, Leo van Iersel, Steven Kelk, Nela Leki´c, Simone Linz, Celine Scornavacca, Leen Stougie

Centrum Wiskunde & Informatica (CWI) Amsterdam

MCW Prague, 30 July 2013

(2)

Definition

Let X be a finite set. A (rooted) phylogenetic treeonX is a rooted tree with no indegree-1 outdegree-1 vertices whose leaves are bijectively labelled by the elements of X.

Kiwi

(New Zealand) Cassowary

(New Guinea + Australia)

Emu

(Australia)

Ostrich

(Africa)

Moa

(New Zealand) 82 Mya

76 Mya 68 Mya

35 Mya

(3)
(4)
(5)
(6)

W.F. Doolittle et al. (2000)

(7)

Definition

Let X be a finite set. A(rooted) phylogenetic network onX is a rooted directed acyclic graph with no indegree-1 outdegree-1 vertices whose leaves are bijectively labelled by the elements of X.

(8)

The first phylogenetic network (Buffon, 1755)

(9)

Phylogenetic network for humans (Reich et al., 2011)

Definition

A reticulationis a vertex with indegree at least 2.

(10)

Tree-based methods

1 Compute trees from DNA sequences.

I Different parts of DNA might give different trees.

2 Try to induce a phylogenetic network from the trees.

Definition

A phylogenetic treeT is displayedby a phylogenetic network N if T can be obtained from a subgraph ofN by contracting edges.

(11)

Example: tree T is displayed by network N

a

b c d e f

g

h a

b c d e f

g h

N T

(12)

The other binary tree T

0

displayed by network N

a

b c d e f

g

h a

b c d e f

g h

N T¶

(13)

Challenge: try to reconstruct the network from the trees

a

b c d e f

g h a

b c d e f g h

a

b c d e f

g h

(14)

Definition

The reticulation numberof a phylogenetic network N is X

v∈V\{root}

d(v)−1.

Problem

Minimum Reticulation

Instance: phylogenetic trees T1,T2

Solution: phylogenetic network that displays T1 and T2 Minimize: reticulation number of the network.

Theorem

There exists a constant factor approximation algorithm for Minimum Reticulation if and only if there exists a constant factor approximation algorithm for Directed Feedback Vertex Set.

Open question: how to handle more than two trees (efficiently)?

(15)

Reconstructing phylogenetic networks

Tree-based methods

1 Construct trees from DNA sequences.

2 Find a network that displays the trees and has minimum reticulation number.

Sequence-based methods

I Find a network directly from the DNA sequences.

I OptimizeParsimonyor Likelihood score of network.

(16)

Maximum Parsimony for trees

Small parsimony problem: given a tree and a sequence for each leaf, assign sequences to the internal vertices in order to minimizethe total number of mutations.

ACCTG ATCTG ATCTC GTAAA TTACT Example input

(17)

Maximum Parsimony for trees

Small parsimony problem: given a tree and a sequence for each leaf, assign sequences to the internal vertices in order to minimizethe total number of mutations.

ACCTG ATCTG ATCTG

ATCTA

ATCTC TTCTA

TTAAA

GTAAA TTACT

Example labelling of internal vertices

(18)

Maximum Parsimony for trees

Small parsimony problem: given a tree and a sequence for each leaf, assign sequences to the internal vertices in order to minimizethe total number of mutations.

ACCTG ATCTG ATCTG

ATCTA

ATCTC

TTCTA

TTAAA

GTAAA TTACT

1

Example of one mutation

(19)

Maximum Parsimony for trees

Small parsimony problem: given a tree and a sequence for each leaf, assign sequences to the internal vertices in order to minimizethe total number of mutations.

ACCTG ATCTG ATCTG

ATCTA

ATCTC

TTCTA

TTAAA

GTAAA TTACT 1

1 1

2

1 2

1

All 9 mutations.

(20)

Maximum Parsimony for trees

Small parsimony problem: given a tree and a sequence for each leaf, assign sequences to the interior vertices in order to minimizethe total number of mutations.

Polynomial-time solvable:

I Consider each character separately.

I Use dynamic programming (Fitch, 1971).

Two possible extensions to networks:

I hardwired

I softwired

(21)

Hardwired Maximum Parsimony on Networks

Ap-statecharacteron X is a functionα:X → {1, . . . ,p}.

Thechange cτ(e) on edge e= (u,v) w.r.t. a p-state characterτ onV(N) is defined as:

cτ(e) =

0 ifτ(u) =τ(v) 1 ifτ(u)6=τ(v).

Thehardwired parsimony score of a phylogenetic network N and p-state characterα is given by

PShw(N, α) = min

τ

X

e∈E(N)

cτ(e),

where the minimum is taken over all p-state characters τ onV(N) that extend α.

(22)

Example input: (N , α)

2

1 2 2 1 3

1

1

(23)

A 3-state character τ on V (N ) that extends α.

2

1 2 2 1 3

1 1

1 1

1 1

2

2 2 2

1

(24)

PS

hw

(N, α) = 4

2

1 2 2 1 3

1 1

1 1

1 1

2

2 2 2

1

(25)

Softwired Maximum Parsimony on Networks

The softwired parsimony scoreof a phylogenetic network N andp-state characterα is given by

PSsw(N, α) = min

T∈T(N)PS(T, α), where T(N) is the set of trees on X displayed by N.

(26)

One of the two trees on X displayed by the network

2

1 2 2 1 3

1

1

(27)

A 3-state character τ on V (T ) that extends α.

2

1 2 2 1 3

1 1

1 1

1 1

2

2 2 2

1

(28)

There are 3 changes

2

1 2 2 1 3

1 1

1 1

1 1

2

2 2 2

1

(29)

The other tree needs 4 changes

2

1 2 2 1 3

1 1

1 1

1 1

2

2 2 2

1

The minimum over the two trees is 3, so PSsw(N, α) = 3.

(30)

Proposition

PShw(N, α) is not an o(n)-approximation of PSsw(N, α).

1 0 0

0 0 0 0 0

1 1 1

(31)

Softwired Parsimony Score

1 0 0

0 0 0 0 0

1 1 1

PSsw(N, α) = 2

(32)

Hardwired Parsimony Score

1 0 0

0 0 0 0 0

1 1 1

PShw(N, α) = 4 =r+ 1 with r the number of reticulations.

(33)

Proposition

Let G be the graph obtained from network N by merging all leaves x with α(x) =i into a single node γi, for i = 1, . . . ,p.

Then, PShw(N, α) equals the size of a minimum multiterminal cutin G with terminals γ1, . . . , γp.

Corollary

Computing the hardwired parsimony score of a phylogenetic network and a binary character is polynomial-time solvable.

Corollary

Computing the hardwired parsimony score of a phylogenetic network and a p-state character, for p ≥3, is NP-hard and APX-hard but fixed-parameter tractable (FPT) in the parsimony score, and there exists a polynomial-time 1.3438-approximation for all p and a 1211-approximation for p= 3.

(34)

Example

1 0 0

0 0 0 0 0

1 1 1

(35)

Merge 0-leaves and 1-leaves

0 1

(36)

Hardwired Parsimony Score is 4

0 1

(37)

Observation

There exists a (trivial)|X|-approximation for computing the softwired parsimony score of a phylogenetic network.

Theorem

For every constant >0 there is no polynomial-time approximation algorithmthat approximates PSsw(N, α) to a factor |X|1−, for a phylogenetic network N and a binary character α, unless P = NP.

Definition

A phylogenetic network isbinary if the root has outdegree 2 and all other vertices have total degree 1 or 3.

Theorem

For every constant >0 there is no polynomial-time approximation algorithm that approximates PSsw(N, α) to a factor |X|13, for abinary phylogenetic network N and a binary character α, unless P = NP.

(38)

Proof: reduction from 3SAT

Ú

s1 s2

1 1 0

0

... ...

x :x

1 1 0

0

... ...

y :y

1 1 0

... ...

:z z

0

f(n,Ï) copies

f(n,Ï) copies variable

gadget

... f(n,Ï) ...

copies

x _ :y

1 1 1 1

:x _ y _z f(n,Ï) copies clause

gadget

(39)

Binary case

x :x y :y z :z

x _ y

1 1 1 1

x _ :y _z x _ :z :x _ :z

dx1

dx2

dx3

dy1

dz1

Îz1 Îz2 Îy1

Îx1

0

0 0

0 0

0

(40)

Theorem

There is no FPT algorithm for computing the softwired parsimony score, with the score as parameter, unless P = NP.

Definition

A phylogenetic network islevel-k if each biconnected component has reticulation number at most k.

Theorem

There is an FPT algorithm for computing the softwired parsimony score, with the level of the network as parameter.

(41)

ILP for softwired parsimony score

minX

e∈E

ce s.t. X

s∈P

xv,s = 1 for allvV

cexu,sxv,s(1ye) for alle= (u,v)E,s∈ P cexv,sxu,s(1ye) for alle= (u,v)E,s∈ P

X

v:(v,r)∈E

y(v,r)= 1 for each reticulationr

ye= 1 for each non-reticulate edgee

xv,α(v)= 1 for each leafv

ce,ye∈ {0,1} for alleE xv,s ∈ {0,1} for allvV,s∈ P

withP ={1, . . . ,p} andα(v) the given character state of a leaf v.

(42)

Both parsimony scores can be computed quickly using ILP

|X| Avg. Average computation time (s)

num. of Hardwired PS Softwired PS

retic. 2-state 3-state 4-state 2-state 3-state 4-state

50 17.0 0.0 0.0 0.1 0.1 0.1 0.3

100 37.0 0.0 0.0 0.2 0.0 0.1 0.6

150 54.1 0.0 0.1 0.6 0.1 0.2 0.8

200 72.8 0.0 0.1 1.1 0.1 0.4 1.4

250 91.3 0.0 0.1 3.5 0.1 0.4 2.2

300 112.6 0.0 0.2 5.2 0.1 0.6 3.7

(43)

Future Work

Are there approximation or FPT algorithms for computing the softwired parsimony score of restricted classesof networks?

How to searchfor an optimalnetwork?

What if the different characters are not independent?

(44)

Thanks

Mareike Fischer (Greifswald) Steven Kelk (Maastricht) Celine Scornavacca (Montpellier)

Simone Linz (Christchurch) Leen Stougie (Amsterdam) Nela Leki´c (Maastricht)

Odkazy

Související dokumenty

The change in the formulation of policies of Mexico and the US responds to the protection of their national interests concerning their security, above the

Master Thesis Topic: Analysis of the Evolution of Migration Policies in Mexico and the United States, from Development to Containment: A Review of Migrant Caravans from the

The submitted thesis titled „Analysis of the Evolution of Migration Policies in Mexico and the United States, from Development to Containment: A Review of Migrant Caravans from

c) In order to maintain the operation of the faculty, the employees of the study department will be allowed to enter the premises every Monday and Thursday and to stay only for

The order is given by a score (floating number value) associated with each element (from the smallest to the greatest score).

A phylogenetic tree of parabasalids, based on the determined SSU rDNA sequences and available sequences from databases (fig. 1), was constructed by maximum likelihood methods and

The work was first initiated for the Thue-Morse sequence, and more recently the case of other sequences (variations of the Baum-Sweet sequence, variations of the Rudin-Shapiro

The data for lymph nodes and small intestinal PPs were pooled and averaged as each mouse could have a different number of MLNs and PPs, but the overal number of cells, especially for