Bioinformatics Algorithms

Fulltext

(1)Bioinformatics Algorithms Paterns, Profiles and Motifs. RNDr. David Hoksza, Ph.D. http://siret.cz/hoksza.

(2) Outline • Motivation • Consensus sequences • Position specific scoring matrices • Hidden Markov Models • Protein families databases Credits: Based on EMBnet course “An introduction to Patterns, Profiles, HMMs andPSI-BLAST”.

(3) Motivation •. MSA contains conserved regions corresponding to o signals (promoters, …) o common structural motifs o chemical reactivity (active sites, …). •. When encountering a new sequence one is interested in assigning the new sequence to other sequences o description of a set of sequences o assigning new sequence to a set of sequence o scoring of the assignment. •. Models of conserved regions o consensus sequence o patterns o position specific scoring matrix (PSSM) o Hidden Markov Models (HMM).

(4) Consensus Sequence • The simplest method to build a model from a multiple sequence alignment • Principle. o majority wins o skip too much variation. • Algorithm. 1. Count symbol distribution in each column independently. 2. For each column with clear majority of one symbol pick that symbol on the respective position in the consensus sequence. 3. Fill the remaining positions with * symbol..

(5) G. H. E. G. V. G. K. V. V. K. L. G. A. G. A. G. H. E. K. K. G. Y. F. E. D. R. G. P. S. A. G. H. E. G. Y. G. G. R. S. R. G. G. G. Y. S. G. H. E. F. E. G. P. K. G. C. G. A. L. Y. I. G. H. E. L. R. G. T. T. F. M. P. A. L. E. C. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. G. H. E. G. V. G. K. V. V. K. G. G. L. Y. A. K. K. Y. F. E. D. L. A. A. G. S. F. Y. G. R. S. R. R. P. S. I. L. E. P. K. G. C. P. G. E. C. R. T. T. F. M. GGE**G*****G***. Consensus sequence to be used to scan a sequence database.

(6) Consensus Sequence – Pros & Cons Pros. • Simple. Cons • Symbol distribution not present in the resulting sequence • Highly dependent on the training set. • Easy to implement. • Binary. o only information whether a query sequence matches the CS, not how well.

(7) Pattern. • Regular expressions for biological sequences o describes a set of sequences within one expression. • Prosite syntax IUPAC one-letter codes neighboring residues delimited by a ‘-’ ‘X’ is treated as a wildcard character any of the symbols between [] can be used at that position • [AG] … alanine or glycine o any of the symbols between {} can not be used at that position • {AG} … anything except alanine or glycine o () … repetitions • [AG](2) … 2 repetitions of alanine or glycine • X(3-5) … 3 to 5 repetitions of any letter o a range only with ‘X', i.e., A(2,4) is not a valid pattern element o a pattern restricted to either the N- or C-terminal of a sequence starts with a `<' symbol or respectively ends with a `>' symbol o o o o.

(8) Pattern - Example <A-x-[ST](2)-x(0,1)-{V}. • • • • •. an alanine in the N-term followed by any amino acid followed by a serine or threonine twice followed or not by any residue followed by any amino acid except valine.

(9) Pattern – Example (cont.). • http://www.ibiblio.org/pub/academic/biology/molbio/data/ prosite/prosite.lis • Post-translational signatures o o. cAMP- and cGMP-dependent protein kinase phosphorylation site • [RK](2)-x-[ST] Tyrosine kinase phosphorylation site • [RK]-x(2)-[DE]-x(3)-Y or [RK]-x(3)-[DE]-x(2)-Y. • Enzymes o. Peroxidases proximal heme-ligand signature • [DET]-[LIVMTA]-{NSYL}-{RPFC}-[LIVM]-[LIVMSTAG]-[SAG]-[LIVMSTAG]H-[STA]-[LIVMFY]. • Receptors o. G-protein coupled receptors family 1 signature • [GSTALIVMFYWC]-[GSTANCPDE]-{EDPKRH}-x-{PQ}-[LIVMNQGA]-{RK}-{RK}[LIVMFT]-[GSTANC]-[LIVMFYWSTAC]-[DENH]-R-[FYWCSH]-{PE}-x-[LIVM].

(10) G. H. E. G. V. G. K. V. V. K. L. G. A. G. A. G. H. E. K. K. G. Y. F. E. D. R. G. P. S. A. G. H. E. G. Y. G. G. R. S. R. G. G. G. Y. S. G. H. E. F. E. G. P. K. G. C. G. A. L. Y. I. G. H. E. L. R. G. T. T. F. M. P. A. L. E. C. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. G. H. E. G. V. G. K. V. V. K. G. G. L. Y. A. K. K. Y. F. E. D. L. A. A. G. S. F. Y. G. R. S. R. R. P. S. I. L. E. P. K. G. C. P. G. E. C. R. T. T. F. M. G−H−E−X(2)−G−X(5)−[GA]−X(3). Pattern to be used to scan a sequence database.

(11) Patterns – Pros & Cons Pros. •. Easy to implement. •. Easy to understand for anyone. •. Ability to better express the motif then consensus sequence. Cons •. Symbol distribution not present in the resulting sequence. •. Highly dependent on the training set. •. Small patterns generate lot of hits o. •. possible false positives. Binary o. only information whether a query sequence matches the CS, not how much.

(12) Patterns - Excercise • Build pattern for. WFFKGIADKDAERHLLA WFFKNLEQKDAEARLLA WFFKR---KDAERQLLA WFFGTI---DAERQLLA WFFKDIPTKDAERQLLA WYFG----RESERLLLA WYFGKIPLKDAERQLLA WYFGKLRAKDTERLLLL.

(13) Position Specific Scoring Matrix. Position Specific Scoring Matrix (PSSM)expresses the likelihood of a letter to appear at a given position. •. o. symbols x positions matrix. Based on counts of letters at the positions. •. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. A. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 1. 0. 2. C. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. D. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. G. H. E. G. V. G. K. V. V. K. L. G. A. G. A. E. 0. 0. 5. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. G. H. E. K. K. G. Y. F. E. D. R. G. P. S. A. F. 0. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. G. H. E. G. Y. G. G. R. S. R. G. G. G. Y. S. G. 5. 0. 0. 2. 0. 5. 1. 0. 1. 0. 2. 3. 1. 1. 0. H. 0. 5. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. G. H. E. F. E. G. P. K. G. C. G. A. L. Y. I I. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. G. H. E. L. R. G. T. T. F. M. P. A. L. E. C. K. 0. 0. 0. 1. 1. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. L. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 2. 0. 0. M. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. N. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. P. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. Q. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. R. 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. S. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. T. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. V. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. W. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. Y. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 2. 0. col1: 𝑓𝐴,1 =. 0 , … , 𝑓𝐺,1 5 0 , … , 𝑓𝐻,2 5. =. 5 ,… 5 5 ,… 5. col2: 𝑓𝐴,2 = = … 0 1 2 col4: 𝑓𝐴,4 = , … , 𝑓𝐹,4 = , 𝑓𝐹,4 = … 5 5 5 ….

(14) PSSM – Pseudo-Counts. • Small training set implicates some zero values in the counts matrix • The probability of occurrence of any symbol is not null • Pseudo-counts o. adding small values for non-observed frequencies to all frequencies (both observed and non-observed). o. pseudo-counts 1:. col1: 𝑓𝐴,1 =. 0+1 5+20 0+1 5+20. = 0.04, … , 𝑓𝐺,1 =. 5+1 5+20 5+1 5+20. = 0.24, …. = 0.04, … , 𝑓𝐻,2 = = 0.24, … col2: 𝑓𝐴,2 = … 0+1 1+1 2+1 = 0,04, … , 𝑓𝐹,4 = = 0.08, 𝑓𝐹,4 = = 0,12, … col4: 𝑓𝐴,4 = 5+20 5+20 5+20 ….

(15) PSSM - Computation • Resulting score for position 𝑖, 𝑗 is computed as loglikelihood ratio from the null model (each amino acid is observed with an identical frequency in a random sequence). • •. 𝒇′𝒊𝒊 𝒔𝒊𝒊 = 𝒍𝒍𝒍( ) 𝒒𝒊. 𝑓𝑖𝑖′ … pseudo-count modified observed frequencies. 𝑞𝑖 … expected frequency of residue 𝑖 in a random sequence.

(16) PSSM - Result 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. A. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. 1.3. 0.7. -0.2. 1.3. C. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. 0.7. -0.2. -0.2. -0.2. -0.2. 0.7. D. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. 0.7. -0.2. -0.2. -0.2. -0.2. -0.2. E. -0.2. -0.2. 2.3. -0.2. 0.7. -0.2. -0.2. -0.2. 0.7. -0.2. -0.2. -0.2. -0.2. 0.7. -0.2. F. -0.2. -0.2. -0.2. 0.7. -0.2. -0.2. -0.2. 0.7. 0.7. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. G. 2.3. -0.2. -0.2. 1.3. -0.2. 2.3. 0.7. -0.2. 0.7. -0.2. 1.3. 1.7. 0.7. 0.7. -0.2. H. -0.2. 2.3. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. I. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. 0.7. K. -0.2. -0.2. -0.2. 0.7. 0.7. -0.2. 0.7. 0.7. -0.2. 0.7. -0.2. -0.2. -0.2. -0.2. -0.2. L. -0.2. -0.2. -0.2. 0.7. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. 0.7. -0.2. 1.3. -0.2. -0.2. M. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. 0.7. -0.2. -0.2. -0.2. -0.2. -0.2. N. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. P. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. 0.7. -0.2. -0.2. -0.2. 0.7. -0.2. 0.7. -0.2. -0.2. Q. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. R. -0.2. -0.2. -0.2. -0.2. 0.7. -0.2. -0.2. 0.7. -0.2. 0.7. 0.7. -0.2. -0.2. -0.2. -0.2. S. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. 0.7. -0.2. -0.2. -0.2. -0.2. 0.7. 0.7. T. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. 0.7. 0.7. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. V. -0.2. -0.2. -0.2. -0.2. 0.7. -0.2. -0.2. 0.7. 0.7. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. W. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. Y. -0.2. -0.2. -0.2. -0.2. 0.7. -0.2. 0.7. -0.2. -0.2. -0.2. -0.2. -0.2. -0.2. 1.3. -0.2.

(17) PSSM - Querying •. The matrix is used as a sliding window which slides across the query sequence. •. PSSM score sums up scores in the columns. •. Position with the highest PSSM score is reported.

(18) PSSM - Weighting • Highly populated families can contain big subfamilies which can negatively influence the results • Sequence weighting compensates the sampling bias.

(19) PSSM – Pros & Cons Pros. Cons. • Relatively fast • Querying is simple to implement. • No insertions or deletions o constant-length regions. • Match scores are statistically interpretable.

(20) PSI-BLAST • Position specific iterated BLAST. o establishment of profiles o using profiles to search sequence database. Query sequence BLAST. Homologs. • Algorithm. 1. Search database using BLASTP 2. Collect high scoring results and build MSA 3. Get PSSM from the MSA 4. Use the profile from PSSM to search against database using BLASTP 5. If new hits are identified add them to the MSA and update profile 6. Repeat steps 4 and 5 until stabilization. MSA Profile BLAST Additional homologs Extended profile New profile.

(21) PSI-BLAST – pros & cons Pros • Capable to identify up three times more 30% homologues then BLAST • Fast because using BLAST heuristics • Allows PSSMs on large databases. Cons • profile drift o high sensitivity → false positives → biased profile → incorporation in subsequent cycles.

(22) Operating Instructions • Consensus sequences o to find highly conserved signatures, as for example enzyme restriction sites for DNA. • Patterns o to search for small signatures or active sites. o to communicate with other biologists. • PSSM o to model small regions with high variability but constant length.

(23) Markov Chains. • A Markov Chain is a • Traffic lights succession of states o states • red, orange, green 𝑆𝑖 (= 0,1, … ) connected by o transition probabilities transitions. A transition • P(green→orange)=1, from 𝑺𝒊 to 𝑺𝒋 has a P(orange→red)=1, P(red→green)=1 probability of 𝑷𝒊𝒋 1. 1. o Markov property • next state of a Markov chain depends just on the current state 𝑺 and not on the sequence of states leading to 𝑆 o 𝑃 𝑆𝑖𝑖 𝑆𝑖𝑖 , 𝑆𝑖𝑖 , … , 𝑆𝑖𝑖−1 = 𝑃 𝑆𝑖𝑖 𝑆𝑖𝑖−1 o Markov model contains • transition probabilities o 𝑎𝑖𝑖 = 𝑃 𝑆𝑖 𝑆𝑗 • initial probabilities o 𝜋𝑖 = 𝑃(𝑠𝑖 ). 1. • Weather. o states • sun, cloud, rain o transition probabilities weather today sun weather yesterday. sun. cloud. rain. 0.5. 0.25. 0.25. cloud. 0.375. 0.125. 0.375. rain. 0.135. 0.625. 0.375. • Exercise – draw diagram.

(24) Markov Chain Sequence Probability 𝑷 𝑺𝒊𝟏 , 𝑺𝒊𝟐 , … , 𝑺𝒊𝒊 = 𝑃 𝑆𝑖𝑖 𝑆𝑖𝑖 , 𝑆𝑖𝑖 , … , 𝑆𝑖𝑖−1 𝑃 𝑆𝑖𝑖 , 𝑆𝑖𝑖 , … , 𝑆𝑖𝑖−1 = 𝑃 𝑆𝑖𝑖 𝑆𝑖𝑖−1 𝑃 𝑆𝑖𝑖 , 𝑆𝑖𝑖 , … , 𝑆𝑖𝑖−1 = … = 𝑷 𝑺𝒊𝒊 𝑺𝒊𝒊−𝟏 𝑷 𝑺𝒊𝒊−𝟏 𝑺𝒊𝒊−𝟐 … 𝑷 𝑺𝟐 𝑺𝟏 𝑷(𝑺𝟏 ). • Probability of a sequence {‘sun’,’sun’, ‘rain’, ‘cloud’} o initial probabilities: P(‘sun’)=0.5, P(‘cloud’)=0.4, P(‘rain’)=0.1 o 𝑃 ‘𝑠𝑠𝑠𝑠, ’𝑠𝑠𝑠’, ‘𝑟𝑟𝑟𝑟𝑟, ‘𝑐𝑐𝑐𝑐𝑐’ = 𝑃 ‘𝑐𝑐𝑐𝑐𝑐𝑐 ’𝑟𝑟𝑟𝑟’ 𝑃 ‘𝑟𝑟𝑟𝑟𝑟 ’𝑠𝑠𝑠’ 𝑃 ‘𝑠𝑠𝑠𝑠 ’𝑠𝑠𝑠 ∗ 𝑃 ‘𝑠𝑠𝑠𝑠 = 0.5 ∗ 0.5 ∗ 0.25 ∗ 0.625 weather today sun. weather yesterday. sun. cloud. rain. 0.5. 0.25. 0.25. cloud. 0.375. 0.125. 0.375. rain. 0.135. 0.625. 0.375.

(25) Hidden Markov Models • Hidden Markov Model (HMM) is a generalization of Markov models where the system is a Markov process passing through hidden states • States are not visible but each state generates (emits) one of M observations (𝑶𝟏 , … , 𝑶𝑴 ) with given probability • HMM is defined as 𝑴(𝑺, 𝑶, 𝝅) where o 𝑆= matrix of transition probabilities 𝑎𝑖𝑖 = 𝑷(𝑺𝒊 |𝑺𝒋 ). o 𝑂 = matrix of observation probabilities 𝑏𝑖𝑚 = 𝑷(𝑶𝒎 |𝑺𝒊 ) o 𝜋 = vector of initial probabilities 𝜋𝑖 = 𝑃(𝑆𝑖 ).

(26) HMM - Example 0.7. Low. 0.3. High. 0.4. 0.2. 0.8. 0.4. 0.6. 0.6 Rain. Dry. • States: ‘Low’ preasure, ‘High’ preasure • Observations: ‘Rain’, ‘Dry’ • Transition probabilities: P(‘Low’|’Low’)=0.3, P(‘High’|’Low’)=0.7, P(‘Low’|’High’)=0.2, P(‘High’|High’)=0.8 • Observation/emission probabilities: P(‘Rain’|’Low’)=0.6, P(‘Dry’|’Low’)=0.4, P(‘Rain’|High’)=0.4, P(‘Dry’|’High’)=0.6 • Initial probabilities: P(‘Low’)=0.4, P(‘High’)=0.6 o. often two special states are added to represent start and end where start is connected to the rest of the graph using the initial probabilities.

(27) Observation Sequence Probability. • Sequence of observations can be obtained (explained) by multiple ways with different probabilities. • If we want to calculate a probability for sequence of observations {‘Dry’, ‘Rain’} we can explain it, e.g., by 𝑃 ‘𝐷𝐷𝐷𝐷, ’𝑅𝑅𝑅𝑅’ , ’𝐿𝐿𝐿𝐿, ’𝐿𝐿𝐿’ = 𝑃 ‘𝐷𝐷𝐷𝐷, ’𝑅𝑅𝑅𝑅’ ’𝐿𝐿𝐿𝐿, ’𝐿𝐿𝐿’ ∗ 𝑃 ’𝐿𝐿𝐿𝐿, ’𝐿𝐿𝐿’ = 𝑃 ’𝐷𝐷𝐷’|’𝐿𝐿𝐿’ ∗ 𝑃 ’𝑅𝑅𝑅𝑅’|’𝐿𝐿𝐿’ ∗ 𝑃 ’𝐿𝐿𝐿𝐿|’𝐿𝐿𝐿’ ∗ 𝑃 ’𝐿𝐿𝐿’ = 0.4 ∗ 0.6 ∗ 0.3 ∗ 0.4.

(28) Profile HMMs •. HMM can contain information contained in MSA → an alternative to PSSM → profile HMM. •. If we have a profile 𝑷 and align a sequence 𝒔 to it, at each step 𝑖 we can either o match 𝑖-th letter of 𝑠 to 𝑃 – 𝑴𝒊 o add gap to s (the corresponding letter in 𝑠 will be matched with some latter position in 𝑃) - 𝑫𝒊 o add gap to the profile and align given position in 𝑠 with a gap in 𝑃 - 𝑰𝒊. •. 𝑴𝒊 , 𝑫𝒊 , 𝑰𝒊 correspond to the states of the HMM which emit letters of the query sequence with given probabilities (learned from a MSA). •. Path in the HMM shows how a sequence could be aligned to the profile and moreover gives the score reflecting the probability with which such an alignment could happen.

(29) Training a HMM from a MSA.

(30) Matching.

(31) Matching (cont.).

(32) Viterbi Algorithm •. Any path through a model emits a sequence with an associated probability (product of all the transitions and emission probabilites). •. Many paths through the HMM can lead to the same emitted sequence → different alignments to the profile → searching for the most probable path (analogous to the best scoring alignment) → Viterbi algorithm o 𝒕 𝑴𝒖 , 𝑴𝒖+𝟏 … transition probability from 𝑀𝑢 to 𝑀𝑢+1 o 𝑥 = 𝑥1 , 𝑥2 , … , 𝑥𝐿 … emitted sequence o 𝒆𝑰𝒖 (𝒙𝒊 ) … the emission probability for residue 𝑥𝑖 from insert state 𝐼𝑢. Emission probability based on how often a training sequence matches with the profile.. The model for insert state is based on random model → probability from the overall AA composition.. 𝒗𝑴𝒖 𝒗𝑰 𝒖. 𝒗𝑫𝒖. 𝒗𝑴𝒖−𝟏 𝒙𝒊−𝟏 𝒕 𝑴𝒖−𝟏 , 𝑴𝒖 𝒙𝒊 = 𝒆𝑴𝒖 𝒙𝒊 𝐦𝐦𝐦 � 𝒗𝑰𝒖−𝟏 𝒙𝒊−𝟏 𝒕 𝑰𝒖−𝟏 , 𝑴𝒖 𝒗𝑫𝒖−𝟏 𝒙𝒊−𝟏 𝒕 𝑫𝒖−𝟏 , 𝑴𝒖 𝒗𝑴𝒖 𝒙𝒊−𝟏 𝒕 𝑴𝒖 , 𝑰𝒖 𝒙𝒊 = 𝒑𝒙𝒊 𝐦𝐦𝐦 � 𝒗𝑰𝒖 𝒙𝒊−𝟏 𝒕 𝑰𝒖 , 𝑰𝒖 𝒗𝑴𝒖−𝟏 𝒙𝒊 𝒕 𝑴𝒖−𝟏 , 𝑫 𝒙𝒊 = 𝒎𝒎𝒎 � 𝒗𝑫𝒖−𝟏 𝒙𝒊 𝒕 𝑫𝒖−𝟏 , 𝑫𝒖 𝒗𝒔𝒔𝒔𝒔𝒔 𝟎 = 𝟏, 𝒗𝒖 𝟎 = 𝟎. Transitions between I and D are usually not considered. o usually log-odds scores are used since probabilities lead to very small values o 𝑣𝑒𝑛𝑛 𝑥𝐿 log-odds score of the best path.

(33) Forward Algorithm •. One emitted sequence can be obtained by many paths. Summing probabilities of all these paths shows the probability of given sequence to be emitted by the HMM o 𝑓𝑀𝑢 (𝑥𝑖 ) … the total probability at the state 𝑀𝑢 when the sequence up to and including residue 𝑥𝑖 has been emitted 𝒇𝑴𝒖 𝒙𝒊 = 𝒆𝑴𝒖 𝒙𝒊 �𝒇𝑴𝒖−𝟏 𝒙𝒊−𝟏 𝒕 𝑴𝒖−𝟏 , 𝑴𝒖 + 𝒇𝑰𝒖−𝟏 𝒙𝒊−𝟏 𝒕 𝑰𝒖−𝟏 , 𝑴𝒖 + 𝒇𝑫𝒖−𝟏 𝒙𝒊−𝟏 𝒕 𝑫𝒖−𝟏 , 𝑴𝒖 � 𝒇𝑰𝒖 𝒙𝒊 = 𝒑𝒙𝒊 𝒇𝑴𝒖 𝒙𝒊−𝟏 𝒕 𝑴𝒖 , 𝑰𝒖 + 𝒇𝑰𝒖−𝟏 𝒙𝒊−𝟏 𝒕 𝑰𝒖−𝟏 , 𝑰𝒖. 𝒇𝑫𝒖 𝒙𝒊 = 𝒇𝑴𝒖−𝟏 𝒙𝒊 𝒕 𝑴𝒖−𝟏 , 𝑫𝒖 + 𝒇𝑫𝒖−𝟏 𝒙𝒊 𝒕 𝑫𝒖−𝟏 , 𝑫𝒖.

(34) Protein Family Databases • There exist many databases of MSAs and related • o consensus sequences o patterns o HMMs o …. • Some databases contain multiple representations of families.

(35) Prosite. • http://www.expasy.ch/prosite. • Collection of motifs, protein domains, families and functional sites • Uses generalized profiles (Pftools) and patterns o patterns usually have 10-20 AA. • Patterns contain. o a quality estimation by counting true positives, false negatives and false positives in SWISS-PROT o taxonomic range (archea, eukaryota, …) o a SWISS-PROT match list. • Contains ScanProsite tool. o allows to search according to profile, filter by taxonomy, length, ID, ….

(36) CDD. • http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml • Conserved Domains Database • Contains MSAs available as PSSMs. o NCBI-curated domains based on 3D structure o imported domains models (Pfam, TIGRFAM, SMART, COG, KOG …). • CD-search. o search interface for scanning CDD against submitted protein or nucleotide query o uses RPS-BLAST (variant of PSI-BLAST). • CDART. o Conserved Domain Architecture Retrieval Tool o being used to analyze the domain architecture and retrieve proteins with similar architecture.

(37) PRINTS • http://bioinf.man.ac.uk/dbbrowser/PRINTS • Collection of conserved motifs used to characterize a protein using fingerprints (conserved motifs used to characterize a protein family) • Fingerprints should encode protein folds and functionalities more flexibly than can single motifs • Similar to BLOCKS.

(38) Pfam • http://www.sanger.ac.uk/Pfam • Collection of protein domains and families and respective MSAs • Uses HMMs (HMMER3 package) • Versions. o Pfam-A • manually curated • over 12,000,000 sequences in over 13,500 families o Pfam-B • automatically clustered and aligned sequences not covered by Pfam-A.

(39) InterPro • http://www.ebi.ac.uk/interpro/ • Combination protein signatures from a number of member databases into a single searchable resource o CATH/Gene3D, PANTHER, Pfam, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY, TIGRFAM, ….. • INTERPROSCAN o allows scanning of sequences again InterPro’s sequences o accessible also using web services.

(40)