Studying bioinformatics for the exam - Prof: Axel Mosig PDF

Title	Studying bioinformatics for the exam - Prof: Axel Mosig
Course	Bioinformatics
Institution	Ruhr-Universität Bochum
Pages	17
File Size	806.3 KB
File Type	PDF
Total Downloads	44
Total Views	162

Preview

CLICK TO PREVIEW PDF

Summary

Prof: Axel Mosig...

Description

Why use e.g Blast in the command line instead of running it from the web?    

-More efficient -Atomatic processing of results -Easily reproducible (by using scripts) -Some genomes are not available for web-based blat. In shell you can give any genome as input y palante.

Study each type of blast (blastp, blastx, blastn…) GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data on a daily basis. Ensembl  The Ensembl project produces genome databases for vertebrates and other eukaryotic species, and makes this information freely available online. You can take a GenBank or Ensembl file and open it in a genome browser such as Artemis. Ensembl has its own genome browser and its very popular. Pfam  Provides peptide sequences of protein families. In particular, provides aligned sequences. Useful for gene homology search. The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). Proteins are generally composed of one or more functional regions, commonly termed domains. Different combinations of domains give rise to the diverse range of proteins found in nature. The identification of domains that occur within proteins can therefore provide insights into their function. Pfam also generates higher-level groupings of related entries, known as clans. A clan is a collection of Pfam entries which are related by similarity of sequence, structure or profile-HMM. Wikipedia: Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. For each family in Pfam one can:     

Look at multiple alignments View protein domain architectures Examine species distribution Follow links to other databases View known protein structures

Rfam  Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated, open access database Rfam is designed to be similar to the Pfam database for annotating protein families. Unlike proteins, ncRNAs often have similar secondary structure without sharing much similarity in the primary sequence. Rfam divides ncRNAs into families based on evolution from a common ancestor. Producing multiple sequence alignments (MSA) of these families can provide insight into their structure and function, similar to the case of protein families. These MSAs become more useful with the addition of secondary structure information. The Rfam database can be used for a variety of functions. For each ncRNA family, the interface allows users to: view and download multiple sequence alignments; read annotation; and examine species distribution of family members. There are also links provided to literature references and other RNA databases.

FASTA format  FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences (header). The format originates from the FASTA software package, but has now become a standard in the field of bioinformatics. The simplicity of FASTA format makes it easy to manipulate and parse sequences using text-processing tools and scripting languages like Python, Ruby, and Perl. FASTA software  DNA and protein alignments.

HOMOLOGY AND ALIGNMENTS Culstalx and clustalw for alignments Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot end in gaps.) A general global alignment technique is the Needleman–Wunsch algorithm, which is based on dynamic programming. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The Smith–Waterman algorithm is a general local alignment method also based on dynamic programming.

Multiple sequence alignment is an extension of pairwise alignment to incorporate more than two sequences at a time. Multiple alignment methods try to align all of the sequences in a given query set. Multiple alignments are often used in identifying conserved sequence regions across a group of sequences hypothesized to be evolutionarily related. Such conserved sequence motifs can be used in conjunction with structural and mechanistic information to locate the catalytic active sites of enzymes. Alignments are also used to aid in establishing evolutionary relationships by constructing phylogenetic trees. Many variations of the Clustal progressive implementation are used for multiple sequence alignment, phylogenetic tree construction, and as input for protein structure prediction. ClustalW is used extensively for phylogenetic tree construction. Multiple sequence alignments can be used to create a phylogenetic tree. This is made possible by two reasons. The first is because functional domains that are known in annotated sequences can be used for alignment in non-annotated sequences. The other is that conserved regions known to be functionally important can be found. This makes it possible for multiple sequence alignments to be used to analyze and find evolutionary relationships through homology between sequences. Point mutations and insertion or deletion events (called indels) can be detected.

BLAST (Basic Local Alignment Search Tool)  BLAST finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. BLAST uses a word method for the alignment. Word methods, also known as k-tuple methods, are heuristic methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming. These methods are especially useful in large-scale database searches where it is understood that a large proportion of the candidate sequences will have essentially no significant match with the query sequence. Word methods are best known for their implementation in the database search tools FASTA and the BLAST family. Word methods identify a series of short, nonoverlapping subsequences ("words") in the query sequence that are then matched to candidate database sequences. The relative positions of the word in the two sequences being compared are subtracted to obtain an offset; this will indicate a region of alignment if multiple distinct words produce the same offset. Only if this region is detected do these methods apply more sensitive alignment criteria; thus, many unnecessary comparisons with sequences of no appreciable similarity are eliminated. Different types of BLASTs are available according to the query sequences. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence. 

Nucleotide-nucleotide BLAST (BLASTN) -This program, given a DNA query, returns the most similar DNA sequences from the DNA database that the user specifies. Use: DNA/RNA homology search.



Protein-protein BLAST (BLASTP) -This program, given a protein query, returns the most similar protein sequences from the protein database that the user specifies. Use: Finding homologs in a protein sequence.



Position-Specific Iterative BLAST (PSI-BLAST) -This program is used to find distant relatives of a protein. First, a list of all closely related proteins is created. These proteins are combined into a general "profile" sequence, which summarises significant features present in these sequences. A query against the protein database is then run using this profile, and a larger group of proteins is found. This larger group is used to construct another profile, and the process is repeated.

By including related proteins in the search, PSI-BLAST is much more sensitive in picking up distant evolutionary relationships than a standard protein-protein BLAST. 

Nucleotide 6-frame translation-protein (BLASTX) -This program compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. Use: Identifying a gene belonging to a transcript.



Nucleotide 6-frame translation-nucleotide 6-frame translation (TBLASTX) -This program is the slowest of the BLAST family. It translates the query nucleotide sequence in all six possible frames and compares it against the six-frame translations of a nucleotide sequence database. The purpose of tblastx is to find very distant relationships between nucleotide sequences. Use: Finding similar coding regions.



Protein-nucleotide 6-frame translation (TBLASTN) This program compares a protein query against the all six reading frames of a nucleotide sequence database. Finding a known gene in a new genome.

Significance of blast hits: Consider a blast match with score S. Blast provides scores and significance measures:   



Raw score = sum of matching scores of all matched (non-gap) positions. Bit-score = normalizes raw scores by the scoring system. E-value: indicates the number of alignments one expects to find with a score greater than or equal to the observed alignment's score in a search against a random database. Consider E-value =S

The higher the bit score, the better is the alignment (values below 50 are generally unreliable). E < 10-4 -> sequences are homologous. E > 1 alignment probably based on random sequence similarity.

Clustal is a series of widely used computer programs for multiple sequence alignment. There are three main steps: 1. Do a pairwise alignment 2. Create a guide tree (or use a user-defined tree) 3. Use the guide tree to carry out a multiple alignment

Hidden Markov Model (HMM)  some parts of a gene may be better conserved than others. Conserved parts should match well; less conserved parts may not match well. This is not taken into account by blast. HMM  Align sequences from same gene family and build a search profile from alignment. Now search genome using this search profile, which “knows” what is conserved and what is not. Hidden Markov models are probabilistic models that can assign likelihoods to all possible combinations of gaps, matches, and mismatches to determine the most likely multiple sequence alignment (MSA) or set of possible MSAs. HMMs can produce a single highest-scoring output but can also generate a family of possible alignments that can then be evaluated for biological significance. HMMs can produce both global and local alignments. Although HMMbased methods have been developed relatively recently, they offer significant improvements in computational speed, especially for sequences that contain overlapping regions. Software package  HMMer

Homology between protein or DNA sequences is defined in terms of shared ancestry. Two segments of DNA can have shared ancestry because of either a speciation event ( orthologs) or a duplication event (paralogs). Homology among proteins or DNA is typically inferred from their sequence similarity. Significant similarity is strong evidence that two sequences are related by divergent evolution of a common ancestor. Alignments of multiple sequences are used to indicate which regions of each sequence are homologous. Sequence regions that are homologous are also called conserved. Orthology  Homologous sequences are orthologous if they are inferred to be descended from the same ancestral sequence separated by a speciation event: when a species diverges into two separate species, the copies of a single gene in the two resulting species are said to be orthologous. Orthologs, or orthologous genes, are genes in different species that originated by vertical descent from a single gene of the last common ancestor. Orthologous sequences provide useful information in taxonomic classification and phylogenetic studies of organisms. The pattern of genetic divergence can be used to trace the relatedness of organisms. Two organisms that are very closely related are likely to display very similar DNA sequences between two orthologs. Conversely, an organism that is further removed evolutionarily from another organism is likely to display a greater divergence in the sequence of the orthologs being studied. Paralogy  Homologous sequences are paralogous if they were created by a duplication event within the genome. For gene duplication events, if a gene in an organism is duplicated to occupy two different positions in the same genome, then the two copies are paralogous. Taxon  A population, or group of populations of organisms which are usually inferred to be phylogenetically related. Members of a taxon population have characters in common which differentiate the unit from other units. Each phylogenetic character should be identical within al individuals represented by one taxon. These can be either phenotypical or genotypical (better for reconstructing phylogenies).

Phylogenetic inference p-distance  Proportion of homologous sites. If the p-distance between two sequences is high (close to 1) means that they are really different. Close to 0 that they are highy homologous.

Given a 10 nucleotide sequence. The two seqs differ in 4 amino acids, then they are 40% nonhomologous or 60% homologous. Dp(Seq1,Seq2) = 0.4 Problem: p-distance is not proportional to biological time. Work in terms of “genetic steps”. Each step follows a statistical model. How many mutations to be expected in a sequence after t genetic steps? Statistical model leads to more meaningful distance  Evolution rate models, such as Jukes-Cantor model.

Distance-based methods for phylogeny reconstruction 

Use distance matrix (e.g p-distance) to construct a phylogenetic tree.



Similar sequences (dp small) should be joinced closely together in the tree.



Dissimilar sequences (dp large) should be joined far apart in the tree.

Given the distance matrix:

A-F are sequences. Compute all pairwise distances. This matrix would lead to the following tree:

This is not really informative. Better -> Neighbor Joining  neighbor joining is a bottom-up (agglomerative) clustering method for the creation of phylogenetic trees. The algorithm requires knowledge of the distance between each pair of taxa (e.g., species or sequences) to form the tree. The genetic distances are taking into account when drawing the tree! Distances in the tree are equal to the distance matrix if such tree exists.

Parsimony-Based Methods Distance-based methods do not take into account characters of ancestral taxa. Parsimony methods try to reconstruct characters of ancestral taxa. Maxium parsimony  The ‘most-parsimonious’ tree is the one that requires the fewest number of evolutionary events (e.g., nucleotide substitutions, amino acid replacements) to explain the sequences. Bootstrapping  How robust is a reconstructed phylogenetic tree? How reliable the branch is… assign a value depending on in how many of the bootstrap trees it is supported (75%, 25%...). Bootstrapping is a resampling analysis that involves taking columns of characters out of your analysis, rebuilding the tree, and testing if the same nodes are recovered. This is done through many (100 or 1000, quite often) iterations. If, for example, you recover the same node through 95 of 100 iterations of taking out one character and resampling your tree, then you have a good idea that the node is well supported (your bootstrap value in that case would be 0.95 or 95%). If you get low support, that suggests that only a few characters support that node, as removing characters at random from your matrix leads to a different reconstruction of that node. The bootstrapping value doesn't say anything direct about evolutionary conservation: it says how well this node is supported in the model you're using to generate a phylogenetic tree. Bootstrapping is a procedure where you take a random subset of the data and re-run the phylogenetic analysis, and the reported value is the percentage of bootstrap replicates in which the node showed up. Thus, 100 means that the node is well-supported: it showed up in all bootstrap replicates. Maximum Likelihood methods

Phylogeny softawe: CulstalW, Tcoffee for multiple sequence alignment. Phylip for generation of the tree and figtree for a graphical representation.

Substitution rate models  Jukes Cantor model -> See script and protocol.

Cladogram  is a phylogenetic tree formed using cladistic methods. This type of tree only represents a branching pattern; i.e., its branch spans do not represent time or relative amount of character change. A cladogram is not, however, an evolutionary tree because it does not show how ancestors are related to descendants or how much they have changed. Phylogram  is a phylogenetic tree that has branch spans proportional to the amount of character change.

A rooted phylogenetic tree is a directed tree with a unique node corresponding to the (usually imputed) most recent common ancestor of all the entities at the leaves of the tree. The most common method for rooting trees is the use of an uncontroversial outgroup—close enough to allow inference from trait data or molecular sequencing, but far enough to be a clear outgroup. Homology vs Homoplasy  The term homology refers to structures on two or more different species that are similar or the same that came from a common ancestor of the species.

Homoplasy, on the other hand, describes a characteristic that two or more different species have in common that was not inherited from their recent ancestor. Instead, a homoplasy would have evolved independently usually due to natural selection in similar environments or filling the same type of niche as the other species with that trait. Homology is a product of divergent evolution. The two species were once the same species at the point where they have a most recent common ancestor. Over time, individuals in the population evolved through either some type of selection or isolation from the rest of the population. The species, even though they diverged at that point, still retain some of the characteristics of the common ancestor. These are the homologies. Convergent evolution is the origin of a homoplasy. These similar traits evolved independently of each other and are not found in the common ancestor of the two species being examined. Instead, each species evolved the trait after diverging and becoming separate species

Grupo monofilético, parafilético, polifilético.  En filogenia, un grupo es monofilético (del griego: de una rama) si todos los organismos incluidos en él han evolucionado a partir de una población ancestral común, y todos los descendientes de ese ancestro están incluidos en el grupo. Por el contrario, un grupo que contiene algunos pero no todos los descendientes del ancestro común más reciente se llama parafilético, y un grupo taxonómico que contiene organismos pero carece de un ancestro común se llama polifilético.

Metagenomic barcoding

DNA barcoding is a taxonomic (! No taxonomic, but phylogenetic. Relationtish beteween species) method that uses a short genetic marker...