Genomics Lecture Notes for Final Exam PDF

Title Genomics Lecture Notes for Final Exam
Author Sarah Elkamhawy
Course Molecular Biology
Institution Rutgers University
Pages 6
File Size 233.7 KB
File Type PDF
Total Downloads 97
Total Views 141

Summary

Dr. Minden lecture notes for final exam. Notes are color coded, things that are highlighted in pink are important and were on the exam....


Description

Next Generation Sequencing (NGS):  Sequencing technologies have changed since the time of the human genome project.  Sequencing is now much more affordable and can be done quickly.  New sequencing methods are called Next Generation Sequencing (NGS) o NGS allows sequencing of thousands to millions of DNA molecules simultaneously  Types of NGS: o Illumina MySeq o Illumina HiSeq o 454 Sequencer o SOLiD system o Ion Proton system  All systems require a library, that may involve ligation with custom adaptors  The sequences are amplified on a solid surface with covalently attached linkers that hybridize the library adapters, producing clusters of DNA  They all involve sequencing machines o The sequencing machines produce raw data at the end of the sequencing run.  Bioinformatics take raw data, analyze and make sense of it.  Today’s new sequencing technologies have greatly increased speed of sequencing, and greatly reduced the cost. o New technologies allow us to sequence whole genomes in days, and at costs that are affordable by many labs.  Can DNA + RNA. 1. NGS Sequencing: Illumina  sequence whole genome of whole cell not justjust one sequence NGS Sequencing Steps: 1. Break DNA into small pieces (50-150 nts) 2. Short sequences (adapters) are attached to the ends 3. Glass flow cell - small sequences on flow cell that bind to adapters (complementary to adapters at ends of small pieces of DNA) 4. Primers on the flow cell that are complementary to the adapters added to the DNA fragments 5. DNA binds at the matching (complementary) primer 6. DNA amplification occurs, in the presence of DNA polymerase - Start with ssDNA then synthesize other strand to make dsDNA 7. One strand is washed off - Left with ssDNA 8. Hybridization to the other primer 9. Hybridization to the other primer - Sequence complementary + bends over to make U shape 10. Another round of DNA synthesis - Whole system not accurate w/ sequencing one piece; need to amplify for accuracy 11. . 12. . 13. . 14. Whole thing repeats  2 single strands folds over and finds complementary adaptor and happens over and over  Lots of copies is key to make whole system accurate 15. Bridge amplification - Sequencing whole genome 16. Massively parallel sequencing 17. Add another primer, sequencing primer is annealed 18. Fluorescently labeled nucleotides are incorporated 19. . 20. Everything on flow cell gets sequenced 21. Align the reads to a reference genome (mapping)  Analyzed on computer and reading what all 50-150 nucleotide sequences

Reads obtained from the sequencing machine:  These DNA sequences are not arranged in any particular order  At end get a bunch of sequences that come out in a File of sequences  Mess in the beginning b/c not in order/put together, not aligned, letters  Bioinformatics (genome mapping) put them together in chromosomal order then analyze to see If there is mutation or change Align the reads to a reference genome (mapping):  Want to put sequence in orders  mapping  Align reads to reference genome o Some overlap b/c cutting DNA randomly at diff places  Reference genome- genome we already know sequence of o EX. Sequencing all genes in human cancer cell – use already known human genome to know basic sequence o Don’t know if there are specific abnormalities in cancer cell  Possible if you don’t have reference human genomes by looking at overlap o w/ mouse and human, we already know basic sequences o Use mice often in lab  End result  nucleotide sequences must be aligned to reference genome and put in order then see if there are changes compared to normal Subtypes of NGS Sequencing: 1. WGS: Whole Genome Sequencing - Sequences the entire genome (introns + exons) 2. WES: Whole Exome Sequencing: - Sequences ONLY coding regions of the genome 3. Chip-Seq: - (Chip: Chromatin immunoprecipitation) - Sequencing that identifies transcription factor binding - Understanding how genes are controlled 4. RNA-Seq: - Sequencing of RNA - important b/c every cell has same basic DNA aside from mutations o liver cells express liver specific RNA  RNA tells you which genes are expressed so we don’t just look for sequence but quantity to see how much RNA of type of cell is being sequenced 5. Methyl-Seq: - Sequencing of methylated DNA - See which nucleotides are methylated o methylation regulated gene expression and can get idea if drug affect gene expression at RNA level or how much DNA methylation is altered and how function is affected What to look for in DNA sequencing:  looking @ cancer cell vs normal or drug treatment vs no treatment or resistance to treatment vs none o Align to reference genome then look for these changes to see abnormality  Variants: o SNVs (small nucleotide variants, 1-2 nucleotides changed) o Indels (small insertions or deletions, slightly more change) o SVs (structural variants: Large insertions and deletions) o Chromosomal rearrangements 1. RNA-sequencing (RNA seq):  Shows RNA sequence in quantitative way to see how much RNA is present in 1 gene vs another which you can’t tell in DNA sequence  Normally comparing one condition to another o Tell differences and how much RNA is present  Isolate mRNA  reverse transcribe to cDNA  cDNA  fragment cDNA (50-150 nucleotides)  add adapters  carry out sequencing (illumina machine)  get sequencing output (random)  align sequences to reference genome  assign gene names

o 

Need to look @ RNA sequences you get back w/ software + computer programs + see overlap doing biology and computer science to analyze genes

Results: o not only the sequences of the RNA molecules, but quantitation of each expressed gene (RNA)

o 

More of exon 2 in condition 1 than condition 2  so downregulation in condition 2 b/c less RNA  How is RNA sequencing useful: o Can analyze the differences in gene expression when comparing 2 (or more) different conditions An RNA seq workflow for identification of differentially expressed genes: 1. Assess the quality of the reads 2. Map reads to a reference genome 3. Transcript reconstruction and quantitate abundance of each gene - Which RNA sequence associated w/ which gene 4. Test for differential expression when looking at 2 diff conditions 5. Statistical analysis  Make sure results are statistically significant  Example of an RNA sequencing experiment: A comparison b/w two breast cancer cell lines:

o 

Mcf – positive for estrogen receptor and progesterone receptor while MDA is negative for both – one hormone responsive, one not so diff outcome possibilities  Did RNA seq study to see what genes are diff aside from ER and PR  Helps to see what best way to treat DZ is Software used in RNA sequence analysis: A. Tuxedo Software Suite: 1. Fastqc (analysis sequence quality) o quality control, sequence shouldn’t be too short or unreadable o what you need for RNA seq 2. Bowtie (fast short read alignment) OR TopHat (spliced short read alignment) o Align DNA sequences reference genome ***  Tophat is a “splice aware” aligner, This one is best for RNA-seq  Bowtie good for DNA seq when not worried about splicing but NOT RNA seq 3. Cufflinks (transcript reconstruction and transcript abundance) o Transcript reconstruction – alignment and this RNA represents this gene and more of this gene compared to other (Certain amt of gene A and amt of gene B) o Aligning cDNA and seeing what gene it belongs to o Gene name and quality 4. Cuffdiff (differential expression analysis) o Comparing 2 conditions to each other and seeing what genes are more expressed in what type of cell o How genes expressed in condition 1 vs. 2 5. R: The R Project for statistical computing  R packages such as CummeRbund, visualization & analysis of results



Assess the quality of the reads: Fastqc: o The sequence you get from the machine is in FASTQ format  Millions of sequence reads in Fastq format o FASTQ format: sequence + quality o FASTQC is software that assesses sequence quality  Determining quality of each nucleotide o Done multiple times b/c of amplification o Get back a chart for your sequencing results  Puts all results together o Using longer sequence (250ish) o Program averages how good quality is by each nucleotide o Looks @ where sequence falls in each part of chart o Should be in the green area

  

quality isn’t great In beginning they’re fine then fall off b/c so long so become poor quality at end If the quality is bad, have to decide if you want to do it again or analyze

o







 Good quality  75 nucleotide sequences  Poor towards end which is normal but overall great quality Bowtie or Tophat software : Map reads to a reference genome o The issue of introns and exons when sequencing RNA rather than DNA  Problem has to do with splicing o Has exons and introns but sequence cDNA so all introns are gone so how do you align it? o Need to use Splice Aware b/c DNA seq spliced into the 1+2  Tophat is a “splice aware” aligner: o Uses computers to splice out regions of ref genome not interested in (introns) + align sequence reads to digitally spliced ref genome Cufflinks: Transcript reconstruction and quantitate the abundance of each gene o Want to know how much gene we have when we align  What genes associated w/ what in ref genome and how much there is o Once the alignment is done, you still need to assign gene names  Use of an annotation file to assign gene names o Used to Assign gene names, quantitate transcripts per gene o Expression analysis:  Count the number of fragments overlapping with all annotated exons of a gene  Often the abundance is expressed as FPKM (Fragments per Kilobase per Million Fragments Mapped) Cuffdiff: test for differential expression: o Results:



Name of gene, locus, samples for conditions and values to see how much of gene in each condition  Log2fold change – how much more you have of 1 gene in condition vs another  Lab treating cells w/ drug and placebo and looking @ gene coming back from system and look at amount of RNA from each gene  Statistical analysis: o To analyze differences btwn diff conditions in genes use R  Commonly used to analyze RNA and DNA sequence o Contains diff packages to analyze different types of things  Cummerbund analyzes results and gives you visuals that make sense; what’s significant and what isn’t  Visually appealing presentation o Differential expression analysis through LogFC  Showing how much change we have o Interpretation of the results from an RNA seq experiment is complicated  How do you decide if differences in expression are significant???  How do you decide which genes are relevant to your system??? o Usually, every RNA sequencing study is done at least in triplicate  just doing 1 cell makes results not that relevant bc it could just be happening that cell line – need 3 independent experiments in each cell to analyze and do statistics o Statistical differences:  Strict p value for cutoff, fold change cutoff?  P value gives level of significance and assigned cutoff so anything below not significantly diff in expression level  Ones above line analyzed to see which are important o Good for looking @ important genes but some may have a low difference but still be important  Starting point but may need to look at ones that didn’t fall into cutoff Further analysis of RNA sequencing results:  Gene by gene analysis to determine which differentially expressed genes are most relevant o This requires extensive literature research o This can be very time consuming 1. Gene set enrichment analysis:  Do genes fall into specific categories, or sets?  Gene sets can be: o Pathways (based on databases such as Reactome, KEGG, Wikipathways, Ingenuity Pathway Analysis (IPA)) o Genomic location o Transcription factor targets  Software is available for many of these tests  Pathways: o Gene set enrichment analysis of signaling pathways in cells and compare to gene you got from system  Possible problems:  Crosstalk between pathways  Genes in pathways are not always regulated by mRNA expression  pathways aren’t always linear 2. Clustering:  Gene ontology  genes are categorized by function, or by association with a specific term  Group genes or samples that contain similar sequences or have similar expression profiles o Groups are called clusters  Many different clustering algorithms exist  There are pros and cons to this method. o There can be problems with assignment of the clusters, which is sometimes quite arbitrary...


Similar Free PDFs