Explain the FASTQ format.
Erklären Sie das FASTQ-Format.
(2023)
DONE
common format for data exchange between different tools and an extension to the FASTA format
can additionally store a numeric quality score of nucleotides besides other informations about the sequence
The FASTQ file has normally 4 lines per sequence:
Line 1: “@” + sequence identifier + optional description (like a FASTA title line)
Line 2: the raw sequence letters
Line 3: begins with a “+” and is optionally followed by the same sequence identifier (and any description)
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence
Below follows the output of the GeneMark.hmm program. Please explain what the column “strand” means.
Unten folgt der Output des GeneMark.hmm-Programms. Erklären Sie bitte, was die Spalte “Strand” bedeutet.
GeneMark is a gene prediction tool for prokaryotes. When a gene is transcribed, the sense or anti-sense strand is used as a template. Due to the fact that genes can therefore be located on the sense or anit-sense strand, GeneMark includes the information on which strand + or - the gene is predicted.
“+” for forward strand (5’ to 3’)
“-” for reverse strand(3’ to 5’)
What is the normalization for the sequence length in NGS and what is it used for?
Was ist die Normalisierung für die Sequenzlänge in NGS und wofür wird sie verwendet?
Sequence length normalization in next-generation sequencing (NGS) refers to adjusting read counts based on the length of the sequence to allow for accurate comparison across samples.
Purpose:
correct for bias
ensure longer sequences do not disproportionately affect the analysis
Common Metrics:
RPKM (Reads Per Kilobase of transcript per Million mapped reads)
FPKM (Fragements Per Kilobase of transcript per Million mapped reads)
TPM (Transcripts Per Million)
Name and explain two types of Alternative Splicing.
Nenne und erklären Sie zwei Arten des alternativen Spleißens.
cassette exon (exon skipping): an exon may be spliced out of the primary transcript or retained
mutually exclusive exons: one of two exons is retained in mRNAs after splicing, but not both (are “competing”)
competing 5’ splice site (alternative donor site): An alternative 5’ splice junction is used, changing the 3’ boundary of the upstream exon
competing 3’ splice site (alternative acceptor site): An alternative 3’ splice junction is used, changing the 5’ boundary of the downstream exon
retained intron: A sequence may be spliced out as an intron or simply retained. This is distinguished from exon skipping because the retained sequence is not flanked by introns.
multiple promoters
mutiple poly(A) sites
Explain one approach to RNA secondary structure prediction.
Erklären Sie bitte einen Ansatz zur Vorhersage der RNA-Sekundärstruktur.
Two possible approaches:
Energy minimization methods: choose complementaey sequence sets that provide the most energetically stable molecules
take into account patterns of base-pairing that are conserved during evolution
Energy minimization method:
every base is compared for complementarity to every other base
The energy of each predicted structure is estimated by the nearest-neighbor rule:
sum the negative base-stacking energies for each pair of bases in the predicted double-stranded regions
add positive energies of destabilizing (unpaired) regions
The complementary regions are evaluated by a dynamic programming algorithm to predict the most energetically stable molecule
Programs: e.g. MFOLD, ViennaR
What are N50 and L50 measures?
N50: The shortest contig length that needs to be included for covering 50% of the genome.
high value = more continuous assembly with longer contigs, shorter gaps, overall more complete assembly
L50: The count of smallest number of contigs whose length sum makes up half of genome size.
lower value = more continuous assembly, with fewer contigs needed to reach the 50%, more efficient and better quality assembly
—> to measure the quality of genome assembly
Explain what is Ka/Ks ratio and what is needed to calculate it?
Erklären Sie, was das Ka/Ks-Verhältnis ist und was zu seiner Berechnung benötigt wird.
The Ka/Ks ratio is the ratio between the nonsynonymous rate of substitution (Ka) and the synonymous rate of substitution (Ks), to test for natural selection / evolutionary tendencies on genes or proteins.
The Ka/Ks ratio is important to predict if a mutation or rather a change is going to be fixed (evolutionary advantage) or being lost (negative selection).
The Ka/Ks ratio can indicate the following tendencies:
Ka/Ks > 1 : positive selection
Ka/Ks = 1 : neutral selection
Ka/Ks < 1 : negative selection -> purifying selection
for pseudogenes Ka/Ks = 1 expected
experimental < 1: underestimated Ka/Ks as genes were compared with present day genes and not the ancestral functional gene that gave rise to the processed pseudogene
Explain the similarity-based approach to gene prediction.
Erklären Sie bitte den ähnlichkeits-basierten Ansatz zur Genvorhersage?
The similarity-based approach uses known genes in one genome to predict (unknown) genes in another genome by e.g. comparison of genomic sequence with homologous genomic sequence from close organisms.
Given a known gene (or a protein) and a genome sequence, find a set of substrings of the genomic sequence whose concatenation best fits the gene.
e.g. the known frog gene is aligned to different locations in the human genome. Find the “best” path to reveal the exon structure of human gene. Chaining these local alignments and look for a maximum chain of substrings.
—> possible due to evolutionary conservation of gene sequences across different species
Example programs: TWINSCAN, SGP2
Often alignment tools like BLAST or BLAT can be used.
Explain the k-mer approach to finding the repeats in genomes.
Erklären Sie den k-mer Ansatz zum Auffinden von Wiederholungen in Genomen.
Sequences are scanned for overrepresented string of certain length
Challenge: to determine optimal size of an oligo (k-mer) and the number of mismatches allowed
Generate K-mers: Slide a window of length k across the genome sequence to generate all possible k-mers.
Count Occurrences: Count the frequency of each k-mer in the genome.
Identify Repeats: K-mers that appear more frequently than expected are identified as potential repeats.
Analysing the repeats by finding out their position in the genome and mapping them
for a better understanding of the distribution and context of the repeats + possibly further analyses to differentiate the different types of repeats (tandem/dispersed repeats etc.)
Applications:
Detecting Repeats: Helps in identifying repetitive sequences in the genome, such as tandem repeats, interspersed repeats, and low-complexity regions.
Genome Assembly: Assists in resolving ambiguous regions during genome assembly by highlighting repetitive sequences.
What are the challenges in identifying motifs in biological sequences?
Was sind die Herausforderungen bei der Identifizierung von Motiven in biologischen Sequenzen?
we don’t know the motif sequence
we don’t know the location relative to gene start
we don’t know the length of the motif
Motifs can differ slightly from one gene to the next / can have mutations (how many mutations are allowed?)
How to discern from „random“ motifs?
(Consensus sequences help in finding motifs)
—> These and other problems mean that the computing algorithms have a high runtime and data storage capacity
What is repeat masking and why is it needed?
Was ist Wiederholungsmaskierung und warum ist sie erforderlich?
Definition: The process of identifying (e.g. using RepeatMasker) and masking repetitive sequences in a genome to prevent them from interfering with genomic analyses.
Improve Assembly: Helps in accurate genome assembly by reducing ambiguity caused by repeats.
Enhance Annotation: Improves gene prediction and annotation by distinguishing between coding and non-coding regions.
Data Quality: Ensures high-quality sequence alignment and variant calling by excluding repetitive regions that can cause errors.
What are two assumptions that can be made when predicting the role of amino acids substitution?
Welche zwei Annahmen können bei der Vorhersage der Rolle der Aminosäuresubstitution getroffen werden?
Conservation: Substitutions in highly conserved regions are more likely to affect protein function, indicating these regions are crucial for the protein's structure or function.
Physicochemical Properties: Substitutions that drastically change the physicochemical properties (such as charge, size, or hydrophobicity) of the amino acid are more likely to impact the protein's function or stability.
Explain the main steps of the genome assembly process.
Erklären Sie die wichtigsten Schritte des Genom-Assembly-Prozesses.
Read Generation: Sequencing the genome to produce short DNA sequences (reads) using technologies like Illumina or PacBio.
Read Quality Control: Filtering and cleaning the reads to remove low-quality sequences and contaminants.
Read Overlapping: Finding overlaps between reads to identify how they connect, e.g. using De-Bruijn Graph or Overlap-Layout-Consensus (De novo or reference-based)
Contig Assembly: Merging overlapping reads into longer contiguous sequences (contigs).
Scaffolding: Ordering and orienting contigs into larger structures (scaffolds) using paired-end reads or long-read technologies.
Gap Filling: Closing gaps within scaffolds to produce a more complete genome sequence.
Quality Assessment: Evaluating the assembly for accuracy, completeness, and contiguity using metrics like N50 and L50.
Optional annotation by identification of genes (MAKER, AUGUSTUS) and functional annotation (BLAST)
Explain the TargetScan algorithm for predicting of mammalian MicroRNA Targets (you can skip the formulas).
Erklären Sie den TargetScan Algorithmus zur Vorhersage von microRNA Zielgenen bei Säugetieren (keine Formels notwendig).
TargetScan:
thermodynamics-based modeling of RNA:RNA duplex interactions
comparative sequence analysis
Input:
miRNA that is conserved in multiple organisms
a set of orthologous 3’ UTR sequences from these organisms
Steps:
search the UTRs in the first organisms for segments of perfect Watson-Crick complementarity to bases 2-8 of the miRNA: “miRNA seed” and “seed matches”
extend each seed match with additional base pairs to the miRNA as far as possible in each direction, allowing G:U pairs, but stopping at mismatches
optimize basepairing of the remaining 3’ portion of the miRNA to the 35 bases of the UTR immediately 5’ of each seed match using the RNAfold program
assign a folding free energy to each such miRNA:target site interation
assign a Z score to each UTR
sort the UTRs in this organism by Z score and assign a rank to each predict
predict as targets those genes for which both Zi >= Zc and Ri <= Rc for an orthologous UTR sequence in each organism, where Zc and Rc are pre-chosen Z score and rank
What is a Pan-genome?
Was is Pangenom?
A pan-genome is the entire set of genes from all strains within a species or clade. It includes:
Core Genome: Genes shared by all strains.
Accessory Genome: Genes not present in all strains, including:
Dispensable Genes: Present in some but not all strains.
Unique Genes: Specific to individual strains.
Purpose: To study genetic diversity, evolution, and functional adaptations.
Wie unterscheidet sich die Gendichte von Bakterien mit der von höheren Eukaryoten?
(2013)
Bakterien haben also eine höhere Gendichte als höhere Eukaryoten
Gendichte bei Prokaryoten (Bakterien & Archaea): ~90% >>> Gendichte bei Eukaryoten: ~1-2%
In Bakterien liegt die Gendichte bei 1 Gen pro 1.000 - 1.4000 Basen
In höheren Eukaryoten liegt die Gendichte bei 1 Gen pro 100.000 Basen
Wie wird das Sequenzlogo mit dem EM-Algorithmus dargestellt?
EM Algorithmen, wie z.B. MEME (Multiple EM for Motif Elucidation), starten bei einer Site von mehreren Sites von Zielsequenzen und wechseln sich dann ab zwischen der Zuordnung der Site zu einem Motiv und dem Updaten des Motivmodells. Dabei werden nur die besten Treffer pro Sequenz angezeigt, obwohl niedrigere Treffer in der gleichen Sequenz auch einen Effekt haben können.
start with initial guesses for region and size (e.g. region of a binding size is already known from prior experiments)
expectation step:
position-wise composition of the site is used to estimate the probability of finding the site at any position of the seqs
these probabilities are used in turn to provide new information as to the expected base distribution for each column
maximization step: new counts of bases for each position in the site found in E-step are substituted for the previous set
E- and M-steps repeated until convergence (no more changes)
Result:
best location of the size in each seq
best estimate of the base composition of each column in the site
Was wird mit der Höhe der Abschnitte in einem Sequenzlogo ausgesagt?
sequence logo height is showing the frequencies scaled relative to the information content (measure of conservation) of the base at the position
can be corrected by base frequencies of the bases
data might include pseudocounts to overcome effects of missing data
the maximum value for DNA bases is 2 bits (log2(4)) —> perfectly conserved
Warum ist es wichtig auch Pseudogene zu kennen?
pseudogenes: Nonfunctional sequences of genomic DNA that are originally derived from functional genes (by gene duplication), but exhibit such degenerative features as premature stop codons and frameshift mutations that prevent their expression
Pseudogene sind dennoch wichtig, da einige eine Rolle bei der Regulierung der Genakativität spielen und somit nicht funktionslos sind.
might interfer with experiments
PCR and hybridization experiments
transcribed pseudogenes
interference with disease diagnostics and treatment
molecular record of dynamics and evolution of genomes
rate of nucleotide substitutions
rate of DNA loss
improvement of gene prediction and annotation efforts
Was bedeutet "multiplicity" und "co-operativity" in Zusammenhang mit miRNA target Interaktionen?
(2013 / 2017)
multiplicity: one miRNA can target more than one gene
some miRNAs appear to be very promiscuous, with hundreds of predicted targets, but most miRNAs control only a few genes
co-operativity: one gene can be controlled by more than one miRNA
Some target genes appear to be subject to highly cooperative control, but most genes do not have more than four targets sites
Wie verändert sich der positive Vorhersagewert wenn das Target mit dem Informanten stark übereinstimmt?
Positive Predictive Value (PPV) = Maß dafür, wie wahrscheinlich es ist, dass ein positiv vorhergesagtes Ereignis tatsächlich eintritt.
PPV = TP / TP + FP
Wenn das Target stark mit dem Informanten übereinstimmt, erhöht sich der PPV. Dies liegt daran, dass die Anzahl der korrekt vorhergesagten positiven Fälle (TP) steigt und die Anzahl der falsch positiven Vorhersagen (FP) sinkt.
Wie kann es dazu kommen, dass in ein Transkript ein alternatives Exon hinzugefügt wird und das zu einem verkürzten Protein-Produkt führt?
Durch Verwendung einer anderen Stelle für die Translationsinitiation (alternative Initiation)
alternatives Exon hat Stopp-Codon (alternative Terminierung)
Eine andere Translationsterminationsstelle aufgrund eines Frameshift (Verkürzung oder Verlängerung)
Ändern des inneren Bereichs aufgrund eines in-Frame Insertion oder Deletion
Nennen sie einen möglichen Ursprung für Operons.
-> nicht in Vorlesung
Operons könnten in termophilen Organismen entstanden sein, da die Organisation von Genen in Operons die Assoziation / Verbindung von funktionell verwandten Protein-Produkten ermöglicht und diese sich somit gegenseitigen Schutz vor thermischen Verfall bieten.
Rolle des Horizontellen Gentransfers: Vorteil komplette Sets an Genen zu übertragen und dem Empfänger einen definierten Phenotyp zu übertragen
evtl. ausgehend von thermophilen Bakterien
Wie wirkt sich eine Vergrößerung des Frameshift der ORF-Länge auf die Genauigkeit der Vorhersage aus?
Die Genauigkeit der Vorhersage steigt mit der Vergrößerung des Frameshifts der ORF-Länge, da längere ORFs eher den tatsächlichen Genen entsprechen.
vmtl bezogen auf GeneMark? -> higher sensitivity, lower specificity
Analysefenster: Größeres Fenster erhöht Sensitivität und False Positives
Frameshift: Erhöhte Frameshift-Fehler verschlechtern die Vorhersagegenauigkeit
Nennen Sie den Proteinkomplex der dafür zuständig ist, dass die tierische pre-miRNA in miRNA umgewandelt wird.
Proteinkomplex: RNA-incduced splicing complex (RISC) enthält das Enzym DICER, welches die pre-miRNA in miRNA spaltet
Welche Eigenschaften hat ein starker Promoter?
Ein starker Promoter ist der Consensus Sequenz sehr ähnlich.
(ein schwacher Promoter unterscheidet sich stärker von der Consensus Sequenz)
DNA-Sequenz, die eine hohe Transkriptionsrate ermöglicht
die effizient an die RNA-Polymerase bindet und einen robusten Transkriptionsbeginn fördert
ein starker Promotor hat eine hohe Affinität für die RNA-Polymerase, was eine effiziente Bindung und Initiierung der Transkription ermöglicht
Vorhandensein spezifischer Sequenzmotive innerhalb der Promotorregion
Was ist ein Sigma Faktor? Wofür wird dieser in der Transkription benötigt? Welcher Sigma Faktor tritt am häufigsten auf?
Sigma Faktor: sind Proteine die Teil des RNA Polymerase Proteinkomplexes sind, welcher an den Promoter bindet.
Sie werden für die Initiation der Transkription benötigt.
Häufigster Sigma Faktor: σ^70 (Housekeeping-Sigma-Faktor von E.coli; steuert die Transkription)
Es gibt mehrere austauschbare Sigma-Faktoren, von denen jeder eine bestimmte Gruppe von Promotoren erkennt (Promotoren von Housekeeping-/Hitzeschock-Genen)
Nennen Sie drei Unterschiede des Whole Genome Shotgun und des Clone-by-Clone Verfahrens.
Clone-by-Clone:
physical mapping: requires construction of clone-based physical map
assembly: easier to resolve complex genomic regions as position of contigs is already known (due to the physical mapping)
labor intensity: physical mapping is labor intensitive, but after mapping clones can be divided between different labs for sequencing
Whole Genome Shotgun:
physical mapping: mapping phase is skipped and subclone library is constructed from entire genome
assembly: order / position of contigs needs to be inferred from overlapping reads and read pairs which can be problematic for tandemly repeated DNA (incorrect overlaps)
labor intensity: less labor intensive, but requires more computational resources
Clone-by-clone shotgun: BAC clones werden benötigt, clones vorher mappen
Whole genome shotgun: Mapping-Phase wird übersprungen, Assembly dauert länger
Clone-by-Clone Shotgun Sequenzierung
Auswahl eines Klons (z.B. BAC).
Reinigung und physikalische Fragmentierung der BAC-DNA.
Subklonierung der DNA-Fragmente (2-5 kb).
Erstellung von Sequenz-Reads aus Subklonen (mehrere tausend Reads pro BAC).
Assemblierung der Reads basierend auf Sequenzüberlappungen zu einer vorläufigen Sequenz.
Identifikation und Ausbesserung von Lücken und Bereichen mit schlechter Sequenzqualität durch zusätzliche Sequenzdaten.
Whole-Genome Shotgun Sequenzierung
Die mapping Phase wird übersprungen
Shotgun Sequenzierung wird fortgesetzt unter Verwendung von Subklon-Bibliotheken die aus dem gesamten Genom hergestellt werden
Typischerweise werden zig Millionen von Sequenz-Reads erstellt
Computergestützte Anwendungen werden verwendet, um Contigs aus den verschiedenen Reads zu erzeugen.
Entstehende Lücken werden durch anschließende Verfahren geschlossen, um eine vollständige genomische Sequenz zu erhalten.
Welches Verfahren wird eher für prokaryotische Genome und welches für eukaryotische Genome verwendet? Erklären Sie genau warum dies so ist. (Clone-by-Clone oder Whole Genome Shotgun)
Whole Genome Shotgun (WGS) für prokaryotische Genome:
Prokaryotische Genome sind kleiner und weniger komplex.
WGS ist schneller und kostengünstiger, da die Kartierungsphase entfällt.
Weniger repetitive DNA-Sequenzen erleichtern die Assemblierung.
Clone-by-Clone für eukaryotische Genome:
Eukaryotische Genome sind größer und komplexer.
Physische Kartierung hilft bei der Bewältigung der Komplexität und erleichtert die Assemblierung.
Handhabung repetitiver Sequenzen durch physische Kartierung.
Effiziente Arbeitsaufteilung durch Verteilung der Klone auf verschiedene Labore.
Erleichterte Assemblierung komplexer genomischer Regionen.
(Approaches can be combined in a hybrid shotgun-sequencing approach)
Nennen Sie vier alternative Splicing Varianten.
Wodurch kann man herausfinden, ob ein alternatives Splicing statt gefunden hat.
Genomweite Analyse:
Genomsequenz-Assemblies und EST-Sequenzen
EST Clustering von UNIGENE
BLAST-Suche:
Kandidaten-Gen-Regionen (BLAST threshold < E10-5)
Vermeintliche (kurze) Exons (BLAST threshold < E10-10)
Alignment der genomischen Regionen mit ESTs durch dynamische Programmierung
Splicing-Erkennung durch computergestütztes Verfahren
Verifizierung:
RNA-Isoformen-Analyse mittels RT-PCR (unterschiedliche PCR-Produktlängen)
Mikroarrays mit Exon-Exon-Junction-Probes
Welche Auswirkungen hat es, wenn das Proteinprodukt durch Alternative Splicing größer wird?
Consequences of new protein parts:
alter protein binding properties, e.g. receptor / ligand
alter intracellular localization, e.g. membrane insertion
alter extracellular localization, e.g. secretion
alter enzymatic or signaling activities
alter protein stability, e.g. inclusion of cleavage sites
Insertion of post-translation modification domains
Change ion channel properties
What are known roles of alternative splicing?
(-)
Influence RNA function
AS does occur to alter 5’ and 3’ UTR regions - Proposed roles in subcellular localization and RNA stability
Coordinated Regulation of Biological Events
Neuron development (Dscam)
Channel activity associated with hearing (slo)
Muscle contraction
Neurite growth
Cell differentiation
Apoptosis
Welche zwei Klassen von Informationen werden in der Genvorhersage verwendet? Nennen Sie auch je zwei Unterklassen dieser Informationen.
Intrinsic
a) Open reading frames (ORFs)
b) Codon usage
c) Anwesenheit von RBS (ribosomal binding sites)
d) Periodizität von repeats (Wiederholungen)
extrinsic
a) Expressed Sequence Tags (ESTs)
b) cDNA-Alignments
c) homology in known Exons
Skizzieren Sie den Ablauf von GenScan.
GenScan: designed to predict complete gene structures but also partial genes or multiple genes separated by intergenic DNA within a sequence
based on generalized Hidden Markov Models
Model both strands at once
GenScan States:
N – intergenic region
P – Promoter
F – 5‘ untranslated region
T – 3’ untranslated region
A – poly-A
E – Exon (sngl = single, init = initial, term = terminal, k = Phase k internal)
Ik – Phase k Intron: 0 – zwischen Codons, 1 – nach der ersten Base eines Codons, 2 – nach der zweiten Base eines Codons
Each state may output a string of symbols (according to some probability distribution).
Explicit intron/exon length modeling
Special sensors for Cap-site and TATA-box, Advanced splice sites
uses dynamic programming to determine the most likely gene structure compatible with the given sequence.
Parallel Unsupervised Training and Prediction:
GenScan initializes all model parameters and uses GeneMark to parse the sequence into "coding" and "non-coding" regions.
The newly labeled sequences are used to re-estimate the model parameters until the model converges.
Gegeben ist die folgende Formel:
Erklären Sie die einzelnen Schritte und skizzieren Sie diese.
Rekursive Definition des besten Scores für eine Subsequenz i, j —> 4 Möglichkeiten:
i, j sind ein Basenpaar, hinzugefügt zu einer Struktur für i+1 ... j-1, add +1
i ist ungepaart, hinzugefügt zu einer Struktur für i+1...j
j ist ungepaart, hinzugefügt zu einer Struktur für i...j-1
i, j sind gepaart, aber nicht zu einander: die Struktur von i...j fügt Unterstrukturen zusammen für zwei Untersequenzen, i...k und k+1...j (bifurcation)
Wie könnte man obige Formel noch verbessern?
Man könnte die obige Formel noch verbessern, indem man auch Pseudoknots beachtet und mit in die Formel einbaut.
this base pair maximization will not necessarily lead to the most stable structure
additionally use thermodynamic information:
negative stacking energy for matches
positive destabilizing energies for loops (size-dependend)
(minimum free energy method)
Nennen Sie alle Klassen von Interspersed Repeats.
Retrotransposons
LTRs (Long Terminal Repeat Retrotransposons)
LINEs (Long Interspersed Nuclear Elements) [autonomous]
SINEs (Short Interspersed Nuclear Elements) [nonautonomous]
DNA Transposons
TIR (Terminal inverted repeat)
MITE (Miniature Inverted-repeat Transposable Elements)
Nennen Sie alle Klassen von Tandemly Repeated DNAs.
Tandemly repeated DNA:
Microsatellites
Minisatellites
Cryptically simple repeats
Low complexity repeats
Satellite repeats
Telomeric repeats
Nennen Sie zwei Eigenschaften von Interspersed Repeats.
Derived from biologically active “transposable elements” (TEs)
Involve RNA intermediates (Retroelements) or DNA intermediates (DNA transposons)
Retroelements: reproduce via reverse transcription followed by integration inot the host DNA (LTR ,LINEs, SINEs)
DNA transposons: capable of integrating themselves to, and excising themselves from, the host genome, thus taking advantage of the host replication thorugh this “cut-and-paste” mechanism
3 different mechanisms for transposition
conservative transposition
replicative transposition
retrotransposition
Welche drei anderen repetitive Sequenzklassen gibt es noch (neben Interspersed repeats)?
Welche Unterschiede gibt es zwischen Interspersed Repeats zu den in 1. genannten Formen?
Tandemly repeated DNA (Simple sequence repeats without interuption)
Satellite and telomeric repeats
Interspersed repeats sind mobil und über das Genom verteilt
anderen repititiven Sequenzen sind stationär und kommen in spezifischen Clustern vor
Interspersed repeats
Retrotransposons (LINEs, SINEs, LTR)
DNA-Transposons
Was versteht man unter SNPs?
SNP stands for Single Nucleotide Polymorphism, it occurs when a single nucleotide replaces one of the other three nucleotide letters in a genome (or in a DNA sequence).
SNPs may occur anywhere: Most SNPs are found outside of coding seqs => SNPs found in a coding seq are of great interest as they are more likely to alter function of a protein
most common type of genetic variation in humans. They account for 90% of the variation between individuals
Most are neutral polymorphisms, some cause disease
density = ~1 every 100-300 bases
Welche zwei Klassen von SNPs unterscheidet man und was ist der Unterschied zwischen den beiden?
Coding SNPs: occur within coding region of a gene
synonymous: not causing a change in the amino acid
nonsynonymous: alters the amino acid sequence of the protein, potentially affecting protein function (missense or nonsense mutations)
Non-coding SNPs: occur outside the coding regions of a gene
Regulatory SNPs: positions that fall in regulatory regions of genes
Intronic SNPs: positions that fall within introns
Wieso kann es durch SNPs auf kodierenden und nicht-kodierenden Regionen zu Krankheiten führen?
SNPs may be informative with respect to disease:
Functional variation. A SNP associated with a nonsynonymous substitution in a coding region will change the amino acid sequence of a protein
Regulatory variation. A SNP in a noncoding region can influence gene expression
Association. SNPs can be used in whole-genome association studies. SNP frequency is compared between affected and control populations.
Nennen Sie drei Unterschiede zwischen Pflanzen und Tier miRNA.
Number of miRNA genes present
Plants: 100-200 genes
Animals:100-500 genes
Location within genome:
Plants: predominantly intergenic regions
Animals: intergenic regions, introns
Presence of miRNA clusters:
Plants: uncommon
Animals: common
miRNA biosynthesis:
Plants: Dicer-like
Animals: Drosha, Dicer
Mechanism of repression:
Plants: mRNA-cleavage (methylation?)
Animals: Translational repression
Location of miRNA-binding motifs
Plants: predominantly in the ORF
Animals: predominantly in the 3’-UTR
Number of miRNA-binding sites within target sites:
Plants: Generally one
Animals: Generally multiple
Function of known target genes:
Plants: Regulatory genes - crucial for development, enzymes
Animals: Regulatory genes - crucial for development, structural proteins, enzymes
Erläutern Sie den Arbeitsablauf des targetScan Algorithmus.
search the UTRs in the first organism for segments of perfect Watson-Crick complementarity to bases 2-8 of the miRNA: “miRNA seed” and “seed matches”
assign a folding free energy G to each such miRNA:target site interaction
sort the UTRs in this organism by Z score and assign a rank R to each
Was sind covariance models? Was ist deren Ziel?
Statistical model that captures the patterns of covariation that can be obtained from a MSA. Covariated bases tend to coevolve as this ensures that the base pair is maintained and RNA structure is conserved. RNA structure prediction can be improved by giving positions with greater covariation more weight.
describes both the secondary structure and the primary sequence consensus of an RNA
Can be applied ro several RNA anlysis problems:
consensus secondary structure prediction
multiple sequence alignment
database similarity searching
Iterative training procedure
Optimal algorithm for RNA secondary structure prediction based on pairwise covariations in multiple alignments
Covariation ensures ability to base pair is maintained and RNA structure is conserved
Welche Daten benötigt man für deren Berechnung? (covariance models)
Covariance models are constructed automatically
from existing RNA sequence alignments
even from initially unaligned example sequences
Welche Nachteile haben covariance models?
Needs to be well trained
Not suitable for searches of large RNA and for database searches
Structural complexity of large RNA cannot be modeled
Runtime
Memory requirements
Can be used for scanning candidate RNAs identified by other methods
sehr rechenintensiv aufgrund der 3D dynamischen Programmierung
keine Angabe von tRNA-spezifischen Informationen dank allgemeinen Ansatz
Nennen Sie drei Unterschiede zwischen prokaryotischen und eukaryotischen Genomen.
Unterschiede:
Prokaryotische Genome sind im Allgemeinen wesentlich kleiner als eukaryotische Genome
Eukaryotische Genome besitzen einen hohen Anteil an nichtcodierender DNA (etwa 95% im Mensch) wohingegen Prokaryotische Genome nur relativ geringe Anteile nichtcodierender DNA besitzen (ca. 5-20%)
Das eukaryotische Genom besitzt eine Intron-Exon-Struktur der Gene, wobei das prokaryotische Genom kaum bis gar keine Introns besitzt
Das prokaryotische Genom ist polycistronisch, das eukaryotische Genom monocistronisch
Die Gendichte in eukaryotischen Genomen ist niedriger aufgrund der vielen nicht-codierenden Bereiche, in prokaryotischen Genomen ist die Gendichte wesentlich höher
Size
prokaryotes between 1s and 10s of Mb
eukaryotes between 1s and 1.000s of Mb
Topology:
prokaryotes: mostly circular
eukyryotes: mostly linear
Gene number:
prokaryotes: most <10.000
eukaryotes: often >10.000
Pseudogenes:
prokaryotes: few
eukaryotes: many
Complexity:
prokaryotes: low
eukaryotes: high
Horizontal gene transfer:
prokaryotes: frequent
eukaryotes: rare
Intergenic regions:
prokaryotes: short (<100kb)
eukaryotes: long (often >100kb)
Genome duplication:
prokaryotes: none
eukaryotes: frequent (especially in plants)
Gene duplication:
prokaryotes: rare
eukaryotes: frequent
Repeated sequences:
prokaryotes: minor components
eukaryotes: major components
Wie wirkt sich eine Vergrößerung des Windows auf den positiven Vorhersagewert eines ORFs aus?
(2017)
Eine Vergrößerung des Windows kann dazu führen, dass mehr potenzielle ORFs erkannt werden
= erhöhten Sensitivität (True Positives)
= verringerte Spezifität: erhöhte Anzahl der False Positives , da mehr zufällige Sequenzen als ORFs erkannt werden könnten, die tatsächlich keine Gene sind
=> beeinflusst den positiven Vorhersagewert
Eine optimale Window-Größe muss daher gefunden werden, die das Gleichgewicht zwischen Sensitivität und Spezifität hält, um den höchsten PPV zu erzielen.
Alternatives Splicing: Exon hinzufügen und trotzdem kürzeres Produkt?
Alternative Initiation der Translation (als ursprüngliche Stelle=
alternative Terminierung durch Hinzufpgen einer Stopp-Codons in alternativem Exon
Verkürzung (oder Verlängerung) aufgrund eines Frameshifts
Ändern des inneren Bereichs aufgrund einer in-Frame Insertion or Deletion
Zuordnen:
TFFM, Kraken, Prodigal, Augustus+, Annovar
zu
Transkriptionfactor vorhersage, Quality Control, prok. Genvorhersage, euk. Genhorhersage, Functional Annotation of Genetic Variants
TFFM <-> Transkriptionfactor vorhersage (Transcription Factor Binding Motif Prediction)
Kraken <-> Quality Control (Taxonomic Classification of Metagenomic Sequences)
Prodigal <-> prok. Genvorhersage
Augustus+ <-> euk. Genhorhersage
Annovar <-> Functional Annotation of Genetic Variants (Genomic Variation)
Nenne zwei Effekte von Alternative Splicing, wenn das Protein verlängert wird.
Effekt:
Bildung von Isoformen mit unterschiedlichen Funktionen
Veränderung der Protein-Interaktionen und Signaltransduktion
Beschreibe eine Methode, wie man mit bioinformatischen Mitteln Alternative Splicing analysieren kann. Gehe besonders auf die notwendigen Daten ein.
TODO
Alignment of ESTs (expressed sequence tags) against DNA (/pre-mRNA?) sequence
Insertions and deletions in the ESTs relative to the [?pre-] mRNA are identified as potential alternative splices
Alternative splices are detected when two splices are mutually exclusive
Requires ESTs which are cDNA sequences derived from mRNA with reverse transcriptase
Welche zwei Typen von Information für Genvorhersage und je 2 Beispiele
Was ist die Kozak-Sequenz
DNA motif for protein translation initiation site in most eukaryotic mRNA transcripts
a region around start codon
5’-(gcc)gccRccAUGG-3’
(eukaryotic equivalent to Shine-Dalgarno)
GeneMarkS-T komplett aufschreiben. Insb. darauf eingehen, ob und wie sich die Anzahl an Transkripten auswirkt.
GeneMarkS: a self-trained method for prediction of gene starts in microbial genomes
GeneMarkS-T: gene prediction in eukaryotic transcripts, additionally uses homology based inference / integrates transcriptome data
—> The number of available transcripts plays a crucial role in the model's accuracy, as more transcripts lead to more and better training data and thus more precise predictions.
Self-Training method and iterative procedure: algorithm can improve its parameters thorugh iterative anaylsis of the input data
employs fifth-order Markov Models
uses Markov chain model to represent the statistics of coding and noncoding reading frames
BasePairing Algorithmus mit gegebener Rekursionformel mit Skizze und eigenen Worten erklären. Struktur aufmalen, die nicht vorhergesagt werden kann. Wie kann man die Formel noch verbessern?
Struktur, die nicht vorhergesagt werden kann: Pseudoknots
pseudoknots violate the recursive definition of the optimal score S(i,j)
2 Unterschiede nennen zw. Tier und Pflanze bzgl. Target binding.
Location of miRNA-binding motifs within target genes
Plants: predominantly the ORF
Animals: Predominantly the 3’ UTR
Number of miRNA-binding sizes within target genes:
Plants: Generally 1
miRNA Processing
Plants: Dicer-like 1 enzyme, (Drosha gene that is in animal genomes is absent)
Animals: Drosha gene + Dicer processes primary miRNA
Site Location
Plants: Target sites outside the 3’ UTR region of mRNA
Animals: Target sites typically located within 3’ UTR region of mRNA
Algorithmus zur Vorhersage von targets beschreiben.
Nenne je zwei Vor- und Nachteile für hybridisierungs und sequenzbasierende Verfahren, zb Microarray vs RNA-Seq
Hybridisierungsverfahren (Microarrays):
Vorteile:
Relatively low cost
Well established in clinical use
relatively fast
Nachteile:
Analysis only of pre-defined sequences
Dynamic range limited by scanner
high background noise
cross-hybridization possible
Sequenzbasierende Verfahren (RNA-seq)
identification of alternative splice variants / new transcript
high sensitivity
broad spectrum of applications
relatively high cost
high computational effort
prone to contamination
complexity of data analysis
Beschreibe kurz das Vorgehen von RNA-Seq
Identifies the full set of transcripts, including large and small RNAs, novel transcripts from unannotated genes, rare transcripts, splicing isoforms and gene-fusion transcripts
Reveals the complex landscape and dynamics of the transcriptome from yeast to human at an unprecedented level of sensitivity and accuracy
Base-pair-level resolution and a much higher dynamic range of expression levels
Overview of the experimental steps in an RNA sequencing (RNA-seq) protocol
RNA extraction → target enrichment → cDNA → library prep → sequencing → Transcriptome/genome mapping → data analysis
Experimental design: number of replicates, depth of sequencing
Parameters: alignment rate, desired power, significance level, log-fold change
RNA-seq workflow:
Quality control
Alignment of reads to reference genome
Transcriptome assembly
Differential expression
Ordnen Sie folgende Programme den richtigen Begriffen zu:
Programme: ORPHEUS, SIFT, MiRScan, Genescan, REPuter
Begriffe: Genvorhersage, Repeats, SNPs, Genvorhersage von Prokaryoten, miRNA
ORPHEUS: Gene prediction in bacterial genomes, Genvorhersage
SIFT: Sort intolerant from tolerant substitutions, SNPs
MiRScan: miRNA
Genescan: Genvorhersage von Prokaryoten
REPuter: Repeats
What are pseudogenes? What are the two main classes distinguished?
Was sind Pseudogene? Welche zwei Hauptklassen werden unterschieden?
(Demo questions)
nicht funktionale Gene aber aus funktionalen Genen hervorgegangen
besitzen degenerative Eigenschaften (missense / nonsense mutations), die Expression verhindern
2 Hauptklassen:
conventional
processed pseudogenes
Explain the Ka/Ks ratio. What does the value say about conservation, what conclusions can be made about the selection pressure?
Erklären Sie den Ka/Ks ratio. Was sagt der Wert über die Konservierung aus, und welche Schlüsse kann man über den vorherrschenden Selektionsdruck ziehen?
(Demo question)
Ka: Zahl der nicht synonymen Mutationen
Ks: Zahl der synonymen Mutationen
Ka/Ks höher mit niedrigerer Konservierung
Ka/Ks = 1 => kein Selektionsdruck
Ks/Ks > 1 => positiver Selektionsdruck (positive selection)
Ka/Ks < 1 => negativer Selektionsdruck (purifying selection)
What are the three strategies for gene prediction? Give an example for each.
Was sind die drei Strategien bei der Genvorhersage? Geben Sie je ein Beispiel.
Content based
Beispiel: ORFs, Codon usage, Repeat periodicity, Compositional complexity
Site based
Beispiel: splice sites, TF binding sites, Consensus sequences, Polyadenylation signals, start / stop codons
Comparative
Beispiel: Inference based on homology, Protein sequence similarity, Modular structure of proteins usually precludes finding complete gene
Nennen Sie die Vorgehensweise / zwei Effekte von Alternative Splicing.
Ablauf Splicing:
5 critical bases: 5’ splice site / donor splice site (GU), branch point (A), 3’ splice site / acceptor splice site (AG)
cleavage on 5’ splice site of pre-mRNA
reaction between 5’ splice site and branch site leads to formation of lariat-like intermediate
cleavage at 3’ splice site
ligation of exons
Types of AS:
constitutive AS: more than one product is always made from transcribed gene
regulated AS: different forms are generated under different conditions
What are single nucleotide polymorphisms (SNPs)? What other types of polymorphisms do you know?
(2023-2)
What are SNPs?
Other types of polymorphisms
Insertion / Deletions (Indels)
Copy Number Variation
Tandem Repeats
Micro- / Mini-satellites
Structural Variations
Transposable Elements
What is a ribosomal binding site and how can it be used for gene prediction?
A sequence of nucleotides in mRNA that is recognized and bound by the ribosome during the initiation of translation
ca. 3-10 nucleotides before initiation codon
e.g. Shine-Dalgarno (SD) sequence, which is located 5-10 bp upstream of the start codon (AUG)
The identification of RBS is crucial for predicting the location of genes in prokaryotic genomes. Since the RBS is closely associated with the start codon of a gene, finding the RBS can help in pinpointing the start site of protein-coding sequences.
Tools: Glimmer, GeneMark, Prodigal
Describe the two most important normalization techniques used for quantitative transcriptome analyses.
RPKM: reads per kilobase of transcript per million mapped reads
FPKM: fragments per kilobase of transcript per million mapped reads
TPM: transcrips per million
Example usage: RNA fragmentation during library construction
Explain what is the reference genome of a species?
Ein Referenzgenom einer Spezies ist das Genom gegen welches neue Sequenzierungsdaten verglichen werden. Es wird als eine Art ideal angesehen und daher wird nach dem Sequenzieren damit verglichen, um bspw. Fehler oder Gaps zu erkennen.
Das Referenzgenom ist daher das “perfekte” Genom für eine Spezies und kann auch modifiziert werden, falls neue Information zu diesem entdeckt werden. Es wird also im Laufe der Zeit potentiell verändert, um up-to-date zu sein.
RNAfold implements one of the methods of RNA secondary structure prediction. What kind of RNA secondary structure is considered optimal in this method?
RNAfold predicts the secondary structure of RNA by finding the structure with the minimum free energy. This structure is considered optimal because it is the most thermodynamically stable configuration.
Explain the similarity-based approach to gene prediction?
Genes in different organisms are similar
The similarity-based approach uses known genes in one
genome to predict (unknown) genes in another genome
Given a known gene (or a protein) and a genome sequence, find a set of substrings of the genomic sequence whose concatenation best fits the gene
e.g. the known frog gene is aligned to different locations in the human genome —> find “best” path to reveal the exon structure of human gene
Explain what pseudogenes are and describe one approach to identify them.
Nonfunctional sequences of genomic DNA that are originally derived from functional genes, but exhibit such degenerative features as premature stop codons and frameshift mutations that prevent their expression
A fundamental feature of pseudogenes is that their nucleotide sequences differ from those of the paralogous functional genes at crucial points
Two types of pseudogenes: conventional, processed
Identification Approach:
Sequence Comparison: Use BLAST to align sequences with known functional genes and look for mutations.
Phylogenetic Analysis: Create phylogenetic trees to compare the evolutionary relationships between functional genes and potential pseudogenes
What is a splice site? Explain two types of splice sites.
Beim Splicen werden unter anderem die Introns aus der Sequenz gesplicet / entfernt.
Hierfür gibt es die 3’ splice site und die 5’ splice site. Die 3’ splice site startet das spleißen von vorne der Sequenz wohingegen die 5’ splice site dieses von hinten startet.
Explain how PolyPhen predicts damaging mutations.
PolyPhen (Polymorphism Phenotyping) predicts the impact of amino acid substitutions on the structure and function of proteins, which helps identify potentially damaging mutations.
Goal: to obtain a lower limit estimate for the quantity of non-synonymous SNPs that might have phenotypic effects
Map known disease mutations onto known 3D structures of proteins
Compare results with a control set of substitutions observed between these proteins and their closely related homologs from other species that are unlikely to cause severe effects on the phenotype
Map a large number of non-synonymous SNPs onto protein structures: thought to be neutral or to be the cause of only minor phenotypic effects
What is Burrows-Wheeler transform and what is it used for?
Produces a permutation of a string that is easier to compress
it is reversible: the original string can be recovered
string compression
pattern matching
searching for patterns in strings
Approach:
form successive circular permutations of the string
sort these lines into alphabetical order
Report the last column
The BWT brings repeats together, facilitating compression
—> Example tool that uses BWT is Bowtie
Name and briefly explain one database or bioinformatics tool related to miRNA. What is it used for?
MirScan:
a bioinformatics tool for predicting miRNA target sites in mRNA sequences by analyzing sequence complementarity and conservation across species
Finds potential miRNA target sites by aligning miRNAs to target mRNA sequences.
Name and explain two -omics disciplines
Genomics: the study of the complete set of DNA sequences (the genome) in an organism, including all of its genes and non-coding regions
identify DNA sequences of an organism’s genome
Understanding genetic variations
identify genes associated with dieases
example tools: BLAST, Genome Browser
Proteomics: the large-scale study of proteins, particularly their functions, structures, and interactions within a cell or organism.
measuring protein abundance and expression levels
exploring protein functions, interactions, modifications
example tools: STRING, Mascot
Transcriptomics: the study of the complete set of RNA transcripts produced by the genome under specific circumstances or in a specific cell type.
analyzing gene expression and understanding how genes are regulated.
example tools: STAR, DESeq2
Zuletzt geändertvor 9 Tagen