undefined

Buffl

0. Methoden der Genomanalyse

by Carina S.

Explain the FASTQ format.

Erklären Sie das FASTQ-Format.

(2023)

DONE

common format for data exchange between different tools and an extension to the FASTA format
can additionally store a numeric quality score of nucleotides besides other informations about the sequence

The FASTQ file has normally 4 lines per sequence:

Line 1: “@” + sequence identifier + optional description (like a FASTA title line)
Line 2: the raw sequence letters
Line 3: begins with a “+” and is optionally followed by the same sequence identifier (and any description)
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence

Below follows the output of the GeneMark.hmm program. Please explain what the column “strand” means.

Unten folgt der Output des GeneMark.hmm-Programms. Erklären Sie bitte, was die Spalte “Strand” bedeutet.

(2023)

DONE

GeneMark is a gene prediction tool for prokaryotes. When a gene is transcribed, the sense or anti-sense strand is used as a template. Due to the fact that genes can therefore be located on the sense or anit-sense strand, GeneMark includes the information on which strand + or - the gene is predicted.

“+” for forward strand (5’ to 3’)

“-” for reverse strand(3’ to 5’)

What is the normalization for the sequence length in NGS and what is it used for?

Was ist die Normalisierung für die Sequenzlänge in NGS und wofür wird sie verwendet?

(2023)

DONE

Sequence length normalization in next-generation sequencing (NGS) refers to adjusting read counts based on the length of the sequence to allow for accurate comparison across samples.

Purpose:

correct for bias
ensure longer sequences do not disproportionately affect the analysis

Common Metrics:

RPKM (Reads Per Kilobase of transcript per Million mapped reads)
FPKM (Fragements Per Kilobase of transcript per Million mapped reads)
TPM (Transcripts Per Million)

Name and explain two types of Alternative Splicing.

Nenne und erklären Sie zwei Arten des alternativen Spleißens.

(2023)

DONE

cassette exon (exon skipping): an exon may be spliced out of the primary transcript or retained
mutually exclusive exons: one of two exons is retained in mRNAs after splicing, but not both (are “competing”)
competing 5’ splice site (alternative donor site): An alternative 5’ splice junction is used, changing the 3’ boundary of the upstream exon
competing 3’ splice site (alternative acceptor site): An alternative 3’ splice junction is used, changing the 5’ boundary of the downstream exon
retained intron: A sequence may be spliced out as an intron or simply retained. This is distinguished from exon skipping because the retained sequence is not flanked by introns.
multiple promoters
mutiple poly(A) sites

Explain one approach to RNA secondary structure prediction.

Erklären Sie bitte einen Ansatz zur Vorhersage der RNA-Sekundärstruktur.

(2023)

DONE

Two possible approaches:

Energy minimization methods: choose complementaey sequence sets that provide the most energetically stable molecules
take into account patterns of base-pairing that are conserved during evolution

Energy minimization method:

every base is compared for complementarity to every other base
The energy of each predicted structure is estimated by the nearest-neighbor rule:
- sum the negative base-stacking energies for each pair of bases in the predicted double-stranded regions
- add positive energies of destabilizing (unpaired) regions
The complementary regions are evaluated by a dynamic programming algorithm to predict the most energetically stable molecule
Programs: e.g. MFOLD, ViennaR

What are N50 and L50 measures?

(2023)

DONE

N50: The shortest contig length that needs to be included for covering 50% of the genome.

high value = more continuous assembly with longer contigs, shorter gaps, overall more complete assembly

L50: The count of smallest number of contigs whose length sum makes up half of genome size.

lower value = more continuous assembly, with fewer contigs needed to reach the 50%, more efficient and better quality assembly

—> to measure the quality of genome assembly

Explain what is Ka/Ks ratio and what is needed to calculate it?

Erklären Sie, was das Ka/Ks-Verhältnis ist und was zu seiner Berechnung benötigt wird.

(2023)

DONE

The Ka/Ks ratio is the ratio between the nonsynonymous rate of substitution (Ka) and the synonymous rate of substitution (Ks), to test for natural selection / evolutionary tendencies on genes or proteins.

The Ka/Ks ratio is important to predict if a mutation or rather a change is going to be fixed (evolutionary advantage) or being lost (negative selection).

The Ka/Ks ratio can indicate the following tendencies:

Ka/Ks > 1 : positive selection
Ka/Ks = 1 : neutral selection
Ka/Ks < 1 : negative selection -> purifying selection

for pseudogenes Ka/Ks = 1 expected
experimental < 1: underestimated Ka/Ks as genes were compared with present day genes and not the ancestral functional gene that gave rise to the processed pseudogene

Explain the similarity-based approach to gene prediction.

Erklären Sie bitte den ähnlichkeits-basierten Ansatz zur Genvorhersage?

(2023)

DONE

The similarity-based approach uses known genes in one genome to predict (unknown) genes in another genome by e.g. comparison of genomic sequence with homologous genomic sequence from close organisms.
Given a known gene (or a protein) and a genome sequence, find a set of substrings of the genomic sequence whose concatenation best fits the gene.
e.g. the known frog gene is aligned to different locations in the human genome. Find the “best” path to reveal the exon structure of human gene. Chaining these local alignments and look for a maximum chain of substrings.

—> possible due to evolutionary conservation of gene sequences across different species

Example programs: TWINSCAN, SGP2
Often alignment tools like BLAST or BLAT can be used.

Explain the k-mer approach to finding the repeats in genomes.

Erklären Sie den k-mer Ansatz zum Auffinden von Wiederholungen in Genomen.

(2023)

DONE

Sequences are scanned for overrepresented string of certain length
Challenge: to determine optimal size of an oligo (k-mer) and the number of mismatches allowed

Generate K-mers: Slide a window of length k across the genome sequence to generate all possible k-mers.
Count Occurrences: Count the frequency of each k-mer in the genome.
Identify Repeats: K-mers that appear more frequently than expected are identified as potential repeats.
Analysing the repeats by finding out their position in the genome and mapping them
for a better understanding of the distribution and context of the repeats + possibly further analyses to differentiate the different types of repeats (tandem/dispersed repeats etc.)

Applications:

Detecting Repeats: Helps in identifying repetitive sequences in the genome, such as tandem repeats, interspersed repeats, and low-complexity regions.
Genome Assembly: Assists in resolving ambiguous regions during genome assembly by highlighting repetitive sequences.

What are the challenges in identifying motifs in biological sequences?

Was sind die Herausforderungen bei der Identifizierung von Motiven in biologischen Sequenzen?

(2023)

DONE

we don’t know the motif sequence
we don’t know the location relative to gene start
we don’t know the length of the motif
Motifs can differ slightly from one gene to the next / can have mutations (how many mutations are allowed?)
How to discern from „random“ motifs?

(Consensus sequences help in finding motifs)

—> These and other problems mean that the computing algorithms have a high runtime and data storage capacity

What is repeat masking and why is it needed?

Was ist Wiederholungsmaskierung und warum ist sie erforderlich?

(2023)

DONE

Definition: The process of identifying (e.g. using RepeatMasker) and masking repetitive sequences in a genome to prevent them from interfering with genomic analyses.

Purpose:

Improve Assembly: Helps in accurate genome assembly by reducing ambiguity caused by repeats.
Enhance Annotation: Improves gene prediction and annotation by distinguishing between coding and non-coding regions.
Data Quality: Ensures high-quality sequence alignment and variant calling by excluding repetitive regions that can cause errors.

What are two assumptions that can be made when predicting the role of amino acids substitution?

Welche zwei Annahmen können bei der Vorhersage der Rolle der Aminosäuresubstitution getroffen werden?

(2023)

DONE

Conservation: Substitutions in highly conserved regions are more likely to affect protein function, indicating these regions are crucial for the protein's structure or function.
Physicochemical Properties: Substitutions that drastically change the physicochemical properties (such as charge, size, or hydrophobicity) of the amino acid are more likely to impact the protein's function or stability.

Explain the main steps of the genome assembly process.

Erklären Sie die wichtigsten Schritte des Genom-Assembly-Prozesses.

(2023)

DONE

Read Generation: Sequencing the genome to produce short DNA sequences (reads) using technologies like Illumina or PacBio.
Read Quality Control: Filtering and cleaning the reads to remove low-quality sequences and contaminants.
Read Overlapping: Finding overlaps between reads to identify how they connect, e.g. using De-Bruijn Graph or Overlap-Layout-Consensus (De novo or reference-based)
Contig Assembly: Merging overlapping reads into longer contiguous sequences (contigs).
Scaffolding: Ordering and orienting contigs into larger structures (scaffolds) using paired-end reads or long-read technologies.
Gap Filling: Closing gaps within scaffolds to produce a more complete genome sequence.
Quality Assessment: Evaluating the assembly for accuracy, completeness, and contiguity using metrics like N50 and L50.
Optional annotation by identification of genes (MAKER, AUGUSTUS) and functional annotation (BLAST)

Explain the TargetScan algorithm for predicting of mammalian MicroRNA Targets (you can skip the formulas).

Erklären Sie den TargetScan Algorithmus zur Vorhersage von microRNA Zielgenen bei Säugetieren (keine Formels notwendig).

(2023)

DONE

TargetScan:

thermodynamics-based modeling of RNA:RNA duplex interactions
comparative sequence analysis

Input:

miRNA that is conserved in multiple organisms
a set of orthologous 3’ UTR sequences from these organisms

Steps:

search the UTRs in the first organisms for segments of perfect Watson-Crick complementarity to bases 2-8 of the miRNA: “miRNA seed” and “seed matches”
extend each seed match with additional base pairs to the miRNA as far as possible in each direction, allowing G:U pairs, but stopping at mismatches
optimize basepairing of the remaining 3’ portion of the miRNA to the 35 bases of the UTR immediately 5’ of each seed match using the RNAfold program
assign a folding free energy to each such miRNA:target site interation
assign a Z score to each UTR
sort the UTRs in this organism by Z score and assign a rank to each predict
predict as targets those genes for which both Zi >= Zc and Ri <= Rc for an orthologous UTR sequence in each organism, where Zc and Rc are pre-chosen Z score and rank

What is a Pan-genome?

Was is Pangenom?

(2023)

DONE

A pan-genome is the entire set of genes from all strains within a species or clade. It includes:

Core Genome: Genes shared by all strains.
Accessory Genome: Genes not present in all strains, including:
- Dispensable Genes: Present in some but not all strains.
- Unique Genes: Specific to individual strains.

Purpose: To study genetic diversity, evolution, and functional adaptations.

Wie unterscheidet sich die Gendichte von Bakterien mit der von höheren Eukaryoten?

(2013)

DONE

Bakterien haben also eine höhere Gendichte als höhere Eukaryoten

Gendichte bei Prokaryoten (Bakterien & Archaea): ~90% >>> Gendichte bei Eukaryoten: ~1-2%
In Bakterien liegt die Gendichte bei 1 Gen pro 1.000 - 1.4000 Basen
In höheren Eukaryoten liegt die Gendichte bei 1 Gen pro 100.000 Basen

Wie wird das Sequenzlogo mit dem EM-Algorithmus dargestellt?

(2013)

DONE

EM Algorithmen, wie z.B. MEME (Multiple EM for Motif Elucidation), starten bei einer Site von mehreren Sites von Zielsequenzen und wechseln sich dann ab zwischen der Zuordnung der Site zu einem Motiv und dem Updaten des Motivmodells. Dabei werden nur die besten Treffer pro Sequenz angezeigt, obwohl niedrigere Treffer in der gleichen Sequenz auch einen Effekt haben können.
start with initial guesses for region and size (e.g. region of a binding size is already known from prior experiments)

expectation step:
- position-wise composition of the site is used to estimate the probability of finding the site at any position of the seqs
- these probabilities are used in turn to provide new information as to the expected base distribution for each column

maximization step: new counts of bases for each position in the site found in E-step are substituted for the previous set

E- and M-steps repeated until convergence (no more changes)
Result:
- best location of the size in each seq
- best estimate of the base composition of each column in the site

Was wird mit der Höhe der Abschnitte in einem Sequenzlogo ausgesagt?

(2013)

DONE

sequence logo height is showing the frequencies scaled relative to the information content (measure of conservation) of the base at the position

can be corrected by base frequencies of the bases
data might include pseudocounts to overcome effects of missing data
the maximum value for DNA bases is 2 bits (log2(4)) —> perfectly conserved

Warum ist es wichtig auch Pseudogene zu kennen?

(2013)

DONE

pseudogenes: Nonfunctional sequences of genomic DNA that are originally derived from functional genes (by gene duplication), but exhibit such degenerative features as premature stop codons and frameshift mutations that prevent their expression
Pseudogene sind dennoch wichtig, da einige eine Rolle bei der Regulierung der Genakativität spielen und somit nicht funktionslos sind.

might interfer with experiments
- PCR and hybridization experiments
- transcribed pseudogenes
- interference with disease diagnostics and treatment
molecular record of dynamics and evolution of genomes
- rate of nucleotide substitutions
- rate of DNA loss
improvement of gene prediction and annotation efforts

Was bedeutet "multiplicity" und "co-operativity" in Zusammenhang mit miRNA target Interaktionen?

(2013 / 2017)

DONE

multiplicity: one miRNA can target more than one gene
- some miRNAs appear to be very promiscuous, with hundreds of predicted targets, but most miRNAs control only a few genes

co-operativity: one gene can be controlled by more than one miRNA
- Some target genes appear to be subject to highly cooperative control, but most genes do not have more than four targets sites

Wie verändert sich der positive Vorhersagewert wenn das Target mit dem Informanten stark übereinstimmt?

(2013)

DONE

Positive Predictive Value (PPV) = Maß dafür, wie wahrscheinlich es ist, dass ein positiv vorhergesagtes Ereignis tatsächlich eintritt.

PPV = TP / TP + FP

Wenn das Target stark mit dem Informanten übereinstimmt, erhöht sich der PPV. Dies liegt daran, dass die Anzahl der korrekt vorhergesagten positiven Fälle (TP) steigt und die Anzahl der falsch positiven Vorhersagen (FP) sinkt.

Wie kann es dazu kommen, dass in ein Transkript ein alternatives Exon hinzugefügt wird und das zu einem verkürzten Protein-Produkt führt?

(2013)

DONE

Durch Verwendung einer anderen Stelle für die Translationsinitiation (alternative Initiation)
alternatives Exon hat Stopp-Codon (alternative Terminierung)
Eine andere Translationsterminationsstelle aufgrund eines Frameshift (Verkürzung oder Verlängerung)
Ändern des inneren Bereichs aufgrund eines in-Frame Insertion oder Deletion

Nennen sie einen möglichen Ursprung für Operons.

(2013)

-> nicht in Vorlesung

Operons könnten in termophilen Organismen entstanden sein, da die Organisation von Genen in Operons die Assoziation / Verbindung von funktionell verwandten Protein-Produkten ermöglicht und diese sich somit gegenseitigen Schutz vor thermischen Verfall bieten.
Rolle des Horizontellen Gentransfers: Vorteil komplette Sets an Genen zu übertragen und dem Empfänger einen definierten Phenotyp zu übertragen
evtl. ausgehend von thermophilen Bakterien

Wie wirkt sich eine Vergrößerung des Frameshift der ORF-Länge auf die Genauigkeit der Vorhersage aus?

(2013)

DONE

Die Genauigkeit der Vorhersage steigt mit der Vergrößerung des Frameshifts der ORF-Länge, da längere ORFs eher den tatsächlichen Genen entsprechen.
vmtl bezogen auf GeneMark? -> higher sensitivity, lower specificity
Analysefenster: Größeres Fenster erhöht Sensitivität und False Positives
Frameshift: Erhöhte Frameshift-Fehler verschlechtern die Vorhersagegenauigkeit

Nennen Sie den Proteinkomplex der dafür zuständig ist, dass die tierische pre-miRNA in miRNA umgewandelt wird.

(2013)

DONE

Proteinkomplex: RNA-incduced splicing complex (RISC) enthält das Enzym DICER, welches die pre-miRNA in miRNA spaltet

Welche Eigenschaften hat ein starker Promoter?

(2013)

DONE

Ein starker Promoter ist der Consensus Sequenz sehr ähnlich.
(ein schwacher Promoter unterscheidet sich stärker von der Consensus Sequenz)
DNA-Sequenz, die eine hohe Transkriptionsrate ermöglicht
- die effizient an die RNA-Polymerase bindet und einen robusten Transkriptionsbeginn fördert
- ein starker Promotor hat eine hohe Affinität für die RNA-Polymerase, was eine effiziente Bindung und Initiierung der Transkription ermöglicht
- Vorhandensein spezifischer Sequenzmotive innerhalb der Promotorregion

Was ist ein Sigma Faktor? Wofür wird dieser in der Transkription benötigt? Welcher Sigma Faktor tritt am häufigsten auf?

(2013)

DONE

Sigma Faktor: sind Proteine die Teil des RNA Polymerase Proteinkomplexes sind, welcher an den Promoter bindet.
Sie werden für die Initiation der Transkription benötigt.
Häufigster Sigma Faktor: σ^70 (Housekeeping-Sigma-Faktor von E.coli; steuert die Transkription)
Es gibt mehrere austauschbare Sigma-Faktoren, von denen jeder eine bestimmte Gruppe von Promotoren erkennt (Promotoren von Housekeeping-/Hitzeschock-Genen)

Nennen Sie drei Unterschiede des Whole Genome Shotgun und des Clone-by-Clone Verfahrens.

(2013)

DONE

Clone-by-Clone:
- physical mapping: requires construction of clone-based physical map
- assembly: easier to resolve complex genomic regions as position of contigs is already known (due to the physical mapping)
- labor intensity: physical mapping is labor intensitive, but after mapping clones can be divided between different labs for sequencing
Whole Genome Shotgun:
- physical mapping: mapping phase is skipped and subclone library is constructed from entire genome
- assembly: order / position of contigs needs to be inferred from overlapping reads and read pairs which can be problematic for tandemly repeated DNA (incorrect overlaps)
- labor intensity: less labor intensive, but requires more computational resources

Clone-by-clone shotgun: BAC clones werden benötigt, clones vorher mappen

Whole genome shotgun: Mapping-Phase wird übersprungen, Assembly dauert länger

Clone-by-Clone Shotgun Sequenzierung

(2013)

DONE

Auswahl eines Klons (z.B. BAC).
Reinigung und physikalische Fragmentierung der BAC-DNA.
Subklonierung der DNA-Fragmente (2-5 kb).
Erstellung von Sequenz-Reads aus Subklonen (mehrere tausend Reads pro BAC).
Assemblierung der Reads basierend auf Sequenzüberlappungen zu einer vorläufigen Sequenz.
Identifikation und Ausbesserung von Lücken und Bereichen mit schlechter Sequenzqualität durch zusätzliche Sequenzdaten.

Whole-Genome Shotgun Sequenzierung

(2013)

DONE

Die mapping Phase wird übersprungen
Shotgun Sequenzierung wird fortgesetzt unter Verwendung von Subklon-Bibliotheken die aus dem gesamten Genom hergestellt werden
Typischerweise werden zig Millionen von Sequenz-Reads erstellt
Computergestützte Anwendungen werden verwendet, um Contigs aus den verschiedenen Reads zu erzeugen.
Entstehende Lücken werden durch anschließende Verfahren geschlossen, um eine vollständige genomische Sequenz zu erhalten.

Welches Verfahren wird eher für prokaryotische Genome und welches für eukaryotische Genome verwendet? Erklären Sie genau warum dies so ist. (Clone-by-Clone oder Whole Genome Shotgun)

DONE

Whole Genome Shotgun (WGS) für prokaryotische Genome:

Prokaryotische Genome sind kleiner und weniger komplex.
WGS ist schneller und kostengünstiger, da die Kartierungsphase entfällt.
Weniger repetitive DNA-Sequenzen erleichtern die Assemblierung.

Clone-by-Clone für eukaryotische Genome:

Eukaryotische Genome sind größer und komplexer.
Physische Kartierung hilft bei der Bewältigung der Komplexität und erleichtert die Assemblierung.
Handhabung repetitiver Sequenzen durch physische Kartierung.
Effiziente Arbeitsaufteilung durch Verteilung der Klone auf verschiedene Labore.
Erleichterte Assemblierung komplexer genomischer Regionen.

(Approaches can be combined in a hybrid shotgun-sequencing approach)

Nennen Sie vier alternative Splicing Varianten.

(2013)

DONE

Wodurch kann man herausfinden, ob ein alternatives Splicing statt gefunden hat.

(2013)

DONE

Genomweite Analyse:

Genomsequenz-Assemblies und EST-Sequenzen
EST Clustering von UNIGENE
BLAST-Suche:
- Kandidaten-Gen-Regionen (BLAST threshold < E10-5)
- Vermeintliche (kurze) Exons (BLAST threshold < E10-10)
Alignment der genomischen Regionen mit ESTs durch dynamische Programmierung
Splicing-Erkennung durch computergestütztes Verfahren

Verifizierung:

RNA-Isoformen-Analyse mittels RT-PCR (unterschiedliche PCR-Produktlängen)
Mikroarrays mit Exon-Exon-Junction-Probes

Welche Auswirkungen hat es, wenn das Proteinprodukt durch Alternative Splicing größer wird?

(2013)

DONE

Consequences of new protein parts:

alter protein binding properties, e.g. receptor / ligand
alter intracellular localization, e.g. membrane insertion
alter extracellular localization, e.g. secretion
alter enzymatic or signaling activities
alter protein stability, e.g. inclusion of cleavage sites
Insertion of post-translation modification domains
Change ion channel properties

What are known roles of alternative splicing?

(-)

Influence RNA function
- AS does occur to alter 5’ and 3’ UTR regions - Proposed roles in subcellular localization and RNA stability
Coordinated Regulation of Biological Events
- Neuron development (Dscam)
- Channel activity associated with hearing (slo)
- Muscle contraction
- Neurite growth
- Cell differentiation
- Apoptosis

Welche zwei Klassen von Informationen werden in der Genvorhersage verwendet? Nennen Sie auch je zwei Unterklassen dieser Informationen.

(2013)

DONE

Intrinsic
a) Open reading frames (ORFs)
b) Codon usage
c) Anwesenheit von RBS (ribosomal binding sites)
d) Periodizität von repeats (Wiederholungen)
extrinsic
a) Expressed Sequence Tags (ESTs)
b) cDNA-Alignments
c) homology in known Exons

Skizzieren Sie den Ablauf von GenScan.

(2013)

DONE

GenScan: designed to predict complete gene structures but also partial genes or multiple genes separated by intergenic DNA within a sequence

based on generalized Hidden Markov Models
Model both strands at once

GenScan States:

N – intergenic region
P – Promoter
F – 5‘ untranslated region
T – 3’ untranslated region
A – poly-A
E – Exon (sngl = single, init = initial, term = terminal, k = Phase k internal)
Ik – Phase k Intron: 0 – zwischen Codons, 1 – nach der ersten Base eines Codons, 2 – nach der zweiten Base eines Codons

Each state may output a string of symbols (according to some probability distribution).
Explicit intron/exon length modeling
Special sensors for Cap-site and TATA-box, Advanced splice sites

uses dynamic programming to determine the most likely gene structure compatible with the given sequence.

Parallel Unsupervised Training and Prediction:

GenScan initializes all model parameters and uses GeneMark to parse the sequence into "coding" and "non-coding" regions.
The newly labeled sequences are used to re-estimate the model parameters until the model converges.

Gegeben ist die folgende Formel:

Erklären Sie die einzelnen Schritte und skizzieren Sie diese.

DONE

Rekursive Definition des besten Scores für eine Subsequenz i, j —> 4 Möglichkeiten:

i, j sind ein Basenpaar, hinzugefügt zu einer Struktur für i+1 ... j-1, add +1
i ist ungepaart, hinzugefügt zu einer Struktur für i+1...j
j ist ungepaart, hinzugefügt zu einer Struktur für i...j-1
i, j sind gepaart, aber nicht zu einander: die Struktur von i...j fügt Unterstrukturen zusammen für zwei Untersequenzen, i...k und k+1...j (bifurcation)

Gegeben ist die folgende Formel:

Wie könnte man obige Formel noch verbessern?

(2013)

DONE

Man könnte die obige Formel noch verbessern, indem man auch Pseudoknots beachtet und mit in die Formel einbaut.
this base pair maximization will not necessarily lead to the most stable structure
- additionally use thermodynamic information:
  - negative stacking energy for matches
  - positive destabilizing energies for loops (size-dependend)
- (minimum free energy method)

Nennen Sie alle Klassen von Interspersed Repeats.

(2013)

DONE

Retrotransposons
- LTRs (Long Terminal Repeat Retrotransposons)
- LINEs (Long Interspersed Nuclear Elements) [autonomous]
- SINEs (Short Interspersed Nuclear Elements) [nonautonomous]
DNA Transposons
- TIR (Terminal inverted repeat)
- MITE (Miniature Inverted-repeat Transposable Elements)

Nennen Sie alle Klassen von Tandemly Repeated DNAs.

Tandemly repeated DNA:
- Microsatellites
- Minisatellites
- Cryptically simple repeats
- Low complexity repeats
- Satellite repeats
- Telomeric repeats

Nennen Sie zwei Eigenschaften von Interspersed Repeats.

(2013)

DONE

Derived from biologically active “transposable elements” (TEs)
Involve RNA intermediates (Retroelements) or DNA intermediates (DNA transposons)
Retroelements: reproduce via reverse transcription followed by integration inot the host DNA (LTR ,LINEs, SINEs)
DNA transposons: capable of integrating themselves to, and excising themselves from, the host genome, thus taking advantage of the host replication thorugh this “cut-and-paste” mechanism
3 different mechanisms for transposition
- conservative transposition
- replicative transposition
- retrotransposition

Welche drei anderen repetitive Sequenzklassen gibt es noch (neben Interspersed repeats)?
Welche Unterschiede gibt es zwischen Interspersed Repeats zu den in 1. genannten Formen?

(2013)

DONE

Tandemly repeated DNA (Simple sequence repeats without interuption)
- Microsatellites
- Minisatellites
- Satellite and telomeric repeats
Cryptically simple repeats
Low complexity repeats

Interspersed repeats sind mobil und über das Genom verteilt
anderen repititiven Sequenzen sind stationär und kommen in spezifischen Clustern vor

Interspersed repeats

Retrotransposons (LINEs, SINEs, LTR)
DNA-Transposons

Was versteht man unter SNPs?

(2013)

DONE

SNP stands for Single Nucleotide Polymorphism, it occurs when a single nucleotide replaces one of the other three nucleotide letters in a genome (or in a DNA sequence).
SNPs may occur anywhere: Most SNPs are found outside of coding seqs => SNPs found in a coding seq are of great interest as they are more likely to alter function of a protein
most common type of genetic variation in humans. They account for 90% of the variation between individuals
Most are neutral polymorphisms, some cause disease
density = ~1 every 100-300 bases

Welche zwei Klassen von SNPs unterscheidet man und was ist der Unterschied zwischen den beiden?

(2013)

DONE

Coding SNPs: occur within coding region of a gene

synonymous: not causing a change in the amino acid
nonsynonymous: alters the amino acid sequence of the protein, potentially affecting protein function (missense or nonsense mutations)

Non-coding SNPs: occur outside the coding regions of a gene

Regulatory SNPs: positions that fall in regulatory regions of genes

Intronic SNPs: positions that fall within introns

Wieso kann es durch SNPs auf kodierenden und nicht-kodierenden Regionen zu Krankheiten führen?

(2013)

DONE

SNPs may be informative with respect to disease:

Functional variation. A SNP associated with a nonsynonymous substitution in a coding region will change the amino acid sequence of a protein
Regulatory variation. A SNP in a noncoding region can influence gene expression
Association. SNPs can be used in whole-genome association studies. SNP frequency is compared between affected and control populations.

Nennen Sie drei Unterschiede zwischen Pflanzen und Tier miRNA.

(2013)

DONE

Number of miRNA genes present
- Plants: 100-200 genes
- Animals:100-500 genes
Location within genome:
- Plants: predominantly intergenic regions
- Animals: intergenic regions, introns
Presence of miRNA clusters:
- Plants: uncommon
- Animals: common
miRNA biosynthesis:
- Plants: Dicer-like
- Animals: Drosha, Dicer
Mechanism of repression:
- Plants: mRNA-cleavage (methylation?)
- Animals: Translational repression
Location of miRNA-binding motifs
- Plants: predominantly in the ORF
- Animals: predominantly in the 3’-UTR
Number of miRNA-binding sites within target sites:
- Plants: Generally one
- Animals: Generally multiple
Function of known target genes:
- Plants: Regulatory genes - crucial for development, enzymes
- Animals: Regulatory genes - crucial for development, structural proteins, enzymes

Erläutern Sie den Arbeitsablauf des targetScan Algorithmus.

(2013)

DONE

TargetScan:

thermodynamics-based modeling of RNA:RNA duplex interactions
comparative sequence analysis

Input:

miRNA that is conserved in multiple organisms
a set of orthologous 3’ UTR sequences from these organisms

Steps:

search the UTRs in the first organism for segments of perfect Watson-Crick complementarity to bases 2-8 of the miRNA: “miRNA seed” and “seed matches”
extend each seed match with additional base pairs to the miRNA as far as possible in each direction, allowing G:U pairs, but stopping at mismatches
optimize basepairing of the remaining 3’ portion of the miRNA to the 35 bases of the UTR immediately 5’ of each seed match using the RNAfold program
assign a folding free energy G to each such miRNA:target site interaction
assign a Z score to each UTR
sort the UTRs in this organism by Z score and assign a rank R to each
predict as targets those genes for which both Zi >= Zc and Ri <= Rc for an orthologous UTR sequence in each organism, where Zc and Rc are pre-chosen Z score and rank

Was sind covariance models? Was ist deren Ziel?

(2013)

DONE

Statistical model that captures the patterns of covariation that can be obtained from a MSA. Covariated bases tend to coevolve as this ensures that the base pair is maintained and RNA structure is conserved. RNA structure prediction can be improved by giving positions with greater covariation more weight.

describes both the secondary structure and the primary sequence consensus of an RNA
Can be applied ro several RNA anlysis problems:
- consensus secondary structure prediction
- multiple sequence alignment
- database similarity searching
Iterative training procedure
Optimal algorithm for RNA secondary structure prediction based on pairwise covariations in multiple alignments
Covariation ensures ability to base pair is maintained and RNA structure is conserved

Welche Daten benötigt man für deren Berechnung? (covariance models)

(2013)

DONE

Covariance models are constructed automatically
- from existing RNA sequence alignments
- even from initially unaligned example sequences

Welche Nachteile haben covariance models?

(2013)

DONE

Needs to be well trained

Not suitable for searches of large RNA and for database searches
- Structural complexity of large RNA cannot be modeled
- Runtime
- Memory requirements
Can be used for scanning candidate RNAs identified by other methods

sehr rechenintensiv aufgrund der 3D dynamischen Programmierung
keine Angabe von tRNA-spezifischen Informationen dank allgemeinen Ansatz

Nennen Sie drei Unterschiede zwischen prokaryotischen und eukaryotischen Genomen.

(2013)

DONE

Unterschiede:

Prokaryotische Genome sind im Allgemeinen wesentlich kleiner als eukaryotische Genome
Eukaryotische Genome besitzen einen hohen Anteil an nichtcodierender DNA (etwa 95% im Mensch) wohingegen Prokaryotische Genome nur relativ geringe Anteile nichtcodierender DNA besitzen (ca. 5-20%)
Das eukaryotische Genom besitzt eine Intron-Exon-Struktur der Gene, wobei das prokaryotische Genom kaum bis gar keine Introns besitzt
Das prokaryotische Genom ist polycistronisch, das eukaryotische Genom monocistronisch
Die Gendichte in eukaryotischen Genomen ist niedriger aufgrund der vielen nicht-codierenden Bereiche, in prokaryotischen Genomen ist die Gendichte wesentlich höher

Size
- prokaryotes between 1s and 10s of Mb
- eukaryotes between 1s and 1.000s of Mb
Topology:
- prokaryotes: mostly circular
- eukyryotes: mostly linear
Gene number:
- prokaryotes: most <10.000
- eukaryotes: often >10.000
Pseudogenes:
- prokaryotes: few
- eukaryotes: many
Complexity:
- prokaryotes: low
- eukaryotes: high
Horizontal gene transfer:
- prokaryotes: frequent
- eukaryotes: rare
Intergenic regions:
- prokaryotes: short (<100kb)
- eukaryotes: long (often >100kb)
Genome duplication:
- prokaryotes: none
- eukaryotes: frequent (especially in plants)
Gene duplication:
- prokaryotes: rare
- eukaryotes: frequent
Repeated sequences:
- prokaryotes: minor components
- eukaryotes: major components

Wie wirkt sich eine Vergrößerung des Windows auf den positiven Vorhersagewert eines ORFs aus?

(2017)

DONE

Eine Vergrößerung des Windows kann dazu führen, dass mehr potenzielle ORFs erkannt werden

= erhöhten Sensitivität (True Positives)
= verringerte Spezifität: erhöhte Anzahl der False Positives , da mehr zufällige Sequenzen als ORFs erkannt werden könnten, die tatsächlich keine Gene sind

=> beeinflusst den positiven Vorhersagewert

Eine optimale Window-Größe muss daher gefunden werden, die das Gleichgewicht zwischen Sensitivität und Spezifität hält, um den höchsten PPV zu erzielen.

Alternatives Splicing: Exon hinzufügen und trotzdem kürzeres Produkt?

(2017)

DONE

Alternative Initiation der Translation (als ursprüngliche Stelle=
alternative Terminierung durch Hinzufpgen einer Stopp-Codons in alternativem Exon
Verkürzung (oder Verlängerung) aufgrund eines Frameshifts
Ändern des inneren Bereichs aufgrund einer in-Frame Insertion or Deletion

Zuordnen:

TFFM, Kraken, Prodigal, Augustus+, Annovar

Transkriptionfactor vorhersage, Quality Control, prok. Genvorhersage, euk. Genhorhersage, Functional Annotation of Genetic Variants

(2017)

DONE

TFFM <-> Transkriptionfactor vorhersage (Transcription Factor Binding Motif Prediction)

Kraken <-> Quality Control (Taxonomic Classification of Metagenomic Sequences)

Prodigal <-> prok. Genvorhersage

Augustus+ <-> euk. Genhorhersage

Annovar <-> Functional Annotation of Genetic Variants (Genomic Variation)

Nenne zwei Effekte von Alternative Splicing, wenn das Protein verlängert wird.

(2017)

DONE

Consequences of new protein parts:

alter protein binding properties, e.g. receptor / ligand
alter intracellular localization, e.g. membrane insertion
alter extracellular localization, e.g. secretion
alter enzymatic or signaling activities
alter protein stability, e.g. inclusion of cleavage sites
Insertion of post-translation modification domains
Change ion channel properties

Effekt:

Bildung von Isoformen mit unterschiedlichen Funktionen
Veränderung der Protein-Interaktionen und Signaltransduktion

Beschreibe eine Methode, wie man mit bioinformatischen Mitteln Alternative Splicing analysieren kann. Gehe besonders auf die notwendigen Daten ein.

(2017)

TODO

Alignment of ESTs (expressed sequence tags) against DNA (/pre-mRNA?) sequence
Insertions and deletions in the ESTs relative to the [?pre-] mRNA are identified as potential alternative splices
Alternative splices are detected when two splices are mutually exclusive
Requires ESTs which are cDNA sequences derived from mRNA with reverse transcriptase

Welche zwei Typen von Information für Genvorhersage und je 2 Beispiele

(2017)

DONE

Intrinsic
a) Open reading frames (ORFs)
b) Codon usage
c) Anwesenheit von RBS (ribosomal binding sites)
d) Periodizität von repeats (Wiederholungen)
extrinsic
a) Expressed Sequence Tags (ESTs)
b) cDNA-Alignments
c) homology in known Exons

Was ist die Kozak-Sequenz

(2017)

DONE

DNA motif for protein translation initiation site in most eukaryotic mRNA transcripts
a region around start codon
5’-(gcc)gccRccAUGG-3’
(eukaryotic equivalent to Shine-Dalgarno)

GeneMarkS-T komplett aufschreiben. Insb. darauf eingehen, ob und wie sich die Anzahl an Transkripten auswirkt.

(2017)

DONE

GeneMarkS: a self-trained method for prediction of gene starts in microbial genomes

GeneMarkS-T: gene prediction in eukaryotic transcripts, additionally uses homology based inference / integrates transcriptome data

—> The number of available transcripts plays a crucial role in the model's accuracy, as more transcripts lead to more and better training data and thus more precise predictions.

Self-Training method and iterative procedure: algorithm can improve its parameters thorugh iterative anaylsis of the input data
employs fifth-order Markov Models
uses Markov chain model to represent the statistics of coding and noncoding reading frames

BasePairing Algorithmus mit gegebener Rekursionformel mit Skizze und eigenen Worten erklären. Struktur aufmalen, die nicht vorhergesagt werden kann. Wie kann man die Formel noch verbessern?

(2017)

DONE

Struktur, die nicht vorhergesagt werden kann: Pseudoknots

pseudoknots violate the recursive definition of the optimal score S(i,j)

Man könnte die obige Formel noch verbessern, indem man auch Pseudoknots beachtet und mit in die Formel einbaut.
this base pair maximization will not necessarily lead to the most stable structure
- additionally use thermodynamic information:
  - negative stacking energy for matches
  - positive destabilizing energies for loops (size-dependend)
- (minimum free energy method)

2 Unterschiede nennen zw. Tier und Pflanze bzgl. Target binding.

(2017)

DONE

Location of miRNA-binding motifs within target genes

Plants: predominantly the ORF
Animals: Predominantly the 3’ UTR

Number of miRNA-binding sizes within target genes:

Plants: Generally 1
Animals: Generally multiple

miRNA Processing

Plants: Dicer-like 1 enzyme, (Drosha gene that is in animal genomes is absent)
Animals: Drosha gene + Dicer processes primary miRNA

Site Location

Plants: Target sites outside the 3’ UTR region of mRNA
Animals: Target sites typically located within 3’ UTR region of mRNA

Algorithmus zur Vorhersage von targets beschreiben.

(2017)

DONE

TargetScan:

thermodynamics-based modeling of RNA:RNA duplex interactions
comparative sequence analysis

Input:

miRNA that is conserved in multiple organisms
a set of orthologous 3’ UTR sequences from these organisms

Steps:

search the UTRs in the first organism for segments of perfect Watson-Crick complementarity to bases 2-8 of the miRNA: “miRNA seed” and “seed matches”
extend each seed match with additional base pairs to the miRNA as far as possible in each direction, allowing G:U pairs, but stopping at mismatches
optimize basepairing of the remaining 3’ portion of the miRNA to the 35 bases of the UTR immediately 5’ of each seed match using the RNAfold program
assign a folding free energy G to each such miRNA:target site interaction
assign a Z score to each UTR
sort the UTRs in this organism by Z score and assign a rank R to each
predict as targets those genes for which both Zi >= Zc and Ri <= Rc for an orthologous UTR sequence in each organism, where Zc and Rc are pre-chosen Z score and rank

Nenne je zwei Vor- und Nachteile für hybridisierungs und sequenzbasierende Verfahren, zb Microarray vs RNA-Seq

(2017)

DONE

Hybridisierungsverfahren (Microarrays):

Vorteile:
- Relatively low cost
- Well established in clinical use
- relatively fast
Nachteile:
- Analysis only of pre-defined sequences
- Dynamic range limited by scanner
- high background noise
- cross-hybridization possible

Sequenzbasierende Verfahren (RNA-seq)

Vorteile:
- identification of alternative splice variants / new transcript
- high sensitivity
- broad spectrum of applications
Nachteile:
- relatively high cost
- high computational effort
- prone to contamination
- complexity of data analysis

Beschreibe kurz das Vorgehen von RNA-Seq

(2017)

TODO

Identifies the full set of transcripts, including large and small RNAs, novel transcripts from unannotated genes, rare transcripts, splicing isoforms and gene-fusion transcripts

Reveals the complex landscape and dynamics of the transcriptome from yeast to human at an unprecedented level of sensitivity and accuracy

Base-pair-level resolution and a much higher dynamic range of expression levels

Overview of the experimental steps in an RNA sequencing (RNA-seq) protocol

RNA extraction → target enrichment → cDNA → library prep → sequencing → Transcriptome/genome mapping → data analysis

Experimental design: number of replicates, depth of sequencing

Parameters: alignment rate, desired power, significance level, log-fold change

RNA-seq workflow:

Quality control
Alignment of reads to reference genome
Transcriptome assembly
Differential expression

Ordnen Sie folgende Programme den richtigen Begriffen zu:

Programme: ORPHEUS, SIFT, MiRScan, Genescan, REPuter

Begriffe: Genvorhersage, Repeats, SNPs, Genvorhersage von Prokaryoten, miRNA

DONE

ORPHEUS: Gene prediction in bacterial genomes, Genvorhersage

SIFT: Sort intolerant from tolerant substitutions, SNPs

MiRScan: miRNA

Genescan: Genvorhersage von Prokaryoten

REPuter: Repeats

Welches Verfahren wird eher für prokaryotische Genome und welches für eukaryotische Genome verwendet? Erklären Sie genau warum dies so ist. (Clone-by-Clone oder Whole Genome Shotgun)

DONE

What are pseudogenes? What are the two main classes distinguished?

Was sind Pseudogene? Welche zwei Hauptklassen werden unterschieden?

(Demo questions)

DONE

nicht funktionale Gene aber aus funktionalen Genen hervorgegangen
besitzen degenerative Eigenschaften (missense / nonsense mutations), die Expression verhindern
2 Hauptklassen:
- conventional
- processed pseudogenes

Explain the Ka/Ks ratio. What does the value say about conservation, what conclusions can be made about the selection pressure?

Erklären Sie den Ka/Ks ratio. Was sagt der Wert über die Konservierung aus, und welche Schlüsse kann man über den vorherrschenden Selektionsdruck ziehen?

(Demo question)

DONE

Ka: Zahl der nicht synonymen Mutationen
Ks: Zahl der synonymen Mutationen
Ka/Ks höher mit niedrigerer Konservierung
Ka/Ks = 1 => kein Selektionsdruck
Ks/Ks > 1 => positiver Selektionsdruck (positive selection)
Ka/Ks < 1 => negativer Selektionsdruck (purifying selection)

for pseudogenes Ka/Ks = 1 expected
experimental < 1: underestimated Ka/Ks as genes were compared with present day genes and not the ancestral functional gene that gave rise to the processed pseudogene

What are the three strategies for gene prediction? Give an example for each.

Was sind die drei Strategien bei der Genvorhersage? Geben Sie je ein Beispiel.

(Demo question)

DONE

Content based
- Beispiel: ORFs, Codon usage, Repeat periodicity, Compositional complexity
Site based
- Beispiel: splice sites, TF binding sites, Consensus sequences, Polyadenylation signals, start / stop codons
Comparative
- Beispiel: Inference based on homology, Protein sequence similarity, Modular structure of proteins usually precludes finding complete gene

Nennen Sie die Vorgehensweise / zwei Effekte von Alternative Splicing.

TODO

Ablauf Splicing:
- 5 critical bases: 5’ splice site / donor splice site (GU), branch point (A), 3’ splice site / acceptor splice site (AG)
- cleavage on 5’ splice site of pre-mRNA
- reaction between 5’ splice site and branch site leads to formation of lariat-like intermediate
- cleavage at 3’ splice site
- ligation of exons
Types of AS:
- constitutive AS: more than one product is always made from transcribed gene
- regulated AS: different forms are generated under different conditions
Effekt:
- Bildung von Isoformen mit unterschiedlichen Funktionen
- Veränderung der Protein-Interaktionen und Signaltransduktion

What are single nucleotide polymorphisms (SNPs)? What other types of polymorphisms do you know?

(2023-2)

DONE

What are SNPs?

SNP stands for Single Nucleotide Polymorphism, it occurs when a single nucleotide replaces one of the other three nucleotide letters in a genome (or in a DNA sequence).
SNPs may occur anywhere: Most SNPs are found outside of coding seqs => SNPs found in a coding seq are of great interest as they are more likely to alter function of a protein
most common type of genetic variation in humans. They account for 90% of the variation between individuals
Most are neutral polymorphisms, some cause disease
density = ~1 every 100-300 bases

Other types of polymorphisms

Insertion / Deletions (Indels)
Copy Number Variation
Tandem Repeats
Micro- / Mini-satellites
Structural Variations
Transposable Elements

What is a ribosomal binding site and how can it be used for gene prediction?

(2023-2)

DONE

A sequence of nucleotides in mRNA that is recognized and bound by the ribosome during the initiation of translation
ca. 3-10 nucleotides before initiation codon
e.g. Shine-Dalgarno (SD) sequence, which is located 5-10 bp upstream of the start codon (AUG)
The identification of RBS is crucial for predicting the location of genes in prokaryotic genomes. Since the RBS is closely associated with the start codon of a gene, finding the RBS can help in pinpointing the start site of protein-coding sequences.
Tools: Glimmer, GeneMark, Prodigal

Describe the two most important normalization techniques used for quantitative transcriptome analyses.

(2023-2)

DONE

RPKM: reads per kilobase of transcript per million mapped reads
FPKM: fragments per kilobase of transcript per million mapped reads
TPM: transcrips per million
Example usage: RNA fragmentation during library construction

Explain what is the reference genome of a species?

(2023-2)

DONE

Ein Referenzgenom einer Spezies ist das Genom gegen welches neue Sequenzierungsdaten verglichen werden. Es wird als eine Art ideal angesehen und daher wird nach dem Sequenzieren damit verglichen, um bspw. Fehler oder Gaps zu erkennen.

Das Referenzgenom ist daher das “perfekte” Genom für eine Spezies und kann auch modifiziert werden, falls neue Information zu diesem entdeckt werden. Es wird also im Laufe der Zeit potentiell verändert, um up-to-date zu sein.

RNAfold implements one of the methods of RNA secondary structure prediction. What kind of RNA secondary structure is considered optimal in this method?

(2023-2)

DONE

RNAfold predicts the secondary structure of RNA by finding the structure with the minimum free energy. This structure is considered optimal because it is the most thermodynamically stable configuration.

Explain the similarity-based approach to gene prediction?

(2023-2)

DONE

Genes in different organisms are similar
The similarity-based approach uses known genes in one
genome to predict (unknown) genes in another genome
Given a known gene (or a protein) and a genome sequence, find a set of substrings of the genomic sequence whose concatenation best fits the gene
e.g. the known frog gene is aligned to different locations in the human genome —> find “best” path to reveal the exon structure of human gene

Explain what pseudogenes are and describe one approach to identify them.

(2023-2)

DONE

Nonfunctional sequences of genomic DNA that are originally derived from functional genes, but exhibit such degenerative features as premature stop codons and frameshift mutations that prevent their expression
A fundamental feature of pseudogenes is that their nucleotide sequences differ from those of the paralogous functional genes at crucial points
Two types of pseudogenes: conventional, processed

Identification Approach:

Sequence Comparison: Use BLAST to align sequences with known functional genes and look for mutations.
Phylogenetic Analysis: Create phylogenetic trees to compare the evolutionary relationships between functional genes and potential pseudogenes

What is a splice site? Explain two types of splice sites.

(2023-2)

DONE

Beim Splicen werden unter anderem die Introns aus der Sequenz gesplicet / entfernt.

Hierfür gibt es die 3’ splice site und die 5’ splice site. Die 3’ splice site startet das spleißen von vorne der Sequenz wohingegen die 5’ splice site dieses von hinten startet.

Explain the k-mer approach to finding the repeats in genomes.

(2023-2)

DONE

Sequences are scanned for overrepresented string of certain length
Challenge: to determine optimal size of an oligo (k-mer) and the number of mismatches allowed

Generate K-mers: Slide a window of length k across the genome sequence to generate all possible k-mers.
Count Occurrences: Count the frequency of each k-mer in the genome.
Identify Repeats: K-mers that appear more frequently than expected are identified as potential repeats.
Analysing the repeats by finding out their position in the genome and mapping them
for a better understanding of the distribution and context of the repeats + possibly further analyses to differentiate the different types of repeats (tandem/dispersed repeats etc.)

Applications:

Detecting Repeats: Helps in identifying repetitive sequences in the genome, such as tandem repeats, interspersed repeats, and low-complexity regions.
Genome Assembly: Assists in resolving ambiguous regions during genome assembly by highlighting repetitive sequences.

Explain how PolyPhen predicts damaging mutations.

(2023-2)

DONE

PolyPhen (Polymorphism Phenotyping) predicts the impact of amino acid substitutions on the structure and function of proteins, which helps identify potentially damaging mutations.

Goal: to obtain a lower limit estimate for the quantity of non-synonymous SNPs that might have phenotypic effects

Map known disease mutations onto known 3D structures of proteins
Compare results with a control set of substitutions observed between these proteins and their closely related homologs from other species that are unlikely to cause severe effects on the phenotype
Map a large number of non-synonymous SNPs onto protein structures: thought to be neutral or to be the cause of only minor phenotypic effects

What are the challenges in identifying motifs in biological sequences?

(2023-2)

DONE

we don’t know the motif sequence
we don’t know the location relative to gene start
we don’t know the length of the motif
Motifs can differ slightly from one gene to the next / can have mutations (how many mutations are allowed?)
How to discern from „random“ motifs?

(Consensus sequences help in finding motifs)

—> These and other problems mean that the computing algorithms have a high runtime and data storage capacity

Explain the main steps of the genome assembly process.

(2023-2)

DONE

What is Burrows-Wheeler transform and what is it used for?

(2023-2)

DONE

Produces a permutation of a string that is easier to compress
it is reversible: the original string can be recovered
Applications:
- string compression
- pattern matching
- searching for patterns in strings
Approach:
- form successive circular permutations of the string
- sort these lines into alphabetical order
- Report the last column
The BWT brings repeats together, facilitating compression

—> Example tool that uses BWT is Bowtie

Name and briefly explain one database or bioinformatics tool related to miRNA. What is it used for?

(2023-2)

DONE

MirScan:

a bioinformatics tool for predicting miRNA target sites in mRNA sequences by analyzing sequence complementarity and conservation across species
Finds potential miRNA target sites by aligning miRNAs to target mRNA sequences.

Name and explain two -omics disciplines

(2023-2)

DONE

Genomics: the study of the complete set of DNA sequences (the genome) in an organism, including all of its genes and non-coding regions

identify DNA sequences of an organism’s genome
Understanding genetic variations
identify genes associated with dieases
example tools: BLAST, Genome Browser

Proteomics: the large-scale study of proteins, particularly their functions, structures, and interactions within a cell or organism.

measuring protein abundance and expression levels
exploring protein functions, interactions, modifications

example tools: STRING, Mascot

Transcriptomics: the study of the complete set of RNA transcripts produced by the genome under specific circumstances or in a specific cell type.

analyzing gene expression and understanding how genes are regulated.
example tools: STAR, DESeq2

Join Course

Preview

Author

Carina S.

Information

Last changed
2 years ago

Report course

Altklausuren

Author

Carina S.

Information