What of vanilla python data types are sequences?
- String, List, Tuple, Range
Stack & queue + when to use which one
- LIFO (last in first out, stack) and FIFO (first in first out, queue) - depending on use case
What is a deque?
- Doubly Ended Queue - allows efficient append and pop operations at the ends of the container (0(1) vs list O(n))
Which operations does a deque perform less efficiently than a regular list?
- Accessing elements (especially close to the center) in lists is more efficient than when using deque
- Reason: Its more efficient because a list uses an index to access elements, and a dequeue needs to invoke the "next" method for all nodes up until to the desired node
What are iterables and iterators in python?
How do we create them?
Is an iterable an iterator and vica vera?
- Iterable = Object, that is capable of returning its elements one by one and implements an __iter__ method
- Iterator = Object, that represents a stream of data and implement an __iter__ method and a __next__ method
- We create them by creating a class and defining the dunder methods mentioned above
- Iterator is an iterable, but iterable is not an iterator
What can be used as a context manager and why do we use them?
- A class with __enter__ and __exit__ methods can be used as a context manager
- Use them to establish a connection to something (like database or file) and want to be sure that this connections gets properly closed in all cases
Magic methods and when to use them?
- Magic methods = methods starting and ending with a double underscore (also called dunder)
- Implementing such methods allows us to use standard python functions and operators with instances of our custom classes
- Use when you want to define the behavior of objects in response to certain actions, such as when to define the string representation of an object
How can you mark a function as private
- You cant, there is no suc hthing as fully private functions or variables
- Double underscore at the start helps a bit
What does the zip function do?
- Zip wraps two or more iterables with a lazy generator that yields tuples containing pairs of next values from each iterable
What does functools.wrap do?
- Helps us keep original names and docstrings of wrapped functions, since decorators usually lose this function (?)
What are generator expressions in python? How are they different from list comprehensions?
- Expression that returns a generator object
- contains a yield keyword (needed for the generator class) Difference:
- Uses lazy evaluation -> Only the part that is needed right now is loaded into memory opposed to where the list comprehension loads everything into the memory, even if its not needed
What are closures and why do they exist?
- a closure is a function that keeps access to its environment as it was when the function was defined
- allows to hide/protect data, to generate function at runtime, to modify the behavior of other functions
What is a dataclass and what are its advantages? Adapt for 3 concept and their pros / cons (dataclass, class, named tuple)
- dataclass: decorator for automatically generating special methods such as __init__ and __repr__ to user defined classes
- class: allows us to create new data types, combining state and behavior
- named tuple: complex data type that allows us to group variables together under one name (use if you care about grouping attributes, but dont really care about modeling)
What is a strongly typed and dynamic language? What does it mean?
- strongly typed languages are languages where data types are fixed and aren't allowed to be changed (e.g. Java Integer a stays Integer a and can't be defined as String a later
- dynamic languages such as python allow for data type changes
Briefly explain the naming scheme of variables, functions and classes in python.
- variables and functions: lowercase_and_underscore
- classes: CamelCase
Name 3 built-in datatypes in python
int, str, float
What is a module?
file containing python code and having .py as extension
Where does python look for a module / package?
- system path, python installation directory, working directory
What does the finally clause mean in the context of exceptions?
The finally block will be executed no matter if the try block raises an error or not. This can be useful to close objects and clean up resources.
Can you defined a function like this: def my_pow(x=10, power) pass
No, default values have to appear afte rthe parameters that dont have default values (in the parameter list)
What is the difference between keyword and positional arguments?
- keyword arguments allow us to pass arguments in any order when calling a function
- positional arguments must be specified in the right order (& before keyword arguments)
How do you create functions that accept arbitrary numbers of arguments?
def average(*args)
- value of args is a tuple with positional arguments
How can you unpack collections and pass their values as arguments to a function?
def my_date(year, month, day):
pass
my_values = [2021, 12, 31]
my_date[*my_values]
How would you sort the values of a list by their third character?
strings = ["Annie","Jake","Paul"]
sorted(strings, key = lambda x: x[2])
What is the difference between append and extend for lists?
- append for single value
- extend for multiple values (like another list)
If we have a list of 20 elements, what does my_list[1:10:2] do?
- Returns a list starting at the second element of my_list, with a step size of two, until element 9
Briefly explain the funciton of namedtuple, counter, and OrderedDict
- namedtuple: to group variables to gether
- counter: to count collections items
- OrderedDict: to make sure the insertion order is preserved and checked
How can you specify the type of a variable at the same time as its value?
var_name: int = 3
How can you specify the return data type of a function?
def my_func() -> int:
What are decorators?
wrapper functions (called with @ before function)
Compare list, tuple, set and dict. Name 2 characteristics for each one
- list: mutable, ordered, heterogenous, can have duplicates - tuple: immutable, ordered, heterogeneous
- set: mutable, heterogenous, no duplicates, unordered
- dict: mutable, heterogenous, unordered seq of key-value pairs, doesn’t necessarily preserve order
- list, strings, and tuples are sequences
Why is numpy faster than normal python lists and under which circumstances?
python lists: hold references to values they hold -> heterogeneity, but no guarantee that all values are copied in one operation
numpy: homogenous and store data in sequential chunks -> several values can be copied at once and don't need to explicitly loop over elements
Provided several pairs of np arrays, determine if a specific operation (like addition) can be broadcasted for each pair. Explain why
- Check the number of dimensions, if they are not the same -> those dimensions that are there need to match (???)
What are the pros and cons of np arrays compared to vanilla python lists?
Pros:
- Implemented in C (faster)
- loaded into memory consecutively (chunks)
- broadcasting
- already implemented operation (e.g. dotproduct)
Cons:
- lacks flexibility in lists
- loses efficiency with Object datatype
- immutable (when one element is added the whole array has to be re-created)
- type restrictions: all elements in a np array must be of the same type
When do we want to use copies, when do we want to use views?
- view = new array object that refers to the same data -> change of value leads to change of original value
- copy = new array object containing a copy of the data (deep copy) -> doesnt change original values
- We use copies if we want to modify the copied data, and views if we want to modificy the original data
What does a double index for a column mean?
- Fancy indexing
- example:
numbers[1:, [1, 0]] -> (1,1), (1,0), (2,1),(2,0) in a 3x3 matrix
What are the performance gains using apply / np.vectorize, and why?
- Vectorizations allows performing an operation on several values at once
- But the vectorize function is primarily for convenience, not for performance (essentially a for loop)
What are different approaches to create numpy arrays?
- Using python lists or tuples (np.array([x,y,z]), np.array((x,y,z)))
- using pre-built methods like np.zeroes(), np.ones(), np.eye(), np.full(), np.random.random()
How can you check which type of data a np array contains?
numpy_array.dtype
How can you access elements of a 2d np array?
- by tuple or indices: arr_2d[1, 1]
- via slicing: arr_2d[:2, 1:]
- via boolean indexing
- via fancy indexing
How can you find elements fulfilling a specific condition in a np array?
- Using the where method to get the indices, and the use those indices to get the elements indices = np.where(arr > 60) arr[indicies]
Name 3 datatypes of numpy?
int32
int64
object
How can you create a pandas dataframe?
- From an array
- from an dictionary
- list of tuples
What are the advantages of pandas?
- flexible handling of missing data
- many built-in methods
- higher performance than vanilla python while still flexible (data types dont need to be set)
- many routines known from other languages, lkike split, aggregate, reshape, melt, select, merge, ...
Define the pd.Index
- an immutable array or an immutable ordered multi-set
- can be of any type
- built on top of np arrays
Can a pd.Series object contain heterogenous data?
- Yes, because the type is automatically inferred, but it will then be stored as object type
Give examples for how to create a pd.Series
- From a list (Series wil be a copy)
- from a np.array (behaves like a view, if copy=True is not set)
- from a dictionary
What is the difference between head / tail and nlargest / nsmallest?
- head and tail return the first / last row
- nlargest and nsmallest return the largest/smallest values
Are the data types of pd.DataFrame limited to the np.array data types?
- No
- e.g. pd.StringDtype, pd.CategoricalDtype
How can you apply a function to a dataframe?
- df.apply(<function>, axis)
- function either pre-defined or lamda
- axis=0 all rows in one column at once
- axis=1 one row all columns each
Name a few methods that you can apply on strings?
- upper, lower, split, slice
What is the difference between None, np.nan and pd.NA?
- None: universal null type, can be converted to nan in Integer series
- NaN: same as np.nan, float
- pd.NA: represents nullable integers (Int64), booleans and strings
Why would you use pandas visualize?
- to explore data fast -> quick and easy pot
- based on matplotlib but without the fancy stuff
What is the difference between primary and secondary databases? What are examples for each of them?
Primary databases:
- Unprocessed "raw" data from experimental and observational studies
- Examples: GenBank, Uniprot, PDB Secondary databases:
- Collections that provide processed and structured data from primary databases - Examples: SCOP, PFAM, PROSITE
What is the CAP theorem?
Why do we use it?
CAP: concept that states that it is impossible for a distributed system to satisfy:
- Consistency: All nodes in the system see the same data at the same time
- Availability: Every request to the system receives a timely response
- Partition tolerance: The system continues to function despite arbitrary partitioning due to network failures
Concept: Systems can provide at most 2 of these 3 guarantees
Used as a tool for understanding the trade-offs involved in designing distributed systems and for choosing the right architecture for a given problem
BASE principle
- Basically available
- Soft state
- Eventually consistent
Types of plots and when to use them:
- Scatter: used to display the relationship between two continous variables (identifying patters / correlation)
- Box: Compare distributions (in terms of percentiles, median, outliers)
- Line: Relationship between a continous independent variable and a continous dependent variable
- One continous variable and one discrete variable -> mainly use bar plots, box plots by cateogry and violine plots by category) - Histogram: Distributions
- Violine plots: like box plot + data density
CRUDE principle
- Minimum set of access functions
- create, reade, update, delete
What is a document based model?
- are used in databases to store semi-structured data
Why shouldn't we store biological databases in SQL databases,
and why is it better to use NoSQL?
- NoSQL are used because they provide a flexible and scalable way to store and retrieve data that does not fit neatly into traditional, tabular data structures (for example many null values)
- Biological data is semi-structured -> SQL is better for structured data -> storing data with NoSQL: using document-oriented or graph-oriented storage)
- Biological database store very diverse data entries, sequences have different annotations, different lengths, etc -> flexible way to store this data is needed, no hard structure, many NAs -> NoSQL
Question about order of magnitude of primary databases
- 10^12 residues GenBank, 109 sequences
10^13 residues WGS etc, 109 sequences
UniProtTrEMBL/Swissprot: 10^(10‑11)/2*10^8 residues , 10^8/5*10^5 sequences
PDB: 200k in total: 172k by X‑ray, 14k by NMR, 14k by E
Structure of flat file format
- fixed number of colums
- structure is modelled via indentation
- a line is called a record and is typed
- typing happens via keyword in the first column
- subkeywords are indented
- no keyword: continuation of the previous line
How is an HTTP request structures
- application of verb (GET/PUT/POST/DELETE) to a noun and an optional response
Briefly explain MAP/REDUCE
- map: applied to all elements of a list, returns modified list
- reduce: aggregate return values from map into one results
- parallel processing, initial data strucutre is conserved, new copy with every entry level, no deadlocks or side effects
How can you query a graph database?
- pattern based: cypher
- navigation based: gremlin
4 NoSQL types:
- key/value/tuple stores
- wide column stores
- document stores
- graph databases
SwissProt entry compared to UniProt entry
- SwissProt curated entries -> much smaller, but (presumably) higher quality
What is the primary database for Protein structures?
What is a secondary database for protein structures?
- primary: PDB
- secondary: PFAM
Last changed2 years ago