TypeError
detects logical errors (e.g. 1 + ‚one‘)
Variable Naming Scheme
valid characters include letters, digits, and underscore sign
a name can’t start with a digit -> SyntaxError
use lowercase for the variable and function names, with words separated by underscores (do_it_like_this_123)
can’t use Python keywords (z.B. TRUE, while, import, in)
sequence
is an ordered collection of values
e.g. strings, lists, tuples, ranges
lists
a mutable, heterogeneous, ordered sequence of elements
mutable: you can change elements in the list
you can add elements to a list via extend() and append()
extend(): to add elements to a list, as list elements
append(): to make nested lists, adds list of the elemnt into the list
you can concatenate lists using +
tuples
A python tuple is an immutable, heterogeneous, ordered sequence of elements
immutable: you cannot change / add an element in the tuple
can concatenate tuples using +
Set
A Python set is a mutable, heterogeneous, unordered sequence of distinct elements. A set can contain only hashable elements (for now, read this as immutable)
we can add elements to a set using the add() method
We can compute union, intersection, difference, and symmetric difference of sets in Python ( | , - , & , ^ )
dictionary
A dictionary is a mapping from hashable values to arbitrary values = a mutable, heterogeneous, unordered sequence of key-value pairs
concatenate using union operator | instead of +
Order matters! If we have the same key in two dictionaries, the former value will be overwritten with the latter one
update one dictionary with the values from another dictionary using the update() method. Order matters!
Data Type: None
the null object, represents the absence of a value
Control Flow - Branching
if
if-else
if-elif-else
match-case
Control Flow - Loops
while
for
function
a named block of code that can accept arguments and can have a return value
e.g. def my_sum(parameter1, parameter2): …
pass statement allows us to create a function with an empty body
use lowercase for the function names, with words separated by underscores (just like the variables)
Naming scheme - Class
upper camel case
e.g. WriteYourClassNameLikeThis
Python module
a file containing Python code and having ’.py’ extension.
The name of the module is the name of the file without extension
can load modules by name using import
Namespace package
a directory that contains modules
Using directories allows us to create hierarchies of modules and to group similar modules together
Loading a namespace package doesn’t give us access to the modules contained in the package. To access the modules we either have to explicitly import them or use a ”regular” package
Regular package
contains an __init__.py file
This file is supposed to contain the initialization logic.
finally
code that will always be executed as the last task before the try statement completes
finally is useful when using external information e.g. closing a file / database connection
How to assign values we don’t need?
using underscore variable, e.g.
my_list = [1, 2, 3, 4, 5]
a, b, c, _, _ = my_list
Augmented Assignments
e.g. +=, **=, //=, *=
augmented assignments give us a compact way to change a value bound to a variable
&= is intersection
can use augmented assignments for numbers, sets, dictionaries
Lambdas
allow us to create a function inline (in place) where it’s needed: sorted(strings, key=lambda x: x[1])
a lambda expression created an anonymous function. The keyword lambda is followed by a comma-separated list of parameters, a colon:, and an expression (and expression can’t have branches or loops, return or yield statements).
A lambda implicitly returns the value of the expression.
Complex numbers
we can create complex numbers using either a literal notation or a complex constructor
e.g. x = i+2j or y = complex(3, -5), first part = real, second = imag
Mathematical functions from math module
math.ceil(5.6) —> 6, ceiling = rounds up
math.floor(5.6) —> 5, floor = rounds down
math.sqrt(81) —> 9.0, square root
math.factorial(6) —> 720
math.comb(10, 3) —> 120, combinations
math.perm(10, 3) —> 720, permutations
math.cos(math.pi) —> -1.0
math.degrees(math.pi) —> 180.0
random.seed()
seed() method is used to initialize the random number generator
passing the same seed to random, and then calling it will give you the same set of numbers
if you want the results to be different every time you will have to seed it with something different every time you start -> default is the current time as seed value
list
a mutable, ordered, heterogeneous collection of items
zero-indexed
list.clear()
list.copy() —> shallow copy
list.count()
list.index()
list.insert()
list.pop()
list.remove()
…
Shallow copy
A shallow copy means constructing a new collection object and then populating it with references to the child objects found in the original. In essence, a shallow copy is only one level deep. The copying process does not recurse and therefore won’t create copies of the child objects themselves.
Deep copy
A deep copy makes the copying process recursive. It means first constructing a new collection object and then recursively populating it with copies of the child objects found in the original. Copying an object this way walks the whole object tree to create a fully independent clone of the original object and all of its children.
list comprehension
map one list onto another list
e.g. create a list of the squared numbers of another list
Tuple
immutable, ordered, heterogeneous (since tuples are immutable, we can’t do much with them!)
functions on Sets
Adding items: add, update
Removing items: remove, discard, pop, clear
Manipulating sets: clear, copy
namedtuple
a complex data type that allows to group variables (properties, attributes) together under one name
can unpack namedtuple like a regular tuple
OrderedDict
when we compare two objects of the OrderedDict type, not only the items are tested for equality, but also the insertion order / to make sure the insertion order is preserved and checked
Callable
something we can call, invoke, execute using parenthesis (), e.g. functions are callables, classes are callables
we can make instances of our classes callable like functions by implementing the __call__ method
why do we want to implement __call__: e.g. timing benchmark with context managers use in decorators
Attribute names starting with a double underscore
make the interpreter rewrite them (name mangling)
We need to be careful with double leading underscores and inheritance, though
implement getters and setters
with the @property and @{property_name}.setter decorators
Context Manager (Class-based)
class with following methods:
__enter__: first executed
__exit__: second executed
(__call__: processes arguments)
examples to use context manager:
to work with files, time benchmarking, working with file system, when you don’t want to upgrade gradients in pytorch, interactions with external stuff,…
when we have to open and then close it (like a sandwich)
Generator
a generator is a function that returns a generator iterator (use the yield keyword)
a generator expression combines lazy evaluation of generators with the beauty and simplicity of list comprehensions
lazy meaning values are only computed if / when we want it
built-in classes in Python that are iterators
enumerate
zip
map & filter
Context Manager (Function-based)
we can create a context manager using a function that yields and a @contextmanager decorator
Nested Functions
we can have an ”inner” function defined inside an ”outer” function
the inner function is only accessible within the outer function
the inner function ”remembers” the value of the power argument even after the outer function has completed its execution —> when an inner function is defined within an outer function, the inner function keeps access to the variables of the outer function
Closure
a closure is a function that keeps access to its environment as it was when the function was defined
closures help us to hide / protect the data
closures help us to generate functions at runtime
closures help us to create decorators -> we can use closures to create function-based decorators
Type hints
allow us to specify types for variables, function parameters, and function return values
(IDEs / editors like PyCharm or 3rd party tools are capable of parsing type hints / can notify user when type hints have been violated
DataClass
@dataclass decorator simplifies class creation & can add the following methods: __init__, __repr__, __eq__: equal, … (these three are added by default)
Vectorization
refers to performing operation on several values at once (SIMD: single instruction, multiple data)
alternatively, vectorization means converting an algorithm from operating on a single value at a time to operating on several values at a time
NumPy Data Types
int8, int16, int32, int64
uint8, uint16, uint32, uint64
float16, float32, float64
bool8, unicode string, object
create NumPy Array
using np.array()
from sequences
from tuples
from ramges
using fromiter to create arrays from iterables
helpers for array creation:
np.arange(1, 10)
np.zeros()
np.ones()
np.eye()
np.full()
np.random.random()
np.random.randint()
accessing elements from NumPy arrays
by indices
use slicing
use Boolean indexing
by array of indices
combining simple indices with fancy indexing
combining slicing with fancy indexing
functions on NumPy arrays
np.sort()
np.where()
np.shape()
Math on NumPy Arrays
np.add(arr1, arr2)
np.divide(arr1, arr2)
np.matmul(arr1, arr2)
np.sum()
np.median()
np.min()
np.argmax()
Broadcasting
we can perform pairwise operations on arrays of different shapes, as long as arrays are compatible in every dimension
Broadcasting allows us to use a smaller array several times together with a larger array according to the following rules:
arrays are compatible in a dimension if they have the same size in a given dimension OR if the smaller array has size 1
if the arrays do not have the same number of dimensions, prepend (add to beginning / as prefix) 1 to the shape of the smaller one until they do
a smaller array acts as if it was copied along those dimensions where its size is 1
Transpose NumPy Array
use the T method or the numpy.transpose function: np.transpose(arr) or arr.T
Store dates and times in NumPy
using the datetime64 data type: e.g. np.datetime64(’2021-01-01’)
we can create datetime64 instances by passing a string, an integer and a unit, or a string and a unit
not a time value= np.datetime64(’NAT’)
timedelta64
the timedelta64 data type is used to store results of substraction of two datetime64 values
Record Arrays
record arrays allow field access through the dot notation, yet are less performant
the recarray subclass gives us the ability to access named fields through attribute lookup (i.o.w., using dot notation).
Regular structured arrays are more performant, so I recommend using them instead of the specialized subclass (unless you really need the attribute lookup)
several helper functions for constructing record arrays:
np.rec.fromarrays
np.rec.fromfile
np.rec.fromrecords
np.rec.fromstring
Adding / removing dimensions of matrixes / arrays in NumPy
use reshape method and -1 to add dimension
alternatively, use the np.expand_dims function to add dimension
np.squeeze() to remove axes of length 1
flatten an array using flatten() method
Stacking Arrays
np.column_stack()to stack as columns
np.row_stack(): to stack as rows (same like column_stack)
np.hstack(): stacks arrays horizontally (column-wise) —> nebeneinander / hintereinander, e.g. np.hstack((vector1, vector2))
np.vstack(): stacks arrays vertically (row-wise) —> untereinander
ufunc
A universal function (ufunc) is a function that accepts NumPy arrays as input and performs computations on them element-wise
Binary ufuncs have three methods:
reduce: repeatedly applies the ufunc it is invoked on to the array elements until it computes a single result
accumulate: repeatedly applies the ufunc it is invoked on to the array elements AND returns the intermediate results
outer: applies the ufunc to all possible pairs from given inputs
create own ufuncs using np.vectorize
Working with files in NumPy
save a NumPy array to a file using np.save(’file’, array)
np.load loads a NumPy array from file: np.load(’file’)
save multiple NumPy arrays to a single file using np.savez(’file’, arr1, arr2)
np.savez_compressed acts like np.savez, except it creates a compressed zip-file
np.savetxt saves arrays to a text file
we can load / read data from text files using np.loadtxt()
creating Series in Pandas
can be created from a scalar, sequence, iterable, NumPy array (might want to use copy=True), dictionary (keys become index values)
indexed by consecutive integers starting at 0 by default
careful when creating Series from NumPy array, set copy = True otherwise if we change values in the Series we also change values in the NumPy array
create DataFrame in Pandas
We can create a DataFrame using
a dictionary of series
using structured NumPy arrays
from a 2D NumPy array
using iterables instead of Series or NumPy arrays / an iterable of iterables
using a Dictionary (and column names) of iterables
by passing an iterable of dictionaries
explicit column names
Index in Pandas
an Index is an immutable array or an immutable ordered multi-set built on top of a NumPy array
operations on index in Pandas
pd_index.size
pd_index.shape
pd_index.ndim
pd_index.dtype
index1.intersection(index2)
index1.union(index2)
index1.difference(index2)
index1.symmetric_difference(index2)
index1.is_unique
index1.has_duplicates
index1.is_monotonic_increasing
index1.is_monotonic_decreasing
index1.insert(3, 100) —> insert value 100 at position 3 (0 indexing, so 100 is fourth value)
index1.delete(1) —> delete value at position 1 (0 indexing so second value)
index1.copy()
index1.max()
index1.argmax()
index1.min()
index1.argmin()
index1.sort_values()
index1.unique()
index1.value_counts()
index1.drop_duplicates()
index1.drop_duplicates(keep=’last’)
index1.drop_duplicates(keep=False)
loc vs iloc
loc allows to select a subset (of rows) based on explicit labels
e.g. df.loc[’MF’]
e.g. df.loc[[’MF’]] (to return a dataframe)
e.g. df.loc[’MF’:’UP’] (to return a multiple rows, here ’UP’ is included)
df.log[1] doesn’t work with integers —> KeyError: 1
we can use slicing with column access using loc: e.g.: df.loc[:, ’capital’:’population’]
iloc allows to select a subset (of rows) based on implicit integer indices
df.iloc[2] —> one row as a series (not so efficient)
df.iloc[[2]] —> one row as a df
df.iloc[2:4] —> multiple rows as dataframe (4 is not included, returns row 2 and 3)
it doesn’t work with labels (explicit index values): e.g. df.iloc[’MF’] —> TypeError: Cannot index by location index with a non-integer key
we can use slicing with column access using iloc: e.g. df.iloc[:, 1:3]
methods to compute statistical values over a Series or (over columns or rows of) DataFrame
Series:
pandaseries.mode()
pandaseries.argmax()
pandaseries.max()
pandaseries.argmin()
pandaseries.min()
pandaseries.mean()
pandaseries.median()
DataFrame
df1.max()
df1.min()
df.median(axis=’columns’)
etc.
df1.describe()
describe method computes several statistical values for columns of a DataFrame
—> returns dataframe with count, mean, std, min, 25%, 50%, 75%, max as rows
all method
returns True if all elements are evaluated as True
can be used with a Series or with a DataFrame
any method (Definition and Usage)
returns True if at least one element is evaluated as True
unique method
returns unique values for a given Series
nunique method
get numbers of unique values in a Series
on DataFrame: get number of unique values in each column
value_counts method
returns values and their counts
on DataFrame: returns the number of unique rows
sorted by descending order by default
convert_dtypes method
simplify working with missing values
convert_dtypes method automatically tries to infer a fitting data type for our values, which comes in handy for None and np.nan
dropna()
drop missing values, rows and columns
df.dropna() —> drops entire rows that have a NA value
To drop columns with missing values instead of rows, we need to pass axis=’columns’: e.g. df.dropna(axis=’columns’)
The dropna method by default removes those rows / columns where at least one value is missing
we can pass an argument for the thresh parameter to remove rows / columns only when the number of present values is below a given threshold
we can specify a parameter: subset of columns to check when removing rows (or other way around)
filling missing values in Pandas
with .fillna() method
filling with a single value
forward propagation —> method=’ffill’
backward propagation —> method=’bfill’
filling with a dictionary / Series
with .interpolate() method
interpolation
nomial variables vs ordinal variables
nomial variables —> e.g. names
ordinal variables have an order —> e.g. clothing sizes
get_dummies()
to create a one-hot encoding of categorical values
working with JSON
json.load()
pd.read_json()
pd.json_normalize()
json.loads(df.to_json())
Hierarchical indexing
allows us to set multiple index levels on an axis. It enables us to manipulate data with an arbitrary number of dimensions in 1D-Series and 2D-DataFrames or to represent hierarchical data
Hierarchical indexing creates a MultiIndex —> a MultiIndex allows to store n-dimensional data in a 2D DataFrame
difference between: mi_df.loc[(’DS’, ’Mstr’)] and mi_df.loc[[(’DS’, ’Mstr’)]]
mi_df.loc[(’DS’, ’Mstr’)] (returns tuple)
mi_df.loc[[(’DS’, ’Mstr’)]] (returns dataframe)
stack() and unstack()
The stack method moves an index level from the columns to the rows
use unstack to move an index from the rows to the columns (opposite operation to stack)
pivot()
to ”unmelt” a DataFrame (convert DF from long format to wide format), use the pivot method
e.g. melted.pivot(columns=’variable’, values=’value’)
append()
to concatenate two Series or DataFrames: e.g. series1.append(series2)
concat()
combine / concatenate several Series / DataFrames at once using concat
using concat once is more efficient that using append several times
e.g. pd.concat([series1, series2, series3], ignore_index=True)
by default, concat keeps all indices (like an outer SQL join) —> we might get missing values.
join()
The join method allows us to combine columns of several dataframes based on an index or a specific column
By default the index values are used
use parameter how = left (default) / right / outer / inner
GFF format
Column: seqid
Column: source
Column: type
Column: start coordinate
Column: end coordinate
Column: score
Column: strand
Column: phase
Column: attributes
E-Utilities
EInfo, ESearch, EPost, ESummary, EFetch, ELink, EGQuery, ESpell, ECitMatch
ESearch
Text search
Responds to a text query with the list of matching UIDs in a given database (for later use in ESummary, EFetch or ELink), along with the term translations of the query
ESummary
Downloads document summaries
Responds to a list of UIDs from a given database with the corresponding document summaries
EGQuery
Global query
Responds to a text query with the number of records matching the query in each Entrez database
EInfo
Database statistics
Provides the number of recors indexed in each field of a given database, the date of the last update of the database, and the available links from the database to other Entrez databases
Without &db: lists all available databases
EFetch
Downloads data records
Responds to a list of UIDs in a given database with the corresponding data records in a specified format
ELink
Entrez links
Responds to a list of UIDs in a given database with either a list of related UIDs (and relevancy scores) in the same database or a list of linked UIDs in another Entrez database
Checks for the existence of a specified link from a list of one or more UIDs
Creates a hyperlink to the primary LinkOut provider for a specific UID and database, or lists LinkOut URLs and attributes for multiple UIDs
EPost
For UID uploads
Accepts a list of UIDs from a given database, stores the set on the History Server, and responds with a query key and web environment for the uploaded dataset
ESpell
Gives Spelling Suggestions
Retrieves spelling suggestions for a text query in a given database
ECitMatch
For batch citation searching in PubMed
retrieves PubMed IDs (PMIDs) corresponding to a set of input citation strings
Typical combinations of E-Utility programs: Get DocSummaries or entries for keywords or IDs:
ESearch —> ESummary / EFetch
EPost —> ESummary / EFetch
Typical combinations of E-Utility programs: Filter / limit a record set
EPost / ELink —> ESearch
Typical combinations of E-Utility programs: More advance queries
ESearch —> ELink —> ESummary / EFetch
EPost —> ELink —> ESearch —> EFetch
Annotation Process: Annotation / Curation Phases (6)
Sequence curation
Sequence analysis
Literature curation
Family-base curation
Evidence attribution
Quality assurance, integration and update
Swiss Prot Flat File Field Types
(some, not all)
Flat file format
fixed number of columns
structure is modelled via identation and column number
a line is called a record and is typed
typing happens via a keywords in the first columns
keywords starting column 1 (1 based counting)
subkeyword are indented
no keyword: continuation of the previous line
Resolution
quality measure of collected data
R-value
fit between measured and calcuated diffraction pattern
R-free
prediction power for diffractions not used in refinement
SCOP Hierarchy
Classes
Fold
Superfamily
Family
Protein domain
Species
Domain
CATH
semi-automatic procedure for deriving a novel hierarchical classification of protein domain structures
4 main levels:
C: protein class
A: architecture
T: topology
H: homologous superfamily
Pfam Terms
Repeat
Motif
Clans
Lamport Timestamps
Weak consistency criterion: If event e1 causes e2 then the timestamp of e1 has to be smaller than the timestamp of e2
Strong consistency criterion (the opposite): If the timestamp of e1 is smaller than the one of e2 then the event e1 has been the cause for event e2
Version Clock
Each process / database has a counter which is incremented
Every process remembers the sender and the timestamp
Every message / version has a vector of id-timestamps attached
Authentication
confirm your identity with login and password
Authorization
your permissions on the database are checked
Zuletzt geändertvor 10 Monaten