undefined

Buffl

0. Advanced Datavisualisation

von Carina S.

TypeError

detects logical errors (e.g. 1 + ‚one‘)

Variable Naming Scheme

valid characters include letters, digits, and underscore sign
a name can’t start with a digit -> SyntaxError
use lowercase for the variable and function names, with words separated by underscores (do_it_like_this_123)
can’t use Python keywords (z.B. TRUE, while, import, in)

sequence

is an ordered collection of values
e.g. strings, lists, tuples, ranges

lists

a mutable, heterogeneous, ordered sequence of elements
- mutable: you can change elements in the list
- you can add elements to a list via extend() and append()
  - extend(): to add elements to a list, as list elements
  - append(): to make nested lists, adds list of the elemnt into the list
  - you can concatenate lists using +

tuples

A python tuple is an immutable, heterogeneous, ordered sequence of elements
immutable: you cannot change / add an element in the tuple
can concatenate tuples using +

Set

A Python set is a mutable, heterogeneous, unordered sequence of distinct elements. A set can contain only hashable elements (for now, read this as immutable)
we can add elements to a set using the add() method
We can compute union, intersection, difference, and symmetric difference of sets in Python ( | , - , & , ^ )

dictionary

A dictionary is a mapping from hashable values to arbitrary values = a mutable, heterogeneous, unordered sequence of key-value pairs
concatenate using union operator | instead of +
- Order matters! If we have the same key in two dictionaries, the former value will be overwritten with the latter one
update one dictionary with the values from another dictionary using the update() method. Order matters!

Data Type: None

the null object, represents the absence of a value

Control Flow - Branching

if
if-else
if-elif-else
match-case

Control Flow - Loops

while
for

function

a named block of code that can accept arguments and can have a return value

e.g. def my_sum(parameter1, parameter2): …
pass statement allows us to create a function with an empty body
use lowercase for the function names, with words separated by underscores (just like the variables)

Naming scheme - Class

upper camel case

e.g. WriteYourClassNameLikeThis

Python module

a file containing Python code and having ’.py’ extension.
The name of the module is the name of the file without extension
can load modules by name using import

Namespace package

a directory that contains modules
Using directories allows us to create hierarchies of modules and to group similar modules together
Loading a namespace package doesn’t give us access to the modules contained in the package. To access the modules we either have to explicitly import them or use a ”regular” package

Regular package

contains an __init__.py file
This file is supposed to contain the initialization logic.

finally

code that will always be executed as the last task before the try statement completes
finally is useful when using external information e.g. closing a file / database connection

How to assign values we don’t need?

using underscore variable, e.g.

my_list = [1, 2, 3, 4, 5]

a, b, c, _, _ = my_list

Augmented Assignments

e.g. +=, **=, //=, *=
augmented assignments give us a compact way to change a value bound to a variable
&= is intersection
can use augmented assignments for numbers, sets, dictionaries

Lambdas

allow us to create a function inline (in place) where it’s needed: sorted(strings, key=lambda x: x[1])
a lambda expression created an anonymous function. The keyword lambda is followed by a comma-separated list of parameters, a colon:, and an expression (and expression can’t have branches or loops, return or yield statements).
A lambda implicitly returns the value of the expression.

Complex numbers

we can create complex numbers using either a literal notation or a complex constructor
e.g. x = i+2j or y = complex(3, -5), first part = real, second = imag

Mathematical functions from math module

math.ceil(5.6) —> 6, ceiling = rounds up
math.floor(5.6) —> 5, floor = rounds down
math.sqrt(81) —> 9.0, square root
math.factorial(6) —> 720
math.comb(10, 3) —> 120, combinations
math.perm(10, 3) —> 720, permutations
math.cos(math.pi) —> -1.0
math.degrees(math.pi) —> 180.0

random.seed()

seed() method is used to initialize the random number generator
passing the same seed to random, and then calling it will give you the same set of numbers
if you want the results to be different every time you will have to seed it with something different every time you start -> default is the current time as seed value

list

a mutable, ordered, heterogeneous collection of items

zero-indexed

list.clear()
list.copy() —> shallow copy
list.count()
list.index()
list.insert()
list.pop()
list.remove()
…

Shallow copy

A shallow copy means constructing a new collection object and then populating it with references to the child objects found in the original. In essence, a shallow copy is only one level deep. The copying process does not recurse and therefore won’t create copies of the child objects themselves.

Deep copy

A deep copy makes the copying process recursive. It means first constructing a new collection object and then recursively populating it with copies of the child objects found in the original. Copying an object this way walks the whole object tree to create a fully independent clone of the original object and all of its children.

list comprehension

map one list onto another list

e.g. create a list of the squared numbers of another list

Tuple

immutable, ordered, heterogeneous (since tuples are immutable, we can’t do much with them!)

functions on Sets

Adding items: add, update
Removing items: remove, discard, pop, clear
Manipulating sets: clear, copy

namedtuple

a complex data type that allows to group variables (properties, attributes) together under one name

can unpack namedtuple like a regular tuple

OrderedDict

when we compare two objects of the OrderedDict type, not only the items are tested for equality, but also the insertion order / to make sure the insertion order is preserved and checked

Callable

something we can call, invoke, execute using parenthesis (), e.g. functions are callables, classes are callables
we can make instances of our classes callable like functions by implementing the __call__ method
why do we want to implement __call__: e.g. timing benchmark with context managers use in decorators

Attribute names starting with a double underscore

make the interpreter rewrite them (name mangling)

We need to be careful with double leading underscores and inheritance, though

implement getters and setters

with the @property and @{property_name}.setter decorators

Context Manager (Class-based)

class with following methods:
- __enter__: first executed
- __exit__: second executed
- (__call__: processes arguments)

examples to use context manager:
- to work with files, time benchmarking, working with file system, when you don’t want to upgrade gradients in pytorch, interactions with external stuff,…
- when we have to open and then close it (like a sandwich)

Generator

a generator is a function that returns a generator iterator (use the yield keyword)
- a generator expression combines lazy evaluation of generators with the beauty and simplicity of list comprehensions
- lazy meaning values are only computed if / when we want it

built-in classes in Python that are iterators

enumerate
zip
map & filter

Context Manager (Function-based)

we can create a context manager using a function that yields and a @contextmanager decorator

Nested Functions

we can have an ”inner” function defined inside an ”outer” function
the inner function is only accessible within the outer function

the inner function ”remembers” the value of the power argument even after the outer function has completed its execution —> when an inner function is defined within an outer function, the inner function keeps access to the variables of the outer function

Closure

a closure is a function that keeps access to its environment as it was when the function was defined
- closures help us to hide / protect the data
- closures help us to generate functions at runtime
- closures help us to create decorators -> we can use closures to create function-based decorators

Type hints

allow us to specify types for variables, function parameters, and function return values

(IDEs / editors like PyCharm or 3rd party tools are capable of parsing type hints / can notify user when type hints have been violated

DataClass

@dataclass decorator simplifies class creation & can add the following methods: __init__, __repr__, __eq__: equal, … (these three are added by default)

Vectorization

refers to performing operation on several values at once (SIMD: single instruction, multiple data)
alternatively, vectorization means converting an algorithm from operating on a single value at a time to operating on several values at a time

NumPy Data Types

int8, int16, int32, int64
uint8, uint16, uint32, uint64
float16, float32, float64
bool8, unicode string, object

create NumPy Array

using np.array()
- from sequences
- from tuples
- from ramges
using fromiter to create arrays from iterables
helpers for array creation:
- np.arange(1, 10)
- np.zeros()
- np.ones()
- np.eye()
- np.full()
- np.random.random()
- np.random.randint()

accessing elements from NumPy arrays

by indices
use slicing
use Boolean indexing
by array of indices
combining simple indices with fancy indexing
combining slicing with fancy indexing

functions on NumPy arrays

np.sort()
np.where()
np.shape()

Math on NumPy Arrays

np.add(arr1, arr2)
np.divide(arr1, arr2)
np.matmul(arr1, arr2)
np.sum()
np.median()
np.min()
np.argmax()

Broadcasting

we can perform pairwise operations on arrays of different shapes, as long as arrays are compatible in every dimension
Broadcasting allows us to use a smaller array several times together with a larger array according to the following rules:
- arrays are compatible in a dimension if they have the same size in a given dimension OR if the smaller array has size 1
- if the arrays do not have the same number of dimensions, prepend (add to beginning / as prefix) 1 to the shape of the smaller one until they do
- a smaller array acts as if it was copied along those dimensions where its size is 1

Transpose NumPy Array

use the T method or the numpy.transpose function: np.transpose(arr) or arr.T

Store dates and times in NumPy

using the datetime64 data type: e.g. np.datetime64(’2021-01-01’)
we can create datetime64 instances by passing a string, an integer and a unit, or a string and a unit
not a time value= np.datetime64(’NAT’)

timedelta64

the timedelta64 data type is used to store results of substraction of two datetime64 values

Record Arrays

record arrays allow field access through the dot notation, yet are less performant
the recarray subclass gives us the ability to access named fields through attribute lookup (i.o.w., using dot notation).
Regular structured arrays are more performant, so I recommend using them instead of the specialized subclass (unless you really need the attribute lookup)
several helper functions for constructing record arrays:
- np.rec.fromarrays
- np.rec.fromfile
- np.rec.fromrecords
- np.rec.fromstring

Adding / removing dimensions of matrixes / arrays in NumPy

use reshape method and -1 to add dimension
alternatively, use the np.expand_dims function to add dimension
np.squeeze() to remove axes of length 1
flatten an array using flatten() method

Stacking Arrays

np.column_stack()to stack as columns
np.row_stack(): to stack as rows (same like column_stack)
np.hstack(): stacks arrays horizontally (column-wise) —> nebeneinander / hintereinander, e.g. np.hstack((vector1, vector2))
np.vstack(): stacks arrays vertically (row-wise) —> untereinander

ufunc

A universal function (ufunc) is a function that accepts NumPy arrays as input and performs computations on them element-wise
Binary ufuncs have three methods:
- reduce: repeatedly applies the ufunc it is invoked on to the array elements until it computes a single result
- accumulate: repeatedly applies the ufunc it is invoked on to the array elements AND returns the intermediate results
- outer: applies the ufunc to all possible pairs from given inputs
create own ufuncs using np.vectorize

Working with files in NumPy

save a NumPy array to a file using np.save(’file’, array)
np.load loads a NumPy array from file: np.load(’file’)
save multiple NumPy arrays to a single file using np.savez(’file’, arr1, arr2)
np.savez_compressed acts like np.savez, except it creates a compressed zip-file
np.savetxt saves arrays to a text file
we can load / read data from text files using np.loadtxt()

creating Series in Pandas

can be created from a scalar, sequence, iterable, NumPy array (might want to use copy=True), dictionary (keys become index values)
indexed by consecutive integers starting at 0 by default
careful when creating Series from NumPy array, set copy = True otherwise if we change values in the Series we also change values in the NumPy array

create DataFrame in Pandas

We can create a DataFrame using
- a dictionary of series
- using structured NumPy arrays
- from a 2D NumPy array
- using iterables instead of Series or NumPy arrays / an iterable of iterables
- using a Dictionary (and column names) of iterables
- by passing an iterable of dictionaries
- explicit column names

Index in Pandas

an Index is an immutable array or an immutable ordered multi-set built on top of a NumPy array

operations on index in Pandas

pd_index.size
pd_index.shape
pd_index.ndim
pd_index.dtype
index1.intersection(index2)
index1.union(index2)
index1.difference(index2)
index1.symmetric_difference(index2)
index1.is_unique
index1.has_duplicates
index1.is_monotonic_increasing
index1.is_monotonic_decreasing
index1.insert(3, 100) —> insert value 100 at position 3 (0 indexing, so 100 is fourth value)
index1.delete(1) —> delete value at position 1 (0 indexing so second value)
index1.copy()
index1.max()
index1.argmax()
index1.min()
index1.argmin()
index1.sort_values()
index1.unique()
index1.value_counts()
index1.drop_duplicates()
index1.drop_duplicates(keep=’last’)
index1.drop_duplicates(keep=False)

loc vs iloc

loc allows to select a subset (of rows) based on explicit labels
- e.g. df.loc[’MF’]
- e.g. df.loc[[’MF’]] (to return a dataframe)
- e.g. df.loc[’MF’:’UP’] (to return a multiple rows, here ’UP’ is included)
- df.log[1] doesn’t work with integers —> KeyError: 1
- we can use slicing with column access using loc: e.g.: df.loc[:, ’capital’:’population’]
iloc allows to select a subset (of rows) based on implicit integer indices
- df.iloc[2] —> one row as a series (not so efficient)
- df.iloc[[2]] —> one row as a df
- df.iloc[2:4] —> multiple rows as dataframe (4 is not included, returns row 2 and 3)
- it doesn’t work with labels (explicit index values): e.g. df.iloc[’MF’] —> TypeError: Cannot index by location index with a non-integer key
- we can use slicing with column access using iloc: e.g. df.iloc[:, 1:3]

methods to compute statistical values over a Series or (over columns or rows of) DataFrame

Series:
- pandaseries.mode()
- pandaseries.argmax()
- pandaseries.max()
- pandaseries.argmin()
- pandaseries.min()
- pandaseries.mean()
- pandaseries.median()
DataFrame
- df1.max()
- df1.min()
- df.median(axis=’columns’)
- etc.

df1.describe()

describe method computes several statistical values for columns of a DataFrame

—> returns dataframe with count, mean, std, min, 25%, 50%, 75%, max as rows

all method

returns True if all elements are evaluated as True
can be used with a Series or with a DataFrame

any method (Definition and Usage)

returns True if at least one element is evaluated as True

can be used with a Series or with a DataFrame

unique method

returns unique values for a given Series

nunique method

get numbers of unique values in a Series
on DataFrame: get number of unique values in each column

value_counts method

returns values and their counts
on DataFrame: returns the number of unique rows
sorted by descending order by default

convert_dtypes method

simplify working with missing values
convert_dtypes method automatically tries to infer a fitting data type for our values, which comes in handy for None and np.nan

dropna()

drop missing values, rows and columns
df.dropna() —> drops entire rows that have a NA value
To drop columns with missing values instead of rows, we need to pass axis=’columns’: e.g. df.dropna(axis=’columns’)
The dropna method by default removes those rows / columns where at least one value is missing
we can pass an argument for the thresh parameter to remove rows / columns only when the number of present values is below a given threshold
we can specify a parameter: subset of columns to check when removing rows (or other way around)

filling missing values in Pandas

with .fillna() method

filling with a single value
forward propagation —> method=’ffill’
backward propagation —> method=’bfill’
filling with a dictionary / Series

with .interpolate() method

interpolation

nomial variables vs ordinal variables

nomial variables —> e.g. names
ordinal variables have an order —> e.g. clothing sizes

get_dummies()

to create a one-hot encoding of categorical values

working with JSON

json.load()
pd.read_json()
pd.json_normalize()
json.loads(df.to_json())

Hierarchical indexing

allows us to set multiple index levels on an axis. It enables us to manipulate data with an arbitrary number of dimensions in 1D-Series and 2D-DataFrames or to represent hierarchical data
Hierarchical indexing creates a MultiIndex —> a MultiIndex allows to store n-dimensional data in a 2D DataFrame

difference between: mi_df.loc[(’DS’, ’Mstr’)] and mi_df.loc[[(’DS’, ’Mstr’)]]

mi_df.loc[(’DS’, ’Mstr’)] (returns tuple)

mi_df.loc[[(’DS’, ’Mstr’)]] (returns dataframe)

stack() and unstack()

The stack method moves an index level from the columns to the rows
use unstack to move an index from the rows to the columns (opposite operation to stack)

pivot()

to ”unmelt” a DataFrame (convert DF from long format to wide format), use the pivot method
e.g. melted.pivot(columns=’variable’, values=’value’)

append()

to concatenate two Series or DataFrames: e.g. series1.append(series2)

concat()

combine / concatenate several Series / DataFrames at once using concat
using concat once is more efficient that using append several times
e.g. pd.concat([series1, series2, series3], ignore_index=True)
by default, concat keeps all indices (like an outer SQL join) —> we might get missing values.

join()

The join method allows us to combine columns of several dataframes based on an index or a specific column
By default the index values are used
use parameter how = left (default) / right / outer / inner

GFF format

Column: seqid
Column: source
Column: type
Column: start coordinate
Column: end coordinate
Column: score
Column: strand
Column: phase
Column: attributes

E-Utilities

EInfo, ESearch, EPost, ESummary, EFetch, ELink, EGQuery, ESpell, ECitMatch

ESearch

Text search
Responds to a text query with the list of matching UIDs in a given database (for later use in ESummary, EFetch or ELink), along with the term translations of the query

ESummary

Downloads document summaries
Responds to a list of UIDs from a given database with the corresponding document summaries

EGQuery

Global query
Responds to a text query with the number of records matching the query in each Entrez database

EInfo

Database statistics
Provides the number of recors indexed in each field of a given database, the date of the last update of the database, and the available links from the database to other Entrez databases
Without &db: lists all available databases

EFetch

Downloads data records
Responds to a list of UIDs in a given database with the corresponding data records in a specified format

ELink

Entrez links
Responds to a list of UIDs in a given database with either a list of related UIDs (and relevancy scores) in the same database or a list of linked UIDs in another Entrez database
Checks for the existence of a specified link from a list of one or more UIDs
Creates a hyperlink to the primary LinkOut provider for a specific UID and database, or lists LinkOut URLs and attributes for multiple UIDs

EPost

For UID uploads
Accepts a list of UIDs from a given database, stores the set on the History Server, and responds with a query key and web environment for the uploaded dataset

ESpell

Gives Spelling Suggestions
Retrieves spelling suggestions for a text query in a given database

ECitMatch

For batch citation searching in PubMed
retrieves PubMed IDs (PMIDs) corresponding to a set of input citation strings

Typical combinations of E-Utility programs: Get DocSummaries or entries for keywords or IDs:

ESearch —> ESummary / EFetch
EPost —> ESummary / EFetch

Typical combinations of E-Utility programs: Filter / limit a record set

EPost / ELink —> ESearch

Typical combinations of E-Utility programs: More advance queries

ESearch —> ELink —> ESummary / EFetch
EPost —> ELink —> ESearch —> EFetch

Annotation Process: Annotation / Curation Phases (6)

Sequence curation
Sequence analysis
Literature curation
Family-base curation
Evidence attribution
Quality assurance, integration and update

Swiss Prot Flat File Field Types

(some, not all)

Flat file format

fixed number of columns
structure is modelled via identation and column number
a line is called a record and is typed
typing happens via a keywords in the first columns
keywords starting column 1 (1 based counting)
subkeyword are indented
no keyword: continuation of the previous line

Resolution

quality measure of collected data

R-value

fit between measured and calcuated diffraction pattern

R-free

prediction power for diffractions not used in refinement

SCOP Hierarchy

Classes
Fold
Superfamily
Family
Protein domain
Species
Domain

CATH

semi-automatic procedure for deriving a novel hierarchical classification of protein domain structures
4 main levels:
- C: protein class
- A: architecture
- T: topology
- H: homologous superfamily

Pfam Terms

Family
Domain
Repeat
Motif
Clans

Lamport Timestamps

Weak consistency criterion: If event e1 causes e2 then the timestamp of e1 has to be smaller than the timestamp of e2
Strong consistency criterion (the opposite): If the timestamp of e1 is smaller than the one of e2 then the event e1 has been the cause for event e2

Version Clock