Which built-in Python data types are sequences and what characterizes a sequence?
Built-in sequence types are:
- str
- list
- tuple
- range
A sequence is an ordered collection that:
- Preserves element order
- Supports indexing (s[0])
- Supports slicing (s[1:4])
- Can be iterated over
What is the difference between a stack and a queue, and when should each be used?
Stack → LIFO (Last In, First Out)
- push, pop
- Used for recursion, undo operations
Queue → FIFO (First In, First Out)
- enqueue, dequeue
- Used for scheduling, task processing, BFS
What is a deque in Python?
A deque (double-ended queue) from collections allows:
- Fast append() and pop()
- Fast appendleft() and popleft()
Efficient for insertions/removals at both ends.
Which operations are less efficient in a deque compared to a list?
Random access (d[i]) is slower in a deque.
Lists allow O(1) indexing due to contiguous memory.
Deques are optimized for fast end operations, not middle access.
What is the difference between an iterable and an iterator?
Iterable:
- An object you can loop over
- Implements __iter__()
Iterator:
- Created using iter()
- Produces items with next()
- Keeps state
- Raises StopIteration when finished
What can be used as a context manager and why?
Any object implementing:
- __enter__()
- __exit__()
Used with "with" to:
- Automatically manage resources
- Ensure cleanup (files, DB connections)
What are magic methods and when are they used?
Magic methods start and end with __ (e.g., __init__, __str__).
They define behavior for:
- Object creation
- Arithmetic
- Comparison
- Printing
They enable operator overloading.
How can you mark a function as private in Python?
You cannot make it truly private.
Convention:
- _function() → protected
- __function() → name mangling (class only)
Privacy is by convention, not enforced.
What does the zip() function do?
zip() combines multiple iterables element-wise.
Returns an iterator of tuples.
Example:
zip([1,2], ['a','b']) → (1,'a'), (2,'b')
What does functools.wraps do?
Used in decorators.
It preserves:
- Original function name
- Docstring
- Metadata
Without it, the wrapper replaces function metadata.
What are generator expressions?
A compact way to create generators:
(x*x for x in range(5))
They:
- Produce values lazily
- Use less memory
- Return generator objects
What is a closure in Python?
A closure is a function that:
- Remembers variables from its enclosing scope
- Even after the outer function has finished
Used for function factories and data encapsulation.
What is a dataclass and its advantages?
A @dataclass automatically generates:
- __init__
- __repr__
- __eq__
Advantages:
- Less boilerplate
- Cleaner code
- Designed for storing data
What does it mean that Python is strongly typed and dynamically typed?
Strongly typed:
- No implicit type coercion (e.g., "3" + 3 fails)
Dynamically typed:
- Type is determined at runtime
- No need to declare variable types
What is Python’s naming convention (PEP 8)?
- variables/functions: lowercase_with_underscores
- Classes: CamelCase
- Constants: UPPERCASE
Name three built-in data types in Python.
Examples:
- int
- float
(others: list, dict, set, tuple, bool)
What is a module in Python?
A module is a .py file containing:
- Functions
- Classes
- Variables
Used to organize and reuse code.
Where does Python look for modules?
Python searches:
- Current directory
- PYTHONPATH
- Standard library directories
- Installed packages (site-packages)
Stored in sys.path.
What does the finally clause do in try/except?
The finally block:
- Always executes
- Runs whether an exception occurs or not
Used for cleanup operations.
Can default parameters appear before non-default ones?
No.
Default parameters must come after required parameters.
Correct:
def f(a, b=3)
What is the difference between positional and keyword arguments?
Positional:
- Order matters
Keyword:
- Passed using parameter names
- Order does not matter
How do you accept an arbitrary number of arguments?
*args → variable positional arguments (tuple)
**kwargs → variable keyword arguments (dict)
How can you unpack collections into function arguments?
Use:
* for sequences
** for dictionaries
func(*my_list)
func(**my_dict)
How do you sort a list by the last letter of each string?
Use sorted() with key:
sorted(strings, key=lambda s: s[-1])
What is the difference between append() and extend()?
append(x):
- Adds one element
- List grows by one
extend(iterable):
- Adds multiple elements
- Merges another iterable
What does list slicing [x:y:z] mean?
x → start index
y → stop index (exclusive)
z → step size
[1:10:2] → every second element
What does my_list[1:] return for a list of 20 elements?
It returns:
- All elements starting from index 1
- Excludes the first element
- Total length = 19
What are namedtuple, Counter, and OrderedDict?
namedtuple:
- Tuple with named fields
Counter:
- Counts element frequency
OrderedDict:
- Dictionary preserving insertion order (older Python versions)
How do you specify a variable type?
Using type hints:
x: int = 3
Type hints improve readability and static analysis.
How do you specify a function return type?
Using ->
def func() -> int:
return 3
What are decorators?
Decorators:
- Wrap functions
- Add functionality
- Use @syntax
They modify behavior without changing original code.
Compare list, tuple, set, and dict (2 characteristics each).
list:
- Ordered
- Mutable
tuple:
- Immutable
set:
- Unordered
- Unique elements
dict:
- Key-value pairs
Name two ways to format strings in Python.
1. f-strings:
f"Hello {name}"
2. str.format():
"Hello {}".format(name)
What do the positions in string[x:y:z] stand for?
x -> start position (inclusive)
y -> end position (exclusive)
z -> step size
Name two different ways to format strings in Python.
f-strings (f"text {variable}")
.format()
method ("text {}".format(variable))
Write an f-string that prints "Alice is 25 years old" using variables name = "Alice" and age = 25.
print(f"{name} is {age} years old.")
Write code using .format() to print "Hello, Bob!" using the variable name = "Bob".
print("Hello, {}!".format(name))
What do %s, %d, and %f mean in % formatting?
%s = string
%d = integer (decimal)
%f = float
Example: "Name: %s, Age: %d" % (name, age)
How do you format a float to 2 decimal places using f-strings?
price = 19.99
print(f"{price:.2f}") # Output: 19.99
What's the difference between append() and extend()?
append(x) — adds ONE element (even if it's a list)
extend(iterable) — adds MULTIPLE elements from an iterable
my_list = [1, 2]
my_list.append([3, 4]) # → [1, 2, [3, 4]] ← nested list!
my_list.extend([3, 4]) # → [1, 2, 3, 4] ← flattened
What's the difference between pop() and remove()?
pop(i) — removes by INDEX (default: last), RETURNS the value
remove(x) — removes by VALUE (first occurrence), returns None
my_list = ['a', 'b', 'c']
my_list.pop(1) # Removes index 1 → returns 'b'
my_list.remove('a') # Removes value 'a' → returns None
Which list methods modify the list in-place (don't return a new list)?
reverse() — reverses in-place
sort() — sorts in-place
append(), extend(), insert(), remove(), pop(), clear()
⚠️ Common mistake: my_list.sort() returns None, NOT the sorted list!
What does "shallow copy" mean for list.copy()?
Copies the list structure, but NOT nested objects (e.g., inner lists)
Changes to nested objects affect BOTH lists
list1 = [[1, 2], [3, 4]]
list2 = list1.copy()
list1[0][0] = 99 # ← Changes list2 too!
# list1 = [[99, 2], [3, 4]]
# list2 = [[99, 2], [3, 4]] ← affected!
What do index() and count() return?
index(x) — index of FIRST occurrence (error if not found)
count(x) — number of occurrences (0 if not found)
my_list = ['a', 'b', 'a', 'c']
my_list.index('a') # → 0 (first occurrence)
my_list.count('a') # → 2 (appears twice)
What does insert(i, x) do? What if index is out of bounds?
Inserts element x BEFORE index i. If i is out of bounds, appends to start/end (no error).
my_list.insert(0, 'z') # → ['z', 'a', 'b', 'c']
my_list.insert(100, 'end') # → [..., 'c', 'end'] ← at end
Write the syntax for list, dict, and set comprehensions.
# List comprehension
[expression for item in iterable if condition]
# Dict comprehension
{key: value for item in iterable if condition}
# Set comprehension
{expression for item in iterable if condition}
Write a list comprehension to get squares of even numbers from [1, 2, 3, 4, 5].
numbers = [1, 2, 3, 4, 5]
result = [x**2 for x in numbers if x % 2 == 0]
# Result: [4, 16]
What's the difference between {x for x in nums}and [x for x in nums]?
{x for x in nums}
[x for x in nums]
{x for x in nums} — set comprehension → removes duplicates, unordered
[x for x in nums] — list comprehension → keeps duplicates, ordered
nums = [1, 2, 2, 3]
list_comp = [x for x in nums] # → [1, 2, 2, 3]
set_comp = {x for x in nums} # → {1, 2, 3}
What's the difference between a module and a package in Python?
Module: A single .py file containing Python code
Package: A directory containing modules +__init__.py file
__init__.py
Module: math_utils.py
math_utils.py
Package:
mypackage/(directory) with__init__.py inside
mypackage/
What is the purpose of __init__.py in a package?
Marks a directory as a package (traditional packages)
Can be empty OR contain initialization code
Namespace packages (Python 3.3+) don't require __init__.py
Note: Without __init__.py, the directory is a namespace package (more advanced)
Briefly describe namedtuple, Counter, OrderedDict, and deque from the collections module.
namedtuple
Counter
OrderedDict
deque
collections
namedtuple— tuple with named fields (access by name:point.x)
point.x
Counter— counts occurrences of elements in an iterable
OrderedDict— dict that preserves insertion order (+ ordering methods)
deque— double-ended queue, fast append/pop from both ends
Easy memory trick:
namedtuple = group variables together
Counter = count collection items
OrderedDict = preserve insertion order
deque = efficient stack/queue
Explain when try, except, else, and finally blocks execute.
try
except
else
finally
try— code that might raise an exception
except— runs if exception occurs in try
else— runs ONLY if NO exception occurred
finally— ALWAYS runs (cleanup code)
Order: try → except (if error) → else (if no error) → finally (always)
Show two ways to catch multiple exception types.
# Method 1: Separate blocks
try:
...
except ValueError:
except TypeError:
# Method 2: Single block with tuple
except (ValueError, TypeError):
How do you manually raise an exception in Python?
Use the raise keyword:
raise
raise ValueError("Custom error message")
# Or re-raise the current exception:
except SomeError:
# do something
raise # re-raises the same exception
Where must default parameters appear in a function definition?
Default parameters MUST come AFTER non-default parameters.
# ✅ Correct
def func(a, b, c=5, d=10):
pass
# ❌ Wrong
def func(a, c=5, b): # SyntaxError!
What are *args and **kwargs? What data types do they create?
*args
**kwargs
*args— collects extra positional arguments into a tuple
**kwargs— collects extra keyword arguments into a dict
def example(*args, **kwargs):
print(type(args)) # → <class 'tuple'>
print(type(args))
# → <class 'tuple'>
print(type(kwargs)) # → <class 'dict'>
print(type(kwargs))
# → <class 'dict'>
example(1, 2, 3, a=4, b=5)
# args = (1, 2, 3)
# kwargs = {'a': 4, 'b': 5}
How do you unpack a list and a dict as function arguments?
*list— unpacks list/tuple as positional arguments
*list
**dict— unpacks dict as keyword arguments
**dict
def func(a, b, c):
return a + b + c
values = [1, 2, 3]
func(*values) # Same as: func(1, 2, 3)
params = {"a": 1, "b": 2, "c": 3}
func(**params) # Same as: func(a=1, b=2, c=3)
When calling a function, what's the rule about positional and keyword arguments?
Positional arguments MUST come BEFORE keyword arguments.
func(1, 2, c=3) # ✅ Correct
func(1, 2, c=3)
func(1, b=2, c=3) # ✅ Correct
func(1, b=2, c=3)
func(a=1, 2, 3) # ❌ WRONG! keyword before positional
func(a=1, 2, 3)
# ❌ WRONG! keyword before positional
What is the syntax of a lambda function? What are its limitations?
lambda parameters: expression
Limitations:
Only one expression (no multiple statements, no loops)
Implicitly returns the expression result
Noreturn keyword needed/allowed
return
lambda x: x ** 2 is the same as def f(x): return x ** 2
lambda x: x ** 2
def f(x): return x ** 2
Sort the list [("Alice", 25), ("Bob", 20)] by age using a lambda.
[("Alice", 25), ("Bob", 20)]
students = [("Alice", 25), ("Bob", 20)]
sorted_students = sorted(students, key=lambda x: x[1])
# Result: [('Bob', 20), ('Alice', 25)]
Pattern:
sorted(iterable, key=lambda x: x[index/attribute])
What's the difference between map() and filter() with lambdas?
map()
filter()
map(func, iterable)— transforms each element, returns all
map(func, iterable)
filter(func, iterable)— keeps only elements where func returnsTrue
filter(func, iterable)
True
nums = [1, 2, 3, 4, 5]
# map: transform all elements
list(map(lambda x: x ** 2, nums))
# → [1, 4, 9, 16, 25] (all 5 transformed)
# filter: keep only some elements
list(filter(lambda x: x % 2 == 0, nums))
# → [2, 4] (only even numbers kept)
What does it mean that functions are "first-class objects" in Python?
Functions can be:
Assigned to variables
Passed as arguments to other functions
Returned from functions
Stored in data structures (lists, dicts)
python
def greet(name):
return f"Hi, {name}"
# Assign to variable
my_func = greet
# Pass as argument
def call_func(func, arg):
return func(arg)
call_func(greet, "Alice") # → "Hi, Alice"
call_func(greet, "Alice")
# → "Hi, Alice"
Write a function apply_twice(func, value) that applies a function to a value twice.
apply_twice(func, value)
def apply_twice(func, value):
return func(func(value))
def square(x):
return x ** 2
apply_twice(square, 2) # → 16
apply_twice(square, 2)
# → 16
# First: square(2) = 4
# Second: square(4) = 16
What's the difference between an iterable and an iterator? What methods must each implement?
Can be looped over (lists, strings, etc.)
Implements__iter__()→ returns an iterator
__iter__()
Can be iterated multiple times
Represents a stream of data
Implements__iter__()(returns self) AND__next__()
__next__()
RaisesStopIterationwhen exhausted
StopIteration
One-time use — exhausted after iteration
Explain what happens behind the scenes when you write for item in my_list.
for item in my_list
Python calls iter(my_list)→ gets an iterator by calling __iter__()
iter(my_list)
In each iteration, Python calls next(iterator)→ gets next value via __next__()
next(iterator)
When __next__()raises StopIteration, the loop terminates
# for item in my_list:
# print(item)
# Equivalent to:
iterator = iter(my_list)
while True:
item = next(iterator)
print(item)
except StopIteration:
break
What are magic/dunder methods in Python? Give 4 examples.
Methods that start and end with double underscores (__). They allow you to customize how Python's built-in operations work with your custom classes.
__
4 Examples:
__str__— called bystr()andprint()
__str__
str()
print()
__init__— constructor, called when creating object
__init__
__eq__— called by==operator
__eq__
==
__len__— called bylen()
__len__
len()
(Other valid answers:__repr__,__add__,__lt__,__iter__,__next__,__call__,__enter__,__exit__)
__repr__
__add__
__lt__
__iter__
__next__
__call__
__enter__
__exit__
Which dunder method is called for each operation?
len(obj) # → obj.__len__()
len(obj)
# → obj.__len__()
str(obj) # → obj.__str__()
str(obj)
# → obj.__str__()
obj1 == obj2 # → obj1.__eq__(obj2)
obj1 == obj2
# → obj1.__eq__(obj2)
obj1 + obj2 # → obj1.__add__(obj2)
obj1 + obj2
# → obj1.__add__(obj2)
obj1 < obj2 # → obj1.__lt__(obj2)
obj1 < obj2
# → obj1.__lt__(obj2)
for x in obj: # → obj.__iter__()
for x in obj:
# → obj.__iter__()
next(iterator) # → iterator.__next__()
# → iterator.__next__()
obj() # → obj.__call__()
obj()
# → obj.__call__()
What's the difference between __str__ and __repr__?
__str__— human-readable, friendly string (for print(),str())
__repr__— unambiguous, developer-focused representation (for repr(), REPL)
repr()
class Point:
def __str__(self):
return f"Point at ({self.x}, {self.y})"
def __repr__(self):
return f"Point(x={self.x}, y={self.y})"
p = Point(3, 4)
print(p) # → "Point at (3, 4)" (uses __str__)
repr(p) # → "Point(x=3, y=4)" (uses __repr__)
Given this basic decorator, modify wrapper to accept arbitrary arguments and keyword arguments:
wrapper
def decorator(func):
def wrapper():
print('Execution started')
result = func()
print('Execution completed')
return result
return wrapper
def wrapper(*args, **kwargs): # ← Add *args, **kwargs
result = func(*args, **kwargs) # ← Pass them to func
What is a decorator in Python? How is @decorator syntax used?
@decorator
A decorator is a function that takes a function and returns a modified version of it.
def my_func():
my_func = decorator(my_func)
Pattern: Outer function returns inner wrapper function (closure).
What are the two ways to implement decorators?
Function-based (most common):
def wrapper(*args, **kwargs):
return func(*args, **kwargs)
Class-based:
class Decorator:
def __init__(self, func):
self.func = func
def __call__(self, *args, **kwargs):
return self.func(*args, **kwargs)
What does functools.wraps do and why is it needed?
functools.wraps
Preserves the original function's metadata (__name__, __doc__, etc.) when decorating.
__name__
__doc__
from functools import wraps
@wraps(func) # ← Preserves func.__name__, func.__doc__
Without it: decorated function has __name__ = "wrapper"instead of the original name.
__name__ = "wrapper"
Describe closures in Python. (bullet points)
A closure is when an inner function has access to variables from its outer function
The inner function "remembers" the outer variables even after the outer function returns
Inner functions can be nested inside outer functions
This is the mechanism behind function-based decorators
def outer(x):
def inner(y):
return x + y # inner accesses x from outer
return inner
add_5 = outer(5)
print(add_5(3)) # → 8 (remembers x=5)
How are closures related to decorators?
Decorators work because of closures:
The decorator'swrapperfunction is an inner function
It retains access tofunc(the decorated function) from the outer scope
func
Whenwrapperis called later, it can still callfunc
def decorator(func): # Outer function
def wrapper(*args, **kwargs): # Inner function
return func(*args, **kwargs) # Uses func (closure!)
What methods must a class implement to be used as a context manager?
__enter__(self)— called when enteringwithblock, returns value forasvariable
__enter__(self)
with
as
__exit__(self, exc_type, exc_val, exc_tb)— called when exiting, handles cleanup
__exit__(self, exc_type, exc_val, exc_tb)
class MyContext:
def __enter__(self):
# Setup code
return self # or any value
def __exit__(self, exc_type, exc_val, exc_tb):
# Cleanup code (always runs!)
return False # propagate exceptions
How do you create a function-based context manager using @contextmanager?
@contextmanager
Use @contextmanager decorator with yield:
yield
from contextlib import contextmanager
def my_context():
# Setup code (before yield)
print("Entering")
yield # or yield value for 'as'
# Code block runs here
finally:
# Cleanup code (after yield)
print("Exiting")
# Usage:
with my_context():
print("Inside")
What's the difference between a list comprehension and a generator expression?
List comprehension ([]) — creates entire list in memory immediately:
[]
[x**2 for x in range(10)] # → [0, 1, 4, 9, ...]
Generator expression (()) — creates values on demand (lazy):
()
(x**2 for x in range(10)) # → generator object
Benefit: Generators are memory-efficient for large datasets.
What does "lazy evaluation" mean in the context of generators?
Values are produced on demand, not all at once.
Generator doesn't compute values until you ask for them (next()or loop)
next()
Memory efficient — doesn't store all values
Can represent infinite sequences
# Only computes values as needed:
gen = (x**2 for x in range(1_000_000))
next(gen) # Only computes first value
How would you re-implement enumerate() and zip() using generators?
enumerate()
zip()
# enumerate
def my_enumerate(iterable, start=0):
index = start
for item in iterable:
yield (index, item)
index += 1
# zip
def my_zip(*iterables):
iterators = [iter(it) for it in iterables]
values = [next(it) for it in iterators]
yield tuple(values)
Show the syntax for type hints on variables and functions.
# Variable type hints
name: str = "Alice"
age: int = 25
# Function parameter and return type hints
def greet(name: str, age: int) -> str:
return f"Hello, {name}! Age: {age}"
# Function with no return value
def print_data(data: list) -> None:
print(data)
Does Python enforce type hints at runtime?
No! Type hints are optional annotations for documentation and static analysis only.
def add(a: int, b: int) -> int:
return a + b
# This runs without error, even with wrong types:
add("hello", "world") # → "helloworld"
To check types: Use a static type checker like mypy, not Python itself.
mypy
What does the @dataclass decorator do? What methods does it auto-generate?
@dataclass
The @dataclassdecorator automatically generates boilerplate methods for classes that mainly store data.
Auto-generated methods:
__init__()— constructor from field annotations
__init__()
__repr__()— string representation
__repr__()
__eq__()— equality comparison
__eq__()
from dataclasses import dataclass
x: int
y: int
# No need to write __init__, __repr__, __eq__ manually!
print(p) # Point(x=3, y=4)
Convert this regular class to a dataclass:
class Person:
def __init__(self, name, age):
self.name = name
self.age = age
name: str
age: int
# That's it! __init__, __repr__, __eq__ are auto-generated
Explain why NumPy is faster than Python lists. (Give 3 reasons)
Contiguous memory layout — values stored next to each other → CPU cache friendly
SIMD vectorized operations — CPU can process multiple values in one instruction
Written in C — compiled C code under the hood, not interpreted Python
# Example: NumPy is ~100x faster
import numpy as np
arr = np.arange(1_000_000)
result = arr * 2 # Uses SIMD, contiguous memory, C code
What are the downsides/limitations of NumPy arrays?
Homogeneous types — all elements must be same type (mixing types forces coercion)
Fixed size — adding/removing elements is costly (creates new array)
Integer overflow — fixed-size integers can overflow (unlike Python's arbitrary precision)
# Overflow example:
arr = np.array([127], dtype=np.int8)
arr[0] += 1
print(arr) # → [-128] ← Overflow!
Name the main NumPy data types and how to check an array's type.
Integers:int8,int16,int32,int64(signed),uint8,uint16,uint32,uint64(unsigned)
int8
int16
int32
int64
uint8
uint16
uint32
uint64
Floats:float16,float32,float64
float16
float32
float64
Others:bool,object(avoid — loses performance!)
bool
object
Check type:
arr.dtype # → int64, float64, etc.
arr = np.array([1, 2, 3], dtype=np.int8)
print(arr.dtype) # → int8
What's the difference between np.zeros(), np.ones(), np.full(), and np.eye()?
np.zeros()
np.ones()
np.full()
np.eye()
np.zeros(shape)— array filled with 0
np.zeros(shape)
np.ones(shape)— array filled with 1
np.ones(shape)
np.full(shape, value)— array filled with any value
np.full(shape, value)
np.eye(n)— n×n identity matrix (diagonal 1s, rest 0s)
np.eye(n)
np.zeros(3) # → [0. 0. 0.]
np.ones(3) # → [1. 1. 1.]
np.full(3, 7) # → [7 7 7]
np.eye(3) # → [[1. 0. 0.]
# [0. 1. 0.]
# [0. 0. 1.]]
What are the 4 ways to index NumPy arrays?
Basic indexing:arr[1],arr[1, 2]
arr[1]
arr[1, 2]
Slicing:arr[1:3],arr[:, 1]
arr[1:3]
arr[:, 1]
Boolean indexing:arr[arr > 5](filter by condition)
arr[arr > 5]
Fancy indexing:arr[[0, 2, 4]](select specific indices)
arr[[0, 2, 4]]
What are the two uses of np.where()?
np.where()
1. Find indices where condition is True:
indices = np.where(arr > 10)
arr[indices] # Get those values
2. Conditional replacement (like ternary operator):
# Replace values > 10 with 100, else keep original
result = np.where(arr > 10, 100, arr)
Explain NumPy's broadcasting rules. Are shapes (3, 1) and (1, 4) compatible? What about (3, 2) and (3, 3)?
3 Rules:
If different ndim, prepend 1s to smaller shape
Compatible if: same size OR one is 1 in each dimension
Result shape = max of each dimension
(3, 1) + (1, 4):
Dim 0: 3 vs 1 → compatible ✓
Dim 1: 1 vs 4 → compatible ✓
Result: (3, 4) ✅
(3, 2) + (3, 3):
Dim 0: 3 vs 3 → compatible ✓
Dim 1: 2 vs 3 → incompatible ❌ (neither is 1)
Fails! ❌
What's the difference between a view and a copy? Which operations create each?
View — references same data, changes affect original Copy — independent data, changes don't affect original
Create VIEWS:
Slicing: arr[1:4]
arr[1:4]
Reshape: arr.reshape(2, 3)
arr.reshape(2, 3)
Transpose: arr.T
arr.T
Create COPIES:
.copy():arr.copy()
.copy()
arr.copy()
Fancy indexing: arr[[0, 2, 4]]
Boolean indexing: arr[arr > 5]
arr = np.array([1, 2, 3, 4])
view = arr[1:3]
view[0] = 999
print(arr) # → [1 999 3 4] ← Original changed!
copy = arr.copy()
copy[0] = 111
print(arr) # → [1 999 3 4] ← Original unchanged
What are np.nan, np.inf, and how do you check for them?
np.nan
np.inf
np.nan— Not a Number (missing/undefined values)
np.inf— Positive infinity
np.NINFor-np.inf— Negative infinity
np.NINF
-np.inf
np.pi— π (3.14159...)
np.pi
np.e— Euler's number (2.71828...)
np.e
Checking:
np.isnan(arr) # Check for nan
np.isinf(arr) # Check for infinity
np.isfinite(arr) # Check for finite (not nan, not inf)
⚠️ Important:
np.nan == np.nan # → False! Use np.isnan() instead
What do reshape(-1), expand_dims(), squeeze(), and flatten() do?
reshape(-1)
expand_dims()
squeeze()
flatten()
reshape(-1)— flatten to 1D OR auto-calculate one dimension
arr.reshape(-1) # → 1D
arr.reshape(-1)
# → 1D
arr.reshape(-1, 3) # → auto-calc rows for 3 columns
arr.reshape(-1, 3)
# → auto-calc rows for 3 columns
expand_dims(arr, axis)— add dimension of size 1
expand_dims(arr, axis)
arr.shape: (3,) → expand_dims(arr, 0) → (1, 3)
squeeze()— remove dimensions of size 1
arr.shape: (1, 3, 1) → squeeze() → (3,)
flatten()— convert to 1D (always copy)
arr_2d.flatten() → 1D array (copy, not view)
Explain vstack(), hstack(), and column_stack() with examples.
vstack()
hstack()
column_stack()
vstack()— vertical stack (stack as rows)
np.vstack([[1,2,3], [4,5,6]])
# → [[1 2 3]
# [4 5 6]]
hstack()— horizontal stack (side by side)
np.hstack([[1,2,3], [4,5,6]])
# → [1 2 3 4 5 6]
column_stack()— stack 1D arrays as columns
np.column_stack([[1,2,3], [4,5,6]])
# → [[1 4]
# [2 5]
# [3 6]]
What do .reduce(), .accumulate(), and .outer() do for ufuncs? Is np.vectorize() fast?
.reduce()
.accumulate()
.outer()
np.vectorize()
.reduce()— apply operation across array → single value
np.add.reduce([1,2,3,4]) # → 10 (sum all)
np.add.reduce([1,2,3,4])
# → 10 (sum all)
.accumulate()— cumulative operation → intermediate results
np.add.accumulate([1,2,3,4]) # → [1,3,6,10]
np.add.accumulate([1,2,3,4])
# → [1,3,6,10]
.outer()— apply to all pairs from two arrays
np.multiply.outer([1,2], [10,20])
# → [[10,20], [20,40]]
np.vectorize()— ⚠️ NOT fast! Just a convenience wrapper (essentially a for loop), no performance benefit
What are the 4 ways to create a pd.Series? What happens to dict keys?
pd.Series
import pandas as pd
# 1. From list (default integer index)
pd.Series([10, 20, 30])
# 2. From dict (keys become index!)
pd.Series({'a': 10, 'b': 20})
# 3. From scalar (broadcast to all index positions)
pd.Series(5, index=['a', 'b', 'c'])
# 4. From NumPy array (view by default!)
arr = np.array([1, 2, 3])
pd.Series(arr) # view — changes to arr affect Series
pd.Series(arr, copy=True) # copy — independent
What are the 4 main ways to create a DataFrame?
# 1. Dict of lists (most common)
pd.DataFrame({'col1': [1,2,3], 'col2': [4,5,6]})
# 2. Dict of Series (Series index → row labels)
pd.DataFrame({'col1': pd.Series([1,2], index=['a','b'])})
# 3. List of dicts (each dict = one row, missing keys → NaN)
pd.DataFrame([{'a': 1, 'b': 2}, {'a': 3}])
# 4. 2D NumPy array (specify column names explicitly)
pd.DataFrame(np.array([[1,2],[3,4]]), columns=['A','B'])
What is pd.Index? Name 3 key properties.
pd.Index
pd.Index is the label system for rows and columns in Pandas.
3 key properties:
Immutable — cannot change individual elements (ensures safe sharing)
Ordered multi-set — maintains order, allows duplicate labels
Built on NumPy — backed by np.ndarray, supports NumPy-like operations
np.ndarray
idx = pd.Index(['a', 'b', 'c'])
idx[0] = 'z' # ❌ TypeError! Immutable
idx.values # → numpy array underneath
What's the difference between .loc[] and .iloc[] in pandas?
.loc[]
.iloc[]
.loc[]— label-based (use index labels and column names)
.iloc[]— integer-based (use 0-based positions)
Memory trick: loc = label, iloc = integer
How do you filter a DataFrame? How do you combine conditions?
What do all() and any() do in Pandas? What does axis control?
all()
any()
axis
all()— True if ALL values meet condition
any()— True if ANY value meets condition
axis=0(default) — check column-wise → result per column
axis=0
axis=1— check row-wise → result per row
axis=1
What's the difference between unique(), nunique(), and value_counts() in pandas?
unique()
nunique()
value_counts()
unique()— returns array of unique values (in order of appearance)
nunique()— returns count of unique values (integer)
value_counts()— returns frequency of each value (sorted by count)
Show how to use replace() in pandas with scalar, list, and dict. How do you drop rows vs columns?
replace()
What does apply() do? What's the difference between axis=0 and axis=1?
apply()
Applies a function to each column or row of a DataFrame.
axis=0(default) — function receives each column → result per column
axis=1— function receives each row → result per row
Name 4 string methods available via .str accessor in Pandas.
.str
.str.upper()/.str.lower()— change case
.str.upper()
.str.lower()
.str.split(sep)— split into list (useexpand=Truefor columns)
.str.split(sep)
expand=True
.str.slice(start, stop)or.str[start:stop]— slice characters
.str.slice(start, stop)
.str[start:stop]
.str.strip()— remove whitespace
.str.strip()
.str.contains(pattern)— boolean mask if contains pattern
.str.contains(pattern)
.str.replace(old, new)— replace substring
.str.replace(old, new)
Name 5 ways to handle missing values in Pandas.
dropna()— drop rows/columns with missing values
dropna()
df.dropna() # drop rows with any NaN
df.dropna()
# drop rows with any NaN
df.dropna(thresh=2) # keep rows with ≥2 non-NaN
df.dropna(thresh=2)
# keep rows with ≥2 non-NaN
fillna(value)— fill with single value
fillna(value)
df.fillna(0)
fillna(dict)— fill each column differently
fillna(dict)
df.fillna({'A': 0, 'B': 'unknown'})
Forward fill — use previous value
df.ffill()
Backward fill — use next value
df.bfill()
interpolate()— estimate from surrounding values (linear interpolation)
interpolate()
df.interpolate()
None vs np.nan vs pd.NA:
None→ Python null, object dtype
None
np.nan→ float, numeric columns
pd.NA→ nullable integers/strings (Int64,string)
pd.NA
Int64
string
What's the difference between dtype='category' and pd.CategoricalDtype(ordered=True) in pandas? What does get_dummies() do?
dtype='category'
pd.CategoricalDtype(ordered=True)
get_dummies()
dtype='category'— nominal categories (no order)
s.astype('category') # colors, cities, etc.
s.astype('category')
# colors, cities, etc.
pd.CategoricalDtype(ordered=True)— ordinal categories (has order)
dtype = pd.CategoricalDtype(['S','M','L','XL'], ordered=True)
s.astype(dtype) # now S < M < L < XL, comparisons work!
s.astype(dtype)
# now S < M < L < XL, comparisons work!
pd.get_dummies()— one-hot encoding (each category → binary column)
pd.get_dummies()
x
pd.get_dummies(df['color'])
# red → [1,0,0], blue → [0,1,0], green → [0,0,1]
.cat— accessor to manage categories (add, remove, rename, reorder)
.cat
Name the 5 key parameters of pd.read_csv().
pd.read_csv()
Parameter
Purpose
Example
header
Which row = column names
header=None
(no header)
names
Custom column names
names=['a','b','c']
index_col
Column → row index
index_col='id'
usecols
Load only these columns
usecols=['name','age']
delimiter
Separator character
delimiter=';'
df = pd.read_csv('data.csv',
header=0,
names=['ID', 'Name', 'Age'],
index_col='ID',
usecols=['Name', 'Age'],
delimiter=','
)
# Write back (no index!)
df.to_csv('output.csv', index=False)
Explain the split-apply-combine pattern in pandas. How do you apply multiple aggregations?
Split — divide DataFrame into groups
Apply — run function on each group
Combine — merge results back together
What do set_index() and reset_index() do in pandas? When does a MultiIndex appear?
set_index()
reset_index()
set_index(col)— move column(s) to become the row index
set_index(col)
reset_index()— move index back to regular columns
MultiIndex — multiple levels of row labels (e.g., fromgroupbywith multiple columns)
groupby
What does pivot()do in pandas? Given this table, write the result:
pivot()
df.pivot(index='name', columns='subject', values='score')
pivot() reshapes long → wide format.
index→ row labels
index
columns→ unique values become column names
columns
values→ fills the table
values
Result:
⚠️ Fails with duplicates → use pivot_table() instead (aggregates duplicates)
pivot_table()
Reverse: melt()goes wide → long
melt()
long_df = wide_df.melt(id_vars='student', -> was rownames before
var_name='subject', -> was colnames before
valuename='score') -> was values before
value
name='score')
-> was values before
What does melt() do in pandas? What columns does the result always have?
melt() reshapes wide → long format (opposite ofpivot()).
Result always has:
id_vars columns (unchanged)
id_vars
variable column (former column names)
variable
value column (former cell values)
df.melt(id_vars=['name'], var_name='subject', value_name='score')
Given this wide pandas DataFrame , write the result of df.melt(id_vars='name'):
df.melt(id_vars='name')
id_vars='name' → name column stays
id_vars='name'
variable → old column names (math, english)
value → old cell values
What's the difference between concat() and merge() in pandas? Explain the 4 join types.
concat()
merge()
concat()— stacks DataFrames (no key matching needed)
merge()— joins on matching key column (like SQL JOIN)
how=
Keeps
'inner'
Only matching rows (default)
'left'
All left rows + matches from right
'right'
All right rows + matches from left
'outer'
ALL rows from both, NaN where no match
What are the 4 differences between primary and secondary biological databases? Give examples of each.
Aspect
Primary
Secondary
Data
Raw/unprocessed
Curated/organized
Release
Frequent
Infrequent
Funding
Institutional
Project-based
Examples
GenBank, UniProt, PDB
PFAM, CATH/SCOP, PROSITE
Memory trick:
Primary = Plain raw data, Pouring in constantly
Secondary = Sorted, Selectively updated
Name the 3 primary biological databases, what they contain, and their approximate sizes.
Database
Contains
Size
GenBank
DNA/RNA sequences
~10⁹ sequences, ~10¹² residues
UniProtKB
Protein sequences
SwissProt: ~5×10⁵ (curated) + TrEMBL: ~10⁸ (uncurated)
PDB
Protein 3D structures
~200,000 structures
UniProt key distinction:
SwissProt = manually curated, small, high quality
TrEMBL = auto-annotated, huge, lower quality
Describe the GenBank flat file format. What are the mandatory fields? How does a record start/end?
Format rules:
Fixed 80 columns wide
Keywords in cols 1–10, sub-keywords in cols 3–4, values in cols 13–80
Mandatory fields:
LOCUS — name, length, type, date
ACCESSION — unique stable ID
VERSION — accession + version number
ORIGIN — the actual sequence
Optional with sub-records:
REFERENCE → AUTHORS, TITLE, JOURNAL
SOURCE → ORGANISM
FEATURES → gene, CDS, etc.
Record boundaries:
Starts with:LOCUS
LOCUS
Ends with://
//
What does the VERSION field track? Can it uniquely identify a version of an entry?
VERSION increments only when the sequence changes
Annotation changes (references, features, organism) do NOT change the version
Therefore: NO — VERSION cannot uniquely identify a specific state of an entry, because the same version number can have different annotations at different points in time.
NM_000518.5 → sequence unchanged
→ but annotations may differ over time!
Name the 4 main E-utilities and describe the standard pipeline.
Tool
esearch
Search database → returns list of IDs
efetch
Download records by ID
einfo
Info about databases and searchable fields
elink
Find related records across databases
Standard pipeline:
esearch → get IDs → efetch → download records
Name the 6 steps of SwissProt manual curation.
Sequence curation — verify and clean the sequence
Sequence analysis — run tools, identify domains/features
Literature curation — extract data from publications
Family-based curation — propagate annotations from related proteins
Evidence attribution — tag each fact with its evidence type
Quality assurance — second review + automated checks
Key point: Every annotation has an evidence tag (experimental / by similarity / predicted) — this is what makes SwissProt trustworthy! 🎯
Compare X-ray, NMR, and Cryo-EM for protein structure determination.
X-ray Crystallography (most common in PDB):
Protein must form crystals → X-rays diffract → 3D electron density map
Gives one single structure
Quality: resolution (Å, lower = better; <2Å excellent) + R-value (lower = better fit, ~0.20 good)
NMR Spectroscopy:
Protein in solution (no crystals needed)
Measures distances between atoms via magnetic field
Gives an ensemble of 10–30 similar structures (not one!) — spread shows flexibility
Limited to small proteins (<50 kDa)
Cryo-EM:
Protein flash-frozen in ice (no crystals needed)
Electron beam + 2D images from many angles → 3D reconstruction
Works for large complexes (ribosomes, viruses, membrane proteins)
Nobel Prize 2017
What does CATH stand for? Describe each level.
Level
Name
Based on
C
Class
Secondary structure content (α, β, αβ)
A
Architecture
Overall 3D shape of secondary structures
T
Topology
Shape + connectivity between elements
H
Homologous superfamily
Common evolutionary ancestor
Direction: Broad (Class) → Specific (Homologous)
Memory: Cats Are Totally Homologous 🐱
What is PFAM and what is a seed-MSA?
PFAM — secondary database of protein families/domains derived from UniProt.
Seed-MSA (Seed Multiple Sequence Alignment):
A small, manually curated alignment of representative sequences for a family
Used to build an HMM profile (Hidden Markov Model)
HMM then searches all of UniProt to find all family members
Process:
Seed sequences → Seed alignment → HMM profile → Search UniProt → Full family
Each PFAM entry has: seed alignment, full alignment, HMM profile, PDB links.
Name the 4 main BioPython modules and their purpose.
Module
Bio.Entrez
Access NCBI databases (esearch, efetch)
Bio.SeqIO
Read/write sequence files (GenBank, FASTA)
Bio.Seq
Work with sequences (complement, transcribe, translate)
Bio.PDB
Parse and analyze PDB structure files
Give 2 reasons why NoSQL databases were developed.
Horizontal scaling — SQL databases scale vertically (bigger server), which is expensive. NoSQL scales horizontally (more servers), which is cheaper and more flexible.
Semi-structured data — SQL requires a fixed schema (same columns every row). NoSQL handles flexible, nested, or varying data structures naturally (e.g. JSON documents with different fields per record).
Name the 4 NoSQL categories with an example each and their use case.
Type
Data model
Use case
Key-Value
Redis
key → value
Caching, sessions
Wide Column
Cassandra
Variable columns per row
IoT, time-series, logs
Document
MongoDB
JSON/YAML documents
Profiles, content
Graph
Neo4j
Nodes + relationships
Social networks, recommendations
Memory trick: Koalas Watch Dark Grey movies
What does CAP stand for? What does the theorem state?
C — Consistency: All nodes see same data after any write
A — Availability: System always responds within acceptable time
P — Partition Tolerance: System works even if network between nodes fails
Theorem: In a distributed system, you can only fully satisfy 2 out of 3 simultaneously.
Choice
Sacrifice
CP
Availability
MongoDB, Redis
AP
Consistency
Cassandra, DynamoDB
CA
Partition tolerance
Traditional SQL (single machine)
In practice, Partition Tolerance is non-negotiable for distributed systems — you always have network failures. So the real choice is always CP vs AP! 🎯
What does BASE stand for? How does it differ from ACID?
B — Basically Available: System always responds, even with stale data
S — Soft State: System state may change over time as nodes sync (no instant consistency required)
E — Eventually Consistent: All nodes will converge to the same value given enough time without new writes
ACID
BASE
Always consistent
Eventually consistent
Strict transactions
Flexible updates
Hard to scale
Easy to scale
PostgreSQL, MySQL
Cassandra, MongoDB
What's the difference between pessimistic (ACID) and optimistic (BASE) concurrency?
Pessimistic (ACID/SQL):
Assumes conflicts will happen → locks data before accessing
Other users must wait until lock is released
Safe but creates bottlenecks at scale
Optimistic (BASE/NoSQL):
Assumes conflicts are rare → no locks, everyone reads/writes freely
If conflict detected → resolve after the fact (e.g. last write wins)
Fast and scalable, trades strict safety for performance
Pessimistic: 🔒 Lock → Read/Write → Unlock → next person
Optimistic: Read/Write freely → detect conflict → resolve
What's the difference between horizontal and vertical scaling? Which does NoSQL use?
Vertical (scale up): More RAM/CPU on one machine — limited by hardware ceiling, expensive, single point of failure
Horizontal (scale out): More machines added — theoretically unlimited, cheap, no single point of failure
NoSQL is designed for horizontal scaling because:
No JOINs across tables → data can live on different servers
Flexible schema → easy to partition/shard data
Eventual consistency → nodes work independently
Vertical: [💻 BIG]
Horizontal: [💻][💻][💻][💻][💻] ← NoSQL ✅
What is MVCC and how does conflict resolution work?
MVCC (Multiversion Concurrency Control) — instead of locking, each write creates a new version of the data. Old versions remain readable.
Benefit: Readers never block writers, writers never block readers.
Conflict resolution:
Both users read version v2
v2
User A writes → creates v3✅
v3
User B tries to write v2 → system detects v3 already exists → conflict!
User B must re-read v3 and retry
What are vector clocks and what are they used for?
Vector clock = list of (node_id, counter) pairs, one counter per node in the system.
(node_id, counter)
Purpose: Track causality between events across distributed nodes (wall-clock time is unreliable).
Rules:
Each write increments your own counter
When receiving a message, take the max of each counter + increment your own
Conflict detection:
If clock A ≤ clock B on all positions → A happened before B
If neither is ≤ the other → concurrent writes → conflict!
A:[2,1,0] vs B:[1,2,1] → conflict!
A:[1,0,0] vs B:[2,1,0] → A happened before B
What are the components of the property graph model?
Component
Has
Nodes
Unique ID, labels (type), properties (key-value)
Edges
Unique ID, label, direction (→), properties (key-value)
Node: (id:1, label:Person, name:"Alice", age:25)
Edge: (id:101, from:1, to:2, label:FRIENDS_WITH, since:2020)
Name 4 graph representations and their main trade-off.
Adjacency Matrix — grid, O(1) lookup, wastes O(n²) memory for sparse graphs
Incidence Matrix — nodes×edges grid, good for edge analysis, rarely practical
Edge List — just a list of pairs, minimal memory, slow neighbor lookup
Adjacency List — node → list of neighbors, best balance for sparse graphs
Write the Gremlin syntax for: get all vertices, filter by property, follow edges, access property.
Compare Cypher, Gremlin, and SPARQL.
What does CRUD stand for? Name the corresponding SQL commands.
C — Create →INSERT
INSERT
R — Read →SELECT
SELECT
U — Update →UPDATE
UPDATE
D — Delete →DELETE
DELETE
CRUD = the minimum set of access functions any data system must provide. 🎯
What is REST? Describe the 4 HTTP verbs and their CRUD mapping.
REST = verb (action) applied to noun (URL resource)
Verb
Action
CRUD
GET
Retrieve
Read
POST
Create new
Create
PUT
Replace/update
Update
Remove
Delete
What do map() and reduce()do?
reduce()
map(func, list)— apply function to every element → same-length list
map(func, list)
map(lambda x: x**2, [1,2,3]) # → [1, 4, 9]
map(lambda x: x**2, [1,2,3])
# → [1, 4, 9]
reduce(func, list)— aggregate all elements into one value
reduce(func, list)
from functools import reduce
reduce(lambda acc, x: acc + x, [1,2,3,4,5]) # → 15
reduce(lambda acc, x: acc + x, [1,2,3,4,5])
# → 15
Together: Map transforms, reduce aggregates:
total = reduce(lambda a,b: a+b, map(lambda x: x*0.9, prices))
Last changed19 days ago