What are the three main hazard dimensions for consistency in distributed systems?
Semantic Integrity
Violated when atomicity fails (e.g., partial transactions)
Operational Integrity
Problems from concurrent users (e.g., lost updates, dirty reads)
Replication Integrity
Arises with multiple replicas needing synchronization
Risks: read/write discrepancies, outdated (non-topical) data
Ideal case: 1 user, 1 operation, 1 data copy → no inconsistency
But this is rarely realistic in distributed environments.
How is atomicity achieved in distributed databases and what are the challenges?
Atomicity ensures all parts of a transaction commit or none do.
In distributed systems, this requires a distributed commit protocol:
2 Phase Commit (2PC):
Central coordinator manages the commit
All nodes must agree to commit
Disadvantages: performance bottleneck, single point of failure
Paxos:
Fully distributed consensus protocol
More fault-tolerant, but complex and costly
Note: 2PC is not the same as 2PL (Two-Phase Locking).
Also, locks must be held for the full duration of the protocol.
How is atomicity handled in NoSQL systems?
Transactions in NoSQL ≈ single operation on a single key-value pair or aggregate.
No atomicity support for:
Multiple operations
Multiple aggregates
Semantic consistency can only be ensured within a single aggregate.
For operations across aggregates, atomicity must be handled at the application level.
Exception: Some Graph DBs support multi-operation atomicity.
What is the 2-Phase Locking (2PL) protocol and how does it help with concurrency in RDBS?
Serializability = Final result is as if the operations happened one after the other when multiple users access and modfiy data
2PL ensures serializability by dividing a transaction into:
Growing phase: locks are acquired, but not released.
Shrinking phase: locks are released, but no new locks can be acquired.
It prevents concurrency issues by enforcing a serial order of operations across transactions. Two types of protocols:
Pessimistic control (prevention): via locking.
Optimistic control (resolution): via timestamp ordering.
How does NoSQL handle concurrency and isolation compared to RDBS?
NoSQL DBs:
Often do not guarantee isolation across multiple aggregates.
Avoid 2PL because it’s expensive in distributed systems.
Rely on optimistic concurrency control.
Expect the application to handle inconsistencies.
What are typical hazards caused by asynchronous update propagation in replicated systems?
Three major hazards due to asynchronous update propagation:
Read/Write Discrepancies
R and W of the same process may hit different replicas
The process might not see its own write in a later read
Non-monotonic Read (R)
Sequential reads from the same process may access different replicas
A later read might return an older value than a previous one
Non-monotonic Write (W)
Sequential writes may be sent to different replicas
A later write might be overwritten by an earlier one
Root Cause: All are caused by asynchronous update propagation in distributed/replicated storage
What is strong consistency (aka linearizability) in distributed systems?
What does the strong consistency requirement imply?
Why is strong consistency difficult to achieve in practice?
Strong consistency ensures that all processes always see the same version of a value. Once a value is written and committed, any subsequent read must return that latest value, regardless of which replica is accessed.
It implies that the system behaves as if there were only a single copy of the data, making every update instantly visible to all read operations.
Because it requires instant propagation of updates to all replicas, which is nearly impossible in real-world distributed systems due to network latency, partitions, and asynchronous replication.
What makes a system linearizable?
Ensures all operations appear to take effect in a single, global order.
A read concurrent with a write must return the new value.
All processes observe the same committed value, regardless of which replica they access.
Emulates a single copy of data across the system.
Important for preventing read/write inconsistencies in distributed databases.
What is ROWA and how does it ensure strong consistency?
ROWA = Read One, Write All strategy.
All replicas must be synchronously updated before confirming a write.
Any read from one replica returns the latest value.
Guarantees strong consistency across replicas.
Often implemented via 2PC or Paxos protocols.
Main drawback: high write latency due to full synchronization.
What is the ROWA alternative and how does it work?
Only a subset of replicas (e.g. 2 out of 4) are updated synchronously.
Reads must query a larger subset (e.g. at least 3 out of 4 replicas).
Ensures at least one up-to-date replica is included in the read.
Improves performance by reducing write coordination.
Requires use of timestamps or version numbers to identify the latest value during reads.
Balances consistency and availability in distributed systems.
What is the Quorum protocol and how does it ensure strong consistency?
A quorum is the minimum number of replicas that must acknowledge a read or write.
Condition for strong consistency:
W + R > N, where
N = total number of replicas
W = replicas required to acknowledge a write
R = replicas contacted for a read
This condition ensures overlap between write and read sets.
Guarantees that at least one replica read is up-to-date (has the latest version).
Example:
If N = 4 and W = 2, then R must be at least 3 (2 + 3 > 4).
Quorum protocol helps relax ROWA, allowing better response times while maintaining consistency.
Compare Serializability vs Linearizability
Aspect
Serializability
Linearizability (Strong Consistency)
Definition
A schedule is serializable if its outcome is equivalent to some serial execution (but order may differ from real-time).
A system is linearizable if operations appear to occur in a total order that respects real-time (wall-clock) ordering.
Focus
Correctness of concurrent transactions
Real-time behavior of read/write operations
Time Awareness
No real-time guarantees – only logical correctness
Real-time aware – newer operations must reflect immediately
Use case
Common in traditional RDBMS (e.g., 2PL ensures it)
Required in strongly consistent distributed systems
Example
It’s okay if A’s write is visible after B’s write, even if A happened before B – as long as the result is like a serial schedule
If A’s write happened before B’s read in real time, B must see A’s write
Cost
Lower (especially in centralized systems)
Higher (requires coordination like quorum, 2PC, Paxos)
Concurrency Control
Often enforced with Two-Phase Locking (2PL)
Enforced with quorum-based protocols, ROWA, Paxos, etc.
Serializability: cares about logical correctness – the order of execution doesn’t have to match real time.
Linearizability: adds a real-time constraint – the system behaves as if there’s a single global clock and a single up-to-date copy of data.
What does the CAP theorem state in the context of distributed systems?
C: Consistency – All clients see the same data at the same time.
A: Availability – System continues to operate even if some nodes fail.
P: Partition Tolerance – System continues operating despite network or message failures.
Only two out of the three can be guaranteed at any one time.
How are NoSQL systems categorized using the CAP Theorem?
CA systems: Prioritize Consistency + Availability (e.g., RDBMS like Greenplum).
CP systems: Prioritize Consistency + Partition Tolerance (e.g., MongoDB, Redis).
AP systems: Prioritize Availability + Partition Tolerance (e.g., Dynamo, Cassandra).
No system provides all three (C, A, P) at once.
Trade-offs are based on application needs and failure assumptions.
What are BASE properties, and how do they relate to CAP?
BASE = “Basically Available, Soft State, Eventually Consistent”
Basically Available: System responds to every request (partial consistency allowed).
Soft State: System state may change over time, even without input.
Eventually Consistent: System becomes consistent once updates are fully propagated.
BASE is often seen in AP systems.
Contrasts with ACID (strong consistency, transactional guarantees).
BASE emphasizes liveness over safety.
How doe eventual and strong consistency differ?
Strong requires r and w step to overlap - eventual not
Eventual does not guarantee each process seing the same version of value
What is a vector clock (VC)?
A data structure tracking causality in distributed systems
Each node has a counter → together form a vector
A versioned value is stored as (value, VC)
Used to determine which update happened before, after, or concurrently
How does a node update its vector clock when sending or receiving an operation?
Sending:
Increment its own clock by 1
Send current VC with the operation
Receiving:
For each VC entry: take max(local SCN, received SCN)
How are two vector clocks compared?
Equal: VCs are identical → in sync
One > other in all or some entries: → one version is more recent (syntactic conflict → auto-resolution possible)
Neither > the other: → concurrent/conflicting (semantic conflict → manual/app-level resolution needed/ semantic reconcilitation)
What’s the difference between syntactic and semantic conflict in vector clocks?
Syntactic Conflict:
One VC dominates the other (e.g., A1, B1 > A1, B0)
Can be auto-resolved
Semantic Conflict:
VCs are incomparable (e.g., A1, B0 vs A0, B1)
Requires manual or app-level reconciliation
What are Hash Histories and how do they differ from vector clocks?
Track versions using a hash value and a revision number
Conflict detected if same revision number but different hash
Independent of number of nodes
Grow with number of updates
Lighter than vector clocks, which grow with number of writing processes
What is serializability
Zuletzt geändertvor einem Monat