Why NoSQL
Lack of flexibility/rigid schemas
web data mostly unstructured and simple ops with e.g. mainly editing profiles; reads and writes on one record
Web apps evolve frequently -> relational may be too restrictive
allows freedom & flexibility (heterogeneity) and avoid schema evolution
Lack of scalability
distribution of data & load needed
Scale up -> vertically (add resources e.g. ram)
scale-out -> horizontally (add nodes, using distributed system) is used for NOSQL
Cost
many nosql are open source -> no licensing cost compared to RDBS
scale out with commodity seercers cheaper than scale up
Name the 3 Vs
Volume - Restricted Scalability
Velocity - High Latency / Low Performance
Variety -Rigid Schemas
Name and briefly explain the 4 different kinds of data models
Key - value
keys & corresponding values (has-tables); Interface vai CRUD
High performance; scalability by simple partitioning
No schema, restricted query potential, relations may not be implemented
Column family/wide column stores
habs nicht gecheckt, auch wahrscheinlich egal
Document
semi structured docs like JSON are stored based on keys
schema free and app is responsible for schema
Redundancy due to missing normal forms(no joins)
Graph-based
How does NoSQL data modelling differ from relational modelling?
Relational model: Uses normalized tables (rows & columns) → complex records are split
NoSQL model: Aggregate-oriented → denormalization
Aggregates:
Collections of related objects identified by a key
Treated as a unit for retrieval, manipulation, and consistency
Convenient for manipulation in JSON-like structures (e.g., embedded user + address data)
How does modelling one-to-one relationships differ in relational and nosql to fetch the data in combination?
For relational model:
only 1 query needed in terms of a join to fetch the data from both tables
For NoSQL (document)
When normalized via reference: 2 queries needed -> join must be done in application
When denormalized via embedding: 1 query needed
How does modelling one-to-many relationships differ in relational and nosql to fetch the data in combination?
For relational data model
To fetch data in combination 1 query needed (using foreign key)
referential integrity assured
no referential integrity assured
Queries needed depending on structure of document
normalized via reference -> 1+n queries are needed to fetch data in combination; join tbd in app
normalized via array of references -> 1+n queries needed to fetch data in combination; join tbd in app
Denormalized via embedding -> 1 query needed to fetch data in combination
How does modelling many-to-many relationships differ in relational and nosql to fetch the data in combination?
relational data model
NoSQL model
normalized via array of references -> * (1+n) and (1+m) queries are needed to fetch data in combination (join in app)
denormalized via embedding -> * (1+1) queries needed to fetch data in combination
When should you use referencing vs. embedding in NoSQL data modeling?
Use Referencing when:
Objects are first-class and stored in separate collections
You model many-to-many relationships
You need complex queries (e.g., pagination)
You deal with large objects (e.g., >16MB in MongoDB)
Use Embedding when:
The object is not referenced by others
Simpler structure and faster access is preferred
What do NoSQL systems offer and not offer in relationship modeling?
Offer:
Multiple ways to model relationships
Trade-offs in query performance and redundancy
Do not offer:
Referential integrity
Joins (avoided using aggregates / denormalization)
How does relationship modeling differ between Relational and NoSQL models?
Relational
NoSQL
Query-agnostic (“aggregate-ignorant”)
Query-specific
Relationships modeled via keys
Focus on jointly accessed aggregates
All queries supported
Other queries harder
Focus: What data do we have?
Focus: What questions do we ask?
Consistency-driven (normalized)
Workload-driven (denormalized)
Uses functional dependencies
Minimize aggregates to access
Optimized for updates
Optimized for reads
General-purpose
Domain-/application-specific
What are best practices for NoSQL data modeling?
Do
Don’t
Know how the data is intended to be used (domain-specific)
Build the DB model first and later decide on querying needs
Determine access patterns before designing schemas
Fall into the “n+1” trap (multiple calls to load related entities inefficiently)
Optimize for common use cases (e.g., 20% of queries cause 80% of load)
Always cap the amount of data you fetch (e.g., use pagination like “next 20 results”)
What is the overall goal of scalability in database systems?
Maintain a consistent performance level (throughput, latency, response time)
As system load increases (e.g., DB size, read/write ratio, active users, cache hit rate)
By adding resources appropriately:
Scaling Up: stronger CPU, more storage
Scaling Out: distribute data/load across nodes
Focus: Use Partitioning (e.g., sharding) and/or Replication to scale effectively
Explain the basic principle of hash-based partitioning and how can it specificly work
Objects pass through a hash function and based on the function they are allocated to a node
Aggregates with similar fragment keys may go to different nodes
Specific:
Keys from the node and objects (e.g., IPs, primary keys) are hashed to a number range using:
a hash function
a modulo operation: H(key) mod L → range [0, L-1]
The key space is visualized as a ring
Values are assigned to nodes around an imagined ring clockwise to the next node following
What problem do virtual nodes solve in hash-based partitioning, and how?
Hashing distributes keys well only with many servers.
With few servers, key distribution can become uneven (some servers overloaded).
Solution: Use multiple virtual nodes (VNodes) per physical node.
Each virtual node is assigned a different position on the hash ring.
This improves load balancing and gives more flexibility in key distribution.
Common in systems like Riak and Voldemort, and scalable based on server capacity.
What happens during reorganization in hash-based partitioning when nodes join or leave?
When a node leaves, its data range is taken over by the next node clockwise.
When a node joins, existing nodes give up part of their range to it.
Reorganization requires that data is replicated to ensure availability.
Systems like Cassandra use tools like Cassandra-Shuffle to:
Reassign virtual nodes
Reschedule and execute data transfers in a 2-phase process
Evenly redistribute ranges across the cluster at runtime
Describe range based partitioning
Map data to partitions based on range of values e.g. dates where one node takes care of Jan & Feb and so on
Even distribution becomes difficult
Describe list based partitioning
Explicit controll how data is partitioned by creating list of discrete values
If key in list of values then add to partition/node
E.g. AT, DE -> DACH, NL, BE -> BeNeLux
How does master-slave replication work and what are its pros and cons?
Master: Handles all writes and propagates changes to slaves.
Slaves: Handle read requests; can be promoted to master on failure.
Read resilient
Pros:
Scales well for read-heavy workloads (add more slaves)
Slaves can still serve reads if master fails
Cons:
Master is a single point of failure and write bottleneck
Delays in update propagation cause inconsistencies
Not suited for write-heavy workloads
What is peer-to-peer replication and what are its advantages and disadvantages?
All nodes can read and write
No single point of failure; nodes synchronize writes with each other
R&W resilient
Survives node failures without data loss
Easily scalable by adding nodes
Changes propagate slowly, risking temporary inconsistencies
Simultaneous writes on different nodes can lead to permanent write-write conflicts
What is the replication factor r?
If a node leaves, stored data beocomes unavailable and the replication factor tells that the next r nodes are responsible for an object so they store a replication in case a node leaves
Zuletzt geändertvor einem Monat