undefined

Buffl

Web IS

von Niklas K.

Why NoSQL

Lack of flexibility/rigid schemas

web data mostly unstructured and simple ops with e.g. mainly editing profiles; reads and writes on one record
Web apps evolve frequently -> relational may be too restrictive
allows freedom & flexibility (heterogeneity) and avoid schema evolution

Lack of scalability
- distribution of data & load needed
- Scale up -> vertically (add resources e.g. ram)
- scale-out -> horizontally (add nodes, using distributed system) is used for NOSQL
Cost
- many nosql are open source -> no licensing cost compared to RDBS
- scale out with commodity seercers cheaper than scale up

Name the 3 Vs

Volume - Restricted Scalability

Velocity - High Latency / Low Performance

Variety -Rigid Schemas

Name and briefly explain the 4 different kinds of data models

Key - value

keys & corresponding values (has-tables); Interface vai CRUD
High performance; scalability by simple partitioning
No schema, restricted query potential, relations may not be implemented

Column family/wide column stores

habs nicht gecheckt, auch wahrscheinlich egal

Document

semi structured docs like JSON are stored based on keys
schema free and app is responsible for schema
Redundancy due to missing normal forms(no joins)

Graph-based

How does NoSQL data modelling differ from relational modelling?

Relational model: Uses normalized tables (rows & columns) → complex records are split

NoSQL model: Aggregate-oriented → denormalization
Aggregates:
- Collections of related objects identified by a key
- Treated as a unit for retrieval, manipulation, and consistency
- Convenient for manipulation in JSON-like structures (e.g., embedded user + address data)

How does modelling one-to-one relationships differ in relational and nosql to fetch the data in combination?

For relational model:

only 1 query needed in terms of a join to fetch the data from both tables

For NoSQL (document)

When normalized via reference: 2 queries needed -> join must be done in application
When denormalized via embedding: 1 query needed

How does modelling one-to-many relationships differ in relational and nosql to fetch the data in combination?

For relational data model

To fetch data in combination 1 query needed (using foreign key)
referential integrity assured

For NoSQL (document)

no referential integrity assured
Queries needed depending on structure of document
normalized via reference -> 1+n queries are needed to fetch data in combination; join tbd in app
normalized via array of references -> 1+n queries needed to fetch data in combination; join tbd in app
Denormalized via embedding -> 1 query needed to fetch data in combination

How does modelling many-to-many relationships differ in relational and nosql to fetch the data in combination?

relational data model

To fetch data in combination 1 query needed (using foreign key)
referential integrity assured

NoSQL model

no referential integrity assured
normalized via array of references -> * (1+n) and (1+m) queries are needed to fetch data in combination (join in app)
denormalized via embedding -> * (1+1) queries needed to fetch data in combination

When should you use referencing vs. embedding in NoSQL data modeling?

Use Referencing when:

Objects are first-class and stored in separate collections
You model many-to-many relationships
You need complex queries (e.g., pagination)
You deal with large objects (e.g., >16MB in MongoDB)

Use Embedding when:

The object is not referenced by others
Simpler structure and faster access is preferred

What do NoSQL systems offer and not offer in relationship modeling?

Offer:

Multiple ways to model relationships
Trade-offs in query performance and redundancy

Do not offer:

Referential integrity
Joins (avoided using aggregates / denormalization)

How does relationship modeling differ between Relational and NoSQL models?

Relational	NoSQL
Query-agnostic (“aggregate-ignorant”)	Query-specific
Relationships modeled via keys	Focus on jointly accessed aggregates
All queries supported	Other queries harder
Focus: What data do we have?	Focus: What questions do we ask?
Consistency-driven (normalized)	Workload-driven (denormalized)
Uses functional dependencies	Minimize aggregates to access
Optimized for updates	Optimized for reads
General-purpose	Domain-/application-specific

What are best practices for NoSQL data modeling?

Do	Don’t
Know how the data is intended to be used (domain-specific)	Build the DB model first and later decide on querying needs
Determine access patterns before designing schemas	Fall into the “n+1” trap (multiple calls to load related entities inefficiently)
Optimize for common use cases (e.g., 20% of queries cause 80% of load)
Always cap the amount of data you fetch (e.g., use pagination like “next 20 results”)

What is the overall goal of scalability in database systems?

Maintain a consistent performance level (throughput, latency, response time)

As system load increases (e.g., DB size, read/write ratio, active users, cache hit rate)
By adding resources appropriately:
- Scaling Up: stronger CPU, more storage
- Scaling Out: distribute data/load across nodes

Focus: Use Partitioning (e.g., sharding) and/or Replication to scale effectively

Explain the basic principle of hash-based partitioning and how can it specificly work

Objects pass through a hash function and based on the function they are allocated to a node

Aggregates with similar fragment keys may go to different nodes

Specific:

Keys from the node and objects (e.g., IPs, primary keys) are hashed to a number range using:
- a hash function
- a modulo operation: H(key) mod L → range [0, L-1]
The key space is visualized as a ring
Values are assigned to nodes around an imagined ring clockwise to the next node following

What problem do virtual nodes solve in hash-based partitioning, and how?

Hashing distributes keys well only with many servers.

With few servers, key distribution can become uneven (some servers overloaded).
Solution: Use multiple virtual nodes (VNodes) per physical node.
Each virtual node is assigned a different position on the hash ring.
This improves load balancing and gives more flexibility in key distribution.
Common in systems like Riak and Voldemort, and scalable based on server capacity.

What happens during reorganization in hash-based partitioning when nodes join or leave?

When a node leaves, its data range is taken over by the next node clockwise.

When a node joins, existing nodes give up part of their range to it.
Reorganization requires that data is replicated to ensure availability.
Systems like Cassandra use tools like Cassandra-Shuffle to:
- Reassign virtual nodes
- Reschedule and execute data transfers in a 2-phase process
- Evenly redistribute ranges across the cluster at runtime

Describe range based partitioning

Map data to partitions based on range of values e.g. dates where one node takes care of Jan & Feb and so on

Even distribution becomes difficult

Describe list based partitioning

Explicit controll how data is partitioned by creating list of discrete values

If key in list of values then add to partition/node

E.g. AT, DE -> DACH, NL, BE -> BeNeLux

How does master-slave replication work and what are its pros and cons?

Master: Handles all writes and propagates changes to slaves.

Slaves: Handle read requests; can be promoted to master on failure.
Read resilient

Pros:

Scales well for read-heavy workloads (add more slaves)
Slaves can still serve reads if master fails

Cons:

Master is a single point of failure and write bottleneck
Delays in update propagation cause inconsistencies
Not suited for write-heavy workloads

What is peer-to-peer replication and what are its advantages and disadvantages?

All nodes can read and write

No single point of failure; nodes synchronize writes with each other
R&W resilient

Pros:

Survives node failures without data loss
Easily scalable by adding nodes

Cons:

Changes propagate slowly, risking temporary inconsistencies
Simultaneous writes on different nodes can lead to permanent write-write conflicts

What is the replication factor r?

If a node leaves, stored data beocomes unavailable and the replication factor tells that the next r nodes are responsible for an object so they store a replication in case a node leaves

Beitreten

Vorschau

Author

Niklas K.

Informationen

Zuletzt geändert
vor 2 Monaten

Kurs melden

NoSQL - Modeling

Author

Niklas K.

Informationen