Buffl

Big Data

JH
by Jacob H.

Possible solutions for Big Data Challenge Scalability?

Scaling Storage and Updates of Data:

  • Sharding: Put different data on separate nodes, each of which does its own reads and writes

    • Cons: Difficult Load-balancing and resilience

  • Master-Slave Replication: Data is replicated across multiple servers, from master to slaves

    -> Master receives all write operation and replicates them to the slaves

    -> Slaves can only read and are used to distribute read queries

    • Pros: Resilience, Read Performance, Useful for OLAP

    • Cons: Updates are costly and consistency is difficult to maintain

  • MPP (Massively Parallerl Procesing) DBs: Combination of Master-Slave and Sharding approach

    => Designed to handle large amounts of data and concurrent queries

    -> Large-scale parallel system

    -> Mater nodes and many segment nodes (shards)

    -> Shared-nothing architecture: each node in the system has its own memory, storage, and processing power

    -> Each segment node is responsible for a part of the data (no overlaps)

    -> Master Node builds query execution plan and assigns parts to segment nodes

    => Horizontal Scaling for DWHs

    • Cons: fundamental scalability limits, ETL is challenging and will not work for unstructured data

  • Peer-to-Peer Replication: Has all nodes applying reads and writes to all the data

    -> Each node in the system acts as both a master and a slave

    -> Nodes communicate and share data together

    • Cons: Consitency

  • Mixing Sharding and Peer-to-Peer Replication: Use replication, but do not replicate all objects on all nodes -> use replication factor n to have each object replicated on at least n node


Author

Jacob H.

Information

Last changed