“V’s” of Big Data?
Volume = Data at Scale
Velocity = Data in Motion -> increasing data acquisition rate
Variety = Data in many forms (-> e.g. data types, data sources)
Veracity = Data Uncertainty
Value = Data as Value
Defintion Big Data?
Big data is term describing the storage and analysis of large and/ or complex data sets using a series of techniques including (but not limited to): NoSQL, MapReduce and machine learning.
3 Use Cases of Big Data
User analysis: e.g. recommendations
Pattern recognition: e.g. fac detection in images
Collaborative filtering: e.g. movie recommendations
Challengs of Big Data?
Scalability: Capability of a system/ network/ process
to handle growing amount of work (w.r.t. data amount and problem complexity)
or its potential to be enlarged in order to accomodate that growth
Variability: Inflexible Schemas
Schemas are not suited for semi-/ unstructured data
Horizontal vs. Vertical Scaling
Horizontal Scaling: Distributing workload across many independet servers (“scale out”)
Vertical Scaling: Installing more processors, more memory and faster hardware (“scale up“) -> optimitizing current servers
Possible solutions for Big Data Challenge Scalability?
Scaling Storage and Updates of Data:
Sharding: Put different data on separate nodes, each of which does its own reads and writes
Cons: Difficult Load-balancing and resilience
Master-Slave Replication: Data is replicated across multiple servers, from master to slaves
-> Master receives all write operation and replicates them to the slaves
-> Slaves can only read and are used to distribute read queries
Pros: Resilience, Read Performance, Useful for OLAP
Cons: Updates are costly and consistency is difficult to maintain
MPP (Massively Parallerl Procesing) DBs: Combination of Master-Slave and Sharding approach
=> Designed to handle large amounts of data and concurrent queries
-> Large-scale parallel system
-> Mater nodes and many segment nodes (shards)
-> Shared-nothing architecture: each node in the system has its own memory, storage, and processing power
-> Each segment node is responsible for a part of the data (no overlaps)
-> Master Node builds query execution plan and assigns parts to segment nodes
=> Horizontal Scaling for DWHs
Cons: fundamental scalability limits, ETL is challenging and will not work for unstructured data
Peer-to-Peer Replication: Has all nodes applying reads and writes to all the data
-> Each node in the system acts as both a master and a slave
-> Nodes communicate and share data together
Cons: Consitency
Mixing Sharding and Peer-to-Peer Replication: Use replication, but do not replicate all objects on all nodes -> use replication factor n to have each object replicated on at least n node
Limits of networked shared data systems?
ACID (Atomicity, Consistency, Isolation and Durability)
CAP-Theorem
Consistency: Having a single up-to-date copy of the data
Availability: every request receives an answer
Partition Tolerance: System continues to work despite loss / failure of part of the system
Is it possilbe achieve all three properties of the CAP theorem?
No
Possible solutions for Big Data Challenge Variability?
Schema-on-Read:
Schema-on-Write (RDBMS):
Schema-on-Read (NoSQL):
Schema must be created before any data can be loaded
Data is simply copied to file store (no transformation)
Explicit load operations transforms data to DB internal structure
Late Binding: Required columns are extracted at read time
New columns must be added explicitly before new data for these columns can be loaded
New data can start flowing anytime
Pros:
Fast read
Standards/ Governance
Fast Load
Flexibility
Categorization for Schema-on-read?
Aggregate-oriented
Key-value store
Document data model
Column-family DB
Graph-oriented
DWH vs. Data Lake?
What if your data is too big to store or actually infinite and fast (-> velocity)?
Streaming of Data
Stream Processing?
Process only a window of the data and process data avoer this window
Store only aggregated/ processed information over thes windows
Discard raw data after its been processed
Data Processing Architectures?
Lambda: Combination of batch and stream processing
-> designed to handle both real-time and historical data and to provide low-latency data processing
Kappa: Everything is processed as a stream
-> Both real-time and historical data processing / reprocessing is handled by a single stram processing enginea
3 main layer of the Lambda data processing architecture?
Batch layer: Responsible for processing historical data and generating a master dataset. The batch layer uses batch processing to process large amounts of data in a batch.
Speed layer: Responsible for handling real-time data and updating the master dataset. The speed layer uses stream processing to process data in real-time.
Serving layer: Responsible for serving the master dataset to the user. It is typically implemented as a data warehouse or a NoSQL database, and provides a low-latency access to the data.
Apache Hadoop?
Open-source software framework for storing and processing large amounts of data
Provides a distributed file system (HDFS)
Based on the MapReduce programming model
Key characteristics of Apache Hadoop?
Scalable -> rather horizontal scaling
Fault-tolerant
Batch-based -> Batch processing
Apache Hadoop High-Level Architecture?
HDFS? Characteristics?
Hadoop File System: Distributed Parallel Data Store
Master (NameNode) - Slave (DataNode) architecture
“Write once” -> files can be replaced, but not changed
Good for modest number of large files
Built-in replication on ingest
MapReduce? Process?
Distributed programming model which allows for the parallel processing of large data sets:
Fault-tolerant batch job processing framework
Map: Input data is split into smaller chunks and a map function is applied to each chunk
Map function takes an input key-value pair and produces a set of intermediate key-value pairs
Shuffle&Sort: Intermediate key-value pairs from the map step are shuffled and sorted by key
-> Preparation for reduce step
Reduce: Reduce function is applied to the grouped and sorted key-value pairs produced by the shuffle and sort step
Reduce function takes an input key and a set of values and produces a set of output key-value pairs
What does MapReduce do for you?
Break jobs down into tasks
Distribute processing and coordinate execution ß Schedule tasks
Data synchronization (Shuffle and sort, I/O)
Error handling
What is NoSQL on Hadoop?
HBase
SQL on Hadoop?
Hive and Impala
Hive?
Data Warehousing infrastructure built on top of Hadoop
-> Allows you to query and manage large-scale data using a familiar SQL-like interface
-> Hive tables constists of data schema
-> No relational database!!!
-> Designed for OLTP
Technologies of Impala?
Massively parallel processing (MPP) SQL query execution engine
Alternatives to Hadoop?
Spark
Last changed2 years ago