undefined

Buffl

Business Intelligence

by Jacob H.

“V’s” of Big Data?

Volume = Data at Scale

Velocity = Data in Motion -> increasing data acquisition rate

Variety = Data in many forms (-> e.g. data types, data sources)

Veracity = Data Uncertainty

Value = Data as Value

Defintion Big Data?

Big data is term describing the storage and analysis of large and/ or complex data sets using a series of techniques including (but not limited to): NoSQL, MapReduce and machine learning.

3 Use Cases of Big Data

User analysis: e.g. recommendations
Pattern recognition: e.g. fac detection in images
Collaborative filtering: e.g. movie recommendations

Challengs of Big Data?

Scalability: Capability of a system/ network/ process
- to handle growing amount of work (w.r.t. data amount and problem complexity)
- or its potential to be enlarged in order to accomodate that growth
Variability: Inflexible Schemas
- Schemas are not suited for semi-/ unstructured data

Horizontal vs. Vertical Scaling

Horizontal Scaling: Distributing workload across many independet servers (“scale out”)

Vertical Scaling: Installing more processors, more memory and faster hardware (“scale up“) -> optimitizing current servers

Possible solutions for Big Data Challenge Scalability?

Scaling Storage and Updates of Data:

Sharding: Put different data on separate nodes, each of which does its own reads and writes
- Cons: Difficult Load-balancing and resilience
Master-Slave Replication: Data is replicated across multiple servers, from master to slaves
-> Master receives all write operation and replicates them to the slaves
-> Slaves can only read and are used to distribute read queries
- Pros: Resilience, Read Performance, Useful for OLAP
- Cons: Updates are costly and consistency is difficult to maintain
MPP (Massively Parallerl Procesing) DBs: Combination of Master-Slave and Sharding approach
=> Designed to handle large amounts of data and concurrent queries
-> Large-scale parallel system
-> Mater nodes and many segment nodes (shards)
-> Shared-nothing architecture: each node in the system has its own memory, storage, and processing power
-> Each segment node is responsible for a part of the data (no overlaps)
-> Master Node builds query execution plan and assigns parts to segment nodes
=> Horizontal Scaling for DWHs
- Cons: fundamental scalability limits, ETL is challenging and will not work for unstructured data
Peer-to-Peer Replication: Has all nodes applying reads and writes to all the data
-> Each node in the system acts as both a master and a slave
-> Nodes communicate and share data together
- Cons: Consitency
Mixing Sharding and Peer-to-Peer Replication: Use replication, but do not replicate all objects on all nodes -> use replication factor n to have each object replicated on at least n node

Limits of networked shared data systems?

ACID (Atomicity, Consistency, Isolation and Durability)
CAP-Theorem
- Consistency: Having a single up-to-date copy of the data
- Availability: every request receives an answer
- Partition Tolerance: System continues to work despite loss / failure of part of the system

Is it possilbe achieve all three properties of the CAP theorem?

Possible solutions for Big Data Challenge Variability?

Schema-on-Read:

Schema-on-Write (RDBMS):	Schema-on-Read (NoSQL):
Schema must be created before any data can be loaded	Data is simply copied to file store (no transformation)
Explicit load operations transforms data to DB internal structure	Late Binding: Required columns are extracted at read time
New columns must be added explicitly before new data for these columns can be loaded	New data can start flowing anytime
Pros: Fast read Standards/ Governance	Pros: Fast Load Flexibility

Categorization for Schema-on-read?

Aggregate-oriented
- Key-value store
- Document data model
- Column-family DB
Graph-oriented

DWH vs. Data Lake?

What if your data is too big to store or actually infinite and fast (-> velocity)?

Streaming of Data

Stream Processing?

Process only a window of the data and process data avoer this window
Store only aggregated/ processed information over thes windows
Discard raw data after its been processed

Data Processing Architectures?

Lambda: Combination of batch and stream processing
-> designed to handle both real-time and historical data and to provide low-latency data processing
Kappa: Everything is processed as a stream
-> Both real-time and historical data processing / reprocessing is handled by a single stram processing enginea

3 main layer of the Lambda data processing architecture?

Batch layer: Responsible for processing historical data and generating a master dataset. The batch layer uses batch processing to process large amounts of data in a batch.
Speed layer: Responsible for handling real-time data and updating the master dataset. The speed layer uses stream processing to process data in real-time.
Serving layer: Responsible for serving the master dataset to the user. It is typically implemented as a data warehouse or a NoSQL database, and provides a low-latency access to the data.

Apache Hadoop?

Open-source software framework for storing and processing large amounts of data
Provides a distributed file system (HDFS)
Based on the MapReduce programming model

Key characteristics of Apache Hadoop?

Scalable -> rather horizontal scaling
Fault-tolerant
Batch-based -> Batch processing

Apache Hadoop High-Level Architecture?

HDFS? Characteristics?

Hadoop File System: Distributed Parallel Data Store

Master (NameNode) - Slave (DataNode) architecture

“Write once” -> files can be replaced, but not changed
Good for modest number of large files
Built-in replication on ingest

MapReduce? Process?

Distributed programming model which allows for the parallel processing of large data sets:

Fault-tolerant batch job processing framework

Map: Input data is split into smaller chunks and a map function is applied to each chunk
- Map function takes an input key-value pair and produces a set of intermediate key-value pairs
Shuffle&Sort: Intermediate key-value pairs from the map step are shuffled and sorted by key
-> Preparation for reduce step
Reduce: Reduce function is applied to the grouped and sorted key-value pairs produced by the shuffle and sort step
- Reduce function takes an input key and a set of values and produces a set of output key-value pairs

What does MapReduce do for you?

Break jobs down into tasks
Distribute processing and coordinate execution ß Schedule tasks
Data synchronization (Shuffle and sort, I/O)
Error handling

What is NoSQL on Hadoop?

HBase

SQL on Hadoop?

Hive and Impala

Hive?

Data Warehousing infrastructure built on top of Hadoop

-> Allows you to query and manage large-scale data using a familiar SQL-like interface

-> Hive tables constists of data schema

-> No relational database!!!

-> Designed for OLTP

Technologies of Impala?

Massively parallel processing (MPP) SQL query execution engine

Alternatives to Hadoop?

Spark

Join Course

Preview

Author

Jacob H.

Information

Last changed
3 years ago

Report course