04 Big Data –an Introduction 01

von nils K.

7 Vs of Big Data

Value - only of use if valuable
Volume - the size
Variability - the data whose meaning is constntly chainging
Vaisualisation - data is a manager thats readable and accessible
Veracity - the trustworthiness of the data in terms of accuracy
Variety - the different types of data
Velocity - the speed at which the data is generated

Poly-structured data? Waht is it

Data organized in multiple different formats or structures
Can include structured, semi-structured and unstructured data
Becoming increasingly common
Difficult to manage and make sense of
Data integration techniques can be used to combine and store data from multiple sources in a single, unified view.

Definition Big Data

Big Data refers to extremely large and complex data sets that are difficult to process using traditional data processing techniques. These data sets are often characterized by their volume, velocity, variety, and value.

Big data is often generated by sources such as social media, internet of things (IoT) devices, and sensor networks, and it can be used to gain insights and make better decisions in fields such as finance, healthcare, and retail.

Typical Use Cases for Big Data Technologies

Index building

User analysis

Pattern recognition (CV, Spam detection at Yahoo!, Google, Facebook; face detection in image)

NLP, text mining

Graph analysis (e.g. social network graphs at facebook)

Collaborative filtering

Sentiment analysis

Risk assessment

Prediction models

What is Sentiment analysis

Sentiment Analysis (also known as Opinion Mining) is the use of natural language processing, text analysis, and computational linguistics to identify and extract subjective information from source materials, such as text, speech, images or videos. The goal of sentiment analysis is to determine the attitude, opinions and emotions of a speaker or writer with respect to some topic or the overall contextual polarity of a document.

What is scalability?

Scalability is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged in order to accommodate that growth.”

Scalable w.r.t. data amaount or problem complexity. What is the difference?

with respect to

Scalable with respect to data amount means that the solution can handle an increase in the amount of data without a corresponding increase in the resources required to process the data.
Scalable with respect to problem complexity means that the solution can handle an increase in the complexity of the problem without a corresponding increase in the resources required to solve the problem.
The main difference is that Scalable w.r.t. data amount refers to the solution's ability to handle an increase in data size while Scalable w.r.t problem complexity refers to the solution's ability to handle an increase in the complexity of the problem.

Horizontal Scaling vs Vertical Scaling

Horizontal Scaling:
- Involves adding more machines to a system
- Increases capacity by distributing workload among multiple machines
- Can handle an increase in data volume and traffic
Vertical Scaling:
- Involves adding more resources to a single machine
- Increases capacity by adding more resources to a single machine
- Can handle an increase in processing power and memory demand

Advabtages abd Drawbacks of Horizonal Scaling

Advabtages abd Drawbacks of Vertical Scaling

Scaling Storage & Updates of Data: Sharding

Put different data on separate nodes, each of which does its own reads and writes

Can be configured in various ways, e.g., keeping the data closest to its users
Ideally, each user reading different data talks to a different server

Cons:

Load-balancing may be difficult: if done naively, one server can end up handling the majority of the load
Resilience: If one server is down, the data is lost.

Scaling Storage & Updates of Data: Master-Slave Replication

Data is replicated from master to slaves. The master services all writes; reads may come from either master or slaves.

Pros:

+Resilience

+Read performance

+Very useful for OLAP

Cons:

–Updates are costly

–Consistency difficult to maintain

Master-Slave + Sharding approach in MPP DBs

MPP = Massively Parallel processing

-Large-scale parallel systems

–Present themselves as relational databases

–Usually columnar

–Usually run on expensive, specialized hardware

Master node and many segment nodes

No overlaps - each node individual data

“Shared nothing” –each node has independent CPUs, RAM, storage

Master builds query execution plan and assigns parts to segment nodes

Provide horizontal scaling for DWHs, but

ETL is challenging and requires heavy lifting at ingress

Will not work for unstructured data

Scaling Storage & Updates of Data: Peer-to-Peer Replication

Peer-to-peer replication has all nodes applying reads and writes to all the data.

Consistency even more difficult

Master is no longer the bottleneck

Mixing Shardingand Peer-to-Peer Replication

replication factor n, to have each object be replicated at at leastnnodes

+Not as costly as replicating to all nodes

+still keeps resilience

Brewers CAP- Thorem

Consistency: having a single up-to-date copy of the data

Availability of the data (for updates): always an answer

Partition tolerance: system continues to operate despite arbitrary message loss or failure of part of the system

RDBMS

RDBMS stands for Relational Database Management System

organizes data into one or more tables with rows and columns. Data in RDBMS is stored in the form of tables, where each table is made up of rows (also called records) and columns (also called fields). RDBMSs use a declarative query language, such as SQL, to access and manipulate the data stored in the tables, it is widely used in various applications and is the most common type of database used in web and mobile applications.

Schema-on-Read vs Schema on Write

Schema-on-Write(RDBMS)

+Read is fast

Schema-on-Write is an approach where the structure of the data is defined and enforced at the time the data is written to the database. In this approach, the data must conform to the predefined schema before it is written to the database. This approach helps to ensure data consistency and integrity, but it can be inflexible and difficult to change the schema once it is defined.

Schema-on-Read(NoSQL)

+Load is fast

Schema-on-Read is an approach where the structure of the data is defined and enforced at the time the data is read from the database. In this approach, the data is stored in a format that is flexible and easy to change, but the structure of the data is only defined and enforced when it is read. This approach allows for more flexibility in changing the schema and dealing with different data formats, but it can be less efficient and can increase the complexity of the data processing.

Schema-on-read: NoSQLdata models have two different approachjjes. aggreate and graph orianted. Name the difference:

Graph-oriented NoSQL databases use

graph data model to store and retrieve data
well-suited for applications that need to represent complex relationships and analyze data in a graph structure
social networks, recommendation systems and fraud detection

Aggregate-oriented NoSQL databases

aggregate data model to store and retrieve data
well-suited for applications that need to handle large amounts of data and perform complex queries
real-time analytics, e-commerce and IoT.

Common characteristics of Schema-on-read: NoSQLdata models

Non relational
Cluster fiendly
Schema-less
(mostly) Open Source

What is the typical stratefy for Stream Processing

Process only a “window” of the data (sliding window) and process data over this window

Persist (i.e. store) only aggregated/processed information over these windows

Discard raw data after it’s been processed

Is it possible to distribute the data in a streaming process?

Yes Horizontaly as well as Vertically

Lambda: Data Processing Architectures

Goal?

What does it combine?

Name the 3 layers it consists of:

Goal: balance latency, throughput, and fault-tolerance

Combines batch and stream processing

The architecture is composed of three layers:

Batch Layer: stores and processes large amounts of historical data using batch processing techniques
Speed Layer: stores and processes recent data in real-time using stream processing techniques
Serving Layer: stores and serves the results of both the batch and speed layers to provide low-latency and accurate results.

Hadoop Key Characteristics

Scalable:

–Ability to scale out horizontally rather than scale up vertically

–Near-linear speedup (i.e., graceful decline on load increase)

2. Fault-tolerant:

– In large systems, failure is common

– Replication, retry, recovery

3. Batch-based:

– Batch processing, no real-time or truly interactive use (!)

– But: higher-level technologies built on top (e.g., YARN)

What is a batch

A collection of data
A set of operations processed together as a single unit
Large amounts of data processed in a single run, not in real-time

Limitations of HBase

§ Not an SQL database!

§ Not relational

§ No JOINs

§ No indexes (except rowkey ordering)

§ No column typing

§ No sophisticated query engine

§ No transactions

§ Rowkey-design determines query efficiency!

What is Hbase good for and bad for?

+ Large amounts of data (100s of millions or billions or rows)!

+ Sparse data

+ Large amount of clients/requests

– Relational analytics (group by, join, where column like,..)

– text-based search access

Beitreten

Vorschau

Author

nils K.

Informationen

Zuletzt geändert
vor 3 Jahren

Kurs melden