7 Vs of Big Data
Value - only of use if valuable
Volume - the size
Variability - the data whose meaning is constntly chainging
Vaisualisation - data is a manager thats readable and accessible
Veracity - the trustworthiness of the data in terms of accuracy
Variety - the different types of data
Velocity - the speed at which the data is generated
Poly-structured data? Waht is it
Data organized in multiple different formats or structures
Can include structured, semi-structured and unstructured data
Becoming increasingly common
Difficult to manage and make sense of
Data integration techniques can be used to combine and store data from multiple sources in a single, unified view.
Definition Big Data
Big Data refers to extremely large and complex data sets that are difficult to process using traditional data processing techniques. These data sets are often characterized by their volume, velocity, variety, and value.
Big data is often generated by sources such as social media, internet of things (IoT) devices, and sensor networks, and it can be used to gain insights and make better decisions in fields such as finance, healthcare, and retail.
Typical Use Cases for Big Data Technologies
Index building
User analysis
Pattern recognition (CV, Spam detection at Yahoo!, Google, Facebook; face detection in image)
NLP, text mining
Graph analysis (e.g. social network graphs at facebook)
Collaborative filtering
Sentiment analysis
Risk assessment
Prediction models
What is Sentiment analysis
Sentiment Analysis (also known as Opinion Mining) is the use of natural language processing, text analysis, and computational linguistics to identify and extract subjective information from source materials, such as text, speech, images or videos. The goal of sentiment analysis is to determine the attitude, opinions and emotions of a speaker or writer with respect to some topic or the overall contextual polarity of a document.
What is scalability?
Scalability is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged in order to accommodate that growth.”
Scalable w.r.t. data amaount or problem complexity. What is the difference?
with respect to
Scalable with respect to data amount means that the solution can handle an increase in the amount of data without a corresponding increase in the resources required to process the data.
Scalable with respect to problem complexity means that the solution can handle an increase in the complexity of the problem without a corresponding increase in the resources required to solve the problem.
The main difference is that Scalable w.r.t. data amount refers to the solution's ability to handle an increase in data size while Scalable w.r.t problem complexity refers to the solution's ability to handle an increase in the complexity of the problem.
Horizontal Scaling vs Vertical Scaling
Horizontal Scaling:
Involves adding more machines to a system
Increases capacity by distributing workload among multiple machines
Can handle an increase in data volume and traffic
Vertical Scaling:
Involves adding more resources to a single machine
Increases capacity by adding more resources to a single machine
Can handle an increase in processing power and memory demand
Advabtages abd Drawbacks of Horizonal Scaling
Advabtages abd Drawbacks of Vertical Scaling
Scaling Storage & Updates of Data: Sharding
Put different data on separate nodes, each of which does its own reads and writes
Can be configured in various ways, e.g., keeping the data closest to its users
Ideally, each user reading different data talks to a different server
Cons:
Load-balancing may be difficult: if done naively, one server can end up handling the majority of the load
Resilience: If one server is down, the data is lost.
Scaling Storage & Updates of Data: Master-Slave Replication
Data is replicated from master to slaves. The master services all writes; reads may come from either master or slaves.
Pros:
+Resilience
+Read performance
+Very useful for OLAP
–Updates are costly
–Consistency difficult to maintain
Master-Slave + Sharding approach in MPP DBs
MPP = Massively Parallel processing
-Large-scale parallel systems
–Present themselves as relational databases
–Usually columnar
–Usually run on expensive, specialized hardware
Master node and many segment nodes
No overlaps - each node individual data
“Shared nothing” –each node has independent CPUs, RAM, storage
Master builds query execution plan and assigns parts to segment nodes
Provide horizontal scaling for DWHs, but
ETL is challenging and requires heavy lifting at ingress
Will not work for unstructured data
Scaling Storage & Updates of Data: Peer-to-Peer Replication
Peer-to-peer replication has all nodes applying reads and writes to all the data.
Consistency even more difficult
Master is no longer the bottleneck
Mixing Shardingand Peer-to-Peer Replication
replication factor n, to have each object be replicated at at leastnnodes
+Not as costly as replicating to all nodes
+still keeps resilience
Brewers CAP- Thorem
Consistency: having a single up-to-date copy of the data
Availability of the data (for updates): always an answer
Partition tolerance: system continues to operate despite arbitrary message loss or failure of part of the system
RDBMS
RDBMS stands for Relational Database Management System
organizes data into one or more tables with rows and columns. Data in RDBMS is stored in the form of tables, where each table is made up of rows (also called records) and columns (also called fields). RDBMSs use a declarative query language, such as SQL, to access and manipulate the data stored in the tables, it is widely used in various applications and is the most common type of database used in web and mobile applications.
Schema-on-Read vs Schema on Write
Schema-on-Write(RDBMS)
+Read is fast
Schema-on-Write is an approach where the structure of the data is defined and enforced at the time the data is written to the database. In this approach, the data must conform to the predefined schema before it is written to the database. This approach helps to ensure data consistency and integrity, but it can be inflexible and difficult to change the schema once it is defined.
Schema-on-Read(NoSQL)
+Load is fast
Schema-on-Read is an approach where the structure of the data is defined and enforced at the time the data is read from the database. In this approach, the data is stored in a format that is flexible and easy to change, but the structure of the data is only defined and enforced when it is read. This approach allows for more flexibility in changing the schema and dealing with different data formats, but it can be less efficient and can increase the complexity of the data processing.
Schema-on-read: NoSQLdata models have two different approachjjes. aggreate and graph orianted. Name the difference:
Graph-oriented NoSQL databases use
graph data model to store and retrieve data
well-suited for applications that need to represent complex relationships and analyze data in a graph structure
social networks, recommendation systems and fraud detection
Aggregate-oriented NoSQL databases
aggregate data model to store and retrieve data
well-suited for applications that need to handle large amounts of data and perform complex queries
real-time analytics, e-commerce and IoT.
Common characteristics of Schema-on-read: NoSQLdata models
Non relational
Cluster fiendly
Schema-less
(mostly) Open Source
What is the typical stratefy for Stream Processing
Process only a “window” of the data (sliding window) and process data over this window
Persist (i.e. store) only aggregated/processed information over these windows
Discard raw data after it’s been processed
Is it possible to distribute the data in a streaming process?
Yes Horizontaly as well as Vertically
Lambda: Data Processing Architectures
Goal?
What does it combine?
Name the 3 layers it consists of:
Goal: balance latency, throughput, and fault-tolerance
Combines batch and stream processing
The architecture is composed of three layers:
Batch Layer: stores and processes large amounts of historical data using batch processing techniques
Speed Layer: stores and processes recent data in real-time using stream processing techniques
Serving Layer: stores and serves the results of both the batch and speed layers to provide low-latency and accurate results.
Hadoop Key Characteristics
Scalable:
–Ability to scale out horizontally rather than scale up vertically
–Near-linear speedup (i.e., graceful decline on load increase)
2. Fault-tolerant:
– In large systems, failure is common
– Replication, retry, recovery
3. Batch-based:
– Batch processing, no real-time or truly interactive use (!)
– But: higher-level technologies built on top (e.g., YARN)
What is a batch
A collection of data
A set of operations processed together as a single unit
Large amounts of data processed in a single run, not in real-time
Limitations of HBase
§ Not an SQL database!
§ Not relational
§ No JOINs
§ No indexes (except rowkey ordering)
§ No column typing
§ No sophisticated query engine
§ No transactions
§ Rowkey-design determines query efficiency!
What is Hbase good for and bad for?
+ Large amounts of data (100s of millions or billions or rows)!
+ Sparse data
+ Large amount of clients/requests
– Relational analytics (group by, join, where column like,..)
– text-based search access
Zuletzt geändertvor 2 Jahren