What is cloud computing?
“Simply put, cloud computing is the delivery of computing services – including servers, storage, databases, networking, software, analytics, and intelligence – over the Internet (‘the cloud’) to offer faster innovation, flexible resources, and economies of scale. You typically pay only for cloud services you use, helping lower your operating costs, run your infrastructure more efficiently and scale as your business needs change.”
Another way to think about cloud computing is to see it as a next step in a series of abstractions. Using cloud services can then be compared to using a programming library. Do you build your own IT hardware?
When you interact with your hardware, do you write your own operating system?
When you work with data, do you write your own analytics mechanisms/frameworks from scratch?
Following this path of abstraction, cloud computing just becomes one more way of “we are using a solution provided by somebody else“.
Common properties of cloud computing services
“On-demand”: you turn services on and off as you wish/need, usually without any long-term commitment
“Elasticity” / “Scalability”: you can dynamically add additional resources (or remove them), as your application requires
“Self-service”: you don’t ask somebody else to configure something for you, nor negotiate about a service: you use ready-made, standardized offerings
“Multi-tenancy”: you share the underlying software and hardware stack with other customers of the same cloud service provider
“Abstraction”: you get a specific outcome “as a service”, but often don’t see how the underlying solution is designed or implemented. This usually reduces your maintenance effort.
“Pay-as-you-go” / “Pay-per-use”: your costs depend on what you use and may go up or down significantly each month
Was tradet man?
You trade CAPEX (Capital expenditure) e.g., one-time cost for buying server hardware for OPEX (Operational expenditure) e.g., monthly cost for using a service
Was ist Economies of scale?
Example 1: A cloud service provider that buys hard disks in batches of thousands gets a lower price per disk than you
Example 2: Running a cooling system for a large data center is usually cheaper than running 10 cooling systems for 10 smaller data centers with the same size combined
Major public cloud service providers today
What about private cloud approaches?
They follow the same principles of “on-demand”, “elasticity”, “multitenancy”, etc. But a certain management effort will stay with your company.
Was sind Traditional virtualization and container orchestration solutions?
Running virtual machines with VMware or Hyper-V
Kubernetes/OpenShift clusters
How did public cloud computing offerings emerge?
Between 2000 and 2010, several tech companies faced significant growth in user demand: Amazon, Yahoo, Google, Facebook, eBay, …
It’s hard to find reliable numbers, but some reports suggest that during peak periods Facebook added 1200 servers per month to keep up with growing demand.
You need repeatable, reusable and automated processes to
launch new servers
configure them in a uniform way
manage them (installing updates, evacuating racks for hardware maintenance, etc.)
assign data storage capacity to servers
…
scenario: Ordering a server in a traditional large corporation
1. You realize you need a server for a project. You guess your CPU and RAM requirements.
2. You fill out a form to order the server.
3. Your supervisor needs to approve the order.
4. If costs are above a certain threshold, a finance responsible needs to approve the order as well.
5. Your order is routed to the central IT department or IT service provider.
6. You need to wait until your order is processed.
7. The person who handles your order may still have a question and tries to contact you. 20
8. You get access to your server.
9. After a few days you realize that you would have needed more RAM. It would also be better if you had a separate database server, instead of running the database on the same system.
10.Your journey goes back to the start: you fill out the order form again
Cloud computing bypasses this process
2. If you don’t already have one, you create an account at a cloud service provider. This requires a credit card and a few minutes of time.
3. You click through the web UI to configure a server. Or you run on the command line: aws ec2 run-instances [your CPU and RAM requirements]
4. You get access to your server within 1-2 minutes.
5. After a few days you realize that you would have needed more RAM. It would also be better if you had a separate database server, instead of running the database on the same system.
6. You run: aws ec2 modify-instance-attribute [details for more RAM] aws rds create-db-instance [...]
7. You’re done.
What are core differences between these two scenarios?
There are no dependencies on other people or other departments. This reduces the turnaround time significantly. It ties back to the cloud computing properties “on-demand”, “elasticity” and “self-service”.
You may end up with high costs quickly
Things you configure may not comply with your company’s expectations or internal policies:
guidelines on which operating systems to use and how to manage them
security policies
Reduced central control and central coordination
Cloud computing does not prevent you from messing up. You can just mess up way faster.
Dedicated services for many different IT needs
running virtual machines
running databases
log monitoring and analytics
video streaming
text translation
IoT device management
machine learning
• …
Wie interagiert man mit diesen Services?
API’s
Multiple ways to send the very same API calls
Via web interface (“AWS Console”)
Via command line interface (“AWS CLI”)
Via software development kits (SDKs) for various programming languages: Python, Java, .NET, Go, PHP, Rust, …
Was sind die AWS service properties?
Services have different pricing models:
per time used
per GB used
per size configured
per number of API calls sent, …
Storage in the cloud
In “classic IT” you often see servers that have disks attached. You store data on the disks by talking to those storage servers. (Example: network shares)
Was muss man mit Storage in the CLoud beachten?
You would again need to take care of:
Scaling with demand
High availability
Capacity planning
OS configuration and maintenance
Was bedeutet die BLOB storage?
Binary Large Object
Universal and most extensively used form of cloud storage
Used to store arbitrary data like images, videos, text, CSVs, …
Size of a BLOB can range from a few bytes to multiple TBs
I want to …
backup my data as a ZIP file.
store the pictures of my web site.
store data for data analytics or machine learning.
store intermediate results of my lengthy computations.
keep application logs in a central place.
Was sind die Nachteile von BLOB Storages?
Objects can be written (and overwritten and deleted), but not partially updated
You interact with the storage via API calls
Objects are automatically replicated across multiple data center buildings in the same region
You get high durability and availability without own management effort.
Virtually unlimited storage
You just store additional objects when you need to (and can delete them again any time). You may store one object or millions of objects.
Performance does not change with the amount of storage you consume
The service answers with the same throughput and latency for an object: There is no single disk or single server that is assigned to you. You are interacting with a distributed system.
Often found properties of BLOB storage services
Logging all data access
Automating backups
Sharing data with other users of the same cloud service provider
Auto-deleting old objects
Replicating data across regions
Was ist AWS S3?
Simple Storage Service
S3 stores objects in buckets
Buckets = logical containers/grouping mechanisms for objects. Some settings you make on bucket level apply to all objects in a bucket.
Objects = your actual data (think: “files”)
Können Bucket names mehrmals vorkommen?
Nein!
S3 und Kosten
Three main components:
1. Amount of data stored (GBs per month)
compress data, if you can
delete old data that you don’t need any more (use lifecycle rules to automate this)
make use of S3 storage classes (trade lower storage cost for higher per-request cost)
2. Number of requests sent to the service (per 1000 requests)
cache data at the clients that use the objects
batch data before you upload it (e.g., do you need to upload new log messages every second or is batching for 10 seconds fine as well?)
3. Data transfer when data is sent out of the AWS region (per GB out)
don’t regularly transfer data if you don’t need to: process data close to where it is stored
Data locality
If data is stored at a cloud service provider, it’s usually also efficient to use data analytics services there (and vice versa).
It’s an anti-pattern to repeatedly move the same GBs of data over wide-area networks, due to latency & data transfer cost.
1. The query engine reads all the required raw data over the Internet/VPN
2. The query result is computed and given to the user
3. All the previously transferred raw data is discarded again
How do object storage and data analytics now fit together?
Distributed data processing solutions emerged because a single system (often in form of a database) could not efficiently handle the size of certain data sets any more.
Wie sieht die Hadoop & HDFS architecture aus?
Each node (=server of a cluster) stores a part of a data set and runs requested computations on it.
How many times does a certain word appear in a textbased dataset (of many GBs in size)?
Hadoop & HDFS architecture has advantages:
Queries can be run in a distributed way, working in parallel
Data is stored close to where the compute happens: nodes run jobs directly on the data they have stored locally
Hadoop & HDFS architecture has disadvantages:
Compute and storage are closely coupled
What if you only need to run computations for 1 hour per day?
Will you keep the servers running idle for the rest of the time? If you want to add new data, the servers need to be running.
What if you need much more storage space? Do you add more servers?
Do you put more disks into existing servers?
but they want you to use their BLOB storage solution instead of HDFS:
Decoupling of compute and storage
Clusters only need to exist when you are running queries and can then be shut down or deleted again cost savings
You can easily experiment with different cluster configurations (number of nodes, CPU, RAM, …) fail fast to achieve what you need
You can add/delete data any time without being dependent on the current existence of servers
Was ist HIVE?
Data analytics framework for distributed processing of large data sets
Provides a SQL-like query interface for data that does not need to be stored as a database: CSV, JSON, Parquet, …
Usage: you write SQL queries and Hive translates them into Hadoop MapReduce jobs and runs them 4
SerDe = Serializer Deserializer
Is an implementation of how to read, interpret and write a specific data format that you want to analyze.
There are built-in SerDes for CSV, JSON, Parquet, …
You can also write your own SerDes for custom file formats.
Hive table definitions are “schema-on-read”.
The data structure is only applied when reading the data, e.g., when you run SELECT s.
A consequence is that Hive will be perfectly happy with faulty table definitions and only fail later when you run queries.
Run queries against your virtual table:
Warum ist die SPeicherung von Daten besser auf S3 anstatt auf HDFS?
Storing data on S3 instead of a local HDFS file system is nice, because it decouples compute and storage. But data now first needs to be sent over the network before it can be analyzed:
2. Filter for those records where payment_type=‘credit_card’
3. Discard all the other records we just transferred over the network → inefficient
Was ist die Idee hinter Pushdowns?
The core idea is that parts of a SQL query are already evaluated close to where the data lives (if possible). The storage layer should only send those parts of raw data to the compute layer that will be needed there.
Evaluation is “pushed down” to a lower layer in the stack.
The pushdown concept is not specific to a product: you will find it in many big data frameworks, data warehouse solutions, etc.
It requires the storage layer to understand the data format you use and provide filtering functionality. → Need to check individually for the setup you use.
Wie sieht ein Beispiel von Predicate pushdownaus?
There are more types of pushdowns, for example:
Aggregate function pushdown
MIN(), MAX(), SUM(), COUNT(), …
LIMIT operator pushdown
Verwendet HIVE Pushdowns?
Other big data frameworks available on EMR can make use of it. Example: Trino
Was ist Trino?
The “look & feel” is like Hive: you write SQL queries to analyze data that is not necessarily stored in a database format.
Trino query execution times are usually way lower than with Hive.
Trino uses pushdown mechanisms.
Trino keeps intermediate results in memory, Hive writes them to disk.
Hive compiles your SQL query into a MapReduce program, which can take some time.
Keep in mind: Numbers will heavily depend on your SQL queries and data sets.
Was ist der Unterschied zwischen HIVE und Trino?
Trino ist viel viel schneller!
When working with EMR, you still see lots of “servers” rather than “services”…
That can be nice when you need maximum control and configurability.
But many customers are rather only interested in analysis results and not in cluster configuration.
Further abstraction into a service:
Similar concept of “SQL query execution as a service”:
Was sind Athena properties?
Serverless: the cloud service provider runs (multi-tenant) clusters in the background: you can’t access them.
Can be used via Web UI or API.
It’s primarily used to query data that is stored on S3.
It can also query other data sources like databases, via connectors (MySQL, Postgres, etc.).
When would this make sense? (Athena Properties)
Because you don’t have access to the underlying clusters, we need a place where metadata is stored (= the schema information that exists after you ran a CREATE TABLE statement):
Where is the source data located? (e.g., S3 paths)
What schema does the data have?
tables
columns
data types of the columns
Glue Data Catalog as a central place for data set definitions.
There is a dedicated AWS service to manage data set definitions, because this is used in multiple other occasions too:
Working with other data analytics services than Athena
Running ETL jobs
Sharing your data with other AWS accounts
Kann man Query results direkt in den Clusters speichern?
Query results also cannot be stored directly on the clusters. They instead go to an S3 bucket that you need to specify. You need to create a dedicated bucket for this.
The Athena web UI will also show query results if they are small enough.
Wie sieht ein AThena Workflow aus?
How to create data sources in Athena/Glue?
You can run CREATE TABLE statements directly in Athena (the Trino statements are similar to Hive). This creates the information in the Glue data catalog.
Alternatively, you can also define your tables manually in Glue. This is literally clicking through web forms:
You need to manually:
look at your source data
determine column names
determine data types
Was sind Glue Crawlers?
Crawlers walk through your data on S3, infer the schema and automatically create tables for it.
Crawlers can also be run again when your data schema evolves/changes over time.
Is it now cheaper to run an EMR cluster or to use Athena?
It depends on your usage patterns:
If you have a small number of queries or ad-hoc queries, Athena is likely cheaper.
If you run queries regularly 24/7, it can be cheaper to “pay for time” for a running EMR cluster.
It also depends on how much data is really scanned by Athena.
EMR clusters can also be created on-demand and terminated again.
There is no generic true answer.
The good thing is that you can always iterate and change your approach. A decision does not lock you in. You did not buy any hardware, nor signed a long-term contract.
What is a data lake?
“A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc., and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning.”
As a producing company, BMW can collect lots of data:
Data created during vehicle manufacturing e.g., measurements of assembly machines, inventory of car components, …
Regular sensor data that vehicles send
Sales and marketing data e.g., which campaigns lead to which sales outcomes?
Results of vehicle engineering simulations e.g., do our simulations line up with what we later see in practice?
How to deal with the massive growth in data volume?
An existing central team could not keep up any more with who provides what data, how much of it, and who should be able to access what.
How to make data accessible and usable to many people in the company?
“Data democratization”
Wie sieht der Ablauf von Data providers, data storage und data consumers aus?
You want to provide a data set within the company?
You’ll get an AWS account.
Store your data in an S3 bucket that we assign to you.
You want to consume a data set from the data lake?
If the data producer allows your request, you’ll get access to 8 the relevant S3 bucket(s).
Wie sieht der FLow zischen provider, resource account und Use case aus?
No more capacity and performance planning for the central team
no cluster/hardware adaptations are required over time
data producers just add new S3 objects that provide data for newer time frames
There is an internal web application (“Data portal”) for users of the data lake:
registering as a data provider or data consumer
browsing the available data sets
requesting and granting permissions to data sets
seeing contact data of the data providers
One thing that will come up quickly in such an S3-based data lake setup:
Data producers and data consumers have different optimization requirements that are antagonists.
Nenne das Beispiel mit: A sensor placed in the factory that emits data every few seconds
It will write small files (and would anyhow not have the capacity to buffer data for longer periods)
It does not support columnar file formats and likely will not compress data before writing
Easy for writers ≠ Well-designed for readers
The data lake thus needs different layers to fulfill different needs
ETL processes in between make sure that data is well-suited for consumption
columnar file formats
using compression
larger file sizes
laid out like many data consumers asked for, e.g., only containing specific columns, uniform date/time formats, etc.
Update cycles from source to prepared layer can vary: 20 hourly, every x hours, daily, weekly, …
“Creating data sets out of data sets”
Similar concept: materialized views in databases
Data consumers don’t need to regularly scan TBs of data if they work on a prepared data set that already aggregates information or combines multiple data sets.
Data consumers can turn into data providers by sharing their “derived data sets” within the company.
Data lineage = Answering the question
“Where does this data come from initially?”
Tracking the source and evolution of data over multiple steps and data sets.
How to actually do the ETL?
Technically, there are many possible ways to create new data sets:
Run jobs on EMR clusters (using any of the frameworks available there)
Athena CTAS and INSERT INTO
Ray jobs
Spark jobs
Was bedeutet CTAS?q
CREATE TABLE AS SELECT
Creates a new Athena/Glue table based on the results of a SQL query
The result files that represent the new table are stored on S3
Useful for one-time or first-time creation of data sets
Was ist INSERT INTO?
Takes the results of a SQL query and adds them to an existing Athena/Glue table
Storage-wise this works by putting additional files into an existing S3 bucket
Useful for regular additions to existing data sets
The decentralized approach has several advantages
What are potential disadvantages?
Advantages
The central data lake team “only” takes care about the platform (web application for data set management) and not about individual data sets, cluster hardware, storage, etc.
Self-service: Providers and consumers can act independently and don’t have a dependency on a central process (that potentially takes long time)
S3 acts as a common storage layer, but consumers can choose from many ways to process data sets
Disadvantages
Less tech-savvy users may be lost more easily because there is no “one standard way how we do things in this company”.
In this approach, permissions are denied or granted on data set level. (=You can either read nothing or everything stored in an S3 bucket.) There is no column-level access control.
Potential solutions: Create multiple data sets via ETL or import a data set from S3 into a “traditional database”.
There is no central gate for data quality. Just like people may leave or change departments over time, data sets may become less curated over time.
Recommendations for data quality
Show metrics to make the data set status transparent
Display how often a data set is updated (and how often it was updated in the past). Think of GitHub commit activity graphs:
Display how much a data set is read by data consumers. S3 has ready-made metrics for this:
Ensure data providers and data consumers can communicate with each other. Examples: show contact details of the data 37 provider, have a “comment section” below a data set, etc.
In the long term, a data lake should ideally provide high quality data and not turn into an unmaintained mess.
So far you have mainly used “Distributed SQL environments”
Hive (but other frameworks are available on EMR too)
Athena (which uses Trino under the hood)
Why/when to use them instead of SQL?
Filtering or data aggregation that requires complex logic
Working with external sources or destinations
Making API calls to external services to request additional data
Writing results to custom destinations (e.g., databases, dashboards, etc.)
Working with complex data formats (e.g., images)
Running Ray in a cloud environment
Cloud service providers offer dedicated services for Ray clusters: you only need to bring your Python code and that’s it.
Last changed19 hours ago