What is cloud computing?
“Simply put, cloud computing is the delivery of computing services – including servers, storage, databases, networking, software, analytics, and intelligence – over the Internet (‘the cloud’) to offer faster innovation, flexible resources, and economies of scale. You typically pay only for cloud services you use, helping lower your operating costs, run your infrastructure more efficiently and scale as your business needs change.”
Another way to think about cloud computing is to see it as a next step in a series of abstractions. Using cloud services can then be compared to using a programming library. Do you build your own IT hardware?
When you interact with your hardware, do you write your own operating system?
When you work with data, do you write your own analytics mechanisms/frameworks from scratch?
Following this path of abstraction, cloud computing just becomes one more way of “we are using a solution provided by somebody else“.
Run a small web application that stores information entered by users in a relational database.
Ranking of data storage flavors when using Athena
1. Use columnar file formats that have compression built-in .parquet, .orc
2. Use row-based file formats, with compression on top .json.gz, .csv.bzip2
3. Use row-based file formats, without compression .json, .csv
Common properties of cloud computing services
“On-demand”: you turn services on and off as you wish/need, usually without any long-term commitment
“Elasticity” / “Scalability”: you can dynamically add additional resources (or remove them), as your application requires
“Self-service”: you don’t ask somebody else to configure something for you, nor negotiate about a service: you use ready-made, standardized offerings
“Multi-tenancy”: you share the underlying software and hardware stack with other customers of the same cloud service provider
“Abstraction”: you get a specific outcome “as a service”, but often don’t see how the underlying solution is designed or implemented. This usually reduces your maintenance effort.
“Pay-as-you-go” / “Pay-per-use”: your costs depend on what you use and may go up or down significantly each month
Was tradet man?
You trade CAPEX (Capital expenditure) e.g., one-time cost for buying server hardware for OPEX (Operational expenditure) e.g., monthly cost for using a service
Was ist Economies of scale?
Example 1: A cloud service provider that buys hard disks in batches of thousands gets a lower price per disk than you
Example 2: Running a cooling system for a large data center is usually cheaper than running 10 cooling systems for 10 smaller data centers with the same size combined
Major public cloud service providers today
What about private cloud approaches?
They follow the same principles of “on-demand”, “elasticity”, “multitenancy”, etc. But a certain management effort will stay with your company.
Was sind Traditional virtualization and container orchestration solutions?
Running virtual machines with VMware or Hyper-V
Kubernetes/OpenShift clusters
How did public cloud computing offerings emerge?
Between 2000 and 2010, several tech companies faced significant growth in user demand: Amazon, Yahoo, Google, Facebook, eBay, …
It’s hard to find reliable numbers, but some reports suggest that during peak periods Facebook added 1200 servers per month to keep up with growing demand.
You need repeatable, reusable and automated processes to
launch new servers
configure them in a uniform way
manage them (installing updates, evacuating racks for hardware maintenance, etc.)
assign data storage capacity to servers
…
scenario: Ordering a server in a traditional large corporation
1. You realize you need a server for a project. You guess your CPU and RAM requirements.
2. You fill out a form to order the server.
3. Your supervisor needs to approve the order.
4. If costs are above a certain threshold, a finance responsible needs to approve the order as well.
5. Your order is routed to the central IT department or IT service provider.
6. You need to wait until your order is processed.
7. The person who handles your order may still have a question and tries to contact you. 20
8. You get access to your server.
9. After a few days you realize that you would have needed more RAM. It would also be better if you had a separate database server, instead of running the database on the same system.
10.Your journey goes back to the start: you fill out the order form again
Cloud computing bypasses this process
2. If you don’t already have one, you create an account at a cloud service provider. This requires a credit card and a few minutes of time.
3. You click through the web UI to configure a server. Or you run on the command line: aws ec2 run-instances [your CPU and RAM requirements]
4. You get access to your server within 1-2 minutes.
5. After a few days you realize that you would have needed more RAM. It would also be better if you had a separate database server, instead of running the database on the same system.
6. You run: aws ec2 modify-instance-attribute [details for more RAM] aws rds create-db-instance [...]
7. You’re done.
What are core differences between these two scenarios?
There are no dependencies on other people or other departments. This reduces the turnaround time significantly. It ties back to the cloud computing properties “on-demand”, “elasticity” and “self-service”.
You may end up with high costs quickly
Things you configure may not comply with your company’s expectations or internal policies:
guidelines on which operating systems to use and how to manage them
security policies
Reduced central control and central coordination
Cloud computing does not prevent you from messing up. You can just mess up way faster.
Dedicated services for many different IT needs
running virtual machines
running databases
log monitoring and analytics
video streaming
text translation
IoT device management
machine learning
• …
Wie interagiert man mit diesen Services?
API’s
Multiple ways to send the very same API calls
Via web interface (“AWS Console”)
Via command line interface (“AWS CLI”)
Via software development kits (SDKs) for various programming languages: Python, Java, .NET, Go, PHP, Rust, …
Was sind die AWS service properties?
Services have different pricing models:
per time used
per GB used
per size configured
per number of API calls sent, …
Storage in the cloud
In “classic IT” you often see servers that have disks attached. You store data on the disks by talking to those storage servers. (Example: network shares)
Was muss man mit Storage in the CLoud beachten?
You would again need to take care of:
Scaling with demand
High availability
Capacity planning
OS configuration and maintenance
Was bedeutet die BLOB storage?
Binary Large Object
Universal and most extensively used form of cloud storage
Used to store arbitrary data like images, videos, text, CSVs, …
Size of a BLOB can range from a few bytes to multiple TBs
I want to …
backup my data as a ZIP file.
store the pictures of my web site.
store data for data analytics or machine learning.
store intermediate results of my lengthy computations.
keep application logs in a central place.
Was sind die Nachteile von BLOB Storages?
Objects can be written (and overwritten and deleted), but not partially updated
You interact with the storage via API calls
Objects are automatically replicated across multiple data center buildings in the same region
You get high durability and availability without own management effort.
Virtually unlimited storage
You just store additional objects when you need to (and can delete them again any time). You may store one object or millions of objects.
Performance does not change with the amount of storage you consume
The service answers with the same throughput and latency for an object: There is no single disk or single server that is assigned to you. You are interacting with a distributed system.
Often found properties of BLOB storage services
Logging all data access
Automating backups
Sharing data with other users of the same cloud service provider
Auto-deleting old objects
Replicating data across regions
Was ist AWS S3?
Simple Storage Service
S3 stores objects in buckets
Buckets = logical containers/grouping mechanisms for objects. Some settings you make on bucket level apply to all objects in a bucket.
Objects = your actual data (think: “files”)
Können Bucket names mehrmals vorkommen?
Nein!
S3 und Kosten
Three main components:
1. Amount of data stored (GBs per month)
compress data, if you can
delete old data that you don’t need any more (use lifecycle rules to automate this)
make use of S3 storage classes (trade lower storage cost for higher per-request cost)
2. Number of requests sent to the service (per 1000 requests)
cache data at the clients that use the objects
batch data before you upload it (e.g., do you need to upload new log messages every second or is batching for 10 seconds fine as well?)
3. Data transfer when data is sent out of the AWS region (per GB out)
don’t regularly transfer data if you don’t need to: process data close to where it is stored
Data locality
If data is stored at a cloud service provider, it’s usually also efficient to use data analytics services there (and vice versa).
It’s an anti-pattern to repeatedly move the same GBs of data over wide-area networks, due to latency & data transfer cost.
1. The query engine reads all the required raw data over the Internet/VPN
2. The query result is computed and given to the user
3. All the previously transferred raw data is discarded again
How do object storage and data analytics now fit together?
Distributed data processing solutions emerged because a single system (often in form of a database) could not efficiently handle the size of certain data sets any more.
Wie sieht die Hadoop & HDFS architecture aus?
Each node (=server of a cluster) stores a part of a data set and runs requested computations on it.
How many times does a certain word appear in a textbased dataset (of many GBs in size)?
Hadoop & HDFS architecture has advantages:
Queries can be run in a distributed way, working in parallel
Data is stored close to where the compute happens: nodes run jobs directly on the data they have stored locally
Hadoop & HDFS architecture has disadvantages:
Compute and storage are closely coupled
What if you only need to run computations for 1 hour per day?
Will you keep the servers running idle for the rest of the time? If you want to add new data, the servers need to be running.
What if you need much more storage space? Do you add more servers?
Do you put more disks into existing servers?
but they want you to use their BLOB storage solution instead of HDFS:
Decoupling of compute and storage
Clusters only need to exist when you are running queries and can then be shut down or deleted again cost savings
You can easily experiment with different cluster configurations (number of nodes, CPU, RAM, …) fail fast to achieve what you need
You can add/delete data any time without being dependent on the current existence of servers
Was ist HIVE?
Data analytics framework for distributed processing of large data sets
Provides a SQL-like query interface for data that does not need to be stored as a database: CSV, JSON, Parquet, …
Usage: you write SQL queries and Hive translates them into Hadoop MapReduce jobs and runs them 4
SerDe = Serializer Deserializer
Is an implementation of how to read, interpret and write a specific data format that you want to analyze.
There are built-in SerDes for CSV, JSON, Parquet, …
You can also write your own SerDes for custom file formats.
Hive table definitions are “schema-on-read”.
The data structure is only applied when reading the data, e.g., when you run SELECT s.
A consequence is that Hive will be perfectly happy with faulty table definitions and only fail later when you run queries.
Run queries against your virtual table:
Warum ist die SPeicherung von Daten besser auf S3 anstatt auf HDFS?
Storing data on S3 instead of a local HDFS file system is nice, because it decouples compute and storage. But data now first needs to be sent over the network before it can be analyzed:
2. Filter for those records where payment_type=‘credit_card’
3. Discard all the other records we just transferred over the network → inefficient
Was ist die Idee hinter Pushdowns?
The core idea is that parts of a SQL query are already evaluated close to where the data lives (if possible). The storage layer should only send those parts of raw data to the compute layer that will be needed there.
Evaluation is “pushed down” to a lower layer in the stack.
The pushdown concept is not specific to a product: you will find it in many big data frameworks, data warehouse solutions, etc.
It requires the storage layer to understand the data format you use and provide filtering functionality. → Need to check individually for the setup you use.
Wie sieht ein Beispiel von Predicate pushdownaus?
There are more types of pushdowns, for example:
Aggregate function pushdown
MIN(), MAX(), SUM(), COUNT(), …
LIMIT operator pushdown
Verwendet HIVE Pushdowns?
Other big data frameworks available on EMR can make use of it. Example: Trino
Was ist Trino?
The “look & feel” is like Hive: you write SQL queries to analyze data that is not necessarily stored in a database format.
Trino query execution times are usually way lower than with Hive.
Trino uses pushdown mechanisms.
Trino keeps intermediate results in memory, Hive writes them to disk.
Hive compiles your SQL query into a MapReduce program, which can take some time.
Keep in mind: Numbers will heavily depend on your SQL queries and data sets.
Was ist der Unterschied zwischen HIVE und Trino?
Trino ist viel viel schneller!
When working with EMR, you still see lots of “servers” rather than “services”…
That can be nice when you need maximum control and configurability.
But many customers are rather only interested in analysis results and not in cluster configuration.
Further abstraction into a service:
Similar concept of “SQL query execution as a service”:
Was sind Athena properties?
Serverless: the cloud service provider runs (multi-tenant) clusters in the background: you can’t access them.
Can be used via Web UI or API.
It’s primarily used to query data that is stored on S3.
It can also query other data sources like databases, via connectors (MySQL, Postgres, etc.).
When would this make sense? (Athena Properties)
Because you don’t have access to the underlying clusters, we need a place where metadata is stored (= the schema information that exists after you ran a CREATE TABLE statement):
Where is the source data located? (e.g., S3 paths)
What schema does the data have?
tables
columns
data types of the columns
Glue Data Catalog as a central place for data set definitions.
There is a dedicated AWS service to manage data set definitions, because this is used in multiple other occasions too:
Working with other data analytics services than Athena
Running ETL jobs
Sharing your data with other AWS accounts
Kann man Query results direkt in den Clusters speichern?
Query results also cannot be stored directly on the clusters. They instead go to an S3 bucket that you need to specify. You need to create a dedicated bucket for this.
The Athena web UI will also show query results if they are small enough.
Wie sieht ein AThena Workflow aus?
How to create data sources in Athena/Glue?
You can run CREATE TABLE statements directly in Athena (the Trino statements are similar to Hive). This creates the information in the Glue data catalog.
Alternatively, you can also define your tables manually in Glue. This is literally clicking through web forms:
You need to manually:
look at your source data
determine column names
determine data types
Was sind Glue Crawlers?
Crawlers walk through your data on S3, infer the schema and automatically create tables for it.
Crawlers can also be run again when your data schema evolves/changes over time.
Is it now cheaper to run an EMR cluster or to use Athena?
It depends on your usage patterns:
If you have a small number of queries or ad-hoc queries, Athena is likely cheaper.
If you run queries regularly 24/7, it can be cheaper to “pay for time” for a running EMR cluster.
It also depends on how much data is really scanned by Athena.
EMR clusters can also be created on-demand and terminated again.
There is no generic true answer.
The good thing is that you can always iterate and change your approach. A decision does not lock you in. You did not buy any hardware, nor signed a long-term contract.
What is a data lake?
“A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc., and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning.”
As a producing company, BMW can collect lots of data:
Data created during vehicle manufacturing e.g., measurements of assembly machines, inventory of car components, …
Regular sensor data that vehicles send
Sales and marketing data e.g., which campaigns lead to which sales outcomes?
Results of vehicle engineering simulations e.g., do our simulations line up with what we later see in practice?
How to deal with the massive growth in data volume?
An existing central team could not keep up any more with who provides what data, how much of it, and who should be able to access what.
How to make data accessible and usable to many people in the company?
“Data democratization”
Wie sieht der Ablauf von Data providers, data storage und data consumers aus?
You want to provide a data set within the company?
You’ll get an AWS account.
Store your data in an S3 bucket that we assign to you.
You want to consume a data set from the data lake?
If the data producer allows your request, you’ll get access to 8 the relevant S3 bucket(s).
Wie sieht der FLow zischen provider, resource account und Use case aus?
No more capacity and performance planning for the central team
no cluster/hardware adaptations are required over time
data producers just add new S3 objects that provide data for newer time frames
There is an internal web application (“Data portal”) for users of the data lake:
registering as a data provider or data consumer
browsing the available data sets
requesting and granting permissions to data sets
seeing contact data of the data providers
One thing that will come up quickly in such an S3-based data lake setup:
Data producers and data consumers have different optimization requirements that are antagonists.
Nenne das Beispiel mit: A sensor placed in the factory that emits data every few seconds
It will write small files (and would anyhow not have the capacity to buffer data for longer periods)
It does not support columnar file formats and likely will not compress data before writing
Easy for writers ≠ Well-designed for readers
The data lake thus needs different layers to fulfill different needs
ETL processes in between make sure that data is well-suited for consumption
columnar file formats
using compression
larger file sizes
laid out like many data consumers asked for, e.g., only containing specific columns, uniform date/time formats, etc.
Update cycles from source to prepared layer can vary: 20 hourly, every x hours, daily, weekly, …
“Creating data sets out of data sets”
Similar concept: materialized views in databases
Data consumers don’t need to regularly scan TBs of data if they work on a prepared data set that already aggregates information or combines multiple data sets.
Data consumers can turn into data providers by sharing their “derived data sets” within the company.
Data lineage = Answering the question
“Where does this data come from initially?”
Tracking the source and evolution of data over multiple steps and data sets.
How to actually do the ETL?
Technically, there are many possible ways to create new data sets:
Run jobs on EMR clusters (using any of the frameworks available there)
Athena CTAS and INSERT INTO
Ray jobs
Spark jobs
Was bedeutet CTAS?q
CREATE TABLE AS SELECT
Creates a new Athena/Glue table based on the results of a SQL query
The result files that represent the new table are stored on S3
Useful for one-time or first-time creation of data sets
Was ist INSERT INTO?
Takes the results of a SQL query and adds them to an existing Athena/Glue table
Storage-wise this works by putting additional files into an existing S3 bucket
Useful for regular additions to existing data sets
The decentralized approach has several advantages
What are potential disadvantages?
Advantages
The central data lake team “only” takes care about the platform (web application for data set management) and not about individual data sets, cluster hardware, storage, etc.
Self-service: Providers and consumers can act independently and don’t have a dependency on a central process (that potentially takes long time)
S3 acts as a common storage layer, but consumers can choose from many ways to process data sets
Disadvantages
Less tech-savvy users may be lost more easily because there is no “one standard way how we do things in this company”.
In this approach, permissions are denied or granted on data set level. (=You can either read nothing or everything stored in an S3 bucket.) There is no column-level access control.
Potential solutions: Create multiple data sets via ETL or import a data set from S3 into a “traditional database”.
There is no central gate for data quality. Just like people may leave or change departments over time, data sets may become less curated over time.
Recommendations for data quality
Show metrics to make the data set status transparent
Display how often a data set is updated (and how often it was updated in the past). Think of GitHub commit activity graphs:
Display how much a data set is read by data consumers. S3 has ready-made metrics for this:
Ensure data providers and data consumers can communicate with each other. Examples: show contact details of the data 37 provider, have a “comment section” below a data set, etc.
In the long term, a data lake should ideally provide high quality data and not turn into an unmaintained mess.
So far you have mainly used “Distributed SQL environments”
Hive (but other frameworks are available on EMR too)
Athena (which uses Trino under the hood)
Why/when to use them instead of SQL?
Filtering or data aggregation that requires complex logic
Working with external sources or destinations
Making API calls to external services to request additional data
Writing results to custom destinations (e.g., databases, dashboards, etc.)
Working with complex data formats (e.g., images)
Running Ray in a cloud environment
Cloud service providers offer dedicated services for Ray clusters: you only need to bring your Python code and that’s it.
What you are usually responsible for in cloud computing?
Terminology of cloud service offerings
IaaS: Infrastructure as a Service
PaaS: Platform as a Service
SaaS: Software as a Service
Was ist IaaS?
Provides basic low-level IT building blocks. You could also run their equivalents in your own data center, for example:
Virtual machines where you can log into the operating system
Virtual hard drives
Virtual networks where you assign IP addresses yourself
Was ist PaaS?
Higher abstraction level than IaaS. The typical model is: “You just bring your code. The cloud service provider runs it.” The cloud service provider takes care of:
Configuring machines that run your code, including OS management, updates, etc.
Scaling and load balancing
Features like collecting logs and metrics, etc.
PaaS was pioneered by platforms like:
Was ist SaaS?
Most abstract layer.
You don’t (or rarely) code yourself but use ready-made applications of a service provider.
Personal point of view (and yours may vary!): The value added by cloud service providers lies in automating away the operations around the open-source software.
Automating updates (of the software and the OS underneath)
Running clusters of multiple machines for high availability
Scalability: automatically adding resources to clusters – or removing them
Hardware and data center operations
Having people on-call 24/7 to react when there are issues
Was ist der crux bei clous services?
The crux is that cloud service providers do not “distribute” software. They only provide a service over a network.
So, (depending on the license) it’s perfectly fine to take opensource software, enhance it, provide it as a cloud service and do not contribute your changes back to the public.
→ “Application service provider loophole”
Was ist die Application service provider loophole?
Some open-source projects consider this “works as designed” and don’t further care: it’s a freedom that their license grants. Customers of cloud service providers also grow their user base.
Others are changing their licenses because they felt their work being exploited:
They are now using AGPL or SSPL, which define “providing a service over a network” to be “distribution”.
Cloud Service Providers now need to
either stop using the underlying open source project
or enter individual agreements with the license holders
or take the last version that uses the old license and create a new new open-source project from it (“fork“)
Nenne Example developments:
Challenges with public cloud usage
Pricing can be hard to predict:
It’s also easy to pay too much for a specific outcome.
Example: If you ask a cloud service provider to launch a virtual machine with way more CPU and RAM than you need in practice, they will happily give it to you (and charge for it).
This is a direct consequence of the “On-demand” and “Elasticity” properties of cloud computing (and thus won’t change soon).
Cloud service providers may also choose to raise their prices in the future.
Loss of exclusive control:
You cannot see how cloud service providers operate internally.
Do their employees adhere to legal and ethical standards?
Which internal security controls are in place?
How are customer separated?
You will get affected by decisions and mistakes of the cloud service provider
Potential vendor lock-in:
A solution you have built may be tightly coupled to the managed services of a specific provider.
Cloud service providers may discontinue services
If services don‘t turn profitable within a couple of years after launching, they may get sunset.
Configuration mistakes due to self-service
Legal concerns
Network latency between on-premises and cloud
Cloud computing requires changes in the organization and changes of the mindset
Cloud is a fast-evolving field
Compute possibilities that emerged over time
The idea behind serverless functions
You as the customer only provide the code (= your function implementation). You are not concerned with the execution environment.
The cloud service provider:
distributes your code across multiple physical machines (for load balancing and high availability)
adds it to more machines if load requires it (→ scaling)
invokes it when certain defined events happen: “event-driven computing”
“Function in computer programming“ vs. “Function in serverless computing“
A serverless function is not only a single function, as used in programming languages. Instead, a serverless function often consists of multiple functions on a programming-language-level.
Event-driven computing
Serverless functions don’t run 24/7. They are frozen again after they processed an event (and then don’t consume resources any longer).
Event-driven computing, example of AWS Lambda
Common serverless function properties
Support for multiple programming languages
Allow for very low resource assignments (e.g., 128MB RAM)
Pay-per-execution-time (e.g., milliseconds executed)
Solutions are often maintenance free. You need to keep your code up-to-date and update any libraries you bring along. But that‘s it.
Are tailored for the “quick and small execution use cases“, not for heavy computing that takes hours
Example limits of AWS Lambda:
Maximum of 10 GB RAM
Maximum execution time: 15 minutes
Maximum 6 MB of request data and 20 MB of response data
Was ist Batch processing / Discrete processing?
There is a known beginning and end of data to be processed. The data volume is known and finite.
Processing a file
Running an hourly ETL job to send specific data into an analytics system
Creating a daily report of all credit card transactions
Was ist Stream processing?
There is no dedicated beginning or end of the data to be processed. The data volume processed over time is unknown and potentially infinite. Data processing functions hook into a continuous stream of input values.
Use cases for stream processing
Near real-time analysis of incoming data → ability to react faster
Analysis of sensor data
Analysis of financial transactions for fraud detection
Log analysis (e.g., detecting denial-of-service attempts or password brute forcing)
Sentiment analysis (e.g., extracting topics that start to trend on social media)
Typical components of a streaming architecture
Example features of stream processing frameworks
Providing a SQL interface for stream contents. Can be used to analyze stream data and to publish SQL-based views of stream contents. (Which can be used for dashboards.) Example: Summarize user activity of the last 5 minutes.
Ready-made outlier detection and anomaly detection functions
Functions that are useful for ETL: converting data, filtering data, calculating averages or other aggregates, exporting data to downstream systems such as databases, etc.
Was ist das Tumbling window?
The window position always moves forward by a whole window size. Windows don’t overlap
It can miss situations: for example, was there a “green trade” and a “blue trade” within a 5 second period?
Updates less frequently: you’ll get a new average trading price every 5 seconds.
Usually easier to implement: each item is only “looked at” once.
Was ist das Sliding window?
The window position always moves forward by a defined step smaller than the window size, e.g., a second. Individual windows overlap.
More accurate: it finds a “green trade” and a “blue trade” within a 5 second period.
Can update more frequently: you’ll get a new average trading price every second.
Can be harder to implement: you will “look at” the same items multiple times. Example: You want to alarm if there was a “yellow trade” and “red trade” within a 5 second period. You’ll need to keep state whether you have already sent an alarm
Tumbling windows or sliding windows?
Analyze your use case whether tumbling windows are “good enough
In many cases, simple tumbling windows just work fine. Example: you just extract certain items from a stream and write them to a database.
Is the batch size the same as the tumbling/sliding window size?
Internal organization of a stream processing solution
Using too few servers is bad
Using too many servers is bad as well
Was sind Shards / Partitions?
Are a way of grouping stream items.
Are a way of splitting responsibility.
Are a unit of capacity and scaling: we can add or remove entire shards (and their servers) as load requires.
In practice, producers often don’t need to know how many partitions currently exist. They just use some partition key, and the stream processing solution determines the target partition internally.
Make sure the partition keys your producers use are welldistributed. Otherwise, you will hammer single servers while others are idle.
welche zwei architekturen kann man anwenden für predictive maintance?
was ist Kinesis?
When you configure Kinesis to be a trigger for your Lambda function, you can define your micro-batch properties
Was ist das Ziel hinter AISageMaker?
The general ambition is to make machine learning tasks easier for people who have little to no machine learning background, because experts are hard to find and hire.
Cloud service offerings thus often go in the directions of (AI)
“We already have pre-trained models for you“ or “We make it easier for you to train and use a model“
Welche ready-made services gibt es?
Text extraction (OCR)
Image recognition
Fraud detection (user behavior analysis)
Text to speech & Speech to text
Language translations
Generative AI
Anomaly detection in sensor data
Welche Methoden gibt es für AI Services?
Was ist Rekognition?
Image and video analysis service, to detect objects, persons, scenes and sentiments.
The service offers pre-trained models that detect common categories (“labels”) in images or videos that you provide:
Alternatively, you can provide your own training data to detect custom items / persons / scenes (“Rekognition Custom Labels”).
Was waren die Nachteile von Rekenotion?
“In late 2017, the Washington County, Oregon Sheriff's Office began using Rekognition to identify suspects' faces.“
“In early 2018, the FBI also began using it as a pilot program for analyzing video surveillance.“
“In May 2018, it was reported by the ACLU that Orlando, Florida was running a pilot using Rekognition for facial analysis in law 11 enforcement […]“
The criticism was not only about the ”Custom Labels” feature, but also about the pre-trained models the service brings along:
“In January 2019, MIT researchers published a peer-reviewed study asserting that Rekognition had more difficulty in identifying dark-skinned females than competitors such as IBM and Microsoft. In the study, Rekognition misidentified darker-skinned women as men 31% of the time, but made no mistakes for light-skinned men.“
Was ist Recognition?
Broad accessibility of point-and-click image and video analysis thus not only raises concerns regarding mass surveillance, but also regarding equal treatment of persons.
It‘s not so much that the technology exists per-se, but that the technology is available to anybody who invests 5 minutes to register an AWS account and may not be aware of potential biases.
Welche Features gibt es features gibt es?
Use the “Custom Labels“ feature of Rekognition
Was muss man beachten bei Rekognition?
annotate your training data with bounding boxes
Make sure you really labeled all training images – it‘s easy to unintentionally miss some:
What algorithms are behind Rekognition?
The information provided is very limited: “convolutional deep neural networks“ and “recurrent neural networks“
Was ist der SageMaker Canvas was unterstützt er?
Point-and-click machine learning web interface
Supported types of machine learning problems:
Linear regression
Binary classification
Multiclass classification
Wie sieht ein ABlauf beim SageMaker Canvas aus?
Upload your data to S3 and create a “Canvas Dataset” from it
What happens in the background in SageMaker Canvas?
Canvas uses a SageMaker feature called “Autopilot“ / “AutoML”:
Splits your data into a test and a validation set
Selects applicable machine learning algorithms
Trains multiple models with different hyperparameters
Tests the trained models and determines the best-performing one
Algorithms/frameworks that Autopilot tries (Sagemaker)
Was kann man in after training in SageMaker Canvas machen?
After training, you can make predictions directly in the web interface (either batch predictions or just single items):
Welche underlying Artifacts kann man in SageMakerCanvas ansehen?
the Python scripts used for data preprocessing
the Jupyter notebook used for training and validation
the trained model files
the model evaluation results
Was ist die SageMaker domain?
which is just a collection of default settings)
Was ist der SageMaker Studio?
“Browser IDE tailored for machine learning tasks”
Provides a mix of custom web UI and a hosted JupyterLab.
As a user, you need to write code, know which algorithms to use, which hyperparameters to use, etc.
Why run JupyterLab in the cloud when you can also run it locally on your machine?
Use SageMakerSTudio!
Was ist der Vorteil von Sagemaker Studio?
Access to specialized hardware (e.g., GPUs) https://aws.amazon.com/sagemaker/pricing/ → On-Demand Pricing → Training
Data locality: Your training data may already be stored in the cloud and/or require significant storage space
Allows to deploy your model after it was trained, for live classification/prediction: SageMaker Inference endpoint
Wie sieht ein SageMaker Inference endpoint aus?
Was ist der Vorteil von Sagemaker STudio?
You have access to your trained models: they are stored in an S3 bucket in your account.
You may also just download them and use them somewhere else, outside the cloud.
You can use algorithms where SageMaker offers built-in support, or bring your custom Docker container images:
Was sind Cloud Infrastructure Generative models?
Similar to machine learning, cloud service providers focus on making generative AI accessible to developers who are not strictly experts in the field.
1. Providing pre-trained models that can be used via API calls
2. Fine-tuning / Continued pre-training of existing models with custom data that you provide
Mit welchen Serrvice kann man LLMS verwenden?
Bedrock
Welche Modelle gibt es in Bedrock?
Cloud service providers try to cover a broad range of use cases (Bedrock):
Text generation (for, e.g., search, interpretation, summarization)
Media generation (for, e.g., images, audio, video, etc.)
Drawing conclusions / Reasoning
Service advantages von Cloud Modellen/Cloud:
Scalability: it‘s the task of the cloud service provider to appropriately size their infrastructure, whether you send only a few requests to a model or many
It‘s easy to get started: there is only little setup work required
You can compare and switch between different models (Multiple models are available via the same API)
Service disadvantages
If used at scale, it can get (too?) expensive for certain use cases. Example: Generating thousands of images.
Strong (online) dependency on a specific model provider.
Use AWS Bedrock to create an image imitation service.
Often found split of responsibilities in traditional IT environments:
Traditional IT environments treat the areas “Dev“ and “Ops“ as different things: different teams, different responsibilities, different mechanisms.
In cloud computing, infrastructure is created and managed by sending API calls. → IT infrastructure can be treated as a software engineering task
The idea behind infrastructure as code (IaC) (which is also driven by DevOps…)
From “ClickOps“ to “GitOps“
IaC is often found in teams that work along DevOps principles, where a single team is responsible for an entire product/service. (“You build it, you run it.“)
This includes:
Definition of the customer
Definition of the product/service offering
Development
Operations
Customer support
Product/service evolution
You could write an infrastructure definition like this:
Issues von IaC:
The code only handles the “create once” case. If you later want to make changes to your infrastructure, you’ll need dedicated code snippets for updates. (Or delete everything and start a new rollout.)
The code only creates infrastructure. You will need to write separate scripts to change or delete resources again.
IaC tools instead use a declaractive model (“desired state model“).
You define what the result should look like, but not the specific API calls to be sent to get to this result.
Wie sieht ein Flow aus von IaC?
You define the “what“ but not the “how“ in IaC:
Cloud Service Providers often have their own IaC services that use YAML or JSON for infrastructure definitions.
Alternatively, there are IaC tools that support resources of multiple Cloud Service Providers.
Was macht Terraform, Pulumi? und was ist das bottleneck?
These tools work with multiple cloud service providers.
Same syntax for infrastructure definition
Same “look and feel“, e.g., when working in an IDE
Same CLI commands, e.g., roll out your infrastructure
But… The actual cloud resource definitions are individual for each cloud service provider! You can‘t specify “Give me a virtual machine“ and the same code will work for multiple clouds. Their underlying APIs are (too) different.
Benefits of IaC
Infrastructure is kept in a defined and consistent state It avoids “This was set up manually and the person who did it left the company.”
You can replicate your infrastructure on demand, e.g., for
having a realistic test environment to try something out
creating fully functional environments on demand, e.g., for customer demos
Development, test and production environments are much more likely to resemble each other
CI/CD pipelines are often used to ensure code quality and to build software. When using IaC, they can also be used to test and roll out your infrastructure. For example:
Example tools for CI/CD pipelines
Wie sieht Terraform architecture aus?
A Terraform provider translates the contents of your infrastructure definition files (*.tf) into API calls that need to be sent to a cloud service provider.
Within Terraform infrastructure definition files (*.tf) you can mix resources of multiple cloud service providers. Terraform makes sure to create the resources at the right provider, in the right order.
Example:
register a domain name at GoDaddy and
let this domain name point to an EC2 instance
Terraform infrastructure definition files contain:
Resource blocks
Represent infrastructure that should be deployed by Terraform
EC2 instances
S3 buckets
RDS databases
Data blocks
Retrieve information from a cloud environment that already exists there (and is not managed by Terraform)
What is the AWS account number?
Which OS images are available?
What is the ID of a virtual network that already exists there?
The state file keeps a mapping between the contents of your infrastructure definition files (*.tf) and the infrastructure that is currently rolled out in your cloud environment(s).
Example 1 und Example 2
Example 1: If you roll out an EC2 instance, Terraform needs to remember, for example: “I have already created this resource in a previous run and it is instance ID ‘i-06e0d316‘ on AWS.“
Example 2: If you delete a previously rolled out resource from a *.tf file, Terraform needs to detect that the resource now needs to be deleted on the cloud side as well. (But the information is already gone from the *.tf file.)
Infrastructure as code vs. ClickOps
IaC is not the right tool for every situation. It is meant to define and maintain environments where consistency and repeatability matter (e.g., production workloads).
If you just want to experiment, try out and validate something, ClickOps will usually be faster than writing IaC definitions. → Fail fast
Zuletzt geändertvor 3 Monaten