05. Cloud Monitoring

Buffl

Cloud

von Jensen J.

Why should one monitor?

make best use of rented resources
-> to reduce cost
-> and increase satisfaction of users of your services

What characteristics does an observable system have?

exposes enough data about itself
so that generating inforamtion (finding answers to questions yet to be formulated)
and easily accessing this informaiton
becomes simple

What is the definintion of monitoring?

monitoring in the cloud is:
the process of collecting
- status information of
- applications and resources
data can be sued to observe application and infrastructure

What is the definition of a monitoring system?

consists of all components for gathering monitoring data at runtime

What is monitoring data?

all (raw) data captured by the monitoring system

What is the definition of imformation w.r.t monitoring? provide an example

infoirmation is gained by
- processing
- interpreting
- organizing
- visualizing
raw data
-> it increases knowledge abotu the observed system

example:

raw data are CPU and memory utilization
informatoin is that there is a trand for an overload or memory leak

Is required informaiton always clear in the cloud?

No!
-> collect any data available

What two ways of creating information do we differentiate?

proactively creating
- -> continuous analysis to trigger alkarms of give overview over the status of the system
reactively creating
- triggered through events such as incidents

What levels do we differentiate w.r.t. the purpose of monitoring?

infrastrucure level
application level

What is the purpose of monitoring at the infrastrucure level

resource management
incident detection
root cause analysis
accounting or metering for payment
intrusoin detection
auditing

What is the purpose of monitoring at the application level

performance analysis
resource management (e.g. scaling decisions)
failure detection and resolution
SLA verification
auditing

What target systems for monitoring do we differentiate?

parelell systems
cloud

What are the monitoring paradigms in paralell systems?

batch system
data is collectd during an application run
analysis happens post mortem
execution is reproducable

What are the monitoring paradigms in the cloud?

interactive system
data is continuously produced - realtime data
realtime analysis
data used for
- immediate action
- study past system behavior

What are the three pillars of monitoring?

metrics
logs
traces

What different metrics do we differentiate (contecptual types)?

monitoring metrics
metrics we monitor

What monitoring metrics did we discuss?

metric itself (e.g. execution time)
- semantics
- unit
context
- server, application service,…
representation
aggregation
- sum, min, max, mean, percentiles, histogram
measurement frequency
- every second, minute, 5 minutes,…

What are some important metrics to monitor?

latency
throughput or traffic
error rate
utilization or saturation

What is latency? How to measure?

time it takes to service a request
-> selectively measure successful and error request

What is throughput or traffic?

web service: requests/second
streaming system: network I/O rate or concurrent sessions
database: transactions/second or retrievals per second

What is error rate? What types of errors did we differentiate?

rate of requests that fail
e.g.
- explicitly (HTTP 500)
- implicityl (wrong reply contents)
- or by violating an SLA

What is utilization or saturation?

percentage of capacity
CPU, memory, I/O

What cloud layers are there to monitor?

client
applicatoin
platform
infrastructure
hardware

What to monitor at the clietn layer? What is the context, metric and purpose of monitoring?

Monitor:

requests

Context:

request type

Metric:

# requests
latency
availablity

purpose:

SLA checking
alerting

What to monitor at the application layer? What is the context, metric and purpose of monitoring?

Monitor:

microservices

Context:

service name
service id

Metric:

# requests
request rate
latency
#replicas
CPU time
memory usage

purpose:

autoscaling
performance tuning

What to monitor at the platform layer? What is the context, metric and purpose of monitoring?

Monitor:

kubernetes
docker

Context:

container id

Metric:

CPU & memory quota
utilization
incoming & outgoing bytes

purpose:

container distribution
autoscaling VM cluster

What to monitor at the infrastructure layer? What is the context, metric and purpose of monitoring?

Monitor:

VM
volumes
queuing services

Context:

VM id
volume id
service name

Metric:

CPU & memory
#read/write
I/O latency
# requests
size of requests of infrastructure service
disk utilization
traffic

purpose:

root cause analysis

What to monitor at the hardware layer? What is the context, metric and purpose of monitoring?

Monitor:

servers
network
SAN
disks

Context:

server id
switch id

Metric:

disk utilization
traffic

purpose:

management of VMs

What are requirements for a monitoring system?

comprehensive
low intrusion
extensibility
scalability
elasticity
accuracy
resilience

What type of monitoring is there (box)?

white box
black box

What is white box monitoring?

data is from in and outside of the system
-> gives more context and more detailed insights
e.g. internal organization of a service is visible e.g. asynchronous internal handling of requests

What is blackbox monitoring?

monitored system handled as black box
no data gained from the inside of a system
e.g. only the request interface of service is visible
- => nothing about the internal structure

What is a proble of overheads?

lead to intrusion

What are reasons for overheads?

instrumentation
computation for aggregations
memory overhead for buffering
time to push to disk or transfer to collector
storage overhead for long-term storage

How to reduce overheads?

number of metrics
measurement frequency
representation
batching
sampling
long-term coarsening

What is amaton cloud watch?

monitoring and management service
collects:
- metrics
- logs: clouod watch log insights

How does amazon cloud watch provide data?

online
-> different frequency depending on data and accout
-> different storage time for different granularity (lower garnularity -> shorter storage)

How can one access amazon cloud watch?

management console web interface
CLI
libraries (e.g. java, script languages, windows .net)
web service API

What actions are possible in amazon cloud watch?

vie graphs and statistics
set alarms

What is prometheus used for?

open source monitoring system
features:
- metric collection in form of time seris
- storage by a time-series database
- oquery language for accessing the time-series
- alerting
- visualization

What is the definition of a log?

sequence of immutable records of discrete events
generated by applicatoins, system eve, infrastrucure, any devices…

What forms can an event log have?

plaintext -> most comman format of logs
structured -> much evangelized, typically json

What to formats are logs usually kept in?

ASCII
- easily readable
- inefficient w.r.t. space and time
binary
- more efficient
- e.g. protobuf from google

How many log data is there usually?

huge!
-> can be configured to levels
allows to drill down
-> difficult to analyze!"

What is protobuf?

bunary format for logs
-> language neutral
-> platform neutral
-> extensible mechanism for serializing structured data
backward compatible
forward compatible

What is the ELK stack for log processing?

stack of FOSS tools
elasticsearch, logstash, kibana
suited for logs and metrics

What is the elastic stack?

ELK + beats and X-pack

What does logstash support?

parsing from predefined patterns (grok patterns), transforms and filters
derive structure
anonymize personal data
geo-location lookups

What is x-pack?

extension to ELK
- authenticaoin and authorizatoin
- monitor ELK
- alerting
- report generation of kibana contents
- machine learning, e.g. anomaly detection and forecasting
- SQL interface for elastic search

What is kibana?

visualization dashboard

What are uses for logs?

beside from debugging and performance monitoring:
properly structured logs help to:
- incident root cause analysis
- anomaly detection
- fault prediction and predictive maintenance
- detect and respond to data breaches and other security incidents
- ensure compliance with serucity policies, regulations & audits

What are best practices for log analysis?

pattern detection and recognition
log normalization
classificatoin and tagging
correlation analysis
artificial ignorance

What are beats?

agents to collect data
- -> filebeat for logs
- -> metricbeat for metrics

What is AWS cloudwatch logs?

allows to store and analyze logs from amazon services
and your applicatoin services

What are functions in AWS cloudwatch logs?

grouping logs
metrics
- -> extract from log statements automatically and insert into cloud watch metircs
control retention period
- by default never deleted
analysis
real-time processing

How can one group AWS cloud watch logs?

allow to cluster several log streams from multiple sources

What is tracing?

capture interactoins of different services, the life of a request
captuer individual events -> e.g. subint request, receive request, start processing, …, submit answer, receive answer
associate events with given request to be able to analyze the execution of this request

What is google dapper? What were its design goals?

tool for distributed tracing
goals:
- continuous and ubiquitous tracing
- low overhead
- application transparency
- scalability

How can we trace requests?

with tree structures

-> can be transformed in a diagram over time

span has its own id (i.e. id for the sub-request)
each sub-resuest handling has also a parent id (which refers to the span id of the procedure that called…)

What is the span of a request?

the lifetime of a request
-> i.e. the time it takes from receiving the request until the frontend answers…

What are trace-ids used for?

to annotate events
-> unique trace id cerated at frontend service
-> passed to sub-requests (so that one can identify to which initial request the execution of each sub-request belongs…)
-> requires manual or automatic instrumentation (adjusting the code so that this passing happens…)

How can requests be represented?

dapper trace trees
-> nodes are called spans: lifetime of a request
-> edges indicate the temporal relationship

What are spans?

represent a RPC (remote procedure call)

What attributed do spans have?

span id
- identifies a span
parent id:
- span id of triggering span
trace id:
- identifies triggering request

Does the root span have a parent id?

no
root span: frontend request (span that receives the user resquest)

What can spans additionally include?

annotaions
applicatoin level events

How can one represent annotations of a span?

above: regular span attributes (i.e. name, trace id, parent id, span id)
annotatoins include points in time of the span indicating what happens at the client and server side (w.r..t the RPC)

What to keep in mind w.r.t. annotaiotns and points in time of spans?

be aware of clock skews
-> as events are created on different systems!

Where is the trace context stored?

in the thread-local storage of the thread executing the span

What happens with asynchronous execution w.r.t. storage of trace context?

issue RPC but no direct execution -> asynchronous…
=> have callback function that stores the trace context
=> when callback is invoked: trace context copied to executing thread

How is trace context handled w.r.t. inter-process communication?

span and trace id are automatically transmitted

What is the advantage of google w.,r.t. handling trace context?

all applicatoins use same control flow and RPC library
instrumentation thus automatic…

How are annotations created?

added by the appliocation owner

How can one use dapper to to enforce security policies?

e.g. proper use of authenticatoin or encryption
or policy based isolation
=> verifiable by looking at what is executed…

Beitreten

Vorschau

Author

Jensen J.

Informationen

Zuletzt geändert
vor 2 Jahren

Kurs melden