Why should one monitor?
make best use of rented resources
-> to reduce cost
-> and increase satisfaction of users of your services
What characteristics does an observable system have?
exposes enough data about itself
so that generating inforamtion (finding answers to questions yet to be formulated)
and easily accessing this informaiton
becomes simple
What is the definintion of monitoring?
monitoring in the cloud is:
the process of collecting
status information of
applications and resources
data can be sued to observe application and infrastructure
What is the definition of a monitoring system?
consists of all components for gathering monitoring data at runtime
What is monitoring data?
all (raw) data captured by the monitoring system
What is the definition of imformation w.r.t monitoring? provide an example
infoirmation is gained by
processing
interpreting
organizing
visualizing
raw data
-> it increases knowledge abotu the observed system
example:
raw data are CPU and memory utilization
informatoin is that there is a trand for an overload or memory leak
Is required informaiton always clear in the cloud?
No!
-> collect any data available
What two ways of creating information do we differentiate?
proactively creating
-> continuous analysis to trigger alkarms of give overview over the status of the system
reactively creating
triggered through events such as incidents
What levels do we differentiate w.r.t. the purpose of monitoring?
infrastrucure level
application level
What is the purpose of monitoring at the infrastrucure level
resource management
incident detection
root cause analysis
accounting or metering for payment
intrusoin detection
auditing
What is the purpose of monitoring at the application level
performance analysis
resource management (e.g. scaling decisions)
failure detection and resolution
SLA verification
What target systems for monitoring do we differentiate?
parelell systems
cloud
What are the monitoring paradigms in paralell systems?
batch system
data is collectd during an application run
analysis happens post mortem
execution is reproducable
What are the monitoring paradigms in the cloud?
interactive system
data is continuously produced - realtime data
realtime analysis
data used for
immediate action
study past system behavior
What are the three pillars of monitoring?
metrics
logs
traces
What different metrics do we differentiate (contecptual types)?
monitoring metrics
metrics we monitor
What monitoring metrics did we discuss?
metric itself (e.g. execution time)
semantics
unit
context
server, application service,…
representation
aggregation
sum, min, max, mean, percentiles, histogram
measurement frequency
every second, minute, 5 minutes,…
What are some important metrics to monitor?
latency
throughput or traffic
error rate
utilization or saturation
What is latency? How to measure?
time it takes to service a request
-> selectively measure successful and error request
What is throughput or traffic?
web service: requests/second
streaming system: network I/O rate or concurrent sessions
database: transactions/second or retrievals per second
What is error rate? What types of errors did we differentiate?
rate of requests that fail
e.g.
explicitly (HTTP 500)
implicityl (wrong reply contents)
or by violating an SLA
What is utilization or saturation?
percentage of capacity
CPU, memory, I/O
What cloud layers are there to monitor?
client
applicatoin
platform
infrastructure
hardware
What to monitor at the clietn layer? What is the context, metric and purpose of monitoring?
Monitor:
requests
Context:
request type
Metric:
# requests
availablity
purpose:
SLA checking
alerting
What to monitor at the application layer? What is the context, metric and purpose of monitoring?
microservices
service name
service id
request rate
#replicas
CPU time
memory usage
autoscaling
performance tuning
What to monitor at the platform layer? What is the context, metric and purpose of monitoring?
kubernetes
docker
container id
CPU & memory quota
utilization
incoming & outgoing bytes
container distribution
autoscaling VM cluster
What to monitor at the infrastructure layer? What is the context, metric and purpose of monitoring?
VM
volumes
queuing services
VM id
volume id
CPU & memory
#read/write
I/O latency
size of requests of infrastructure service
disk utilization
traffic
What to monitor at the hardware layer? What is the context, metric and purpose of monitoring?
servers
network
SAN
disks
server id
switch id
management of VMs
What are requirements for a monitoring system?
comprehensive
low intrusion
extensibility
scalability
elasticity
accuracy
resilience
What type of monitoring is there (box)?
white box
black box
What is white box monitoring?
data is from in and outside of the system
-> gives more context and more detailed insights
e.g. internal organization of a service is visible e.g. asynchronous internal handling of requests
What is blackbox monitoring?
monitored system handled as black box
no data gained from the inside of a system
e.g. only the request interface of service is visible
=> nothing about the internal structure
What is a proble of overheads?
lead to intrusion
What are reasons for overheads?
instrumentation
computation for aggregations
memory overhead for buffering
time to push to disk or transfer to collector
storage overhead for long-term storage
How to reduce overheads?
number of metrics
batching
sampling
long-term coarsening
What is amaton cloud watch?
monitoring and management service
collects:
logs: clouod watch log insights
How does amazon cloud watch provide data?
online
-> different frequency depending on data and accout
-> different storage time for different granularity (lower garnularity -> shorter storage)
How can one access amazon cloud watch?
management console web interface
CLI
libraries (e.g. java, script languages, windows .net)
web service API
What actions are possible in amazon cloud watch?
vie graphs and statistics
set alarms
What is prometheus used for?
open source monitoring system
features:
metric collection in form of time seris
storage by a time-series database
oquery language for accessing the time-series
visualization
What is the definition of a log?
sequence of immutable records of discrete events
generated by applicatoins, system eve, infrastrucure, any devices…
What forms can an event log have?
plaintext -> most comman format of logs
structured -> much evangelized, typically json
What to formats are logs usually kept in?
ASCII
easily readable
inefficient w.r.t. space and time
binary
more efficient
e.g. protobuf from google
How many log data is there usually?
huge!
-> can be configured to levels
allows to drill down
-> difficult to analyze!"
What is protobuf?
bunary format for logs
-> language neutral
-> platform neutral
-> extensible mechanism for serializing structured data
backward compatible
forward compatible
What is the ELK stack for log processing?
stack of FOSS tools
elasticsearch, logstash, kibana
suited for logs and metrics
What is the elastic stack?
ELK + beats and X-pack
What does logstash support?
parsing from predefined patterns (grok patterns), transforms and filters
derive structure
anonymize personal data
geo-location lookups
What is x-pack?
extension to ELK
authenticaoin and authorizatoin
monitor ELK
report generation of kibana contents
machine learning, e.g. anomaly detection and forecasting
SQL interface for elastic search
What is kibana?
visualization dashboard
What are uses for logs?
beside from debugging and performance monitoring:
properly structured logs help to:
incident root cause analysis
anomaly detection
fault prediction and predictive maintenance
detect and respond to data breaches and other security incidents
ensure compliance with serucity policies, regulations & audits
What are best practices for log analysis?
pattern detection and recognition
log normalization
classificatoin and tagging
correlation analysis
artificial ignorance
What are beats?
agents to collect data
-> filebeat for logs
-> metricbeat for metrics
What is AWS cloudwatch logs?
allows to store and analyze logs from amazon services
and your applicatoin services
What are functions in AWS cloudwatch logs?
grouping logs
-> extract from log statements automatically and insert into cloud watch metircs
control retention period
by default never deleted
analysis
real-time processing
How can one group AWS cloud watch logs?
allow to cluster several log streams from multiple sources
What is tracing?
capture interactoins of different services, the life of a request
captuer individual events -> e.g. subint request, receive request, start processing, …, submit answer, receive answer
associate events with given request to be able to analyze the execution of this request
What is google dapper? What were its design goals?
tool for distributed tracing
goals:
continuous and ubiquitous tracing
low overhead
application transparency
How can we trace requests?
with tree structures
-> can be transformed in a diagram over time
span has its own id (i.e. id for the sub-request)
each sub-resuest handling has also a parent id (which refers to the span id of the procedure that called…)
What is the span of a request?
the lifetime of a request
-> i.e. the time it takes from receiving the request until the frontend answers…
What are trace-ids used for?
to annotate events
-> unique trace id cerated at frontend service
-> passed to sub-requests (so that one can identify to which initial request the execution of each sub-request belongs…)
-> requires manual or automatic instrumentation (adjusting the code so that this passing happens…)
How can requests be represented?
dapper trace trees
-> nodes are called spans: lifetime of a request
-> edges indicate the temporal relationship
What are spans?
represent a RPC (remote procedure call)
What attributed do spans have?
span id
identifies a span
parent id:
span id of triggering span
trace id:
identifies triggering request
Does the root span have a parent id?
no
root span: frontend request (span that receives the user resquest)
What can spans additionally include?
annotaions
applicatoin level events
How can one represent annotations of a span?
above: regular span attributes (i.e. name, trace id, parent id, span id)
annotatoins include points in time of the span indicating what happens at the client and server side (w.r..t the RPC)
What to keep in mind w.r.t. annotaiotns and points in time of spans?
be aware of clock skews
-> as events are created on different systems!
Where is the trace context stored?
in the thread-local storage of the thread executing the span
What happens with asynchronous execution w.r.t. storage of trace context?
issue RPC but no direct execution -> asynchronous…
=> have callback function that stores the trace context
=> when callback is invoked: trace context copied to executing thread
How is trace context handled w.r.t. inter-process communication?
span and trace id are automatically transmitted
What is the advantage of google w.,r.t. handling trace context?
all applicatoins use same control flow and RPC library
instrumentation thus automatic…
How are annotations created?
added by the appliocation owner
How can one use dapper to to enforce security policies?
e.g. proper use of authenticatoin or encryption
or policy based isolation
=> verifiable by looking at what is executed…
Zuletzt geändertvor 2 Jahren