What is frequency distribution and what are two possible ways to represent it?
Representations of frequency distributions are an important tool to summarize collected raw data
It tries to tackle the question, how often does an attribute value occur
One can represent frequency distributions either with tables or with charts
What is the difference between percentage and valid percentage?
In an example where students had to state which gender they have and with which grade they finished the course - assuming that not all students provided this information - we can determine the percentage and the valid percentage
Percentage: only calculating about values we really know and stating that there are missing values
Valid Percentage: Rounding up too 100% and distribute the the missing values
What is a cross table?
A cross table can be used to describe the relationship between two variables, e.g. a students final grade of a course and the corresponding gender
What is data classification and why is it necessary?
Data classification is the process of sorting and categorizing data into various classes. In the case of qualitative, ordinal and quantitative discrete characteristics, the classification is obvious. Each attribute value forms its own class
For presenting data in tables a classification of the data is necessary
What are useful guidelines for creating a frequency chart?
Label the axes
The zero point of the frequency distribution should be presented on the y-axis
The axes should not be biased
3D charts should be avoided
Consider existing conventions
The simplest chart is often the best chart
What is a histogram and what problem does it solve?
For quantitative-continuous attributes or quantitative-discrete attributes with a lot of different attribute values, an artificial classification of the information is necessary before creating a chart
The information is presented as the area of these rectangles
Each class of a histogram is represented by a semi-open interval [e_i-1, e_i)
The rectangle over each class is proportional to the class-specific absolute or relative frequency
The basis of each rectangle is defined by the semi-open interval [e_i-1, e_i)
The height h_i of each rectangle is normalized to the length d_i of this class
The choice of the class limits has a great impact on the form of the histogram
What is the emprical cumulative distribution function?
The (empirical) cumulative distribution function represents the cumulated relative frequencies of objects of a population with attribute values smaller or equal the specific attribute value
The empirical cumulative distribution function is based on the relative cumulated frequencies of empirically collected data. Instead of this approach, the distribution function can also be described based on a theoretical function („theoretical cumulative distribution function“).
The empirical cumulative distribution function is an estimate of the theoretical distribution function based on the given data.
As abbreviation „CDF“ is often used. This should not be confused with the abbreviation „PDF“ („probability density function“)
What are two types of distribution parameters?
Measures of location
Measures of variability („spread“, „dispersion“)
What problem causes data with huge number of attribute values when trying to classify the data and how can we solve it?
The idea of „each attribute value gets its own class“ leads to no significant gain of information compared to the raw data
An „artificial data classification“ using artificial class limits is needed
There are different rules of thumbs in how many classes, data should be divided into
How can we statistically define a class?
A class has…
… a lower class limit (included), e_i-1
… a upper class limit (excluded), e_i
… a class length or class interval, d_i = e_i - e_i-1
… a class mark or class mean or class center,
x_i = (ei + e_i-1) / 2
What are important guidelines for data classification?
If possible, the number of objects in each class should be similar distributed (e.g. no class containing 90% of the data and 9 classes containing the remaining 10% of the data).
The length of the classes should be similar but this is not always useful (e.g. in the case of income classes).
The class limits should be „common“ limits (e.g. age in step of 10 years).
The objects in a class should be centered about the class mark. This means the frequency of objects should be high around the class mark.
In the case of large classes, information about the frequency distribution can be blurred or biased.
A class consists of objects with the same attribute values or at least with similar attribute values.
What is a bar chart and where is it useful or not useful?
A bar chart can be used to present a frequency distribution in a graphical way
Bar charts are only useful for qualitative, ordinal or quantitative-discrete attributes with a „manageable“ number of attribute values
The distribution of the absolute or relative frequencies is presented on the y-axis.
The information is represented by the length of the y-axis (ordinate)
A bar chart is not always the best way to present information, e.g. by represententing the height measured in cm of a population
What are possible chart types?
Bar chart
Histogram
Pie chart
Line chart
What is the difference of the graphs of an empirical distribution function in case of metric discrete and metric conituous data?
In the case of continuous data the empirical distribution function is based on an approximation by a monotonically increasing polygonfunction
In the case of discrete data the empirical distribution function is approximated by a monotonically increasing step-function
What is the arithmetic mean and to what type of distribution parameter does it belong?
The arithmetic mean („average“) is a measure of location or central tendency
The arithmetic mean is based on the sum of the population’s attribute values divided by the number of the total elements in the population
The unit of measurement of the mean is the same as the unit of measurement of the elements
The arithmetic mean changes according to linear transformations applied to the raw data –> not location-invariant and not scale-invariant
The arithmetic mean is a measure of location
Zuletzt geändertvor 2 Jahren