What is frequency distribution and what are two possible ways to represent it?

Representations of frequency distributions are an important tool to summarize collected raw data

It tries to tackle the question, how often does an attribute value occur

One can represent frequency distributions either with tables or with charts

What is the difference between percentage and valid percentage?

In an example where students had to state which gender they have and with which grade they finished the course - assuming that not all students provided this information - we can determine the percentage and the valid percentage

Percentage: only calculating about values we really know and stating that there are missing values

Valid Percentage: Rounding up too 100% and distribute the the missing values

What is a cross table?

A cross table can be used to describe the relationship between two variables, e.g. a students final grade of a course and the corresponding gender

What is data classification and why is it necessary?

Data classification is the process of sorting and categorizing data into various classes. In the case of qualitative, ordinal and quantitative discrete characteristics, the classification is obvious. Each attribute value forms its own class

For presenting data in tables a classification of the data is necessary

What are useful guidelines for creating a frequency chart?

Label the axes

The zero point of the frequency distribution should be presented on the y-axis

The axes should not be biased

3D charts should be avoided

Consider existing conventions

The simplest chart is often the best chart

What is a histogram and what problem does it solve?

For quantitative-continuous attributes or quantitative-discrete attributes with a lot of different attribute values, an artificial classification of the information is necessary before creating a chart

The information is presented as the area of these rectangles

Each class of a histogram is represented by a semi-open interval [e_i-1, e_i)

The rectangle over each class is proportional to the class-specific absolute or relative frequency

The basis of each rectangle is defined by the semi-open interval [e_i-1, e_i)

The height h_i of each rectangle is normalized to the length d_i of this class

The choice of the class limits has a great impact on the form of the histogram

What is the emprical cumulative distribution function?

The (empirical) cumulative distribution function represents the cumulated relative frequencies of objects of a population with attribute values smaller or equal the specific attribute value

The empirical cumulative distribution function is based on the relative cumulated frequencies of empirically collected data. Instead of this approach, the distribution function can also be described based on a theoretical function („theoretical cumulative distribution function“).

The empirical cumulative distribution function is an estimate of the theoretical distribution function based on the given data.

As abbreviation „CDF“ is often used. This should not be confused with the abbreviation „PDF“ („probability density function“)

What are two types of distribution parameters?

Measures of location

Measures of variability („spread“, „dispersion“)

What problem causes data with huge number of attribute values when trying to classify the data and how can we solve it?

The idea of „each attribute value gets its own class“ leads to no significant gain of information compared to the raw data

An „artificial data classification“ using artificial class limits is needed

There are different rules of thumbs in how many classes, data should be divided into

How can we statistically define a class?

A class has…

… a lower class limit (included), e_i-1

… a upper class limit (excluded), e_i

… a class length or class interval, d_i = e_i - e_i-1

… a class mark or class mean or class center,

x_i = (ei + e_i-1) / 2

What are important guidelines for data classification?

If possible, the number of objects in each class should be similar distributed (e.g. no class containing 90% of the data and 9 classes containing the remaining 10% of the data).

The length of the classes should be similar but this is not always useful (e.g. in the case of income classes).

The class limits should be „common“ limits (e.g. age in step of 10 years).

The objects in a class should be centered about the class mark. This means the frequency of objects should be high around the class mark.

In the case of large classes, information about the frequency distribution can be blurred or biased.

A class consists of objects with the same attribute values or at least with similar attribute values.

What is a bar chart and where is it useful or not useful?

A bar chart can be used to present a frequency distribution in a graphical way

Bar charts are only useful for qualitative, ordinal or quantitative-discrete attributes with a „manageable“ number of attribute values

The distribution of the absolute or relative frequencies is presented on the y-axis.

The information is represented by the length of the y-axis (ordinate)

A bar chart is not always the best way to present information, e.g. by represententing the height measured in cm of a population

What are possible chart types?

Bar chart

Histogram

Pie chart

Line chart

What is the difference of the graphs of an empirical distribution function in case of metric discrete and metric conituous data?

In the case of continuous data the empirical distribution function is based on an approximation by a monotonically increasing polygonfunction

In the case of discrete data the empirical distribution function is approximated by a monotonically increasing step-function

What is the arithmetic mean and to what type of distribution parameter does it belong?

The arithmetic mean („average“) is a measure of location or central tendency

The arithmetic mean is based on the sum of the population’s attribute values divided by the number of the total elements in the population

The unit of measurement of the mean is the same as the unit of measurement of the elements

The arithmetic mean changes according to linear transformations applied to the raw data –> not location-invariant and not scale-invariant

The arithmetic mean is a measure of location

Last changed2 months ago