Basic concepts of statistics

Buffl

Statistics for AI

by Noel K.

What is statistics?

Statistics is the science of collecting, organizing, presenting and interpreting data
In the field of statistics one learns from data

For what is statistics useful and what is the common procedure of a statistical task?

Statistics enables Exploration and visualization of large and complicated datasets
Statistics compresses data to extract useful information and summarize data
Statistics models real world applications (e.g. radioactive decay)
Statitistics estimates and predict unknown parameters or quantities
Statistics tests research questions and hypotheses
Common precedure
- Explore
- Summarize
- Model
- Estimate
- Test

Why is it important to learn statistics?

Solving your own statistical problems
Understanding statistical methods in scientific papers
Being comfortable and competent around data and uncertainty
Statistics is the foundation of scientific research and part of our daily life

How does the media use statistics to twist ir hide facts?

Presenting polls which sum up to more than 100%
Asking unprecise or non-related questions in polls
Present graphs with incoherent time intervals
Comparing statistical maps with different scales of the categories
Presenting data in manipulating scales
Confusing correlation and causation
Turning graphs upside-down

What is data in statistics?

Data is referring to numerical facts

What is a model in statistics?

A model is a system of assumptions and equations that describes the data you are interested in

What is statistical hypotheses testing?

Statistical hypothesis testing is the use of data in deciding between different possibilities

What are the two main categories in statistics and how do they differ?

Descriptive statistics (empirical statistics)
- Given data is described and summerized to gain more information
- Typical descriptive methods are tables, graphs, charts and summerizing statistics
Inductice statistics (mathematical/inferential statistics)
- Given data is used to predict or answer research questions
- Draw conclusions from a sample and generalize them to a population
- Propability theory is often used with inductive statistics

What are the two key elements of combinatarics?

Perutation: How many possibilities exist to arrange n elements in different sequences?
Combination: How many possibilities exist to select k elements from a set of n elements?

What is important regarding data collection and what two different ways of data collection exist?

Important for data collecting
- Data collecting should be objective (independent of the person, who is collecting the data)
- Data collecting should be valid (precise measurement of what is needed)
- Data collecting should be reliable (it should be replicable under constant conditions)
Two different ways of collecting data
- Primary Data („Field Research“): firsthand collection of data by a researcher through observations, experiments or surveys
- Secondary Data („Desk Research“): data has already been collected by someone else (e.g. government organizations) and is available (e.g. through publications, journals, newspapers, …)

What is basic terminology in respect of statistics?

Empirical population: a finite set of objects, which are clearly (spatially, temporally, objectively) defined, e.g. the students which are sitting in HS7 at 13:00
Sample: a selection of objects from a population, e.g. the students who sit in the first row
Obersivational unit: entity, whose characteristics are measured, e.g. the students which grade should be statistically anylized
Attribute: is a characteristic or feature, which is measured for each observational unit, e.g. the grade of each student
Attribute value: the specific measured or observed value or the specific characteristic of an object, e.g. each student has one grade in the range of 1-5
Parameters: the „true values“ of a population, which can be estimated by a sample statistic, e.g. based on a sample of students, it is estimated that the average grade is 2.5

What are the different levels of measurements (scales of measurements) and how do they differ?

Nominal data: only categories with no meaningful order, e.g. color, gender, origin
Ordinal data: meaningful order, ranking according to this order is possible and can be used to analyze the data, e.g. job classification, bond rating, school grade
Quantitave: data is observed/counted or measured („numbers with a scale unit“)
- quantitative-discrete data: only values from a fixed list of numbers can be assumed, point on the number-line
- quantitative-continuous data: all values from a „continuum“ are possible, interval on the number-line (e.g. measuring the weight of an apple in gram ➔ theoretically the weight can be measured with an infinite precision)

What is the key difference between probability and statistics?

Statistics: Presentation of the data and generalization of the data to the „real world“
Probability: What if we know how the world works? What kind of data and results can we expect?

What is important to consider regarding the quality of data collection?

Does the source of the data make money on it?
Is the raw data available?
Are the respondents selected at random?
Does the interviewer use suggestive questions?
Does an independent confirmation exist?

What is the difference between disjunct and complete attributes?

Based on an attribute, the population can be divided into classes so that this classification:
- disjunct, i.e. no object may fall into several classes
- complete, i.e. each element must fall into exactly one class

What is the difference between interval and ration data?

Interval Scale: zero point is defined subjectively (e.g. calendar date, …) only addition and subtraction are possible
Ratio Scale: zero point is defined objectively (e.g. scale units in physics, …) addition, subtraction, multiplication and division are possible

How are the following terminologies in statistics defined?

Multivariate data
Raw data list
Stock data
Flow data

Multivariate data: Contrary to univariate data more pieces of information of an object are recorded and analyzed simultaneously (e.g. height and weight)
Raw data list: Original uncompressed recording of all information regarding a population
Stock data: Data is measured at one specific time point and represents a quantity existing at that point in time
Flow data: Data is measured over an interval of time

What is the difference between intensive and extensive data?

Extensive data: The sum of all the data leads to useful information, e.g. all tech companies in the US combined have a 6 Billion income

Intensive data: The sum of all the data leads to useless information, but the average of this some contains useful information. e.g. the average height of a Google employee is 1.82 cm

How is the absolute frequency defined?

Absolute frequency: Number of times that a specific attribute value occurs in a population, which is divided into classes by this attribute
All absolute frequency sum up to the size of the population

How is the relative frequency defined?

Relative frequency: Result of dividing the absolute frequency of a specific attribute value by the size of the total population. The relative frequency is the absolute frequency normalized by the total number of events
All relative frequencies sum up to 100%

How is the cummulative frequency defined?

Cumulative frequency: Sum of the absolute frequencies of all attribute values less than or equal to a specific attribute value. If the relative frequencies are used instead the absolute frequencies the result is called relative cumulative frequency
Cumulative frequencies provide only useful information, if the data has at least ordinal scale
All cumulative frequencies sum up to the size of the population

Join Course

Preview

Author

Noel K.

Information

Last changed
3 years ago

Report course