Data Mining

Data Profiting

Data mining is the process of discovering relevant information that has not yet been identified before.

Data profiling is done to evaluate a dataset for its uniqueness, logic, and consistency.

In data mining, raw data is converted into valuable information.

It cannot identify inaccurate or incorrect data values.

Data Wrangling is the process wherein raw data is cleaned, structured, and enriched into a desired usable format for better decision making. It involves discovering, structuring, cleaning, enriching, validating, and analyzing data. This process can turn and map out large amounts of data extracted from various sources into a more useful format. Techniques such as merging, grouping, concatenating, joining, and sorting are used to analyze the data. Thereafter it gets ready to be used with another dataset.

Understanding the Problem

Understand the business problem, define the organizational goals, and plan for a lucrative solution.

Collecting Data

Gather the right data from various sources and other information based on your priorities.

Cleaning Data

Clean the data to remove unwanted, redundant, and missing values, and make it ready for analysis.

Exploring and Analyzing Data

Use data visualization and business intelligence tools, data mining techniques, and predictive modeling to analyze data.

Interpreting the Results

Create a data cleaning plan by understanding where the common errors take place and keep all the communications open.

Before working with the data, identify and remove the duplicates. This will lead to an easy and effective data analysis process.

Focus on the accuracy of the data. Set cross-field validation, maintain the value types of data, and provide mandatory constraints.

Normalize the data at the entry point so that it is less chaotic. You will be able to ensure that all information is standardized, leading to fewer errors on entry.

Descriptive

Predictive

Prescriptive

It provides insights into the past to answer “what has happened”

Understands the future to answer “what could happen”

Suggest various courses of action to answer “what should you do”

Uses data aggregation and data mining techniques

Uses statistical models and forecasting techniques

Uses simulation algorithms and optimization techniques to advise possible outcomes

Example: An ice cream company can analyze how much ice cream was sold, which flavors were sold, and whether more or less ice cream was sold than the day before

Example: Lower prices to increase the sale of ice creams, produce more/fewer quantities of a specific flavor of ice cream

Simple random sampling

Systematic sampling

Cluster sampling

Stratified sampling

Judgmental or purposive sampling

Univariate analysis is the simplest and easiest form of data analysis where the data being analyzed contains only one variable.

Example - Studying the heights of players in the NBA.

Univariate analysis can be described using Central Tendency, Dispersion, Quartiles, Bar charts, Histograms, Pie charts, and Frequency distribution tables.

The bivariate analysis involves the analysis of two variables to find causes, relationships, and correlations between the variables.

Example – Analyzing the sale of ice creams based on the temperature outside.

The bivariate analysis can be explained using Correlation coefficients, Linear regression, Logistic regression, Scatter plots, and Box plots.

The multivariate analysis involves the analysis of three or more variables to understand the relationship of each variable with the other variables.

Example – Analysing Revenue based on expenditure.

Multivariate analysis can be performed using Multiple regression, Factor analysis, Classification & regression trees, Cluster analysis, Principal component analysis, Dual-axis charts, etc.

Normal Distribution refers to a continuous probability distribution that is symmetric about the mean. In a graph, normal distribution will appear as a bell curve.

The mean, median, and mode are equal

All of them are located in the center of the distribution

68% of the data falls within one standard deviation of the mean

95% of the data lies between two standard deviations of the mean

99.7% of the data lies between three standard deviations of the mean

Time Series analysis is a statistical procedure that deals with the ordered sequence of values of a variable at equally spaced time intervals. Time series data are collected at adjacent periods. So, there is a correlation between the observations. This feature distinguishes time-series data from cross-sectional data.

Below is an example of time-series data on coronavirus cases and its graph

Overfitting

Underfitting

The model trains the data well using the training set.

Here, the model neither trains the data well nor can generalize to new data.

The performance drops considerably over the test set.

Performs poorly both on the train and the test set.

Happens when the model learns the random fluctuations and noise in the training dataset in detail.

This happens when there is lesser data to build an accurate model and when we try to develop a linear model using non-linear data

To deal with outliers, you can use the following four methods:

Drop the outlier records

Cap your outliers data

Assign a new value

Try a new transformation

Hypothesis testing is the procedure used by statisticians and scientists to accept or reject statistical hypotheses. There are mainly two types of hypothesis testing:

Null hypothesis: It states that there is no relation between the predictor and outcome variables in the population. H0 denoted it.

Example: There is no association between a patient’s BMI and diabetes.

Alternative hypothesis: It states that there is some relation between the predictor and outcome variables in the population. It is denoted by H1.

Example: There could be an association between a patient’s BMI and diabetes.

In Hypothesis testing, a Type I error occurs when the null hypothesis is rejected even if it is true. It is also known as a false positive.

A Type II error occurs when the null hypothesis is not rejected, even if it is false. It is also known as a false negative.

Clustering

In a non-machine learning context, clustering refers to the act of grouping together similar objects or entities based on their shared characteristics or properties. This concept is widely used in various fields, such as biology, sociology, and geography.

Outlier

An outlier is a data point or observation that significantly deviates from other data points in a dataset. In other words, an outlier is an observation that is markedly different from other observations in the same dataset and can be an extreme value that is either too large or too small. Outliers can occur due to a variety of reasons such as measurement errors, data entry errors, or natural variations in the data.

In statistical analysis, outliers can have a significant impact on the overall analysis and interpretation of the data. Outliers can affect the mean and standard deviation of a dataset and can also distort the overall pattern or trend in the data. Therefore, it is important to identify and handle outliers appropriately to avoid incorrect or biased conclusions.

There are several methods to detect outliers in a dataset, including graphical methods such as box plots, scatter plots, and histograms, and statistical methods such as z-scores, Mahalanobis distance, and interquartile range (IQR). Once outliers have been identified, they can be handled in a variety of ways, such as removing them from the dataset, replacing them with a more appropriate value, or treating them as a separate category in the analysis.

P-value

confidence intervall 95%

A 95 percent confidence interval is a range of values that is calculated from a sample of data and is expected to contain the true value of a population parameter with a probability of 95 percent. In other words, it is an estimate of the range of values within which the population parameter is likely to lie, based on a sample of data.

what kind of data does exist?

nominal

ordinal

continuous

Residual

- Error, deviance and residue is the same

- Residual is the difference to the means

P-Value

How likely is data given the null-hypothesis

P-value, short for "probability value," is a measure of the evidence against a null hypothesis in a statistical hypothesis test. In statistical analysis, a null hypothesis is a statement that assumes there is no significant difference or relationship between two groups or variables being compared.

Systematic Sampling

Systematic sampling is a type of probability sampling method used in statistics to select a sample of individuals or items from a larger population. In systematic sampling, the population is first divided into a series of equally sized subgroups or strata, and a random starting point is chosen within the first stratum. Then, every kth individual or item in the population is selected to be included in the sample, where k is a fixed interval determined by dividing the population size by the desired sample size.

For example, suppose we want to select a sample of 100 students from a school population of 1000 students using systematic sampling with a sampling interval of 10. We would first divide the population into 10 subgroups of 100 students each. Then, we would choose a random starting point within the first subgroup and select every 10th student from that point on until we have a sample of 100 students.

Cluster Sampling

Cluster sampling is a type of probability sampling method used in statistics to select a sample of individuals or items from a larger population. In cluster sampling, the population is first divided into non-overlapping groups or clusters, such as households, schools, or neighborhoods. Then, a random sample of clusters is selected from the population, and all individuals or items within the selected clusters are included in the sample.

Stratified sampling is a type of probability sampling method used in statistics to select a sample of individuals or items from a larger population. In stratified sampling, the population is first divided into non-overlapping subgroups or strata based on some relevant characteristic, such as age, gender, income, or geographic location. Then, a random sample is selected from each stratum, and the samples from each stratum are combined to form the final sample.

Judgemntal sampling

Judgmental sampling is a non-probability sampling method used in statistics to select a sample of individuals or items from a larger population based on the judgment or expertise of the researcher or another knowledgeable person. In judgmental sampling, the sample is selected based on some criterion or purposeful selection rather than being selected randomly.

Last changed5 months ago