Data Mining
Data Profiting
Data mining is the process of discovering relevant information that has not yet been identified before.
Data profiling is done to evaluate a dataset for its uniqueness, logic, and consistency.
In data mining, raw data is converted into valuable information.
It cannot identify inaccurate or incorrect data values.
Data Wrangling is the process wherein raw data is cleaned, structured, and enriched into a desired usable format for better decision making. It involves discovering, structuring, cleaning, enriching, validating, and analyzing data. This process can turn and map out large amounts of data extracted from various sources into a more useful format. Techniques such as merging, grouping, concatenating, joining, and sorting are used to analyze the data. Thereafter it gets ready to be used with another dataset.
Understanding the Problem
Understand the business problem, define the organizational goals, and plan for a lucrative solution.
Collecting Data
Gather the right data from various sources and other information based on your priorities.
Cleaning Data
Clean the data to remove unwanted, redundant, and missing values, and make it ready for analysis.
Exploring and Analyzing Data
Use data visualization and business intelligence tools, data mining techniques, and predictive modeling to analyze data.
Interpreting the Results
Create a data cleaning plan by understanding where the common errors take place and keep all the communications open.
Before working with the data, identify and remove the duplicates. This will lead to an easy and effective data analysis process.
Focus on the accuracy of the data. Set cross-field validation, maintain the value types of data, and provide mandatory constraints.
Normalize the data at the entry point so that it is less chaotic. You will be able to ensure that all information is standardized, leading to fewer errors on entry.
Descriptive
Predictive
Prescriptive
It provides insights into the past to answer “what has happened”
Understands the future to answer “what could happen”
Suggest various courses of action to answer “what should you do”
Uses data aggregation and data mining techniques
Uses statistical models and forecasting techniques
Uses simulation algorithms and optimization techniques to advise possible outcomes
Example: An ice cream company can analyze how much ice cream was sold, which flavors were sold, and whether more or less ice cream was sold than the day before
Example: Lower prices to increase the sale of ice creams, produce more/fewer quantities of a specific flavor of ice cream
Simple random sampling
Systematic sampling
Cluster sampling
Stratified sampling
Judgmental or purposive sampling
Univariate analysis is the simplest and easiest form of data analysis where the data being analyzed contains only one variable.
Example - Studying the heights of players in the NBA.
Univariate analysis can be described using Central Tendency, Dispersion, Quartiles, Bar charts, Histograms, Pie charts, and Frequency distribution tables.
The bivariate analysis involves the analysis of two variables to find causes, relationships, and correlations between the variables.
Example – Analyzing the sale of ice creams based on the temperature outside.
The bivariate analysis can be explained using Correlation coefficients, Linear regression, Logistic regression, Scatter plots, and Box plots.
The multivariate analysis involves the analysis of three or more variables to understand the relationship of each variable with the other variables.
Example – Analysing Revenue based on expenditure.
Multivariate analysis can be performed using Multiple regression, Factor analysis, Classification & regression trees, Cluster analysis, Principal component analysis, Dual-axis charts, etc.
Normal Distribution refers to a continuous probability distribution that is symmetric about the mean. In a graph, normal distribution will appear as a bell curve.
The mean, median, and mode are equal
All of them are located in the center of the distribution
68% of the data falls within one standard deviation of the mean
95% of the data lies between two standard deviations of the mean
99.7% of the data lies between three standard deviations of the mean
Time Series analysis is a statistical procedure that deals with the ordered sequence of values of a variable at equally spaced time intervals. Time series data are collected at adjacent periods. So, there is a correlation between the observations. This feature distinguishes time-series data from cross-sectional data.
Below is an example of time-series data on coronavirus cases and its graph
Overfitting
Underfitting
The model trains the data well using the training set.
Here, the model neither trains the data well nor can generalize to new data.
The performance drops considerably over the test set.
Performs poorly both on the train and the test set.
Happens when the model learns the random fluctuations and noise in the training dataset in detail.
This happens when there is lesser data to build an accurate model and when we try to develop a linear model using non-linear data
To deal with outliers, you can use the following four methods:
Drop the outlier records
Cap your outliers data
Assign a new value
Try a new transformation
Hypothesis testing is the procedure used by statisticians and scientists to accept or reject statistical hypotheses. There are mainly two types of hypothesis testing:
Null hypothesis: It states that there is no relation between the predictor and outcome variables in the population. H0 denoted it.
Example: There is no association between a patient’s BMI and diabetes.
Alternative hypothesis: It states that there is some relation between the predictor and outcome variables in the population. It is denoted by H1.
Example: There could be an association between a patient’s BMI and diabetes.
In Hypothesis testing, a Type I error occurs when the null hypothesis is rejected even if it is true. It is also known as a false positive.
A Type II error occurs when the null hypothesis is not rejected, even if it is false. It is also known as a false negative.
Clustering
In a non-machine learning context, clustering refers to the act of grouping together similar objects or entities based on their shared characteristics or properties. This concept is widely used in various fields, such as biology, sociology, and geography.
Outlier
An outlier is a data point or observation that significantly deviates from other data points in a dataset. In other words, an outlier is an observation that is markedly different from other observations in the same dataset and can be an extreme value that is either too large or too small. Outliers can occur due to a variety of reasons such as measurement errors, data entry errors, or natural variations in the data.
In statistical analysis, outliers can have a significant impact on the overall analysis and interpretation of the data. Outliers can affect the mean and standard deviation of a dataset and can also distort the overall pattern or trend in the data. Therefore, it is important to identify and handle outliers appropriately to avoid incorrect or biased conclusions.
There are several methods to detect outliers in a dataset, including graphical methods such as box plots, scatter plots, and histograms, and statistical methods such as z-scores, Mahalanobis distance, and interquartile range (IQR). Once outliers have been identified, they can be handled in a variety of ways, such as removing them from the dataset, replacing them with a more appropriate value, or treating them as a separate category in the analysis.
P-value
confidence intervall 95%
A 95 percent confidence interval is a range of values that is calculated from a sample of data and is expected to contain the true value of a population parameter with a probability of 95 percent. In other words, it is an estimate of the range of values within which the population parameter is likely to lie, based on a sample of data.
what kind of data does exist?
nominal
ordinal
continuous
Residual
- Error, deviance and residue is the same
- Residual is the difference to the means
P-Value
How likely is data given the null-hypothesis
P-value, short for "probability value," is a measure of the evidence against a null hypothesis in a statistical hypothesis test. In statistical analysis, a null hypothesis is a statement that assumes there is no significant difference or relationship between two groups or variables being compared.
Systematic Sampling
Systematic sampling is a type of probability sampling method used in statistics to select a sample of individuals or items from a larger population. In systematic sampling, the population is first divided into a series of equally sized subgroups or strata, and a random starting point is chosen within the first stratum. Then, every kth individual or item in the population is selected to be included in the sample, where k is a fixed interval determined by dividing the population size by the desired sample size.
For example, suppose we want to select a sample of 100 students from a school population of 1000 students using systematic sampling with a sampling interval of 10. We would first divide the population into 10 subgroups of 100 students each. Then, we would choose a random starting point within the first subgroup and select every 10th student from that point on until we have a sample of 100 students.
Cluster Sampling
Cluster sampling is a type of probability sampling method used in statistics to select a sample of individuals or items from a larger population. In cluster sampling, the population is first divided into non-overlapping groups or clusters, such as households, schools, or neighborhoods. Then, a random sample of clusters is selected from the population, and all individuals or items within the selected clusters are included in the sample.
Stratified sampling is a type of probability sampling method used in statistics to select a sample of individuals or items from a larger population. In stratified sampling, the population is first divided into non-overlapping subgroups or strata based on some relevant characteristic, such as age, gender, income, or geographic location. Then, a random sample is selected from each stratum, and the samples from each stratum are combined to form the final sample.
Judgemntal sampling
Judgmental sampling is a non-probability sampling method used in statistics to select a sample of individuals or items from a larger population based on the judgment or expertise of the researcher or another knowledgeable person. In judgmental sampling, the sample is selected based on some criterion or purposeful selection rather than being selected randomly.
Last changed2 years ago