In statistics, outliers are values that are significantly different from other values in a sample or population databases, and are data that should be considered very carefully before initiating any marketing or modeling analysis.
Outliers can be the result of measurement error, data collection bias or other factors that are not representative of the population. They can influence the results of the statistical analysis and may potentially introduce bias in the analysis.
In some studies, with sampling, outliers could also be weak signals that should be considered as such.
Outliers can therefore affect the validity and reliability of the results of a survey-based opinion study by distorting the descriptive statistics. These distortions bias inferential analyses, influence the extrapolation of knowledge acquired from the sample to the entire original population.
In an artificial intelligence modeling context, outliers can complexify the learning phase, for instance, by significantly increasing both the volume of data and the learning time, much more than initially estimated.
Therefore, it is important to detect and process outliers before including them in a statistical analysis or modeling phase.
✓ How to detect outliers?
There are several methods for detecting outliers in data sets, including :
✓ Visual examination : visualize data using graphs to detect abnormal values such as Normal Q-Q plot, BoxPlot or ScatterPlot...
✓ Use of thresholds : define thresholds for outliers using dispersion statistics such as standard deviation, inter-quartile range...
✓ Statistical analysis : use statistical tests to detect outliers, such as student's t tests, normality tests, extreme value detection tests...
✓ Once detected, how should we deal with these outliers?
The choice of method should depend on the context and purpose of the analysis.
✓ Data collection error : if possible, you should correct this collection error if you have the means to do so.
If the correction is not possible, you can first impute a missing data to replace this erroneous data.
✓ Plausible data : the identified values are probably outliers but remain within the realm of probability despite their very low frequency in the parent population.
We must then focus on the objective and purpose of the analysis.
☛ Study and reporting : these values should be kept in the data and can be labeled as weak signals,
☛ Modeling or Regression : these data should probably be excluded from the initial models to focus on the majority and core target for this analysis.
As a second step, we can consider creating a dedicated model to deal with these outliers or weak signals. This second model will require an additional data collection step to have enough data to carry out this new analysis.
At this point, you probably have missing data in some of your variables in place of the outliers identified earlier.
You can either analyze these variables with these missing data, or you can use imputation or missing value estimation techniques. These imputation tools will be the subject of a new article.
We were able to discuss in this article, in a short and relatively simple way, different tools to identify outliers, and then some decision rules to consider to manage these outliers in your datasets.
If you face any issues and need support to understand and manage outliers in your datasets, please contact us.