How can we detect outliers




















Single Versus Multiple Outliers. Some outlier tests are designed to detect the prescence of a single outlier while other tests are designed to detect the prescence of multiple outliers. It is not appropriate to apply a test for a single outlier sequentially in order to detect multiple outliers. In addition, some tests that detect multiple outliers may require that you specify the number of suspected outliers exactly. Masking and Swamping. Masking can occur when we specify too few outliers in the test.

For example, if we are testing for a single outlier when there are in fact two or more outliers, these additional outliers may influence the value of the test statistic enough so that no points are declared as outliers.

On the other hand, swamping can occur when we specify too many outliers in the test. For example, if we are testing for two or more outliers when there is in fact only a single outlier, both points may be declared outliers many tests will declare either all or none of the tested points as outliers.

Due to the possibility of masking and swamping, it is useful to complement formal outlier tests with graphical methods.

Graphics can often help identify cases where masking or swamping may be an issue. Swamping and masking are also the reason that many tests require that the exact number of outliers being tested must be specified. Also, masking is one reason that trying to apply a single outlier test sequentially can fail.

For example, if there are multiple outliers, masking may cause the outlier test for the first outlier to return a conclusion of no outliers and so the testing for any additional outliers is not performed. Z-Scores and Modified Z-Scores. In other words, data is given in units of how many standard deviations it is from the mean. These authors recommend that modified Z-scores with an absolute value of greater than 3. A number of formal outlier tests have proposed in the literature.

These can be grouped by the following characteristics: What is the distributional model for the data? We restrict our discussion to tests that assume the data follow an approximately normal distribution. Is the test designed for a single outlier or is it designed for multiple outliers? If the test is designed for multiple outliers, does the number of outliers need to be specified exactly or can we specify an upper bound for the number of outliers? The following are a few of the more commonly used outlier tests for normally distributed data.

This list is not exhaustive a large number of outlier tests have been proposed in the literature. The tests given here are essentially based on the criterion of "distance from the mean". This is not the only criterion that could be used. For example, the Dixon test, which is not discussed here, is based a value being too large or small compared to its nearest neighbor. Grubbs' Test - this is the recommended test when testing for a single outlier.

Tietjen-Moore Test - this is a generalization of the Grubbs' test to the case of more than one outlier. It has the limitation that the number of outliers must be specified exactly. IsolationForest and neighbors. LocalOutlierFactor perform reasonably well on the data sets considered here. The svm. OneClassSVM is known to be sensitive to outliers and thus does not perform very well for outlier detection.

That being said, outlier detection in high-dimension, or without any assumptions on the distribution of the inlying data is very challenging. OneClassSVM may still be used with outlier detection but requires fine-tuning of its hyperparameter nu to handle outliers and prevent overfitting.

This implementation is here used with a kernel approximation technique to obtain results similar to svm. Finally, covariance. EllipticEnvelope assumes the data is Gaussian and learns an ellipse. For more details on the different estimators refer to the example Comparing anomaly detection algorithms for outlier detection on toy datasets and the sections hereunder.

See Comparing anomaly detection algorithms for outlier detection on toy datasets for a comparison of the svm. OneClassSVM , the ensemble. IsolationForest , the neighbors. LocalOutlierFactor and covariance. Consider now that we add one more observation to that data set. Is the new observation so different from the others that we can doubt it is regular? Or on the contrary, is it so similar to the other that we cannot distinguish it from the original observations? This is the question addressed by the novelty detection tools and methods.

Then, if further observations lay within the frontier-delimited subspace, they are considered as coming from the same population than the initial observations. Otherwise, if they lay outside the frontier, we can say that they are abnormal with a given confidence in our assessment.

OneClassSVM object. It requires the choice of a kernel and a scalar parameter to define a frontier. The RBF kernel is usually chosen although there exists no exact formula or algorithm to set its bandwidth parameter. This is the default in the scikit-learn implementation. The nu parameter, also known as the margin of the One-Class SVM, corresponds to the probability of finding a new, but regular, observation outside the frontier.

Neural computation Species distribution modeling. This implementation scales linearly with the number of samples and can be used with a kernel approximation to approximate the solution of a kernelized svm. OneClassSVM whose complexity is at best quadratic in the number of samples. Outlier detection is similar to novelty detection in the sense that the goal is to separate a core of regular observations from some polluting ones, called outliers.

One common way of performing outlier detection is to assume that the regular data come from a known distribution e. The scikit-learn provides an object covariance. EllipticEnvelope that fits a robust covariance estimate to the data, and thus fits an ellipse to the central data points, ignoring points outside the central mode. For instance, assuming that the inlier data are Gaussian distributed, it will estimate the inlier location and covariance in a robust way i.

The Mahalanobis distances obtained from this estimate is used to derive a measure of outlyingness. This strategy is illustrated below. See Robust covariance estimation and Mahalanobis distances relevance for an illustration of the difference between using a standard covariance.

EmpiricalCovariance or a robust estimate covariance. MinCovDet of location and covariance to assess the degree of outlyingness of an observation. Rousseeuw, P.

One efficient way of performing outlier detection in high-dimensional datasets is to use random forests. The ensemble. Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.

This path length, averaged over a forest of such random trees, is a measure of normality and our decision function. Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.

The implementation of ensemble. IsolationForest is based on an ensemble of tree. See IsolationForest example for an illustration of the use of IsolationForest. See Comparing anomaly detection algorithms for outlier detection on toy datasets for a comparison of ensemble.



0コメント

  • 1000 / 1000