Working with outliers Outlier




1 working outliers

1.1 retention
1.2 exclusion
1.3 non-normal distributions
1.4 set-membership uncertainties
1.5 alternative models





working outliers

the choice of how deal outlier should depend on cause. estimators highly sensitive outliers, notably estimation of covariance matrices.


retention

even when normal distribution model appropriate data being analyzed, outliers expected large sample sizes , should not automatically discarded if case. application should use classification algorithm robust outliers model data naturally occurring outlier points.


exclusion

deletion of outlier data controversial practice frowned upon many scientists , science instructors; while mathematical criteria provide objective , quantitative method data rejection, not make practice more scientifically or methodologically sound, in small sets or normal distribution cannot assumed. rejection of outliers more acceptable in areas of practice underlying model of process being measured , usual distribution of measurement error confidently known. outlier resulting instrument reading error may excluded desirable reading @ least verified.


the 2 common approaches exclude outliers truncation (or trimming) , winsorising. trimming discards outliers whereas winsorising replaces outliers nearest nonsuspect data. exclusion can consequence of measurement process, such when experiment not entirely capable of measuring such extreme values, resulting in censored data.


in regression problems, alternative approach may exclude points exhibit large degree of influence on estimated coefficients, using measure such cook s distance.


if data point (or points) excluded data analysis, should stated on subsequent report.


non-normal distributions

the possibility should considered underlying distribution of data not approximately normal, having fat tails . instance, when sampling cauchy distribution, sample variance increases sample size, sample mean fails converge sample size increases, , outliers expected @ far larger rates normal distribution. slight difference in fatness of tails can make large difference in expected number of extreme values.


set-membership uncertainties

a set membership approach considers uncertainty corresponding ith measurement of unknown random vector x represented set xi (instead of probability density function). if no outliers occur, x should belong intersection of xi s. when outliers occur, intersection empty, , should relax small number of sets xi (as small possible) in order avoid inconsistency. can done using notion of q-relaxed intersection. illustrated figure, q-relaxed intersection corresponds set of x belong sets except q of them. sets xi not intersect q-relaxed intersection suspected outliers.



figure 5. q-relaxed intersection of 6 sets q=2 (red), q=3 (green), q= 4 (blue), q= 5 (yellow).


alternative models

in cases cause of outliers known, may possible incorporate effect model structure, example using hierarchical bayes model, or mixture model.








Comments

Popular posts from this blog

Fuji List of motion picture film stocks

The Missionaries and the Congo Congo Free State propaganda war

Discography Tommy Denander