The four datasets composing Anscombe's quartet. All four sets have identical statistical parameters, but the graphs show them to be considerably different.

Anscombe's quartet comprises four datasets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (x,ย y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data when analyzing it, and the effect of outliers and other influential observations on statistical properties. He described the article as being intended to counter the impression among statisticians that "numerical calculations are exact, but graphs are rough".[1]

Data

edit

For all four datasets:

Property Value Accuracy
Mean of x 9 exact
Sample variance of x: s2
x
11 exact
Mean of y 7.50 to 2 decimal places
Sample variance of y: s2
y
4.125 ยฑ0.003
Correlation between x and y 0.816 to 3 decimal places
Linear regression line yย =ย 3.00ย +ย 0.500x to 2 and 3 decimal places, respectively
Coefficient of determination of the linear regression: 0.67 to 2 decimal places
  • The first scatter plot (top left) appears to be a simple linear relationship, corresponding to two correlated variables, where y could be modelled as Gaussian with mean linearly dependent onย x.
  • For the second graph (top right), while a relationship between the two variables is obvious, it is not linear, and the Pearson correlation coefficient is not relevant. A more general regression and the corresponding coefficient of determination would be more appropriate.
  • In the third graph (bottom left), the modelled relationship is linear, but should have a different regression line (a robust regression would have been called for). The calculated regression is offset by the one outlier, which exerts enough influence to lower the correlation coefficient from 1 to 0.816.
  • Finally, the fourth graph (bottom right) shows an example when one high-leverage point is enough to produce a high correlation coefficient, even though the other data points do not indicate any relationship between the variables.

The quartet is still often used to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets.[2][3][4][5][6]

The datasets are as follows. The x values are the same for the first three datasets.[1]

Anscombe's quartet
Dataset I Dataset II Dataset III Dataset IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

It is not known how Anscombe created his datasets.[7] Since its publication, several methods to generate similar datasets with identical statistics and dissimilar graphics have been developed.[7][8] One of these, the Datasaurus dozen, consists of points tracing out the outline of a dinosaur, plus twelve other datasets that have the same summary statistics.[9][10][11]

See also

edit

References

edit
  1. ^ a b Anscombe, F. J. (1973). "Graphs in Statistical Analysis". American Statistician. 27 (1): 17โ€“21. doi:10.1080/00031305.1973.10478966. JSTORย 2682899.
  2. ^ Elert, Glenn (2021). "Linear Regression". The Physics Hypertextbook. Archived from the original on 2020-10-01. Retrieved 2017-02-23.
  3. ^ Janert, Philipp K. (2010). Data Analysis with Open Source Tools. O'Reilly Media. pp.ย 65โ€“66. ISBNย 978-0-596-80235-6.
  4. ^ Chatterjee, Samprit; Hadi, Ali S. (2006). Regression Analysis by Example. John Wiley and Sons. p.ย 91. ISBNย 0-471-74696-7.
  5. ^ Saville, David J.; Wood, Graham R. (1991). Statistical Methods: The geometric approach. Springer. p.ย 418. ISBNย 0-387-97517-9.
  6. ^ Tufte, Edward R. (2001). The Visual Display of Quantitative Information (2ndย ed.). Cheshire, CT: Graphics Press. ISBNย 0-9613921-4-2.
  7. ^ a b Chatterjee, Sangit; Firat, Aykut (2007). "Generating Data with Identical Statistics but Dissimilar Graphics: A follow up to the Anscombe dataset". The American Statistician. 61 (3): 248โ€“254. doi:10.1198/000313007X220057. JSTORย 27643902. S2CIDย 121163371.
  8. ^ Matejka, Justin; Fitzmaurice, George (2017). "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing". Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. pp.ย 1290โ€“1294. doi:10.1145/3025453.3025912. ISBNย 9781450346559. S2CIDย 9247543.
  9. ^ Matejka, Justin; Fitzmaurice, George (2017). "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing". Autodesk Research. Archived from the original on 2020-10-04. Retrieved 2021-04-20.
  10. ^ Murray, Lori L.; Wilson, John G. (April 2021). "Generating data sets for teaching the importance of regression analysis". Decision Sciences Journal of Innovative Education. 19 (2): 157โ€“166. doi:10.1111/dsji.12233. ISSNย 1540-4595. S2CIDย 233609149. Archived from the original on 2021-04-23. Retrieved 2021-04-20.
  11. ^ Andrienko, Natalia; Andrienko, Gennady; Fuchs, Georg; Slingsby, Aidan; Turkay, Cagatay; Wrobel, Stefan (2020), "Visual Analytics for Investigating and Processing Data", Visual Analytics for Data Scientists, Cham: Springer International Publishing, pp.ย 151โ€“180, doi:10.1007/978-3-030-56146-8_5, ISBNย 978-3-030-56145-1, S2CIDย 226648414, archived from the original on 2024-10-03, retrieved 2021-04-20.{{citation}}: CS1 maint: work parameter with ISBN (link)
edit

๐Ÿ“š Artikel Terkait di Wikipedia

Datasaurus dozen

appear very different when graphed. It was inspired by the smaller Anscombe's quartet that was created in 1973. The following table contains summary statistics

Frank Anscombe

now known as Anscombe's quartet. He later published a textbook on statistical computing in APL. In economics and decision theory, Anscombe is best known

Exploratory data analysis

EDA before modeling in order to avoid misleading conclusions as in Anscombe's Quartet. John W. Tukey wrote the book Exploratory Data Analysis in 1977. Tukey

Data set

several papers in the machine learning (data mining) literature. Anscombe's quartet โ€“ Small data set illustrating the importance of graphing the data

All models are wrong

about the real world, a point emphasized by Box's famous remark. Anscombe's quartetย โ€“ Four data sets with the same descriptive statistics, yet very different

Influential observation

are Nยทk DFBETAs). Table shows DFBETAs for the third dataset from Anscombe's quartet (bottom left chart in the figure): DFFITS - difference in fits Cook's

Simpson's paradox

simplistic interpretations of data. Aliasingย โ€“ Signal processing effect Anscombe's quartetย โ€“ Four data sets with the same descriptive statistics, yet very different

Coefficient of determination

to the geneticist Sewall Wright and was first published in 1921. Anscombe's quartet Fraction of variance unexplained Goodness of fit Nashโ€“Sutcliffe model