Descriptive statistics: (Use a stats package or even a spreadsheet (e.g., Excel) to help you).
Consider the following numbers
A B C D E F 1 0 16 7 1 1 2 7 17 62 12 2 3 0 9 0 5 4 4 7 18 35 18 8 5 7 13 5 28 16 6 8 11 10 78 32 7 9 13 14 0 64 8 2 10 48 46 128 9 7 16 0 23 256 10 3 10 13 23 512 11 4 14 8 11 1024 12 4 12 9 34 2048 13 3 22 5 10 4096 14 0 10 59 5 8192 15 5 13 96 24 16384 16 7 22 97 43 32768
For each column, find the (arithmetic) mean, median, and standard deviation. How well do these conventional statistics describe the basic characteristics of the data? Arithmetic mean, Median, Standard Deviation
By examining the data, there are possible transformations that might better capture the underlying characteristics of each column. What transforms would you recommend that would make the data easier to understand? Find the same descriptive statistics on these transformed data.
source("http://personality-project.org/r/useful.r") #get a small package of psychometrically useful functions
problem1 <- read.clipboard() #after first copying the table with the header row from above
summary(problem1) #get the basic summary statistics
boxplot(problem1) #show this graphically
pairs.panels(problem1) #show a graphic with scatterplots and histograms
#produces this output
problem1 <- read.clipboard() #after first copying the table with the header row from above
> summary(problem1)
A B C D E
Min. : 1.00 Min. :0.000 Min. : 9.00 Min. : 0.00 Min. : 0.00
1st Qu.: 4.75 1st Qu.:2.750 1st Qu.:10.75 1st Qu.: 6.50 1st Qu.: 8.75
Median : 8.50 Median :4.500 Median :13.00 Median :11.50 Median :20.50
Mean : 8.50 Mean :4.562 Mean :14.12 Mean :29.25 Mean :22.56
3rd Qu.:12.25 3rd Qu.:7.000 3rd Qu.:16.25 3rd Qu.:50.75 3rd Qu.:29.50
Max. :16.00 Max. :9.000 Max. :22.00 Max. :97.00 Max. :78.00
F
Min. : 1
1st Qu.: 14
Median : 192
Mean : 4096
3rd Qu.: 2560
Max. :32768
> boxplot(problem1) #show this graphically
> pairs.panels(problem1) #show a graphic with scatterplots and histograms
Note that the boxplot isn't very helpful, because the range of variable F is so great. What happens if we do a log transform of the data?
logprob <- log(problem1)
summary(logprob)
boxplot(logprob)
logprob <- log(problem1)
> summary(logprob)
A B C D E
Min. :0.000 Min. : -Inf Min. :2.197 Min. : -Inf Min. : -Inf
1st Qu.:1.554 1st Qu.:0.9972 1st Qu.:2.374 1st Qu.:1.862 1st Qu.:2.129
Median :2.138 Median :1.4979 Median :2.565 Median :2.434 Median :3.013
Mean :1.917 Mean : -Inf Mean :2.611 Mean : -Inf Mean : -Inf
3rd Qu.:2.505 3rd Qu.:1.9459 3rd Qu.:2.788 3rd Qu.:3.923 3rd Qu.:3.381
Max. :2.773 Max. :2.1972 Max. :3.091 Max. :4.575 Max. :4.357
F
Min. : 0.000
1st Qu.: 2.599
Median : 5.199
Mean : 5.199
3rd Qu.: 7.798
Max. :10.397
> boxplot(logprob)
Warning messages:
1: Outlier (-Inf) in 2nd boxplot are NOT drawn in: bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group ==
2: Outlier (-Inf) in 4th boxplot are NOT drawn in: bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group ==
3: Outlier (-Inf) in 5th boxplot are NOT drawn in: bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group ==
The complaint about the box plot arises because we logged numbers for B, D and E that were zero. Try adding one to the numbers before taking the logs.
log1prob <- log(problem1+1)
summary(log1prob)
boxplot(log1prob)
log1prob <- log(problem1+1)
> summary(log1prob)
A B C D E
Min. :0.6931 Min. :0.000 Min. :2.303 Min. :0.000 Min. :0.000
1st Qu.:1.7462 1st Qu.:1.314 1st Qu.:2.463 1st Qu.:2.008 1st Qu.:2.246
Median :2.2499 Median :1.701 Median :2.639 Median :2.518 Median :3.061
Mean :2.0941 Mean :1.486 Mean :2.684 Mean :2.674 Mean :2.698
3rd Qu.:2.5835 3rd Qu.:2.079 3rd Qu.:2.848 3rd Qu.:3.942 3rd Qu.:3.414
Max. :2.8332 Max. :2.303 Max. :3.135 Max. :4.585 Max. :4.369
F
Min. : 0.6931
1st Qu.: 2.6742
Median : 5.2044
Mean : 5.2962
3rd Qu.: 7.7983
Max. :10.3972
> boxplot(log1prob)
The final boxplot is shown below (I have not shown the ones that are not as useful.)