Descriptive statistics: (Use a stats package or even a spreadsheet (e.g., Excel) to help you).

Consider the following numbers

A B C D E F 1 0 16 7 1 1 2 7 17 62 12 2 3 0 9 0 5 4 4 7 18 35 18 8 5 7 13 5 28 16 6 8 11 10 78 32 7 9 13 14 0 64 8 2 10 48 46 128 9 7 16 0 23 256 10 3 10 13 23 512 11 4 14 8 11 1024 12 4 12 9 34 2048 13 3 22 5 10 4096 14 0 10 59 5 8192 15 5 13 96 24 16384 16 7 22 97 43 32768

For each column, find the (arithmetic) mean, median, and standard deviation. How well do these conventional statistics describe the basic characteristics of the data? Arithmetic mean, Median, Standard Deviation

By examining the data, there are possible transformations that might better capture the underlying characteristics of each column. What transforms would you recommend that would make the data easier to understand? Find the same descriptive statistics on these transformed data.

The following code in the R system will do this. (Note that I am shortcutting the input step by copying the data to the clipboad and using a procedure to read the clipboard. My "read.clipboard()" function supposedly combines the code for PCs and Macs into one function. You can get it by downloading my "useful.r" routines.

source("http://personality-project.org/r/useful.r") #get a small package of psychometrically useful functions problem1 <- read.clipboard() #after first copying the table with the header row from above summary(problem1) #get the basic summary statistics boxplot(problem1) #show this graphically pairs.panels(problem1) #show a graphic with scatterplots and histograms #produces this output problem1 <- read.clipboard() #after first copying the table with the header row from above > summary(problem1) A B C D E Min. : 1.00 Min. :0.000 Min. : 9.00 Min. : 0.00 Min. : 0.00 1st Qu.: 4.75 1st Qu.:2.750 1st Qu.:10.75 1st Qu.: 6.50 1st Qu.: 8.75 Median : 8.50 Median :4.500 Median :13.00 Median :11.50 Median :20.50 Mean : 8.50 Mean :4.562 Mean :14.12 Mean :29.25 Mean :22.56 3rd Qu.:12.25 3rd Qu.:7.000 3rd Qu.:16.25 3rd Qu.:50.75 3rd Qu.:29.50 Max. :16.00 Max. :9.000 Max. :22.00 Max. :97.00 Max. :78.00 F Min. : 1 1st Qu.: 14 Median : 192 Mean : 4096 3rd Qu.: 2560 Max. :32768 > boxplot(problem1) #show this graphically > pairs.panels(problem1) #show a graphic with scatterplots and histogramsNote that the boxplot isn't very helpful, because the range of variable F is so great. What happens if we do a log transform of the data?

logprob <- log(problem1) summary(logprob) boxplot(logprob) logprob <- log(problem1) > summary(logprob) A B C D E Min. :0.000 Min. : -Inf Min. :2.197 Min. : -Inf Min. : -Inf 1st Qu.:1.554 1st Qu.:0.9972 1st Qu.:2.374 1st Qu.:1.862 1st Qu.:2.129 Median :2.138 Median :1.4979 Median :2.565 Median :2.434 Median :3.013 Mean :1.917 Mean : -Inf Mean :2.611 Mean : -Inf Mean : -Inf 3rd Qu.:2.505 3rd Qu.:1.9459 3rd Qu.:2.788 3rd Qu.:3.923 3rd Qu.:3.381 Max. :2.773 Max. :2.1972 Max. :3.091 Max. :4.575 Max. :4.357 F Min. : 0.000 1st Qu.: 2.599 Median : 5.199 Mean : 5.199 3rd Qu.: 7.798 Max. :10.397 > boxplot(logprob) Warning messages: 1: Outlier (-Inf) in 2nd boxplot are NOT drawn in: bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group == 2: Outlier (-Inf) in 4th boxplot are NOT drawn in: bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group == 3: Outlier (-Inf) in 5th boxplot are NOT drawn in: bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group ==

The complaint about the box plot arises because we logged numbers for B, D and E that were zero. Try adding one to the numbers before taking the logs.

log1prob <- log(problem1+1) summary(log1prob) boxplot(log1prob) log1prob <- log(problem1+1) > summary(log1prob) A B C D E Min. :0.6931 Min. :0.000 Min. :2.303 Min. :0.000 Min. :0.000 1st Qu.:1.7462 1st Qu.:1.314 1st Qu.:2.463 1st Qu.:2.008 1st Qu.:2.246 Median :2.2499 Median :1.701 Median :2.639 Median :2.518 Median :3.061 Mean :2.0941 Mean :1.486 Mean :2.684 Mean :2.674 Mean :2.698 3rd Qu.:2.5835 3rd Qu.:2.079 3rd Qu.:2.848 3rd Qu.:3.942 3rd Qu.:3.414 Max. :2.8332 Max. :2.303 Max. :3.135 Max. :4.585 Max. :4.369 F Min. : 0.6931 1st Qu.: 2.6742 Median : 5.2044 Mean : 5.2962 3rd Qu.: 7.7983 Max. :10.3972 > boxplot(log1prob)The final boxplot is shown below (I have not shown the ones that are not as useful.)