2 Descriptive Statistics
Descriptive Statistics
Student Learning Outcomes
By the end of this chapter, the student should be able to:
- Display data graphically and interpret graphs: stemplots, histograms and boxplots.
- Recognize, describe, and calculate the measures of location of data: quartiles and percentiles.
- Recognize, describe, and calculate the measures of the center of data: mean, median, and mode.
- Recognize, describe, and calculate the measures of the spread of data: variance, standard deviation, and range.
Introduction
Once you have collected data, what will you do with it? Data can be described and presented in many different formats. For example, suppose you are interested in buying a house in a particular area. You may have no clue about the house prices, so you might ask your real estate agent to give you a sample data set of prices. Looking at all the prices in the sample often is overwhelming. A better way might be to look at the median price and the variation of prices. The median and variation are just two ways that you will learn to describe data. Your agent might also provide you with a graph of the data.
In this chapter, you will study numerical and graphical ways to describe and display your data. This area of statistics is called “Descriptive Statistics”. You will learn to calculate, and even more importantly, to interpret these measurements and graphs.
Displaying Data
A statistical graph is a tool that helps you learn about the shape or distribution of a sample. The graph can be a more effective way of presenting data than a mass of numbers because we can see where data clusters and where there are only a few data values. Newspapers and the Internet use graphs to show trends and to enable readers to compare facts and figures quickly.
Statisticians often graph data first to get a picture of the data. Then, more formal tools may be applied.
Some of the types of graphs that are used to summarize and organize data are the dot plot, the bar chart, the histogram, the stem-and-leaf plot, the frequency polygon (a type of broken line graph), pie charts, and the boxplot. In this chapter, we will briefly look at stem-and-leaf plots, line graphs and bar graphs. Our emphasis will be on histograms and boxplots.
Stem and Leaf Graphs (Stemplots), Line Graphs and Bar Graphs
One simple graph, the stem-and-leaf graph or stem plot, comes from the field of exploratory data analysis.It is a good choice when the data sets are small. To create the plot, divide each observation of data into a stem and a leaf. The leaf consists of a final significant digit. For example, 23 has stem 2 and leaf 3. Four hundred thirty-two (432) has stem 43 and leaf 2. Five thousand four hundred thirty-two (5,432) has stem 543 and leaf 2. The decimal 9.3 has stem 9 and leaf 3. Write the stems in a vertical line from smallest the largest. Draw a vertical line to the right of the stems. Then write the leaves in increasing order next to their corresponding stem.
Example 1
For Susan Dean’s spring pre-calculus class, scores for the first exam were as follows (smallest to largest):
33; 42; 49; 49; 53; 55; 55; 61; 63; 67; 68; 68; 69; 69; 72; 73; 74; 78; 80; 83; 88; 88; 88; 90; 92; 94; 94; 94; 94;
96; 100
Stem-and-Leaf Diagram
Stem | Leaf |
3 | 3 |
4 | 299 |
5 | 355 |
6 | 1378899 |
7 | 2348 |
8 | 03888 |
9 | 0244446 |
10 | 0 |
Table 1
The stem plot shows that most scores fell in the 60s, 70s, 80s, and 90s. Eight out of the 31 scores or approximately 26% of the scores were in the 90’s or 100, a fairly high number of As.
The stem plot is a quick way to graph and gives an exact picture of the data. You want to look for an overall pattern and any outliers. An outlier is an observation of data that does not fit the rest of the data. It is sometimes called an extreme value. When you graph an outlier, it will appear not to fit the pattern of the graph. Some outliers are due to mistakes (for example, writing down 50 instead of 500) while others may indicate that something unusual is happening. It takes some background information to explain outliers. In the example above, there were no outliers.
Line Graph
Another type of graph that is useful for specific data values is a line graph. In the particular line graph shown in the example, the x-axis consists of data values and the y-axis consists of frequency points. The frequency points are connected.
Example 2
In a survey, 40 mothers were asked how many times per week a teenager must be reminded to do his/her chores. The results are shown in the table (Table 2) and the line graph (Figure 1).
Number of times teenager is reminded | Frequency |
0 | 2 |
1 | 5 |
2 | 8 |
3 | 14 |
4 | 7 |
5 | 4 |
Table 2
Figure 1
Bar Graph
Bar graphs consist of bars that are separated from each other. The bars can be rectangles or they can be rectangular boxes and they can be vertical or horizontal.
The bar graph shown in Figure 2 has age groups represented on the x-axis and proportions on the y-axis.
Example 3
By the end of 2011, in the United States, Facebook had over 146 million users. The table shows three age groups, the number of users in each age group and the proportion (%) of users in each age group. Source: http://www.kenburbary.com/2011/03/facebook-demographics- revisited-2011-statistics-2/
Age groups | Number of Facebook users | Proportion (%) of Facebook users |
13 – 25 | 65,082,280 | 45% |
26 – 44 | 53,300,200 | 36% |
45 – 64 | 27,885,100 | 19% |
Table 3
Figure 2
Histograms
For most of the work you do in this book, you will use a histogram to display the data. One advantage of a histogram is that it can readily display large data sets. A rule of thumb is to use a histogram when the data set consists of 100 values or more.
A histogram consists of contiguous boxes. It has both a horizontal axis and a vertical axis. The horizontal axis is labeled with what the data represents (for instance, distance from your home to school). The vertical axis is labeled either Frequency or relative frequency. The graph will have the same shape with either label. The histogram (like the stemplot) can give you the shape of the data, the center, and the spread of the data. (The next section tells you how to calculate the center and the spread.)
The relative frequency is equal to the frequency for an observed value of the data divided by the total number of data values in the sample. (In the chapter on Sampling and Data), we defined frequency as the number of times an answer occurs.) If:
• f = frequency
• n = total number of data values (or the sum of the individual frequencies), and
• RF = relative frequency,
then:
Example 4
If 3 students in Mr. Ahab’s English class of 40 students received from 90% to 100%, then,
f = 3 , n = 40 , and
Seven and a half percent of the students received 90% to 100%. Ninety percent to 100 % are quantitative measures.
To construct a histogram, first decide how many bars or intervals, also called classes, represent the data. Many histograms consist of from 5 to 15 bars or classes for clarity. Choose a starting point for the first interval to be less than the smallest data value. A convenient starting point is a lower value carried out to one more decimal place than the value with the most decimal places. For example, if the value with the most decimal places is 6.1 and this is the smallest value, a convenient starting point is 6.05 (6.1 – 0.05 = 6.05). We say that 6.05 has more precision. If the value with the most decimal places is 2.23 and the lowest value is 1.5, a convenient starting point is 1.495 (1.5 – 0.005 = 1.495). If the value with the most decimal places is
3.234 and the lowest value is 1.0, a convenient starting point is 0.9995 (1.0 – .0005 = 0.9995). If all the data happen to be integers and the smallest value is 2, then a convenient starting point is 1.5 (2 – 0.5 = 1.5). Also, when the starting point and other boundaries are carried to one additional decimal place, no data value will fall on a boundary.
Example 5
The following data are the heights (in inches to the nearest half inch) of 100 male semiprofessional soccer players. The heights are continuous data since height is measured.
60; 60.5; 61; 61; 61.5
63.5; 63.5; 63.5
64; 64; 64; 64; 64; 64; 64; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5
66; 66; 66; 66; 66; 66; 66; 66; 66; 66; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 67; 67;
67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5
68; 68; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69.5; 69.5; 69.5; 69.5; 69.5
70; 70; 70; 70; 70; 70; 70.5; 70.5; 70.5; 71; 71; 71
72; 72; 72; 72.5; 72.5; 73; 73.5
74
The smallest data value is 60. Since the data with the most decimal places has one decimal (for instance, 61.5), we want our starting point to have two decimal places. Since the numbers 0.5, 0.05, 0.005, etc. are convenient numbers, use 0.05 and subtract it from 60, the smallest value, for the convenient starting point.
60 – 0.05 = 59.95 which is more precise than, say, 61.5 by one decimal place. The starting point is, then, 59.95.
The largest value is 74. 74+ 0.05 = 74.05 is the ending value.
Next, calculate the width of each bar or class interval. To calculate this width, subtract the starting point from the ending value and divide by the number of bars (you must choose the number of bars you desire). Suppose you choose 8 bars.
NOTE: We will round up to 2 and make each bar or class interval 2 units wide. Rounding up to 2 is one way to prevent a value from falling on a boundary. Rounding to the next number is necessary even if it goes against the standard rules of rounding. For this example, using 1.76 as the width would also work.
The boundaries are:
• 59.95
• 59.95 + 2 = 61.95
• 61.95 + 2 = 63.95
• 63.95 + 2 = 65.95
• 65.95 + 2 = 67.95
• 67.95 + 2 = 69.95
• 69.95 + 2 = 71.95
• 71.95 + 2 = 73.95
• 73.95 + 2 = 75.95
The heights 60 through 61.5 inches are in the interval 59.95 – 61.95. The heights that are 63.5 are in the interval 61.95 – 63.95. The heights that are 64 through 64.5 are in the interval 63.95 – 65.95. The heights 66 through 67.5 are in the interval 65.95 – 67.95. The heights 68 through 69.5 are in the interval 67.95 – 69.95. The heights 70 through 71 are in the interval 69.95 – 71.95. The heights 72
through 73.5 are in the interval 71.95 – 73.95. The height 74 is in the interval 73.95 – 75.95.
The following histogram (Figure 3) displays the heights on the x-axis and relative frequency on the y-axis.
Figure 3
Example 6
The following data are the number of books bought by 50 part-time college students at ABC College. The number of books is discrete data since books are counted.
1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1
2; 2; 2; 2; 2; 2; 2; 2; 2; 2
3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3
4; 4; 4; 4; 4; 4
5; 5; 5; 5; 5
6; 6
Eleven students buy 1 book. Ten students buy 2 books. Sixteen students buy 3 books. Six students buy 4 books. Five students buy 5 books. Two students buy 6 books.
Because the data are integers, subtract 0.5 from 1, the smallest data value and add 0.5 to 6, the largest data value. Then the starting point is 0.5 and the ending value is 6.5.
Next, calculate the width of each bar or class interval. If the data are discrete and there are not too many different values, a width that places the data values in the middle of the bar or class interval is the most convenient. Since the data consist of the numbers 1, 2, 3, 4, 5, 6 and the starting point is 0.5, a width of one places the 1 in the middle of the interval from 0.5 to 1.5, the 2 in the middle of the interval from 1.5 to 2.5, the 3 in the middle of the interval from 2.5 to 3.5, the 4 in the middle of the interval from to 3.5 to 4.5, the 5 in the middle of the interval from 4.5 to 5.5, and the 6 in the middle of the interval from 5.5 to 6.5.
Calculate the number of bars as follows:
where 1 is the width of a bar. Therefore, bars = 6.
The following histogram (Figure 4) displays the number of books on the x-axis and the frequency on the y-axis.
Figure 4
Measures of the Location of the Data
The common measures of location are quartiles and percentiles
Quartiles are special percentiles. The first quartile, Q1, is the same as the 25th percentile, and the third quartile, Q3, is the same as the 75th percentile. The median, M, is called both the second quartile and the 50th percentile.
To calculate quartiles and percentiles, the data must be ordered from smallest to largest. Quartiles divide ordered data into quarters. Percentiles divide ordered data into hundredths. To score in the 90th percentile of an exam does not mean, necessarily, that you received 90% on a test. It means that 90% of test scores are the same or less than your score and 10% of the test scores are the same or greater than your test score.
Percentiles are useful for comparing values. For this reason, universities and colleges use percentiles extensively. One instance in which colleges and universities use percentiles is when SAT results are used to determine a minimum testing score that will be used as an acceptance factor. For example, suppose Duke accepts SAT scores at or above the 75th percentile. That translates into a score of at least 1220.
Percentiles are mostly used with very large populations. Therefore, if you were to say that 90% of the test scores are less (and not the same or less) than your score, it would be acceptable because removing one particular data value is not significant.
The median is a number that measures the “center” of the data. You can think of the median as the “middle value,” but it does not actually have to be one of the observed values. It is a number that separates ordered data into halves. Half the values are the same number or smaller than the median, and half the values are the same number or larger. For example, consider the following data.
1; 11.5; 6; 7.2; 4; 8; 9; 10; 6.8; 8.3; 2; 2; 10; 1
Ordered from smallest to largest:
1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5
Since there are 14 observations, the median is between the seventh value, 6.8, and the eighth value, 7.2. To find the median, add the two values together and divide by two.
The median is seven. Half of the values are smaller than seven and half of the values are larger than seven.
Quartiles are numbers that separate the data into quarters. Quartiles may or may not be part of the data. To find the quartiles, first find the median or second quartile. The first quartile, Q1, is the middle value of the lower half of the data, and the third quartile, Q3, is the middle value, or median, of the upper half of the data. To get the idea, consider the same data set:
1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5
The median or second quartile is seven. The lower half of the data are 1, 1, 2, 2, 4, 6, 6.8. The middle value of the lower half is two.
1; 1; 2; 2; 4; 6; 6.8
The number two, which is part of the data, is the first quartile. One-fourth of the entire sets of values are the same as or less than two and three-fourths of the values are more than two.
The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of the upper half is nine.
The third quartile, Q3, is nine. Three-fourths (75%) of the ordered data set are less than nine. One-fourth (25%) of the ordered data set are greater than nine. The third quartile is part of the data set in this example.
The interquartile range is a number that indicates the spread of the middle half or the middle 50% of the data. It is the difference between the third quartile (Q3) and the first quartile (Q1).
IQR = Q3 – Q1
The IQR can help to determine potential outliers. A value is suspected to be a potential outlier if it is less than (1.5)(IQR) below the first quartile or more than (1.5)(IQR) above the third quartile. Potential outliers always require further investigation.
For the following 13 real estate prices, calculate the IQR and determine if any prices are potential outliers. Prices are in dollars.
389,950; 230,500; 158,000; 479,000; 639,000; 114,950; 5,500,000; 387,000; 659,000; 529,000; 575,000;
488,800; 1,095,000
Order the data from smallest to largest.
114,950; 158,000; 230,500; 387,000; 389,950; 479,000; 488,800; 529,000; 575,000; 639,000; 659,000;
1,095,000; 5,500,000
M = 488,800
Q1 = =308,750
Q3 = = 649,000
IQR = 649,000 – 308,750 = 340,250
(1.5)(IQR) = (1.5)(340,250) = 510,375
Q1 – (1.5)(IQR) = 308,750 – 510,375 = –201,625
Q3 + (1.5)(IQR) = 649,000 + 510,375 = 1,159,375
No house price is less than –201,625. However, 5,500,000 is more than 1,159,375. Therefore, 5,500,000 is a potential outlier.
Example 7
For the two data sets in the test scores example, find the following:
a. The interquartile range. Compare the two interquartile ranges.
b. Any outliers in either set.
Solution – Example 7
The five number summary for the day and night classes is
Minimum | Q1 | Median | Q3 | Maximum | |
Day | 32 | 56 | 74.5 | 82.5 | 99 |
Night | 25.5 | 78 | 81 | 89 | 98 |
Table 4
a. The IQR for the day group is Q3 – Q1 = 82.5 – 56 = 26.5 The IQR for the night group is Q3 – Q1 = 89 – 78 = 11
The interquartile range (the spread or variability) for the day class is larger than the night class IQR. This suggests more variation will be found in the day class’s class test scores.
b. Day class outliers are found using the IQR times 1.5 rule. So,
Q1 – IQR(1.5) = 56 – 26.5(1.5) = 16.25
Q3 + IQR(1.5) = 82.5 + 26.5(1.5) = 122.25
Since the minimum and maximum values for the day class are greater than 16.25 and less than 122.25, there are no outliers.
Night class outliers are calculated as:
Q1 – IQR (1.5) = 78 – 11(1.5) = 61.5
Q3 + IQR(1.5) = 89 + 11(1.5) = 105.5
For this class, any test score less than 61.5 is an outlier. Therefore, the scores of 45 and 25.5 are outliers. Since no test score is greater than 105.5, there is no upper end outlier.
Example 8
Fifty statistics students were asked how much sleep they get per school night (rounded to the nearest hour). The results were:
AMOUNT OF SLEEP PER SCHOOL NIGHT (HOURS) |
FREQUENCY |
RELATIVE FREQUENCY | CUMULATIVE RELATIVE FREQUENCY |
4 | 2 | 0.04 | 0.04 |
5 | 5 | 0.10 | 0.14 |
6 | 7 | 0.14 | 0.28 |
7 | 12 | 0.24 | 0.52 |
8 | 14 | 0.28 | 0.80 |
9 | 7 | 0.14 | 0.94 |
10 | 3 | 0.06 | 1.00 |
Table 5
Find the 28th percentile. Notice the 0.28 in the “cumulative relative frequency” column. Twenty-eight percent of 50 data values is 14 values. There are 14 values less than the 28th percentile. They include the two 4s, the five 5s, and the seven 6s. The 28th percentile is between the last six and the first seven. The 28th percentile is 6.5.
Find the median. Look again at the “cumulative relative frequency” column and find 0.52. The median is the 50th percentile or the second quartile. 50% of 50 is 25. There are 25 values less than the median. They include the two 4s, the five 5s, the seven 6s, and eleven of the 7s. The median or 50th percentile is between the 25th, or seven, and 26th, or seven, values. The median is seven.
Find the third quartile. The third quartile is the same as the 75th percentile. You can “eyeball” this answer. If you look at the “cumulative relative frequency” column, you find 0.52 and 0.80. When you have all the fours, fives, sixes and sevens, you have 52% of the data. When you include all the 8s, you have 80% of the data. The 75th percentile, then, must be an eight. Another way to look at the problem is to find 75% of 50, which is 37.5, and round up to 38. The third quartile, Q3, is the 38th value, which is an eight. You can check this answer by counting the values. (There are 37 values below the third quartile and 12 values above.)
Guideline:
When writing the interpretation of a percentile in the context of the given data, the sentence should contain the following information:
- information about the context of the situation being considered,
- the data value (value of the variable) that represents the percentile,
- the percent of individuals or items with data values below the percentile.
- Additionally, you may also choose to state the percent of individuals or items with data values above the percentile.
Example 9
On a timed math test, the first quartile for times for finishing the exam was 35 minutes. Interpret the first quartile in the context of this situation.
25% of students finished the exam in 35 minutes or less. 75% of students finished the exam in 35 minutes or more.
A low percentile could be considered good, as finishing more quickly on a timed exam is desirable. (If you take too long, you might not be able to finish.)
Example 10
On a 20 question math test, the 70th percentile for number of correct answers was 16. Interpret the 70th percentile in the context of this situation.
70% of students answered 16 or fewer questions correctly. 30% of students answered 16 or more questions correctly.
Note: A high percentile could be considered good, as answering more questions correctly is desirable.
Example 11
At a certain community college, it was found that the 30th percentile of credit units that students are enrolled for is 7 units. Interpret the 30th percentile in the context of this situation.
30% of students are enrolled in 7 or fewer credit units 70% of students are enrolled in 7 or more credit units.
In this example, there is no “good” or “bad” value judgment associated with a higher or lower percentile. Students attend community college for varied reasons and needs, and their course load varies according to their needs.
Measures of the Center of the Data
The “center” of a data set is also a way of describing location. The two most widely used measures of the “center” of the data are the mean (average) and the median. To calculate the mean weight of 50 people, add the 50 weights together and divide by 50. To find the median weight of the 50 people, order the data and find the number that splits the data into two equal parts (previously discussed under box plots in this chapter). The median is generally a better measure of the center when there are extreme values or outliers because it is not affected by the precise numerical values of the outliers. The mean is the most common measure of the center.
NOTE: The words “mean” and “average” are often used interchangeably. The substitution of one word for the other is common practice. The technical term is “arithmetic mean” and “average” is technically a center location. However, in practice among non-statisticians, “average” is commonly accepted for “arithmetic mean.”
The mean can also be calculated by multiplying each distinct value by its frequency and then dividing the sum by the total number of data values. The letter used to represent the sample mean is an x with a bar over it (pronounced “x bar”): x.
The Greek letter µ (pronounced “mew”) represents the population mean. One of the requirements for the sample mean to be a good estimate of the population mean is for the sample taken to be truly random.
To see that both ways of calculating the mean are the same, consider the sample: 1; 1; 1; 2; 2; 3; 4; 4; 4; 4; 4
= 2.7
= 2.7
In the second calculation for the sample mean, the frequencies are 3, 2, 1, and 5. You can quickly find the location of the median by using the expression .
The letter n is the total number of data values in the sample. If n is an odd number, the median is the middle value of the ordered data (ordered smallest to largest). If n is an even number, the median is equal to the two middle values added together and divided by 2 after the data has been ordered. For example, if the total number of data values is 97, then = = 49. The median is the 49th value in the ordered data.
If the total number of data values is 100, then = = 50.5. The median occurs midway between the 50th and 51st values. The location of the median and the value of the median are not the same. The upper case letter M is often used to represent the median. The next example illustrates the location of the median and the value of the median.
Example 12
Suppose that in a small town of 50 people, one person earns $5,000,000 per year and the other 49 each earn $30,000. Which is the better measure of the “center”: the mean or the median?
= $129,400
M = 30,000
(There are 49 people who earn $30,000 and one person who earns $5,000,000.)
The median is a better measure of the “center” than the mean because 49 of the values are $30,000 and one is $5,000,000. The $5,000,000 is an outlier. The $30,000 gives us a better sense of the middle of the data.
Box Plots
Box plots or box-whisker plots give a good graphical image of the concentration of the data. They also show how far from most of the data the extreme values are. The box plot is constructed from five values: the smallest value, the first quartile, the median, the third quartile, and the largest value. The median, the first quartile, and the third quartile will be discussed here, and then again in the section on measuring data in this chapter. We use these values to compare how close other data values are to them.
The median, a number, is a way of measuring the “center” of the data. You can think of the median as the “middle value,” although it does not actually have to be one of the observed values. It is a number that separates ordered data into halves. Half the values are the same number or smaller than the median and half the values are the same number or larger. For example, consider the following data:
1; 11.5; 6; 7.2; 4; 8; 9; 10; 6.8; 8.3; 2; 2; 10; 1
Ordered from smallest to largest:
1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5
The median is between the 7th value, 6.8, and the 8th value 7.2. To find the median, add the two values together and divide by 2.
6.8 + 7.2
2 = 7 (2.0)
The median is 7. Half of the values are smaller than 7 and half of the values are larger than 7.
Quartiles are numbers that separate the data into quarters. Quartiles may or may not be part of the data. To find the quartiles, first find the median or second quartile. The first quartile is the middle value of the lower half of the data and the third quartile is the middle value of the upper half of the data. To get the idea, consider the same data set shown above:
1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5
The median or second quartile is 7. The lower half of the data is 1, 1, 2, 2, 4, 6, 6.8. The middle value of the lower half is 2.
1; 1; 2; 2; 4; 6; 6.8
The number 2, which is part of the data, is the first quartile. One-fourth of the values are the same or less than 2 and three-fourths of the values are more than 2.
The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of the upper half is 9.
7.2; 8; 8.3; 9; 10; 10; 11.5
The number 9, which is part of the data, is the third quartile. Three-fourths of the values are less than 9 and one-fourth of the values are more than 9.
To construct a box plot, use a horizontal number line and a rectangular box. The smallest and largest data values label the endpoints of the axis. The first quartile marks one end of the box and the third quartile marks the other end of the box. The middle fifty percent of the data fall inside the box. The “whiskers” extend from the ends of the box to the smallest and largest data values. The box plot gives a good quick picture of the data.
NOTE: You may encounter box and whisker plots that have dots marking outlier values. In those cases, the whiskers are not extending to the minimum and maximum values.
Consider the following data:
1; 1; 2; 2; 4; 6; 6.8 ; 7.2; 8; 8.3; 9; 10; 10; 11.5
The first quartile is 2, the median is 7, and the third quartile is 9. The smallest value is 1 and the largest value is 11.5. The box plot is constructed as follows:
Figure 5
The two whiskers extend from the first quartile to the smallest value and from the third quartile to the largest value. The median is shown with a dashed line.
Example 13
The following data are the heights of 40 students in a statistics class.
59; 60; 61; 62; 62; 63; 63; 64; 64; 64; 65; 65; 65; 65; 65; 65; 65; 65; 65; 66; 66; 67; 67; 68; 68; 69; 70; 70; 70;
70; 70; 71; 71; 72; 72; 73; 74; 74; 75; 77
a. Each quarter has 25% of the data.
b. The spreads of the four quarters are 64.5 – 59 = 5.5 (first quarter), 66 – 64.5 = 1.5 (second quarter), 70 – 66 = 4 (3rd quarter), and 77 – 70 = 7 (fourth quarter). So, the second quarter has the smallest spread and the fourth quarter has the largest spread.
c. Interquartile Range: IQR = Q3 Q1 = 70 64.5 = 5.5.
d. The interval 59 through 65 has more than 25% of the data so it has more data in it than the interval 66 through 70 which has 25% of the data.
e. The middle 50% (middle half) of the data has a range of 5.5 inches.
For some sets of data, some of the largest value, smallest value, first quartile, median, and third quartile may be the same. For instance, you might have a data set in which the median and the third quartile are the same. In this case, the diagram would not have a dotted line inside the box displaying the median. The right side of the box would display both the third quartile and the median. For example, if the smallest value and the first quartile were both 1, the median and the third quartile were both 5, and the largest value was 7, the box plot would look as follows:
Figure 6
The Law of Large Numbers and the Mean
The law of large numbers says that if you take samples of larger and larger size from any population, then the mean of the sampling distribution, tends to get closer and closer to the true population mean, . From the Central Limit Theorem, we know that as n gets larger and larger, the sample means follow a normal distribution. The larger n gets, the smaller the standard deviation of the sampling distribution gets. (Remember that the standard deviation for the sampling distribution is ). This means that the sample mean must be closer to the population mean as n increases. We can say that is the value that the sample means approach as n gets larger. The Central Limit Theorem illustrates the law of large numbers.
This concept is so important and plays such a critical role in what follows it deserves to be developed further. Indeed, there are two critical issues that flow from the Central Limit Theorem and the application of the Law of Large numbers to it. These are
1. The probability density function of the sampling distribution of means is normally distributed regardless of the underlying distribution of the population observations and
2. standard deviation of the sampling distribution decreases as the size of the samples that were used to calculate the means for the sampling distribution increases.
Taking these in order. It would seem counterintuitive that the population may have any distribution and the distribution of means coming from it would be normally distributed. With the use of computers, experiments can be simulated that show the process by which the sampling distribution changes as the sample size is increased. These simulations show visually the results of the mathematical proof of the Central Limit Theorem.
Here are three examples of very different population distributions and the evolution of the sampling distribution to a normal distribution as the sample size increases. The top panel in these cases represents the histogram for the original data. The three panels show the histograms for 1,000 randomly drawn samples for different sample sizes: n=10, n= 25 and n=50. As the sample size increases, and the number of samples taken remains constant, the distribution of the 1,000 sample means becomes closer to the smooth line that represents the normal distribution.
Sampling Distributions and Statistic of a Sampling Distribution
You can think of a sampling distribution as a relative frequency distribution with a great many samples. (See Sampling and Data for a review of relative frequency). Suppose thirty randomly selected students were asked the number of movies they watched the previous week. The results are in the relative frequency table shown below.
# of movies | Relative Frequency |
0 | 5/30 |
1 | 15/30 |
2 | 6/30 |
3 | 4/30 |
4 | 1/30 |
Table 6
If you let the number of samples get very large (say, 300 million or more), the relative frequency table becomes a relative frequency distribution.
A statistic is a number calculated from a sample. Statistic examples include the mean, the median and the mode as well as others. The sample mean x is an example of a statistic which estimates the population mean .
Skewness and the Mean, Median, and Mode1
Consider the following data set:
4 ; 5 ; 6 ; 6 ; 6 ; 7 ; 7 ; 7 ; 7 ; 7 ; 7 ; 8 ; 8 ; 8 ; 9 ; 10
This data set produces the histogram shown below. Each interval has width one and each value is located in the middle of an interval.
Figure 7
The histogram displays a symmetrical distribution of data. A distribution is symmetrical if a vertical line can be drawn at some point in the histogram such that the shape to the left and the right of the vertical line are mirror images of each other. The mean, the median, and the mode are each 7 for these data. In a perfectly symmetrical distribution, the mean and the median are the same. This example has one mode (unimodal) and the mode is the same as the mean and median. In a symmetrical distribution that has two modes (bimodal), the two modes would be different from the mean and median.
The histogram for the data:
4 ; 5 ; 6 ; 6 ; 6 ; 7 ; 7 ; 7 ; 7 ; 8 is not symmetrical. The right-hand side seems “chopped off” compared to the left side. The shape distribution is called skewed to the left because it is pulled out to the left.
Figure 8
The mean is 6.3, the median is 6.5, and the mode is 7. Notice that the mean is less than the median and they are both less than the mode. The mean and the median both reflect the skewing but the mean more so.
The histogram for the data:
6 ; 7 ; 7 ; 7 ; 7 ; 8 ; 8 ; 8 ; 9 ; 10 is also not symmetrical. It is skewed to the right.
Figure 9
The mean is 7.7, the median is 7.5, and the mode is 7. Of the three statistics, the mean is the largest, while the mode is the smallest. Again, the mean reflects the skewing the most.
To summarize, generally if the distribution of data is skewed to the left, the mean is less than the median, which is often less than the mode. If the distribution of data is skewed to the right, the mode is often less than the median, which is less than the mean.
Skewness and symmetry become important when we discuss probability distributions in later chapters.
Measures of the Spread of the Data
An important characteristic of any set of data is the variation in the data. In some data sets, the data values are concentrated closely near the mean; in other data sets, the data values are more widely spread out from the mean. The most common measure of variation, or spread, is the standard deviation.
The standard deviation is a number that measures how far data values are from their mean.
The standard deviation
• provides a numerical measure of the overall amount of variation in a data set
• can be used to determine whether a particular data value is close to or far from the mean
The standard deviation provides a measure of the overall variation in a data set
The standard deviation is always positive or 0. The standard deviation is small when the data are all concentrated close to the mean, exhibiting little variation or spread. The standard deviation is larger when the data values are more spread out from the mean, exhibiting more variation.
Suppose that we are studying waiting times at the checkout line for customers at supermarket A and supermarket B; the average wait time at both markets is 5 minutes. At market A, the standard deviation for the waiting time is 2 minutes; at market B the standard deviation for the waiting time is 4 minutes.
Because market B has a higher standard deviation, we know that there is more variation in the waiting times at market B. Overall, wait times at market B are more spread out from the average; wait times at market A are more concentrated near the average.
The standard deviation can be used to determine whether a data value is close to or far from the mean. Suppose that Rosa and Binh both shop at Market A. Rosa waits for 7 minutes and Binh waits for 1 minute at the checkout counter. At market A, the mean wait time is 5 minutes and the standard deviation is 2 minutes. The standard deviation can be used to determine whether a data value is close to or far from the mean.
Rosa waits for 7 minutes:
- 7 is 2 minutes longer than the average of 5; 2 minutes is equal to one standard deviation.
- Rosa’s wait time of 7 minutes is 2 minutes longer than the average of 5 minutes.
- Rosa’s wait time of 7 minutes is one standard deviation above the average of 5 minutes.
Binh waits for 1 minute.
- 1 is 4 minutes less than the average of 5;
- 4 minutes is equal to two standard deviations.
- Binh’s wait time of 1 minute is 4 minutes less than the average of 5 minutes.
- Binh’s wait time of 1 minute is two standard deviations below the average of 5 minutes.
A data value that is two standard deviations from the average is just on the borderline for what many statisticians would consider to be far from the average. Considering data to be far from the mean if it is more than 2 standard deviations away is more of an approximate “rule of thumb” than a rigid rule. In general, the shape of the distribution of the data affects how much of the data is further away than 2 standard deviations. (We will learn more about this in later chapters.)
If 1 were also part of the data set, then 1 is two standard deviations to the left of 5 because
- In general, a value = mean + (#ofSTDEV)(standard deviation)
- Where #ofSTDEVs = the number of standard deviations
- 7 is one standard deviation more than the mean of 5 because: 7=5+(1)(2)
- 1 is two standard deviations less than the mean of 5 because: 1=5+(−2)(2)
The equation value = mean + (#ofSTDEVs)(standard deviation) can be expressed for a sample and for a population:
Sample:
Population:
The lower case letter s represents the sample standard deviation and the Greek letter (sigma, lower case) represents the population standard deviation.
The symbol x is the sample mean and the Greek symbol is the population mean.
Calculating the Standard Deviation
If x is a number, then the difference “x – mean” is called its deviation. In a data set, there are as many deviations as there are items in the data set. The deviations are used to calculate the standard deviation. If the numbers belong to a population, in symbols a deviation is . For sample data, in symbols a deviation is .
The procedure to calculate the standard deviation depends on whether the numbers are the entire population or are data from a sample. The calculations are similar, but not identical. Therefore the symbol used to represent the standard deviation depends on whether it is calculated from a population or a sample. The lower case letter s represents the sample standard deviation and the Greek letter (sigma, lower case) represents the population standard deviation. If the sample has the same characteristics as the population, then s should be a good estimate of .
To calculate the standard deviation, we need to calculate the variance first. The variance is an average of the squares of the deviations (the values for a sample, or the values for a population). The symbol represents the population variance; the population standard deviation σ is the square root of the population variance. The symbol represents the sample variance; the sample standard deviation s is the square root of the sample variance. You can think of the standard deviation as a special average of the deviations.
If the numbers come from a census of the entire population and not a sample, when we calculate the aver- age of the squared deviations to find the variance, we divide by N, the number of items in the population. If the data are from a sample rather than a population, when we calculate the average of the squared deviations, we divide by n-1, one less than the number of items in the sample. You can see that in the formulas below.
Formulas for the Sample Standard Deviation
For the sample standard deviation, the denominator is n-1, that is the sample size MINUS 1.
Formulas for the Population Standard Deviation
For the population standard deviation, the denominator is N, the number of items in the population.
In these formulas, f represents the frequency with which a value appears. For example, if a value appears once, f is 1. If a value appears three times in the data set or population, f is 3.
Sampling Variability of a Statistic
The statistic of a sampling distribution was discussed in Descriptive Statistics: Measuring the Center of the Data. How much the statistic varies from one sample to another is known as the sampling variability of a statistic. You typically measure the sampling variability of a statistic by its standard error. The standard error of the mean is an example of a standard error. It is a special standard deviation and is known as the standard deviation of the sampling distribution of the mean. You will cover the standard error of the mean in The Central Limit Theorem (not now). The notation for the standard error of the mean is where σ is the standard deviation of the population and n is the size of the sample.
Example 14
In a fifth grade class, the teacher was interested in the average age and the sample standard deviation of the ages of her students. The following data are the ages for a SAMPLE of n = 20 fifth grade students. The ages are rounded to the nearest half year:
9 ; 9.5 ; 9.5 ; 10 ; 10 ; 10 ; 10 ; 10.5 ; 10.5 ; 10.5 ; 10.5 ; 11 ; 11 ; 11 ; 11 ; 11 ; 11 ; 11.5 ; 11.5 ; 11.5
The average age is 10.53 years, rounded to 2 places.
The variance may be calculated by using a table. Then the standard deviation is calculated by taking the square root of the variance. We will explain the parts of the table after calculating s.
Data | Freq. | Deviations | (Freq.)() | |
x | f | ( f ) | ||
9 | 1 | 9 − 10.525 = −1.525 | = 2.325625 | 1 × 2.325625 = 2.325625 |
9.5 | 2 | 9.5 − 10.525 = −1.025 | = 1.050625 | 2 × 1.050625 = 2.101250 |
10 | 4 | 10 − 10.525 = −0.525 | = 0.275625 | 4 × .275625 = 1.1025 |
10.5 | 4 | 10.5 − 10.525 = −0.025 | = 0.000625 | 4 × .000625 = .0025 |
11 | 6 | 11 − 10.525 = 0.475 | = 0.225625 | 6 × .225625 = 1.35375 |
11.5 | 3 | 11.5 − 10.525 = 0.975 | = 0.950625 | 3 × .950625 = 2.851875 |
Table 7
The sample variance, s2, is equal to the sum of the last column (9.7375) divided by the total number of data values minus one (20 – 1):
= 0.5125
The sample standard deviation s is equal to the square root of the sample variance:
Rounded to two decimal places, s = 0.72
Typically, you do the calculation for the standard deviation on your calculator or computer. The intermediate results are not rounded. This is done for accuracy.
Example 15
Find the value that is 1 standard deviation above the mean. Find ().
Solution – Example 15
= 10.53 + (1) (0.72) = 11.25
Example 16
Find the value that is two standard deviations below the mean. Find (x − 2s).
Solution – Example 16
= 10.53 − (2) (0.72) = 9.09
Example 17
Find the values that are 1.5 standard deviations from (below and above) the mean.
Solution – Example 16
= 10.53 − (1.5) (0.72) = 9.45
= 10.53 + (1.5) (0.72) = 11.61
Explanation of the standard deviation calculation shown in the table
The deviations show how spread out the data are about the mean. The data value 11.5 is farther from the mean than is the data value 11. The deviations 0.97 and 0.47 indicate that. A positive deviation occurs when the data value is greater than the mean. A negative deviation occurs when the data value is less than the mean; the deviation is -1.525 for the data value 9. If you add the deviations, the sum is always zero. (For this example, there are n=20 deviations.) So you cannot simply add the deviations to get the spread of the data. By squaring the deviations, you make them positive numbers, and the sum will also be positive. The variance, then, is the average squared deviation.
The variance is a squared measure and does not have the same units as the data. Taking the square root solves the problem. The standard deviation measures the spread in the same units as the data.
Notice that instead of dividing by n=20, the calculation divided by n-1=20-1=19 because the data is a sample. For the sample variance, we divide by the sample size minus one (n 1). Why not divide by n? The answer has to do with the population variance. The sample variance is an estimate of the population variance. Based on the theoretical mathematics that lies behind these calculations, dividing by (n 1) gives a better estimate of the population variance.
NOTE: Your concentration should be on what the standard deviation tells us about the data. The standard deviation is a number which measures how far the data are spread from the mean. Let a calculator or computer do the arithmetic.
The standard deviation, s or σ, is either zero or larger than zero. When the standard deviation is 0, there is no spread; that is, the all the data values are equal to each other. The standard deviation is small when the data are all concentrated close to the mean, and is larger when the data values show more variation from the mean. When the standard deviation is a lot larger than zero, the data values are very spread out about the mean; outliers can make s or σ very large.
The standard deviation, when first presented, can seem unclear. By graphing your data, you can get a better “feel” for the deviations and the standard deviation. You will find that in symmetrical distributions, the standard deviation can be very helpful but in skewed distributions, the standard deviation may not be much help. The reason is that the two sides of a skewed distribution have different spreads. In a skewed distribution, it is better to look at the first quartile, the median, the third quartile, the smallest value, and the largest value. Because numbers can be confusing, always graph your data.
Media Attributions
- TeenRemind
- Histogram
- Histogram2
- BoxPlot
- BoxPlot2
- HistSkew
- HistSkew2
- HistSkew3