Statistics

This page is designed to demonstrate the various calculations required for A'level Statistics. Students can generate data and see the results of the calculations; also students can try the calculations for themselves and check against the computer's results.

Data Definintion

First Data Sample

Data type:
Auto:

Data:
Distribution:
Number of points:
Distribution mean: Number of trials: Maximum:
Standard deviation: Probability of success:
Extra data points:

Second Data Sample

Enable:
Auto:

Data:
Distribution:
Number of points:
Distribution mean: Number of trials: Maximum:
Standard deviation: Probability of success:
Extra data points:
Generate quantiles:
Force correlation:
Desired correlation:
Apply coding:	$Y =$ $X +$

Raw Data

First Sample:
Second Sample:
Paired Data:

Sorted Raw Data

First Sample:
Second Sample:

Data Representation

Even class intervals:
Class offset:
Class interval width:
Class endpoints:

Frequency Table

Class Interval		Frequency
		First	Second

Stem and Leaf Diagram

(Data rounded to nearest whole number.)

First Leaf	Stem	Second Leaf

Histogram

Mark:
Points from first sample below mark:
Approximate points from first sample below mark:
Points from second sample below mark:
Approximate points from second sample below mark:

Boxplot

Sample Statistics

	First Sample	Second Sample
Size: The size of a data set is the number of values. This is usually denotes by $n$
Mean: The mean of a data set is the sum of the values divided by the number of values. $\bar{x} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}$
Median: The median of a data set is the middle value when they are listed in order. If there are an even number, it is the average of the two middle values.
Modal class: To define the modal class of a data set, the data needs to be divided into classes as in a frequency table. Then the modal class is the class or classes containing the most elements of the data set. Mode: The mode of a data set is the value or values that occur most frequently in the data.
Variance: The variance of a data set is a measure of its spread. It is defined as: $σ^{2} = \frac{1}{n} \sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}$ (where $\bar{x}$ is the mean) but is more conveniently calculated using the formula: $σ^{2} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}^{2} - {\bar{x}}^{2}$
Standard Deviation: The standard deviation of a data set is a measure of its spread. It is defined as the square root of the variance.
Lower Quartile: The lower quartile of a data set is the value such that a quarter of the points lie below that value and three-quarters lie above. Its definition is slightly different depending on whether or not there is an actual data point satisfying that property.
Upper Quartile: The upper quartile of a data set is the value such that three-quarters of the points lie below that value and a quarter lie above. Its definition is slightly different depending on whether or not there is an actual data point satisfying that property.
Inter-Quartile Range: The inter-quartile range of a data set is a measure of its spread. It is defined as the upper quartile minus the lower quartile.
Skewness $\frac{3 (\bar{x} - median)}{σ}$ :
Skewness $\frac{\bar{x} - mode}{σ}$ :
Quartile skewness coefficient:
Lilliefors Normality Test: The Lilliefors Normailty Test of a data set is a hypothesis test with null hypothesis that the data is drawn from a normally distributed population. If this number is above the significance level then the null hypothesis is to be rejected. This test was chosen as it is straightforward to implement using the approximations due to Abdi and Molin.
Estimate of Mean:
Estimate of Variance:
Estimate of Standard Deviation:
Estimate of Median:
Estimate of Lower Quartile:
Estimate of Upper Quartile:
Estimate of Inter-Quartile Range:
Additional quantile calculations:
th quantile of
Estimate:

Correlation

Regression Calculation

\begin{aligned} Y & = 7 X + 8 \\ Error & = 0 \end{aligned}

Correlation and Summary Statistics

\begin{aligned} S_{x x} & = 3 \\ S_{y y} & = 4 \\ S_{x y} & = 5 \\ r = \frac{S_{x y}}{\sqrt{S_{x x} S_{y y}}} & = 6 \end{aligned}

Help

The purpose of this webpage is to provide a playground for A'level statistics.

A brief overview of the tabs is as follows.

Data Definition. For defining or inputting the data to be studied.
Raw Data. Once the data has been defined, it can be viewed here. It is listed both sorted and unsorted, and if two data sets of equal length are used then two data sets are combined into a list of pairs.
Grouping. Many data analyses start with data in groups. This tab allows one to specify the groupings.
Frequency Table. Following the grouping, this tab displays the corresponding frequency tables.
Stem and Leaf. This tab displays the data sets as stem and leaf diagrams.
Histogram. This tab displays the grouped data in histograms. It also allows for the definition of a mark and counts the number of data points below the mark, both actual and estimated from the frequency table.
Boxplot. This tab displays the data as a boxplot, with outliers defined using the interquartile range definition.
Sample Statistics. This tab displays lots of statistics calculated from the data sets.
Correlation. If the two data sets are of the same length, this tab shows the correlation between them. It displays a scatter plot, calculates the correlation coefficient, and the regression line. It is possible to modify the regression line to see how the error varies.

Data Definition

There are many options for defining the data set.

Data type. There are three types of data which modify how the program treats the data.
- Continous. The data is considered as being drawn from a continuous distribution.
- Continous but rounded. The data is considered as being drawn from a continuous distribution but a rounding function is applied (currently only "to nearest integer" is available). This means that the data appears discrete but is treated as having been drawn from a continuous distribution.
- Discrete. The data is considered as being drawn from a discrete distribution.
Enable. It is possible to work with either one or two data sets. Checking this box enables the second data set.
Auto. The data can be automatically generated or entered directly. If this is checked, the data is automatically generated. If not, it can be entered directly in the Data textarea. The data points can be separated by commas, semi-colons, or spaces (including tabs and newlines).
Distribution. Depending on the data type, various standard distributions are available. Each requires some parameters. The available distributions are:
- Continuous distributions. All are specified by giving the mean and standard deviation. The Exponential and Log-Normal distributions are good sources of skewed data.
  - Gaussian (Normal)
  - (Translated) Exponential
  - Uniform
  - Log-Normal
- Discrete distributions.
  - Binomial
  - Poisson
  - Uniform
Number of points. This is the number of points to generate. If both data sets have the same number of points, regression analysis is enabled.
Extra data points. It is possible to add additional data points to the generated data.
Generate data. Most of the time, the data should update itself when the options are changed. Sometimes it might be necessary or desirable to manually generate it.
Generate quantiles. Rather than generate the second data set randomly (from the given distribution), it is possible to generate the quantiles corresponding to the number of data points in the first data sample. This can be used as a simple test for whether or not the first data set fits the given distribution: plotting a scatter plot of the data sets and looking at the correlation coefficient provides evidence to this (in these circumstances, the scatter plot is known as a QQ-plot.
Force correlation. If this checkbox is ticked, the program attempts to coerce the second data set to be correlated to the first with the given Desired correlation. The final correlation may not be precisely equal to the desired correlation.
Apply coding. If this checkbox is ticked, the second data set is generated by applying the given coding to the first data set.

Grouping

The Frequency Table, Histogram, and some of the Sample Statistics depend on putting the data into groups. This tab allows one to define those groups.

There are two ways to define the groups. If Even class intervals is checked then the groups are of equal widths and so are specified by the width and the initial offset from zero. If it is not checked, one must specify the boundaries individually.

The groups are assumed to be open at the top end. That is, if $a$ and $b$ are two boundaries, the group is $a \leq x < b$ .

Histogram

The interactivity in the Histogram tab is the ability to move the red marker. The purpose of this is to show the effect of interpolation. After the mark is set, the program calculates how many data points are below the mark together with an estimate of how many data points there are using interpolation on the groups. Thus it is possible to compare the grouped estimate with the actual figure.

Sample Statistics

In this tab, an array of statistics is computed from the data sample(s). Hovering over a title reveals a brief description of the statistics. Where a statistic contains the word "Estimate", it is based on the grouped data. Extra quantiles can be calculated using the input boxes at the bottom. For example, to work out the 70th percentile, enter "70" and "100" in the boxes so that it reads "70th quantile of 100".

Correlation

This tab is only displayed if there are two data sets of the same length. It displays a scatter plot together with the relevant statistics for correlation and regression analysis.

It is possible to adjust the line to demonstrate that the regression line is the best fit. Clicking and dragging on the scatter plot will adjust the line to pass through the mouse or touch point. If the mouse is near the centre of the scatter plot then the y-intercept is adjusted, if the mouse is near the edges then the gradient is adjusted. Clicking Reset regression line resets the line back to the regression line.