# Statistics

This page is designed to demonstrate the various calculations required for A'level Statistics. Students can generate data and see the results of the calculations; also students can try the calculations for themselves and check against the computer's results.

### Data Definintion

#### First Data Sample

 Data type: Continuous Continuous but rounded Discrete Auto:
 Data: Distribution: Gaussian (Normal) (Translated) Exponential Uniform Log-Normal Binomial Poisson Uniform Number of points: Distribution mean: Number of trials: Maximum: Standard deviation: Probability of success: Extra data points:

#### Second Data Sample

 Enable: Auto:
 Data: Distribution: Gaussian (Normal) (Translated) Exponential Uniform Log-Normal Binomial Poisson Uniform Number of points: Distribution mean: Number of trials: Maximum: Standard deviation: Probability of success: Extra data points: Generate quantiles: Force correlation: Desired correlation: Apply coding: $Y=$ $X+$

### Raw Data

 First Sample: Second Sample: Paired Data:

### Sorted Raw Data

 First Sample: Second Sample:

### Data Representation

 Even class intervals: Class offset: Class interval width: Class endpoints:

### Frequency Table

Class Interval Frequency
First Second

### Stem and Leaf Diagram

(Data rounded to nearest whole number.)

First Leaf Stem Second Leaf

### Histogram

 Mark: Points from first sample below mark: Approximate points from first sample below mark: Points from second sample below mark: Approximate points from second sample below mark:

### Sample Statistics

First Sample Second Sample
Size:
Mean:
Median:
Modal class: Mode:
Variance:
Standard Deviation:
Lower Quartile:
Upper Quartile:
Inter-Quartile Range:
Skewness $\frac{3\left(\overline{x}-median\right)}{\sigma }$:
Skewness $\frac{\overline{x}-mode}{\sigma }$:
Quartile skewness coefficient:
Lilliefors Normality Test:
Estimate of Mean:
Estimate of Variance:
Estimate of Standard Deviation:
Estimate of Median:
Estimate of Lower Quartile:
Estimate of Upper Quartile:
Estimate of Inter-Quartile Range:
th quantile of
Estimate:

### Correlation

#### Regression Calculation

$\begin{array}{rl}Y& =7X+8\\ \text{Error}& =0\end{array}$

#### Correlation and Summary Statistics

$\begin{array}{rl}{S}_{xx}& =3\\ {S}_{yy}& =4\\ {S}_{xy}& =5\\ r=\frac{{S}_{xy}}{\sqrt{{S}_{xx}{S}_{yy}}}& =6\end{array}$

### Help

The purpose of this webpage is to provide a playground for A'level statistics.

A brief overview of the tabs is as follows.

• Data Definition. For defining or inputting the data to be studied.
• Raw Data. Once the data has been defined, it can be viewed here. It is listed both sorted and unsorted, and if two data sets of equal length are used then two data sets are combined into a list of pairs.
• Grouping. Many data analyses start with data in groups. This tab allows one to specify the groupings.
• Frequency Table. Following the grouping, this tab displays the corresponding frequency tables.
• Stem and Leaf. This tab displays the data sets as stem and leaf diagrams.
• Histogram. This tab displays the grouped data in histograms. It also allows for the definition of a mark and counts the number of data points below the mark, both actual and estimated from the frequency table.
• Boxplot. This tab displays the data as a boxplot, with outliers defined using the interquartile range definition.
• Sample Statistics. This tab displays lots of statistics calculated from the data sets.
• Correlation. If the two data sets are of the same length, this tab shows the correlation between them. It displays a scatter plot, calculates the correlation coefficient, and the regression line. It is possible to modify the regression line to see how the error varies.

#### Data Definition

There are many options for defining the data set.

• Data type. There are three types of data which modify how the program treats the data.
• Continous. The data is considered as being drawn from a continuous distribution.
• Continous but rounded. The data is considered as being drawn from a continuous distribution but a rounding function is applied (currently only "to nearest integer" is available). This means that the data appears discrete but is treated as having been drawn from a continuous distribution.
• Discrete. The data is considered as being drawn from a discrete distribution.
• Enable. It is possible to work with either one or two data sets. Checking this box enables the second data set.
• Auto. The data can be automatically generated or entered directly. If this is checked, the data is automatically generated. If not, it can be entered directly in the Data textarea. The data points can be separated by commas, semi-colons, or spaces (including tabs and newlines).
• Distribution. Depending on the data type, various standard distributions are available. Each requires some parameters. The available distributions are:
• Continuous distributions. All are specified by giving the mean and standard deviation. The Exponential and Log-Normal distributions are good sources of skewed data.
• Gaussian (Normal)
• (Translated) Exponential
• Uniform
• Log-Normal
• Discrete distributions.
• Binomial
• Poisson
• Uniform
• Number of points. This is the number of points to generate. If both data sets have the same number of points, regression analysis is enabled.
• Extra data points. It is possible to add additional data points to the generated data.
• Generate data. Most of the time, the data should update itself when the options are changed. Sometimes it might be necessary or desirable to manually generate it.
• Generate quantiles. Rather than generate the second data set randomly (from the given distribution), it is possible to generate the quantiles corresponding to the number of data points in the first data sample. This can be used as a simple test for whether or not the first data set fits the given distribution: plotting a scatter plot of the data sets and looking at the correlation coefficient provides evidence to this (in these circumstances, the scatter plot is known as a QQ-plot.
• Force correlation. If this checkbox is ticked, the program attempts to coerce the second data set to be correlated to the first with the given Desired correlation. The final correlation may not be precisely equal to the desired correlation.
• Apply coding. If this checkbox is ticked, the second data set is generated by applying the given coding to the first data set.

#### Grouping

The Frequency Table, Histogram, and some of the Sample Statistics depend on putting the data into groups. This tab allows one to define those groups.

There are two ways to define the groups. If Even class intervals is checked then the groups are of equal widths and so are specified by the width and the initial offset from zero. If it is not checked, one must specify the boundaries individually.

The groups are assumed to be open at the top end. That is, if $a$ and $b$ are two boundaries, the group is $a\le x.

#### Histogram

The interactivity in the Histogram tab is the ability to move the red marker. The purpose of this is to show the effect of interpolation. After the mark is set, the program calculates how many data points are below the mark together with an estimate of how many data points there are using interpolation on the groups. Thus it is possible to compare the grouped estimate with the actual figure.

#### Sample Statistics

In this tab, an array of statistics is computed from the data sample(s). Hovering over a title reveals a brief description of the statistics. Where a statistic contains the word "Estimate", it is based on the grouped data. Extra quantiles can be calculated using the input boxes at the bottom. For example, to work out the 70th percentile, enter "70" and "100" in the boxes so that it reads "70th quantile of 100".

#### Correlation

This tab is only displayed if there are two data sets of the same length. It displays a scatter plot together with the relevant statistics for correlation and regression analysis.

It is possible to adjust the line to demonstrate that the regression line is the best fit. Clicking and dragging on the scatter plot will adjust the line to pass through the mouse or touch point. If the mouse is near the centre of the scatter plot then the y-intercept is adjusted, if the mouse is near the edges then the gradient is adjusted. Clicking Reset regression line resets the line back to the regression line.