# Research methods

** Data Treatment **

**Part A : Workshop**

The aim of this workshop is to give you some experience using standard statistical treatments of data using statistical software. The package we will use for this workshop is **Minitab**. It is the package used as the standard statistical software by the Mathematics and Statistics courses at RMIT and Chemistry also has a site licence for it. Most of the analyses in this workshop can be done using Excel but Excel is not very user-friendly for statistical analyses. The analyses in Minitab can be done from simple pull-down menus and there is a good on-line help facility. You can copy-and-paste data from Excel into Minitab. For the assessment you should enter your results into the attached pro forma.

Before you attempt the case studies in this assignment you should try the worked examples in modules 1-4

## Case Study 1

Analyst A | Analyst B |

6.42 | 6.40 |

6.41 | 6.54 |

6.43 | 6.52 |

6.38 | 6.58 |

The above results were obtained by two analysts using a new method for determination of Nickel in a standard reference alloy containing a certified value of 6.49% Ni

For this data we want to determine the standard statistics:- (i) mean (ii) variance (iii) standard deviation and (iv) confidence intervals for the mean. This data enables us to answer the following questions:-

- Which analysis is the most
**accurate**? (i.e. closest to the certified value) - Which analysis is the most
**precise**? (i.e. which has the smallest spread, or variability, of values

As well we can use the t-test to answer the following:-

- Is there any evidence, with either analyst, of a systematic error? I.e. does either average differ significantly from the certified value?
- Do the results of each analyst differ significantly?

**Analysis: Basic Statistics**

- Open up Minitab by clicking on the Minitab icon on your desktop
- When you open the program you will notice it is divided into two areas – the data area (lower screen) and the output area. Enter data from the above table in columns C1 and C2.

**Warning: **make sure you start entering data in row 1 **NOT** in the cell immediately below the column heading (C1 etc). This cell is reserved for column labels (you may put a label here like ‘Analyst A’). Also make sure you don’t enter a column label in row 1. The whole column will then be formatted as text (C1-T) and cannot be used for analysis. If this happens delete the whole column and start again (clicking on ‘C1’ will highlight the whole column).

- To get descriptive statistics click on
**Stat => Basic Statistics => Display Descriptive Statistics**to get the basic statistics dialog box. Highlight C1 and C2 on the left and then click ‘Select’. Alternatively you can click in the Variable box and type C1 C2 . Then click OK and the output will appear in the output window. From the output data enter the values in the pro forma. Note that the output does not give the variance but you should be able to calculate it from the standard deviation.

## Confidence Intervals

- The confidence intervals for the mean can be obtained as follows:
**Stat => Basic Statistics => 1-sample t.**Click on ‘confidence interval’ and leave at the default 95% - The confidence interval is of the form (low value, high value). To express ie interval in the form of ‘mean +/- deviation’ calculate the deviation as 0.5*(high – low)

## Hypothesis Testing

** **We now want to test whether either sample deviates significantly from the expected (certified) value of 6.49. We need to formulate the **null hypothesis (H _{o}). **In all statistical testing the probability is then calculated of the null hypothesis being true. If there is a low probability (usually < 5% or p = 0.05) of H

_{o}being true we reject it and accept the alternative (H

_{1}). The null hypothesis generally considers any deviations as being just due to chance/ experimental error. In question (c ) we are looking a null hypothesis of the analytical result not being significantly different from the certified value i.e the mean value is actually 6.49. In question (d) our null hypothesis is that the two means are equal.

For question (c ) we apply a t test: **Stat => Basic Statistics => 1-sample t **as above but this time check the **Test mean **box and enter ‘6.49’ in this box. Remember in your conclusions that a result is **significant** (i.e **reject **H_{o}) if **p < 0.05 (**less than a 5% chance that H_{o} is ture i.e no significant difference in the mean value from the certified value).

For question (d) we also apply a t test, to compare two means: **Stat = Basic Statistics => 2-sample t. **Click on ‘**Samples in different columns’ ** Click ‘First’ box and then double click on C1 in the variables column and similarly for C2 as ‘Second’. Accept ‘Not equal’ and ‘95’ as default values. The 95% confidence level given in the output is for the **difference** between the two means. The probability that this difference is actually zero (or not significantly different from zero) is given at the end of the output.

## Case Study 2

A (mL) | B (mL) | C (mL) | D (mL) | |

14.03 | 13.98 | 14.13 | 14.16 | |

14.09 | 13.90 | 14.23 | 14.23 | |

14.07 | 13.79 | 14.08 | 14.10 | |

Four students were asked to perform three triplicate titrations using the same titrimetric procedures. Test to see if the students’ results differ significantly

## One-Way Analysis of Variance

Enter the above data in columns C3-C6 (with appropriate headings for the columns). We can then test for differences in the means of each column using 1-way ANOVA as follows:- **Stat => ANOVA => 1-way (unstacked)**. Highlight all the columns C3-C6 in the left box and click ‘select’ . They should now appear in the response box. Click OK.

The output should be a typical ANOVA table (see the notes :**measurement and assessment of variability.pdf** for a full explanation of the ANOVA table). The key value is again the p value (p that H_{o} is true). We are testing here whether all four students are the same i.e their results do not differ significantly. We have to be careful about the alternative (H_{1}) hypothesis if we reject H_{o}. H_{1} is **not**that the students are all different (why?). Minitab gives a diagram which can help in interpreting the results, showing each mean and confidence interval. Two results differ significantly if their CI’s don’t overlap. Note, however, that Minitab uses a pooled CI so they are all the same size. The diagram is thus just an indication but is still quite useful.

**Case Study 3**

| Phosphorus (mg/kg) | | | |

Temperature (^{o}C) | Soil 1 | Soil 1 | Soil 2 | Soil 2 |

230 | 18.2 | 18.4 | 18.2 | 18.5 |

260 | 18.6 | 18.9 | 18.4 | 18.1 |

290 | 17.7 | 18 | 18.1 | 17.8 |

320 | 17.1 | 17.4 | 17.8 | 17.5 |

An experiment was carried out on the determination of phosphorus in soils to examine the effect of temperature on the analysis. As a result of time and cost considerations it was only possible to carry out 16 experiments. There was insufficient soil for all 16, so two batches of soil were used. A randomised block design was used, giving these results. Test to see if the temperature affects the analysis, and if there is any difference between the soils. Is there any evidence of interaction between soils and temperature?

## Two Way Analysis of Variance

This differs from the previous study in that there are two variables – temperature and soil. The data needs to be set out differently, as follows:-

- In
**one**column enter all 16 phosphorus analytical values (18.2, 18.4 ….17.5) - You also need two
**coding**columns. Make one column the code for temperature and give a code (1 – 4) for each temperature. - Enter in a third column the code (1-2) for the soil type. Thus the first value (18.5) will have 18.5, 1, 1 in the three columns while the last value (17.5) would have 17.5,4,2 (i.e 320
^{o}C and soil 2) - Carry out the two way ANOVA:- Stat=> ANOVA => 2-Way. In the response field enter the column for phosphorus and enter the other two variables in the row and column boxes. Check the ‘display means’ boxes.

Because there are two variables there are now null hypotheses for each variable (e.g no significant difference between soils i.e mean [P] for soil 1 = mean [P] for soil 2). As with all our previous testing the p value is the probability that this is **true** and we **reject **H_{o} if p is low ( < 0.05) and hence conclude there is a significant difference.

In 2 way ANOVA the possibility of variable interaction is also tested. An interaction means, for example, that temperature differences depend on soil type. If we see temperature differences with soil 1 but not soil 2 this would be an interaction effect. Again the diagram of means and CIs can be an indication of where differences occur.

## Case Study 4

X(^{o}C) | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 |

Y(%) | 10 | 15 | 14 | 17 | 18 | 19 | 20 | 23 | 24 | 23 | 28 |

A study was made on the effect of temperature on the yield of a chemical process. The results are shown in the above table. Carry out a regression analysis of the data. Predict the % yield when the temperature is 22.5^{o}C

## Linear Regression

The analysis can be carried out as follows:-

Stat => Regression => Regression. Enter the Y column in the response box and the X column in the predictors box. Click on options and in the ‘prediction intervals for new responses’ enter ‘22.5’ (note if you have more than one X for prediction you can enter them in a new column and put the column in this box).

The output gives you the model (the regression equation), values of the intercept (constant) and gradient (predictor) with statistical information on these parameters. A full ANOVA table is also shown . For full interpretation of this output you should consult the ‘Chemometrics2: Regression” notes.

The t tests determine whether the gradient or the intercept are significantly non-zero. 9Again, check the p values)

The confidence intervals for the gradient and intercept can be determined as +/- s_{a}*t_{n-2,.05} and similarly for s_{b} . s_{a } and s_{b} are the standard deviations of gradient and intercept respectively. T is the critical t value for n-2 (n = number of pairs of data) degrees of freedom and 0.05 significance level. This value can be obtained from t tables.

At the end of the output is the predicted Y when X = 22.5, along with the confidence (CI) and prediction (PI) intervals. The full meaning of these terms is explained in the regression notes.

## Case Study 5

X concentration (mg/L) | 40 | 50 | 60 | 70 | 80 | 90 | 40 | 60 | 50 | 80 |

Y colorimeter reading | 69 | 175 | 272 | 335 | 409 | 415 | 72 | 265 | 180 | 412 |

*The following data indicates the relationship between the amount of **b- erylthriodine in an aqueous solution and the colorimeter reading of turbidity. Carry out a regression analysis as above. Assess whether a linear model is appropriate for this data.*

- Carry out linear regression: Stat => regression =>regression. Enter the columns for the X and Y data in the predictor and response boxes.
- To investigate whether a linear model is appropriate (i.e . a straight line fits the data better than a polynomial, exponential or logarithmic curve, or some other model) we carry out a ‘lack-of-fit’ test. Minitab has two of these. The first is the ‘pure error’ test, which is standard for testing ‘lack-of-fit’ but requires that at least some of the X values be replicated. The second test is non-standard but does give an indication even when there are no replicates. To use this test click on ‘Options’ in the regression screen and click ‘pure error’ and ‘data subsetting’. The null hypothesis is that there is no curvature (i.e. a linear model is adequate) so the p value reflects the extent to which this is true.
- We can also examine the graph of the data and graphs of residuals. To get a regression plot of the fitted line select: Stat => regression =fitted line plot. Enter the columns for X and Y in the predictors and response boxes as above. Click on ‘options’ and check ‘display confidence bands’ and ‘display prediction bands’
- To display plots of the residuals , in the regression screen click on ‘graphs’ and clock on the ‘normal plot of residuals’ and ‘residuals vs fits’ boxes.

Examining graphs and plots of residuals can help determine whether there are any outliers. While a curve may be the best fit to all the data , one outlier can greatly affect this. Residuals should be randomly scattered around the X axis of the residuals-X plot, and the normal plot of the residuals should be linear (see Regression notes for further discussion).

### Exercise: ‘The Inverse Calibration Problem’

An unknown erlythroidine solution gives a colorimeter reading of 402. What is the predicted concentration? What are the confidence limits for this prediction if (i) this was a single measurement (ii) it was an average of several measurements.

This question is typical of the sort of problem frequently encountered in analytical calibrations. We cannot proceed as in case study 4 because we now wish to determine X from a known Y (the ‘inverse’ problem). Least squares analysis assumes the error is in the Y determinations. However the error in X determined from Y can be estimated from the standard deviation of the interpolated X_{0}:

S_{X0} = }^{0.5}

A spreadsheet has been set up to carry out this calculation. It can be found in s:\hons\data treatment\invcalib.xls, or on the data treatment site on the DLS.

Carry out the determination of the prediction and confidence intervals as follows:-

(i) Copy the (X,Y) calibration values for Q.5 from Minitab . Paste them in the invcalib spreadsheet (inverse calibration sheet) in the X and Y columns at the left.

(ii) enter ‘402’ in the y_{o} cell (highlighted in green) and ‘1’ in the highlighted cell for ‘m’. This then gives the predicted value for x and the CI if 402 is a single measurement

(iii) change the ‘m’ values to 5. See what effect it has on the CI .

**Part B **You are to carry out an evaluation of your project , in terms of the data collection and treatment aspects of the project. If you do not have a project yet, think of a ‘hypothetical’ project in your discipline area which might be carried out. This is to be presented as a brief summary , set out as follows.

**Project Overview**

Give the project title (including supervisor). State the aims of the project – what do you want to achieve? Why is the study being carried out?

**Define the response(s)**

** **What is being measured? List your types of responses. Are these responses qualitative or quantitative? If qualitative can they be turned into quantitative responses (e.g by giving a score or rating). Are they discrete or continuous?

**Define the Factors**

- What factors (variables) affect your results (responses)?
- Rank the factors – known to influence, suspect to influence, unknown effect

- Divide the factors into controllable and uncontrollable

**Identify sources of error**

What are the sources of error in your study? How can they be minimised? You need to consider the effect of sampling – usually you cannot test the whole population so you want to take a sample of the population. How do you select the sample? How big should the sample be?

**Name………………………… Student Number ………………………**

### Case Study 1

Basic Statistics | Analyst A | Analyst B |

Mean | | |

Standard Deviation | | |

Variance | | |

Confidence Interval (95%) | | |

Which analyst is the most precise? ……………….. Reason?…………………..

Which Analyst is the most accurate?………………. Reason? …………………

Test | Null Hypothesis | p | Significant? |

A differs from standard? | |||

B differs from standard? | |||

A and B differ from each other? |

### Case Study 2

H_{0} …………………………………… H_{1} ………………………………………..

p ……………… significant? ……………………………………

Does the diagram indicate anything further about the students? …………………………….

**Case Study 3:**

F | p | Significant? | |

Temperatures | |||

Soils | |||

Interactions |

### Case Study 4

Predicted equation (model) | |

P for hypothesis (b= 0) | |

Significant? i.e. is the gradient non-zero? | |

Standard deviation of slope (s_{b}) | |

t (from tables) | |

Confidence intervals for b_{1}(+/- ts_{b}) | |

Predicted yield for 22.5^{o}C | |

Confidence interval | |

Prediction interval |

### Case Study 5

Model (equation) ……………………………………………….

Lack-of-fit significant? …………………………………………

Check the plot of the data and the residuals. Is there any evidence of an outlier? What further treatment of the data would you suggest?

……………………………………………………………………………………………

……………………………………………………………………………………………

**‘Inverse Calibration problem’**

Predicted x | Predicted CI (m=1) | Predicted CI (m=5) |

| | |