Introduction
Estimates prepared from the Samples of Anonymised Records are based on a sample of the 1991 Census data. They are estimates of the actual figures that would have been obtained from a complete enumeration of all residents. These estimates are expected to be different from complete figures because they are subject both to sampling errors and non-sampling errors. They will not necessarily be the same as those published in census reports. This section of the user guide discusses sampling and non-sampling errors in some detail and suggests how the user should assess these errors in practice. The advice can be summarised as follows:
Sampling error
Because the SARs are based on a random sample, estimates based on them may differ somewhat from the figures that would be obtained from processing all the census records; they may also differ from the estimate that would have been obtained from processing a different sample of the same size drawn in the same way from the census records.
Comparisons of SARs and population data
For a limited number of variables it has been possible to compare the estimate from the SARs and the value that is given by all census records that the SARs are drawn from. For Great Britain these comparisons are given in the table below. Throughout the table, the population base includes present and absent residents but excludes visitors, imputed absent households and residents in imputed absent households. As one might expect with such large samples, the SARs closely represent the population from which they were drawn. Conversely, the smaller the size of the sample the greater the tendency of estimates to differ from corresponding values for the entire population. Consequently, estimates derived from the SARs for sub-groups of the population or single SAR areas will tend to deviate more from the 100 per cent statistics.
The deviation of a sample estimate from the census value is called the sampling error. The standard error of a sample estimate is a measure of the variation (the standard deviation) of the sampling error across all the possible samples and thus is a measure of the precision with which an estimate from a particular sample approximates the census value.
Table 1: Characteristics of the SARs and the population from which
they were drawn Great Britain, percent.
Individual Characteristics
| % OF ALL RESIDENTS |
COMMUNAL ESTABLISHMENTS |
|||
Individual SAR |
Census population |
Individual SAR |
Census population |
|
| Male | 48.4 |
48.4 |
41.7 |
41.2 |
| Female | 51.6 |
51.6 |
58.3 |
58.8 |
| Age 0-15 | 20.2 |
20.2 |
3.4 |
3.7 |
| 16-17 | 2.5 |
2.5 |
1.2 |
1.3 |
| 18-29 | 18.1 |
18.1 |
21.6 |
21.3 |
| 30-44 | 21.3 |
21.2 |
9.9 |
10.1 |
| 45 up to pensionable age | 19.3 |
19.3 |
8.6 |
8.9 |
| Pensionable age | 18.7 |
18.7 |
55.2 |
54.7 |
| Single | 41.0 |
41.1 |
47.3 |
48.0 |
| Married | 46.9 |
46.8 |
12.5 |
12.7 |
| Widowed/Divorced | 12.1 |
12.1 |
40.1 |
39.3 |
| With llt illness | 13.1 |
13.1 |
63.5 |
63.3 |
| In employment | 44.1 |
44.3 |
21.6 |
22.1 |
| Unemployed | 4.6 |
4.5 |
3.7 |
3.6 |
| Economically inactive | 31.2 |
31.1 |
71.3 |
70.6 |
| White | 94.6 |
94.6 |
94.3 |
94.5 |
| Other ethnic groups | 5.4 |
5.4 |
5.7 |
5.5 |
Household Characteristics
| % OF RESIDENTS IN HOUSEHOLDS |
% OF HOUSEHOLDS |
|||
Individual SAR |
Census population |
Household SAR |
Census population |
|
| One person in household | 10.6 |
10.6 |
26.3 |
26.3 |
| Owner occupied | 69.9 |
70.0 |
66.4 |
66.7 |
| Rented privately (exc with job) | 5.5 |
5.5 |
7.2 |
6.9 |
| Rented from a housing association | 2.4 |
2.4 |
3.2 |
3.1 |
| Rented from a local authority, new town, or Scottish Homes |
20.0 |
20.0 |
21.3 |
21.4 |
| Lacking or sharing use of a bath/shower and/or inside WC |
0.74 |
0.75 |
1.3 |
1.2 |
| No central heating | 16.8 |
16.8 |
18.8 |
18.8 |
| No car | 24.9 |
24.9 |
33.3 |
33.1 |
| Lone parent | n/a |
4.15 |
3.7 |
3.7 |
Sources. Individual SAR, Household SAR, LBS (Tables 18 and 19 for imputed households, deducted from equivalent cells for 100 per cent data in other LBS tables). The base in each case excludes imputed households and residents in them. Crown Copyright.
The sample estimate and its estimated standard error permit the construction of interval estimates with prescribed confidence that the interval includes the true population value.
Non-Sampling error
In addition to the variability which arises from the sampling procedures, both sample data and the full census data are subject to non-sampling error. Non-sampling error may be introduced during any of the complex operations used to collect and process census data.
Non-sampling error may affect the data in two ways. Errors that are introduced randomly will increase the variability of the data, and should, therefore, be reflected in the standard error discussed below. Errors that tend to be consistent in one direction will make both sample and 100 per cent data biased in that direction. For example, if respondents consistently tend to under-report the number of cars available to their household then the resulting counts of households by number of cars will tend to be understated for the multi- car households and overstated for the no-car households. Such biases are not reflected in the standard error.
Sources of non-sampling error include:
a) Quality of response
Respondents to the census may misinterpret census questions or for other reasons complete the census form incorrectly. The census form requests that the head or joint heads of the household, or other adult over 16, completes the form on behalf of all members of the household. The Census Validation Survey (CVS) carried out by OPCS shortly after the 1991 Census assessed the quality of responses to the census.
b) Incomplete coverage of the census
Every census misses some people who are particularly difficult to enumerate,
in spite of the thorough census field procedures designed to enumerate
the entire population (evaluated in Clark 1992). The age-structure of
those missed by the census has also been estimated and is significantly
different from the age-structure of the population as a whole.
c) Transcription and coding errors, missing data items
During the processing of census forms, transcription and coding errors
can occur. Missing items for persons on a completed census form are imputed
(estimated) by the Census Offices. Corrections are made to some inconsistent
data, such as persons reported married but aged under 16. Mills and Teague
(1991) provide a description of the processing of census forms and the
imputation of these types of missing or inconsistent data.
d) Data Modification to ensure confidentiality
In the 100 per cent tabular output of Local Base Statistics and Small
Area Statistics for areas within local authorities, an additional source
of error was purposefully introduced by Census Offices to provide additional
protection against the identification of individuals (Census User Guide
48; Cole 1993). Counts in some cells of the tabulations are slightly adjusted;
the cells that are adjusted are not known to the user. However, no such
adjustment is made to the sample data in the 10 per cent tabular output
or in the SARs. Other methods, described in Section 1.4, reduce the already
negligible risk that individuals can be identified from records in the
SARs.
Standard Errors and Confidence Intervals
The complex sampling design described above has implications for the estimation of sampling errors on both the individual and household file. The 1 per cent household SAR approximates to a simple stratified random sample of households, although counts of individuals in the household file are subject to the effects of clustering. In the 2 per cent individual file there are two potential sources of clustering which arise in the sampling process. First individuals are clustered into households in the selection of the 10 per cent sample and second, the removal of the household SAR from the 10 per cent sample implies a further clustering into households (Dale and Marsh, 1993). Nonetheless, preliminary work suggests that the 2 per cent SAR approximates to a simple random sample.
Calculating standard errors and confidence intervals
The method described here for estimating standard errors of estimates
from the SARs involves two simple stages. The first stage calculates the
unadjusted standard error, using formulae that apply to simple random
samples. The second stage multiplies the unadjusted standard error by
an appropriate design factor. This is the factor by which sampling errors
must be multiplied in order to compensate for the effect of clustering
or stratification in the sampling process. The design factor approximates
the ratio of the standard error from the actual sample design to the standard
error from a simple random sample. In practice the steps are:
1. Calculate the unadjusted standard error from the appropriate formula
at (ii) below.
2. Multiply the unadjusted standard error from step 1 by the design factor
appropriate to the characteristic (e.g. unemployment status, or age).
The design factor that should be applied may be more or less than 1.0. If there is stratification in the sampling process the sample should be more representative than a simple random sample and the design factor will be less than one. Clustering will cause sampling errors to be larger than those found with simple random sampling and the design factor will be greater than one.
Preliminary estimations of design factors have been made using two different methods, the first using sampling point information (for the household file); the second comparing differences between expected and observed errors (for the individual file).
Design factors for household characteristics from the 1 per cent household SAR
Assumption of a design factor of 1.0 (i.e. using the unadjusted standard errors as if the sample was a simple random one) is unlikely to be far wrong when using household characteristics from the 1 per cent SAR. At worst, a slight over-estimate of the sampling error may result, as household level variables are subject to stratification effects and estimated design factors (including those relating to particular members of the household, for example the social class of the head of household) are slightly less than unity.
Design factors for individual characteristics from the 1 per cent household SAR
For analyses of individual characteristics from the 1 per cent household SAR, assuming simple random sampling may be misleading because clustering effects mean that sampling errors may be seriously under-estimated. This is because this SAR includes all individuals in each sampled household and for variables such as ethnic group, country of birth, migrants, qualifications and social class, there is a tendency for individuals in the same household to have similar characteristics. The effect of household clustering could probably be ignored however for estimates of subgroups of which there is usually no more than one person per household, such as women aged over 80.
The largest effects are for ethnic group. Preliminary estimates are as follows:
Ethnic Group
| White | 1.84 |
| Black Caribbean | 1.60 |
| Black African | 1.83 |
| Black Other | 1.51 |
| Indian | 1.99 |
| Pakistani | 2.27 |
| Bangladeshi | 2.37 |
| Chinese | 1.87 |
| Other Asian | 1.83 |
| Other Other | 1.60 |
Design Factors in the 2 per cent Individual SAR
Design factors estimated for the individual SAR are based on a comparison
of the difference between the SARs and 100 per cent Census data across
the 278 SAR areas (having subtracted residents in wholly imputed households)
and the sampling errors which would be expected from simple random sampling.
The method is described in more detail in CMU Occasional Paper 2.
Individual and household level variables on the individual file are less likely to be subject to clustering and may benefit from stratification. Most design factors deviated very little from unity, many being less than one. Again the largest design factors are for ethnic group, though the effects are much smaller than on the household file.
Ethnic Origin
| White | 1.15 |
| Black Caribbean | 1.00 |
| Black African | 1.06 |
| Black Other | 1.04 |
| Indian | 1.26 |
| Pakistani | 1.20 |
| Bangladeshi | 1.04 |
| Chinese | 1.19 |
| Other Asian | 1.02 |
| Other Other | 1.30 |
Having calculated the standard error for a SAR estimate, it will often be appropriate to go on to calculate a confidence interval for the estimate. These are discussed in section (iii).
Users should also read the notes in section (iv) which give further advice on the use of standard errors. Worked examples are given throughout this discussion of standard errors and their use. More details on estimated design factors are available on request from the CMU.
Generally, use of the 2 per cent individual SAR will minimise sampling errors for individual level analyses whilst the household file is the most appropriate for the analysis of household characteristics.
Calculation of standard errors
The means of calculating the unadjusted standard errors for four common
statistics are given here. The derivations can be found in many statistics
textbooks and most statistical software will calculate them as part of
their standard output.
| Statistic | Value | Approximate standard error |
| Sample cell count | c | SE(c)=sqrt(c(N-c)/N) |
| Scaled cell count | C=f*c | SE(C)=f*SE(c) |
| Sample cell proportion | Pr=c/n | SE(Pr)=sqrt(Pr(1-Pr)/n) |
| Sample cell percentage | Pe=100*c/n | SE(Pe)=sqrt(Pe(100-Pe)/n) |
Examples of the statistics, and definitions:
c the number of non-white textile workers in Yorkshire and Humberside
region.
C=f*c that number scaled to the total census enumerated population. In
this case f is 50 or 100 for the individual or household SAR respectively.
N the total number of records in the SAR in the Yorkshire and Humberside
region, irrespective of industry or ethnic group. In general, N is the
total number of records in the SAR for the area concerned; where a characteristic
of the population in communal establishments is being counted in the individual
SAR, N is the total number of records from communal establishments in
the area concerned. Where N is very large compared to c (N more than 30
times c), the formula can be replaced by the approximation SE(c)=sqrt(c)
and SE(C)=f*sqrt(c).
Pr The number of non-white textile workers in the region (c) as a proportion
of all non-whites in employment in the region (n).
Pe The number of non-white textile workers in the region (c) as a percentage
of all non-whites in employment in the region (n).
The standard error of the SAR statistic is then derived by multiplying the unadjusted standard error from these formulae by the appropriate design factor.
Examples of calculation of standard errors
(a) The percentage of the population of Newham who are of Indian ethnic origin. The percentage of the sample who are Indian in Newham is 13.8 per cent. The unadjusted standard error is:
Unadjusted SE(Pe) = sqrt(13.80*(100-13.8)/1000) = 1.19
The estimated design factor for Indians on the individual file is 1.26. The standard error for this SAR percentage is therefore
Standard error (Pe) = 1.19*1.26 = 1.50
(b) The number of renting households in Britain with a person under pensionable age having a limiting long-term illness, from the household SAR, scaled to a total for all households enumerated in the census.
If the total number of such households in the household SAR is 289, it is scaled by 100 (the household SAR sampling fraction) to estimate a total in Britain of 28,900 such households. There are 215,789 household records in the household SAR in all, so the unadjusted standard error of the estimate of 28,900 is:
Unadjusted SE(C) = 100*sqrt(289*(215,789-289)/215,789) = 1,699
Note that the number c=289 is very small compared to the overall number of records N=215,789, so a very similar result would be achieved using the approximation referred to on the previous page,
Unadjusted SE(C) = 100*sqrt(289) = 1,700
From the discussion in the previous section, the design factor for household characteristics from the household SAR may be taken to be 1.0, so in this case the standard error requires no further adjustment.
5.4.6 Confidence intervals and inferences based on the SARs
A sample estimate and its estimated standard error may be used to construct
confidence intervals around the estimate. These intervals are ranges that
will contain the true population value of the estimated characteristic,
with a known probability.
For example:
1. With approximately 68 per cent probability, the interval from one standard error below the estimate to one standard error above the estimate contains the true value.
2. With approximately 90 per cent probability, the interval from 1.6 standard errors below the estimate to 1.6 standards error above the estimate contains the true value.
3. With approximately 95 per cent probability, the interval from two standard errors below the estimate to two standard errors above the estimate contains the true value.
The intervals are referred to as 68 per cent, 90 per cent, and 95 per cent confidence intervals, respectively.
Example
Using the earlier example, the standard error of the 28,900 households in Britain with someone below pensionable age with a limiting long-term illness was estimated to be 1,698. Thus a 95 per cent confidence interval for this estimated total is estimated as:
(28,900 - 2*1,698) to (28,900 + 2*1,698), or 25,504 to 32,296.
5.4.7 Other notes on standard errors
A standard sampling theory text or the explanatory guide to the user's
statistical software should be helpful if the user needs more information
about confidence intervals and non-sampling errors. These should be consulted
for details of standard errors for sums, differences and ratios of estimates
from the SARs.
Zero estimates
When the proportion, percentage, or cell count is zero, the formulae in section (ii) above give estimated standard errors of zero. While the magnitude of the error is difficult to quantify, estimated percentages and totals of zero are still subject to error.
The effect of non-sampling error on the standard errors and inference using confidence intervals. The estimated standard errors given above do not include the variation due to non-sampling error that may be present in the data. The standard errors reflect the effect of simple response variability, but not the effect of systematic errors introduced by enumerators, coders, or other field processes. As a result, confidence intervals formed using these estimated standard errors may not meet the stated levels of confidence in estimating the true population value of a characteristic. One of the most important sources of error that might additionally affect the accuracy of confidence intervals is bias arising from missing records, discussed below.
Quality Of Census Responses
One characteristic of the SARs is that the accuracy of responses contained
in them is determined almost wholly by the accuracy of the responses given
by residents themselves. There has been no data modification or perturbation
and imputed records for wholly absent households have not been included
in the SARs; only data that was missing from a returned form has been
imputed, by the 'hot deck' procedures described in Mills and Teague (1991),
for the 100 per cent coded variables.
A check on the quality of responses in the Census was one of the aims of the Census Validation Survey (CVS) carried out very soon after the 1991 Census (Heady et al, 1996). The CVS, which seeks to establish the quality of both responses to, and coverage of, the Census, is based on a sample of around 6000 households in over 1200 enumeration districts, and was administered by means of individual interviews held between six weeks and three months after Census day (Wiggins, 1993).
Ethnic group
The ethnic group question was newly introduced in the 1991 Census. After extensive testing in the field, it was decided to use a question which gave form fillers nine possible categories from which to choose, two of which asked for more detailed information to be supplied. The number of ethnic minority households in the CVS sample was not sufficient to justify individual analysis of all nine categories and so four aggregate codes were created: white, black (combining black Caribbean, black African and black 'other'), Indian sub-continent (Indian, Pakistani and Bangladeshi) and other (Chinese and 'any other ethnic group'). The gross error rate was only 0.8 per cent. However, this figure should be treated with caution given that the vast majority of answers were in just one category (white). If those who answered 'white' in both the Census and CVS are excluded, the gross error rate was 13.2 per cent. It was found that 21 per cent of those coded as 'other' in the Census either described themselves as white or in one of the black categories in the CVS. Conversely, 9 per cent of those coded 'black' in the Census described themselves as 'other' in the CVS. Overall, 6.1 per cent of people in households who replied in both the Census and CVS were in the non-white ethnic group in the CVS, compared with 5.8 per cent of the same people according to their replies in the Census (Heady et al, 1996).
Last updated 25 October 2004