The Cathie Marsh Centre for Census and Survey Research

SOCIOECONOMIC DIFFERENTIALS IN NON-RESPONSE: A PROCEDURE FOR ADJUSTING CENSUS DATA

1 Introduction

Once in each decade, the national census provides statistics from an enumeration of residents in every local area of Britain. The tabulated output of local statistics are used very widely. In every census, some residents are missed. The impact for users depends on the size of the non-response but also on whether those missed have different characteristics from those counted.

This paper describes how all census tabular output for 1991 can be adjusted using estimates of the characteristics of non-respondents. The a method may also be of use in 2001. Within the current project, the method will be used to quantify the impact of census non-response on selected applications in social policy and social research.

1.1 The problem

Given

  1. Census counts,
  2. Adjustment factors for non-response specific to age, sex and locality, which relate the census counts to the full population, and
  3. Adjustment factor differentials for socio-economic categories which describe non-response over and above that accounted for by age, sex and locality

Then

A procedure is required to derive adjustments to any census output. The results are to be consistent with the adjustments made by age, sex and locality.

Non-response here refers to residents not captured in the Census enumeration and imputation procedures, not to item non-response.

 

1.2 The solution summarised

A procedure and the data sources for it are specified below for the 1991 Census of Great Britain, for use in the ESRC project Quantifying the impact of census non-response on social policy and social research. The procedure will allow any Local Base Statistics (LBS) or Small Area Statistics (SAS) to be adjusted.

The procedure assumes that the data at (b) and (c) describe the characteristics of non-response completely. The task is then to disaggregate the given census counts by age, sex, locality and the socio-economic categories of (c), and to apply appropriate adjustment factors to each disaggregated Census count.

Since LBS/SAS data are not usually available disaggregated in such detail, the procedure requires estimation of missing data as follows:

  1. Estimation of census counts with age-sex-locality-socio-economic detail by applying relationships from the SARs wherever they are missing in the LBS/SAS.
  2. Estimation of non-response factors with age-sex-locality-socio-economic detail that give results consistent with the age-sex-locality factors already estimated.

The procedure is generally applicable, but is deterministic - no estimates of reliability are made. Reliability will be assessed as a separate part of the project.

It will be used initially to adjust 1991 Census calculations of government grant to local authorities (SSA). At relevant places an example is specified which will be used to aid the practical programming of the procedure during February 1997.

The procedure can be adapted for use in other settings, including with the 2001 Census to which reference is made at appropriate points.

 

1.3 The structure of the paper

Definitions of the terms used, are followed by a specification of the procedure for adjustment of census data counts. The data sources required to run the procedure with 1991 GB Census counts are described, and notes on computation of the required data. Some issues are raised in a final section.

Readers of this working paper are asked to question and comment on the procedure, to identify perceived strengths and weaknesses within it, and give such feedback to the author.

 

2 Definitions and notation

a, s, and l index the age (five year groups to 80-84 and 85+), sex (male, female), and locality (flexible down to Enumeration District and Output Area, but for our example Wards in England and postal sectors in Scotland). Together they define the population groups {asl} for which non-response has already been estimated.

i=1,…,n indexes the socio-economic categories across which census non-response differs.

Example, rented/owner occupied tenure.

Note: a, s, l, i, together define the categories of people which show different rates of non-response. Within each census category asli, the likelihood of non-response is the same for all people.

A, S, L, I will refer to broader categorisations of age, sex, locality and socio-economic group which is found in some Census output, eg under 16 rather than 5-year age groups.

f is the non-response factor, the proportion of enumerated count which must be added to reach the full population count. For example over all Britain the non-response factor was 0.02 in the 1991 Census, and f aslhave been estimated for 1991 by ONS (l = Districts) and by Estimating with Confidence (l = EDs/Oas/wards/postal sectors).

ui is the non-response factor differential in category i. These may or may not be known separately for each a, s, l and in what follows are referred to simply as {ui}. There is no loss of generality: the procedure simply substitutes the relevant uasli in place of ui in what follows, if these are known or estimated.

Only the differentials are needed. Ie ui/ uj represents the ratio of non-response factors in categories i and j. The {ui} may thus be scaled for convenience - and the precise non-response factors unknown.

Example: {u1 , u2}={3 , 1} states that people in rented accommodation are three times as under-enumerated as people in owner-occupied accommodation. The population/census ratio would be 1.03,1.01; 1.003,1.001; 1.6,1.2 and so on, depending on age, sex, and locality.

R represents residents

X represents any Census characteristic

C represents the enumerated count. Thus RCasli is the Census count of residents of age a, sex s, location l and socio-economic category i. XCasli is the Census count for characteristic X within residents of age a, sex s, location l and socio-economic category I.

P represents the estimated complete population. Thus Rpasli is the complete population of age a, sex s, location l and socio-economic category i. XP is the complete population with characteristic X, summed over a, s, l and i; it is this which we wish to estimate.

Note that f=(RP / RC) - 1.

Example: XP is the number of children of lone parents in each District of England, a variable in the SSA formulae which is currently estimated by XC.

 

3 Procedure

The non-response factor differentials {ui} may sometimes be assumed to be independent of all or some of the population groups {asl}, particularly locality l, for lack of detailed information. The level of adjustment fasli will nonetheless be specific to each population group, in order to be consistent with RPasl - RCasl already estimated.

fasli is calculated by scaling the ui:

(1)

Any census enumeration XC may now be calculated:

(2)

Note that the characteristics of the local area determine the amount of non-response added in each socio-economic category. Thus if an area contains few residents of category i then the procedure will add few extra such residents even if its non-response differential is relatively high and even if it dominates the non-response in other areas or nationally.

 

4 Data sources

4.1 {ui} Non-response differentials.

Ideally and after the 2001 Census, these may be estimated directly from a national Census Coverage Survey and other validation procedures. In 1991 the validation of the Census has not given rise to reliable direct estimates of socio-economic differentials in non-response.

For 1991 in this ESRC project’s work, the categories will be hypothesised after a literature search including the 1991 GB Census Validation Survey and other UK Census studies, reports from the Australian, USA and Canadian Census offices, and national UK surveys.

Household characteristics (of missed individuals and missed households) as well as individual characteristics will be considered in the review (but see 6.4).

A range of plausible hypotheses will give rise to several sets of {ui}. These sets of {ui} may differ in the number of categories n and/or in the values of ui for each i. The choice of categories will be only slightly constrained by the necessity to find the same categories the LBS and SAR datasets in order to implement the procedure described above.

 

4.2 {RPasl} Estimated complete population in a population group

In current plans for the 2001 Census, the One Number Census project group include the estimation of regional non-response by age and sex by combining demographic and survey validation methods, and then regression modelling to estimate non-response by age and sex in smaller localities, providing the RPasl.

For 1991, the work of the Estimating with Confidence project will be used. Its EWCPOP files are located on the MIDAS system at Manchester University; the ‘a’ file holds both the Census enumeration and the adjustments for non-response for each ward (England and Wales) and postal sector (Scotland,) for five-year age groups, for males and females separately. EWCPOP also allows Enumeration District estimates, but these are considerably less reliable. EWCPOP figures are fully consistent with (ie they add to) those made for local authority Districts by staff of the Office for National Statistics.

Example: use the EWCPOP files for LBS wards (England, Wales) and postal sectors (Scotland).

 

4.3 {RCasli}, {XCasli} The census enumeration of residents and of residents with a specific characteristic X, for each socio-economic category i within each population group asl.

The 1991 Census LBS/SAS contain for every locality l a detailed age-sex tabulation (tables L02, S02) which is additionally disaggregated in other tables by marital status, economic position, ethnic group and limiting long-term illness.

However, whenever either X or {i} involves variables other than these four, as will be usual, then the Census counts required will be estimated by disaggregating available counts in the LBS/SAS according to the full relationship observable in the Samples of Anonymised Records (SARs) at national level.

In general,

(3)

where RCASli is the LBS/SAS count for locality l and socio-economic group i with age-sex group AS which is larger than and contains as.

(4)

where XCASlI is the LBS/SAS count for locality l with age-sex-socio-economic group ASI which is larger than and contains asi.

 

In each case, the SAR proportion that the fine age-sex-socio-economic group consists of within the broader group that exists in the LBS/SAS, is imposed on the latter.

Example with {i} defined by tenure, and X being children of lone parents, RCasli exists in the LBS only for A=all ages, S=persons, and XCasli exists only for A=all ages; S=persons; and I=all tenures.

  

5 Computation

In each of our applications we will be adjusting many different X - for example the 20-30 Census indicators used in the SSAs. In every case the adjustments will be repeated with each of at least 3 sets of hypothesised {ui}, perhaps rather more.

Thus the extraction of data and the procedure for adjusting census data will need to be programmed in such a way that it can be repeated with changed {ui} and changed X with minimum additional work.

There are two steps, that correspond to equations (1) and (2) of section 3 above. When step A has been completed for a specific set of {ui}, step B is repeated for many different X.

A Calculate fasli for all asli

A1 Extract from LBS/SAS RCasli for all asli.

If the SARs are needed, calculate the SARs ratios required (the second term in (3) above).

A2 Calculate fasli for all asli as at (1)

B Calculate adjusted census count for variable X

B1 Extract from LBS/SAS XCasli for all asli

If SARs are needed, first calculate the SARs ratios required (the 2nd term in (4) above).

B2 Calculate XP.

As at (2).

Example: do so aggregating to Districts.

  

6 Issues

  1. Reliability of local Census data XCasli

In our work with 1991 data we have to accept approximations which should can and should be avoided with 2001:

  1. Data modification
  2. Imposition of SARs data on local statistics, involving approximations due to sampling and the lack of imputed household in the SARs, and the ecological fallacy of assuming a national relationship holds for local areas.

It would be possible to use regional or SAR area data for the SARS to reduce the ecological errors. Such an approach will be considered when the categories {i} have been chosen, but at present it seems unlikely that use of sub-national SAR relationships will be productive because of increased sampling error.

 

  1. Is ward disaggregation necessary?

Some applications - including most of our main investigation into SSAs - use District census data but not ward data. The procedure outlined above could be applied directly to Districts, using the District adjustments provided by ONS. This would reduce the computation required very significantly.

However, proceeding with District data assumes that the variables of interest are spread evenly throughout any District. For example, that lone parents are not concentrated in high-non-response local areas. Work with census ethnicity tabulations (Simpson 1996) showed an appreciable effect of local (within-District) geographical concentration on non-response estimates for ethnicity, over and above that of age and sex.

Although the use of {ui} in addition to age, sex, and locality will probably reduce the independent effect of local geography, we intend to work at ward level and re-aggregate even for applications that require only data for larger areas. We will replicate some early work with District data to measure the different effect of adjusting at ward or at District levels.

This choice for a census user between extensive but sensitive local calculations and fewer but more approximate large-area calculations would not be necessary were the One Number Census approach in 2001 to include adjustments at the most local level feasible.

 

6.3 Socio-economic differential non-response, specific to population groups

In the most general case we have {uasli}, non-response differentials for each age-sex-locality-socio-economic combination. The procedure continues exactly as in Section 3 above, replacing ui with uasli in (1).

The computation at A2 of Section 5 is rather heavier.

This refinement will be necessary if as a result of the review of non-response (see Section 4.1), socio-economic differentials vary from one age group to another, or are different for males and females.

 

6.4 X referring to households

Throughout, reference has only been made to resident individuals. Household characteristics of individuals can be accommodated within the framework proposed in this paper. However, very little research and estimation on numbers of households missed has been published (DoE 1995, Simpson and Dorling 1994). There are no non-response estimates for types of household.

If household non-response is similar to that of individuals, we could side-step this issue by imposing a sex-age distribution on households equal to that for residents with the household characteristics. For example, for the number of households with no car, an age-sex distribution would be taken from those people in households with no car; non-response factors can then be applied as in this paper, and the results re-aggregated to give an adjusted number of households with no car.

Alternatively, a separate procedure could be prepared. This would probably be simple for 1991, and include the DoE estimates of total non-response of households in each District based on the Census Validation Survey.

 

 

References:

Department of the Environment (1995) Projections of households in England to 2016. HMSO: London

Simpson, S (1996) Non-response to the 1991 Census: the effect on ethnic group enumeration. Chapter 3, pp63-79, in D Coleman and J Salt (eds) Demographic characteristics of the ethnic minority populations, Volume 1 of a four-volume series. HMSO: London.

Simpson, S and Dorling D (1994) Those missing millions: implications for social statistics of non-response to the 1991 Census. Journal of Social Policy 23(4): 543-67

 

  

January 1997

Census data coverage adjustment, inconsistency in results from working paper 1, and a solution with Iterative Proportional Fitting - June 1997

1 Introduction

Working Paper 1, Socio-economic Differentials In Non-Response: A Procedure For Adjusting Census Data, contains an ambiguity in how the SARs data should be used to estimate 100% census data at local level. This ambiguity, and in any case the method proposed, produces adjusted census data that are inconsistent with the adjustments of ONS and EwC.

2 The specification in working Paper 1

Working Paper 1 specifies in equation (1) that the non-response already estimated by ONS/EwC for an age-sex-area population group is divided between socio-economic groups (eg tenure categories) such that their relative non-response factors are as in a set of presumed differentials {ui}. It then applies these to any Census output X (eg lone parents), as in equation (2).

For the first stage, Census counts for residents divided between each age-sex-area-socio-economic group does not usually exist in the LBS and must be estimated, RCasli .

For the second stage, Census counts for Census output X divided between each age-sex-area-socio-economic group does not usually exist in the LBS and must be estimated,

Xcasli .

These Census counts are estimated by using the national relationship between X, a, s, and i as gained from the SARs, with as constraint a relevant LBS table for the local area l. This was specified in equations (3) and (4) of Working Paper 1.

3 The problem

Working Paper 1 was ambiguous in not specifying which LBS table for area l should be used to form the constraint, when more than one relevant table exists as is usual. For example there is usually at least one table showing age and sex, another showing socio-economic groups, and another showing the variable X.

However even when that ambiguity is removed, there remains inconsistency, as follows:

Equation (1) forces the adjustment factors f to be different in each local area so that for all a, s and l the undercount Rpasl - Rcasl is consistent with that already estimated by ONS/EwC. This is good.

But the Census count itself Rcasl is estimated wrongly if the LBS table used for (3) and (4) is not the one that contains the enumerated age-sex counts (L02). If any other table is used - for example the one with tenure ({i}) in it - the age structure will then be taken from the SARs within each tenure category and when added across tenure categories will not add to the enumerated age-sex counts for that area. So the adjusted population total Rpasl will be estimated differently from the ONS/EwC, even though the undercount itself was replicated correctly.

On the other hand, if we always use local table L02, and impose the i and X dimensions from the national SAR, the result will be inconsistent with the known local totals for each i and for X.

We have noticed this twice now: our adjusted total number of residents is not the same when we use different models, even though we thought we were constraining so that it would do so; and when we adjust X and then adjust ‘all-residents-except-X’ we also don’t get the total number of residents as we thought we would do so.

In each case, the problem is that our estimation of local areas census enumeration is too crude and is contradicted by whichever relevant LBS table we have not used. The inconsistencies are too great to ignore.

It follows that we should improve our estimation of the local area census enumeration, in some way that makes it consistent with the LBS tables on a-s, i and X for the local area. The way I know of doing this is Iterative Proportional Fitting.

 

4 Iterative Proportional Fitting (IPF)

IPF estimates an n-dimensional array when a set of initial values are given for each cell of the array and must be constrained to agree with (fit) the marginal totals which are consistent between themselves. I think we only need to consider the two-dimensional case. There is an example of this two-dimensional case in Phil Rees (1994) Estimating and projecting the populations of urban communities, Environment and Planning A, 26:1671-1697.

5 How IPF might apply to our case in practice

The full stop . stands for the total over a subscript.

Instead of equation (3): estimate Rcasli for each a, s, l, and i as follows

Initial values come from the SARs, and are the same for every local area, ie Rc(0)as.i .

Constraints are Rcasl., from table L02, and Rc..li, from eg. Table L20.*1

In terms of the IPF, for each local area l there is a two-dimensional table with age-sex categories as one dimension, say the rows, and socio-economic categories as the other dimension, say the columns. The internal values are initially those from the SARs, which must be made to add up to the marginal row and column totals given in the LBS tables L02 and L20 for each local area. This is as follows.

Prepare marginals. Make the two constraints (Rcasl. , Rc..li ) have consistent totals in order to ensure IPF converges, by using `Rc..li= Rc..li * (sum over a,s of Rcasl.)/(sum over i of Rc..li).

Step 1. Constrain to a,s marginal total.

Rc(1)asli = Rc(0)as.i * Rcas.. / Rc(0)as..

Step 2. Constrain to i marginal total.

Rc(2)asli = Rc(1)as.i * `Rc...i / Rc(1)…i

Then place Rc(2)asli in the place of Rc(0)as.i and repeat steps 1 and 2.

Retain the results of the previous step, and stop the iteration when it has converged, ie when the maximum absolute difference between the internal values of two successive steps is less than a pre-specified small value, eg 0.01:

Stop when max over a,s,l,i of abs(Rc(2)asli - Rc(1)asli) < 0.01.

If it makes programming simpler you could compare values of Rc(1)as.i on successive steps.

Our estimates of Rcasli for use in equation (1) are the final values of the array when the iteration has stopped.

 

Instead of equation (4): estimate Xcasli for each a, s, l, and i as follows

Here I need to change the notation of Working Paper 1 slightly, to make X a subscript that can take two values, to indicate whether it is X or not, eg lone parent or not lone parent. So Xcasli is now written Rcxasli .

Initial values come from the SARs, and are the same for every local area, ie Rc(0)xas.i . Constraints are Rc.asli from above and Rcx..l., from the LBS table with X in it.*1

In terms of the IPF, for each local area l there is a two-dimensional table with age-sex-socio-economic categories as one dimension and variable-of-interest (X) dichotomy as the other dimension. The internal values are initially those from the SARs, which must be made to add up to the marginal totals. This is done with the steps and iteration exactly as above.

Our estimates of Rcxasli for use in equation (2) are the final values of the array when the iteration has stopped.

 

Now that our census enumeration estimates are consistent with each other and with the counts in the LBS, there will be no inconsistency between our estimates and those of ONS/EwC, nor with the census LBS. We will be able to successfully complete the tests that we have proposed.*2

 

University of Manchester CCSR