SARs Newsletter No. 14

SARS NEWSLETTER

NO. 14 - MARCH, 2000

Samples of Anonymised Records from the 1991 Census

Census Microdata Unit Faculty of Economic and Social Studies University of Manchester Manchester M13 9PL

 

SARS FROM THE 2001 CENSUS:

OPTIONS AND RECOMMENDATIONS FOR THE HOUSEHOLD AND INDIVIDUAL SAR

After an extensive period of user consultation and work on confidentiality, we are in a position to set out some options and recommendations for the Individual and Household SARs for 2001. The consultation exercise has included a number of workshops, a conference targeted at local authorities and a postal survey of user and non-user views. These have been reported in previous SAR newsletters (Nos 7, 9, 10, 11).

Confidentiality work has been carried out by Mark Elliot and Angela Dale (see article further down).

In this report we consider proposals only for the existing Household and Individual SARs. Work on the feasibility and possible form of a Third SAR (a individual SAR with more detailed geography and a limited set of variables) is reported on page 12.

The starting point for developing recommendations for the 2001 Household and Individual SAR is that everything available in the 1991 SARs should be available for 2001 in at least as much detail. There should be no sacrifice of variable detail to achieve a finer geographical resolution in either the Individual or Household SAR. Proposals for a Third SAR should be considered independently of the two existing SARs.

1% HOUSEHOLD SAR

The recommendation is that we should retain the size and structure of the 1991 file. Within ONS confidentiality constraints there is no prospect of raising the sample size above 1%. To do so would require losing all geography from the SAR, for which there has been little expressed support. Neither is there any prospect for a move to SAR areas at district rather than regional level.

However, while there is little scope for changes to the basic components of sample size, variable detail and the minimum size of output areas, there remain some important issues surrounding the geography of SAR areas. First, while wanting to retain full comparability with the 1991 geography, there is a need to respond to a shift in the working geography of regional government, from the Standard Statistical Regions (SSRs) to the Standard Government Office Regions (GORS) regions. This includes the effective subsumence of the old North and Merseyside SSRs into the new North East and North West GORs.

A second issue concerns the large difference in the population base of regions. A number of users have argued that if East Anglia (with a population of just over 2 million) can be a separate SAR area, there is scope for some disaggregation of larger regions (e.g. Yorkshire and Humberside had a population of 4.7 million) without increasing disclosure risk.

Within the constraints of a pre-defined population threshold (based on East Anglia in 1991), and the requirement that areas should reconstitute into the regional geography, there are various options for disaggregation. Users have also indicated the value of distinguishing metropolitan from non-metropolitan areas. For Scotland and Wales this might involve a simple division between lowlands and highlands and for Wales, between rural and South Wales. For England, the choices are more complex, and an important issue is whether areas should be geographically contiguous.

2% INDIVIDUAL SAR

Geography

The consultation process over the Individual SAR established strong demand for a larger sample to improve the precision of area level estimates and the scope for analysing population sub-groups. There is also a strong demand for a SAR geography that delivers output at the sub-district level. A further factor influencing the options on geography is the need to address some major changes to government boundaries with the creation of 49 new Unitary authorities.

Confidentiality work (reported on page 4) has demonstrated that the 1991 Individual SAR was probably even safer than originally thought. Their work shows that increasing the sample size and providing finer geographical resolution in the individual SAR would be possible without a serious increase in disclosure risk and without loss of variable detail. At this stage the work suggests that we could:

Lowering the population threshold in this way would dramatically increase the share of local authority districts which could stand as separate SAR areas. Under the 1991 specifications, almost two thirds of Local/Unitary Authorities in 2001 would (using 1991 population counts) have populations below the 120,000 threshold and would consequently have to be grouped with other authorities. At 90,000 this falls to less than 25% and at 60,000 to just over 10%.

Such a change would strengthen the case for providing output areas at sub-district level, especially among the larger metropolitan districts. For example, using a 90,000 threshold, Birmingham, the UK district with the largest population (almost 1 million) could be divided into 10 SAR areas, Leeds into 7 and Sheffield into 5.

In order to develop an appropriate basis for defining such areas, we have sought the views of those local authorities with populations large enough to allow sub-district SAR areas. To ensure comparability with the published area statistics, we recommend that these areas should be based on aggregations of complete wards, be spatially contiguous and avoid crossing district boundaries. Individual local authorities should be consulted over the precise groupings of wards to ensure that they represent meaningful areas of maximum utility to practitioners.

Comparability with 1991

Ensuring SAR areas do not cross over district boundaries is important, not only to retain comparability with published area statistics, but also to allow comparison with 1991 SARs. However, that task will be made more difficult following the creation of the new Unitary Authorities.

A number of different suggestions have been made to ensure comparability between SARS for 1991 and 2001 (see Newsletter no. 9). One is to draw two separate non-overlapping SARs based on old and new geographies respectively. However, there is no evidence that demand for 1991 SAR geography is sufficient to justify the substantial cost of this. A more practical option may be to recode the 1991 SAR to the 2001 geography. This would be cheaper, since the sample has already been drawn, but with confidentiality implications. The practicality of this option will be explored but it cannot be a top priority.

Variables

The consultation exercise established a demand for increased detail about the household in which individuals lived. Suggestions for additional variables include:

Generally, adding information about the head of household leads to a very significant increase in risk of disclosure. However, there may be scope to include more derived variables that summarise household level information, since these are less likely to allow direct identification and so breach confidentiality e.g. a summary of household composition using MRS Lifestage variable/DETR (DOE) household classification/CCSR household classification.

There is also some demand for attaching more area level information to individual records, such as an indicator of urban/rural status. This information would help tackle the problem of small (often) rural LAs being unidentifiable in larger SAR areas. However, this needs to be linked to the retention of an area-level classification and the request that this should be the same on both SARs.

Finally, one aspect of variable detail that proved particularly problematic for users of the 1991 SAR concerned migration. In particular, the detail on place of origin was considered too crude to be of much analytical use. While identifying individual SAR area of origin raises considerable confidentiality constraints, there may be scope to work with a summary classification of area type (e.g. metropolitan/non-metropolitan areas or even the more detailed area classification already available for migrant ‘destination’). This would considerably improve the value of the data by allowing identification of separate flow types such as urban-urban or urban-rural and so a measure of processes such as counter-urbanisation. In addition there has been considerable demand for more bands on distance moved, especially at the near-distance end, accounting for the majority of movers. Readers wishing to contribute to the debate on migration should contact Paul Boyle at the University of St Andrews (P.Boyle@st-andrews.ac.uk), who is helping to develop recommendations on this topic.

 

CONFIDENTIALITY WORK TO SUPPORT THE SPECIFICATION FOR AN INDIVIDUAL SAR FROM THE 2001 CENSUS

Angela Dale and Mark Elliot

The request for an Individual SAR from the 2001 Census will make a case for a larger sample size and lower population threshold than the 1991 file. The research supporting this specification is set out below. The article on p.1 provides more information on the background to this request.

1. THE REQUIREMENT FROM USERS

Strong arguments were made by academic and local government users for a larger sample size for the Individual SAR, together with a lower population threshold that would substantially reduce the number of LA districts that were grouped together. Whilst both these modifications would enhance the research value of the SAR, the sample size and population threshold for the 1991 SAR had been set in order to meet confidentiality requirements. The risk of identifying an individual or household in the SARs should be negligible.

In recent years a substantial programme of research on confidentiality has been conducted at CCSR, funded by the European Union and the ESRC. This has provided an opportunity to make a detailed assessment of the confidentiality risk of increasing the sample size and reducing the population threshold of the Individual SAR. A summary of the results are reported below. The full paper can be downloaded here.

2. REASSESSING THE RISK OF DISCLOSURE FROM THE 1991 SARS

An initial step was to re-assess the research conducted by Marsh et al (1991) which underpinned the acceptance of the 1991 SARs, in the light of the better availability of data since 1991. Marsh et al predicted the probability of correct identification of an individual in a microdata sample, on the basis of variables already recorded for that individual in other files, was of the order of 1 in 4 million or 2.4-0.7.

Marsh et al required three conditions for identification to be feasible and a fourth to be able to infer an exact record match. A probability was attached to each condition. These were:

  1. An individual must appear in the sample. The probability of being in the sample was set at 0.02 – reflecting the 2 percent sample size of the Individual SAR.
  2. An individual must be unique in the population on the key variables being used. The probability of being unique in the population on eight key variables was estimated at 0.02 based on 1980 census data for the entire region of Tuscany.

We used population data from the 1991 Census, at a geographical level of 120K population, and the following keys: age (94), sex (2), marital status (5), economic activity (5), ethnic group (10), migration (4), tenure (6), giving a theoretical total of 1.1 million cells. We found that 4.8 per cent of records were unique in the population.

It is likely that the higher percentage of uniques in the British data is explained by:

  1. the correlation between occupation and position at work in Marsh et al’s data for Tuscany, which will reduce the number of uniques, and
  2. the effect of two variables, ethnic group and tenure, in the British data, both of which are highly skewed and likely to increase the number of uniques.

The key variables we have used may represent a higher confidentiality threshold than that used by Marsh et al, in that accurate knowledge of all the characteristics would be very hard to come by. Nonetheless we use the higher figure of 0.048 rather than 0.02 in the re-calculation of Marsh et al’s equation.

  1. If a match is to be established between an individual in both the population and the sample, key variables must be recorded identically on both datasets. The probability of this occurring was estimated by Marsh et al at 0.6 using evidence from a wide range of datasets.

The probability of recording key variables identically on both datasets was based on:

Our research suggested a much lower probability of about 0.18. This was based on three pieces of work:

  1. An assessment of the extent to which the process of editing and imputation used to construct the 1991 Census database added a measure of protection to the data.
  2. Using census population data1 from seven Local Authority Districts, we found that, overall, 16 percent of records had one or more changes made during the edit and imputation phase. However, there was considerable variation between the geographical areas and between the variables edited. The variable with the greatest degree of editing was economic activity with 5.7 percent of records edited. As the data did not include the 10 percent sample no figures were available for occupation – although the Census Validation Survey found gross errors of 25 per cent at occupational unit level and about 15 per cent at the level of the Standard Occupational Classification Major Groups (9 categories).

  3. An analysis of change in key variables over time. This used the British Household Panel Survey which began in 1991 with a sample of 5,000 households across Great Britain who are interviewed each year. It therefore provided an excellent way of establishing the extent of change in key variables year on year.
  4. Using marital status, labour force status and tenure we found that only 68 per cent of those with a response in both 1991 and 1993 stay in the same categories on all three variables. Almost 20 per cent of respondents changed both marital status and tenure, reflecting the extent to which forming or dissolving a partnership affects change between the major tenure categories – owner-occupation, social rented housing and privately rented.

  5. The General Household Survey for 1991, with a sample of 12,000 households, contains most of the questions asked on the 1991 Census and was used to conduct a matching experiment with the 2 percent Individual SAR.

Using 18 variables present in both datasets and recoded as nearly as possible to the same categories, about 45 per cent of records in the GHS (over 11,000 records) gave an exact match against one or more individual in the SARs. In many cases there were very large numbers of ‘statistical twins’ in the SARs for one GHS record. (For example, 690 records in the GHS each had over 100 matches in the SARs). Thus, for an intruder attempting to make a one-to-one match between the two data sets the situation would be very confusing. This very high level of false matches was entirely unpredicted and would seem to be quite an effective deterrent to the would-be intruder.

In an attempt to reduce the matching task to manageable proportions, we focussed on one month only - April 1991 - designed to maximise the ability to match correctly. For this month there were 219 records in the GHS which matched one, and only one, individual in the SARs and a further 112 records which matched only two individuals in the SARs. From the sampling schemes we would predict that, by chance, there would be about 43 people who were in both datasets. When the 219 apparent matches were checked within ONS, only 6 were found to be correct and when the extra 112 records were checked a further 2 were correct. Thus the probability of a correct match given a one-to-one match is 6/219=0.027. In practice, an intruder would be faced with total confusion with 219 apparent matches and no way of knowing which was correct.

However, this probability does not directly address the requirement set by Marsh et al, in their third proposition. A measure of the probability of the two files being coded identically on the 18 key variables for those individuals expected to be in both files has been derived by Elliot (1998). This shows that 68 per cent of records which could be assumed to be present for the same individual in both datasets differed on at least one variable - even after coding to maximise their compatibility. However, the probability of recording key variables identically on two datasets is also influenced by data ageing. For example, analysis of the British Household Panel Survey showed that, using only three variables, 45 per cent of respondents present in both sweeps had changed their characteristics on one of more variables between 1991 and 1993. These two figures – 32 per cent probability of being coded the same on both datasets and 55 per cent probability of changing values – are both conservative. Nonetheless, if taken together they suggest that a probability of 0.18 (.32 * .55= 0.18) can be used as a more realistic estimate than that of 0.6 used by Marsh et al.

  1. It is necessary to verify that a match is, in fact, a correct match and does not belong to someone else in the population with the same set of characteristics. The probability of verifying population uniqueness was estimated by Marsh et al to be zero but set at 0.001 for the purposes of calculating a risk

Despite the growth of lifestyle databases they do not provide verification that an apparent match is a correct one. For example, such databases do not give population coverage; they do not record historical data allowing matches to a given year, let alone month; they do not record information using the same classifications as the census; they are renown for their inaccuracy; they are designed to hold the most up-to-date information possible on each individual rather than keeping information relating to a specific period some years earlier. They could not, therefore, provide verification that an apparent match was a correct match.

We do not yet have a complete population register. Data linkage within government is still proceeding cautiously; there is no available register with the range of variables needed to verify identification. In addition, were such a register to be compiled it would be kept under heavy security and accessed only by a limited number of staff screened for security. It may well be argued that verification would not be necessary in order to damage the census. However, to rework Marsh et al’s figures we retain the verification requirement and have concluded that scope for verification still remains very limited. We therefore use the same probability as Marsh et al, 0.001.

Calculating the per record risk:

From the four element discussed above, Marsh et al calculated a per record probability of identification of 2.4 x 10–7

If we re-calculate this probability using our more recent figures we reach a probability of : .02 x .048 x .18 x .001 = 1.73 x 10-7

(.02 giving the probability of a record being in the sample; .048 the probability of a record being unique in the population; .18 the probability of achieving a correct match; and .001 from the probability of verifying such a match).

From this we can conclude that the per record risk of identification calculated by Marsh et al (1991) under-estimated the difficulty of making a correct match between the SARs and an external data file.

3. THE DISCLOSURE RISK FROM CHANGES IN SAMPLE SIZE AND POPULATION THRESHOLD

Having calculated revised probabilities for a 2 percent sample with a 120K population threshold we can now proceed to calculate the increased risk of a larger sample size.

Increasing the sample size to 3 per cent gives an identification risk of:

.03 x .048 x .18 x .001 = 2.59 x 10-7

Thus the effect of increasing the sample size from 2 per cent to 3 per cent gives a per record risk very similar to that accepted for the 1991 SARs.

We can also calculate the effect on the percentage of population uniques of reducing the population threshold. For the same key variables, and with a 3 per cent sample, the percentage of population uniques rises to .054 at a population size of 90K giving a per record risk of:

.03 x .054 x .18 x .001 = 2.92 x 10-7 or 1 in 3.4 million by comparison with 1 in 4 million with the original SAR specification.

Reducing the population threshold to 60K gives:

.03 x .067 x .18 x .001 = 3.62 x 10-7 or 1 in 2.76 million.

From this we can argue that reducing the population threshold to 90K and increasing the sample size to 3 per cent leaves the per record risk very similar to that accepted for the 1991 SARs. Reducing the population to 60K increases the risk slightly by comparison with 1991, although the per record risk does not increase linearly with geography. It is, however, necessary to point out that increasing the sample size increases the total number of records that are at risk, even if the risk per record does not increase substantially.

The equations used above are based on the assumption that an attempt at identification is made. If we included in the probability equation the likelihood of an attempt, the probability would reduce even further.

4. ALTERNATIVE WAYS OF ASSESSING RISK

Access to population data from the 1991 Census has allowed us to develop some alternative methods of risk assessment. These methods all draw samples from the population data and, therefore, there is no disparity between the population and the sample in terms of classification, editing, or data ageing. All results are based on individuals in households for one large Local Authority District.

We begin from the same premise as Marsh et al: that records at risk of identification are those which are unique in both the sample and the population. These records are termed union unique (UU) in the rest of this discussion. Similarly sample uniques are termed SU and population uniques PU. We draw samples of different size and with different population thresholds in order to compare the absolute and relative numbers of records unique in both sample and population (UU). We can also make these comparisons using different key sets of variables.

There are various ways in which we can express the number of records that are unique in both the sample and the population.

1. The percentage of Sample Uniques which are Union Uniques (UUSU)

x 100

2. An alternative way of expressing Union Uniques is as a percentage of the entire sample. This gives a more direct measure of the percentage of a given sample that is at risk.

3. We can also calculate the absolute numbers of UU records and establish how this changes with the sampling fraction and population threshold.

For each measure we can take the baseline value as that which would be obtained for a sample with the parameters of the 1991 SAR - that is, a 2 per cent sample at 120K population threshold. We can then compare this value with those obtained from a larger sample size and smaller population thresholds. We have used data at three geographical levels (60K, 90K and 120K) with a 2 per cent and 3 per cent sample and 5-variable key.

Firstly, using the percentage of Sample Uniques which are Union Unique (UUSU), we can see from table 1 that decreasing the geographical threshold to 60K has very little impact. This counter-intuitive effect is found using other key sets (Elliot et al, 1998) and may be explained by the clustering of individuals on a range of descriptive variables. This is explored further, and a stochastic model fitted to explain the observation, in Elliot, Skinner and Dale (1998).

Table 2 expresses the number of union uniques as a percentage of the total sample and thus gives an alternative way of assessing the size of this risk set. Union uniques form a very small percentage of the total sample size and the relationship with the population size is not proportional – thus an increase in population threshold from 60K to 120K leads to only a 50 per cent increase in the percentage union uniques.

The third measure takes the total number of records (per geographical unit) which are union unique (Table 3). This gives an absolute measure of the number of records at risk within each geographical area defined. As the population size decreases, the number of UUs decreases – a function of a reduction in the number of both sample and population uniques. However, the increased sampling fraction acts to increase the number of sample uniques and thus increases the number of UUs. Thus with a threshold of 120K and a 2 per cent sampling fraction there are 21 risky records per geographical unit which increases to 26 per geographical unit with a larger sampling fraction and smaller population threshold.

The comparisons between geographical level shown here have been calculated taking each area as a single entity and have not considered the effect of using different thresholds on the disclosure risk for the entire sample for GB. The aggregation effect can be simply demonstrated with the LA data used for these examples.

The total population was divided into seven areas of about 60K, five at 90K and four at 120K. Using the numbers from table 3 which averages across the constituent areas, we can see that the total number of records which are union unique in the Local Authority District with a 2 per cent sample are:

60K 17.8 x 7 = 125

90K 21.0 x 5 = 105

120K 20.7 x 4 = 83

Once again, the increase in the number of individuals who are union unique increases by 50 per cent for a halving of the population size. In assessing the significance of this, a crucial question is the role of geography in a disclosure attempt. If it is assumed that attempts at disclosure would be made within a geographical area, then one might argue that it is more important to assess risk within a defined geographical area, either at the LA level or below, rather than to aggregate risky records across the entire country.

CONCLUSIONS

The 1991 SARs represented a major breakthrough for social science. They have been widely used and there is every indication that the 2001 SARs will be used even more widely. However, there are powerful arguments for increasing the sample size and reducing the population threshold of the Individual SAR. User consultation has demonstrated that these changes would make a significant impact on the research value of the SARs and widen the user base considerably.

We have shown that, using a number of different measures, an increase in sample size to 3 percent and a reduction in population threshold to 60-90K makes only a small additional increase in the risk to confidentiality. There have been no known attempts to breach confidentiality in the 1991 SARs and a rigorous registration system requires users to give a legally-binding undertaking not to attempt to identify any individual or household. Scenario-based analysis of threats to confidentiality (Elliot and Dale, 1999) suggest that these are more likely to occur during the process of census-taking when public awareness is higher and opportunities are easier than from attempts to identify individuals in microdata files.

The research and policy value of the 1991 SARs has been very considerable. The proposed increase in sampling fraction and reduction in population threshold would increase it even further and broaden the value of the SARs in the policy arena.

REFERENCES

Elliot, M.J. (1998) DIS: Data intrusion simulation - a method of estimating the worst case disclosure risk for a microdata file. Proceedings of international symposium on linked employee-employer records, Washington; May 1998.

Elliot, M. J., and Dale, A. (1998) Disclosure Risk for Microdata.: Workpackage DM1.1 What is a Key Variable? Report to the European Union ESP/ 204 62/DG III

Elliot, M. J. and Dale, A. (1999) Scenarios of Attack: The data intruder's perspective on statistical disclosure risk. Netherlands Official Statistics. Spring 1999.

Elliot, M. J., Skinner, C. J, and Dale, A. (1998) Special Uniques, Random Uniques and Sticky Populations: Some Counterintuitive Effects of Geographical Detail on Disclosure Risk. Research in Official Statistics; 1(2)

Marsh, C.; Skinner, C.; Arber, S.; Penhale, P.; Openshaw, S.; Hobcraft, J.; Lievesley, D.; Walford, N. (1991). The Case for a Sample of Anonymised Records from the 1991 Census. Journal of the Royal Statistical Society Series A,154, 305-340.

Table 1 UUSU by sampling fraction and geographical level

Key set: age (94), sex (2), marital status (5), economic activity (5), ethnic group (10)

Level 60K 90K 120K
2% 6.20 6.21 6.04
3% 8.04 8.22 7.84

 

Table 2 UU as a percentage of sample size by sampling fraction and geographical level

Key set: age (94), sex (2), marital status (5), economic activity (5), ethnic group (10)

Level 60K 90K 120K
2% 1.40 1.18 0.93
3% 1.38 1.17 0.91

 

Table 3 Number of Union Uniques by sampling fraction and geographical level

Key set: age (94), sex (2), marital status (5), economic activity (5), ethnic group (10)

Level 60K 90K 120K
2% 17.8 21.0 20.7
3% 26.4 31.2 30.6

STAFF CHANGES

Tracey Payne, our Computing Officer, left CCSR at the end of December 1999. Tracey had been with us since 1993, providing technical support and advice to SARs users and helping to run courses and seminars on the exploitation of the SARs. We wish her every success in her new post as Database Manager at a secondary school near her home in Sheffield.

We have been fortunate enough to be able to appoint Dr Yaojun Li to fill this post. Dr Li will take up his position in August 2000. In the meantime any technical queries will be dealt with by Dr Mark Brown, Tel 0161 275 4780, Email mark.brown@man.ac.uk. Queries regarding SARs registration, access etc should continue to be sent to Ruth. Durrell, Tel 0161 275 4721, Email ruth.durrell@man.ac.uk.

Muriel Egerton is now working at the Centre for Longitudinal Studies, Institute of Education.

Clare Holdsworth took up a post as lecturer in the Department of Geography, University of Liverpool in January 2000. She continues a close association with CCSR through her ESRC grant on alternative household classifications which is jointly held with Angela Dale and Rachel Leeser (GLA) and based in Manchester (see article on p12). Jo Wathan joined this project on March 1 as researcher.

 

A SAR FOR SMALL AREAS

Mark Brown

A key finding from our consultation over requirements for SARs for the 2001 Census has been frustration over the relatively crude 1991 SAR geography on both individual and household SAR. Practitioners in local government and health authorities expressed a clear need to work with data for their own areas. From a slightly different perspective, many academics argued the need for a more detailed and flexible SAR geography in the investigation and measurement of area effects.

To a limited extent these concerns are being addressed in the proposals for lowering the population threshold used in the Individual SAR (see article on page 4). This would substantially increase the number of LAs that would stand as separate SAR areas, and, for the more populous metropolitan districts, offers the prospect for some SAR areas at sub-district level. However, this remains some way short of the level of geographical detail requested by many, and further reduction in the population threshold would have to be at the expense of individual variable detail.

With little support for loss of detail on the main Individual SAR, a number of users have argued the case for an an additional micro data file, a Third SAR, in which the trade-off between geography and variable detail could be re-negotiated in favour of a substantially more detailed geography (see newsletter articles in vol 9). As part of the Census Development Programme we have received ESRC funding to evaluate the case for small area microdata (SAMs).

The project has a number of different but related components which together should allow us to establish the analytic potential of the SAM and determine the optimal point at which information in the SAM can be maximised whilst ensuring that the level of risk remains negligible. The basis for this work is a series of prototype SAMs which we have extracted from 1991 census data with sample sizes from five to ten per cent and population thresholds from five to thirty thousand.

Statistical measures to increase the precision of estimates

One potential application of the SAMs, especially for practitioners, is to generate user defined tables for small areas that are not available in the published area statistics. However, simply delivering a more detailed geography will not guarantee useful statistics at the sub-district level. One of the problems encountered with the 1991 Individual SAR is that attempts to analyse population sub-groups quickly encounter cell counts that are too low for reliable estimates. Even with a 5% sample, a SAM population threshold of 10,000 represents just 500 people. The large confidence intervals likely to surround sub-district estimates derived from SAMs could therefore deter using the data for policy purposes. In recognition of this, a key component of the project is to investigate using the univariate or bivariate distributions of key variables based on the 100 percent SAS or LBS tables to increase precision of estimates based on microdata. The potential contribution of this work to methodological development in the use of microdata clearly extends well beyond the SAMs and is based on earlier research by Mark Tranmer and David Steel.

Analytical value

While the ability to derive user-defined tables is a key benefit of microdata generally, the analytic potential of SAMs is likely to be greatest in the field of multivariate modelling.

For many potential users, the appeal of small output areas is not as units of analysis in their own right but as ‘building blocks’ which can be aggregated to form meaningful user-defined geographies. While this has obvious appeal to practitioners who may be able to build SAMs to policy defined areas, it has particular potential for those seeking to model area effects where appropriate geography will depend on the particular application. Thus in employment analysis one wants local labour markets or travel to work areas; for educational analysis one needs LEAs; for health one needs much smaller areas which can be approximated by grouped wards. Whilst theoretical advances have identified the importance of including area-level effects, and statistical modelling and software developments (eg MLWin) have greatly facilitated conducting analyses, there is little data available with the flexibility to allow users to define the appropriate areas for specific analyses. It is in this respect that the SAMs offer particularly exciting potential.

We are currently conducting a series of multivariate and multilevel analyses, in which we are using SAMs at different levels of geography to model differentials in ill health, unemployment and housing need - all applications where the ability to measure area effects at sub-district level is of obvious theoretical and policy interest. By repeating the analyses using different specifications of sample size, geography and variable detail, we will be in a position to establish the extent to which analysis is affected by loss of individual-level detail and so judge the optimal trade-off between sample size, variable detail and geography for a specified level of disclosure risk.

We are also looking at the potential to use SAMs to derive indicator variables for key policy areas, such as housing and health. The ability to derive these represents a clear advantage of microdata over small area tables where problems of the ecological fallacy often arise from the need to relate information provided in separate tables.

Geography

Apart from seeking views on the optimum size of SAM areas, an important task of the consultation process is to develop and prioritise criteria for defining their shape. A starting assumption in developing a SAM geography is that it should be compatible with the geography used in the published small area statistics. To this end we are using wards or aggregations of wards as the building blocks for constructing SAM areas. Important questions remain over the criteria used to group wards. Initially we are employing an algorithm developed by Professor Dave Martin of Southampton University to group areas according to principles of homogeneity. However, consultation will be sought on the extent to which this should be constrained e.g. to ensure contiguity of wards and the avoidance of shapes that appear contrived.

Confidentiality

With geographical area widely recognised as one of the most powerful keys to individual identification, the careful assessment of disclosure risk is a fundamental requirement at all stages in the development of SAM proposals. Using the 1991 SARs as a base-line for acceptable risk, we are therefore testing the disclosure risk associated with each separate combination of sample size, variable detail and geography.

Consultation workshops

Finally, we are keen to ensure extensive consultation with those who have put forward the arguments for this SAR and to ensure that the detail provided in SAMs meets policy requirements. To this end, three workshops are being held, in Leeds (May 4), Edinburgh (June 9) and London (June 16), involving both academics and practitioners. The purpose of the workshops is to demonstrate, and seek views on, the characteristics and analytical potential of the SAM for different trade-offs of geography (areas of 5,000 to 30,000) and variable detail.

Workshop dates

Thursday May 4, University of Leeds

Friday June 9, COSLA, Edinburgh

Friday June 16, Greater London Authority

Conclusion

The SAMs project has the potential to provide an important bridge between the benefits of microdata and the traditional uses of aggregate statistics. Crucially SAMs could provide a way of overcoming the restrictions imposed by the single geography in the 1991 SARs, and thereby extend the value of microdata to a wide range of potential users. It will also enhance the ability to apply sophisticated modelling methods which recognise the influence of locality on the lives of individuals.

Further details of the project and the workshops are available on our small area microdata project page.

 

ALTERNATIVE HOUSEHOLD CLASSIFICATIONS

FOR THE 2001 CENSUS

ESRC 2001 Census Development Programme

Clare Holdsworth, University of Liverpool, Rachel Leeser, London Research Centre/Greater London Authority and Angela Dale, University of Manchester

The Census provides a rich and unique data source for the UK population, yet it is an underused dataset. One of the reasons for this is that the information that users want is often not available in the format required, particularly for analysis based on households. Consider the following two examples:

Given the complex nature of the information required in the above example, this kind of information is often not available in standard outputs and the only solution – at some cost – may be to ask ONS to derive a new complex variable. This will require a considerable amount of work in creating appropriate algorithms.

It would therefore be far more satisfactory if ONS were able to anticipate users needs by including a wider range of household classifications that could be made available to users either in the standard output or included in the data dictionary from which bespoke tables could be easily produced.

In recognition of this, the Economic and Social Research Council are funding a research programme as part of their 2001 Census Development Programme, to consult with users regarding their requirements for alternative household classifications for the 2001 Census.

Consultation Workshops

If you are working in the public or voluntary sector and are interested in this issue, we would like to invite you to participate in this consultation exercise and to have your say as to what household classifications you would like to see made available from the 2001 Census.

Monday May 8 , Greater London Authority (full day)

Friday June 9, COSLA, Edinburgh 12-2pm

Tuesday June 13, University of Manchester (afternoon)

Please contact Jo.Wathan@man.ac.uk for more details. Tel: 0161 275 4975

 

The Census of Population: 2000 and beyond

Dalton-Ellis Hall, University of Manchester

22 – 23 June 2000

This conference will bring together census takers and census users from around the world. There will be a maximum of 70 participants to ensure plenty of opportunity for discussion and debate. Sessions will combine methodological innovations with substantive issues.

The conference will take place at Dalton-Ellis Hall, which is set in its own landscaped grounds conveniently situated for both the University campus and the city centre.

An informal programme of events will be arranged for visitors who will be staying on in Manchester on Friday night and Saturday.

Registration Details

There is a registration fee of £100.00, which will be waived for speakers.

Low-cost accommodation is available at Dalton-Ellis Hall, bookable through CCSR, or delegates can arrange their own accommodation from a list of approved hotels.

 

INNOVATIONS WITH AREA CLASSIFICATIONS

Exploring individual and spatial variation using a multilevel modelling approach

A one day workshop to be held in the Small Training Room, Manchester Computing

Wednesday 21 June 2000

Mark Tranmer and Ed Fieldhouse

The fee for attending this workshop is £50.00, or £35.00 if you are also attending the two-day conference on ‘The Census of Population: 2000 and beyond’, to be held at Dalton-Ellis Hall on 22/23 June 2000.

Please note that places are extremely limited and early booking is advisable.

 

 

AN AGE SCHEDULE OF MIGRATION RATES IN 1991

Ludi Simpson, Bradford Council and CCSR University of Manchester

There is no public source for a detailed age schedule of migration rates for Britain as a whole, as government demographers tend to deal with flows of migrants rather than propensities or probabilities of migrating, and to report the flows in five-year age bands. A single-year of age schedule of migration for the year preceding the 1991 Census is however available from the Samples of Anonymised Records, as in the graphs and table below which are drawn from the resident records on the 2% SAR. The migration within Britain could be separately distinguished for different categories of distance moved, or for moves within and between standard regions, but moves within Districts cannot be separately counted from the 1991 SARs.

These data have been useful at Bradford Council in three practical ways as part of its forecasting with the software POPGROUP1:

The shape of the graph for migration within Britain shows peaks for pre-school ages, the young labour force and oldest ages, and has already been published separately for males and females in Population Trends2. In work at Bradford the rate for persons has been used because the difference between males and females, particularly at young adult ages, has been shown to be due to undercount of migrating males3. The graphs here do not include any adjustment for undercount.

The graph for immigration shows the same shape but without the peak at older ages (presumably British institutional care does not attract people from outside Britain). The data are more sparse: a three year average is given here to smooth the trend. A rate of emigration cannot be derived, unless all the censuses from the rest of the world could be examined for migrants from Britain.

Because emigrants are lost to the census, the denominator for the within-Britain rates cannot include all those ‘at risk of migrating’ at the beginning of the year before the census. For immigrants, by convention, the denominator is also the population within Britain rather than those in the rest of the world who were really ‘at risk’ of emigrating.

The definition of migration and rates from it has been very usefully schematised and developed recently in work from Leeds University4. One problem with the 1991 census counts of migrants is that they exclude those who were born in the year before the census (the ‘new-born’). The ten new-born migrants recorded in the SAR are unexplained and ignored here – see table. The unmeasured rate of migration for new-born can be expected to be at least equal to the rate for those who were age 1 at the Census. As the new-born are only alive on average for half the year, this is usually expressed as a probability of moving during the year one half that for those who were age 1 at the end of the year.

1 John Andelin and Ludi Simpson (1999) POPGROUP: population forecasting system reference manual. Bradford Council Policy and Research Unit, City Hall, Bradford BD1 1HY.

2 Tony Champion (1996) Population review: (3) Migration to, from and within the United Kingdom, Population Trends 83: 5-16.

3 Stephen Simpson and Elizabeth Middleton (1999) Undercount of migration in the UK 1991 Census and its impact on counterurbanisation and population projections, International Journal of Population Geography 5: 387-405.

4 Phil Rees, Martin Bell, Oliver Duke-Williams and Marcus Blake (1999) The measurement of migration intensities. Paper to British Society for Population Studies 6-8th September, Dublin. Available from the authors at the School of Geography, University of Leeds.

PUBTRAWL

As part of the annual reporting requirements attached to the use of the SARs, all registered users were contacted earlier this year and asked to provide us with details of publications. If you have not already replied, we should be grateful if you could do so as soon as possible. The information will be added to the SARs publications list maintained by the Census Microdata Unit, a copy of which is available on our website at jointpub.html. In addition, the updated list will be circulated with the next edition of this newsletter.

COURSES

For the past few years, CCSR has run a series of one-day courses on the analysis of survey and census microdata, and research design. These courses have been very successful and have been attended by people from local authorities, health authorities and commercial organisations as well as by academics.

Since September 1999, we have been offering an MA(Econ) and postgraduate diploma in Social Research Methods and Statistics, with most courses also available to external students. We now have SARs training material available on CD for use with SPSS. This includes exploratory analysis of the Individual SAR and hierarchical analysis of the Household SAR. A detailed workbook and two datasets are available for £30.00. It is necessary to become a registered SAR user before obtaining this CD.

With these developments in mind, we have decided not to run any further short courses this year but to concentrate on restructuring and redesigning our current provision with the aim of offering a new programme of short courses from January 2001. However, the course materials can be purchased from CCSR at £5.00 per set. If you would like to be notified of our new programme as soon as it is available, please contact Nasira.Asghar@man.ac.uk or telephone 0161 275 4736.

MA IN SOCIAL RESEARCH METHODS AND STATISTICS

Details of our new MA(Econ) and postgraduate diploma in Social Research Methods and Statistics are included with this Newsletter. The MA provides a firm grounding in advanced quantitative methods, taught within an applied social science framework. It is designed to be accessible to non-statisticians, yet more focussed than most of the existing master's courses in social research methods. In particular it offers courses in methods of longitudinal data analysis and advanced survey methods that are not available on most other comparable courses. A number of individual courses which may be of interest to practitioners can be taken separately as day-release courses over 12 weeks.

Further information is available from:

Dr Mark Brown, Programme Director, Tel 0161 275 4780, Email mark.brown@man.ac.uk

Ms Margaret Murray, SRMS Course Secretary, Tel 0161 275 4589, Email margaret.martin@man.ac.uk.

or click here

SEMINARS

Special seminar - health statistics

9 May 2000, 4.00pm, Room 3.51a,Third Floor, Williamson Building

National health statistics: what do we have and what do we want?

Alison Macfarlane, Medical Statistician, National Perinatal Epidemiology Unit, Oxford

Please contact Ruth Durrell, Tel: 0161 275 4721, Email: R.Durrell@man.ac.uk for further information


 

ESRC Contact SARs Support | CCSR
These pages are maintained by the SARs support team.
Send us comments on this web page.