Recent years have seen a sharp increase in concern over confidentiality – not just in the statistical offices of the UK but also in NSIs around the world. In part this is influenced by a recognition of the increasing amount of information available through stored databases and on the web and also the increasing power of search engines and data mining techniques. It also reflects increased concerns about privacy and the legal requirements of the census offices to ensure that the guarantee of confidentiality given on the census form is upheld. This increased concern over disclosure risk has resulted in a decision that small cells from the tabular output from the 2001 Census for England, Wales and Northern Ireland should be adjusted to either zero or 3. It has also resulted in less detail being released in the 2001 SARs.
The ONS protocol for Data Access and Confidentiality sets out the overarching principles that relate to confidentiality. The general framework by which ONS protect census data is given at www.statistics.gov.uk/census2001/discloseprotect.asp.
Confidentiality and disclosure control in the 2001 SARs
Protecting confidentiality: statement on ONS web site
The 2001 Samples of Anonymised Records has been subjected to a more extended analysis of disclosure control than that used with the 1991 SARs. A number of scenarios were set up as the likely routes through which an attack on the microdata might be made. For example, one scenario used the information that a journalist may be able to obtain on an individual whilst another related to the information held on databases used by commercial companies. Sets of key variables – which could be used to match to the same variables in the SARs - were identified, based on these scenarios. Then, using the full 2001 Census data, the records in the SAR that were population unique on the key variables were determined. The percentage of records that were population unique was used as the primary measure of risk.
Considering population uniques meant that only matches which were certain were counted as a risk. Matches which were correct by chance were not counted as a risk. After the proportion of population uniques had been reduced by collapsing variables, the most risky records were perturbed by changing the values of one or two variables. To control information loss the proportion of records perturbed was small. At least two-thirds of the population uniques in the file were perturbed. A more detailed discussion is available in PDF format.
Perturbation in the SARs
This section is based on a paper by Gross, Guiblin and Merrett presented at the Open Meeting on the 2001 SARs, September 2004.
Recoding variables determined as risky is the main disclosure control method used for the SARs. However, there comes a point where further recoding will cause a large decrease in the information released for little decrease in disclosure risk. At this point ONS decided to use PRAM (Post-Randomisation Method) as a perturbative microdata disclosure control technique which can be applied to categorical variables. The values on some categorical variables for certain records in the microdata file are changed to a different value according to a prescribed probability mechanism. Each new value may or may not be different from the original value. The key aspect of the PRAM method used is that the method conserves the original frequency distributions, while minimising the loss of information.
The number of records PRAMmed for each variable is given in Table 1. Values which were PRAMmed were flagged as imputed, but not so as to distinguish them from already imputed values.
Table 1: Number of records PRAMmed for each variable by Country
| Variable |
England and Wales (1,626,324) |
Northern Ireland (50,889) |
Scotland (164,307) |
Age |
7,510 |
302 |
961 |
Distance to work |
2,414 |
134 |
Not PRAMmed |
Marital Status |
7,939 |
486 |
95 |
Number of cars in household |
2,589 |
128 |
57 |
Number of earners in household |
726 |
53 |
36 |
Number dependent children in household |
2,157 |
111 |
Not PRAMmed |
Number of residents in household |
2,554 |
114 |
40 |
Primary economic position |
12,977 |
562 |
364 |
Tenure |
4,348 |
245 |
96 |
Workplace |
240 |
12 |
Not PRAMmed |
Address last year |
5,251 |
258 |
Not PRAMmed |
Ethnicity |
5,421 |
Not PRAMmed |
271 |
Industry |
7,143 |
237 |
147 |
Long term illness |
313 |
Not PRAMmed |
Not PRAMmed |
Occupation |
5,324 |
217 |
429 |
Effect of PRAM on data quality
Three aspects of data quality were examined.
1 The invariance property – preservation of the univariate. This was checked by looking at univariate frequency tables pre- and post-pram and looking at the transitions (cross frequency between original variable and PRAMmed variable).
2 Preserving the multivariate frequencies within subgroups of variables is checked looking at multi-way tables. To achieve this criteria we used a series of control variables defining strata within which Pram was performed (e.g. Age within strata defined by workplace, Econprim and marital status).
3 Preserving the relationship between PRAMmed variables and non PRAMmed variables. This was not controlled. But the damage has been assessed by comparing frequency tables before and after PRAM.
The conclusions are:
1 The univariate distributions for PRAMmed variables were not damaged. The preservation of the frequencies worked pretty well. This means that the optimisation process worked well.
2 The multivariate distributions between variables involved in the PRAM process (PRAMmed and control variables) worked well too. This means than the stratification and the control on transition were efficient.
3 The assessment of the damage on distribution between variables involved and not involved in the PRAM process was measured by comparing tables before and after PRAM and by comparing the impact of PRAMming relative to the sampling error. The ratio between the relative error due to PRAM and relative sampling error was calculated for each cell. When the ratio is lower than 1 the additional error due to PRAM can be considered as acceptable. PRAM is more damaging (relative error due to PRAM > Relative Sampling Error) for cells which have low frequencies.
Below is an example of the loss of information affecting a three-way table: industry/sex/ethnic group. It showed a much larger loss of information than other tables examined. For each cell size of this 3-way table of 181 cells we measure the ratio between the error due to PRAM and the Sampling Error.
| Cell Frequency Before PRAM |
0-5 | 6-10 | 11-20 | 21-40 | 41-90 | 91-150 | 150-500 | 500+ |
Number of cells where Ratio >1 |
6 | 3 | 9 | 4 | 2 | 0 | 4 | 3 |
PRAM is more damaging for cells with low frequencies. Six cells with a frequency less than 5 have been more damaged by PRAM. (Note: an increase of 1 person after PRAM in a cell of 5 before PRAM represents a change of 20 per cent but an increase of 1 person after PRAM in a cell of 50 before PRAM represents a change of 2 per cent).
Disclosure control report-
Special Licence Household SAR
(Source: ONS)
Introduction
The original SARs project plan for the Household SAR was to produce a licensed file that would be accessible under the End User Licence at the UK Data Archives, the same way as for the Individual Licensed SAR and the Small Area Microdata (SAM) file. The original disclosure risk assessment of the Household SAR indicated that the file presented a very high risk of disclosure, with a high percentage of the households and the individuals within those households being population unique. Possible recodes were suggested to reduce the level of disclosure risk to a level acceptable for an End User Licence file. However implementing the recodes would render the file of little value to users, largely because age was the main variable to be recoded. To provide the user community with a file that met their requirements, single years of age for all household sizes, meant producing a more disclosive file which could not be released under the existing End User Licence arrangement.
At the time the Household SAR was being assessed, ONS was also reviewing its policy on microdata access generally and approved access to more detailed datasets for social surveys which allowed a combination of i) statistical disclosure control methods with ii) legally binding agreements, to protect data confidentiality. Approval to use this ONS Special Licence for the Household SAR was sought and approved. The use of the ONS Special Licences enables a more useful Household SAR to be produced. The overall protection of the data is thus partly through design, and partly through licensing, with control over the user and their intended use.
In the proposal to use the ONS Special Licence for the Household SAR, scenarios were outlined which the released data should be protected against. The consensus was that the risk of deliberate attempts at matching by researchers was very low. In other words the process of approval for the Special licence, the accountability of the institution and the severe impact of penalties for improper use allows us to trust the researcher, and provides protection against intentional confidentiality breaches. The physical security arrangements required by the licence are also considered adequate. The concern was the likelihood of careless or negligent actions by the researcher, particularly where the requirements of physical security are contrary to their normal work habits. Failure through negligence to strictly adhere to the terms of the licence conditions could lead to unauthorised access and is the main difference in risk between the Special licence and access under a safe setting. It was also agreed that the data should be protected against spontaneous recognition of publicly well known individuals both by researchers and any unauthorised access by a third party.
Assessment of disclosure risk
The assessment of the disclosure risk of the special licensed Household SAR has involved examining two strands, firstly assessing the risk from a private database match and secondly assessing the risk of spontaneous recognition of publicly well known individuals.
(i) Private Database
A risk analysis of the Household SARs was carried out to provide a quantitative measure of the disclosure risk. The risk measure used is the percentage of population uniques in the file at household and individual level. The characteristics used to define unique records are determined by key variables, which in turn depend on the intruder scenario being considered. Key variables are a set of variables that can be used to identify a particular unit (e.g. household or person) for a given intruder scenario. A unit record in a microdata file that is unique in the population on a set of key variables is highly likely to allow the unit corresponding to that record to be successfully identified, and all the information on the file about them made available to the intruder.
Households are defined as unique in the population if the particular combination of characteristics of the individuals in the household are not found in any other household. If a household unit is a population unique, all individuals within the household are also counted as population uniques, as they can be identified through the household. The chances of a household being population unique increase with the size of the household. Variables coded with more detailed categories will generally give rise to more population uniques than more aggregated coding.
The risk analysis uses the private database scenario to maintain consistency with the Individual SAR. The private database was slightly modified to adjust for the household structure. The modified key then consists of:
Household variables: household
size, country, tenure, number of cars
Individual variables: age, sex, marital status, primary economic position
Various coding schemes were considered for the Household SAR.
The coding scheme which has been adopted for the Household SAR is:
Age is in two year bands, and
top-coded at 80 years
Marital status has 5 categories,
England and Wales are combined, and
Other key variables coded as for the SAM.
Remaining variables coded as for the Individual SAR
No data is provided for Scotland and Northern Ireland as the disclosure risk for these countries was deemed to be high even under special licence conditions.
The level of population uniques in the file was 12% of households and 21% of individuals. This compares to 1% of population uniques under the End User Licence. Table 1 shows the risk measures for all the coding options considered. These are converted to the risk-utility framework in Figure 1 (as used in Duncan et al, 2001). Coding schemes have been ordered by the amount of detail available, however as we have not used a quantitative measure, the data utility axis is not to scale. The map shows disclosure risk decreasing as data utility is reduced. The lowest risk file (at 1%) with 10 year age bands and households over size 9 removed did not meet the needs of data users.
Most households of size 6 and over were found to be population unique. Figure 2 shows the percentage of households which are population unique for 3 coding schemes by household size. For individual years of age and sex it can be seen that the percentage of households which are population unique substantially increases from size 4. For the modified private scenario the percentage of households which are population unique substantially increases at size 6. However, for the coarsest coding scheme the percentage of households which are population unique does not substantially increase until size 9.
The level of risk as shown by coding scheme 3 in Table 1 was felt to be acceptable given the greater reliance on contractual arrangements provided by the Special Licence and provided a good balance between disclosure risk and data utility.
Figure 1. Risk-Utility Map for the Household SAR

Note: Data Utility is expressed in the coding schemes as explained in Table 1
Table 1: Risk for the Household SAR with different coding options (private database scenario)

(1) these are population figures, but will provide a good estimate for
the sample.
(4) considers only two variables, and is not a full scenario analysis
Figure 2: Percentage of households that are population unique for various keys by household size

(ii) Spontaneous recognition of individuals who are well known publicly
This risk assessment looks at the risk of being able to identify well known individuals in the file. It was decided that it would be difficult to identify unusual households as few are well known by virtue of household characteristics. The analysis concentrated at looking for individuals who might be well known publicly at the national level.
Note that for the special licence we do not aim to protect the file against recognition of a " community of acquaintances", i.e. individuals or households that might be identified by particular "intruders". The researchers agree not to attempt to identify individuals, under the terms of the licence.
We were concerned only with those variables on the file that are visible and traceable. The 7 variables for the modified private database cross-match scenario are Household size, Tenure, Number of cars, Age, Sex, Marital Status and Primary Economic position. Of these we considered only household size, age, sex and Marital status would be used to try and identify a publicly well known individual. Other key variables that we felt might lead to recognition of a publicly well known individual on the Household SAR, which are not contained on the above private database scenario were Ethnicity, Religion, country of birth, occupation and International standard classification for occupation (ISCO). Thus, the variables that we considered for spontaneous recognition of publicly well known individuals were:
Household size - top coded at size
12
Age - 2 year bands
Sex - 2 categories
Marital Status - 5 categories
Ethnicity - 16 categories
Religion- 9 categories
Country of birth - 16 categories
Occupation - SOC minor 81 categories
ISCO - 3 digit level.
To examine the risk of being able to identify well known individuals two analyses were undertaken, these were:
• The univariate distributions of the visible and traceable variables not on the modified private database scenario were examined at the national level. We were looking for categories of variables with low numbers in particular, as this gave us an indication of where the disclosure risks were.
• Identify all population uniques in the sample for all three way
combinations that include occupation. We considered that occupation was
the main variable that in combination with other variables would contribute
to recognising a well known individual at a national level. In our judgement
it was sufficient to restrict the analysis to occupation and 2 other variables.
We were not considering all possible combinations of the 9 visible and
traceable variables.
The results of these analyses showed that it would be very difficult to identify a publicly well known individual based on the level of detail provided in the Household SAR. Therefore for variables not on the private database scenario it was decided they can be coded to the same level of detail as for the Individual SAR. There were however 6 records in the Household SAR that were population unique for a combination of an occupation typical of a well known individual and two other variables. The most identifying variable for an individual was changed manually to protect the 6 records.
Disclosure control methods applied
Recoding has been the main disclosure control method applied to the Household SAR. Only those variables which are on the private database have been subject to recoding and in addition no individual level information is provided for households of size 12 or more. In addition to the recoding a small amount of perturbation has been applied to the data to protect confidentiality, using the same methods as for the Individual SAR see Bycroft et al (2005).
(i) Recoding
For the variables on the modified private database scenario some recoding was applied:
Tenure - this was recoded from 10 categories to 3.
Number of cars - this was recoded from 5 categories to 3.
Age - this has been banded into two year age groups and top-coded
at 80.
Marital status - this was recoded from 6 categories to 5.
Primary Economic Position - this was recoded from 16 categories
to 4.
Household Size - this was top coded at 12. For households of
up to and including size 11, there will be one record for each member
of the household. For households of size 12 or more only 1 record providing
summary information will be provided for the household.
For full details of the coding of the variables please see appendix A.
Individual and household level edits
After both the perturbation had been applied to the 5% of individuals in large households and the modifications to the 6 risky records from the visible and traceable analysis have been made, individual and household level edits were checked to ensure that no invalid combinations have occurred.
The individual level edits that were checked were the same as those agreed for the Licensed Individual SAR. These edits were derived based on the edits used by Census in creating the 2001 Census database and some additional edits that we felt should be checked. These edits checked to ensure that no invalid 'individuals' were created such as a 2 year old married person.
The household level edits that were checked were (i) Age difference between spouses should not be greater than 30 years, (ii) Age difference between Parent and child should be 16 years, and (iii) Age difference between Grandparent and Grandchildren should be 32 years or greater. These edits were used to ensure that the perturbation applied to the age did not create any new extreme households.
The results of running both the individual and household level edits were that only one new additional failed edit occurred through the perturbation. All the other failed edits already existed in the original data so it was decided that these would not be corrected.
The Edit list contained some edits that were failing on the original census records. We recommend that the edits be revised for the 2011 SARs. Better coordination at an earlier stage between the edits used for the Census and those for the SARs would assist in the development of the SARs.
Conclusions and Recommendations for 2011
The Household SAR was the first ONS dataset to undergo a quantitative assessment of disclosure risk for access under the Special Licence. Decisions were made without the benefit of any previous experience of the new licence.
For 2011 there is a need to reassess the balance between the protection provided by Special Licence and the recoding of the data. If ONS is comfortable with the Special Licence arrangement and there is a demand from researchers for more detailed data then, for example it may be possible to provide single years of age.
Disclosure control report-
Small Area Microdata (SAM)
(Source: ONS)
Introduction
The Small Area Microdata (SAM) file is a 5% sample of individuals from the 2001 Census. Geography area identification is at Local Authority level for England, Wales and Scotland and Parliamentary Constituencies for Northern Ireland. The case for the Small Area Microdata file was put to ONS in January 2001 by Tranmer et al (2005). Users of the 1991 Individual SAR had noted that the geographical units identifiable were often larger than desirable for the type of analysis that they wished to conduct.
The SAM will beis accessible under conditions similar to those employed for the Individual Licensed SAR. As Tthe SAM is similar to the Individual licensed SAR, except that it is larger and at a finer level of geography, and similar methodological practices could werebe used in the production of this file. This has enabled a smoother and quicker production process than for the other SAR products. As with the There has beenwas a trade off between the constraints of confidentiality and the amount of individual level detail that could be provided in the SAM.
Assessment of disclosure risk
The assessment of the disclosure risk of the licensed SAM has involved examining two strands, firstly assessing the risk from a private database match and secondly assessing the risk of spontaneous recognition of individuals.
A risk analysis of the SAM was carried out to provide a quantitative measure of the disclosure risk. The risk measure used was the percentage of individuals in the file which were population unique. The characteristics used to define unique records are determined by key variables, which in turn depend on the intruder scenario being considered. Key variables are a set of variables that can be used to identify a particular unit (e.g. household or person) for a given intruder scenario. A unit record in a microdata file that is unique in the population on a set of key variables is highly likely to allow the unit corresponding to that record to be successfully identified, and all the information on the file about them made available to the intruder.
(i) Private Database
The risk analysis uses the private database scenario to maintain consistency with the individual SAR. The private database contains the following variables: Local Authority District (Parliamentary constituencies for NI), Age, Sex, Distance to Work, Marital Status, Number of Cars, Number of Earners, Number of dependent Children, Number of Residents, Primary Economic position, Tenure, Workplace
Three coding schemes were initially considered for the SAM these being a fine, coarse and a coarse plus (coarse specification plus two additional recodes) specification which resulted in 26.5%, 3.5 %, 2.5% of population uniques for England and Wales. The coarse plus specification was decided on as the percentage of population uniques was similar to that for the individual SAR before perturbation was applied. The coarse plus specification was put out for consultation with users and the resulting specification for the SAM can be seen in appendix A. This specification resulted in 2.5%, 3.1% and 4.2% of records being population unique for England, Wales and Northern Ireland respectively.
As part of the disclosure risk assessment of the SAM a special uniques assessment was run. The special uniques assessment was used to decide which records and which variables would be selected for perturbation. For further details on the special uniques methodology see Elliot (2004).
The Data Intrusion Simulation metric (DIS) is a file level measure of risk that gives the probability that a match made by an intruder between an individual and a sample unique in the microdata file is correct.The latest special uniques algorithm combines the SUDA score and the DIS metric to generate estimated per record matching probabilities.
The first step of the special uniques assessment is a SUDA score for each record and this is based on the number and size of minimal sample uniques (MSU). A MSU is a set of variable values which is unique in the sample and for which no subset is unique. Each record may have a number of MSU's within a given key variable set. Using the information about the number and size of the MSU's a SUDA score for each record is derived. This score is then heuristically combined with the output from the DIS metric to give a per record matching probability (dis_suda score).
For the SAM the special uniques assessment identified all the minimal sample uniques for that record within the private database cross-match scenario. As we had the population data it was possible for us to identify which records in the SAM were population unique for the private database scenario. We then used the results of this special uniques assessment to grade these population uniques, with the highest score being given to the "most" risky population uniques. A population unique resulting from a small number of variables will have a higher score than one that results from a larger number of variables. We use this ranking of records to select the highest risk records for perturbation.
The special uniques assessment also told us which variables were contributing the most to the risk for each record; this is based on the number of times a variable occurs in the set of minimal uniques for that record. We used this information in selecting which variable will be prammed for a record. We experienced some difficulty with running the special uniques analysis on such a large file with many variables including LA. We solved this problem by splitting the files by GOR and looking for an MSU of size 6 or less out of 12 variables in the scenario.
(ii) Spontaneous Recognition
The SAM as well as being protected against a private database attack should also be protected against spontaneous recognition and by this we mean that an intruder would be able to recognise people they know personally or people who are in the public eye. Spontaneous recognition is seen to be unintentional disclosure and we have viewed it as requiring only a small amount of information to be known about an individual. For the purposes of this analysis we have defined the risk of spontaneous recognition as being population unique on 4 variables. Geographies GOR and LA (or Parliamentary Constituency), are one of the visible variables, so that high risk records will be unique at LA level plus three other variables, or at GOR level with four variables. We note that any record unique at GOR level on three variables will also be unique in their LA.
We were concerned only with those variables on the file that are visible and traceable. The private database scenario contains some of the visible and traceable variables but there are others in the SAM which are not contained within this scenario. The variables that we considered to be visible and traceable on the file are:
Household level
Local Authority (Parliamentary constituencies for NI)
Accommodation type
Family type
Lowest floor level of household living
Number of rooms
Number of Cars
Number of Earners
Number of Dependent Children
Number of Residents
Density
Accommodation self-contained.
Individual level
Age
Sex
Marital Status
Distance to work
Primary economic position
Workplace
Country of birth
Ethnic group
Transport to work
Migration indicator
Religion
NS- SEC
Community background (Northern Ireland only)
Professional qualification
Supervisor/Foreman
Limiting Long term illness
Size of Work force
To investigate the risk of spontaneous recognition two analyses were undertaken these were:
• The univariate distributions of the visible and traceable variables not on the private database scenario key were examined by LA. We were looking for categories of variables with low numbers in particular, as this gave an indication of where the disclosure risks were.
• The special uniques analysis was run using a Minimal Sample Unique (MSU) of size 4 for the visible and traceable variables listed above. The MSU of 4 means that all possible combinations of size 4 out of the total 28 variables will contribute to the assessment of risk. The special uniques analysis produced a DIS-SUDA score for each sample unique record. The DIS-SUDA score estimates the probability that a match against the record is a correct match. From the special uniques analysis we were able to tell which of the variables is contributing the most to the number of MSU's and which variable values were the most risky.
The results of examining the distributions of the visible and traceable variables by LAD showed that there were very few variables with small counts. The DIS -SUDA scores for the samples uniques were all low indicating a low chance of being population unique. These results were verified by taking the top 10 variables (LA, ethnicity, religion, age, NS-SEC, transport to work, migration indicator, country of birth, work place and distance to work) contributing most to the risk and examining all four way combinations of these variables in the population to see if the combination was population unique. We found that the probability that a sample unique is a population unique was small, confirming the low DIS-SUDA scores. Therefore based on this we concluded that the level of detail of the visible and traceable variables in the SAM was appropriate and no further disclosure control for these variables was required.
(iii) Risk Assessment for Scotland
Scotland has a different disclosure risk problem as it is possible to use tables which are in the public domain to confirm whether a record in the microdata sample is population unique. This is due to the fact that Scotland chose not to apply small cell adjustment to their tabular outputs which means that some values of 1 are present in published Scottish tables. Scotland’s disclosure risk assessment has consisted of producing three way population tables for all variables on the SAM specification and then checking whether any population uniques in these tables occurred in the sample. This risk assessment was conducted by Sam Smith at the University of Manchester.
Disclosure control methods applied
Recoding has been the main disclosure control method to be applied to the SAM. However there came a point when further recoding would have resulted in little reduction in disclosure risk but large information loss. For the remaining high risk records in the SAM some perturbation was applied.
(i) Recoding
Due to the smaller geography on the SAM compared to the Individual SAR, it was necessary to substantially reduce the individual and household level information. Some information was completely removed from the file such as the Standard Occupation Classification of the individual and the industry that they worked in. For full details of the final coding used in the SAM please see appendix A.
(ii) Perturbation
The results of the special uniques analysis were used to efficiently target the perturbation to the highest risk records and highest risk variables. The special uniques analysis ranks sample uniques in the file by what is called a DIS/SUDA score. The methodology used to perturb the records in the SAM is based on the Post Randomisation Methodology. For further information on the PRAM methodology used please see Bycroft and Merrett 2005. PRAM was applied only to records that exceeded a threshold for the DIS/SUDA score and were population unique for the private database scenario.
Records in the SAM which were imputed were flagged. We used the same flag to indicate whether a record had been subject to PRAM. This informs the user that the value is not obtained directly from a true response, but does not allow them to distinguish between the two processes. Therefore if an intruder comes across a flagged record they do not know whether it is a true value, perturbed or imputed. Northern Ireland and Scotland choose to remove the imputation and PRAM flags from their files before release to provide some additional protection.
Edits
After the perturbation had been applied to the file, edits were checked to ensure that no invalid combinations have occurred. The edits that were checked were the same as those agreed for the Licensed Individual SAR . These edits were derived based based on the edits used by Census in creating the 2001 Census database and some additional edits that we felt should be checked. These edits checked to ensure that no invalid 'individuals' were created such as a 2 year old married person.
The results of running the edits were that no new additional failed edits occurred due to the perturbation. All the other failed edits already existed in the original data so it was decided that these would not be corrected.
Conclusion
In conclusion the SAM is a valuable
file providing detailed geography. The trade off for this detailed geography
has been in the reduction in the detail of information provided for individuals.
However the level of detail of the variables on the file is such that
the file will still be useful for research purposes.
Last updated 25 April 2007