Merging files of different data levels

January 7, 2019

Survey Solutions produces export data files separately for each data level. This can be for example households, persons, plots, crops, etc. The process is completely automatic and not configurable by the user.

If you need to combine information from different data levels (for example bring household characteristics to the personal level) you can utilize a statistical package to post-process your data. This is a basic operation supported by most packages, such as SAS, SPSS, Stata, R and others, though the exact terms may differ by package. Note that doing this in spreadsheet applications like Excel introduces unnecessary complications.

Here is an example. Suppose we have a survey with interviews corresponding to households and within each household we collect information about the household members. Each household has a categorical attribute region (among other characteristics) and a list of household members (text list type question hhmembers). Suppose further that the main survey level has the ID HOUSEHOLDS and the information about the household members is collected in the roster called MEMBERS.

Now, if we were to combine the household and personal information together, we would run a code like the following:

   version 14.0
   clear all
   cd "C:\path\to\data\"
   use "MEMBERS.dta"
   merge m:1 interview__id using "HOUSEHOLDS.dta" , generate(merge_quality) keepusing(region)
   tabulate merge_quality
   assert merge_quality==3
   drop merge_quality

where all the magic happens in the single merge command and the rest is helping to set the stage for the merge or to control the quality of the merge.

Note that:

  • you will need to adjust the path to data as appropriate for the location of the unpacked exported data folder on your computer;
  • this is a many-to-one merge, since there are (potentially) multiple persons in a household;
  • we are merging by matching the cases based on the variable interview__id and in the resulting dataset the level is still that of persons, but now with the household variables (region in our case) attached;
  • in the resulting dataset the cases are identified by the household id variable interview__id and person id within the household variable hhmembers__id, both identifying variables are generated by Survey Solutions automatically;
  • all members of a household will have the same value of the region (and other household level variables);
  • we specified the names of the variable region to bring from the household level, you can omit this option and then Stata brings all the variables from that level.

Depending on the analysis that you intend to perform, you may need to merge sequentially several files, for example to bring the household and plot characteristics to the crop level, or make several merges to combine e.g. households with persons and households with livestock, or create other structures.

More formally:

The exact syntax may vary by versions of a particular package. Refer to the documentation appropriate for the software you are using.