Fieldguide: From Idea to data
From EpiDataWiki
Example - from idea to questionnaire
The starting point for working with any data is a thorough description of the information that is to be analyzed, whether is is collected for a special purpose, obtained from a register or from an existing database.
The basis for the management and analysis of data is really a project plan with a description of purpose, problem definition, tasks, etc.
If your questionnaire is written without a clear goal and purpose you are going to overlook important issues and waste participants' time by asking useless questions.
If your questionnaire lack of logical flow it can cause the participant to lose interest, leading to low response rate.
The problems of an unclear objective in your study do not end here, but continue on to the analysis stage. How would it be possible to reach insightful conclusions if one didn't actually know what they had been looking for or planning to observe.
The example below is one where the data come from different sources.
The idea
We want to elucidate the traffic habits of the population in a survey. The results will be used for planning the capacity and distribution of roads, bicycle trails, and recreational paths in an urban area. Later a collaborative project on accident prevention between the county administration and the local hospital will also be carried out. The county administration will carry out traffic counts in selected places as well as a questionnaire survey of a sample of the population of the area. The health sector and the schools will collect key information in a special registration form when persons in the area in question have been injured.
The idea here includes the purpose and a description of the types of data that will be available. When the basic structure of the data required is known, decisions can be made about registration forms and questionnaires that will be collected. In this case, three sources of information were collected:
- Traffic count form: Means of transportation, counting position, estimated age of road user, hour, day of the week.
- Questionnaire: Age, household composition, weekly use of bicycle, use of bus, car, sequence number, etc.
- Registration form - accidents: Hour, day of the week, short description of size of injury, age, sequence number, causes of accidents.
First draft of forms to be used for data collection
For each of the desired forms, a first draft is prepared to include the different topics covered by the investigation. It is usually a good idea to start with questions which specifically relate to the purpose of the project. Demographic data such as date of birth, sex, etc. can be put into the form later.
The specific wording of each question should be taken, as much as possible, from standard questions (e.g. the standard questionnaire from the Danish Institute for Health Services Research), a validated scale (see reference list) or a standardized list (e.g. diagnostic codes from ICD-10).
For every piece of information (data) that is collected, this must include how to code the information for the data analysis. Very often, a single question may result in more than one variable for analysis.
Eventually, each "Q:" above will have a question number, but at this point only the questions are of interest.
Considerations when deciding on variables
Data may be the response to a question in a questionnaire, information automatically or already known such as an identifying number or code, the value from a physical or observed measure such as a blood sample, or information derived from a related subject or specimen such as the histological diagnosis for a tissue sample. In order to use computers for analysis, data are coded as one or more variables.
Variables that are categorical, ordinal, or continuous can be used to group responses.
- Categorical variables will have a limited number of answer categories (e.g. yes/no/don’t know, true/false, a/b/c/d)
- Ordinal variables can be ranked on the basis of a fixed order (e.g. large/larger/largest, disagree/neutral/agree)
- Continuous or numerical variables can be manipulated using arithmetic
Although numbers can be used to code categorical or ordinal variables, they should not be treated as continuous. The average of an ordinal variable has no meaning in the usual sense. Variables may also be strings of free text/alphabetical (e.g. “He put his hat on and left”). Most computer software allows for special treatment of logical (True/False, Y/N, yes/no, +/-) variables, but coding as 0 and 1 or 1 and 2 is preferable.
Data quality - control variables
Often the quality of data varies with other characteristics. It must be expected that an X-ray diagnosis made by a medical specialist is more precise than a diagnosis made by the most junior registrar. Another example is an analysis known to have a variation which is dependent on the season. Having data on the date and temperature during transportation will provide measures of quality in relation to transportation.
In order to use the precision of a piece of information in the analysis, one or more variables may be included as measures of the uncertainty of the main variable (in the examples above, the experience of the diagnostician or the date of the investigation). During the data analysis, systematic variations for sub-groups characterized by these ”control variables” can be examined and published.
With questionnaires or chart abstractions, a number of control variables are usually necessary: date of the interview, when the questionnaire was returned, who coded the text information, who conducted the interview or chart abstraction, etc.
Grouping of data
Maintaining the original data is very important. While coding may be done during data collection or before data entry, this should be done in a way that does not eliminate the original data.
It is very important that no information is grouped/summarized at the time of data collection or data entry. This can be done much easier and better during analysis. The principle is that there should be no data reduction during data collection or data entry!
For example, collect and enter
- date of birth – not age
- weight in kilos – not grouped weight
- actual number of sick days – not 'less than 7 days/more than 7 days'
Naming of variables
The variables must be given names. This is to ensure that the right tables are being analysed. It is a matter of taste whether to give variables numbers (v1,v2....v129) or names that indicate the content (age, sex, ACTH, .....). The name must be unambiguous and may advantageously refer to the questions in the registration, interview, or questionnaire forms (e.g. q1,q2 ... for questionnaire and l1,l2 ... for laboratory data). Use a maximum of 8 characters/digits in variable names. The first character in a name must not be a number: 1name is illegal, while name1 is ok.
Is the code book necessary?
It is a matter of preference whether you want to prepare a specific code book as shown in the figure below, or whether you decide that sufficient information can be found in the final questionnaire/registration form with an appendix giving the background of each question.
As a minimum you have to decide what variables each question will result in before the final collection of data. It is often in the preparation of the code book that inconsistent answer categories in the questionnaire are detected.
Specification of coding
A code book (see example below) with the selected variables is prepared. In the code book each piece of information will be converted to one or more precisely defined variables. What type of data (continuous, grouped, open text ...), what answer categories, what numeric code will be used if the variable is missing, etc. Number of digits is the maximum space used by the largest category number. For continuous variables minimum and maximum must be stated. Also key variables must be decided on, i.e. variables that must be present for each respondent.
Normally you will need two types of ”missing data”. One type is for when a wanted piece of information has not been ascertained (the person did not answer, the result of the blood test has not yet arrived, etc. ), while the other type notes an “irrelevant question” occurring in connection with the so-called “filter questions”. “If no, go to question 11” should result in the questions up to No. 11 are given a special value for irrelevant. All this will be apparent from the code book, which may look like this in connection with the questionnaire on page 2:
Please note that question 11 has resulted in 3 variables. When preparing the code book it may be an advantage to draft the graphs and tables which will be part of the planned publications. If you don’t know what graphs etc. you will eventually need, you will not know what data you will have to collect.
Final preparation of questionnaires/registration forms
When variables for analysis have been defined you can apply the finishing touches to the questionnaires, and the data collection can begin. "Pilot testing" can be performed on a group corresponding to the final response group. Following the pilot phase each question is reviewed: Is only one dimension asked (one question), are all possible answers covered in a grouped question ….
Demands for documentation
The demands for documentation have increased in recent years. This is explicitly apparent from the guidelines issued by the ”Danish Committees on Scientific Dishonesty”. The rules are guidelines and , unfortunately, written as general recommendations. E.g. ’quality control’ without further specification. According to these rules two things in particular must be documented.
A)The possibility of returning to the original material
For every piece of information (e.g. a point in a figure) it must be possible to return to the original material. This means that each observation must have an ID Number attached. The ID Number must follow all versions of the data and is unambiguously connected with the original observations. The original material must be kept for 10 years. The original material comprises notes (also handwritten amendment sheets etc.), questionnaires, analysis forms, etc. The transition from observed material to original material is undefined. The concrete choice of boundary must be evident from the documentation. It is a consequence of this demand that it is not allowed to correct faults in the original data once these have been entered, at least not without saving a copy of the data file first entered. A working solution could be as follows:
For each of the "stars" a file will exist afterwards whose name is apparent in the data documentation. The rationale for not correcting the original data is that this might introduce new errors which might not be noticed.
Quality control
The process above will also make it possible to document the number of errors found and the consequences of these errors.
Designing the questionnaire/registration form and preparing data entry
Above the preparations for designing a questionnaire/registration form were summed up. Now the process will be repeated in connection with exercises. Read the entire chapter and perform exercises 1-4.
What data structure – deciding on unit of analysis
The data structure must always reflect the purpose of the project. It is a good idea initially to draw a figure showing what data will be collected and how the different data sources will be linked, i.e. to decide on the desired unit of analysis.
Example: A blood bank wishes to analyse different issues pertaining to complications in connection with blood transfusions. The first step is to draw a figure showing the data structure:
Some donors give several portions of blood, some patients receive a single blood portion, while others receive several portions from different donors. The unit of analysis could simply be: “a donor” or “a patient”. “A blood transfusion” is more complicated: Are we talking about 6 transfusion episodes to 2 patients? 6 mutually independent transfusion episodes? 5 combinations of donor and patient? 7 donors and 2 patients ? .......
How many questionnaires/forms?
Example: An investigation must contain: interview data, a clinical examination and para-clinical data collected from medical records.
All forms are linked to Analysis by the jointly used ID-number (code number). A special file with code number as well as social security number is kept separately and locked in accordance with the rules given by the data protection agency.
Exercise 1: Your own project - what data structure?
Draw a figure describing the data source(s) of your project. Describe the subjects of each data source, how the data are collected, and how you plan to link the different types of data. How can anonymity be kept. Note: (many projects have only one data source = a questionnaire/registration form)
Designing the questionnaire/registration form
The questionnaire (or the registration form) is to be used directly for the data entry. Do not transfer data to a coding sheet.
When printing the questionnaire, you have to make sure that it is easy to leaf through during data entry. It may not be a good idea to print on both sides of the paper. The questionnaire must have a format which makes it easy to handle during data entry. A4 may be a better format than A5.
For every question it must be stated directly in the questionnaire what code to be entered for that particular question, e.g. by means of a small digit before every answer category. Always use the same codes, e.g. 1 for yes and 2 for no throughout the entire questionnaire (0 cannot be entered quickly). Consider carefully whether a given question contains exhaustive as well as mutually exclusive answers (i.e. whether all respondents may find one and only one valid answer). If it is possible to tick off several sub-questions, then a variable must be coded for each sub-question. “Don’t know” should not be given as an answer category. We know from experience that respondents who are in severe doubt do not answer a question anyway.
Too many answers to a question cannot be recommended as this will increase the possibility of error: It is easy to tick off a wrong answer and also to enter a wrong answer during data entry.
It would be a good idea to look up already constructed questions in questionnaires from related surveys. You may contact a data library.
The respondent must not be in doubt where to put his answer. You may e.g. place all questions to the left and all answers to the right, like this:
Exercise 2 Your own project – drafting the questionnaire
You are going to prepare the actual registration forms or questionnaires and the code book(s) based on these for a project of your own.
Deciding on the method of data entry
You may choose any method of data entry, but here we shall discuss the method using epi-info. The requirements for a suitable method of data entry are as follows:
- Key variables such as ID-number (code number) must be entered and checked first and lastly for each questionnaire.
- Variables must be given an unambiguous name that can be transferred to other programs.
- The legal values for a given variable must be stated. Only legal values must be entered.
- Filter questions must cause the relevant questions to be skipped.
- There must be a distinction between irrelevant and not ascertained (NA).
- You must save to disk every time you have entered a questionnaire (beware of database programs etc. using a cache for intermediate storing )
- It must be possible either to enter data double with later control of errors made during data entry, or to make a “blinded” double data entry on top of/overwriting the same data with the purpose of exposing errors or documenting error rates.
- You must arrange to have a back-up on diskette, USB memory stick or tape so that at the end of each workday, data are placed in separate buildings. If working where there is internet one can always send a copy to someone in another place - but if doing so remember to encrypt the file with a secure encryption principle first. Note that many encryption procedures are not sufficient.
Two vital considerations must be taken:
- Accuracy of the data entered
- Documentation of procedures for error detection, etc.
There exists no such thing as the “right” solution which can be of use to all. The solution chosen must depend on local conditions (what experiences and what programs are at hand, collaboration with a data entry bureau, etc.). The researcher must be closely connected with the planning of data entry procedures, the definition of documentation of errors, the decision of an acceptable error percentage (do we want all data to be doubly entered or proofread?), etc. It gives the researcher good insight into possible errors and the consequences of the individual question to control the entire process. The data entry itself, on the other hand, may be handled by assistants. As a minimum, so much needs to be doubly entered that it is possible to document the error rate (hopefully only per thousand!).
The working procedure is as follows:
Exercise 3 Your own project – decide on the method of data entry
You must make a decision of what method of data entry you wish to employ and how you can secure that you have security back-ups of data at any time.
Exercise 4 Your own project – the first document on data processing
Describe the overall structure of the data in the project, the different questionnaires, the different questions and also the variables resulting from the questions. Describe the basic principles of the data entry.
The description must be drafted as a summary of exercises 1-3 in a format which makes it suitable to form part of a diploma study or PhD thesis as an appendix. The size must be a few (one) pages of text attached to the specific questionnaires.
Difference between not ascertained and irrelevant
It is difficult to declare a fixed rule for handling this difference during data entry.
Not ascertained (NA)
Not ascertained means that the piece of information in a question has not been provided. Whether this is due to a lack of inclination on the part of the respondent to part with an answer or whether this is not possible to obtain does not matter. In either case no answer is available for analysis.
Irrelevant
Irrelevant is an answer which cannot and shall not be answered. E.g. the number of completed pregnancies is of no relevance for men (in a biological and somatic sense).
During data entry it may be an advantage to just skip all missing values by giving them a blank (no) value. In the case of filter questions the irrelevant questions are skipped in this way, and in the data file there is no distinction between irrelevant and not ascertained. This distinction can be established later during the processing of data. See below.
Filter question
Example:
A filter question which has not been answered should result in all the questions inside the filter to be coded as not ascertained. In the example, questions 10-27 should be coded not ascertained for all who have not answered question 10. This would be part of the data entry definition.











