Case:oswego
From EpiDataWiki
'Outbreak of Gastro Intestinal Disease in Oswego: Using EpiData in an Outbreak Investigation'
| Instructional Designer and Adapter: | Pedro Arias, MD, MPH, FETP Spain |
| Epi Data Technical Advisors: | Jens Lauritsen, MD, EpiData Asociation, Denmark |
Learning Objectives and Notes
- Understanding the basic concepts of DataBase Management and Documentation during an Outbreak investigation
- Understanding how to organize Data Analysis in an Outbreak Investigation
- Understanding how to describe an Outbreak in terms of: Time, Place, Person
- Understanding basic concepts about analytical studies: Retrospective Cohort Study
- Understanding the utility of Software during Outbreak investigations
- NOTE: This case study is based in the original one form CDC. The focus in this revision is mainly in the database management and documentation and in the use of EpiData as a tool for data analysis. In order to reach this objectives both the questionnaire and the data, are slightly different from the original one. We have included questions about symptoms. The idea was to show how to implement JUMPS in the Check code.
- Warning: You will need to download and unzip this sample file.
Introduction
On April 19, 1940, the local health officer in the village of Lycoming, Oswego County, New York, reported the occurrence of an outbreak of acute gastrointestinal illness to the District Health Officer in Syracuse. Dr. A. M. Rubin, epidemiologist-in-training, was assigned to conduct an investigation.
When Dr. Rubin arrived in the field, he learned from the health officer that all persons known to be ill had attended a church supper the previous evening, April 18. Family members who had not attended the church supper had not become ill. Accordingly, the investigation was focused on the circumstances related to the supper.
Interviews regarding the presence of symptoms, including the day and hour of onset, and the food consumed at the church supper, were completed on 75 of the 80 persons known to have been present.
A case was defined as:
Persons who developed acute gastrointestinal symptoms within 72 hours of eating supper on April 18, 1940, and who were among attendees of the Lycoming, Oswego church supper.
A total of 46 persons matching the above case definition were identified.
The event involved a group of people of varying ages, who developed acute gastroenteritis, chiefly characterized by several episodes of nausea and vomiting. Diffuse, crampy, nonradiating abdominal pain was also present, associated with nonbloody diarrhea.
Other symptoms included malaise and chills. There were no fatalities, and no ill person reported fever. All recovered within 24 to 30 hours. Approximately 20% of the ill individuals visited physicians. No stool specimens or other biological samples were obtained for laboratory analysis.
Starting a new project
Standard Structure of a Questionnaire
Creating a Questionnaire for gathering Information.
When you start an outbreak investigation is good practice to develop the Questionnaire (both in paper and electronic form) as the same time that you start thinking about the analysis you want to perform.
A standard questionnaire should be divided in four main parts:
- Identification and socio-demographic questions: Including ID number of the questionnaire, ID numbers of persons (Social Security Numbers, Passport Number, ID Card, Car License, etc..), full name (note: be aware of confidentially issues), age (or even better date of birth), sex (gender), address, telephone number (maybe you need to contact them later)
- Questions related with the disease under investigation: Including classification of persons as Ill/Not Ill, Date of Onset (hour can be necessary in certain situations), symptoms, output (recovery, death, etc.), labs results, etc…
- Questions related to the exposure: Including questions related to consume of different foods, beverages, or any other significant exposure according with the disease under investigation and the possible ways of transmission, levels of exposure (doses), time of exposure, etc.
- Questions related to other variables (confounding or modifying the expected relationship between the exposure and the output): These questions are very case-specific
The following is an example of the paper questionnaire administered to the attendees of the Church supper: Questionnaire Example
Always create a new folder for each project
Create a new folder by right-clicking anywhere, choosing NEW, and then FOLDER. A new folder will appear. The blinking cursor inside the title of the folder indicates you can type in a new name. (If you do not see a blinking cursor, you can right-click on the new folder, select Rename, and then type the new name). Name the folder with name you want, but remember is a good idea to use meaningful names; “Oswego_Investigation” is a good name. You can create a folder for your investigation anywhere you like on your computer; you do not have to create one in the same drive as the EpiData program files. However, it will make it easier for you to remember where the files are located, if you use always a similar system, for example creating all the FOLDERS under “Your Name” folder.
Creating a New QES file
- Open EpiData Entry.
Click on the EpiData Entry icon in your Desktop. The EpiData Entry screen should look like this.
- Create a New QES
- a) Click on the Define Data button and select New QES file.
Or go to File and select New.
- b) In the blank screen you can add a title for you Questionnaire. For this exercise you can use “Questionnaire for Church Supper – Oswego”
You can use either the Tab or Space key to put the Title in the position you want.
As you must have noted, in the lower left corner of the window there is tab named “untitled 1”. That means that your work is not still saved. Before start adding new fields, it is a good idea to save your work.
a) Click on File>Save. A Save as… dialog box will appear. Choose the folder where you want to save your job. You should use a meaningful name for the QES file (use OSWEGOINV.QES for this example), and click Ok. You can see that the name in the tab on the lower-left corner has changed.
Adding New Fields
You will start adding new fields in your questionnaire. To do that, you can put the cursor in any place in the blank screen. You can move the cursor with the Space, Tab and/or Enter Key.
- Start including the Questionnaire ID number.
Write “Q1 Questionnaire ID: ” and then four # symbols
In EpiData you can indicate the kind of data you want to store in a field by using symbols:
- a) For numeric fields use #
- b) For text fields use ________
- c) For Yes/No fields use <Y> (many people prefer to use 1/0 or 1/2 instead)
- d) For date fields use <DD/MM/YYYY>
- Let's continue adding the Date of Interview, Name of Interviewed, Age and Sex(Gender).
It is a good idea to think in your questions writing down a table like the follow one:
| Prompt/Question Type | Pattern/Font | Field Name | Observations |
|---|---|---|---|
| Q2 Date of Interview | Date DD/MM/YYYY | Default | Not < Than Date of start of investigation |
| Q3 Name | Text ____________ (40 characters) | Default | Open Question |
| Q4 Age Number | Number ### (3 characters) | Default | Range:0-80 (specific for this study) |
| Q5 Sex (Gender) | Text _ (1 character) | Default | 1=Male / 2=Female/ 9=Unknown |
After finish adding these fields your questionnaire should look like this:
Now we will add a new field. In this case we want to store the time of supper. EpiData doesn’t include any specific field for time, so we are going to create a numeric field to enter the time in 24 hours format.
Write “Q6 Time of Supper:” and the four #.
Remember, it’s a good idea to save often. Just click on the Save Button on the Tools Bar
Using the Field Pick List
When you are adding Questions in your Questionnaire is some times useful to have a tool to add easily the format and size of your fields. You can use the Field Pick List button to do that.
The Field Pick List has several tabs for different kind of fields (Numeric, Text, Date and Others). In each tab you can choose different options.
Using the Preview Tab
The Preview Button in the Tool bar allows you to have an idea on the aspect of the Data Entry Screen. Each time you click on this button, EpiData opens the Preview Tab and displays the Data Entry Screen.
Now you can finish creating the questionnaire for this study.
Try It!
You have to add the rest of the questions in the Questionnaire Example
Add the questions related to the Outcome (ill). You must include a question for the Date of Onset (DD/MM/YYYY) and a separated one for the Time of Onset (####).
Add all the variables related to food consume (Exposure). What kind of variable suits better to ask this kind of questions?
NOTE: When you finish, you can compare your questionnaire with the one on the OSWEGO.ZIP file
Opening and Editing a QES file
If for any reason you have to close your QES file and continue working on it later, you can always open an existing QES file by clicking on the Define Data Button and then choosing Open QES file.
You can also go to File>Open.
Enhancing your Questionnaire
Although it is possible to create your questionnaire as a simple list of questions, one below the previous, it’s better if you spend some time enhancing the layout of the questions, for example including Labels or Headings for each section of your questionnaire, displaying the questions in several rows, writing some “instructions” or reminders in the screen, etc.
You can also make some changes in the colours of the background, fields, and active field (explore the Options Tab: Show data form).
Making up your questionnaire helps you or the one doing the data entry process to avoid (or at least minimize) errors and makes this process easier
Options
There are several settings that you can modify. To see the Setting Dialog Box, click on File>Options.
- a) In The Show Data form tab check the Highlight active field box and choose 3D look (see below)
- b) Click OK to save changes
Creating a New Data file: REC file
Once you have finished setting up the Questionnaire, you must create a REC file; a file where the data will be stored.
- Click on Make Data File:
In the Dialog Window, select the name of the QES file (if not already selected).
Select the name for the data file and folder where it will be created.
Your dialog window should look like the follow one:
Is a good idea to give the same name both to the QES file and the REC file. And keep them together in the same folder.
So, choose OSWEGOINV.REC as the name for your Data file.
- Click OK, to confirm
In the dialog box that will be displayed, write a short description for your data file. For example, you can write: “Draft data file for Oswego Investigation: Today” (Today= Today’s date). You can change this description later.
Including the today’s date in the description allow you to know the date you created the data file, so that you can know which version is the last one. Sometimes computers or servers have wrong system’s date.
Writing a short description of your data file, helps you and others to understand the purpose and contents of you files.
Controlling data quality: The Check Code
Errors can occur any time during the data collection or after data have been collected.
Examples of data errors include:
- Transpositions (e.g., 19 becomes 91 during data entry)
- Copying errors (e.g., 0 (zero)becomes O during data entry)
- Coding errors (e.g., a racial group gets improperly coded because of changes in the coding scheme)
- Consistency errors (contradictory responses, such as the reporting of a hysterectomy after the respondent has identified himself as a male)
- Range errors (responses outside of the range of plausible answers, such as a reported age of 290)
To prevent such errors, you must identify the stage at which they occur and correct the problem. Methods to prevent data entry errors include:
- Manual checks during data collection (e.g., checks for completeness, handwriting legibility)
- Range and consistency checking during data entry (e.g., preventing impossible results, such as ages greater than 110)
- Double entry and validation following data entry
- Data analysis screening for outliers during data analysis
EpiData provides a range and consistency checking program and allows for double entry and validation and some other checking tools.
- Opening the check program
Click on the “3 Checks” button under the tool bar of Epidata Entry. Choose the data file for which you want to define Check controls. In this case choose “OSWEGOINV. REC” A screen like this should be displayed:
There is a small dialog box where you can define several properties for the check code. The dialog box should have as caption: “OSWEGOINV.chk”
- Defining a Range a legal values
We want the Questionnaire ID Number filed to accept only values between 1 and 75 (75 is the maximum number of individuals we are going to interview, including cases and no cases).
On the OSWEGOINV.chk dialog file check if the name of the Filed is Q1IDQUESTI. If not, either click on the field on the screen or choose the name of the field from the Drop-down list.
In the Range,Legal space, write “1-75”; click on the Save button.
- Defining a “Must enter” status
Some times, you want a specific field to have always a value, to be defined as required (mandatory). In our case, we want all our questionnaires to have an ID number, so you must define this field as “Must enter”.
In the Must enter space, choose Yes; click on the Save button.
Your screen should look like this:
There are others fields that can be define as “Must enter”, for example Date of Interview. Define this field as required.
You must be very careful deciding when a field is required, because EpiData will not allow you to leave this field blank. When defining a field as “Must enter”, be sure either that you have data for all questionnaires or that there is an alternative to classified the field as “missing” (Unknown).
- Adding a list of Codes and Value Labels
We want the Sex (gender) question, accepting only three possible values: 1, 2 and 9; but we want also that these three values have a label text with the complete text: “Female”, “Male” and "Unknown". We also want to make this field “Must enter”.
In the Check dialog box, select the field Q5Sex, or Click in the field in the screen. Once selected the field, change the status of the Must enter space to Yes.
Then, from the drop-down list of Value Label, select sex. That include a special check code using a predefined list of possible values for sex: 1=Female, 2=Male and 9=Unknown.
- Understanding the check code
We are going to have a look to the check code for this field (Q2Sex).
At this point the check code for Q2Sex is divided in two different parts:
The one directly under the Q2Sex section, which looks like:
Q5SEX
- COMMENT LEGAL USE sex
- MUSTENTER
END
You can read this code like: The Q2Sex filed will USE as a list of possible values and labels those included in the list named “sex”. The Q2Sex filed is a “Must enter” field, meaning it cannot be left in blank.
What is that “sex” list?. This sex list is a predefined list and its definition is included at the beginning of the check code file called OSWEGOINV.CHK, under the LABELBLOCK section. You can find this code looking like:
LABEL sex
- 1 Male
- 2 Female
- 9 Unknown
END
If you want, you can open the OSWEGOINV.CHK with any general text editor (Notepad, WordPad) or the EpiData Editor: FILE>OPEN>Change Type to CHK and Choose OSWEGOINV.CHK. At this point, the whole CHK file should look like:
LABELBLOCK
- LABEL sex
- 1 Male
- 2 Female
- 9 Unknown
- END
END
Q1IDQUESTI
- RANGE 1 75
- MUSTENTER
END
Q5SEX
- COMMENT LEGAL USE sex
- MUSTENTER
END
- Jumps
When entering data, some times happens that the answer to one question determines the answer to others. In our case answering NO to Q7Ill (ill?) implies that the rest of Questions until Q16 Baked Ham, those related with symptoms are either not applicable or have the value "No".
We can make EpiData to jump all this questions.
Click on the 3 Checks button and choose OSWEGOINV.CHK if this file is not already open (The caption in the big screen should say “Add/Revise Checks- OSWEGOINV.REC”).
Choose Q7Ill in the OSWEGOINV.CHK dialog box, or click on this field in the screen.
First, we will define this field as “Must enter”, because we want to know for every questionnaire if it is a Case (Ill=YES) or Not (Ill=No).
Secondly, in the Jump section of the OSWEGOINV.CHK dialog box write: “N>Q16BAKEDHA”. It means, if the answer in Q7Ill is NO then the cursor should go to Q16Baked Ham.
It is a good idea to assign automatically the value "No" to all the questions related with symptoms.(But we are not going to do it now).
Click on the Save Button, and then on the Edit button; in the screen with the check code should be displayed something like:
Q7ILL
- JUMPS
- N Q16BAKEDHA
- END
- MUSTENTER
END
Documentation and Backup
Documentation
Writing summary descriptions of what you do is just as important as the final written document, and it's an easy and natural task if you do it as you proceed. As you work on a project, budget enough time to successfully plan, execute, and document your database management and analysis tasks as you go. If you don't, you'll very likely find yourself rushing the job, making unnecessary mistakes, and having to redo your work--spending much more time in the long run. What seems intuitively obvious in the data management or analysis you are currently doing may be only remotely familiar several weeks from now. For example, exactly where did you save the last version of the QES file? or How did you define outliers and what did you do with them? One rule-of-thumb to consider is to include at least one short paragraph of written documentation for each data management and analysis task. While this exercise will increase the amount of written material you need to manage, it ensures you'll have a clearer picture of what you did and why when you revisit summary files later on
EpiData helps you to document your database. Now we are going to create a Data File Structure, a document with all the names, kinds, size, check code, labels, etc. in your database.
Open Epidata Entry (if not already open) and click on the “5 Document” button; choose the OSWEGOINV.REC file. As a result a new document will be displayed on the EpiData Editor.
You can see that the document display very useful information about the database (Path, short description, date of last revision, number of fields and records, and for each variable: Name, label, field type, width, check code and value labels.
Backups
When any data process is being conducted, attention should be paid to regularly backing up all work. Regular hard drive back-ups should be made as work is progressing. Back-ups should also be made onto appropriate media periodically, in case of computer failure. Use of different, clearly labelled back-up media for each large data file should be considered to avoid accidentally overwriting data files.
Data should NEVER be deleted because of lack of computer disk space. It should always be copied to appropriate media and kept securely. A LABELLED copy should also be kept in a filing cabinet drawer.
There are different key moments when backup are crucial, one of then is after finish the questionnaire design, data file creation and check code definition. So, assuming we have finished all these tasks, we are going to create the first back-up set.
Epidata includes some tools to achieve this task:
Click on the “6 Export Data” button and choose Backup. In the dialog box that will appear select the Data file you want to back-up (in our case OSWEGOINV.REC).
In the Destination Directory space, choose an appropriate Driver Unit and path.
Always remember Back-up must be store in a different computer, not in the one you have your original files and data
Enter data with EpiData
Entering data is easy once you have prepared adequately your database and check code. In this case study there were a total of 75 questionnaires. You are going to enter in four (4) questionnaires (records) for practice. The four questionnaires you will enter into the database are in Appendix A.
Enter data
Click on the “5 Enter Data” button or Click on Data in/out (in the menu) > Enter data
Choose OSWEGOINV.REC
- Create the first record.
You will notice that a new window is open that looks just like the questionnaire you created. You can immediately enter the data from your first questionnaire. Find the first questionnaire from Appendix A and enter the data required. You will notice that, for items like dates, you only need to enter the numbers, but not the hyphens.
Note: We have intentionally left the Name field blank on the questionnaires. It is important to include this identifier, but we are using real data, so we have chosen not to include it here. You can just skip that field as you enter data.
When you finish entering the data in the last field and press enter, EpiData will ask you if you want to save you data. Click OK.
A new record will appear.
Enter the next three records in Appendix A.
- Adding Notes
During the data entry process it s possible that you have to take decisions about the data on one specific record, or maybe you need to keep a note of something not clear regarding the answers of one of the participants, or you want to remember later to complete some extra information. EpiData offers you a simple way to add a Note to your record.
When entering the data for this record press F5. A Notes window will be displayed showing the date and time of the note, the record number, the name of the field where the cursor was when you pressed F5 and its value. In this editor you can write any text you want.
EpiData will create an OSWEGOINV.NOT file (NOT standing for Notes)
Working with Records
As you enter data, you may want to look at a record you have previously entered. For example, you find that you have made a mistake on entering data and want to return to a record to correct it. There are several ways to look for a record.
- Surfing through records
To the left button of the screen, under the Record section, is a set of buttons with arrows. Here is how each functions:
You can always create a new record by clicking on the blue start. That will be open a new empty record ready to be entered.
- Go to a Specific Record Number
Click on GOTO> GOTO RECORD.
Type in the record number.
Click OK.
- Search for a Record by Field Information
You may have entered a large number of records and then realize you need to find the record for a particular person: a Female 52 years old, but you don’t remember the record number.
We are going to search for anyone who had those data in the dataset. At the moment, we only have a few records but, in a large database, this information would be difficult to search just by scrolling through the records.
Click on GOTO>FIND RECORD.
In the left column you can write the name of the variable you want use (for example Q5SEX). You can press F4 to get a list of all variables.
In the right column you can write the criteria to be matched, for example 2 (for Female).
You can use more than one row, meaning you want to look for combined criteria.
In our case in row 2, you must include Q4Age and 52.
Click OK.
EpiData will show the first record matching “Female 52”, if you press F3 EpiData will look for another record with the same criteria.
Deleting and undeleting records
If you accidentally create duplicate records, you can delete one of the records by clicking on the red cross on the left-bottom of the screen. The word DEL will be displayed. However, this only deletes the file for analysis purposes; it does not actually delete it from the database.
To undelete a record previously “deleted”, click again on the red cross.
Controlling data quality: Double Data Entry
Once you have finished entering all your data (we have prepared a copy with the 75 records already entered: it is called OSWEGOINV.REC and you can find it in the Folder OSWEGO_BACKUP in the ZIP file accompanying this exercise), it is time to be sure that your dataset has the best possible data quality.
To ensure a high quality of data, often it is a good strategy to have two different persons entering the same data. In EpiData this can be done in two different ways: either by entering the same data in two separate data files, which later can be compared or by entering in double entry mode where the new data immediately are compared with the original data.
Validate duplicate data files
Two different persons enter the same data in two separate data files. When all data has been entered the two data files may be compared using the function Validate Duplicate Files found in the Document menu in the main screen (when all files are closed or when only editor files are shown).
To prepare double entry the function Copy Structure found in the Tools menu may be used to copy the structure (not the data) of a data file to a new data file. Copy Structure has the option to leave out text fields since these are seldom entered twice.
- a) Click on TOOLS>COPY STRUCTURE
- b) Select the OSWEGOINV.REC file
- c) In the dialog box, select the name of the new file to be created (OSWEGOSCND.REC). Don’t change any options.
- d) Click OK
As a result you have OSWEGOINV.REC (with 75 records) and OSWEGOSCND.REC (an empty copy).
Once you have created it you have to enter your data in it. (Of course we don’t want you to do it in this exercise, so we have prepared this new file for you. You must copy now the OSWEGOSCND.REC file form the OSWEGO_BACKUP folder).
Now that you have entered your data twice: OSWEGOINV.REC and OSWEGOSCND.REC. We are going to compare if the contents of both files are similar.
- e) Select DOCUMENT>VALIDATE DUPLICATED FILES.
- f) Select the names of the two files. After that a dialog is shown with the options of the validation process.
- g) Select key fields
In order to compare two data files one or more KEY fields should be selected. The KEY fields selected are used to match records in the two files. The list of selectable key fields shows only the fields that are common to both data files. Fields that are marked with KEY in the check file have a key-symbol. The key symbol is shown only as information. It is not necessary that the KEY fields selected for validation are KEY fields in the check file.
If no KEY fields are selected then the two data files are compared on a record-by-record basis (i.e. record 1 in data file 1 is compared to record 1 in data file 2, and so on). Data in the two files must therefore be entered in the same order if no key fields are selected.
- h) Options:
- Ignore deleted records:
Records marked as deleted are skipped during the validation process
- Ignore text fields:
Fields of the type Text and Uppercase text are ignored during the validation process
- Ignore letter case in text fields:
If set then "Smith" is considered equal to "sMiTh"
- Report differences in field types:
If set then the validation report will include information about fields in the two data files have the same field name but have different field types.
- Ignore missing records in data file 2:
Set this option to avoid messages that records found in data file 1 are not found in data file 2. This is useful if double entry is only made on a sample of the original data file. Select the original (full) data file as data file 1 and the extract as data file 2, and set the option Ignore missing records in data file 2.
Click OK to run the validation. The two data files are compared and a validation report is shown.
Double entry verification
Double entry verification is a procedure where one person has entered data in a data file and another person enters the same data in EpiData's double entry mode where the new data immediately are compared with the original data. During the second round of data entry the user will receive messages if the new data differ from the original data.
Double entry is done in two steps. First double entry is prepared, second the data are re-entered.
- Prepare for double entry verification
- i) In the TOOLS menu select PREPARE DOUBLE ENTRY.
- ii) Select the data file with the original data (OSWEGOINV.REC)
- iii) Select a name for the data file, where the second round of data are to be saved (OSWEGODBL.REC)
Options can be changed now: choose if text fields are to be ignored during double entry (only data in numeric fields will be compared with original data).
Choose if records are to be compared by record number or by a key field. If the option "Match records by field" is unchecked, then records are compared by record number. If the option is checked, the user will be asked to point out which data field should act as the key field. The key field must contain unique data, i.e. an ID-number.
Click OK and read the message stating that double entry verification is now prepared.
- Re-enter data
- i) Select Enter Data and choose the double entry data file that was created when double entry verification was prepared (this file will be default if Enter Data is selected directly following prepare for double entry). The user will see a warning stating that EpiData is in double entry verification mode.
- ii) Begin entering the data. If the data entered differ from the original data file a warning is shown, giving you the choice of accepting the new value, the original value or editing the input.
As during normal data entry, the double entry verification can be interrupted by closing the data file and then resuming verification at a later time.
The double entry file (the second file) and the original file being compared must reside in the same folder. If the user wishes another option, change the complete path in the ...dbc file, which is a file defining the definitions for the double entry.
Documentation and Backup: Always
- Documentation
Again, this is a key point in your study, so it is a good idea to prepare a Backup and document your data.
Open Epidata Entry (if not already open) and click on the “5 Document” button; choose CodeBook. The Codebook gives key information plus basic descriptive statistics on the data found in the data file including the number of records, number of deleted records, variable labels, field types, selected check commands and number of missing values (= blank fields). Summary statistics are also displayed depending on the field type.
Choose the OSWEGOINV.REC file. As a result a new document will be displayed on the EpiData Editor.
- Backups
Label this copy in a very special way because it contains your raw original data, after all the process of error detection and before any manipulation of them.
Always remember that the back-up must be stored in a different computer, not in the one where you have your original files and data.
Describing the cases in terms of Person, Place and Time
A good epidemiological description can help you to develop hypotheses about the mode of transmission and the source of infection.
To describe your data, you are going to use EpiData Analysis.
Open EpiData Analysis by clicking on the Desktop icon. If this is the first time you are using EpiData Analysis, you will need to setup and save some parameters. The defaults are usually the best option, so just locate the Save and Restart link in the main screen and click on it. The layout of the EpiData Analysis screen has a Menu on the top, a work process bar, a toolbar, an output screen and a command line. If you press F2 and F3 you will get a variables window and command window. See image below:
- Reading (Open)the data file
- a) Click on the READ DATA button and search your folder and file.
- b) Select OSWEGOINV
- c) Click Open
Epidata will show some messages informing you about the data field: name, number of records, number of fields, etc.
- Now we will create a Line List
In order to see our data, we can create either a List or Browse the data. Browse is faster, but with a Line list you can get a hardcopy (in case you haven’t created it already).
- a) Click on the BROWSE DATA button
- b) In the dialog window click on the All button: >>
- c) Click RUN
- d) If the Browse window is not displayed, press ALT-TAB, and look for another instance of EpiData. The Browse window should be visible now.
Or
- a) In the command prompt write “list”
- b) Press Enter
- Working with a subset of data
Some times, we will need to work with only a subset of data, those matching a specific criterion. You can do this using the select command.
For example, if you want to work only with those records in the database corresponding with Ill people, you can do SELECT Q7ILL=”Y”
- a) Press F2 and F3 in order to get the Commands Window and the Variables Window
- b) On the Commands Window look for Read &Start and expand it (by clicking on the + icon or double clicking on the text)
- c) Double click on Select (the select command will be automatically written in the command prompt)
- d) On the Variables Window, click on Q7ill (it will be written in the command prompt)
- e) On the command prompt add : =”Y”
- f) Press Enter
Describing your cases in terms of person
- a) First we want to describe cases by Sex (gender). A Frequency distribution is the right way:
- i) Click on the Analysis button and choose Frequency
- ii) On the dialog window, select Q5Sex and then click on the Pass this button (>)
- iii) Q5Sex will be displayed in the selected variable Area.
- iv) Click on RUN
The result should be something similar to:
| Q5 Sex: | No. | % | Cum % |
|---|---|---|---|
| Male | 16 | 34.78 | 34.78 |
| Female | 30 | 65.22 | 100.00 |
| Total | 46 | 100% |
Try It!
Now you can finish the description of cases using other variables, for instance you can describe the cases by symptoms.
- b) To describe or summarize a continuous numeric variable, like age, you need to use some specific measures (measures of central tendency and variation). With EpiData Analysis, you can do this using the describe command
- i) Click on Analysis
- ii) Select Describe
- iii) Select Q4Age and click on the Pass this button
- iv) Click on RUN
You will get something like:
| Variable | N=46 | Sum | Mean | (95% cfi) | Min | p5 | p10 | p25 | Median | p75 | p90 | p95 | Max |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Q4AGE | 46 | 1806.00 | 39.26 | 32.77 45.75 | 3.00 | 7.35 | 9.40 | 16.75 | 38.50 | 59.00 | 68.60 | 73.30 | 77.0 |
You can also describe a continuous numeric variable using a Boxplot Graph (Box-Whisker):
- i) Click on the Graph Button
- ii) Choose Box plot
- iii) From the Drop-down list choose Q4Age
- iv) Click RUN
Boxplot of Median and Inter Quartile Range (IQR=25-75%). Whisker: 1.5*IQR. Outliers exceed IQR*1.5 . N=46
- c) But sometimes when we summarize our data using percentages and measures of central tendency and dispersion, we loose some crucial information. In our case we find out if the occurrence of disease is similar among all groups of age in both sexes.
To do this, we will aggregate the people in age groups and then we will tabulate (or create a graph) of the distribution of cases by age group and sex.
- i) First, we will create a new variable:
DEFINE AGEGROUP _________________
- ii) Then we will recode this new variable based on the values in Q4Age. We want to aggregate the ages in 5-year age groups.
RECODE Q4AGE to AGEGROUP by 5
- iii) Now we can tabulate the distribution of cases by age group and sex
TABLES AGEGROUP Q5SEX if Q7ILL=”Y”
Or
BAR AGEGROUP IF (Q7ILL=”Y” and Q5SEX=”1”) (Bar graph of male cases by age group)
And
BAR AGEGROUP IF (Q7ILL=”Y” and Q5SEX=”2”) (Bar graph of female cases by age group)
Remember including Titles, footnotes, etc. in your graphs.
- Working again with all the records in the database
The selection we did, using the Select command, to work with only Ill people is a temporary selection; we can cancel it using the Select command alone (without an specific criteria). Now we want to work with all records in the dataset, so in the Command prompt write select and press Enter. Notice that in the Tables and Bar commands above, we used an IF statement on the end of the command. This creates a subset of the data just for that command.
Try It!
Now you can make a complete description of your database (including a summary of how many cases and non-cases you have).
Epidemic Curve
Describing your cases in terms of Time will give you clues to understand the mode of transmission of the disease.
An epidemic curve, or "epi curve" for short, is a two-dimensional graph that provides a simple visual display of an epidemic's magnitude and time course.
The epidemic curve plots time along the X-axis and the number of cases along the Y-axis. Because time is continuous, the epidemic curve is drawn as a histogram (no gaps between adjacent columns), not as a bar chart.
The units of time must be consistent along the length of the X-axis; for example, the same distance must equal 1 day anywhere along the X-axis. For a given graph, the most appropriate units of time for the X-axis depend on the incubation period of the disease, the length of time over which cases are distributed, and the points you wish to communicate with the graph.
One rule of thumb states that the units should be between one-eighth to one-third (e.g., roughly one-quarter) as long as the incubation period of the disease in question. So, for a common-source outbreak of Clostridium perfringens gastroenteritis (usual incubation period 10-12 hours), X-axis units of 2-3 hours would be suitable.
To create an Epidemic Curve with EpiData is easy if the units of time are days or higher. However, it is a little bit more complicated if you want to display hours (as in our case).
In order to create an epicurve of hours we have to deal with two variables: Q8ONSETDAT and Q9ONSETTIM. And we have to transform these two variables in a number of hours from an arbitrary point in the time (a point always previous to the first case). We can choose for example the time 00:00 hours of the date 18/04/1940.
From this arbitrary point we can count the numbers of hours until the onset of symptoms for each case.
- a) Define a new variable: You need a new variable to store your calculations.
In the command prompt write the Define command, the name of this new variable for example ONSETHR and the kind of data you want to store in. In hour case your command prompt should look like:
Define ONSETHR ##
Meaning: Create a new variable called ONSETHR; this variable should be numeric, integer with a width of 2 digits.
- b) Assign values to the new variable: The values of the new variable will depend on the values in Q8ONSETDAT and of the values on Q9ONSETTIM. From Q9ONSETTIM, it is easy, since we only need the two first digits of the 24-hour-format time (i.e: if Q9ONSETTIM is 1230, we want only 12). From Q8ONSETDAT, if the value is 18/04/1940, we already will have the number of hours in that day; but if the value is 19/04/1940, we need to add 24 (24 hours).
In the command prompt you should write the following:
IF Q8ONSETDAT=dmy(19,04,1940) then let ONSETHR=24+(Integer(Q9ONSETTIM/100)) else let ONSETHR=(Integer(Q9ONSETTIM/100))
All of this must be written in one line.
This is a little bit complicated, so we are going to look at this command carefully.
- i) First the IF [condition] THEN [actions] ELSE [alternative actions] part
This is a very common structure in programming languages. It is called a conditional sentence: If the “condition” is true the “actions” are done, if the condition is NOT true then the “alternative actions” are done.
In our case If in each specific record the value of the Variable Q8ONSETDAT is equal to 19/04/1940 the action:
- ii) let: assign to the variable ONSETHR the result of adding up 24 plus the
- iii) integer value of dividing the value in Q9ONSETTIM by 100, must be done
If the condition is not true, for example the value is 18/09/1940, then the action (alternative action): Assign to the variable ONSETHR the integer value of dividing the value in Q9ONSETTIM by 100.
Now we have in one single variable the information about how many hours have passed from the 00:00 of 18/04/1940 until the onset date and time of each ill person.
There are often multiple ways to do calculations in EpiData. For example, the function IIF can be used as a short form of IF ... THEN ... ELSE. In this example, you could also calculate values for ONSETHR using the IIF function:
ONSETHR = (Q9ONSETTIM div 100) + iif(Q8ONSETDAT=dmy(19,4,1940),24,0)
To create an Epidemic Curve, the only thing we have to do is
- i) Select Q7Ill=”Y”
- ii) Histogram ONSETHR. To do that you can either write this command in the command prompt or click on the GRAPHS button, choose Histogram and choose ONSETHR as X variable.
The results should be similar to:
Now you can add a Title, footnote and different features to your graph.
You can see there is a case that happened at 15.00 of 18/04/1940 and 6 cases occurred at 21 and 22 hours of the same day.
You can see these data very easily, sorting the dataset by Q8onsetdat and q9onsettim and then browsing these two variables
SORT Q8ONSETDAT Q9ONSETTIM
BROWSE Q8ONSETDAT Q9ONSETTIM
- Interpreting an Epidemic Curve.
The first step in interpreting an epidemic curve is to consider its overall shape. The shape of the epidemic curve is determined by the epidemic pattern (common source versus propagated), the period of time over which susceptible persons are exposed, and the minimum, average, and maximum incubation periods for the disease.
An epidemic curve which has a steep upslope and a more gradual downslope (a log-normal curve) indicates a point source epidemic in which persons are exposed to the same source over a relative brief period. In fact, any sudden rise in the number of cases suggests sudden exposure to a common source. In a point source epidemic, all the cases occur within one incubation period.
- Incubation Period: In order to generate hypothesis about the probable agent, you need to calculate the Incubation Period, assuming the source of exposure was the supper.
- a) First we need to define a new variable to store the calculations:
DEFINE IP ###
- b) Now we will assign to the variable IP the result of subtracting from ONSETHR (Number of hours between 00:00 18/04/1940 and the date and time of onset) the number of hours between 00:00 18/04/1940 and the date and time of exposure. We know that date of exposure was always 18/04/1940, but time of exposure differed by case.
In the command prompt you must write:
IP=ONSETHR-(Integer(Q6TIMESUPP/100))
The following time line shows the relationship between IP and other variables for case number 42.
- c) And now you can describe the Incubation Period using the Describe command:
Describe IP
You will get something similar to:
| Variable | N=75 | Sum | Mean | (95% cfi) | Min | p5 | p10 | p25 | Median | p75 | p90 | p95 | Max |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IP | 22 | 96.00 | 4.36 | 3.67 5.06 | 3.00 | 3.00 | 3.00 | 3.00 | 4.00 | 6.00 | 7.00 | 7.00 | 7.00 |
You can see that the average IP is 4.36, with a range between: 3 and 7, an incubation period compatible with Staphylococcus aureus enterotoxin.
Testing Hypothesis: Analyzing the Risk Factors
At this point in our retrospective cohort study, we are hoping to identify risk factors that might indicate the cause and mode of transmission of the disease. We created a questionnaire that asked both persons who were exposed and those who were not exposed to different foods and beverages if they were ill after that exposition. What we are trying to do here is determine the probability that a risk factor (for example, eating Cake - Q26CAKE) is linked to some outcome (illness – Q7Ill).
We will do this by creating a 2x2 table and looking at the p-value generated. Remember, the p-value indicates the probability that the association between two variables might be due to chance. For example, if the p-value of two variables equals .75, then the likelihood that the association between them might be due to chance is 75%. On the other hand, a low p-value indicates it is less likely the association between two variables is due to chance. So a low p-value (generally < .05) may indicate that a risk factor (e.g., eating cake) is closely associated with a certain outcome (illness).
- Calculating Attack Rates
Because we are in a Retrospective Cohort Study, we can calculate Attack Rate for each food among those that did eat the food and those that didn’t. The true vehicle is likely to have three features:
- a) The attack rate is high among persons who ate the food (high food-specific attack rate).
- b) The attack rate is low among persons who did not eat the food (so the difference or ratio is high).
- c) Most of the cases were exposed, so the exposure could “explain” most, if not all, of the cases.
EpiData doesn’t have a specific command for calculate attack rates; however it is possible to program an attack-rate-like command.
We have created a program that calculate it for you.
To calculate the Attack Rate you have to run a program called AR.PGM (you can find it in the OSWEGO_BAKCUP folder; you must copy this file and the AR.RPT file in the folder where OSWEGOINV.REC is).
- i) In the command prompt write:
- SELECT (To eliminate any previous selection)
- RUN AR.PGM
- ii) The program will ask you the name of the variable where the outcome status is recorded (in our case Q7ILL, but it can be called in another way in other of your files in the future).
- iii) Then the program will ask you for each food or beverage:
- How to label each food (or beverage); you must write a short but meaningful label
- The name of the variable where the information about each exposure is stored (for example Q16BAKEDHA).
You must be very careful writing the name of variables. If you make a mistake you will have to close the database (using the close command) and then running again the program.
After each food, EpiData will show the Attack Rate, Attributable Proportion in Exposed (eAF %), Attributable Fraction in population (pAF %) and RR with its 95% confidence interval.
- Calculating Risk Ratio and p values
For those foods which are more likelihood to be associated with the illness, we can use EpiData to calculate Risk Ratio (RR) and the correspondent measure of association and probability (p value).
- a) Click on the Analysis button
- b) Select EpiTable
- c) Select the OUTCOME variable (Q7ILL) and click the Add this button
- d) Select the EXPOSURE variable (for example, Q20JELLO) and click the Add this button
- e) Click RUN
You will get a 2 by 2 table with the calculation of OR, RR, X² and p values.
Try It!
Based on the results of Attack Rates, test which exposure is more likelihood been associated with diseases.

