Translations of this page?:
 

EpiData XML File Format Specification

This document contains the specification for the EpiData XML File format (experimental version)

The number of alphabets, character sets etc. on the supported platforms (Linux, Mac, Windows) in numerous countries is so large, that there is a need to find ways of saving data and documentation in a uniform way. One way of doing this would be to use a generally available and documented format such as the ODF standard, Stata binary files, The Data Documentation Initiative format (DDI) or maintain the well known REC + CHK file formats used in EpiData (see specification) and add some new specifications. For more discussions on formatted data file structures see here or consult the DDI format specifications .

The main requirements for the data format are:

  • speed of reading and writing.
  • uniformity of specification across operating systems and countries.
  • possibility of fixing data file errors in a standard text file editor.
  • minimal overhead due to general data format specification requirements.
  • The end user should not need to consider where this or that file came from.

Following some experimentation in mid 2009 and looking into DDI and ODF standards it was judged, due to speed and overhead issues to create a simplified EpiData specific adapted data file XML structure. Other data formats will be supported by export and/or import functionality.

XML Standard

By way of the XML standard the first line in an XML formatted file must be: <?xml version=“1.0”?> Optionally this line may specify encoding=“utf-8” - but EpiData will by default use utf-8 and therefore this is not written.

The actual data file contents are contained between the root tags: <EPIDATA> … </EPIDATA>

The overall file structure is therefore:

<?xml version="1.0"?>
<EPIDATA> ... 
  ... sections ....
</EPIDATA>

Sections

The datafile structure contains four sections of which the METADATA is optional. The four sections are:

<SETTINGS>
  ...
</SETTINGS>

Provides formatting style information and information about optional password protection.

<METADATA>
  ...
</METADATA>  

Provides metadata information regarding the dataset, being entire dataset or the individual fields.

<FIELDS>
  ...
</FIELDS>

Provides a list of fields/labels in a dataset.

<RECORDS>
  ...
</RECORDS>

Provides the data for the dataset, formatted according to the specification in the settings section.

The order is not important, but for better readability we suggest the ordering as displayed above.

SETTINGS

This section provides information about the format of the datafile. The intent is that programs adhering to these setting may read and write data without having to guess the formatting style first and it allow other programs to choose a different default formatting style than the one EpiData chooses.

This section MUST be included in the file.

Tag: Type: Default: Required: Description:
VERSION1) Integer 0 x The version of the datafile format. Currently only the experimental version (0) is used, but is subject to change in the future.
DATESEP Char / x Specifies which character is used as separator for date fields.
DECIMALSEP Char . x Specifies which character is used to represent the decimal separator in floating point values.
MISSINGMARK String . x Specifies the string used to represent native missing values.
PASSWORD String N/A A Base-64 encoded (AES encrypted) string specifying the password used to encode the crypt fields. To decrypt the string must first be Base-64 decoded, then AES decrypted using the password itself as initialization vector.

Example

<SETTINGS>
  <VERSION>0</VERSION>
  <DATESEP>/</DATESEP>
  <DECIMALSEP>.</DECIMALSEP>
  <MISSINGMARK>.</MISSINGMARK>
  <PASSWORD>TziKIl3okMz=</PASSWORD>
</SETTINGS>

METADATA

This section provides all metadata for the dataset. This includes dataset label, study information, value labels, missing information etc. Additionally a subsection of user defined information can be added. The METADATA section is OPTIONAL or may be empty.

Tag: Type: Description:
FILELABEL String Describes the content of the dataset.
STUDY String Describes the study information for the dataset.
MODIFIED Date (ymd) Last modification date. This should be updated each time the file is modified. The date is in YYYYMMDD format and with datesepartor as described above.
VERSION String User provided version of the dataset.
LABELS n/a A subsection describing value labels for the fields. See below for further information.
USERDATA n/a A subsection to provide user defined information. See below for further information.

LABELS

The subsection for valuelabel is furthermore divided into section, each containing a single value label set.

REMARK: Unlike the old .rec/.chk file format, user defined missing values are now stored within the value label set and can therefor also be given a value label. In addition to this there is no limit as to how many user defined missing values wishes.

Each label value set is defined by the tag <LABEL> which contain the following attribute specifiers (order is no significant):

Attribute: Type: Required: Description:
NAME String x Identification name for the value label. Is used as a reference for the fields (see below).
TYPE Integer x Specifies the type of the value in the set. See list of supported types here.
EXTERNAL String Specifies that the value label set is to be obtained from the file listed by the attribute. If this file have two key fields, key field 1 is chosen as values and key field 2 is chosen as labels, otherwise first field is value and second field is label. This allow for external reference to non-Epidata file formats. When specifying this attribute the below mentioned pair of value labels is not required (will be ignored by EpiData.

Example

<LABEL NAME="mylabel" TYPE="0" EXTERNAL="/home/user/epidata/icd10.rec"/>

Within the LABEL tags a list of SET tags must be specified. For each SET tag the following attributes exists:

Attribute: Type: Required: Description:
VALUE As specified in the LABEL tag. x The value specifier for the value label pair.
LABEL String x The label associated with the value above.
MISSING Integer Specifies if value is a user defined missing value. Is only accepted as missing if the value is “1”.

Example

<LABEL NAME="kmlbl" TYPE="0">
  <SET VALUE="1" LABEL="0- 25 km"/>
  <SET VALUE="2" LABEL="26-62 km"/>
  <SET VALUE="3" LABEL="63 -120"/>
  <SET VALUE="4" LABEL="120+ km"/>
  <SET VALUE="5" LABEL="Copenhagen"/>
  <SET VALUE="6" LABEL="East Denmark Other"/>
  <SET VALUE="9" LABEL="Missing" MISSING="1"/>
</LABEL>

USERDATA

If the user wishes to provide additional information for the dataset it is possible to create a sub-section within the metadata called USERDATA. Any tags or text written here must be preserved as presented, but it is left to the program whether this section is editable or not. There is no requirement as to the structure or the content, however if the USERDATA tag is present, but empty it is legal for any program to NOT write this section again.

Example

<METADATA>
  <FILELABEL>Marathon data - 1995 across bridges from Funen and</FILELABEL>
  <STUDY>This is an example text for the study xml tag</STUDY>
  <MODIFIED>2009/09/01</MODIFIED>
  <VERSION>1.0 beta</VERSION>
  <LABELS>
    <LABEL NAME="kmlbl" TYPE="0">
      <SET VALUE="1" LABEL="0- 25 km"/>
      <SET VALUE="2" LABEL="26-62 km"/>
      <SET VALUE="3" LABEL="63 -120"/>
      <SET VALUE="4" LABEL="120+ km"/>
      <SET VALUE="5" LABEL="Copenhagen"/>
      <SET VALUE="6" LABEL="East Denmark Other"/>
    </LABEL>
  </LABELS>
  <USERDATA>
    This is an example how user can include additional descriptions into the dataset.
    This allowed for both XML tags and normal text - however the style should adhere to the XML standard.
  </USERDATA>
</METADATA>

FIELDS

This section defines the data fields and labels/headlines in the dataset. The order of the fields must determine the field order within the programs, although positional information is provided this order must be respected and written in the same order again. Each <FIELD> tag MUST include the attribute TYPE that specifies the type of field. The list of supported types is found here.

For each field tag the following set of sub-tags are required:

Tag: Type: Description:
NAME String Identification name for the field. There is no limit to the size of content. It is recommended to use short informative names without special characters (such as #,$,etc..) and spaces.
TAG Integer An identification tag used to associate field with data in the <RECORDS> section.
LENGTH Integer Specifies the length of the field. There is no restriction as to the length of a field. For floating point fields this length includes the decimal separator AND the length of decimals (see below).
DECIMALS Integer Specifies the number of decimals in floating point fields otherwise is should be 0 (null).
POS Integer The attributes X and Y specify the pixel position2) of the visual entry field.

And each of the following sub-tags are optional:

Tag: Type: Description:
LABEL String An optional label to describe the content of the field. It is recommended that the field description is put here instead of the NAME attribute, since accessing the field is done through the fields name (which in term will have to be written completely).
If this tag is used the two attributes X and Y (integer) MUST be specified. These attributes specify the pixel position3) of the label.
COLOUR Hex The colour of the entry field on the dataform given HTML colour codes (see the W3C-School specification here).
VALUELABEL String References a value label specified in the METADATA section above. Only a single value label can be referenced.
DEFAULTVALUE As Field Type The default value given to the variable on entering the entry field.
CONFIRM n/a If tag is present the field is not left when LENGTH is reached, but requires confirmation using the ENTER/RETURN key.
REPEAT n/a If tag is present the field is, upon entering the entry field, filled with the value entered in the previous record.
ENTER Boolean If tag is NOT present entering data is optional. If the tag is present it MUST have either the value TRUE or FALSE.
FALSE = data entry is not possible.
TRUE = data must be entered.
JUMPS n/a A subsection describing the jumps possible for this field. See below for further information.
RANGE n/a A subsection describing the range of legal values for this field. See below for further information.
TOPOFSCREEN Integer If the tag is present it tell the entry program to “clear” the page. I.e. this field is placed at the top of screen (where the number specifies the amount of space above the field).
TYPECOMMENT n/a If present and the field has valuelabels, the label for the value entered is printed to another location base on the chosen attribute:
COLOUR = label is printed beside the entry field in the colour specified. Colour is given in HEX code, see COLOUR above for more information.
FIELD = label is printed to another field (receiving field must be of type string).

REMARK: Since this is still an experimental format the above set of tags/attributes is subject to change. We are currently considering whether “standalone” labels should be placed in the <METADATA> section, this is to simplify the format for export/import in 3rd party programs.

JUMPS

FIXME

RANGE

FIXME

Example
<FIELD TYPE="6">
  <NAME>dectime</NAME>
  <TAG>F8</TAG>
  <LENGTH>6</LENGTH>
  <DECIMALS>4</DECIMALS>
  <POS X="265" Y="236"/>
  <LABEL X="59" Y="241">Completion time - Hours</LABEL>
  <COLOUR>FFFFFF</COLOUR>
</FIELD>

RECORDS

This section contains the actual data of the dataset. The order of the individual records define the ordering of data.

For each record the data are contained in a single line tag, where each field is referenced as an attribute to the tag by the sequential number, prefixed by the letter 'F'. E.g. F2 is the second field defined in the FIELDS section. An optional attribute ST may be specified, indicating the status of the record:

  • 0 or no attribute: This is a normal record.
  • 1: This records is marked as deleted.
  • 2: This records is marked as verified.

Example

Please note in record no. 3 that field 8 and field 9 contains a ”.” which in the SETTINGS sections specifies that this is a native missing value.

<RECORDS>
  <REC F1="7" F2="2" F3="0" F4="231" F5="4" F6="38" F7="30" F8="3.7333" F9="1" F10="1"/>
  <REC F1="8" F2="1" F3="0" F4="231" F5="4" F6="38" F7="30" F8="4.1167" F9="2" F10="1"/>
  <REC F1="9" F2="2" F3="0" F4="71" F5="3" F6="40" F7="40" F8="." F9="." F10="0"/>
  <REC F1="10" F2="2" F3="0" F4="256" F5="4" F6="49" F7="40" F8="4.2" F9="2" F10="1" ST="1"/>
</RECORDS>

Examples

Please see this site for additional examples of the new EpiData XML file format.

1) current experimantal version
2) , 3) top-left corner
 
documentation/datafileformat/xml_v0.txt · Last modified: 2011/01/07 11:33 (external edit)
 
Except where otherwise noted, content all EpiData wiki Content is licensed as Creative Commons License Recent changes RSS feed Driven by DokuWiki