ARFF Format Details
File Structure
An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a machine learning dataset (or relation). It was developed at the University of Waikato (NZ) for use with the Weka machine learning software. We will use a simplifed version for CS 478.
ARFF files have two distinct sections:
- Metadata information (relation's schema)
- Name of relation
- List of attributes and domains
- Data information
- Actual instances or rows of the relation
Optional comments may also be included (lines prefixed with %), and blank lines are ignored.
Here is an example of a small ARFF file:
% 1. Title: Hypothetical Database % % 2. Sources: % (a) Creator: C. Giraud-Carrier % (b) Institution: BYU % (c) Date: August, 2004 @RELATION hypo @ATTRIBUTE length CONTINUOUS @ATTRIBUTE color {Red, Green, Blue} @ATTRIBUTE age CONTINUOUS @ATTRIBUTE neighbors {1, 2-to-4, 5-to-9, more-than-10} @ATTRIBUTE class {True, False} @DATA 5.1,Red,4,2-to-4,True 4.9,Blue,4,1,True 4.7,Red,3,more-than-10,False 4.6,Green,5,2-to-4,True 5.0,Blue,4,5-to-9,False 5.4,Red,7,5-to-9,True
Order of Metadata Information
- The relation's name is defined in the first
non-comment line of the file:
- Of the form:
@RELATION <relation-name>
- Declaration is case-insensitive
<relation-name>
is a whitespace-free string
- Of the form:
- Attribute declarations take the form of an ordered
sequence of statements, one per line:
- Of the form:
@ATTRIBUTE <attribute-name> <domain>
- Declaration is case-insensitive
<attribute-name>
is a whitespace-free string-
<domain>
is either:CONTINUOUS
(case-insensitive) for real/integer values- A set of possible nominal values, enclosed in
curly braces and delimited with commas, of the
form:
{<nominal-value-1>, <nominal-value-2>, <nominal-value-3>, ...}
- The nominal domain type is used for attributes whose values range over a (small) finite set
- Nominal values are whitespace-free, case-sensitive strings
- Important: The order in which the attributes are declared indicates the column position in the data section of the file.
- Of the form:
- The data declaration is a single line denoting the
start of the data segment in the file:
- Of the form:
@DATA
- Each instance is then coded on a single line, with a carriage return denoting the end of the instance.
- Attribute values for each instance are delimited by commas, with or without whitespace in between.
- Of the form: