ARFF Format Details

File Structure

An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a machine learning dataset (or relation). It was developed at the University of Waikato (NZ) for use with the Weka machine learning software. We will use a simplifed version for CS 478.

ARFF files have two distinct sections:

  1. Metadata information (relation's schema)
    • Name of relation
    • List of attributes and domains
  2. Data information
    • Actual instances or rows of the relation

Optional comments may also be included (lines prefixed with %), and blank lines are ignored.

Here is an example of a small ARFF file:

% 1. Title: Hypothetical Database
% 
% 2. Sources:
%      (a) Creator: C. Giraud-Carrier
%      (b) Institution: BYU
%      (c) Date: August, 2004

@RELATION hypo

@ATTRIBUTE length      CONTINUOUS
@ATTRIBUTE color       {Red, Green, Blue}
@ATTRIBUTE age         CONTINUOUS
@ATTRIBUTE neighbors   {1, 2-to-4, 5-to-9, more-than-10}
@ATTRIBUTE class       {True, False}
  
@DATA
   5.1,Red,4,2-to-4,True
   4.9,Blue,4,1,True
   4.7,Red,3,more-than-10,False
   4.6,Green,5,2-to-4,True
   5.0,Blue,4,5-to-9,False
   5.4,Red,7,5-to-9,True

Order of Metadata Information

  1. The relation's name is defined in the first non-comment line of the file:
    • Of the form: @RELATION <relation-name>
    • Declaration is case-insensitive
    • <relation-name> is a whitespace-free string
  2. Attribute declarations take the form of an ordered sequence of statements, one per line:
    • Of the form: @ATTRIBUTE <attribute-name> <domain>
    • Declaration is case-insensitive
    • <attribute-name> is a whitespace-free string
    • <domain> is either:
      • CONTINUOUS (case-insensitive) for real/integer values
      • A set of possible nominal values, enclosed in curly braces and delimited with commas, of the form:
        {<nominal-value-1>, <nominal-value-2>, <nominal-value-3>, ...}
      • The nominal domain type is used for attributes whose values range over a (small) finite set
      • Nominal values are whitespace-free, case-sensitive strings
    • Important: The order in which the attributes are declared indicates the column position in the data section of the file.
  3. The data declaration is a single line denoting the start of the data segment in the file:
    • Of the form: @DATA
    • Each instance is then coded on a single line, with a carriage return denoting the end of the instance.
    • Attribute values for each instance are delimited by commas, with or without whitespace in between.