ARFF

http://weka.wikispaces.com/ARFF+%28book+version%29


An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a set of attributes.

Overview

ARFF files have two distinct sections. The first section is the Header information, which is followed the Data information.

The Header of the ARFF file contains the name of the relation, a list of the attributes (the columns in the data), and their types. An example header on the standard IRIS dataset looks like this:
   % 1. Title: Iris Plants Database
   % 
   % 2. Sources:
   %      (a) Creator: R.A. Fisher
   %      (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
   %      (c) Date: July, 1988
   % 
   @RELATION iris
 
   @ATTRIBUTE sepallength  NUMERIC
   @ATTRIBUTE sepalwidth   NUMERIC
   @ATTRIBUTE petallength  NUMERIC
   @ATTRIBUTE petalwidth   NUMERIC
   @ATTRIBUTE class        {Iris-setosa,Iris-versicolor,Iris-virginica}
The Data of the ARFF file looks like the following:

   @DATA
   5.1,3.5,1.4,0.2,Iris-setosa
   4.9,3.0,1.4,0.2,Iris-setosa
   4.7,3.2,1.3,0.2,Iris-setosa
   4.6,3.1,1.5,0.2,Iris-setosa
   5.0,3.6,1.4,0.2,Iris-setosa
   5.4,3.9,1.7,0.4,Iris-setosa
   4.6,3.4,1.4,0.3,Iris-setosa
   5.0,3.4,1.5,0.2,Iris-setosa
   4.4,2.9,1.4,0.2,Iris-setosa
   4.9,3.1,1.5,0.1,Iris-setosa
Lines that begin with a % are comments. The @RELATION, @ATTRIBUTE and @DATA declarations are case insensitive.

Examples

Several well-known machine learning datasets are distributed with Weka in the $WEKAHOME/data directory as ARFF files.

The ARFF Header Section

The ARFF Header section of the file contains the relation declaration and attribute declarations.

The @relation Declaration

The relation name is defined as the first line in the ARFF file. The format is:

    @relation <relation-name>
where <relation-name> is a string. The string must be quoted if the name includes spaces.

The @attribute Declarations

Attribute declarations take the form of an ordered sequence of @attribute statements. Each attribute in the data set has its own @attribute statement which uniquely defines the name of that attribute and its data type. The order the attributes are declared indicates the column position in the data section of the file. For example, if an attribute is the third one declared then Weka expects that all that attributes values will be found in the third comma delimited column.

The format for the @attribute statement is:
    @attribute <attribute-name> <datatype>
where the <attribute-name> must start with an alphabetic character. If spaces are to be included in the name then the entire name must be quoted.

The <datatype> can be any of the four types supported by Weka:
  • numeric
  • integer is treated as numeric
  • real is treated as numeric
  • <nominal-specification>
  • string
  • date [<date-format>]
where <nominal-specification> and <date-format> are defined below. The keywords numeric, real, integer, string and date are case insensitive.

Numeric attributes

Numeric attributes can be real or integer numbers.

Nominal attributes

Nominal values are defined by providing an <nominal-specification> listing the possible values: {<nominal-name1>, <nominal-name2>, <nominal-name3>, ...}

For example, the class value of the Iris dataset can be defined as follows:
    @ATTRIBUTE class        {Iris-setosa,Iris-versicolor,Iris-virginica}
Values that contain spaces must be quoted.

String attributes

String attributes allow us to create attributes containing arbitrary textual values. This is very useful in text-mining applications, as we can create datasets with string attributes, then write Weka Filters to manipulate strings (like StringToWordVectorFilter). String attributes are declared as follows:
    @ATTRIBUTE LCC    string

Date attributes

Date attribute declarations take the form:

    @attribute <name> date [<date-format>]
where <name> is the name for the attribute and <date-format> is an optional string specifying how date values should be parsed and printed (this is the same format used by SimpleDateFormat). The default format string accepts the ISO-8601 combined date and time format: yyyy-MM-dd'T'HH:mm:ss. Check out the Javadoc of the java.text.SimpleDateFormat class for supported character patterns.

Dates must be specified in the data section as the corresponding string representations of the date/time (see example below).

The ARFF Data Section

The ARFF Data section of the file contains the data declaration line and the actual instance lines.

The @data Declaration

The @data declaration is a single line denoting the start of the data segment in the file. The format is:
    @data

The instance data

Each instance is represented on a single line, with carriage returns denoting the end of the instance.

Attribute values for each instance are delimited by commas. A comma may be followed by zero or more spaces. Attribute values must appear in the order in which they were declared in the header section (i.e., the data corresponding to the nth @attribute declaration is always the nth field of the attribute).

A missing value is represented by a single question mark, as in:

    @data
    4.4,?,1.5,?,Iris-setosa
Values of string and nominal attributes are case sensitive, and any that contain space must be quoted, as follows:
    @relation LCCvsLCSH
 
    @attribute LCC string
    @attribute LCSH string
 
    @data
    AG5,   'Encyclopedias and dictionaries.;Twentieth century.'
    AS262, 'Science -- Soviet Union -- History.'
    AE5,   'Encyclopedias and dictionaries.'
    AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Phases.'
    AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Tables.'
Dates must be specified in the data section using the string representation specified in the attribute declaration. For example:
    @RELATION Timestamps
 
    @ATTRIBUTE timestamp DATE "yyyy-MM-dd HH:mm:ss"
 
    @DATA 
    "2001-04-03 12:12:12"
    "2001-05-03 12:59:55"

Sparse ARFF files

Sparse ARFF files are very similar to ARFF files, but data with value 0 are not be explicitly represented.

Sparse ARFF files have the same header (i.e @relation and @attribute tags) but the data section is different. Instead of representing each value in order, like this:
    @data
    0, X, 0, Y, "class A"
    0, 0, W, 0, "class B"
the non-zero attributes are explicitly identified by attribute number and their value stated, like this:
    @data
    {1 X, 3 Y, 4 "class A"}
    {2 W, 4 "class B"}
Each instance is surrounded by curly braces, and the format for each entry is: <index> <space> <value> where index is the attribute index (starting from 0).

Note that the omitted values in a sparse instance are 0, they are not "missing" values! If a value is unknown, you must explicitly represent it with a question mark (?).

Warning: There is a known problem saving SparseInstance objects from datasets that have string attributes. In Weka, string and nominal data values are stored as numbers; these numbers act as indexes into an array of possible attribute values (this is very efficient). However, the first string value is assigned index 0: this means that, internally, this value is stored as a 0. When a SparseInstance is written, string instances with internal value 0 are not output, so their string value is lost (and when the arff file is read again, the default value 0 is the index of a different string value, so the attribute value appears to change). To get around this problem, add a dummy string value at index 0 that is never used whenever you declare string attributes that are likely to be used in SparseInstance objects and saved as Sparse ARFF files.

See also


Links

文件说明 下面我们来对这个文件的内容进行说明。 识别ARFF文件的重要依据是分行,因此不能在这种文件里随意的断行。空行(或全是空格的行)将被忽略。 以“%”开始的行是注释,WEKA将忽略这些行。如果你看到的“weather.arff”文件多了或少了些“%”开始的行,是没有影响的。 除去注释后,整个ARFF文件可以分为两个部分。第一部分给出了头信息(Head information),包括了对关系的声明和对属性的声明。第二部分给出了数据信息(Data information),即数据集中给出的数据。从“@data”标记开始,后面的就是数据信息了。 关系声明 关系名称在ARFF文件的第一个有效行来定义,格式为 @relation 是一个字符串。如果这个字符串包含空格,它必须加上引号(指英文标点的单引号或双引号)。 属性声明 属性声明用一列以“@attribute”开头的语句表示。数据集中的每一个属性都有它对应的“@attribute”语句,来定义它的属性名称和数据类型。 这些声明语句的顺序很重要。首先它表明了该项属性在数据部分的位置。例如,“humidity”是第三个被声明的属性,这说明数据部分那些被逗号分开的列中,第三列数据 85 90 86 96 ... 是相应的“humidity”值。其次,最后一个声明的属性被称作class属性,在分类或回归任务中,它是默认的目标变量。 属性声明的格式为 @attribute 其中是必须以字母开头的字符串。和关系名称一样,如果这个字符串包含空格,它必须加上引号。 WEKA支持的有四种,分别是 numeric-------------------------数值型 -----分类(nominal)型 string----------------------------字符串型 date []--------日期和时间型 其中 和 将在下面说明。还可以使用两个类型“integer”和“real”,但是WEKA把它们都当作“numeric”看待。注意“integer”,“real”,“numeric”,“date”,“string”这些关键字是区分大小写的,而“relation”“attribute ”和“date”则不区分。 数值属性 数值型属性可以是整数或者实数,但WEKA把它们都当作实数看待。 分类属性 分类属性由列出一系列可能的类别名称并放在花括号中:{, , , ...} 。数据集中该属性的值只能是其中一种类别。 例如如下的属性声明说明“outlook”属性有三种类别:“sunny”,“ overcast”和“rainy”。而数据集中每个实例对应的“outlook”值必是这三者之一。 @attribute outlook {sunny, overcast, rainy} 如果类别名称带有空格,仍需要将之放入引号中。 字符串属性 字符串属性中可以包含任意的文本。这种类型的属性在文本挖掘中非常有用。 示例: @ATTRIBUTE LCC string 日期和时间属性 日期和时间属性统一用“date”类型表示,它的格式是 @attribute date [] 其中是这个属性的名称,是一个字符串,来规定该怎样解析和显示日期或时间的格式,默认的字符串是ISO-8601所给的日期时间组合格式“yyyy-MM-ddTHH:mm:ss”。 数据信息部分表达日期的字符串必须符合声明中规定的格式要求(下文有例子)。 数据信息 数据信息中“@data”标记独占一行,剩下的是各个实例的数据。 每个实例占一行。实例的各属性值用逗号“,”隔开。如果某个属性的值是缺失值(missing value),用问号“?”表示,且这个问号不能省略。例如: @data sunny,85,85,FALSE,no ?,78,90,?,yes 字符串属性和分类属性的值是区分大小写的。若值中含有空格,必须被引号括起来。例如: @relation LCCvsLCSH @attribute LCC string @att
目录列表: 2dplanes.arff abalone.arff ailerons.arff Amazon_initial_50_30_10000.arff anneal.arff anneal.ORIG.arff arrhythmia.arff audiology.arff australian.arff auto93.arff autoHorse.arff autoMpg.arff autoPrice.arff autos.arff auto_price.arff balance-scale.arff bank.arff bank32nh.arff bank8FM.arff baskball.arff bodyfat.arff bolts.arff breast-cancer.arff breast-w.arff breastTumor.arff bridges_version1.arff bridges_version2.arff cal_housing.arff car.arff cholesterol.arff cleveland.arff cloud.arff cmc.arff colic.arff colic.ORIG.arff contact-lenses.arff cpu.arff cpu.with.vendor.arff cpu_act.arff cpu_small.arff credit-a.arff credit-g.arff cylinder-bands.arff delta_ailerons.arff delta_elevators.arff dermatology.arff detroit.arff diabetes.arff diabetes_numeric.arff echoMonths.arff ecoli.arff elevators.arff elusage.arff eucalyptus.arff eye_movements.arff fishcatch.arff flags.arff fried.arff fruitfly.arff gascons.arff glass.arff grub-damage.arff heart-c.arff heart-h.arff heart-statlog.arff hepatitis.arff house_16H.arff house_8L.arff housing.arff hungarian.arff hypothyroid.arff ionosphere.arff iris.2D.arff iris.arff kdd_coil_test-1.arff kdd_coil_test-2.arff kdd_coil_test-3.arff kdd_coil_test-4.arff kdd_coil_test-5.arff kdd_coil_test-6.arff kdd_coil_test-7.arff kdd_coil_train-1.arff kdd_coil_train-3.arff kdd_coil_train-4.arff kdd_coil_train-5.arff kdd_coil_train-6.arff kdd_coil_train-7.arff kdd_el_nino-small.arff kdd_internet_usage.arff kdd_ipums_la_97-small.arff kdd_ipums_la_98-small.arff kdd_ipums_la_99-small.arff kdd_JapaneseVowels_test.arff kdd_JapaneseVowels_train.arff kdd_synthetic_control.arff kdd_SyskillWebert-Bands.arff kdd_SyskillWebert-BioMedical.arff kdd_SyskillWebert-Goats.arff kdd_SyskillWebert-Sheep.arff kdd_UNIX_user_data.arff kin8nm.arff kr-vs-kp.arff labor.arff landsat_test.arff landsat_train.arff letter.arff liver-disorders.arff longley.arff lowbwt.arff lung-cancer.arff lymph.arff machine_cpu.arff mbagrade.arff meta.arff mfeat-factors.arff mfeat-fourier.arff mfeat-karhunen.arff mfeat-morphological.arff mfeat-pixel.arff mfeat-zernike.arff molecular-biology_promoters.arff monks-problems-1_test.arff monks-problems-1_train.arff monks-problems-2_test.arff monks-problems-2_train.arff monks-problems-3_test.arff monks-problems-3_train.arff mushroom.arff mv.arff nursery.arff optdigits.arff page-blocks.arff pasture.arff pbc.arff pendigits.arff pharynx.arff pol.arff pollution.arff postoperative-patient-data.arff primary-tumor.arff puma32H.arff puma8NH.arff pwLinear.arff pyrim.arff quake.arff ReutersCorn-test.arff ReutersCorn-train.arff ReutersGrain-test.arff ReutersGrain-train.arff schlvote.arff segment-challenge.arff segment-test.arff segment.arff sensory.arff servo.arff sick.arff sleep.arff solar-flare_1.arff solar-flare_2.arff sonar.arff soybean.arff spambase.arff spectf_test.arff spectf_train.arff spectrometer.arff spect_test.arff spect_train.arff splice.arff sponge.arff squash-stored.arff squash-unstored.arff stock.arff strike.arff supermarket.arff triazines.arff unbalanced.arff vehicle.arff veteran.arff vineyard.arff vote.arff vowel.arff water-treatment.arff waveform-5000.arff weather.nominal.arff weather.numeric.arff white-clover.arff wine.arff wisconsin.arff zoo.arff
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值