data Java

Java Python Week 1 Practical
Introduction to WEKA
What are we doing?
• Download an open source machine learning tool “WEKA” and explore the main
features of this tool.
• Understand and practice the basic data pre-processing operations that can be
performed using WEKA.

Submission:
You are required to submit one .arff file (after completing the practical task as
instructed in this prac document) via the weekly-practical submission box.

What is WEKA?
The WEKA (The Waikato Environment for Knowledge Analysis) is a machine learning
toolkit developed at the University of Waikato in Hamilton, New Zealand. The
software provides many machine learning statistics and other data mining solutions
for various types of data mining task, such as classification, cluster detection,
association rule discovery and attribute selection. The software is also equipped with
data pre-processing and post-processing tools and visualisation tools so that
complete data mining projects can be conducted via a number of different styles of
user interface. The toolkit is written in Java and can, therefore, run on various
platforms, such as Linux, Windows and Macintosh. It is an open-source software and
distributed under the terms and conditions of the GNU General Public License.

Launching and Starting WEKA
You can find instructions for installing Weka at

https://waikato.github.io/weka-wiki/downloading_weka/

When you open Weka you should see a screen like the one below (Figure 1).
[Figure 1]

Select the Explorer option below Applications.

Data Pre-Processing using WEKA
This example illustrates some of the basic data preprocessing operations that can be
performed using WEKA. The sample data set used for this example, unless otherwise
indicated, is the "bank data", called Bank Data.csv
The data contains the following fields
id a unique identification number
age age of customer in years (numeric)
sex MALE / FEMALE
region inner_city/rural/suburban/town
income income of customer (numeric)
married is the customer married (YES/NO)
children number of children (numeric)
car does the customer own a car (YES/NO)
save_acct does the customer have a saving account (YES/NO)
current_acct does the customer have a current account (YES/NO)
mortgage does the customer have a mortgage (YES/NO)
pep
did the customer buy a PEP (Personal Equity Plan) after the last mailing
(YES/NO)

Loading the Data
In addition to the native ARFF data file format, WEKA has the capability to read in
".csv" format files. This is fortunate since many databases or spreadsheet applications
can save or export data into flat files in this format. A usual Microsoft Excel worksheet

can be saved as a CSV file and opened by WEKA. The first row of the spreadsheet is
used to name the attributes and the data types for the attributes are derived
automatically but not always accurately. Once opened, you can save the data set into
an ARFF file in WEKA (by clicking “Save” in the Preprocess tab).
In this example, we load the data set into WEKA, perform a series of operations using
WEKA's attribute and discretization filters. While all of these operations can be
performed from the command line, we use the GUI interface for WEKA Explorer.
Initially (in the Preprocess tab) click "open" and navigate to the directory containing
the data file (which is something like bank-data.csv). This is shown in [Figure 2].
Once the data is loaded, WEKA will recognize the attributes and during the scan of the
data will compute some basic statistics on each attribute. The left panel in [Figure 3]
shows the list of recognized attributes, while the top panels indicate the names of the
base relation (or table) and the current working relation (which are the same initially).
Note: The recent version of WEKA has an additional tab named “Edit” under
Preprocess menu to view the current contents of the dataset under working.
Whenever you apply any filter in WEKA, you can see the updated contents via this
viewer facility. (Alternatively, you can use the “Arff Viewer” tool included in WEKA.
Refer to the WEKA manual document for further details)
[Figure 2]



[Figure 3]
Clicking on any attribute in the left panel will show the basic statistics on that
attribute. For categorical attributes, the frequency for each attribute value is shown,
while for continuous attributes we can obtain min, max, mean, standard deviation,
etc. As an example, see the [Figure 4] below which show the results of selecting the
“age” attribute.

[Figure 4]

Selecting or Filtering Attributes
In our sample data file, each record is uniquely identified by a customer id (the "id"
attribute). We need to remove this attribute before the data mining step (as this
attribute is not necessary). We can do this by using the Attribute filters in WEKA.

In the "Filter" panel, click on the "Choose" button.

This will show a popup window with a list available filters. Scroll down the list and
select the "weka.filters.unsupervised.attribute.Remove" filter as shown in [Figure 5].

Next, click on text box immediately to the right of the "Choose" button.

In the resulting dialog box enter the index of the attribute to be filtered out (this can
be a range or a list separated by commas). In this case, we enter 1 which is the index
of the "id" attribute (see the left panel). Make sure that the "invertSelection" o data、Java ption
is set to false (otherwise everything except attribute 1 will be filtered). Then click "OK"
(See [Figure 6]). Now, in the filter box you will see "Remove -R 1" (see [Figure 7]).

[Figure 5]
[Figure 6]


[Figure 7]

Click the "Apply" button to apply this filter to the data. This will remove the "id"
attribute and create a new working relation (whose name now includes the details of
the filter that was applied). The result is depicted in [Figure 8].
[Figure 8]

Discretization
Some techniques, such as association rule mining, can only be performed on
categorical data. This requires performing discretization on numeric or continuous
attributes. (There are 3 such attributes in this data set: "age", "income", and
"children"). Click on the “age” attribute. Again we activate the Filter dialog box, but
this time, we will select "Discretize" filter from the list. (see [Figure 9]).
[Figure 9]

Next, to change the defaults for this filter, click on the box to the right of the "Choose"
button. This will open the Discretize Filter dialog box.
We enter the index for the the attributes to be discretized. In this case we enter 1
corresponding to attribute "age". We also enter 3 as the number of bins (note that it
is possible to discretize more than one attribute at the same time (by using a list of
attribute indexes). Since we are doing simple binning, all of the other available options
are set to "false". The dialog box is shown in [Figure 10].

Click "Apply" in the Filter panel. This will result in a new working relation with the
selected attribute partitioned into 3 bins (shown in Figure 10).
Finally, save the file as something like "bank-data-final.arff".
Submit this final filtered arff file to prove your work for this weekly
practical.

[Figure 10]
[Figure 11]

Other Useful Filters in WEKA
There are more useful preprocessing filters provided in WEKA in addition to filters we
tried in this exercise. The following is briefs of some among them. You are
recommended to refer to WEKA manual for further details and have a try to apply
some to bank data for your own exercise.

In WEKA, data pre-processing is done using attribute or instance filters that can
operate supervised or unsupervised. Attribute filters are applied to attributes
(columns) and instance filters are applied to data objects (rows). Supervised filters
perform with consideration of a class attribute whereas unsupervised filters do not.
(Many unsupervised filters have a supervised counterpart. Supervised filters must be
used with care for classification tasks; test examples must be pre-processed in the
same way as the training examples.)

The many other filters for data pre-processing have not been described here due to
limitations of space. Filters in WEKA are continuously developed and new filters are
constantly added in new versions.

Add attribute filter
Using “Add” filter, we can create a new attribute (with empty value as default)
and specify the location, name and labels of the new attribute. Once created,
the value of the new attribute can be entered manually in the viewer window
for data objects.
New numeric features can be added with the “AddExpression” filter, which
applies a mathematical expression based on the values of other attributes.

Numeric transformation attribute filters
The “MathExpression” filter allows transformation with a valid mathematical
expression that uses arithmetic operators and built-in functions, such as
absolute (abs), logarithm (log), square root (sqrt), etc.
The “NumericTransform” filter only allows transformations by methods
supported by the Java math library. Unlike AddExpression, these filters do not
create new attributes but replace the current values with the transformed
values.

Transformation attribute filters
The “Normalize” filter converts the values of all numeric attributes in the
loaded data set to those within a common range. The default range is [0.1].
The user can change the normal range if needed.
The “Standardize” filter standardizes all numeric attributes to have zero mean
and unit variance.

ReplaceMissingValues filter
This rudimentary filter fills in missing values; numeric values are replaced with
the sample mean and nominal values are replaced with the sample mode. The
user can also fill in missing values manually in the viewer window (using “Edit”

menu). For numeric attributes, the user may enter any value. For nominal
attributes, the user can only select one of the nominal labels that already exists
in the attribute domain. If the label does not exist (for instance, it is a special
code indicating unknown), the label can be added into the attribute domain by
using “AddValues” filter.

Resample instance filter
This filter selects a random sample of a certain percentage (SampleSizePercent
parameter) of the loaded data set, with or without replacement (to sample
without replacement, set the noReplacement parameter to True). The
unsupervised Resample filter draws the sample from the entire data set
reflecting the real distribution of attribute values including class values; the
supervised Resample filter draws samples according to either the real
distribution of classes (set the biasToUniformClass parameter to 0) or a
uniform distribution of classes (set the biasToUniformClass parameter to 1)         

### 回答1: multipart/form-data是一种HTTP POST请求的编码类型,用于在Web应用程序中上传文件和表单数据。在Java中,可以使用Java Servlet API或Apache HttpClient等库来处理multipart/form-data请求。通常,multipart/form-data请求会将数据分成多个部分,每个部分都有自己的Content-Type和Content-Disposition头信息,以及数据本身。在Java中,可以使用Part接口来表示每个部分,并使用request.getParts()方法来获取所有部分的列表。然后,可以使用Part的方法来获取每个部分的内容和元数据。 ### 回答2: multipart/form-data是一种HTTP POST请求中常用的数据传输格式,它是用来在HTML表单中上传文件或二进制数据的一种方式。Java语言在处理multipart/form-data时,需要使用一些特定的API来解析和处理请求。在下面的回答中,我将介绍Java中处理multipart/form-data的相关API。 首先,在Java中处理multipart/form-data请求时,需要使用Servlet API中的Part接口和相关类。Part接口表示HTTP POST请求中的一个组件,可以是一个HTML表单的一个字段或一个上传的文件。可以通过HttpServletRequest的getPart()或getParts()方法获取请求中的Part对象。 另外,Java Servlet API中还定义了一个名为javax.servlet.http.Part的标准接口,该接口定义了一些方法,可以获取Part对象的内容类型、大小、文件名等信息。以及可以通过Part.write()方法将Part对象的数据写入输出流中,也就是将用户上传的文件保存到服务器磁盘上。 另外,为了更方便地处理请求中的部件,Java Servlet API还提供了一个名为javax.servlet.http.multipart.MultipartHttpServletRequest的类。MultipartHttpServletRequest类继承自HttpServletRequest接口,它可以通过HttpServletRequest的getInputStream()方法获取请求的输入流,然后使用第三方库如Apache Commons FileUpload在输入流中读取请求的各个部件。这个类可以使处理multipart/form-data请求更加简单和可读性更好。使用MultipartHttpServletRequest类启用multipart/form-data请求处理器时,开发人员需要在web.xml文件中设置multipart-config元素来指定部件上限大小和保存上传文件的位置。 总而言之,使用Java语言处理multipart/form-data请求时,需要使用Servlet API中的Part接口、MultipartHttpServletRequest类以及第三方文件上传库,通过这些API可以将用户上传的数据或文件捕获并处理。这些API的正确使用可以使开发人员在处理multipart/form-data请求时避免一些常见的错误和异常。 ### 回答3: multipart/form-data是HTTP协议中一种常用的编码格式,用于在web应用程序中上传文件或提交表单数据。multipart/form-data格式可以传输二进制流数据,而其他编码格式不能够充分满足web应用程序对这种需求的要求。multipart/form-data格式最常见的应用场景就是通过HTTP POST方法上传文件。 在Java中,可以通过使用Apache的commons-fileupload和commons-io库来实现multipart/form-data格式的文件上传。首先,应该在项目中导入这两个库的jar包,并在代码中使用相关类。 文件上传时的步骤如下: 1. 创建一个DiskFileItemFactory对象,并设置缓冲区和限制大小(可选) 2. 创建一个ServletFileUpload对象,并使用DiskFileItemFactory进行初始化 3. 使用ServletFileUpload对象解析request对象,并返回一个List对象 4. 遍历List对象中的FileItem对象,分别处理每一个表单字段或上传的文件数据 5. 对于上传的文件数据,可以将其保存到本地磁盘或进行其他操作 以下是一个示例代码: //创建DiskFileItemFactory DiskFileItemFactory factory = new DiskFileItemFactory(); factory.setSizeThreshold(1024 * 1024 * 10); factory.setRepository(new File(System.getProperty("java.io.tmpdir"))); //创建ServletFileUpload ServletFileUpload upload = new ServletFileUpload(factory); upload.setFileSizeMax(1024 * 1024 * 50); //解析request List<FileItem> items = upload.parseRequest(request); //遍历List for (FileItem item : items) { if (item.isFormField()) { //处理表单字段 String fieldName = item.getFieldName(); String fieldValue = item.getString(); } else { //处理上传的文件 String fileName = item.getName(); String contentType = item.getContentType(); InputStream fileStream = item.getInputStream(); //保存文件到本地磁盘或进行其他操作 } } multipart/form-data格式的文件上传是web开发中常见的需求,通过Java提供的相关工具库可以很轻松地实现这个功能。需要注意的是,上传文件时应该对文件大小进行限制,并对上传的文件数据进行正确的处理和存储,以确保应用程序的稳定性和安全性。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值