pdf

Text-only Preview

CS3920 Computer Learning:
Coursework
Yuri Kalnishkan
November 9, 2010
The coursework consists of implementing two learning methods, Nearest
Neighbours and Support Vector Machines with Vapnik’s polynomial kernel, in
MatLab (or its clone; if using a clone programme, please contact the lecturer) .
The coursework assignment is strictly individual. The submission is entirely
electronic.
In order to submit, copy all you submission files into a directory, e.g.,
Learning/ and run the script submitCoursework Learning from the parent
directory. Choose CS3920 and cw when prompted by the script. You will re-
ceive an automatically generated e-mail with a confirmation of your submission.
Please keep the e-mail as a submission receipt in case of a dispute; it is your
proof of submission. No complaints will be accepted later without a submission
receipt. If you have not received a confirmation e-mail, please contact Support.
You should submit files with the following names:
• NN.m should contain the source code for the Nearest Neighbours method;
• SVM.m should contain the source code for the support vector machine;
• all other necessary files with scripts and functions (please avoid submitting
unnecessary files such as old versions, back-up copies made by the editor
etc);
• report.pdf or report.rtf should contain the numerical results; please
avoid openoffice formats such as .odt.
The deadline for submission is Wednesday, December 1st, 14:00. An
extension can only be given by the academic advisor.
1
Datasets
Download the archive with two datasets, iris.txt and ionosphere.txt from
the course web page. Each line of these files represents one labelled example
with comma-separated attributes. The last number in the line describes the
classification, +1 or −1.
1

• iris.txt is perhaps one of the best known data sets to be found in the
pattern recognition literature. Each example has 4 attributes describing
sepal length, sepal width, petal length, and petal width of an iris plant.
The labels are −1 for Iris Setosa and +1 for Iris Versicolour.
• ionosphere.txt contains data collected by a radar system in Goose Bay,
Labrador. This system consists of a phased array of 16 high-frequency
antennas with a total transmitted power on the order of 6.4 kilowatts.
‘Good’ (+1) radar returns are those showing evidence of some type of
structure in the ionosphere. ‘Bad’ (−1) returns are those that do not.
The datasets are based on Frank, A. and Asuncion, A. (2010). UCI Machine
Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of
California, School of Information and Computer Science.
For each dataset do the following:
1. Load it into MatLab using
DATA = dlmread('filename.txt',',');
2. Split the dataset into the training and test sets. For iris.txt use the first
60 lines as the training set and the rest as the test set. For ionosphere.txt
use the first 150 lines as the training set and the rest as the test set. Create
the matrices XTrain with the training examples and YTrain the training
labels. Create the matrices XTest with the test examples and YTest the
test labels.
3. Run the functions NN and SVM with different parameters as described be-
low.
4. Write the results into the report (see below for a description of what to
report for each method).
2
Nearest Neighbours
The Nearest Neighbours method is described in Class 3. The predicted clas-
sification for a test example x is the same as the classification of the nearest
training example (in the Euclidean distance
x1 − x2 ). Calculate predicted
labels for all test examples and compare them with the true labels for the test
examples. Calculate the percentage of correct classifications.
Here is the format for the function header:
function pc = NN(XTrain,YTrain,XTest,YTest)
The output pc is the percentage of correct classifications on the test set.
Give the percentages of correct classifications in your report.
2

3
SVM
The SVM algorithm with kernels is described in Class 7. You need to use Vap-
nik’s polynomial kernel K(u, v) = (1 + u v)degree, where degree is a parameter.
The algorithm may be roughly described as follows:
1. Create the kernel matrix K of size
numOfTrainExamples × numOfTrainExamples. You may do it with two
nested loops or by using advanced MatLab features to speed up the cal-
culation.
2. Create other necessary matrices as described in Class 7; you will need a
parameter C.
3. Run quadprog to obtain vector alpha (the size of the vector is
numOfTrainExamples).
4. Find indexes i such that 0 < αi < C. Because quadprog is a numerical
function, it will output approximate answers, so instead of 0 < αi < C you
need to check whether threashold < αi < C − threashold; the values of
threashold is another parameter. The number of indexes you have found
is the number of suitable support vectors.
5. For each index i you have found, calculate d. Take the average of ds for
better precision.
6. Use the values of alpha and d to calculate the predictions for the test set.
Compare them with the true labels for the test examples. Calculate the
percentage of correct classifications.
For each experiment you need to give two values in your report, the percent-
age of correct classifications and the number of suitable support vectors.
Here is the header for the function:
function [pc, numSupport] = SVM(XTrain,YTrain,XTest,YTest,degree,C,threshold)
For iris.txt try the following combinations of parameters: degree = 2 and
degree = 3 with C = 1 and threshold = 10−4. For d = 4 and degree = 5 find
values of C and threshold to maximise the percentage of correct predictions.
For ionosphere.txt try the following combinations of parameters: degree =
2 and with C = 1 and threshold = 10−1. For degree = 3 find values of C and
threshold to maximise the percentage of correct predictions.
4
Hints
Although it is required to submit NN and SVM as functions, it is easier to
develop them as scripts. A script operates on global variables and you can
easily inspect their values.
These datasets are very regular. On iris.txt it is possible to achieve ideal
accuracy. On ionosphere.txt one can achieve very high accuracy.
3

5
Extra Marks
Extra marks will be given for attempts to optimise MatLab code and any in-
teresting observations about the datasets and methods (discuss these in your
report).
4

Document Outline

  • Datasets
  • Nearest Neighbours
  • SVM
  • Hints
  • Extra Marks