### Text-only Preview

CS3920 Computer Learning:

Coursework

Yuri Kalnishkan

November 9, 2010

The coursework consists of implementing two learning methods, Nearest

Neighbours and Support Vector Machines with Vapnik’s polynomial kernel, in

MatLab (or its clone; if using a clone programme, please contact the lecturer) .

The coursework assignment is strictly individual. The submission is entirely

electronic.

In order to submit, copy all you submission ﬁles into a directory, e.g.,

Learning/ and run the script submitCoursework Learning from the parent

directory. Choose CS3920 and cw when prompted by the script. You will re-

ceive an automatically generated e-mail with a conﬁrmation of your submission.

Please keep the e-mail as a submission receipt in case of a dispute; it is your

proof of submission. No complaints will be accepted later without a submission

receipt. If you have not received a conﬁrmation e-mail, please contact Support.

You should submit ﬁles with the following names:

• NN.m should contain the source code for the Nearest Neighbours method;

• SVM.m should contain the source code for the support vector machine;

• all other necessary ﬁles with scripts and functions (please avoid submitting

unnecessary ﬁles such as old versions, back-up copies made by the editor

etc);

• report.pdf or report.rtf should contain the numerical results; please

avoid openoffice formats such as .odt.

The deadline for submission is Wednesday, December 1st, 14:00. An

extension can only be given by the academic advisor.

1

Datasets

Download the archive with two datasets, iris.txt and ionosphere.txt from

the course web page. Each line of these ﬁles represents one labelled example

with comma-separated attributes. The last number in the line describes the

classiﬁcation, +1 or −1.

1

• iris.txt is perhaps one of the best known data sets to be found in the

pattern recognition literature. Each example has 4 attributes describing

sepal length, sepal width, petal length, and petal width of an iris plant.

The labels are −1 for Iris Setosa and +1 for Iris Versicolour.

• ionosphere.txt contains data collected by a radar system in Goose Bay,

Labrador. This system consists of a phased array of 16 high-frequency

antennas with a total transmitted power on the order of 6.4 kilowatts.

‘Good’ (+1) radar returns are those showing evidence of some type of

structure in the ionosphere. ‘Bad’ (−1) returns are those that do not.

The datasets are based on Frank, A. and Asuncion, A. (2010). UCI Machine

Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of

California, School of Information and Computer Science.

For each dataset do the following:

1. Load it into MatLab using

DATA = dlmread('filename.txt',',');

2. Split the dataset into the training and test sets. For iris.txt use the ﬁrst

60 lines as the training set and the rest as the test set. For ionosphere.txt

use the ﬁrst 150 lines as the training set and the rest as the test set. Create

the matrices XTrain with the training examples and YTrain the training

labels. Create the matrices XTest with the test examples and YTest the

test labels.

3. Run the functions NN and SVM with diﬀerent parameters as described be-

low.

4. Write the results into the report (see below for a description of what to

report for each method).

2

Nearest Neighbours

The Nearest Neighbours method is described in Class 3. The predicted clas-

siﬁcation for a test example x is the same as the classiﬁcation of the nearest

training example (in the Euclidean distance

x1 − x2 ). Calculate predicted

labels for all test examples and compare them with the true labels for the test

examples. Calculate the percentage of correct classiﬁcations.

Here is the format for the function header:

function pc = NN(XTrain,YTrain,XTest,YTest)

The output pc is the percentage of correct classiﬁcations on the test set.

Give the percentages of correct classiﬁcations in your report.

2

3

SVM

The SVM algorithm with kernels is described in Class 7. You need to use Vap-

nik’s polynomial kernel K(u, v) = (1 + u v)degree, where degree is a parameter.

The algorithm may be roughly described as follows:

1. Create the kernel matrix K of size

numOfTrainExamples × numOfTrainExamples. You may do it with two

nested loops or by using advanced MatLab features to speed up the cal-

culation.

2. Create other necessary matrices as described in Class 7; you will need a

parameter C.

3. Run quadprog to obtain vector alpha (the size of the vector is

numOfTrainExamples).

4. Find indexes i such that 0 < αi < C. Because quadprog is a numerical

function, it will output approximate answers, so instead of 0 < αi < C you

need to check whether threashold < αi < C − threashold; the values of

threashold is another parameter. The number of indexes you have found

is the number of suitable support vectors.

5. For each index i you have found, calculate d. Take the average of ds for

better precision.

6. Use the values of alpha and d to calculate the predictions for the test set.

Compare them with the true labels for the test examples. Calculate the

percentage of correct classiﬁcations.

For each experiment you need to give two values in your report, the percent-

age of correct classiﬁcations and the number of suitable support vectors.

Here is the header for the function:

function [pc, numSupport] = SVM(XTrain,YTrain,XTest,YTest,degree,C,threshold)

For iris.txt try the following combinations of parameters: degree = 2 and

degree = 3 with C = 1 and threshold = 10−4. For d = 4 and degree = 5 ﬁnd

values of C and threshold to maximise the percentage of correct predictions.

For ionosphere.txt try the following combinations of parameters: degree =

2 and with C = 1 and threshold = 10−1. For degree = 3 ﬁnd values of C and

threshold to maximise the percentage of correct predictions.

4

Hints

Although it is required to submit NN and SVM as functions, it is easier to

develop them as scripts. A script operates on global variables and you can

easily inspect their values.

These datasets are very regular. On iris.txt it is possible to achieve ideal

accuracy. On ionosphere.txt one can achieve very high accuracy.

3

5

Extra Marks

Extra marks will be given for attempts to optimise MatLab code and any in-

teresting observations about the datasets and methods (discuss these in your

report).

4

Coursework

Yuri Kalnishkan

November 9, 2010

The coursework consists of implementing two learning methods, Nearest

Neighbours and Support Vector Machines with Vapnik’s polynomial kernel, in

MatLab (or its clone; if using a clone programme, please contact the lecturer) .

The coursework assignment is strictly individual. The submission is entirely

electronic.

In order to submit, copy all you submission ﬁles into a directory, e.g.,

Learning/ and run the script submitCoursework Learning from the parent

directory. Choose CS3920 and cw when prompted by the script. You will re-

ceive an automatically generated e-mail with a conﬁrmation of your submission.

Please keep the e-mail as a submission receipt in case of a dispute; it is your

proof of submission. No complaints will be accepted later without a submission

receipt. If you have not received a conﬁrmation e-mail, please contact Support.

You should submit ﬁles with the following names:

• NN.m should contain the source code for the Nearest Neighbours method;

• SVM.m should contain the source code for the support vector machine;

• all other necessary ﬁles with scripts and functions (please avoid submitting

unnecessary ﬁles such as old versions, back-up copies made by the editor

etc);

• report.pdf or report.rtf should contain the numerical results; please

avoid openoffice formats such as .odt.

The deadline for submission is Wednesday, December 1st, 14:00. An

extension can only be given by the academic advisor.

1

Datasets

Download the archive with two datasets, iris.txt and ionosphere.txt from

the course web page. Each line of these ﬁles represents one labelled example

with comma-separated attributes. The last number in the line describes the

classiﬁcation, +1 or −1.

1

• iris.txt is perhaps one of the best known data sets to be found in the

pattern recognition literature. Each example has 4 attributes describing

sepal length, sepal width, petal length, and petal width of an iris plant.

The labels are −1 for Iris Setosa and +1 for Iris Versicolour.

• ionosphere.txt contains data collected by a radar system in Goose Bay,

Labrador. This system consists of a phased array of 16 high-frequency

antennas with a total transmitted power on the order of 6.4 kilowatts.

‘Good’ (+1) radar returns are those showing evidence of some type of

structure in the ionosphere. ‘Bad’ (−1) returns are those that do not.

The datasets are based on Frank, A. and Asuncion, A. (2010). UCI Machine

Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of

California, School of Information and Computer Science.

For each dataset do the following:

1. Load it into MatLab using

DATA = dlmread('filename.txt',',');

2. Split the dataset into the training and test sets. For iris.txt use the ﬁrst

60 lines as the training set and the rest as the test set. For ionosphere.txt

use the ﬁrst 150 lines as the training set and the rest as the test set. Create

the matrices XTrain with the training examples and YTrain the training

labels. Create the matrices XTest with the test examples and YTest the

test labels.

3. Run the functions NN and SVM with diﬀerent parameters as described be-

low.

4. Write the results into the report (see below for a description of what to

report for each method).

2

Nearest Neighbours

The Nearest Neighbours method is described in Class 3. The predicted clas-

siﬁcation for a test example x is the same as the classiﬁcation of the nearest

training example (in the Euclidean distance

x1 − x2 ). Calculate predicted

labels for all test examples and compare them with the true labels for the test

examples. Calculate the percentage of correct classiﬁcations.

Here is the format for the function header:

function pc = NN(XTrain,YTrain,XTest,YTest)

The output pc is the percentage of correct classiﬁcations on the test set.

Give the percentages of correct classiﬁcations in your report.

2

3

SVM

The SVM algorithm with kernels is described in Class 7. You need to use Vap-

nik’s polynomial kernel K(u, v) = (1 + u v)degree, where degree is a parameter.

The algorithm may be roughly described as follows:

1. Create the kernel matrix K of size

numOfTrainExamples × numOfTrainExamples. You may do it with two

nested loops or by using advanced MatLab features to speed up the cal-

culation.

2. Create other necessary matrices as described in Class 7; you will need a

parameter C.

3. Run quadprog to obtain vector alpha (the size of the vector is

numOfTrainExamples).

4. Find indexes i such that 0 < αi < C. Because quadprog is a numerical

function, it will output approximate answers, so instead of 0 < αi < C you

need to check whether threashold < αi < C − threashold; the values of

threashold is another parameter. The number of indexes you have found

is the number of suitable support vectors.

5. For each index i you have found, calculate d. Take the average of ds for

better precision.

6. Use the values of alpha and d to calculate the predictions for the test set.

Compare them with the true labels for the test examples. Calculate the

percentage of correct classiﬁcations.

For each experiment you need to give two values in your report, the percent-

age of correct classiﬁcations and the number of suitable support vectors.

Here is the header for the function:

function [pc, numSupport] = SVM(XTrain,YTrain,XTest,YTest,degree,C,threshold)

For iris.txt try the following combinations of parameters: degree = 2 and

degree = 3 with C = 1 and threshold = 10−4. For d = 4 and degree = 5 ﬁnd

values of C and threshold to maximise the percentage of correct predictions.

For ionosphere.txt try the following combinations of parameters: degree =

2 and with C = 1 and threshold = 10−1. For degree = 3 ﬁnd values of C and

threshold to maximise the percentage of correct predictions.

4

Hints

Although it is required to submit NN and SVM as functions, it is easier to

develop them as scripts. A script operates on global variables and you can

easily inspect their values.

These datasets are very regular. On iris.txt it is possible to achieve ideal

accuracy. On ionosphere.txt one can achieve very high accuracy.

3

5

Extra Marks

Extra marks will be given for attempts to optimise MatLab code and any in-

teresting observations about the datasets and methods (discuss these in your

report).

4

# Document Outline

- Datasets
- Nearest Neighbours
- SVM
- Hints
- Extra Marks