CS6140 Machine Learning

HW5 - Features

Make sure you check the syllabus for the due date. Please use the notations adopted in class, even if the problem is stated in the book using a different notation.

SpamBase-Poluted dataset: the same datapoints as in the original Spambase dataset, only with a lot more columns (features) : either random values, or somewhat loose features, or duplicated original features.

SpamBase-Poluted with missing values dataset: train, test. Same dataset, except some values (picked at random) have been deleted.

Digits Dataset (Training data, labels. Testing data, labels): about 60,000 images, each 28x28 pixels representing digit scans. Each image is labeled with the digit represented, one of 10 classes: 0,1,2,...,9.

PROBLEM 5 : Implement Kernel PCA for linear regression (Optional, no credit)

Dataset: 1000 2-dim datapoints TwoSpirals
Dataset: 1000 2-dim datapoints ThreeCircles

A) First, train a Linear Regression (library) and confirm that it doesnt work , i.e. it has a high classification error or high Root Mean Squared Error.
B) Run KernelPCA with Gaussian Kernel to obtain a representation of T features. For reference these steps we demoed in class (Matlab):
%get pairwise squared euclidian distance
X2 = dot(X,X,2);
DIST_euclid = bsxfun(@plus, X2, X2') - 2 * X * X';
% get a kernel matrix NxN
sigma = 3;
K = exp(-DIST_euclid/sigma);
%normalize the Kernel to correspond to zero-mean
U = ones(N)/ N ;
Kn = K - U*K -K*U + U*K*U ;
% obtain kernel eignevalues, vectors; then sort them with largest eig first
[V,D] = eig(Kn,'vector') ;
[D,sorteig] = sort(D,'descend') ;
V = V(:, sorteig);
% get the projection matrix
XG = Kn*V';
%get first 3 dimmensions
X3G = XG(:,1:3);
%get first 20 dimmensions
X20G = XG(:,1:20);
%get first 100 dimmensions
X100G = XG(:,1:100);

C) Retrain Linear regression on the transformed D-dim data. How large D needs to be to get good performance?
some_text