Coursera: Machine Learning (Week 7) [Assignment Solution] - Andrew NG

byAkshay Daga (APDaga) -يونيو 05, 2021

25

▸ Support vector machines (SVMs) to build a spam classifier.

I have recently completed the Machine Learning course from Coursera by Andrew NG.

While doing the course we have to go through various quiz and assignments.

Here, I am sharing my solutions for the weekly assignments throughout the course.

These solutions are for reference only.

> It is recommended that you should solve the assignments by yourself honestly then only it makes sense to complete the course.

> But, In case you stuck in between, feel free to refer to the solutions provided by me.

NOTE:

Don't just copy-paste the code for the sake of completion.

Even if you copy the code, make sure you understand the code first.

Click here to check out week-6 assignment solutions, Scroll down for the solutions for week-7 assignment.

In this exercise, you will be using support vector machines (SVMs) to build a spam classifier. Before starting on the programming exercise, we strongly recommend watching the video lectures and completing the review questions for the associated topics.

Recommended Machine Learning Courses:
Coursera: Machine Learning
Coursera: Deep Learning Specialization
Coursera: Machine Learning with Python
Coursera: Advanced Machine Learning Specialization
Udemy: Machine Learning
LinkedIn: Machine Learning
Eduonix: Machine Learning
edX: Machine Learning
Fast.ai: Introduction to Machine Learning for Coders

It consist of the following files:

ex6.m - Octave/MATLAB script for the first half of the exercise
ex6data1.mat - Example Dataset 1
ex6data2.mat - Example Dataset 2
ex6data3.mat - Example Dataset 3
svmTrain.m - SVM training function
svmPredict.m - SVM prediction function
plotData.m - Plot 2D data
visualizeBoundaryLinear.m - Plot linear boundary
visualizeBoundary.m - Plot non-linear boundary
linearKernel.m - Linear kernel for SVM
[*] gaussianKernel.m - Gaussian kernel for SVM
[*] dataset3Params.m - Parameters to use for Dataset 3
ex6 spam.m - Octave/MATLAB script for the second half of the exercise
spamTrain.mat - Spam training set
spamTest.mat - Spam test set
emailSample1.txt - Sample email 1
emailSample2.txt - Sample email 2
spamSample1.txt - Sample spam 1
spamSample2.txt - Sample spam 2
vocab.txt - Vocabulary list
getVocabList.m - Load vocabulary list
porterStemmer.m - Stemming function
readFile.m - Reads a file into a character string
submit.m - Submission script that sends your solutions to our servers
[*] processEmail.m - Email preprocessing
[*] emailFeatures.m - Feature extraction from emails
Video - YouTube videos featuring Free IOT/ML tutorials

* indicates files you will need to complete

gaussianKernel.m :

function sim = gaussianKernel(x1, x2, sigma)
  %RBFKERNEL returns a radial basis function kernel between x1 and x2
  %   sim = gaussianKernel(x1, x2) returns a gaussian kernel between x1 and x2
  %   and returns the value in sim
  
  % Ensure that x1 and x2 are column vectors
  x1 = x1(:); x2 = x2(:);
  
  % You need to return the following variables correctly.
  sim = 0;
  
  % ====================== YOUR CODE HERE ======================
  % Instructions: Fill in this function to return the similarity between x1
  %               and x2 computed using a Gaussian kernel with bandwidth
  %               sigma
  %
  %
  
  sim = exp(-1*sum(abs(x1-x2).^2)/(2*sigma^2));
  
  % =============================================================  
end

dataset3Params.m :

function [C, sigma] = dataset3Params(X, y, Xval, yval)
  %DATASET3PARAMS returns your choice of C and sigma for Part 3 of the exercise
  %where you select the optimal (C, sigma) learning parameters to use for SVM
  %with RBF kernel
  %   [C, sigma] = DATASET3PARAMS(X, y, Xval, yval) returns your choice of C and
  %   sigma. You should complete this function to return the optimal C and
  %   sigma based on a cross-validation set.
  %
  
  % You need to return the following variables correctly.
  C = 1;
  sigma = 0.3;
  
  % ====================== YOUR CODE HERE ======================
  % Instructions: Fill in this function to return the optimal C and sigma
  %               learning parameters found using the cross validation set.
  %               You can use svmPredict to predict the labels on the cross
  %               validation set. For example,
  %                   predictions = svmPredict(model, Xval);
  %               will return the predictions on the cross validation set.
  %
  %  Note: You can compute the prediction error using
  %        mean(double(predictions ~= yval))
  %
  
  %% %%%%%%%%%% WORKING: SOLUTION1 %%%%%%%%%%
  % C_list     = [0.01 0.03 0.1 0.3 1 3 10 30]';
  % sigma_list = [0.01 0.03 0.1 0.3 1 3 10 30]';
  % 
  % prediction_error = zeros(length(C_list), length(sigma_list));
  % for i = 1:length(C_list)
  %     for j = 1: length(sigma_list)
  %         C_test = C_list(i);
  %         sigma_test = sigma_list(j);
  %         model = svmTrain(X, y, C_test, @(x1, x2) gaussianKernel(x1, x2, sigma_test));
  %         predictions = svmPredict(model, Xval);
  %         prediction_error(i,j) = mean(double(predictions ~= yval));
  %     end
  % end
  % 
  % % Finding row and col corresponding to min(prediction_error)
  % [values, row_index]=min(prediction_error);
  % [~ ,col] = min(values);
  % row = row_index(col);
  % 
  % % C and sigma corresponding to min(prediction_error)
  % C = C_list(row);
  % sigma = sigma_list(col);
  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  
  
  %% %%%%%%%%%% WORKING: SOLUION 2 %%%%%%%%%%%%%%
  C_list     = [0.01 0.03 0.1 0.3 1 3 10 30]';
  sigma_list = [0.01 0.03 0.1 0.3 1 3 10 30]';
  
  prediction_error = zeros(length(C_list), length(sigma_list));
  result = zeros(length(C_list)+length(sigma_list),3);
  row = 1;
  
  for i = 1:length(C_list)
      for j = 1: length(sigma_list)
          C_test = C_list(i);
          sigma_test = sigma_list(j);
          
          model = svmTrain(X, y, C_test, @(x1, x2) gaussianKernel(x1, x2, sigma_test));
          predictions = svmPredict(model, Xval);
          prediction_error(i,j) = mean(double(predictions ~= yval));
          
          result(row,:) = [prediction_error(i,j), C_test, sigma_test];
          row = row + 1;
      end
  end
  
  % Sorting prediction_error in ascending order
  sorted_result = sortrows(result, 1);
  
  % C and sigma corresponding to min(prediction_error)
  C = sorted_result(1,2);
  sigma = sorted_result(1,3);
  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  % =========================================================================
end

processEmail.m :

function word_indices = processEmail(email_contents)
  %PROCESSEMAIL preprocesses a the body of an email and
  %returns a list of word_indices
  %   word_indices = PROCESSEMAIL(email_contents) preprocesses
  %   the body of an email and returns a list of indices of the
  %   words contained in the email.
  %
  
  % Load Vocabulary
  vocabList = getVocabList();
  
  % Init return value
  word_indices = [];
  
  % ========================== Preprocess Email ===========================
  
  % Find the Headers ( \n\n and remove )
  % Uncomment the following lines if you are working with raw emails with the
  % full headers
  
  % hdrstart = strfind(email_contents, ([char(10) char(10)]));
  % email_contents = email_contents(hdrstart(1):end);
  
  % Lower case
  email_contents = lower(email_contents);
  
  % Strip all HTML
  % Looks for any expression that starts with < and ends with > and replace
  % and does not have any < or > in the tag it with a space
  email_contents = regexprep(email_contents, '<[^<>]+>', ' ');
  
  % Handle Numbers
  % Look for one or more characters between 0-9
  email_contents = regexprep(email_contents, '[0-9]+', 'number');
  
  % Handle URLS
  % Look for strings starting with http:// or https://
  email_contents = regexprep(email_contents, ...
      '(http|https)://[^\s]*', 'httpaddr');
  
  % Handle Email Addresses
  % Look for strings with @ in the middle
  email_contents = regexprep(email_contents, '[^\s]+@[^\s]+', 'emailaddr');
  
  % Handle $ sign
  email_contents = regexprep(email_contents, '[$]+', 'dollar');
  
  
  % ========================== Tokenize Email ===========================
  
  % Output the email to screen as well
  fprintf('\n==== Processed Email ====\n\n');
  
  % Process file
  l = 0;
  
  while ~isempty(email_contents)
      
    % Tokenize and also get rid of any punctuation
    [str, email_contents] = ...
        strtok(email_contents, ...
        [' @$/#.-:&*+=[]?!(){},''">_<;%' char(10) char(13)]);
    
    % Remove any non alphanumeric characters
    str = regexprep(str, '[^a-zA-Z0-9]', '');
    
    % Stem the word
    % (the porterStemmer sometimes has issues, so we use a try catch block)
    try str = porterStemmer(strtrim(str));
    catch str = ''; continue;
    end;
    
    % Skip the word if it is too short
    if length(str) < 1
        continue;
    end
    
    % Look up the word in the dictionary and add to word_indices if
    % found
    % ====================== YOUR CODE HERE ======================
    % Instructions: Fill in this function to add the index of str to
    %               word_indices if it is in the vocabulary. At this point
    %               of the code, you have a stemmed word from the email in
    %               the variable str. You should look up str in the
    %               vocabulary list (vocabList). If a match exists, you
    %               should add the index of the word to the word_indices
    %               vector. Concretely, if str = 'action', then you should
    %               look up the vocabulary list to find where in vocabList
    %               'action' appears. For example, if vocabList{18} =
    %               'action', then, you should add 18 to the word_indices
    %               vector (e.g., word_indices = [word_indices ; 18]; ).
    %
    % Note: vocabList{idx} returns a the word with index idx in the
    %       vocabulary list.
    %
    % Note: You can use strcmp(str1, str2) to compare two strings (str1 and
    %       str2). It will return 1 only if the two strings are equivalent.
    %
 
    %% %%%%% WORKING: SOLUTION %%%%%%%%%%
    % find index of the word in vocabList (if Exist)
    index = find(strcmp(str,vocabList),1);
    
    % Add the index in the vector word_indices
    word_indices = [word_indices; index];
    
    %% =============================================================
    
    % Print to screen, ensuring that the output lines are not too long
    if (l + length(str) + 1) > 78
        fprintf('\n');
        l = 0;
    end
    fprintf('%s ', str);
    l = l + length(str) + 1;
      
  end
  
  % Print footer
  fprintf('\n\n=========================\n');
  
end

Check-out our free tutorials on IOT (Internet of Things):

emailFeatures.m :

function x = emailFeatures(word_indices)
  %EMAILFEATURES takes in a word_indices vector and produces a feature vector
  %from the word indices
  %   x = EMAILFEATURES(word_indices) takes in a word_indices vector and 
  %   produces a feature vector from the word indices. 
  
  % Total number of words in the dictionary
  n = 1899;
  
  % You need to return the following variables correctly.
  x = zeros(n, 1);
  
  % ====================== YOUR CODE HERE ======================
  % Instructions: Fill in this function to return a feature vector for the
  %               given email (word_indices). To help make it easier to 
  %               process the emails, we have have already pre-processed each
  %               email and converted each word in the email into an index in
  %               a fixed dictionary (of 1899 words). The variable
  %               word_indices contains the list of indices of the words
  %               which occur in one email.
  % 
  %               Concretely, if an email has the text:
  %
  %                  The quick brown fox jumped over the lazy dog.
  %
  %               Then, the word_indices vector for this text might look 
  %               like:
  %               
  %                   60  100   33   44   10     53  60  58   5
  %
  %               where, we have mapped each word onto a number, for example:
  %
  %                   the   -- 60
  %                   quick -- 100
  %                   ...
  %
  %              (note: the above numbers are just an example and are not the
  %               actual mappings).
  %
  %              Your task is take one such word_indices vector and construct
  %              a binary feature vector that indicates whether a particular
  %              word occurs in the email. That is, x(i) = 1 when word i
  %              is present in the email. Concretely, if the word 'the' (say,
  %              index 60) appears in the email, then x(60) = 1. The feature
  %              vector should look like:
  %
  %              x = [ 0 0 0 0 1 0 0 0 ... 0 0 0 0 1 ... 0 0 0 1 0 ..];
  %
  %
  
  %% WORKING: SOLUTION 1 %%%%%%
  % for i = 1:length(word_indices)
  %     x1 = ([1:n] == word_indices(i));
  %     x = x | x1';
  % end
  
  %% WORKING: SOLUTION 2 %%%%%%
  for i = 1:length(word_indices)
      x(word_indices(i)) = 1;
  end
  % =========================================================================
end

I tried to provide optimized solutions like vectorized implementation for each assignment. If you think that more optimization can be done, then put suggest the corrections / improvements.

--------------------------------------------------------------------------------

Click here to see solutions for all Machine Learning Coursera Assignments.

&

Click here to see more codes for Raspberry Pi 3 and similar Family.

&

Click here to see more codes for NodeMCU ESP8266 and similar Family.

&

Click here to see more codes for Arduino Mega (ATMega 2560) and similar Family.

Feel free to ask doubts in the comment section. I will try my best to solve it.

If you find this helpful by any mean like, comment and share the post.

This is the simplest way to encourage me to keep doing such work.

Thanks and Regards,

-Akshay P. Daga

25 تعليقات

Alankar Mishra7 أبريل 2020 في 7:51 م
processEmail code is not running in matlab , it is showing the following error in the command prompt : !! Submission failed: unexpected error: Error using fprintf
Function is not defined for 'cell' inputs.
Error from file:/MATLAB Drive/machine-learning-ex/ex6/processEmail.m

This is line 114 :
fprintf('%s ', str);
How to resolve it .

And , error 2 is
catch str = ''; continue;
in the above line it is telling variable assigned to variable "str" might be unused .
Function:processEmail
On line:114

And third error is :
word_indices = {word_indices; index};
In the above line it is telling variable "word_indices" tend to change size on every loop iteration . Consider preallocating for speed .
ردحذف
الردود
Unknown13 أبريل 2020 في 9:22 م
In this line of code:
coderesult = zeros(length(C_list)+length(sigma_list),3)
you would get a 16x3 matrix since both arrays are 8 units long.
However, wouldn't you need a 64x3 matrix since we need to try out each possibility in C_list and sigma_list, which would mean trying out 64 different permutations?

ردحذف
الردود
Unknown21 أبريل 2020 في 12:51 م
Your code for dataset3param gives c =0.1and sigma =0.1 which is not correct. Correct value for c and sigma is 0.3 and 0.1 respectively.
ردحذف
الردود
Unknown12 يوليو 2020 في 3:42 م
Hey Akshay, I have a suggestion for a small optimization..
In the emailFeatures.m we can instead write
for i = word_indices
x(i) = 1;
end
Hope its better
ردحذف
الردود
Asad Zubair20 يوليو 2020 في 9:12 م
In emailfeatures.m
rather than using loop

x(word_indices,1)=1;
ردحذف
الردود
Unknown4 أغسطس 2020 في 4:41 م
why is i showing training " out of time" error
ردحذف
الردود
Unknown13 أغسطس 2020 في 4:28 م
what is the use of @.
ردحذف
الردود
mhari18 أغسطس 2020 في 8:19 ص
What is the value for the features in Gaussian kernel,can you help me in understanding the criteria for selecting x1,x2 in svmTrain.m
ردحذف
الردود
Hajar_Z14 سبتمبر 2020 في 7:36 م
Please could you explain to me what's the difference between svmtrain and svmpredict ? what are the results returned? I get a little confused. Thank you in advance
ردحذف
الردود
unknown17 سبتمبر 2020 في 7:10 م
https://www.mathworks.com/matlabcentral/answers/320129-what-does-do
This may help
ردحذف
الردود
unknown18 سبتمبر 2020 في 10:55 ص
please could you explain to me in the dataset3Params.m why the
result = zeros(length(C_list)+length(sigma_list),3); is not
result = zeros(length(C_list)*length(sigma_list),3);?
ردحذف
الردود
118 ديسمبر 2020 في 2:13 م
Hi Akshay,a question:
in emailFeatures.m
length(word_indices) = 53
why are there 45 but not 53 non-zero entries???
ردحذف
الردود
Unknown28 ديسمبر 2020 في 12:49 ص
Hi! Thank you for your code! It is useful to see different ways to solve the exercises.
In my case I followed the tutorial indications and I didn't use any for loop in emailFeature.m, so I just wrote:

x(word_indices) = 1;

And that's all! It worked and submitted perfectly so it seems to be fine and it's just one line :D
ردحذف
الردود
Unknown23 مارس 2021 في 6:02 ص
Hi!!
I use online matlab to execute code. For both parameters to be used for data set 3 and process email code it takes a long time for training or execution and matlab session gets timed out and the process starts all over again. Please can you help me out with proper parameter values or any another solution to solve this problem. Thank you!!
ردحذف
الردود
Unknown17 مايو 2021 في 6:16 م
i am using octave and while submitting it shows "training...... done training..... done...." but My assignment is not submitting.
i dont know why.............someone plz help me.............................
ردحذف
الردود

إضافة تعليق

إرسال تعليق