SVM SMO算法的实现

SVM通过拉格拉日对偶等最终形成了一套数学模型，最终可以通过SMO（Sequential Minimal Optimization)的方法可以解出最佳值alpha值。当然，可以通过经典的libsvm这样的slover实现计算，诸如此类的solver也是通过SMO算法来计算SVM的最优化参数，但是作为小白，确实有必要理解一下SMO内在的数值计算逻辑。
当然网络上有很多参考书可以，总结一下：
1、李航统计学习方法。最后总结SMO算法时含糊提到了两层循环，外层循环专门找不满足KKT条件的，如果找到一个不满足的记为a1，然后找到一个与a1不相同的a2，此时利用a1y1+a2y2为常数这个约束条件构造关于a2的二次函数，这就容易求出最小优化的a2、a1，这就是所谓的内层循环优化，同时a1、a2也要满足＞0、＜C和a1y1+a2y2=常数，三个条件，需要把按函数求导得到的a2值做判断看是否满足L、H区间，如果不在区间内则取边界值，称之为clip。最后外层多次循环直到所有a都满足KKT条件，最终把a不等于0的值取出来得到对应点即为支持向量SVM。但该书不涉及具体的实现代码。
2、吴恩达svmtrain.m程序，思路清晰，代码少，胜过所有看到的python算例。下文有详细分析和注释。
3、台湾交通大学SMO算法视频，老师讲的a1、a2取值以及clip思路清晰，胜过知乎。
4、其他python算法不能作为参考，参差不齐。

借鉴吴恩达机器学习SMO算法的Matlab实现，其代码实现极简，但是逻辑无比清晰，特记录并注释。

function [model] = svmTrain(X, Y, C, kernelFunction, ...
                            tol, max_passes)
%SVMTRAIN Trains an SVM classifier using a simplified version of the SMO 
%algorithm. 
%   [model] = SVMTRAIN(X, Y, C, kernelFunction, tol, max_passes) trains an
%   SVM classifier and returns trained model. X is the matrix of training 
%   examples.  Each row is a training example, and the jth column holds the 
%   jth feature.  Y is a column matrix containing 1 for positive examples 
%   and 0 for negative examples.  C is the standard SVM regularization 
%   parameter.  tol is a tolerance value used for determining equality of 
%   floating point numbers. max_passes controls the number of iterations
%   over the dataset (without changes to alpha) before the algorithm quits.
%
% Note: This is a simplified version of the SMO algorithm for training
%       SVMs. In practice, if you want to train an SVM classifier, we
%       recommend using an optimized package such as:  
%
%           LIBSVM   (http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
%           SVMLight (http://svmlight.joachims.org/)
%
%

if ~exist('tol', 'var') || isempty(tol)
    tol = 1e-3;
end

if ~exist('max_passes', 'var') || isempty(max_passes)
    max_passes = 5;
end

% Data parameters
m = size(X, 1);
n = size(X, 2);

% Map 0 to -1
Y(Y==0) = -1;

% Variables
alphas = zeros(m, 1);
b = 0;
E = zeros(m, 1);
passes = 0;
eta = 0;
L = 0;
H = 0;

% Pre-compute the Kernel Matrix since our dataset is small
% (in practice, optimized SVM packages that handle large datasets
%  gracefully will _not_ do this)
% 
% We have implemented optimized vectorized version of the Kernels here so
% that the svm training will run faster.
if strcmp(func2str(kernelFunction), 'linearKernel')
    % Vectorized computation for the Linear Kernel
    % This is equivalent to computing the kernel on every pair of examples
    K = X*X';
elseif strfind(func2str(kernelFunction), 'gaussianKernel')
    % Vectorized RBF Kernel
    % This is equivalent to computing the kernel on every pair of examples
    X2 = sum(X.^2, 2);
    K = bsxfun(@plus, X2, bsxfun(@plus, X2', - 2 * (X * X')));
    K = kernelFunction(1, 0) .^ K;
else
    % Pre-compute the Kernel Matrix
    % The following can be slow due to the lack of vectorization
    K = zeros(m);
    for i = 1:m
        for j = i:m
             K(i,j) = kernelFunction(X(i,:)', X(j,:)');
             K(j,i) = K(i,j); %the matrix is symmetric
        end
    end
end

% Train
fprintf('\nTraining ...');
dots = 12;
while passes < max_passes
% 此处为外层循环开始。许多python程序都直接给一个很大的数字，这个不科学
% while passes < 600            
    num_changed_alphas = 0;
    for i = 1:m
        
        % Calculate Ei = f(x(i)) - y(i) using (2). 
        % E(i) = b + sum (X(i, :) * (repmat(alphas.*Y,1,n).*X)') - Y(i);
        E(i) = b + sum (alphas.*Y.*K(:,i)) - Y(i);
        
        if ((Y(i)*E(i) < -tol && alphas(i) < C) || (Y(i)*E(i) > tol && alphas(i) > 0))
            % 这个if条件很巧妙，Y(i)*E(i)=Y(i)*[W'.X(i)+b-Y(i)]=Y(i)*(W'.X+b)-1
            % 这就是y(i)*f(x)=1这个边界条件么？实际上if用y(i)*f(x)来判断也可以，但是这样就多了一层计算
            % 与E(i)合并计算，减少计算损耗。当Y(i)*f(x(i))-1>0时，也就相当于Y(i)*f(x(i))>1，这显然表示X(i)是SVM边界外的点
            % 如果X(i)在SVM边界外，按照KKT条件，那么alpha(i)=0，如果alpha<>0,则表示需要优化。
            % 同理，如果Y(i)*f(x)<1,则表示在Margin界内，此时alpha=C，如果<>C,则必须要优化。
            % 优化要往哪个方向走呢？假设有a1、a2，其他a3到a(m)都为固定数字，那么按照a1*y1 + a2*y2 =
            % constant常数，那么可以构造
            % In practice, there are many heuristics one can use to select
            % the i and j. In this simplified code, we select them randomly.
            j = ceil(m * rand());   
            % 随机选取与i不同的样本行作为j，
            % 
            while j == i  % Make sure i \neq j
                j = ceil(m * rand());
            end

            % Calculate Ej = f(x(j)) - y(j) using (2).
            E(j) = b + sum (alphas.*Y.*K(:,j)) - Y(j);

            % Save old alphas
            alpha_i_old = alphas(i);
            alpha_j_old = alphas(j);
            
            % Compute L and H by (10) or (11). 
            if (Y(i) == Y(j)),
                
                L = max(0, alphas(j) + alphas(i) - C);
                H = min(C, alphas(j) + alphas(i));
            else
                L = max(0, alphas(j) - alphas(i));
                H = min(C, C + alphas(j) - alphas(i));
            end
           
            if (L == H),
                % continue to next i. 
                continue;
            end

            % Compute eta by (14).
            eta = 2 * K(i,j) - K(i,i) - K(j,j);
            if (eta >= 0),
                % continue to next i. 
                continue;
            end
            
            % Compute and clip new value for alpha j using (12) and (15).
            alphas(j) = alphas(j) - (Y(j) * (E(i) - E(j))) / eta;
            % 这个地方有意思，假设alpha1、aphp2之外的alpha都固定。
            % 如果alpha(i)是svm，则E(i) =0, 如果E(j)是支持向量【正方向】之外的点，那么alpha(j)应该是等于0的，根据上述这个式子就很容易理解。alpha(j) 减去Y(j)*(-E(j))/eta，显然eta<0,那么就意味着alpha(j) - [一个正数]，最终使得alpha(j)变小，甚至等于负数，但clip之后，alpha(j)会变为0。
            
            % Clip
            alphas(j) = min (H, alphas(j));
            alphas(j) = max (L, alphas(j));
            
            % Check if change in alpha is significant
            if (abs(alphas(j) - alpha_j_old) < tol),
                % continue to next i. 
                % replace anyway
                alphas(j) = alpha_j_old;
                continue;
            end
            
            % Determine value for alpha i using (16). 
            alphas(i) = alphas(i) + Y(i)*Y(j)*(alpha_j_old - alphas(j));
            
            % Compute b1 and b2 using (17) and (18) respectively. 
            b1 = b - E(i) ...
                 - Y(i) * (alphas(i) - alpha_i_old) *  K(i,j)' ...
                 - Y(j) * (alphas(j) - alpha_j_old) *  K(i,j)';
            b2 = b - E(j) ...
                 - Y(i) * (alphas(i) - alpha_i_old) *  K(i,j)' ...
                 - Y(j) * (alphas(j) - alpha_j_old) *  K(j,j)';

            % Compute b by (19). 
            if (0 < alphas(i) && alphas(i) < C),
                b = b1;
            elseif (0 < alphas(j) && alphas(j) < C),
                b = b2;
            else
                b = (b1+b2)/2;
            end

            num_changed_alphas = num_changed_alphas + 1;

        end
        % 此处为结束if这层循环，如果有一个i，满足了KKT条件，这个if内的都不会执行，跳过，继续找下个违反KKT的i
        % 刚开始迭代时，每一个1:m的循环中，改变alpha会很多，因为alpha初始化为0，所有都需要优化
        % 随着迭代次数的增加，每次循环改变的alpha会越来越少，直至为0。每次循环，每个i进来计算时，记录alpha是否改变（+1）
    end
    % 此处为1:m内层循环结束，每循环一次遍历所有样本一一遍。
    
    % 1:m把样本计算一遍之后，很巧妙地计算出何时需要结束循环，如果每次循环alpha都有所改变，说明alpha优化还远未到位，还需要继续来。
    % 如果一个1:m的循环，alpha没有变化，则计算passes +1

    if (num_changed_alphas == 0),
        passes = passes + 1;
    else
        passes = 0;
        % 当passes取值区间在1-20之内时，如果alpha重新发生了变化，则passes从头开始算
        % 直到20次循环之内，alpha都没有发生变化，说明模型固定了。此时结束外层循环。
    end
    % 外层循环，当每循环一次，输出.,直到输出78个点时，重新置零，回车开始写下一行
    fprintf('.');
    dots = dots + 1;
    if dots > 78
        dots = 0;
        fprintf('\n');
    end

    if exist('OCTAVE_VERSION')
        fflush(stdout);
    end
% 这里是外循环的结束。
% 内循环没有改变的alpha时，并且已经持续了20个1:m的循环，达到passes=20次了，alpha仍然未定，那么外层while passes<20也就结束了。
end

fprintf(' Done! \n\n');

% Save the model
idx = alphas > 0;
% alphas>0的是支持向量，把他们的index存下来
% alphas<0的是松弛因子作用找到的变量，这个样本点在支持向量机的margin之间
% alphas=0，则表示这些样本点落在了Margin之外
% model.alphasinit = alphas;
model.X= X(idx,:);
model.y= Y(idx);
model.kernelFunction = kernelFunction;
model.b= b;
model.alphas= alphas(idx);
model.w = ((alphas.*Y)'*X)';
% 这个model中保存的X、y只有支持向量所在的点。
% W = sum（alpha(i)*y(i)*X(i),用向量运算更加高效便捷，注意此时x、y、alpha用的是全集，而不是支持向量的那几个样本点。
end

暂无评论

发表回复取消回复

发表回复 取消回复

发表回复取消回复