SVM通过拉格拉日对偶等最终形成了一套数学模型,最终可以通过SMO(Sequential Minimal Optimization)的方法可以解出最佳值alpha值。当然,可以通过经典的libsvm这样的slover实现计算,诸如此类的solver也是通过SMO算法来计算SVM的最优化参数,但是作为小白,确实有必要理解一下SMO内在的数值计算逻辑。
当然网络上有很多参考书可以,总结一下:
1、李航统计学习方法。最后总结SMO算法时含糊提到了两层循环,外层循环专门找不满足KKT条件的,如果找到一个不满足的记为a1,然后找到一个与a1不相同的a2,此时利用a1y1+a2y2为常数这个约束条件构造关于a2的二次函数,这就容易求出最小优化的a2、a1,这就是所谓的内层循环优化,同时a1、a2也要满足>0、<C和a1y1+a2y2=常数,三个条件,需要把按函数求导得到的a2值做判断看是否满足L、H区间,如果不在区间内则取边界值,称之为clip。最后外层多次循环直到所有a都满足KKT条件,最终把a不等于0的值取出来得到对应点即为支持向量SVM。但该书不涉及具体的实现代码。
2、吴恩达svmtrain.m程序,思路清晰,代码少,胜过所有看到的python算例。下文有详细分析和注释。
3、台湾交通大学SMO算法视频,老师讲的a1、a2取值以及clip思路清晰,胜过知乎。
4、其他python算法不能作为参考,参差不齐。
借鉴吴恩达机器学习SMO算法的Matlab实现,其代码实现极简,但是逻辑无比清晰,特记录并注释。
function [model] = svmTrain(X, Y, C, kernelFunction, ...
tol, max_passes)
%SVMTRAIN Trains an SVM classifier using a simplified version of the SMO
%algorithm.
% [model] = SVMTRAIN(X, Y, C, kernelFunction, tol, max_passes) trains an
% SVM classifier and returns trained model. X is the matrix of training
% examples. Each row is a training example, and the jth column holds the
% jth feature. Y is a column matrix containing 1 for positive examples
% and 0 for negative examples. C is the standard SVM regularization
% parameter. tol is a tolerance value used for determining equality of
% floating point numbers. max_passes controls the number of iterations
% over the dataset (without changes to alpha) before the algorithm quits.
%
% Note: This is a simplified version of the SMO algorithm for training
% SVMs. In practice, if you want to train an SVM classifier, we
% recommend using an optimized package such as:
%
% LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
% SVMLight (http://svmlight.joachims.org/)
%
%
if ~exist('tol', 'var') || isempty(tol)
tol = 1e-3;
end
if ~exist('max_passes', 'var') || isempty(max_passes)
max_passes = 5;
end
% Data parameters
m = size(X, 1);
n = size(X, 2);
% Map 0 to -1
Y(Y==0) = -1;
% Variables
alphas = zeros(m, 1);
b = 0;
E = zeros(m, 1);
passes = 0;
eta = 0;
L = 0;
H = 0;
% Pre-compute the Kernel Matrix since our dataset is small
% (in practice, optimized SVM packages that handle large datasets
% gracefully will _not_ do this)
%
% We have implemented optimized vectorized version of the Kernels here so
% that the svm training will run faster.
if strcmp(func2str(kernelFunction), 'linearKernel')
% Vectorized computation for the Linear Kernel
% This is equivalent to computing the kernel on every pair of examples
K = X*X';
elseif strfind(func2str(kernelFunction), 'gaussianKernel')
% Vectorized RBF Kernel
% This is equivalent to computing the kernel on every pair of examples
X2 = sum(X.^2, 2);
K = bsxfun(@plus, X2, bsxfun(@plus, X2', - 2 * (X * X')));
K = kernelFunction(1, 0) .^ K;
else
% Pre-compute the Kernel Matrix
% The following can be slow due to the lack of vectorization
K = zeros(m);
for i = 1:m
for j = i:m
K(i,j) = kernelFunction(X(i,:)', X(j,:)');
K(j,i) = K(i,j); %the matrix is symmetric
end
end
end
% Train
fprintf('\nTraining ...');
dots = 12;
while passes < max_passes
% 此处为外层循环开始。许多python程序都直接给一个很大的数字,这个不科学
% while passes < 600
num_changed_alphas = 0;
for i = 1:m
% Calculate Ei = f(x(i)) - y(i) using (2).
% E(i) = b + sum (X(i, :) * (repmat(alphas.*Y,1,n).*X)') - Y(i);
E(i) = b + sum (alphas.*Y.*K(:,i)) - Y(i);
if ((Y(i)*E(i) < -tol && alphas(i) < C) || (Y(i)*E(i) > tol && alphas(i) > 0))
% 这个if条件很巧妙,Y(i)*E(i)=Y(i)*[W'.X(i)+b-Y(i)]=Y(i)*(W'.X+b)-1
% 这就是y(i)*f(x)=1这个边界条件么?实际上if用y(i)*f(x)来判断也可以,但是这样就多了一层计算
% 与E(i)合并计算,减少计算损耗。当Y(i)*f(x(i))-1>0时,也就相当于Y(i)*f(x(i))>1,这显然表示X(i)是SVM边界外的点
% 如果X(i)在SVM边界外,按照KKT条件,那么alpha(i)=0,如果alpha<>0,则表示需要优化。
% 同理,如果Y(i)*f(x)<1,则表示在Margin界内,此时alpha=C,如果<>C,则必须要优化。
% 优化要往哪个方向走呢?假设有a1、a2,其他a3到a(m)都为固定数字,那么按照a1*y1 + a2*y2 =
% constant常数,那么可以构造
% In practice, there are many heuristics one can use to select
% the i and j. In this simplified code, we select them randomly.
j = ceil(m * rand());
% 随机选取与i不同的样本行作为j,
%
while j == i % Make sure i \neq j
j = ceil(m * rand());
end
% Calculate Ej = f(x(j)) - y(j) using (2).
E(j) = b + sum (alphas.*Y.*K(:,j)) - Y(j);
% Save old alphas
alpha_i_old = alphas(i);
alpha_j_old = alphas(j);
% Compute L and H by (10) or (11).
if (Y(i) == Y(j)),
L = max(0, alphas(j) + alphas(i) - C);
H = min(C, alphas(j) + alphas(i));
else
L = max(0, alphas(j) - alphas(i));
H = min(C, C + alphas(j) - alphas(i));
end
if (L == H),
% continue to next i.
continue;
end
% Compute eta by (14).
eta = 2 * K(i,j) - K(i,i) - K(j,j);
if (eta >= 0),
% continue to next i.
continue;
end
% Compute and clip new value for alpha j using (12) and (15).
alphas(j) = alphas(j) - (Y(j) * (E(i) - E(j))) / eta;
% 这个地方有意思,假设alpha1、aphp2之外的alpha都固定。
% 如果alpha(i)是svm,则E(i) =0, 如果E(j)是支持向量【正方向】之外的点,那么alpha(j)应该是等于0的,根据上述这个式子就很容易理解。alpha(j) 减去Y(j)*(-E(j))/eta,显然eta<0,那么就意味着alpha(j) - [一个正数],最终使得alpha(j)变小,甚至等于负数,但clip之后,alpha(j)会变为0。
% Clip
alphas(j) = min (H, alphas(j));
alphas(j) = max (L, alphas(j));
% Check if change in alpha is significant
if (abs(alphas(j) - alpha_j_old) < tol),
% continue to next i.
% replace anyway
alphas(j) = alpha_j_old;
continue;
end
% Determine value for alpha i using (16).
alphas(i) = alphas(i) + Y(i)*Y(j)*(alpha_j_old - alphas(j));
% Compute b1 and b2 using (17) and (18) respectively.
b1 = b - E(i) ...
- Y(i) * (alphas(i) - alpha_i_old) * K(i,j)' ...
- Y(j) * (alphas(j) - alpha_j_old) * K(i,j)';
b2 = b - E(j) ...
- Y(i) * (alphas(i) - alpha_i_old) * K(i,j)' ...
- Y(j) * (alphas(j) - alpha_j_old) * K(j,j)';
% Compute b by (19).
if (0 < alphas(i) && alphas(i) < C),
b = b1;
elseif (0 < alphas(j) && alphas(j) < C),
b = b2;
else
b = (b1+b2)/2;
end
num_changed_alphas = num_changed_alphas + 1;
end
% 此处为结束if这层循环,如果有一个i,满足了KKT条件,这个if内的都不会执行,跳过,继续找下个违反KKT的i
% 刚开始迭代时,每一个1:m的循环中,改变alpha会很多,因为alpha初始化为0,所有都需要优化
% 随着迭代次数的增加,每次循环改变的alpha会越来越少,直至为0。每次循环,每个i进来计算时,记录alpha是否改变(+1)
end
% 此处为1:m内层循环结束,每循环一次遍历所有样本一一遍。
% 1:m把样本计算一遍之后,很巧妙地计算出何时需要结束循环,如果每次循环alpha都有所改变,说明alpha优化还远未到位,还需要继续来。
% 如果一个1:m的循环,alpha没有变化,则计算passes +1
if (num_changed_alphas == 0),
passes = passes + 1;
else
passes = 0;
% 当passes取值区间在1-20之内时,如果alpha重新发生了变化,则passes从头开始算
% 直到20次循环之内,alpha都没有发生变化,说明模型固定了。此时结束外层循环。
end
% 外层循环,当每循环一次,输出.,直到输出78个点时,重新置零,回车开始写下一行
fprintf('.');
dots = dots + 1;
if dots > 78
dots = 0;
fprintf('\n');
end
if exist('OCTAVE_VERSION')
fflush(stdout);
end
% 这里是外循环的结束。
% 内循环没有改变的alpha时,并且已经持续了20个1:m的循环,达到passes=20次了,alpha仍然未定,那么外层while passes<20也就结束了。
end
fprintf(' Done! \n\n');
% Save the model
idx = alphas > 0;
% alphas>0的是支持向量,把他们的index存下来
% alphas<0的是松弛因子作用找到的变量,这个样本点在支持向量机的margin之间
% alphas=0,则表示这些样本点落在了Margin之外
% model.alphasinit = alphas;
model.X= X(idx,:);
model.y= Y(idx);
model.kernelFunction = kernelFunction;
model.b= b;
model.alphas= alphas(idx);
model.w = ((alphas.*Y)'*X)';
% 这个model中保存的X、y只有支持向量所在的点。
% W = sum(alpha(i)*y(i)*X(i),用向量运算更加高效便捷,注意此时x、y、alpha用的是全集,而不是支持向量的那几个样本点。
end
暂无评论