您的位置：首页 > 大数据

文本分类问题的增量学习 PassiveAggressiveClassifier在线主动攻击型分类算法大数据

2017-12-27 16:47 721 查看

实际解决机器学习问题过程中，我们会遇到一些“大数据”问题，比如有上百万条数据，上千上万维特征，此时数据存储已经达到10G这种级别。
如果是文本分类分体，你还需要提取文本特征，这时候如果把数据load到内存，那占用内存就太大了，如何解决：1. 对数据进行降维？2. 使用流式或类似流式处理？3. 上大机器，高内存的，或者用spark集群。
本文将要介绍的是一种增量学算法PassiveAggressiveClassifier
处理流程：

1. 流式数据
第一个条件，要给算法流式数据或小batch的数据，比如一次提供1000条这样。这一块是需要自己写代码提供的，可以实现一个生成器，每调用一次提供一份小batch数据。

2. 提取特征
第二个条件，可以使用任何一种sklearn中支持的特征提取方法。对于一些特殊情况，比如特征需要标准化或者是事先不知道特征值的情况下需要特殊处理。

3. 增量学习算法
对于第三个条件，sklearn中提供了很多增量学习算法。虽然不是所有的算法都可以增量学习，但是学习器提供了 partial_fit的函数的都可以进行增量学习。事实上，使用小batch的数据中进行增量学习（有时候也称为online learning）是这种学习方式的核心，因为它能让任何一段时间内内存中只有少量的数据。
sklearn提供很多增量学习算法例如sklearn.linear_model.PassiveAggressiveClassifier
其中对于分类问题，在第一次调用partial_fit时需要通过classes参数指定分类的类别。

迭代器的生成流数据

def iter_minibatches(filename, minibatch_size):
'''
迭代器
给定文件流（比如一个大文件），每次输出minibatch_size行，默认选择1k行
将输出转化成numpy输出，返回X, y
'''
import pandas as pd
import numpy as np
from sklearn.utils import shuffle
x = []
y = []
cur_line_num = 0
csvfile = open(filename, 'rb')
reader = pd.read_csv(csvfile
#,encoding = 'gb18030'
)
#分割商品名称
reader['HWMC'] = sjcl(list(reader['HWMC'].astype(str)))
reader['HWMC']=reader['HWMC'].apply(lambda x: np.NaN if str(x)=='' else x)#将空白替换为nan
#df_null = df[df['HWMC'].isnull()]
reader = reader[reader['HWMC'].notnull()]
reader.index =np.arange(len(reader))
reader = shuffle(reader)
for line in reader.index:
x.append(reader.HWMC[line])
y.append(reader.U_CODE[line])  # 这里要将数据转化成float类型
cur_line_num += 1
if cur_line_num >= minibatch_size:
x, y = np.array(x), np.array(y)  # 将数据转成numpy的array类型并返回
yield x, y
x, y = [], []
cur_line_num = 0
csvfile.close()

训练代码。。。大家不可直接复制，要根据业务需求，做好特征提取 import pandas as pd
import numpy as np
import datetime
import gc
from sklearn i
102ea
mport metrics
from sklearn.externals import joblib
df_sc = pd.DataFrame([[0,0,0]],columns = ['model','time','score'])
num = 1
for model in models:
MD = models[model]
print("获取classes",datetime.datetime.now().strftime('%Y.%m.%d-%H:%M:%S'))
all_classes = get_classes(filename)
minibatch_train_iterators = iter_minibatches(filename, size)
x_test, y_test = next(minibatch_train_iterators)
print("开始训练",datetime.datetime.now().strftime('%Y.%m.%d-%H:%M:%S'))
for i, (X_train, y_train) in enumerate(minibatch_train_iterators):
print("{} time".format(i)) # 当前次数
# 使用 partial_fit ，并在第一次调用 partial_fit 的时候指定 classes
MD.partial_fit(get_hv(X_train), y_train, classes=all_classes)
result=MD.predict(get_hv(x_test))
print(model,"score: %.4g" % metrics.accuracy_score(y_test,result)) # 在测试集上看效果
df_sc.loc[num] = {'model':model,'time':datetime.datetime.now().strftime('%Y.%m.%d-%H:%M:%S'),'score':MD.score(get_hv(x_test),y_test)}

if df_sc.score[num]>df_sc.score[num-1]:
print("模型训练完成，保存模型",datetime.datetime.now().strftime('%Y.%m.%d-%H:%M:%S'))
#保存模型
joblib.dump(MD, "/root/lizheng/model/model_learn1%s.pkl.gz" % model, compress=('gzip', 3))

from sklearn.linear_model import PassiveAggressiveClassifier

import sys
#sys.path.append("D:/PDM/SPBM")
sys.path.append("/root/lizheng")
models_learn ={#'pa1-0.6':PassiveAggressiveClassifier(C=0.6,max_iter=100000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1),#71.4
#'pa1-0.7':PassiveAggressiveClassifier(C=0.7,max_iter=10000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1),
#'pa1-0.8':PassiveAggressiveClassifier(C=0.8,max_iter=10000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1),
#'pa1-0.9':PassiveAggressiveClassifier(C=0.9,max_iter=10000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1),
#'pa1-1':PassiveAggressiveClassifier(C=1,max_iter=100000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1),#71.6
'pa4-1':PassiveAggressiveClassifier(C=2,max_iter=10000,loss = 'hinge',average=True,n_jobs=-1,random_state=1)
}

sp.fitby_linear_model('/root/lizheng/fcqspbm_1214a.csv',models_learn,1000000)

sklearn.linear_model

.PassiveAggressiveClassifier

class

sklearn.linear_model.

PassiveAggressiveClassifier

(C=1.0, fit_intercept=True, max_iter=None, tol=None, shuffle=True, verbose=0, loss=’hinge’, n_jobs=1, random_state=None, warm_start=False, class_weight=None, average=False, n_iter=None)[source]Passive Aggressive ClassifierRead more in the User Guide.

Parameters:	C : floatMaximum step size (regularization). Defaults to 1.0.fit_intercept : bool, default=FalseWhether the intercept should be estimated or not. If False, the data is assumed to be already centered.max_iter : int, optionalThe maximum number of passes over the training data (aka epochs). It only impacts the behavior in the fit method, and not the partial_fit. Defaults to 5. Defaults to 1000 from 0.21, or if tol is not None.New in version 0.19.tol : float or None, optionalThe stopping criterion. If it is not None, the iterations will stop when (loss > previous_loss - tol). Defaults to None. Defaults to 1e-3 from 0.21.New in version 0.19.shuffle : bool, default=TrueWhether or not the training data should be shuffled after each epoch.verbose : integer, optionalThe verbosity levelloss : string, optionalThe loss function to be used: hinge: equivalent to PA-I in the reference paper. squared_hinge: equivalent to PA-II in the reference paper.n_jobs : integer, optionalThe number of CPUs to use to do the OVA (One Versus All, for multi-class problems) computation. -1 means ‘all CPUs’. Defaults to 1.random_state : int, RandomState instance or None, optional, default=NoneThe seed of the pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.warm_start : bool, optionalWhen set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.class_weight : dict, {class_label: weight} or “balanced” or None, optionalPreset for the class_weight fit parameter.Weights associated with classes. If not given, all classes are supposed to have weight one.The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)) New in version 0.17: parameter class_weight to automatically weight samples.average : bool or int, optionalWhen set to True, computes the averaged SGD weights and stores the result in the coef_ attribute. If set to an int greater than 1, averaging will begin once the total number of samples seen reaches average. So average=10 will begin averaging after seeing 10 samples.New in version 0.19: parameter average to use weights averaging in SGDn_iter : int, optionalThe number of passes over the training data (aka epochs). Defaults to None. Deprecated, will be removed in 0.21.Changed in version 0.19: Deprecated
Attributes:	coef_ : array, shape = [1, n_features] if n_classes == 2 else [n_classes, n_features]Weights assigned to the features.intercept_ : array, shape = [1] if n_classes == 2 else [n_classes]Constants in decision function.n_iter_ : intThe actual number of iterations to reach the stopping criterion. For multiclass fits, it is the maximum over every binary fit.

Parameters:

C : floatMaximum step size (regularization). Defaults to 1.0.fit_intercept : bool, default=FalseWhether the intercept should be estimated or not. If False, the data is assumed to be already centered.max_iter : int, optionalThe maximum number of passes over the training data (aka epochs). It only impacts the behavior in the

fit

method, and not the partial_fit. Defaults to 5. Defaults to 1000 from 0.21, or if tol is not None.New in version 0.19.tol : float or None, optionalThe stopping criterion. If it is not None, the iterations will stop when (loss > previous_loss - tol). Defaults to None. Defaults to 1e-3 from 0.21.New in version 0.19.shuffle : bool, default=TrueWhether or not the training data should be shuffled after each epoch.verbose : integer, optionalThe verbosity levelloss : string, optionalThe loss function to be used: hinge: equivalent to PA-I in the reference paper. squared_hinge: equivalent to PA-II in the reference paper.n_jobs : integer, optionalThe number of CPUs to use to do the OVA (One Versus All, for multi-class problems) computation. -1 means ‘all CPUs’. Defaults to 1.random_state : int, RandomState instance or None, optional, default=NoneThe seed of the pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.warm_start : bool, optionalWhen set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.class_weight : dict, {class_label: weight} or “balanced” or None, optionalPreset for the class_weight fit parameter.Weights associated with classes. If not given, all classes are supposed to have weight one.The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as

n_samples / (n_classes * np.bincount(y))

New in version 0.17: parameter class_weight to automatically weight samples.average : bool or int, optionalWhen set to True, computes the averaged SGD weights and stores the result in the

coef_

attribute. If set to an int greater than 1, averaging will begin once the total number of samples seen reaches average. So average=10 will begin averaging after seeing 10 samples.New in version 0.19: parameter average to use weights averaging in SGDn_iter : int, optionalThe number of passes over the training data (aka epochs). Defaults to None. Deprecated, will be removed in 0.21.Changed in version 0.19: Deprecated

Attributes:

coef_ : array, shape = [1, n_features] if n_classes == 2 else [n_classes, n_features]Weights assigned to the features.intercept_ : array, shape = [1] if n_classes == 2 else [n_classes]Constants in decision function.n_iter_ : intThe actual number of iterations to reach the stopping criterion. For multiclass fits, it is the maximum over every binary fit.

sklearn.linear_model

.PassiveAggressiveClassifier

class

sklearn.linear_model.

PassiveAggressiveClassifier

Parameters:	C : floatMaximum step size (regularization). Defaults to 1.0.fit_intercept : bool, default=FalseWhether the intercept should be estimated or not. If False, the data is assumed to be already centered.max_iter : int, optionalThe maximum number of passes over the training data (aka epochs). It only impacts the behavior in the fit method, and not the partial_fit. Defaults to 5. Defaults to 1000 from 0.21, or if tol is not None.New in version 0.19.tol : float or None, optionalThe stopping criterion. If it is not None, the iterations will stop when (loss > previous_loss - tol). Defaults to None. Defaults to 1e-3 from 0.21.New in version 0.19.shuffle : bool, default=TrueWhether or not the training data should be shuffled after each epoch.verbose : integer, optionalThe verbosity levelloss : string, optionalThe loss function to be used: hinge: equivalent to PA-I in the reference paper. squared_hinge: equivalent to PA-II in the reference paper.n_jobs : integer, optionalThe number of CPUs to use to do the OVA (One Versus All, for multi-class problems) computation. -1 means ‘all CPUs’. Defaults to 1.random_state : int, RandomState instance or None, optional, default=NoneThe seed of the pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.warm_start : bool, optionalWhen set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.class_weight : dict, {class_label: weight} or “balanced” or None, optionalPreset for the class_weight fit parameter.Weights associated with classes. If not given, all classes are supposed to have weight one.The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)) New in version 0.17: parameter class_weight to automatically weight samples.average : bool or int, optionalWhen set to True, computes the averaged SGD weights and stores the result in the coef_ attribute. If set to an int greater than 1, averaging will begin once the total number of samples seen reaches average. So average=10 will begin averaging after seeing 10 samples.New in version 0.19: parameter average to use weights averaging in SGDn_iter : int, optionalThe number of passes over the training data (aka epochs). Defaults to None. Deprecated, will be removed in 0.21.Changed in version 0.19: Deprecated
Attributes:	coef_ : array, shape = [1, n_features] if n_classes == 2 else [n_classes, n_features]Weights assigned to the features.intercept_ : array, shape = [1] if n_classes == 2 else [n_classes]Constants in decision function.n_iter_ : intThe actual number of iterations to reach the stopping criterion. For multiclass fits, it is the maximum over every binary fit.

Parameters:

fit

n_samples / (n_classes * np.bincount(y))

New in version 0.17: parameter class_weight to automatically weight samples.average : bool or int, optionalWhen set to True, computes the averaged SGD weights and stores the result in the

coef_

Attributes:

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航

文本分类问题的增量学习 PassiveAggressiveClassifier在线主动攻击型分类算法大数据

sklearn.linear_model
.PassiveAggressiveClassifier

sklearn.linear_model
.PassiveAggressiveClassifier

文本分类问题的增量学习 PassiveAggressiveClassifier在线主动攻击型分类算法 大数据

sklearn.linear_model.PassiveAggressiveClassifier

sklearn.linear_model.PassiveAggressiveClassifier

文本分类问题的增量学习 PassiveAggressiveClassifier在线主动攻击型分类算法大数据

sklearn.linear_model
.PassiveAggressiveClassifier

sklearn.linear_model
.PassiveAggressiveClassifier