文本分类问题的增量学习 PassiveAggressiveClassifier在线主动攻击型分类算法 大数据
2017-12-27 16:47
721 查看
实际解决机器学习问题过程中,我们会遇到一些“大数据”问题,比如有上百万条数据,上千上万维特征,此时数据存储已经达到10G这种级别。
如果是文本分类分体,你还需要提取文本特征,这时候如果把数据load到内存,那占用内存就太大了,如何解决:1. 对数据进行降维?2. 使用流式或类似流式处理?3. 上大机器,高内存的,或者用spark集群。
本文将要介绍的是一种增量学算法PassiveAggressiveClassifier
处理流程:
1. 流式数据
第一个条件,要给算法流式数据或小batch的数据,比如一次提供1000条这样。这一块是需要自己写代码提供的,可以实现一个生成器,每调用一次提供一份小batch数据。
2. 提取特征
第二个条件,可以使用任何一种sklearn中支持的特征提取方法。对于一些特殊情况,比如特征需要标准化或者是事先不知道特征值的情况下需要特殊处理。
3. 增量学习算法
对于第三个条件,sklearn中提供了很多增量学习算法。虽然不是所有的算法都可以增量学习,但是学习器提供了 partial_fit的函数的都可以进行增量学习。事实上,使用小batch的数据中进行增量学习(有时候也称为online learning)是这种学习方式的核心,因为它能让任何一段时间内内存中只有少量的数据。
sklearn提供很多增量学习算法 例如sklearn.linear_model.PassiveAggressiveClassifier
其中对于分类问题,在第一次调用partial_fit时需要通过classes参数指定分类的类别。
迭代器的生成流数据
训练代码。。。大家不可直接复制,要根据业务需求,做好特征提取 import pandas as pd
import numpy as np
import datetime
import gc
from sklearn i
102ea
mport metrics
from sklearn.externals import joblib
df_sc = pd.DataFrame([[0,0,0]],columns = ['model','time','score'])
num = 1
for model in models:
MD = models[model]
print("获取classes",datetime.datetime.now().strftime('%Y.%m.%d-%H:%M:%S'))
all_classes = get_classes(filename)
minibatch_train_iterators = iter_minibatches(filename, size)
x_test, y_test = next(minibatch_train_iterators)
print("开始训练",datetime.datetime.now().strftime('%Y.%m.%d-%H:%M:%S'))
for i, (X_train, y_train) in enumerate(minibatch_train_iterators):
print("{} time".format(i)) # 当前次数
# 使用 partial_fit ,并在第一次调用 partial_fit 的时候指定 classes
MD.partial_fit(get_hv(X_train), y_train, classes=all_classes)
result=MD.predict(get_hv(x_test))
print(model,"score: %.4g" % metrics.accuracy_score(y_test,result)) # 在测试集上看效果
df_sc.loc[num] = {'model':model,'time':datetime.datetime.now().strftime('%Y.%m.%d-%H:%M:%S'),'score':MD.score(get_hv(x_test),y_test)}
if df_sc.score[num]>df_sc.score[num-1]:
print("模型训练完成,保存模型",datetime.datetime.now().strftime('%Y.%m.%d-%H:%M:%S'))
#保存模型
joblib.dump(MD, "/root/lizheng/model/model_learn1%s.pkl.gz" % model, compress=('gzip', 3))
class
class
如果是文本分类分体,你还需要提取文本特征,这时候如果把数据load到内存,那占用内存就太大了,如何解决:1. 对数据进行降维?2. 使用流式或类似流式处理?3. 上大机器,高内存的,或者用spark集群。
本文将要介绍的是一种增量学算法PassiveAggressiveClassifier
处理流程:
1. 流式数据
第一个条件,要给算法流式数据或小batch的数据,比如一次提供1000条这样。这一块是需要自己写代码提供的,可以实现一个生成器,每调用一次提供一份小batch数据。
2. 提取特征
第二个条件,可以使用任何一种sklearn中支持的特征提取方法。对于一些特殊情况,比如特征需要标准化或者是事先不知道特征值的情况下需要特殊处理。
3. 增量学习算法
对于第三个条件,sklearn中提供了很多增量学习算法。虽然不是所有的算法都可以增量学习,但是学习器提供了 partial_fit的函数的都可以进行增量学习。事实上,使用小batch的数据中进行增量学习(有时候也称为online learning)是这种学习方式的核心,因为它能让任何一段时间内内存中只有少量的数据。
sklearn提供很多增量学习算法 例如sklearn.linear_model.PassiveAggressiveClassifier
其中对于分类问题,在第一次调用partial_fit时需要通过classes参数指定分类的类别。
迭代器的生成流数据
def iter_minibatches(filename, minibatch_size): ''' 迭代器 给定文件流(比如一个大文件),每次输出minibatch_size行,默认选择1k行 将输出转化成numpy输出,返回X, y ''' import pandas as pd import numpy as np from sklearn.utils import shuffle x = [] y = [] cur_line_num = 0 csvfile = open(filename, 'rb') reader = pd.read_csv(csvfile #,encoding = 'gb18030' ) #分割商品名称 reader['HWMC'] = sjcl(list(reader['HWMC'].astype(str))) reader['HWMC']=reader['HWMC'].apply(lambda x: np.NaN if str(x)=='' else x)#将空白替换为nan #df_null = df[df['HWMC'].isnull()] reader = reader[reader['HWMC'].notnull()] reader.index =np.arange(len(reader)) reader = shuffle(reader) for line in reader.index: x.append(reader.HWMC[line]) y.append(reader.U_CODE[line]) # 这里要将数据转化成float类型 cur_line_num += 1 if cur_line_num >= minibatch_size: x, y = np.array(x), np.array(y) # 将数据转成numpy的array类型并返回 yield x, y x, y = [], [] cur_line_num = 0 csvfile.close()
训练代码。。。大家不可直接复制,要根据业务需求,做好特征提取 import pandas as pd
import numpy as np
import datetime
import gc
from sklearn i
102ea
mport metrics
from sklearn.externals import joblib
df_sc = pd.DataFrame([[0,0,0]],columns = ['model','time','score'])
num = 1
for model in models:
MD = models[model]
print("获取classes",datetime.datetime.now().strftime('%Y.%m.%d-%H:%M:%S'))
all_classes = get_classes(filename)
minibatch_train_iterators = iter_minibatches(filename, size)
x_test, y_test = next(minibatch_train_iterators)
print("开始训练",datetime.datetime.now().strftime('%Y.%m.%d-%H:%M:%S'))
for i, (X_train, y_train) in enumerate(minibatch_train_iterators):
print("{} time".format(i)) # 当前次数
# 使用 partial_fit ,并在第一次调用 partial_fit 的时候指定 classes
MD.partial_fit(get_hv(X_train), y_train, classes=all_classes)
result=MD.predict(get_hv(x_test))
print(model,"score: %.4g" % metrics.accuracy_score(y_test,result)) # 在测试集上看效果
df_sc.loc[num] = {'model':model,'time':datetime.datetime.now().strftime('%Y.%m.%d-%H:%M:%S'),'score':MD.score(get_hv(x_test),y_test)}
if df_sc.score[num]>df_sc.score[num-1]:
print("模型训练完成,保存模型",datetime.datetime.now().strftime('%Y.%m.%d-%H:%M:%S'))
#保存模型
joblib.dump(MD, "/root/lizheng/model/model_learn1%s.pkl.gz" % model, compress=('gzip', 3))
from sklearn.linear_model import PassiveAggressiveClassifier import sys #sys.path.append("D:/PDM/SPBM") sys.path.append("/root/lizheng") models_learn ={#'pa1-0.6':PassiveAggressiveClassifier(C=0.6,max_iter=100000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1),#71.4 #'pa1-0.7':PassiveAggressiveClassifier(C=0.7,max_iter=10000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1), #'pa1-0.8':PassiveAggressiveClassifier(C=0.8,max_iter=10000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1), #'pa1-0.9':PassiveAggressiveClassifier(C=0.9,max_iter=10000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1), #'pa1-1':PassiveAggressiveClassifier(C=1,max_iter=100000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1),#71.6 'pa4-1':PassiveAggressiveClassifier(C=2,max_iter=10000,loss = 'hinge',average=True,n_jobs=-1,random_state=1) } sp.fitby_linear_model('/root/lizheng/fcqspbm_1214a.csv',models_learn,1000000)
sklearn.linear_model
.PassiveAggressiveClassifier
class sklearn.linear_model.
PassiveAggressiveClassifier(C=1.0, fit_intercept=True, max_iter=None, tol=None, shuffle=True, verbose=0, loss=’hinge’, n_jobs=1, random_state=None, warm_start=False, class_weight=None, average=False, n_iter=None)[source]Passive Aggressive ClassifierRead more in the User Guide.
Parameters: | C : floatMaximum step size (regularization). Defaults to 1.0.fit_intercept : bool, default=FalseWhether the intercept should be estimated or not. If False, the data is assumed to be already centered.max_iter : int, optionalThe maximum number of passes over the training data (aka epochs). It only impacts the behavior in the fitmethod, and not the partial_fit. Defaults to 5. Defaults to 1000 from 0.21, or if tol is not None.New in version 0.19.tol : float or None, optionalThe stopping criterion. If it is not None, the iterations will stop when (loss > previous_loss - tol). Defaults to None. Defaults to 1e-3 from 0.21.New in version 0.19.shuffle : bool, default=TrueWhether or not the training data should be shuffled after each epoch.verbose : integer, optionalThe verbosity levelloss : string, optionalThe loss function to be used: hinge: equivalent to PA-I in the reference paper. squared_hinge: equivalent to PA-II in the reference paper.n_jobs : integer, optionalThe number of CPUs to use to do the OVA (One Versus All, for multi-class problems) computation. -1 means ‘all CPUs’. Defaults to 1.random_state : int, RandomState instance or None, optional, default=NoneThe seed of the pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.warm_start : bool, optionalWhen set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.class_weight : dict, {class_label: weight} or “balanced” or None, optionalPreset for the class_weight fit parameter.Weights associated with classes. If not given, all classes are supposed to have weight one.The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))New in version 0.17: parameter class_weight to automatically weight samples.average : bool or int, optionalWhen set to True, computes the averaged SGD weights and stores the result in the coef_attribute. If set to an int greater than 1, averaging will begin once the total number of samples seen reaches average. So average=10 will begin averaging after seeing 10 samples.New in version 0.19: parameter average to use weights averaging in SGDn_iter : int, optionalThe number of passes over the training data (aka epochs). Defaults to None. Deprecated, will be removed in 0.21.Changed in version 0.19: Deprecated |
---|---|
Attributes: | coef_ : array, shape = [1, n_features] if n_classes == 2 else [n_classes, n_features]Weights assigned to the features.intercept_ : array, shape = [1] if n_classes == 2 else [n_classes]Constants in decision function.n_iter_ : intThe actual number of iterations to reach the stopping criterion. For multiclass fits, it is the maximum over every binary fit. |
sklearn.linear_model
.PassiveAggressiveClassifier
class sklearn.linear_model.
PassiveAggressiveClassifier(C=1.0, fit_intercept=True, max_iter=None, tol=None, shuffle=True, verbose=0, loss=’hinge’, n_jobs=1, random_state=None, warm_start=False, class_weight=None, average=False, n_iter=None)[source]Passive Aggressive ClassifierRead more in the User Guide.
Parameters: | C : floatMaximum step size (regularization). Defaults to 1.0.fit_intercept : bool, default=FalseWhether the intercept should be estimated or not. If False, the data is assumed to be already centered.max_iter : int, optionalThe maximum number of passes over the training data (aka epochs). It only impacts the behavior in the fitmethod, and not the partial_fit. Defaults to 5. Defaults to 1000 from 0.21, or if tol is not None.New in version 0.19.tol : float or None, optionalThe stopping criterion. If it is not None, the iterations will stop when (loss > previous_loss - tol). Defaults to None. Defaults to 1e-3 from 0.21.New in version 0.19.shuffle : bool, default=TrueWhether or not the training data should be shuffled after each epoch.verbose : integer, optionalThe verbosity levelloss : string, optionalThe loss function to be used: hinge: equivalent to PA-I in the reference paper. squared_hinge: equivalent to PA-II in the reference paper.n_jobs : integer, optionalThe number of CPUs to use to do the OVA (One Versus All, for multi-class problems) computation. -1 means ‘all CPUs’. Defaults to 1.random_state : int, RandomState instance or None, optional, default=NoneThe seed of the pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.warm_start : bool, optionalWhen set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.class_weight : dict, {class_label: weight} or “balanced” or None, optionalPreset for the class_weight fit parameter.Weights associated with classes. If not given, all classes are supposed to have weight one.The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))New in version 0.17: parameter class_weight to automatically weight samples.average : bool or int, optionalWhen set to True, computes the averaged SGD weights and stores the result in the coef_attribute. If set to an int greater than 1, averaging will begin once the total number of samples seen reaches average. So average=10 will begin averaging after seeing 10 samples.New in version 0.19: parameter average to use weights averaging in SGDn_iter : int, optionalThe number of passes over the training data (aka epochs). Defaults to None. Deprecated, will be removed in 0.21.Changed in version 0.19: Deprecated |
---|---|
Attributes: | coef_ : array, shape = [1, n_features] if n_classes == 2 else [n_classes, n_features]Weights assigned to the features.intercept_ : array, shape = [1] if n_classes == 2 else [n_classes]Constants in decision function.n_iter_ : intThe actual number of iterations to reach the stopping criterion. For multiclass fits, it is the maximum over every binary fit. |
相关文章推荐
- 【机器学习】从分类问题区别机器学习类型 与 初步介绍无监督学习算法 PAC
- 数据挖掘十大经典算法学习之C4.5决策树分类算法及信息熵相关
- 数据结构经典算法学习之01背包问题
- 统计学习知识---感知机学习算法的拓展(非线性可分数据问题)
- 数据科学之机器学习5:分类之k-近邻算法
- 数据结构与算法学习之路:迷宫问题——回溯思想找出所有路径
- 数据分析学习之路——(八)分类算法介绍
- 数据结构与算法学习笔记04(约瑟夫问题)
- 机器学习实战之k-近邻算法(5)--- 完整版约会网站数据分类
- 1.8 KNN算法学习——数据归一化处理解决量纲不同的问题
- 数据结构经典算法学习之完全背包问题
- 感知机学习算法的拓展---非线性可分数据问题
- 【并查集】数据结构与算法实验题 11.2 病毒排查问题
- 数据挖掘分类算法(2)
- C++多线程编程学习一 [关于数据竞争问题]
- 数据挖掘中分类算法小结
- ACM基本算法分类、推荐学习资料和配套pku习题
- JAVA 数据结构与算法学习笔记一(转载)
- 根据分类分别取数据算法
- SQL Server 2008 学习笔记【一】 一次性插入多行数据的问题