您的位置:首页 > 大数据

文本分类问题的增量学习 PassiveAggressiveClassifier在线主动攻击型分类算法 大数据

2017-12-27 16:47 721 查看
如果是文本分类分体,你还需要提取文本特征,这时候如果把数据load到内存,那占用内存就太大了,如何解决:1. 对数据进行降维?2. 使用流式或类似流式处理?3. 上大机器,高内存的,或者用spark集群。

1. 流式数据

2. 提取特征

3. 增量学习算法
对于第三个条件,sklearn中提供了很多增量学习算法。虽然不是所有的算法都可以增量学习,但是学习器提供了 partial_fit的函数的都可以进行增量学习。事实上,使用小batch的数据中进行增量学习(有时候也称为online learning)是这种学习方式的核心,因为它能让任何一段时间内内存中只有少量的数据。
sklearn提供很多增量学习算法 例如sklearn.linear_model.PassiveAggressiveClassifier

def iter_minibatches(filename, minibatch_size):
将输出转化成numpy输出,返回X, y
import pandas as pd
import numpy as np
from sklearn.utils import shuffle
x = []
y = []
cur_line_num = 0
csvfile = open(filename, 'rb')
reader = pd.read_csv(csvfile
#,encoding = 'gb18030'
reader['HWMC'] = sjcl(list(reader['HWMC'].astype(str)))
reader['HWMC']=reader['HWMC'].apply(lambda x: np.NaN if str(x)=='' else x)#将空白替换为nan
#df_null = df[df['HWMC'].isnull()]
reader = reader[reader['HWMC'].notnull()]
reader.index =np.arange(len(reader))
reader = shuffle(reader)
for line in reader.index:
y.append(reader.U_CODE[line])  # 这里要将数据转化成float类型
cur_line_num += 1
if cur_line_num >= minibatch_size:
x, y = np.array(x), np.array(y)  # 将数据转成numpy的array类型并返回
yield x, y
x, y = [], []
cur_line_num = 0

训练代码。。。大家不可直接复制,要根据业务需求,做好特征提取 import pandas as pd
import numpy as np
import datetime
import gc
from sklearn i
mport metrics
from sklearn.externals import joblib
df_sc = pd.DataFrame([[0,0,0]],columns = ['model','time','score'])
num = 1
for model in models:
MD = models[model]
all_classes = get_classes(filename)
minibatch_train_iterators = iter_minibatches(filename, size)
x_test, y_test = next(minibatch_train_iterators)
for i, (X_train, y_train) in enumerate(minibatch_train_iterators):
print("{} time".format(i)) # 当前次数
# 使用 partial_fit ,并在第一次调用 partial_fit 的时候指定 classes
MD.partial_fit(get_hv(X_train), y_train, classes=all_classes)
print(model,"score: %.4g" % metrics.accuracy_score(y_test,result)) # 在测试集上看效果
df_sc.loc[num] = {'model':model,'time':datetime.datetime.now().strftime('%Y.%m.%d-%H:%M:%S'),'score':MD.score(get_hv(x_test),y_test)}

if df_sc.score[num]>df_sc.score[num-1]:
joblib.dump(MD, "/root/lizheng/model/model_learn1%s.pkl.gz" % model, compress=('gzip', 3))
from sklearn.linear_model import PassiveAggressiveClassifier

import sys
models_learn ={#'pa1-0.6':PassiveAggressiveClassifier(C=0.6,max_iter=100000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1),#71.4
#'pa1-0.7':PassiveAggressiveClassifier(C=0.7,max_iter=10000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1),
#'pa1-0.8':PassiveAggressiveClassifier(C=0.8,max_iter=10000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1),
#'pa1-0.9':PassiveAggressiveClassifier(C=0.9,max_iter=10000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1),
#'pa1-1':PassiveAggressiveClassifier(C=1,max_iter=100000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1),#71.6
'pa4-1':PassiveAggressiveClassifier(C=2,max_iter=10000,loss = 'hinge',average=True,n_jobs=-1,random_state=1)



(C=1.0, fit_intercept=True, max_iter=None, tol=None, shuffle=True, verbose=0, loss=’hinge’, n_jobs=1, random_state=None, warm_start=False, class_weight=None, average=False, n_iter=None)[source]Passive Aggressive ClassifierRead more in the User Guide.
Parameters:C : floatMaximum step size (regularization). Defaults to 1.0.fit_intercept : bool, default=FalseWhether the intercept should be estimated or not. If False, the data is assumed to be already centered.max_iter : int, optionalThe maximum number of passes over the training data (aka epochs). It only impacts the behavior in the 
 method, and not the partial_fit. Defaults to 5. Defaults to 1000 from 0.21, or if tol is not None.New in version 0.19.tol : float or None, optionalThe stopping criterion. If it is not None, the iterations will stop when (loss > previous_loss - tol). Defaults to None. Defaults to 1e-3 from 0.21.New in version 0.19.shuffle : bool, default=TrueWhether or not the training data should be shuffled after each epoch.verbose : integer, optionalThe verbosity levelloss : string, optionalThe loss function to be used: hinge: equivalent to PA-I in the reference paper. squared_hinge: equivalent to PA-II in the reference paper.n_jobs : integer, optionalThe number of CPUs to use to do the OVA (One Versus All, for multi-class problems) computation. -1 means ‘all CPUs’. Defaults to 1.random_state : int, RandomState instance or None, optional, default=NoneThe seed of the pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.warm_start : bool, optionalWhen set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.class_weight : dict, {class_label: weight} or “balanced” or None, optionalPreset for the class_weight fit parameter.Weights associated with classes. If not given, all classes are supposed to have weight one.The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as 
n_samples / (n_classes * np.bincount(y))
New in version 0.17: parameter class_weight to automatically weight samples.average : bool or int, optionalWhen set to True, computes the averaged SGD weights and stores the result in the 
 attribute. If set to an int greater than 1, averaging will begin once the total number of samples seen reaches average. So average=10 will begin averaging after seeing 10 samples.New in version 0.19: parameter average to use weights averaging in SGDn_iter : int, optionalThe number of passes over the training data (aka epochs). Defaults to None. Deprecated, will be removed in 0.21.Changed in version 0.19: Deprecated
Attributes:coef_ : array, shape = [1, n_features] if n_classes == 2 else [n_classes, n_features]Weights assigned to the features.intercept_ : array, shape = [1] if n_classes == 2 else [n_classes]Constants in decision function.n_iter_ : intThe actual number of iterations to reach the stopping criterion. For multiclass fits, it is the maximum over every binary fit.


(C=1.0, fit_intercept=True, max_iter=None, tol=None, shuffle=True, verbose=0, loss=’hinge’, n_jobs=1, random_state=None, warm_start=False, class_weight=None, average=False, n_iter=None)[source]Passive Aggressive ClassifierRead more in the User Guide.
Parameters:C : floatMaximum step size (regularization). Defaults to 1.0.fit_intercept : bool, default=FalseWhether the intercept should be estimated or not. If False, the data is assumed to be already centered.max_iter : int, optionalThe maximum number of passes over the training data (aka epochs). It only impacts the behavior in the 
 method, and not the partial_fit. Defaults to 5. Defaults to 1000 from 0.21, or if tol is not None.New in version 0.19.tol : float or None, optionalThe stopping criterion. If it is not None, the iterations will stop when (loss > previous_loss - tol). Defaults to None. Defaults to 1e-3 from 0.21.New in version 0.19.shuffle : bool, default=TrueWhether or not the training data should be shuffled after each epoch.verbose : integer, optionalThe verbosity levelloss : string, optionalThe loss function to be used: hinge: equivalent to PA-I in the reference paper. squared_hinge: equivalent to PA-II in the reference paper.n_jobs : integer, optionalThe number of CPUs to use to do the OVA (One Versus All, for multi-class problems) computation. -1 means ‘all CPUs’. Defaults to 1.random_state : int, RandomState instance or None, optional, default=NoneThe seed of the pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.warm_start : bool, optionalWhen set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.class_weight : dict, {class_label: weight} or “balanced” or None, optionalPreset for the class_weight fit parameter.Weights associated with classes. If not given, all classes are supposed to have weight one.The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as 
n_samples / (n_classes * np.bincount(y))
New in version 0.17: parameter class_weight to automatically weight samples.average : bool or int, optionalWhen set to True, computes the averaged SGD weights and stores the result in the 
 attribute. If set to an int greater than 1, averaging will begin once the total number of samples seen reaches average. So average=10 will begin averaging after seeing 10 samples.New in version 0.19: parameter average to use weights averaging in SGDn_iter : int, optionalThe number of passes over the training data (aka epochs). Defaults to None. Deprecated, will be removed in 0.21.Changed in version 0.19: Deprecated
Attributes:coef_ : array, shape = [1, n_features] if n_classes == 2 else [n_classes, n_features]Weights assigned to the features.intercept_ : array, shape = [1] if n_classes == 2 else [n_classes]Constants in decision function.n_iter_ : intThe actual number of iterations to reach the stopping criterion. For multiclass fits, it is the maximum over every binary fit.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息