Semi-supervised Text Categorization by Considering Sufficiency and Diversity
2015-12-10 00:00
761 查看
Following are some excerpts from the paper Semi-supervised Text Categorization by Considering Sufficiency and Diversity by Shoushan Li et al.. Those excerpts summarize the main idea of the paper.
Paper name: Semi-supervised Text Categorization by Considering Sufficiency and Diversity
Paper authors: Shoushan Li et al.
Key words: Semi-supervised, Text Categorization, Bootstrapping, Sufficiency, Diversity
Bootstrapping with Random Subspace
In bootstrapping, the classifier for choosing the samples with high confidences is usually trained over the whole feature space. This type of classifier tends to choose the samples much similar to the initial labeled data in terms of the whole feature space.
Generally, the extent of the differences between each two classifiers largely depends on the differences of the features they used. One straight way to obtain different classifiers is to randomly select r features from the whole feature set in each iteration in bootstrapping. A classifier trained with the subspace training data is called a subspace classifier.
The size of the feature subset r is an important parameter in this algorithm. The smaller r, the more different subspace classifiers are from each other. However, the value of r should not be too small because a classifier trained with too few features is not capable of correctly predicting samples.
Bootstrapping with Excluded Subspace
To better satisfy the diversity preference, the paper improved the random subspace generation strategy with an constraint which restricts that every two adjacent subspace classifiers do not share any features.
Diversity Consideration among Different Types of Features
The paper introduced another constraint which restricts that every two adjacent subspace classifiers do not share any similar features. Here, two features are
7fe0
considered similar when they contain the same informative unigram.
来自为知笔记(Wiz)
Paper name: Semi-supervised Text Categorization by Considering Sufficiency and Diversity
Paper authors: Shoushan Li et al.
Key words: Semi-supervised, Text Categorization, Bootstrapping, Sufficiency, Diversity
Overview
The paper Semi-supervised Text Categorization by Considering Sufficiency and Diversity by Shoushan Li et al. proposed a novel bootstrapping approach to semi-supervised text categorization (TC) by considering two basic preferences, i.e., sufficiency and diversity. Experimental evaluation shows the effectiveness of the modified bootstrapping approach in both topic and sentiment-based TC tasks.Bootstrapping
In bootstrapping, a classifier is first trained with a small amount of labeled data and then iteratively retrained by adding most confident unlabeled samples as new labeled data.Sufficiency
In order to make bootstrapping successful, we should correctly predict the labels of the newly added data as possible as we can. Otherwise, many wrongly predicted samples would make bootstrapping fail completely. For clarity, we refer to this preference as sufficiency.Diversity
When the newly-added data is too close to the initial labeled data, the trained hyperplane might be far away from the optimal one. One possible way to overcome the concentration drawback is to make the added data more different from the initial data and better reflect the natural data distribution. For clarity, we refer to this preference of letting newly labeled data more different from existing labeled data as diversity.Bootstrapping by Considering Sufficiency and Diversity
To take sufficiency and diversity into consideration, the paper proposed three methods:Bootstrapping with Random Subspace
In bootstrapping, the classifier for choosing the samples with high confidences is usually trained over the whole feature space. This type of classifier tends to choose the samples much similar to the initial labeled data in terms of the whole feature space.
Generally, the extent of the differences between each two classifiers largely depends on the differences of the features they used. One straight way to obtain different classifiers is to randomly select r features from the whole feature set in each iteration in bootstrapping. A classifier trained with the subspace training data is called a subspace classifier.
The size of the feature subset r is an important parameter in this algorithm. The smaller r, the more different subspace classifiers are from each other. However, the value of r should not be too small because a classifier trained with too few features is not capable of correctly predicting samples.
Bootstrapping with Excluded Subspace
To better satisfy the diversity preference, the paper improved the random subspace generation strategy with an constraint which restricts that every two adjacent subspace classifiers do not share any features.
Diversity Consideration among Different Types of Features
The paper introduced another constraint which restricts that every two adjacent subspace classifiers do not share any similar features. Here, two features are
7fe0
considered similar when they contain the same informative unigram.
来自为知笔记(Wiz)
相关文章推荐
- 替换bmp图片中的颜色 good
- GoF23种设计模式
- Goon
- Learning ROS forRobotics Programming Second Edition学习笔记(八)indigo rviz gazebo
- Learning ROS forRobotics Programming Second Edition学习笔记(八)indigo rviz gazebo
- Learning ROS forRobotics Programming Second Edition学习笔记(八)indigo rviz gazebo
- django 部署
- Learning ROS for Robotics Programming Second Edition学习笔记(七) indigo PCL xtion pro live
- Learning ROS for Robotics Programming Second Edition学习笔记(七) indigo PCL xtion pro live
- Learning ROS for Robotics Programming Second Edition学习笔记(七) indigo PCL xtion pro live
- Django 简介
- Django 基础教程
- 推荐一下django学习的网址!!!
- go程序性能优化
- go-nsq使用简述
- Go语言关于chan理解的实验
- 在windows平台下安装MangoDB3.0.7 设置超级管理员服务启动
- BestCoder Round #63 (div.2)
- Go语言beego框架环境搭建
- mongo 附近的点按指定字段排序