Forward-backward梯度求导(tensorflow word2vec实例)
2015-12-15 14:50
417 查看
考虑不可分的例子
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144934287-1814537071.png)
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144935052-853440526.png)
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144935334-820397703.png)
通过使用basis functions 使得不可分的线性模型变成可分的非线性模型
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144936177-1878127883.png)
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144936740-752865018.png)
最常用的就是写出一个目标函数
并且使用梯度下降法
来计算
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144937849-10842448.png)
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144938349-460601668.png)
梯度的下降法的梯度计算
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144939599-947468081.png)
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144940349-653124183.png)
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144941177-413492508.png)
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144941740-1221848055.png)
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144942521-1344091712.png)
关于线性和非线性的隐层
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144942927-239423610.png)
非线性隐层使得网络可以计算更加复杂的函数
线性隐层不能增强网络的表述能力,它们被用来做降维,减少训练需要的参数数目,这在nlp相关的模型中
经常用到(embedding vector)
一个back prop的例子
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144943662-1670760257.png)
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144944021-1018293978.png)
前向计算 Forward pass
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144944459-1851826778.png)
后向计算 Backward pass
激活梯度
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144945381-627561961.png)
权重梯度
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144945771-1693338465.png)
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144946896-404729464.png)
来看一下计算某些变量的梯度,需要计算哪些其它变量
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144948256-2014808777.png)
如果要计算从单元A到单元B的weight的梯度需要哪些信息?
参考上面的
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144948881-1657301505.png)
需要 A的激活梯度 B的反向传播的梯度
另外一些需要了解的
许多梯度计算都是0,
这是因为我们采用了线性矫正来作为非线性单元
有一些梯度计算出来比其它的大很多,这回造成连乘后传递扩大,这是所谓的"梯度爆炸"
forward-backward的实例(word2vec)
考虑tensorflow实现的word2vec,tensorflow是可以自动求导的,但是你也可以自己来写这一部分
Word2vec_optimized.py就是自己实现的forward-backward步骤(手写),采用true sgd
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144949490-409963452.png)
看一下代码
# Training nodes.
inc = global_step.assign_add(1)
with tf.control_dependencies([inc]):
train = word2vec.neg_train(
w_in, #上图中左面的w,将在negtrain中被改变
w_out, #上图中右面的w,将在negtrain中被改变
examples, # 中心词编号数组,长度为batch_size
labels, # 周围词 surronding word 编号数组
lr, #学习率 learning rate
vocab_count=opts.vocab_counts.tolist(), #每个词的频次数组
num_negative_samples=opts.num_samples #负样本采样数目
)
REGISTER_OP("NegTrain")
.Input("w_in: Ref(float)") //Ref传递引用
.Input("w_out: Ref(float)")
.Input("examples: int32")
.Input("labels: int32")
.Input("lr: float")
.Attr("vocab_count: list(int)")
.Attr("num_negative_samples: int")
.Doc(R"doc(
Training via negative sampling.
w_in: input word embedding.
w_out: output word embedding.
examples: A vector of word ids.
labels: A vector of word ids.
vocab_count: Count of words in the vocabulary.
num_negative_samples: Number of negative samples per exaple.
)doc");
// Gradient accumulator for v_in.
Tensor
buf(DT_FLOAT, TensorShape({dims}));
auto
Tbuf = buf.flat<float>();
// Scalar buffer to hold sigmoid(+/- dot).
Tensor
g_buf(DT_FLOAT, TensorShape({}));
auto
g = g_buf.scalar<float>();
// The following loop needs 2 random 32-bit values per negative
// sample. We reserve 8 values per sample just in case the
// underlying implementation changes.
auto
rnd = base_.ReserveSamples32(batch_size * num_samples_ * 8);
random::SimplePhilox
srnd(&rnd);
for (int64
i = 0; i < batch_size; ++i) {
const
int32
example = Texamples(i);
DCHECK(0 <= example && example < vocab_size) << example;
const
int32
label = Tlabels(i);
DCHECK(0 <= label && label < vocab_size) << label;
auto
v_in = Tw_in.chip<0>(example);
//正样本label 1,
负样本label -1,累积误差
这里应该是按照MLE 最大化可能概率
所以是累加梯度,参考ng课件
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144950037-1607975975.png)
nce的做法,转化为二分类问题
// Positive: example predicts label.
// forward: x = v_in' * v_out
// l = log(sigmoid(x))
// backward: dl/dx = g = sigmoid(-x)
// dl/d(v_in) = (dl/dx)*(dx/d(v_in)) = g * v_out'
// dl/d(v_out) = (dl/dx)*(dx/d(v_out)) = v_in' * g
{
auto
v_out = Tw_out.chip<0>(label);
auto
dot = (v_in * v_out).sum();
g = (dot.exp() + 1.f).inverse();
Tbuf = v_out * (g() * lr);
v_out += v_in * (g() * lr);
}
// Negative samples:
// forward: x = v_in' * v_sample
// l = log(sigmoid(-x))
// backward: dl/dx = g = -sigmoid(x)
// dl/d(v_in) = g * v_out'
// dl/d(v_out) = v_in' * g
for (int
j = 0; j < num_samples_; ++j) {
const
int
sample = sampler_->Sample(&srnd);
if (sample == label) continue; // Skip.
auto
v_sample = Tw_out.chip<0>(sample);
auto
dot = (v_in * v_sample).sum();
g = -((-dot).exp() + 1.f).inverse();
Tbuf += v_sample * (g() * lr);
v_sample += v_in * (g() * lr);
}
// Applies the gradient on v_in.
v_in += Tbuf;
}
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144934287-1814537071.png)
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144935052-853440526.png)
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144935334-820397703.png)
通过使用basis functions 使得不可分的线性模型变成可分的非线性模型
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144936177-1878127883.png)
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144936740-752865018.png)
最常用的就是写出一个目标函数
并且使用梯度下降法
来计算
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144937849-10842448.png)
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144938349-460601668.png)
梯度的下降法的梯度计算
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144939599-947468081.png)
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144940349-653124183.png)
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144941177-413492508.png)
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144941740-1221848055.png)
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144942521-1344091712.png)
关于线性和非线性的隐层
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144942927-239423610.png)
非线性隐层使得网络可以计算更加复杂的函数
线性隐层不能增强网络的表述能力,它们被用来做降维,减少训练需要的参数数目,这在nlp相关的模型中
经常用到(embedding vector)
一个back prop的例子
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144943662-1670760257.png)
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144944021-1018293978.png)
前向计算 Forward pass
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144944459-1851826778.png)
后向计算 Backward pass
激活梯度
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144945381-627561961.png)
权重梯度
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144945771-1693338465.png)
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144946896-404729464.png)
来看一下计算某些变量的梯度,需要计算哪些其它变量
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144948256-2014808777.png)
如果要计算从单元A到单元B的weight的梯度需要哪些信息?
参考上面的
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144948881-1657301505.png)
需要 A的激活梯度 B的反向传播的梯度
另外一些需要了解的
许多梯度计算都是0,
这是因为我们采用了线性矫正来作为非线性单元
有一些梯度计算出来比其它的大很多,这回造成连乘后传递扩大,这是所谓的"梯度爆炸"
forward-backward的实例(word2vec)
考虑tensorflow实现的word2vec,tensorflow是可以自动求导的,但是你也可以自己来写这一部分
Word2vec_optimized.py就是自己实现的forward-backward步骤(手写),采用true sgd
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144949490-409963452.png)
看一下代码
# Training nodes.
inc = global_step.assign_add(1)
with tf.control_dependencies([inc]):
train = word2vec.neg_train(
w_in, #上图中左面的w,将在negtrain中被改变
w_out, #上图中右面的w,将在negtrain中被改变
examples, # 中心词编号数组,长度为batch_size
labels, # 周围词 surronding word 编号数组
lr, #学习率 learning rate
vocab_count=opts.vocab_counts.tolist(), #每个词的频次数组
num_negative_samples=opts.num_samples #负样本采样数目
)
REGISTER_OP("NegTrain")
.Input("w_in: Ref(float)") //Ref传递引用
.Input("w_out: Ref(float)")
.Input("examples: int32")
.Input("labels: int32")
.Input("lr: float")
.Attr("vocab_count: list(int)")
.Attr("num_negative_samples: int")
.Doc(R"doc(
Training via negative sampling.
w_in: input word embedding.
w_out: output word embedding.
examples: A vector of word ids.
labels: A vector of word ids.
vocab_count: Count of words in the vocabulary.
num_negative_samples: Number of negative samples per exaple.
)doc");
// Gradient accumulator for v_in.
Tensor
buf(DT_FLOAT, TensorShape({dims}));
auto
Tbuf = buf.flat<float>();
// Scalar buffer to hold sigmoid(+/- dot).
Tensor
g_buf(DT_FLOAT, TensorShape({}));
auto
g = g_buf.scalar<float>();
// The following loop needs 2 random 32-bit values per negative
// sample. We reserve 8 values per sample just in case the
// underlying implementation changes.
auto
rnd = base_.ReserveSamples32(batch_size * num_samples_ * 8);
random::SimplePhilox
srnd(&rnd);
for (int64
i = 0; i < batch_size; ++i) {
const
int32
example = Texamples(i);
DCHECK(0 <= example && example < vocab_size) << example;
const
int32
label = Tlabels(i);
DCHECK(0 <= label && label < vocab_size) << label;
auto
v_in = Tw_in.chip<0>(example);
//正样本label 1,
负样本label -1,累积误差
这里应该是按照MLE 最大化可能概率
所以是累加梯度,参考ng课件
![](http://images2015.cnblogs.com/blog/61573/201512/61573-20151215144950037-1607975975.png)
nce的做法,转化为二分类问题
// Positive: example predicts label.
// forward: x = v_in' * v_out
// l = log(sigmoid(x))
// backward: dl/dx = g = sigmoid(-x)
// dl/d(v_in) = (dl/dx)*(dx/d(v_in)) = g * v_out'
// dl/d(v_out) = (dl/dx)*(dx/d(v_out)) = v_in' * g
{
auto
v_out = Tw_out.chip<0>(label);
auto
dot = (v_in * v_out).sum();
g = (dot.exp() + 1.f).inverse();
Tbuf = v_out * (g() * lr);
v_out += v_in * (g() * lr);
}
// Negative samples:
// forward: x = v_in' * v_sample
// l = log(sigmoid(-x))
// backward: dl/dx = g = -sigmoid(x)
// dl/d(v_in) = g * v_out'
// dl/d(v_out) = v_in' * g
for (int
j = 0; j < num_samples_; ++j) {
const
int
sample = sampler_->Sample(&srnd);
if (sample == label) continue; // Skip.
auto
v_sample = Tw_out.chip<0>(sample);
auto
dot = (v_in * v_sample).sum();
g = -((-dot).exp() + 1.f).inverse();
Tbuf += v_sample * (g() * lr);
v_sample += v_in * (g() * lr);
}
// Applies the gradient on v_in.
v_in += Tbuf;
}
相关文章推荐
- Category - 4
- AngularJS(03)---Http对象
- 负载均衡算法的简单介绍及实现!
- HTML5新增的主体结构元素
- iframe 自适应高度
- 很特别的一个动态规划入门教程
- 北京安徽企业商会第一届会长何帮喜同志讲话
- 新闻列表中标题和日期的左右分别对齐的几种处理方法
- woe和iv的含义
- 排序算法:堆排序
- 64位Win7安装Oracle12C临时位置权限错误解决方案
- 64位Win7安装Oracle12C临时位置权限错误解决方案
- Unity3D方法来隐藏和显示对象
- 当需要向数据库插入空值时,sql语句的判断
- Divisibility by Eight---cf550C(被8整除 暴力)
- Step-by-Step Guide to Setting Up an R-Hadoop System
- 求1!+2!+・・・・・10!
- JavaScript组件开发完整示例
- 面向对象三大特征:继承、封装、多态 (泛型)
- 【LEETCODE】204-Count Primes