简单实现中文分词中的常用字过滤
2007-01-17 10:34
861 查看
首先感谢兽族的荣耀朋友的文章简单编写的中文分词程序 ,我开始接触搜索引擎这个领域以及写这篇随笔都离不开他的精彩文章的帮助:)
下面切入正题。
名词:分析器(Analyzer),词单元(Tokens),高亮(Highlight)。
实现背景:
当在搜索引擎文本框中写入源词时,分析器(Analyzer)会将源词拆分成多组词单元(Tokens)。之后搜索引擎会在词库中搜索词单元,进行匹配,记录权重等其它操作。
当有些源词中包括常用词时,往往会给接下来的工作带来麻烦,比如下面的情况:
public class FilterDemo
using (StreamReader sr = File.OpenText(path))
using (StreamReader sr = new StreamReader(path, Encoding.Default))
的
吗
么
啊
说
对
在
和
是
被
最
所
那
这
有
将
会
与
於
于
他
她
它
您
为
the
for
in
to
on
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
1
2
3
4
5
6
7
8
9
0
about
above
after
again
all
also
am
an
and
any
are
as
at
back
be
been
before
behind
being
below
but
by
can
click
do
does
done
each
else
etc
ever
every
few
for
from
generally
get
go
gone
has
have
hello
here
how
if
in
into
is
just
keep
later
let
like
lot
lots
made
make
makes
many
may
me
more
most
much
must
my
need
no
not
now
of
often
on
only
or
other
others
our
out
over
please
put
so
some
such
than
that
the
their
them
then
there
these
they
this
try
to
up
us
very
want
was
we
well
what
when
where
which
why
will
with
within
you
your
yourself
下面切入正题。
名词:分析器(Analyzer),词单元(Tokens),高亮(Highlight)。
实现背景:
当在搜索引擎文本框中写入源词时,分析器(Analyzer)会将源词拆分成多组词单元(Tokens)。之后搜索引擎会在词库中搜索词单元,进行匹配,记录权重等其它操作。
当有些源词中包括常用词时,往往会给接下来的工作带来麻烦,比如下面的情况:
public class FilterDemo
using (StreamReader sr = File.OpenText(path))
using (StreamReader sr = new StreamReader(path, Encoding.Default))
的
吗
么
啊
说
对
在
和
是
被
最
所
那
这
有
将
会
与
於
于
他
她
它
您
为
the
for
in
to
on
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
1
2
3
4
5
6
7
8
9
0
about
above
after
again
all
also
am
an
and
any
are
as
at
back
be
been
before
behind
being
below
but
by
can
click
do
does
done
each
else
etc
ever
every
few
for
from
generally
get
go
gone
has
have
hello
here
how
if
in
into
is
just
keep
later
let
like
lot
lots
made
make
makes
many
may
me
more
most
much
must
my
need
no
not
now
of
often
on
only
or
other
others
our
out
over
please
put
so
some
such
than
that
the
their
them
then
there
these
they
this
try
to
up
us
very
want
was
we
well
what
when
where
which
why
will
with
within
you
your
yourself
相关文章推荐
- 简单实现中文分词中的常用字过滤
- 简单实现中文分词中的常用字过滤
- 简单实现中文分词中的常用字过滤
- 简单实现中文分词中的常用字过滤
- 简单的逆向最大匹配算法实现中文分词(Python)
- 中文分词Java简单实现
- Java实现敏感词过滤 - IKAnalyzer中文分词工具
- Java实现敏感词过滤 - IKAnalyzer中文分词工具
- PHP简单实现中文分词全文索引实例(tag专题)
- 一个简单最大正向匹配(Maximum Matching)MM中文分词算法的实现
- Java实现敏感词过滤 - IKAnalyzer中文分词工具
- 一个简单最大正向匹配(Maximum Matching)MM中文分词算法的实现
- 用PHP实现简单的反向最大匹配中文分词(代码)
- Java实现敏感词过滤 - IKAnalyzer中文分词工具
- Java实现敏感词过滤 - IKAnalyzer中文分词工具
- PHP中文分词的简单实现代码分享
- 简单中文分词系统的实现
- 使用Lucene和IKAnalyzer实现 中文简单 分词
- 用R简单实现分词