您的位置:首页 > 其它

简单实现中文分词中的常用字过滤

2007-01-17 10:34 861 查看
首先感谢兽族的荣耀朋友的文章简单编写的中文分词程序 ,我开始接触搜索引擎这个领域以及写这篇随笔都离不开他的精彩文章的帮助:)
下面切入正题。

名词:分析器(Analyzer),词单元(Tokens),高亮(Highlight)。
实现背景:
当在搜索引擎文本框中写入源词时,分析器(Analyzer)会将源词拆分成多组词单元(Tokens)。之后搜索引擎会在词库中搜索词单元,进行匹配,记录权重等其它操作。
当有些源词中包括常用词时,往往会给接下来的工作带来麻烦,比如下面的情况:
public class FilterDemo
using (StreamReader sr = File.OpenText(path))
using (StreamReader sr = new StreamReader(path, Encoding.Default))

























the
for
in
to
on

a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
1
2
3
4
5
6
7
8
9
0
about
above
after
again
all
also
am
an
and
any
are
as
at
back
be
been
before
behind
being
below
but
by
can
click
do
does
done
each
else
etc
ever
every
few
for
from
generally
get
go
gone
has
have
hello
here
how
if
in
into
is
just
keep
later
let
like
lot
lots
made
make
makes
many
may
me
more
most
much
must
my
need
no
not
now
of
often
on
only
or
other
others
our
out
over
please
put
so
some
such
than
that
the
their
them
then
there
these
they
this
try
to
up
us
very
want
was
we
well
what
when
where
which
why
will
with
within
you
your
yourself
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: