您的位置:首页 > 编程语言 > Python开发

NLP01-python的wordcloud实现中文词云小例

2017-10-25 14:30 337 查看


上图是下面歌词生成的

《When You Are Old》
William Butler Yeats
When you are old and grey and full of sleep,
And nodding by the fire, take down this book,
And slowly read, and dream of the soft look
Your eyes had once, and of their shadows deep;
How many loved your moments of glad grace,
And loved your beauty with love false or true,
But one man loved the pilgrim soul in you,
And loved the sorrows of your changing face;
And bending down beside the glowing bars,
Murmur, a little sadly, how love fled
And paced upon the mountains overhead
And hid his face amid a crowd of stars.


摘要:只是wordcloud的安装与演示测试,可为入门者提供帮助。

1. 安装

构建词云的方法很多, 但是个人觉得python的wordcloud包功能最为强大,可以自定义图片.

官网: https://amueller.github.io/word_cloud/

github: https://github.com/amueller/word_cloud

安装:pip install wordcloud

或 下载:http://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud 然后安装。

2. 查看API

API中,WordCloud类是重要类。

class wordcloud.WordCloud(font_path=None, width=400, height=200, margin=2, ranks_only=None, prefer_horizontal=0.9,mask=None, scale=1, color_func=None, max_words=200, min_font_size=4, stopwords=None, random_state=None,background_color='black', max_font_size=None, font_step=1, mode='RGB', relative_scaling=0.5, regexp=None, collocations=True,colormap=None, normalize_plurals=True)
font_path : string
Font path to the font that will be used (OTF or TTF). Defaults to DroidSansMono path on a Linux machine. If you are on another OS or don’t have this font, you need to adjust this path.
[对于win7,这个得修改了,否则会乱码]
width : int (default=400)
Width of the canvas.
画布宽
height : int (default=200)
Height of the canvas.
画布高
prefer_horizontal : float (default=0.90)
The ratio of times to try horizontal fitting as opposed to vertical. If prefer_horizontal < 1, the algorithm will try rotating the word if it doesn’t fit. (There is currently no built-in way to get only vertical words.)

mask : nd-array or None (default=None)

scale : float (default=1)
Scaling between computation and drawing. For large word-cloud images, using scale instead of larger canvas size is significantly faster, but might lead to a coarser fit for the words.
min_font_size : int (default=4)
Smallest font size to use. Will stop when there is no more room in this size.
最小字号大小
font_step : int (default=1)
Step size for the font. font_step > 1 might speed up computation but give a worse fit.
max_words : number (default=200)
The maximum number of words.
显示的最多中词数据上限
stopwords : set of strings or None
The words that will be eliminated. If None, the build-in STOPWORDS list will be used.
停用词
background_color : color value (default=”black”)
Background color for the word cloud image.
前景色
max_font_size : int or None (default=None)
Maximum font size for the largest word. If None, height of the image is used.
词的最大大小;
mode : string (default=”RGB”)
Transparent background will be generated when mode is “RGBA” and background_color is None.
relative_scaling : float (default=.5)
Importance of relative word frequencies for font-size. With relative_scaling=0, only word-ranks are considered. With relative_scaling=1, a word that is twice as frequent will have twice the size. If you want to consider the word frequencies and not only their rank, relative_scaling around .5 often looks good.
color_func : callable, default=None
Callable with parameters word, font_size, position, orientation, font_path, random_state that returns a PIL color for each word. Overwrites “colormap”. See colormap for specifying a matplotlib colormap instead.
regexp : string or None (optional)
Regular expression to split the input text into tokens in process_text. If None is specified,r"\w[\w']+" is used.
collocations : bool, default=True
Whether to include collocations (bigrams) of two words.
colormap : string or matplotlib colormap, default=”viridis”
Matplotlib colormap to randomly draw colors from for each word. Ignored if “color_func” is specified.
normalize_plurals : bool, default=True
Whether to remove trailing ‘s’ from words. If True and a word appears with and without a trailing ‘s’, the one with trailing ‘s’ is removed and its counts are added to the version without trailing ‘s’ – unless the word ends with ‘ss’.


3.图片

图片名为:mask_png.png



4.测试中文文档

题目:脚抽筋怎么办

网址:http://health.china.com/html/jiankang/jijiuzhinan/richangjijiu/201603/26-328450.html

5.代码

# -*- coding: utf-8 -*-
from os import path

import jieba
import matplotlib.pyplot as plt
from scipy.misc import imread
from wordcloud import WordCloud

def doWordcloud():
comment_text = open('test.txt', 'r', encoding='UTF-8').read()
cut_text = " ".join(jieba.cut(comment_text))
color_mask = imread("mask_png.png")
cloud = WordCloud(
# 设置字体,不指定就会出现乱码;
# 在win7的路径:C:\Windows\Fonts进行查看
font_path="simsun.ttc",
mask=color_mask,
max_words=200,
max_font_size=80,
width=1000,
height=1000
)
word_cloud = cloud.generate(cut_text)  # 产生词云
# word_cloud.to_file("pic.jpg")  # 保存图片
plt.imshow(word_cloud)
plt.axis('off')
plt.show()


说明:test.txt内容是《脚抽筋怎么办》的文章内容;

mask_png.png是上面那个小女孩的图片;

6.显示结果



【作者:happyprince ;http://blog.csdn.net/ld326/article/details/78341147
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  词云图