您的位置:首页 > 运维架构 > Linux

PHP在linux读取word文档

2017-07-01 10:48 489 查看
几天帮朋友解决一个技术问题,在Linux下,将word文档中的内容读取,然后使用正则匹配,拼成sql入库

查阅了外文资料和google之后,步骤如下:

#wget http://www.winfield.demon.nl/ href="http://lib.csdn.net/base/linux" target=_blank>linux/antiword-0.37.tar.gz

#tar zxvf antiword-0.37.tar.gz

#cd antiword-0.37

#make

#make install

antiword
cp /root/bin/*antiword /usr/local/bin/
mkdir /usr/share/antiword
cp -R /root/.antiword/* /usr/share/antiword/
chmod 777 /usr/local/bin/*antiword
chmod 755 /usr/share/antiword/*

安装完成之后,如果要在web上查看的话,需要使用root执行 make global_install

<?php
header("Content-type: text/html; charset=utf-8");

$filename = 'test.doc';
#$content = shell_exec('antiword '.$filename);
$content = shell_exec('antiword -mUTF-8 '.$filename);

echo '<pre>';
print_r ($content);
echo '</pre>';


#coding=utf-8
#usage python <script_name> <docFilePath>
#pip install python-docx [安装一下扩展库]
import sys
import os

from docx import Document

#获取当前脚本得名称
argv0_list = sys.argv[0].split("\\");
script_name = argv0_list[len(argv0_list) - 1];
usage = "\n Usage python <"+script_name+"> <docFilePath>"

if len(sys.argv) != 2:
print "Warning:\n docx file is empty" + usage
sys.exit()
docx_path = sys.argv[1]
if not os.path.exists(docx_path):
print "Warning:\n docx file is not exist" + usage
sys.exit()

#打开文档
document = Document(docx_path)
#读取每段资料
l = [ paragraph.text.encode('utf8') for paragraph in document.paragraphs];
#输出并观察结果,也可以通过其他手段处理文本即可
for i in l:
print i
#读取表格材料,并输出结果
tables = [table for table in document.tables];
for table in tables:
for row in table.rows:
for cell in row.cells:
print cell.text.encode('utf8'),'\t',
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: