您的位置：首页 > 其它

利用正则表达式提取docx转为txt的文件。

2017-06-26 17:00 302 查看

使用pandoc先转化docx文件。

pandoc -f docx -t latex -o t33.txt testAp.docx

提取出的txt格式是这样的

\section{Question1}\label{question1}

\subsection{问题}\label{ux95eeux9898}

\begin{quote}
The random variable \(X\) has the probability distribution shown in the
table.
\end{quote}

\begin{longtable}[]{@{}llll@{}}
\toprule
\(x\) & 2 & 4 & 6\tabularnewline
\midrule
\endhead
\(P\left( X = x \right)\) & 0.5 & 0.4 & 0.1\tabularnewline
\bottomrule
\end{longtable}

Two \emph{independent} values of \(X\) are chosen at random. The random
variable \(Y\) takes the value s of \(X\) are the same. Otherwise the
value of \(Y\) is the larger value of \(X\) minus the smaller value of
\(X\).

(i) Draw up the probability distribution table for \(Y\).

(ii) Find the expected value of \(Y\).

\subsection{关键字}\label{ux5173ux952eux5b57}

\begin{quote}
独立 可能性 期待
\end{quote}

\subsection{翻译}\label{ux7ffbux8bd1}

\begin{quote}
随机变量X 可能的分布情况如下所示。
\end{quote}

\subsection{选项}\label{ux9009ux9879}

\begin{enumerate}
\def\labelenumi{\Alph{enumi}.}
\item
21+ \(f(x)\)
\item
22
\item
23
\end{enumerate}

\subsection{答案}\label{ux7b54ux6848}

A

\subsection{提示}\label{ux63d0ux793a}

参考统计学的平均数公式。

期望值公式。

\subsection{解析}\label{ux89e3ux6790}

(i)

\begin{longtable}[]{@{}llll@{}}
\toprule
\(r\) & 0 & 2 & 4\tabularnewline
\midrule
\endhead
\(P(Y = r)\) & 0.42 & 0.48 & 0.1\tabularnewline
\bottomrule
\end{longtable}

\(P\left( Y = 0 \right) = 0.5 \times 0.5 + 0.4 \times 0.4 + 0.1 \times 0.1 = 0.42\)
\textbackslash{}\textbackslash{}

\(P\left( Y = 2 \right) = 0.5 \times 0.4 \times 2 + 0.4 \times 0.1 \times 2 = 0.48\)

\(P\left( Y = 4 \right) = 0.5 \times 0.1 \times 2 = 0.1\)

(ii) \(\left( Y \right) = 0 \times 0.42 + 2 \times 0.48 + 4 \times 0.1\)
\textbackslash{}\textbackslash{}

\(= 1.36\)

\subsection{结束}\label{ux7ed3ux675f}

我在这里约定了一些格式，用来判定问题。最后提取的正则表达式为。

var regx = /\\subsection{问题}\\label.*\n?((\s|\S)*?)\\subsection{关键字}/;
var regxKey = /\\subsection{关键字}\\label.*\n?((\s|\S)*?)\\subsection{翻译}/;
var regxTrans = /\\subsection{翻译}\\label.*\n?((\s|\S)*?)\\subsection{选项}/;
var regxChoice =/\\subsection{选项}\\label.*\n?((\s|\S)*?)\\subsection{答案}/;
var regxAnswer = /\\subsection{答案}\\label.*\n?((\s|\S)*?)\\subsection{提示}/;
var regxHint = /\\subsection{提示}\\label.*\n?((\s|\S)*?)\\subsection{解析}/;
var regxStep = /\\subsection{解析}\\label.*\n?((\s|\S)*?)\\subsection{结束}/;

在这里我原来习惯用 (.|\n)*?来提取任何字符，但是在这里不知道为何失效，最后我用的是 \s|\S,这令我很是疑惑，不知道Xregex的任何字符功能是否依然这样。
最后一点，txt格式中，因为是windows系统，所以用的是\r\n换行，我没能转变思路，所以折腾了很久。
在txt中，用一个空白行表示一段的结束，而我这里需要将分段转为用<p></p>表示。我需要对格式进行重新整理。

str = str.replace(/(\S)\r\n(\S)/g, "$1 $2");

这行代码的意思是如果是单个换行符，就用空格代替，我的目的是将单个换行符用空格代替，而两个连续换行符则保持不变。
本来计划是用split来将字符串分隔开再重新连接的，现在看来还是replace比较可靠。

替换表格

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航