Java应用中表达式解析器(Java Cup/JFlex)生成器的介绍及示例
2010-07-01 15:31
585 查看
在基于Java的软件系统的构建过程中,开发人员经常会遇到词法解析、语法解析等问题,比如:在报表系统的中一般需要支持单元格计算公式(类似于Excel的公式),在某些系统的数据转换过程要实现自定义的转换规则脚本。面对这一类问题,我们最终需要的是一个针对问题域的词法及语法解析器,通常的实现方式不外乎有以下几种:
优点:最直接,自然而然的选择;如果开发人员理论基础扎实的话不需要额外的学习负担。
缺点:要求有一定的理论基础;不利于扩充、或设计良好利于扩充但随着未来需求的改变及新语法的不断增加使扩充的开发成本不可控;测试成本增加;因未经过应用检验而存在风险;
优点:开发工作量少;部分引擎已经得到广泛应用的验证;具有复杂的脚本支持能力,可以提供类似于程序设计语言的处理能力。
缺点:脚本的语法由使用的引擎决定,往往比较复杂,对于处理简单问题的场合显得过于累赘;系统中需要引入第三方的类库/控件;要求最终用户学习引擎所规定的脚本语法;某些引擎在访问应用系统的运行时对象时能力有限。
关于JFlex与Java_Cup的文档你需要自己到下面的链接中仔细看看。其中关键的部分是需要定义你的语法及词法分析文件,类似于BNF表达式。
主要步骤如下(为了方便,这里仅仅以四则运算为例,这个例子也是编译器构造工具的通用例子,满世界都采用它。):
当然首先你必须下在JFlex与Java_cup,并解压到特定目录,假定二者的目录分别是${JFlex_home}、${Java_Cup_home}。
3.1定义JFlex词法分析文件calc.flex
3.2生成词法解析程序
执行${JFlex_home}binjflex.bat,并输入calc.flex,成功后输出lex.java文件即四则运算的词法解析文件。职责主要是从本例中的四则运算字符串输入中进行词法分析,输出每一个Token,比如数字、运算符等。
3.3定义Java_cup语法分析文件
3.4 生成语法解析器calc.cup
当前目录切换到${java_cup_home}下,执行
java java_cup.Main options < calc.cup
其中,Options可以自己读文档,视情况而设。
执行的结果是生成了语法解析文件parser.java,它负责从lex.java中接受token流,构造语法规则,比如生成语法解析树。
至此,你已经拥有了一个解析器,只要在你的代码中引入Parser.java,调用其相应parse()方法即可完成表达式解析过程。测试调用过程如下:
不过在处理复杂表达式(涉及到上下文、系统其他对象引用等)时,你需要仔细定义语法规则文件中的action。
相关资源:
1、JFlex http://jflex.de/
2、Java CUP http://www.cs.princeton.edu/~appel/modern/java/CUP/
3、BeanShell http://www.beanshell.org
4、Rhino http://www.mozilla.org/rhino
5、Innerfuse Pascal Script http://www.carlo-kok.com/ifps3.php
原文:http://www.bi-professional.com/231.html#comments
1. 自己实现全部过程
当所涉及语法非常简单(一般只有一两句原语)时,可以选择该方式。优点:最直接,自然而然的选择;如果开发人员理论基础扎实的话不需要额外的学习负担。
缺点:要求有一定的理论基础;不利于扩充、或设计良好利于扩充但随着未来需求的改变及新语法的不断增加使扩充的开发成本不可控;测试成本增加;因未经过应用检验而存在风险;
2. 使用第三方的脚本引擎
当所涉及的语法及其复杂(需要支持比较复杂的脚本,甚至需要实现一种新的程序设计语言时,类似于Brio Query中提供的自定义脚本功能)时,可以选择该方式。目前,在互联网上有很多第三方脚本引擎(如:针对应用有:Netscape Script Engine、JavaRhino、BeanShell、IBM BSF、Tcl-Java等,针对Delphi应用有:Innerfuse Pascal Script 3,对微软体系有Microsoft Script Engine。这些引擎的具体情况不在这里讨论)可以选择,只需要少量的编程工作这些脚本引擎大多都可以内嵌到宿主应用程序中,为宿主应用程序提供脚本支持功能。优点:开发工作量少;部分引擎已经得到广泛应用的验证;具有复杂的脚本支持能力,可以提供类似于程序设计语言的处理能力。
缺点:脚本的语法由使用的引擎决定,往往比较复杂,对于处理简单问题的场合显得过于累赘;系统中需要引入第三方的类库/控件;要求最终用户学习引擎所规定的脚本语法;某些引擎在访问应用系统的运行时对象时能力有限。
3. 使用语法解析器生成器
使用过Unix的同志一般会知道在Unix中有Yacc/Lex,C语言的编译器生成器。在Java中,也有相应的工具。此方法使用JFlex/Java_Cup生成语法解析源码。本人使用这种方法成功的实现了类似Excel那样的公式,支持多种运算及函数,可以维护上下文,可以在表达式中引用系统的对象体系,可以执行动作,并且可以任意扩展而不需要修改解析源码。关于JFlex与Java_Cup的文档你需要自己到下面的链接中仔细看看。其中关键的部分是需要定义你的语法及词法分析文件,类似于BNF表达式。
主要步骤如下(为了方便,这里仅仅以四则运算为例,这个例子也是编译器构造工具的通用例子,满世界都采用它。):
当然首先你必须下在JFlex与Java_cup,并解压到特定目录,假定二者的目录分别是${JFlex_home}、${Java_Cup_home}。
3.1定义JFlex词法分析文件calc.flex
/* This example comes from a short article series in the Linux Gazette by Richard A. Sevenich and Christopher Lopes, titled "Compiler Construction Tools". The article series starts at http://www.linuxgazette.com/issue39/sevenich.html Small changes and updates to newest JFlex+Cup versions by Gerwin Klein */ /* Commented By: Christopher Lopes File Name: lcalc.flex To Create: > jflex lcalc.flex and then after the parser is created > javac Lexer.java */ /* ————————–Usercode Section———————— */ import java_cup.runtime.*; %% /* —————–Options and Declarations Section—————– */ /* The name of the class JFlex will create will be Lexer. Will write the code to the file Lexer.java. */ %class Lexer /* The current line number can be accessed with the variable yyline and the current column number with the variable yycolumn. */ %line %column /* Will switch to a CUP compatibility mode to interface with a CUP generated parser. */ %cup /* Declarations Code between %{ and %}, both of which must be at the beginning of a line, will be copied letter to letter into the lexer class source. Here you declare member variables and functions that are used inside scanner actions. */ %{ /* To create a new java_cup.runtime.Symbol with information about the current token, the token will have no value in this case. */ private Symbol symbol(int type) { return new Symbol(type, yyline, yycolumn); } /* Also creates a new java_cup.runtime.Symbol with information about the current token, but this object has a value. */ private Symbol symbol(int type, Object value) { return new Symbol(type, yyline, yycolumn, value); } %} /* Macro Declarations These declarations are regular expressions that will be used latter in the Lexical Rules Section. */ /* A line terminator is a r (carriage return), n (line feed), or rn. */ LineTerminator = r|n|rn /* White space is a line terminator, space, tab, or line feed. */ WhiteSpace = {LineTerminator} | [ tf] /* A literal integer is is a number beginning with a number between one and nine followed by zero or more numbers between zero and nine or just a zero. */ dec_int_lit = 0 | [1-9][0-9]* /* A identifier integer is a word beginning a letter between A and Z, a and z, or an underscore followed by zero or more letters between A and Z, a and z, zero and nine, or an underscore. */ dec_int_id = [A-Za-z_][A-Za-z_0-9]* %% /* ————————Lexical Rules Section———————- */ /* This section contains regular expressions and actions, i.e. Java code, that will be executed when the scanner matches the associated regular expression. */ /* YYINITIAL is the state at which the lexer begins scanning. So these regular expressions will only be matched if the scanner is in the start state YYINITIAL. */ { /* Return the token SEMI declared in the class sym that was found. */ ";" { return symbol(sym.SEMI); } /* Print the token found that was declared in the class sym and then return it. */ "+" { System.out.print(" + "); return symbol(sym.PLUS); } "-" { System.out.print(" – "); return symbol(sym.MINUS); } "*" { System.out.print(" * "); return symbol(sym.TIMES); } "/" { System.out.print(" / "); return symbol(sym.DIVIDE); } "(" { System.out.print(" ( "); return symbol(sym.LPAREN); } ")" { System.out.print(" ) "); return symbol(sym.RPAREN); } /* If an integer is found print it out, return the token NUMBER that represents an integer and the value of the integer that is held in the string yytext which will get turned into an integer before returning */ {dec_int_lit} { System.out.print(yytext()); return symbol(sym.NUMBER, new Integer(yytext())); } /* If an identifier is found print it out, return the token ID that represents an identifier and the default value one that is given to all identifiers. */ {dec_int_id} { System.out.print(yytext()); return symbol(sym.ID, new Integer(1));} /* Don’t do anything if whitespace is found */ {WhiteSpace} { /* just skip what was found, do nothing */ } } /* No token was found for the input so through an error. Print out an Illegal character message with the illegal character that was found. */ [^] { throw new Error("Illegal character <"+yytext()+">"); }
3.2生成词法解析程序
执行${JFlex_home}binjflex.bat,并输入calc.flex,成功后输出lex.java文件即四则运算的词法解析文件。职责主要是从本例中的四则运算字符串输入中进行词法分析,输出每一个Token,比如数字、运算符等。
3.3定义Java_cup语法分析文件
/* This example comes from a short article series in the Linux Gazette by Richard A. Sevenich and Christopher Lopes, titled "Compiler Construction Tools". The article series starts at http://www.linuxgazette.com/issue39/sevenich.html Small changes and updates to newest JFlex+Cup versions by Gerwin Klein */ /* Commented By: Christopher Lopes File Name: ycalc.cup To Create: > java java_cup.Main < ycalc.cup */ /* ———————-Preliminary Declarations Section——————–*/ /* Import the class java_cup.runtime.* */ import java_cup.runtime.*; /* Parser code to change the way the parser reports errors (include line and column number of the error). */ parser code {: /* Change the method report_error so it will display the line and column of where the error occurred in the input as well as the reason for the error which is passed into the method in the String ‘message’. */ public void report_error(String message, Object info) { /* Create a StringBuffer called ‘m’ with the string ‘Error’ in it. */ StringBuffer m = new StringBuffer("Error"); /* Check if the information passed to the method is the same type as the type java_cup.runtime.Symbol. */ if (info instanceof java_cup.runtime.Symbol) { /* Declare a java_cup.runtime.Symbol object ‘s’ with the information in the object info that is being typecasted as a java_cup.runtime.Symbol object. */ java_cup.runtime.Symbol s = ((java_cup.runtime.Symbol) info); /* Check if the line number in the input is greater or equal to zero. */ if (s.left >= 0) { /* Add to the end of the StringBuffer error message the line number of the error in the input. */ m.append(" in line "+(s.left+1)); /* Check if the column number in the input is greater or equal to zero. */ if (s.right >= 0) /* Add to the end of the StringBuffer error message the column number of the error in the input. */ m.append(", column "+(s.right+1)); } } /* Add to the end of the StringBuffer error message created in this method the message that was passed into this method. */ m.append(" : "+message); /* Print the contents of the StringBuffer ‘m’, which contains an error message, out on a line. */ System.err.println(m); } /* Change the method report_fatal_error so when it reports a fatal error it will display the line and column number of where the fatal error occurred in the input as well as the reason for the fatal error which is passed into the method in the object ‘message’ and then exit.*/ public void report_fatal_error(String message, Object info) { report_error(message, info); System.exit(1); } :}; /* ————Declaration of Terminals and Non Terminals Section———– */ /* Terminals (tokens returned by the scanner). Terminals that have no value are listed first and then terminals that do have an value, in this case an integer value, are listed on the next line down. */ terminal SEMI, PLUS, MINUS, TIMES, DIVIDE, LPAREN, RPAREN; terminal Integer NUMBER, ID; /* Non terminals used in the grammar section. Non terminals that have an object value are listed first and then non terminals that have an integer value are listed. An object value means that it can be any type, it isn’t set to a specific type. So it could be an Integer or a String or whatever. */ non terminal Object expr_list, expr_part; non terminal Integer expr, factor, term; /* ————-Precedence and Associatively of Terminals Section———– */ /* Precedence of non terminals could be defined here. If you do define precedence here you won’t need to worry about precedence in the Grammar Section, i.e. that TIMES should have a higher precedence than PLUS. The precedence defined here would look something like this where the lower line always will have higher precedence than the line before it. precedence left PLUS, MINUS; precedence left TIMES, DIVIDE; */ /* —————————-Grammar Section——————– */ /* The grammar for our parser. expr_list ::= expr_list expr_part | expr_part expr_part ::= expr SEMI expr ::= expr PLUS factor | expr MINUS factor | factor factor ::= factor TIMES term | factor DIVIDE term | term term ::= LPAREN expr RPAREN | NUMBER | ID */ /* ‘expr_list’ is the start of our grammar. It can lead to another ‘expr_list’ followed by an ‘expr_part’ or it can just lead to an ‘expr_part’. The lhs of the non terminals ‘expr_list’ and ‘expr_part’ that are in the rhs side of the production below need to be found. Then the rhs sides of those non terminals need to be followed in a similar manner, i.e. if there are any non terminals in the rhs of those productions then the productions with those non terminals need to be found and those rhs’s followed. This process keeps continuing until only terminals are found in the rhs of a production. Then we can work our way back up the grammar bringing any values that might have been assigned from a terminal. */ expr_list ::= expr_list expr_part | expr_part; /* ‘expr_part’ is an ‘expr’ followed by the terminal ‘SEMI’. The ‘:e’ after the non terminal ‘expr’ is a label an is used to access the value of ‘expr’ which will be an integer. The action for the production lies between {: and :}. This action will print out the line " = + e" where e is the value of ‘expr’. Before the action takes places we need to go deeper into the grammar since ‘expr’ is a non terminal. Whenever a non terminal is encountered on the rhs of a production we need to find the rhs of that non terminal until there are no more non terminals in the rhs. So when we finish going through the grammar and don’t encounter any more non terminals in the rhs productions will return until we get back to ‘expr’ and at that point ‘expr’ will contain an integer value. */ expr_part ::= expr:e {: System.out.println(" = " + e); :} SEMI ; /* ‘expr’ can lead to ‘expr PLUS factor’, ‘expr MINUS factor’, or ‘factor’. The ‘TIMES’ and ‘DIVIDE’ productions are not at this level. They are at a lower level in the grammar which in affect makes them have higher precedence. Actions for the rhs of the non terminal ‘expr’ return a value to ‘expr’. This value that is created is an integer and gets stored in ‘RESULT’ in the action. RESULT is the label that is assigned automatically to the rhs, in this case ‘expr’. If the rhs is just ‘factor’ then ‘f’ refers to the non terminal ‘factor’. The value of ‘f’ is retrieved with the function ‘intValue()’ and will be stored in ‘RESULT’. In the other two cases ‘f’ and ‘e’ refers to the non terminals ‘factor’ and ‘expr’ respectively with a terminal between them, either ‘PLUS’ or ‘MINUS’. The value of each is retrieved with the same function ‘intValue’. The values will be added or subtracted and then the new integer will be stored in ‘RESULT’.*/ expr ::= expr:e PLUS factor:f {: RESULT = new Integer(e.intValue() + f.intValue()); :} | expr:e MINUS factor:f {: RESULT = new Integer(e.intValue() – f.intValue()); :} | factor:f {: RESULT = new Integer(f.intValue()); :} ; /* ‘factor’ can lead to ‘factor TIMES term’, ‘factor DIVIDE term’, or ‘term’. Since the productions for TIMES and DIVIDE are lower in the grammar than ‘PLUS’ and ‘MINUS’ they will have higher precedence. The same sort of actions take place in the rhs of ‘factor’ as in ‘expr’. The only difference is the operations that takes place on the values retrieved with ‘intValue()’, ‘TIMES’ and ‘DIVIDE’ here instead of ‘PLUS’ and ‘MINUS’. */ factor ::= factor:f TIMES term:t {: RESULT = new Integer(f.intValue() * t.intValue()); :} | factor:f DIVIDE term:t {: RESULT = new Integer(f.intValue() / t.intValue()); :} | term:t {: RESULT = new Integer(t.intValue()); :} ; /* ‘term’ can lead to ‘LPAREN expr RPAREN’, ‘NUMBER’, or ‘ID’. The first production has the non terminal ‘expr’ in it so the production with its lhs side needs to be found and followed. The next rhs has no non terminals. So the grammar ends here and can go back up. When it goes back up it will bring the value that was retrieved when the scanner encounter the token ‘NUMBER’. ‘RESULT’ is assigned ‘n’, which refers to ‘NUMBER’, as the action for this production. The same action occurs for ‘ID’, except the ‘i’ is used to refer to ‘ID’. ‘ID’ is also the only thing on the rhs of the production. And since ‘ID’ is a terminal the grammar will end here and go back up. */ term ::= LPAREN expr:e RPAREN {: RESULT = e; :} | NUMBER:n {: RESULT = n; :} | ID:i {: RESULT = i; :} ;
3.4 生成语法解析器calc.cup
当前目录切换到${java_cup_home}下,执行
java java_cup.Main options < calc.cup
其中,Options可以自己读文档,视情况而设。
执行的结果是生成了语法解析文件parser.java,它负责从lex.java中接受token流,构造语法规则,比如生成语法解析树。
至此,你已经拥有了一个解析器,只要在你的代码中引入Parser.java,调用其相应parse()方法即可完成表达式解析过程。测试调用过程如下:
/* Commented By: Christopher Lopes File Name: Main.java To Create: After the scanner, lcalc.flex, and the parser, ycalc.cup, have been created. > javac Main.java To Run: > java Main test.txt where test.txt is an test input file for the calculator. */ import java.io.*; public class Main { static public void main(String argv[]) { /* Start the parser */ try { parser p = new parser(new Lexer(new FileReader(argv[0]))); Object result = p.parse().value; } catch (Exception e) { /* do cleanup here — possibly rethrow e */ e.printStackTrace(); } } }
不过在处理复杂表达式(涉及到上下文、系统其他对象引用等)时,你需要仔细定义语法规则文件中的action。
相关资源:
1、JFlex http://jflex.de/
2、Java CUP http://www.cs.princeton.edu/~appel/modern/java/CUP/
3、BeanShell http://www.beanshell.org
4、Rhino http://www.mozilla.org/rhino
5、Innerfuse Pascal Script http://www.carlo-kok.com/ifps3.php
原文:http://www.bi-professional.com/231.html#comments
相关文章推荐
- Java应用中表达式解析器(Java Cup/JFlex)生成器的介绍及示例
- java字符串应用之表达式解析器
- java字符串应用之表达式解析器
- java字符串应用之表达式解析器
- java字符串应用之表达式解析器
- Java 集合系列16之 HashSet详细介绍(源码解析)和使用示例
- QuartZ Cron表达式在java定时框架中的应用
- java正则表达式匹配网页所有网址和链接文字的示例
- Java自带的线程池ThreadPoolExecutor详细介绍说明和实例应用
- php中常用的正则表达式的介绍及应用实例代码
- Java分布式应用技术架构介绍
- Java正则表达式-语法与示例
- java LRU算法介绍与用法示例
- Portal-Basic Java Web 应用开发框架 v2.6.1(源码、示例及文档)
- 正则表达式在Java中的应用
- Java正则表达式应用总结
- Java 正则表达式的详细介绍以及实例演示
- Java8 lambda 表达式的10示例
- Java 集合系列03之 ArrayList详细介绍(源码解析)和使用示例
- java多线程应用中常用的sleep、yield、join、wait、notify、notifyAll方法介绍