jflex1.5扫描中文字符-白红宇

jflex1.5扫描中文字符

阅读量：7092 次

发布时间：2019-06-28

本文共 13825 字，大约阅读时间需要 46 分钟。

动机：

测试jflex对中文文本的支持。

下载jflex，直接使用其中的例子，稍作修改便可支持中文了，这里把计算器的例子中的"+"号改为“加”。

所需的文件：

// lcalc.flex

This example comes from a short article series in the Linux

Gazette by Richard A. Sevenich and Christopher Lopes, titled

"Compiler Construction Tools". The article series starts at

Small changes and updates to newest JFlex+Cup versions

by Gerwin Klein

Commented By: Christopher Lopes

File Name: lcalc.flex

To Create: > jflex lcalc.flex

and then after the parser is created

> javac Lexer.java

/* --------------------------Usercode Section------------------------ */

import java_cup.runtime.*;

/* -----------------Options and Declarations Section----------------- */

The name of the class JFlex will create will be Lexer.

Will write the code to the file Lexer.java.

%class Lexer

%unicode

The current line number can be accessed with the variable yyline

and the current column number with the variable yycolumn.

%line

%column

Will switch to a CUP compatibility mode to interface with a CUP

generated parser.

%cup

Declarations

Code between %{ and %}, both of which must be at the beginning of a

line, will be copied letter to letter into the lexer class source.

Here you declare member variables and functions that are used inside

scanner actions.

/* To create a new java_cup.runtime.Symbol with information about

the current token, the token will have no value in this

case. */

private Symbol symbol(int type) {

return new Symbol(type, yyline, yycolumn);

}

/* Also creates a new java_cup.runtime.Symbol with information

about the current token, but this object has a value. */

private Symbol symbol(int type, Object value) {

return new Symbol(type, yyline, yycolumn, value);

}

Macro Declarations

These declarations are regular expressions that will be used latter

in the Lexical Rules Section.

/* A line terminator is a \r (carriage return), \n (line feed), or

\r\n. */

LineTerminator = \r|\n|\r\n

/* White space is a line terminator, space, tab, or line feed. */

WhiteSpace = {LineTerminator} | [ \t\f]

/* A literal integer is is a number beginning with a number between

one and nine followed by zero or more numbers between zero and nine

or just a zero. */

dec_int_lit = 0 | [1-9][0-9]*

/* A identifier integer is a word beginning a letter between A and

Z, a and z, or an underscore followed by zero or more letters

between A and Z, a and z, zero and nine, or an underscore. */

dec_int_id = [A-Za-z_][A-Za-z_0-9]*

/* ------------------------Lexical Rules Section---------------------- */

This section contains regular expressions and actions, i.e. Java

code, that will be executed when the scanner matches the associated

regular expression. */

/* YYINITIAL is the state at which the lexer begins scanning. So

these regular expressions will only be matched if the scanner is in

the start state YYINITIAL. */

<YYINITIAL> {

/* Return the token SEMI declared in the class sym that was found. */

";" { return symbol(sym.SEMI); }

/* Print the token found that was declared in the class sym and then

return it. */

"加" { System.out.print(" 加 "); return symbol(sym.PLUS); }

"-" { System.out.print(" - "); return symbol(sym.MINUS); }

"*" { System.out.print(" * "); return symbol(sym.TIMES); }

"/" { System.out.print(" / "); return symbol(sym.DIVIDE); }

"(" { System.out.print(" ( "); return symbol(sym.LPAREN); }

")" { System.out.print(" ) "); return symbol(sym.RPAREN); }

/* If an integer is found print it out, return the token NUMBER

that represents an integer and the value of the integer that is

held in the string yytext which will get turned into an integer

before returning */

{dec_int_lit} { System.out.print(yytext());

return symbol(sym.NUMBER, new Integer(yytext())); }

/* If an identifier is found print it out, return the token ID

that represents an identifier and the default value one that is

given to all identifiers. */

{dec_int_id} { System.out.print(yytext());

return symbol(sym.ID, new Integer(1));}

/* Don't do anything if whitespace is found */

{WhiteSpace} { /* just skip what was found, do nothing */ }

}

/* No token was found for the input so through an error. Print out an

Illegal character message with the illegal character that was found. */

[^] { throw new Error("Illegal character <"+yytext()+">"); }

// ycalc.cup

This example comes from a short article series in the Linux

Gazette by Richard A. Sevenich and Christopher Lopes, titled

"Compiler Construction Tools". The article series starts at

Small changes and updates to newest JFlex+Cup versions

by Gerwin Klein

Commented By: Christopher Lopes

File Name: ycalc.cup

To Create: > java java_cup.Main < ycalc.cup

/* ----------------------Preliminary Declarations Section--------------------*/

/* Import the class java_cup.runtime.* */

import java_cup.runtime.*;

/* Parser code to change the way the parser reports errors (include

line and column number of the error). */

parser code {:

/* Change the method report_error so it will display the line and

column of where the error occurred in the input as well as the

reason for the error which is passed into the method in the

String 'message'. */

public void report_error(String message, Object info) {

/* Create a StringBuilder called 'm' with the string 'Error' in it. */

StringBuilder m = new StringBuilder("Error");

/* Check if the information passed to the method is the same

type as the type java_cup.runtime.Symbol. */

if (info instanceof java_cup.runtime.Symbol) {

/* Declare a java_cup.runtime.Symbol object 's' with the

information in the object info that is being typecasted

as a java_cup.runtime.Symbol object. */

java_cup.runtime.Symbol s = ((java_cup.runtime.Symbol) info);

/* Check if the line number in the input is greater or

equal to zero. */

if (s.left >= 0) {

/* Add to the end of the StringBuilder error message

the line number of the error in the input. */

m.append(" in line "+(s.left+1));

/* Check if the column number in the input is greater

or equal to zero. */

if (s.right >= 0)

/* Add to the end of the StringBuilder error message

the column number of the error in the input. */

m.append(", column "+(s.right+1));

}

/* Add to the end of the StringBuilder error message created in

this method the message that was passed into this method. */

m.append(" : "+message);

/* Print the contents of the StringBuilder 'm', which contains

an error message, out on a line. */

System.err.println(m);

}

/* Change the method report_fatal_error so when it reports a fatal

error it will display the line and column number of where the

fatal error occurred in the input as well as the reason for the

fatal error which is passed into the method in the object

'message' and then exit.*/

public void report_fatal_error(String message, Object info) {

report_error(message, info);

System.exit(1);

}

:};

/* ------------Declaration of Terminals and Non Terminals Section----------- */

/* Terminals (tokens returned by the scanner).

Terminals that have no value are listed first and then terminals

that do have an value, in this case an integer value, are listed on

the next line down. */

terminal SEMI, PLUS, MINUS, TIMES, DIVIDE, LPAREN, RPAREN;

terminal Integer NUMBER, ID;

/* Non terminals used in the grammar section.

Non terminals that have an object value are listed first and then

non terminals that have an integer value are listed. An object

value means that it can be any type, it isn't set to a specific

type. So it could be an Integer or a String or whatever. */

non terminal Object expr_list, expr_part;

non terminal Integer expr, factor, term;

/* -------------Precedence and Associatively of Terminals Section----------- */

Precedence of non terminals could be defined here. If you do define

precedence here you won't need to worry about precedence in the

Grammar Section, i.e. that TIMES should have a higher precedence

than PLUS.

The precedence defined here would look something like this where the

lower line always will have higher precedence than the line before it.

precedence left PLUS, MINUS;

precedence left TIMES, DIVIDE;

/* ----------------------------Grammar Section-------------------- */

/* The grammar for our parser.

expr_list ::= expr_list expr_part

| expr_part

expr_part ::= expr SEMI

expr ::= expr PLUS factor

| expr MINUS factor

| factor

factor ::= factor TIMES term

| factor DIVIDE term

| term

term ::= LPAREN expr RPAREN

| NUMBER

| ID

/* 'expr_list' is the start of our grammar. It can lead to another

'expr_list' followed by an 'expr_part' or it can just lead to an

'expr_part'. The lhs of the non terminals 'expr_list' and

'expr_part' that are in the rhs side of the production below need

to be found. Then the rhs sides of those non terminals need to be

followed in a similar manner, i.e. if there are any non terminals

in the rhs of those productions then the productions with those non

terminals need to be found and those rhs's followed. This process

keeps continuing until only terminals are found in the rhs of a

production. Then we can work our way back up the grammar bringing

any values that might have been assigned from a terminal. */

expr_list ::= expr_list expr_part

expr_part;

/* 'expr_part' is an 'expr' followed by the terminal 'SEMI'. The ':e'

after the non terminal 'expr' is a label an is used to access the

value of 'expr' which will be an integer. The action for the

production lies between {: and :}. This action will print out the

line " = + e" where e is the value of 'expr'. Before the action

takes places we need to go deeper into the grammar since 'expr' is

a non terminal. Whenever a non terminal is encountered on the rhs

of a production we need to find the rhs of that non terminal until

there are no more non terminals in the rhs. So when we finish

going through the grammar and don't encounter any more non

terminals in the rhs productions will return until we get back to

'expr' and at that point 'expr' will contain an integer value. */

expr_part ::= expr:e

{: System.out.println(" = " + e); :}

SEMI

;

/* 'expr' can lead to 'expr PLUS factor', 'expr MINUS factor', or

'factor'. The 'TIMES' and 'DIVIDE' productions are not at this

level. They are at a lower level in the grammar which in affect

makes them have higher precedence. Actions for the rhs of the non

terminal 'expr' return a value to 'expr'. This value that is

created is an integer and gets stored in 'RESULT' in the action.

RESULT is the label that is assigned automatically to the rhs, in

this case 'expr'. If the rhs is just 'factor' then 'f' refers to

the non terminal 'factor'. The value of 'f' is retrieved with the

function 'intValue()' and will be stored in 'RESULT'. In the other

two cases 'f' and 'e' refers to the non terminals 'factor' and

'expr' respectively with a terminal between them, either 'PLUS' or

'MINUS'. The value of each is retrieved with the same function

'intValue'. The values will be added or subtracted and then the

new integer will be stored in 'RESULT'.*/

expr ::= expr:e PLUS factor:f

{: RESULT = new Integer(e.intValue() + f.intValue()); :}

expr:e MINUS factor:f

{: RESULT = new Integer(e.intValue() - f.intValue()); :}

factor:f

{: RESULT = new Integer(f.intValue()); :}

;

/* 'factor' can lead to 'factor TIMES term', 'factor DIVIDE term', or

'term'. Since the productions for TIMES and DIVIDE are lower in

the grammar than 'PLUS' and 'MINUS' they will have higher

precedence. The same sort of actions take place in the rhs of

'factor' as in 'expr'. The only difference is the operations that

takes place on the values retrieved with 'intValue()', 'TIMES' and

'DIVIDE' here instead of 'PLUS' and 'MINUS'. */

factor ::= factor:f TIMES term:t

{: RESULT = new Integer(f.intValue() * t.intValue()); :}

factor:f DIVIDE term:t

{: RESULT = new Integer(f.intValue() / t.intValue()); :}

term:t

{: RESULT = new Integer(t.intValue()); :}

;

/* 'term' can lead to 'LPAREN expr RPAREN', 'NUMBER', or 'ID'. The

first production has the non terminal 'expr' in it so the

production with its lhs side needs to be found and followed. The

next rhs has no non terminals. So the grammar ends here and can go

back up. When it goes back up it will bring the value that was

retrieved when the scanner encounter the token 'NUMBER'. 'RESULT'

is assigned 'n', which refers to 'NUMBER', as the action for this

production. The same action occurs for 'ID', except the 'i' is

used to refer to 'ID'. 'ID' is also the only thing on the rhs of

the production. And since 'ID' is a terminal the grammar will end

here and go back up. */

term ::= LPAREN expr:e RPAREN

{: RESULT = e; :}

NUMBER:n

{: RESULT = n; :}

ID:i

{: RESULT = i; :}

;

// test.txt ,测试输入文件

2加4;

5*(6-3)加1;

6/3*5加20;

4*76/31;

1-1-1;

// Main.java

This example comes from a short article series in the Linux

Gazette by Richard A. Sevenich and Christopher Lopes, titled

"Compiler Construction Tools". The article series starts at

Small changes and updates to newest JFlex+Cup versions

by Gerwin Klein

Commented By: Christopher Lopes

File Name: Main.java

To Create:

After the scanner, lcalc.flex, and the parser, ycalc.cup, have been created.

> javac Main.java

To Run:

> java Main test.txt

where test.txt is an test input file for the calculator.

import java.io.*;

public class Main {

static public void main(String argv[]) {

/* Start the parser */

try {

parser p = new parser(new Lexer(new FileReader(argv[0])));

Object result = p.parse().value;

} catch (Exception e) {

/* do cleanup here -- possibly rethrow e */

e.printStackTrace();

}

测试：

首先生成扫描和分析两个java类文件:

cmd>jflex lcalc.flex

cmd>java -jar java-cup-11a.jar < ycalc.cup

java-cup-11a.jar在下载的源码lib目录里，是cup的运行时jar包。把生成的java文件与Main.java放到一起。

然后运行main：

cmd>java Main test.txt

测试结果:

2 加 4 = 6

5 * ( 6 - 3 ) 加 1 = 16

6 / 3 * 5 加 20 = 30

4 * 76 / 31 = 9

1 - 1 - 1 = -1

问题：

上述.flex和.cup文件必须使用平台默认编码，否则会报错，无法生成java文件.

转载于:https://blog.51cto.com/2924037/1363258

你可能感兴趣的文章

Linux(CentOS)最小化(mini)安装VMware Tools

passive-interface / silent-interface

磁盘df看还有剩余空间，但是创建文件时报错，提示磁盘已经满问题解决

查看>>

用艾宾浩斯曲线记忆周期来背单词是否有理论依据？

查看>>

[转]云计算之hadoop、hive、hue、oozie、sqoop、hbase、zookeeper环境搭建及配置文件