The context sensitivity of C's grammar
2015-12-30 17:13
316 查看
我们都知道YACC是一个parser生成器,它只能接受Contex Free Grammar(CFG);而且YACC当然可以产生C语言的parser。
然而C语言的语法其实并不完全是CFG的(它是接近CFG的一种语法)。且看以下这篇文章。
原文: http://eli.thegreenplace.net/2007/11/24/the-context-sensitivity-of-cs-grammar
http://trevorjim.com/c-and-cplusplus-are-not-context-free/
November 24, 2007 at 08:16 Tags Articles , C
& C++ , Compilation
Context free
grammars (CFGs) are a valuable theoretical tool on which the modern compilation theory relies for parsing the code of programming languages. For example, the most popular tool used for parsing – YACC,
generates parsers for CFGs. What most people don’t know1 is
that the vast majority of programming languages have grammars that are not context free.
C is a very good example, because it is one of the most popular languages in use and because its grammar is soalmost context free that it serves as a good model to demonstrate what I’m talking
about.
Now, a CFG has several definitions in relation to formal languages and programming languages. I don’t want to delve too deep into the nomenclature here, but here
is a discussion by a bunch of clever guys picking nits off this matter. When I say that the grammar of C is not a CFG, what I mean is that a grammar given to YACC[2] is not enough to
parse C correctly, without referring to some context information that comes from elsewhere. It’s time for some examples.
Consider this code:
Believe it or not, but given that
is actually a valid declaration of
C. However, if
the argument
previously defined by a
I can hear you say “but this is contrived, who ever writes code like that ?”. OK, something more standard:
What is this, a declaration of
or a void multiplication of the variables
There is no way to know without having the table of types defined by
parsers aren’t built to do that – this iscontext sensitive information.
Here’s another example:
If
cast to
If
passed to
In all these examples, the parser would be lost without having some information gathered on the code before the problematic statement is reached. Therefore, C cannot be parsed with a YACC grammar
without mixing in some context sensitive information. This actually has a name in the compilation / C community – the “typedef-name: identifier” problem. Even K&R23 has
something to say about it, when presenting the grammar for C in the appendix:
With one further change, namely deleting the production typedef-name: identifier and making typedef-name a terminal symbol, this grammar is acceptable to the YACC parser-generator.
So, as you see, C is very close to having a CFG, but isn’t quite there. Fortunately, this problem is very simple to fix. All that’s needed is keeping a symbol table of types defined
by
if this identifier is a defined type, and returns the correct token to the parser. As far as the parser is concerned, it has two distinct terminals – an identifier and a defined type. All that’s left is updating the symbol table whenever a successful parse
of a typedef statement completes. To show better how this works, I’ll show the relevant portions of the C parser and lexer from c2c’s code. Here is a portion of the Lex file:
Without getting too much into the syntax of Lex here, what this basically says is that whenever an identifier is found, it is tested for being a type. If it is, the
is returned. Otherwise,
terminals.
1 To be more precise, “most people” don’t even care for things like this. By people I
here refer to those who are interested in programming and computer science.
2 YACC only accepts CFGs, since in every production rule
3 “The ANSI C programming language, 2nd edition” by Kernighan
and Ritchie
C and C++ are not context free
January 31, 2013 ∞
(Part 5 of a series: 1, 2, 3, 4)
It seems that the Internet is still confused about this, so let’s consider the question:
Are C and C++ context-free languages?
And the answer is
No, C and C++ are context-sensitive languages.
There are several reasons.
To parse C and C++, you start by using a very powerfulpreprocessor. These preprocessors are inevitably written by hand (they are not based on a theoretic foundation like regular expressions or context-free grammars).
C and C++ lexers require lexical feedback to differentiate between typedef names and identifiers. That is, the context-sensitive
lexer needs help from the “context-free” parser to distinguish between an identifier “foo” and a typedef name “foo”. In this snippet,
the first “foo” is an identifier while the second and third are typedef names. You need to parse typedef declarations to figure this out (and since types have nested parentheses, this is definitely at least as hard as parsing a context-free language).
This means that the parser and lexer are mutually recursive, so it doesn’t make sense to say that the parser is context free while the lexer is context sensitive.
The grammar is ambiguous: it has LR conflicts, such as the if-then-else conflict. Parsers typically resolve this using context (an “else” matches the closest “if”).
If you believe that these reasons are not sufficient to show that the languages are context sensitive, please consult my
previous post on the subject.
C and C++ are important languages so it’s worth saying a bit more.
The C and C++ preprocessors are a real problem for the languages. Among other things, they make it hard to write tools that operate on programs. For example, suppose you would like to refactor your program by renaming a function. Preprocessors can introduce
new tokens, so ideally your tool would work by first running the preprocessor, then the lexer/parser to find all occurrences of the function and replacing it. But then, you have a refactored program that has already been preprocessed. At this point
you would like to run the preprocessor in reverse, so that that result is both readable and similar to the original. Good luck with that.
C++ is much harder to parse than C. McPeak’s work on Elkhound and Elsa has a good discussion of the issues.
If you ever look closely at language specifications, you’ll see that they have a lot of problems. They have many ambiguities and even contradictions. Moreover, language implementations rarely meet the specification, and no two implementations work the same
on all programs (cf. my posts on Postel’s Law). A
Few Billion Lines of Code Later: Using Static Analysis to Find Bugs in the Real World is a delightful article by Coverity on the difficulties of writing static analyzers for C programs “in the wild.” The whole article is worth reading several times over,
but here are a couple of relevant quotes:
Law: You can’t check code you can’t parse.Checking code deeply requires understanding the code’s semantics. The most basic requirement is that you parse it. Parsing is considered a solved problem. Unfortunately, this view is naïve, rooted in the
widely believed myth that programming languages exist.
The C language does not exist; neither does Java, C++, and C#. While a language may exist as an abstract idea, and even have a pile of paper (a standard) purporting to define it, a standard is not a compiler. What language do people write code in? The character
strings accepted by their compiler. Further, they equate compilation with certification. A file their compiler does not reject has been certified as “C code” no matter how blatantly illegal its contents may be to a language scholar. Fed this illegal not-C
code, a tool’s C front-end will reject it. This problem is the tool’s problem.
and
If divergence-induced parse errors are isolated events scattered here and there, then they don’t matter. An unsound tool can skip them. Unfortunately, failure often isn’t modular. In a sad, too-common story line, some crucial, purportedly “C” header file
contains a blatantly illegal non-C construct. It gets included by all files. The no-longer-potential customer is treated to a constant stream of parse errors as your compiler rips through the customer’s source files, rejecting each in turn. The customer’s
derisive stance is, “Deep source code analysis? Your tool can’t even compile code. How can it find bugs?” It may find this event so amusing that it tells many friends.
(Coverity, by the way, is now looking at HTML5
parsing to find security vulnerabilities, cf. this post and BEEP.)
I don’t blame C for these problems. C is old, and it was created when there were few widely available parser generators. In fact, it looks to me like yacc and C were essentially born together!Here’s an
inspirational interview with Stephen Johnson that discusses those early days. It wasn’t easy; Johnson says
Over the next several years, I rewrote the program over a dozen times, speeding it up by a factor of 10,000 or so.
That’s the level of effort and craftsmanship you need to create one of the most influential programs ever written.
然而C语言的语法其实并不完全是CFG的(它是接近CFG的一种语法)。且看以下这篇文章。
原文: http://eli.thegreenplace.net/2007/11/24/the-context-sensitivity-of-cs-grammar
http://trevorjim.com/c-and-cplusplus-are-not-context-free/
The context
sensitivity of C's grammar
November 24, 2007 at 08:16 Tags Articles , C& C++ , Compilation
Context free
grammars (CFGs) are a valuable theoretical tool on which the modern compilation theory relies for parsing the code of programming languages. For example, the most popular tool used for parsing – YACC,
generates parsers for CFGs. What most people don’t know1 is
that the vast majority of programming languages have grammars that are not context free.
C is a very good example, because it is one of the most popular languages in use and because its grammar is soalmost context free that it serves as a good model to demonstrate what I’m talking
about.
Now, a CFG has several definitions in relation to formal languages and programming languages. I don’t want to delve too deep into the nomenclature here, but here
is a discussion by a bunch of clever guys picking nits off this matter. When I say that the grammar of C is not a CFG, what I mean is that a grammar given to YACC[2] is not enough to
parse C correctly, without referring to some context information that comes from elsewhere. It’s time for some examples.
Consider this code:
{ T (x); ... }
Believe it or not, but given that
Tis a type, this
is actually a valid declaration of
xof the type
Tin
C. However, if
Tis not a known type, this is a call to the function
Twith
the argument
x. How can the C parser know which way to parse without knowing whether
Twas
previously defined by a
typedef?
I can hear you say “but this is contrived, who ever writes code like that ?”. OK, something more standard:
{ T * x; ... }
What is this, a declaration of
xas a pointer to
T,
or a void multiplication of the variables
Tand
x?
There is no way to know without having the table of types defined by
typedefin the memory, and
parsers aren’t built to do that – this iscontext sensitive information.
Here’s another example:
func((T) * x);
If
Tis a type, the result of dereferencing
xis
cast to
Tand passed to
func.
If
Tisn’t a type, the multiplication of
Tand
xis
passed to
func.
In all these examples, the parser would be lost without having some information gathered on the code before the problematic statement is reached. Therefore, C cannot be parsed with a YACC grammar
without mixing in some context sensitive information. This actually has a name in the compilation / C community – the “typedef-name: identifier” problem. Even K&R23 has
something to say about it, when presenting the grammar for C in the appendix:
With one further change, namely deleting the production typedef-name: identifier and making typedef-name a terminal symbol, this grammar is acceptable to the YACC parser-generator.
So, as you see, C is very close to having a CFG, but isn’t quite there. Fortunately, this problem is very simple to fix. All that’s needed is keeping a symbol table of types defined
by
typedefas the parsing goes. Whenever a new identifier is recognized in the lexer, it checks
if this identifier is a defined type, and returns the correct token to the parser. As far as the parser is concerned, it has two distinct terminals – an identifier and a defined type. All that’s left is updating the symbol table whenever a successful parse
of a typedef statement completes. To show better how this works, I’ll show the relevant portions of the C parser and lexer from c2c’s code. Here is a portion of the Lex file:
identifier ([a-zA-Z_][0-9a-zA-Z_]*) <INITIAL,C>{identifier} { GetCoord(&yylval.tok); yylval.n = MakeIdCoord(UniqueString(yytext), yylval.tok); if (IsAType(yylval.n->u.id.text)) RETURN_TOKEN(TYPEDEFname); else RETURN_TOKEN(IDENTIFIER); }
Without getting too much into the syntax of Lex here, what this basically says is that whenever an identifier is found, it is tested for being a type. If it is, the
TYPEDEFnametoken
is returned. Otherwise,
IDENTIFIERis returned. For the Yacc grammar, these two are separate
terminals.
1 To be more precise, “most people” don’t even care for things like this. By people I
here refer to those who are interested in programming and computer science.
2 YACC only accepts CFGs, since in every production rule
V -> w,
Vis a single nonterminal symbol.
3 “The ANSI C programming language, 2nd edition” by Kernighan
and Ritchie
C and C++ are not context free
C and C++ are not context freeJanuary 31, 2013 ∞
(Part 5 of a series: 1, 2, 3, 4)
It seems that the Internet is still confused about this, so let’s consider the question:
Are C and C++ context-free languages?
And the answer is
No, C and C++ are context-sensitive languages.
There are several reasons.
To parse C and C++, you start by using a very powerfulpreprocessor. These preprocessors are inevitably written by hand (they are not based on a theoretic foundation like regular expressions or context-free grammars).
C and C++ lexers require lexical feedback to differentiate between typedef names and identifiers. That is, the context-sensitive
lexer needs help from the “context-free” parser to distinguish between an identifier “foo” and a typedef name “foo”. In this snippet,
int foo; typedef int foo; foo x;
the first “foo” is an identifier while the second and third are typedef names. You need to parse typedef declarations to figure this out (and since types have nested parentheses, this is definitely at least as hard as parsing a context-free language).
This means that the parser and lexer are mutually recursive, so it doesn’t make sense to say that the parser is context free while the lexer is context sensitive.
The grammar is ambiguous: it has LR conflicts, such as the if-then-else conflict. Parsers typically resolve this using context (an “else” matches the closest “if”).
If you believe that these reasons are not sufficient to show that the languages are context sensitive, please consult my
previous post on the subject.
C and C++ are important languages so it’s worth saying a bit more.
The C and C++ preprocessors are a real problem for the languages. Among other things, they make it hard to write tools that operate on programs. For example, suppose you would like to refactor your program by renaming a function. Preprocessors can introduce
new tokens, so ideally your tool would work by first running the preprocessor, then the lexer/parser to find all occurrences of the function and replacing it. But then, you have a refactored program that has already been preprocessed. At this point
you would like to run the preprocessor in reverse, so that that result is both readable and similar to the original. Good luck with that.
C++ is much harder to parse than C. McPeak’s work on Elkhound and Elsa has a good discussion of the issues.
If you ever look closely at language specifications, you’ll see that they have a lot of problems. They have many ambiguities and even contradictions. Moreover, language implementations rarely meet the specification, and no two implementations work the same
on all programs (cf. my posts on Postel’s Law). A
Few Billion Lines of Code Later: Using Static Analysis to Find Bugs in the Real World is a delightful article by Coverity on the difficulties of writing static analyzers for C programs “in the wild.” The whole article is worth reading several times over,
but here are a couple of relevant quotes:
Law: You can’t check code you can’t parse.Checking code deeply requires understanding the code’s semantics. The most basic requirement is that you parse it. Parsing is considered a solved problem. Unfortunately, this view is naïve, rooted in the
widely believed myth that programming languages exist.
The C language does not exist; neither does Java, C++, and C#. While a language may exist as an abstract idea, and even have a pile of paper (a standard) purporting to define it, a standard is not a compiler. What language do people write code in? The character
strings accepted by their compiler. Further, they equate compilation with certification. A file their compiler does not reject has been certified as “C code” no matter how blatantly illegal its contents may be to a language scholar. Fed this illegal not-C
code, a tool’s C front-end will reject it. This problem is the tool’s problem.
and
If divergence-induced parse errors are isolated events scattered here and there, then they don’t matter. An unsound tool can skip them. Unfortunately, failure often isn’t modular. In a sad, too-common story line, some crucial, purportedly “C” header file
contains a blatantly illegal non-C construct. It gets included by all files. The no-longer-potential customer is treated to a constant stream of parse errors as your compiler rips through the customer’s source files, rejecting each in turn. The customer’s
derisive stance is, “Deep source code analysis? Your tool can’t even compile code. How can it find bugs?” It may find this event so amusing that it tells many friends.
(Coverity, by the way, is now looking at HTML5
parsing to find security vulnerabilities, cf. this post and BEEP.)
I don’t blame C for these problems. C is old, and it was created when there were few widely available parser generators. In fact, it looks to me like yacc and C were essentially born together!Here’s an
inspirational interview with Stephen Johnson that discusses those early days. It wasn’t easy; Johnson says
Over the next several years, I rewrote the program over a dozen times, speeding it up by a factor of 10,000 or so.
That’s the level of effort and craftsmanship you need to create one of the most influential programs ever written.
相关文章推荐
- 如何快速关闭电脑屏幕的工具
- 查看指定端口端口号所对应用的PID及应用程序
- 机器学习入门系列一(关键词:单变量线性回归,梯度下降法)
- dwz+jquery+fileupload+springmvc实现文件上传 及图片预览
- 金币(0)<P2015_1>
- 【cocos2d-x 3.4】全屏显示窗口,隐藏标题栏
- 【蒟蒻の进阶PLAN】 置顶+持续连载
- 1-FreeMarker的介绍和helloworld
- PMCAFF微课堂 (已结束)| 京东平台产品负责人:如何打造一个支撑3200万日订单量的平台型产品
- 2015年中国公有云市场的七个重要特征
- 【操作系统】 置换策略模拟实现
- 2015年中国公有云市场的七个重要特征
- Mongodb管理的几个常识
- 微信分享提示未安装
- Darwin中OSRef和OSHashTable类的使用
- 为什么Enable BitCode(Xcode7)真机测试要修改为 NO?
- 【蒟蒻の进阶PLAN】 置顶+持续连载
- 新发现:云盘转移multcloud
- 写在年尾
- 1. Two Sum