您的位置：首页 > 其它

The context sensitivity of C's grammar

2015-12-30 17:13 316 查看

我们都知道YACC是一个parser生成器，它只能接受Contex Free Grammar（CFG）；而且YACC当然可以产生C语言的parser。

然而C语言的语法其实并不完全是CFG的（它是接近CFG的一种语法）。且看以下这篇文章。

原文： http://eli.thegreenplace.net/2007/11/24/the-context-sensitivity-of-cs-grammar

http://trevorjim.com/c-and-cplusplus-are-not-context-free/

The context
sensitivity of C's grammar

November 24, 2007 at 08:16 Tags Articles , C
& C++ , Compilation

Context free
grammars (CFGs) are a valuable theoretical tool on which the modern compilation theory relies for parsing the code of programming languages. For example, the most popular tool used for parsing – YACC,
generates parsers for CFGs. What most people don’t know1 is
that the vast majority of programming languages have grammars that are not context free.
C is a very good example, because it is one of the most popular languages in use and because its grammar is soalmost context free that it serves as a good model to demonstrate what I’m talking
about.
Now, a CFG has several definitions in relation to formal languages and programming languages. I don’t want to delve too deep into the nomenclature here, but here
is a discussion by a bunch of clever guys picking nits off this matter. When I say that the grammar of C is not a CFG, what I mean is that a grammar given to YACC[2] is not enough to
parse C correctly, without referring to some context information that comes from elsewhere. It’s time for some examples.
Consider this code:

{
T (x);
...
}

Believe it or not, but given that

is a type, this
is actually a valid declaration of

of the type

in
C. However, if

is not a known type, this is a call to the function

with
the argument

. How can the C parser know which way to parse without knowing whether

was
previously defined by a

typedef

?
I can hear you say “but this is contrived, who ever writes code like that ?”. OK, something more standard:

{
T * x;
...
}

What is this, a declaration of

as a pointer to

,
or a void multiplication of the variables

and

?
There is no way to know without having the table of types defined by

typedef

in the memory, and
parsers aren’t built to do that – this iscontext sensitive information.
Here’s another example:

func((T) * x);

is a type, the result of dereferencing

is
cast to

and passed to

func

.
If

isn’t a type, the multiplication of

and

is
passed to

func

.
In all these examples, the parser would be lost without having some information gathered on the code before the problematic statement is reached. Therefore, C cannot be parsed with a YACC grammar
without mixing in some context sensitive information. This actually has a name in the compilation / C community – the “typedef-name: identifier” problem. Even K&R23 has
something to say about it, when presenting the grammar for C in the appendix:

With one further change, namely deleting the production typedef-name: identifier and making typedef-name a terminal symbol, this grammar is acceptable to the YACC parser-generator.

So, as you see, C is very close to having a CFG, but isn’t quite there. Fortunately, this problem is very simple to fix. All that’s needed is keeping a symbol table of types defined
by

typedef

as the parsing goes. Whenever a new identifier is recognized in the lexer, it checks
if this identifier is a defined type, and returns the correct token to the parser. As far as the parser is concerned, it has two distinct terminals – an identifier and a defined type. All that’s left is updating the symbol table whenever a successful parse
of a typedef statement completes. To show better how this works, I’ll show the relevant portions of the C parser and lexer from c2c’s code. Here is a portion of the Lex file:

identifier ([a-zA-Z_][0-9a-zA-Z_]*)

<INITIAL,C>{identifier}
{
GetCoord(&yylval.tok);
yylval.n = MakeIdCoord(UniqueString(yytext),
yylval.tok);
if (IsAType(yylval.n->u.id.text))
RETURN_TOKEN(TYPEDEFname);
else
RETURN_TOKEN(IDENTIFIER);
}

Without getting too much into the syntax of Lex here, what this basically says is that whenever an identifier is found, it is tested for being a type. If it is, the

TYPEDEFname

token
is returned. Otherwise,

IDENTIFIER

is returned. For the Yacc grammar, these two are separate
terminals.
1 To be more precise, “most people” don’t even care for things like this. By people I
here refer to those who are interested in programming and computer science.
2 YACC only accepts CFGs, since in every production rule

V
-> w

is a single nonterminal symbol.
3 “The ANSI C programming language, 2nd edition” by Kernighan
and Ritchie

C and C++ are not context free

C and C++ are not context free

January 31, 2013 ∞

(Part 5 of a series: 1, 2, 3, 4)

It seems that the Internet is still confused about this, so let’s consider the question:

Are C and C++ context-free languages?

And the answer is

No, C and C++ are context-sensitive languages.

There are several reasons.

To parse C and C++, you start by using a very powerfulpreprocessor. These preprocessors are inevitably written by hand (they are not based on a theoretic foundation like regular expressions or context-free grammars).

C and C++ lexers require lexical feedback to differentiate between typedef names and identifiers. That is, the context-sensitive
lexer needs help from the “context-free” parser to distinguish between an identifier “foo” and a typedef name “foo”. In this snippet,

int foo;
typedef int foo;
foo x;

the first “foo” is an identifier while the second and third are typedef names. You need to parse typedef declarations to figure this out (and since types have nested parentheses, this is definitely at least as hard as parsing a context-free language).

This means that the parser and lexer are mutually recursive, so it doesn’t make sense to say that the parser is context free while the lexer is context sensitive.

The grammar is ambiguous: it has LR conflicts, such as the if-then-else conflict. Parsers typically resolve this using context (an “else” matches the closest “if”).

If you believe that these reasons are not sufficient to show that the languages are context sensitive, please consult my
previous post on the subject.

C and C++ are important languages so it’s worth saying a bit more.

The C and C++ preprocessors are a real problem for the languages. Among other things, they make it hard to write tools that operate on programs. For example, suppose you would like to refactor your program by renaming a function. Preprocessors can introduce
new tokens, so ideally your tool would work by first running the preprocessor, then the lexer/parser to find all occurrences of the function and replacing it. But then, you have a refactored program that has already been preprocessed. At this point
you would like to run the preprocessor in reverse, so that that result is both readable and similar to the original. Good luck with that.

C++ is much harder to parse than C. McPeak’s work on Elkhound and Elsa has a good discussion of the issues.

If you ever look closely at language specifications, you’ll see that they have a lot of problems. They have many ambiguities and even contradictions. Moreover, language implementations rarely meet the specification, and no two implementations work the same
on all programs (cf. my posts on Postel’s Law). A
Few Billion Lines of Code Later: Using Static Analysis to Find Bugs in the Real World is a delightful article by Coverity on the difficulties of writing static analyzers for C programs “in the wild.” The whole article is worth reading several times over,
but here are a couple of relevant quotes:

Law: You can’t check code you can’t parse.Checking code deeply requires understanding the code’s semantics. The most basic requirement is that you parse it. Parsing is considered a solved problem. Unfortunately, this view is naïve, rooted in the
widely believed myth that programming languages exist.

The C language does not exist; neither does Java, C++, and C#. While a language may exist as an abstract idea, and even have a pile of paper (a standard) purporting to define it, a standard is not a compiler. What language do people write code in? The character
strings accepted by their compiler. Further, they equate compilation with certification. A file their compiler does not reject has been certified as “C code” no matter how blatantly illegal its contents may be to a language scholar. Fed this illegal not-C
code, a tool’s C front-end will reject it. This problem is the tool’s problem.

and

If divergence-induced parse errors are isolated events scattered here and there, then they don’t matter. An unsound tool can skip them. Unfortunately, failure often isn’t modular. In a sad, too-common story line, some crucial, purportedly “C” header file
contains a blatantly illegal non-C construct. It gets included by all files. The no-longer-potential customer is treated to a constant stream of parse errors as your compiler rips through the customer’s source files, rejecting each in turn. The customer’s
derisive stance is, “Deep source code analysis? Your tool can’t even compile code. How can it find bugs?” It may find this event so amusing that it tells many friends.

(Coverity, by the way, is now looking at HTML5
parsing to find security vulnerabilities, cf. this post and BEEP.)

I don’t blame C for these problems. C is old, and it was created when there were few widely available parser generators. In fact, it looks to me like yacc and C were essentially born together!Here’s an
inspirational interview with Stephen Johnson that discusses those early days. It wasn’t easy; Johnson says

Over the next several years, I rewrote the program over a dozen times, speeding it up by a factor of 10,000 or so.

That’s the level of effort and craftsmanship you need to create one of the most influential programs ever written.

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航

The context sensitivity of C&#39;s grammar