The code is horrible.
Not all my code is that bad. At least not the code I usually publish.
As always:
Copyright (c) 2002 Steffen Mueller. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. Please see the Perl Artistic License.
The script is available at:
Old version: steffen-mueller.net/chomsky/chomsky.txt
New version: steffen-mueller.net/chomsky/chomsky2.txt
The new version behaves differently for CH-0/CH-1 grammars. See Description.
On Windows, you need to install Perl (try www.activestate.com)
and then rename the file to "chomsky.pl" before running it as "perl chomsky.pl".
On *nix, you have Perl. You know how to run Perl scripts, don't you? Just
modify the shebang appropriately.
Please read the description.
Example grammars:
grammar 1
grammar 2
grammar 3
grammar 4
With the nasty stuff out of the way, I can define the expected behaviour of the program. If its behaviour does not comply to these specs, it's a bug and provided that it is not already listed under Limitations and bugs, you may feel free to complain to the author.
All grammars comply to type CH-0. (read: Chomsky type 0)
All grammars that comply to CH-n also comply to CH-m with m < n.
To comply with CH-1, all productions of the grammar must be of the form:
uAv -> urvwhere u and v are single terminals or epsilon (epsilon meaning "nothing"), A is a single non terminal, and r is a string of terminals and non terminals but not epsilon.
To comply with CH-2, all productions of the grammar must be of the form:
A -> rwhere A is a single terminal and r is a string of terminals and non-terminals or epsilon.
To comply with CH-3, all productions of the grammar must be of any of the
following forms:
A -> Bx (left-linear) A -> xB (right-linear) A -> x (terminating) A -> epsilon (epsilon production)where A and B are single non terminals and x a single terminal.
Grammars should look similar to the following example, but whitespace should be mostly insignificant. (Except that in between tokens, there must be some whitespace to delimit the tokens.)
This is some random description where no
line starts with "N\s*=\s*".
Nonterminals are the tokens inside the curly parens
that are assigned to "N".
Terminals are in "E" (resembling large sigma)
N =
{
A
S
D
}
# This is a comment.
# Comments are removed.
But you needn't comment here unless the comment starts with
# E\s*=\s*
E =
{
f
g
}
Now for the productions.
Productions start with "P:" or any substring of "Production"
followed by a colon.
P: A ->
f D # This is the first production
| # OR
S # This is the second production...
|
f
;
# A semicolon ends a production.
P: D -> f D | B ; # Much shorter!
P: S ->
g S
|
g
;
Lots.
One is described above. (see CH-1)
Another one is that if you have "A -> B" and "B -> b" (capitals being
non terminals, lower case letters being terminals), the program
will identify "A -> B" as not CH-3 compliant and make the whole grammar
not CH-3 compliant. In fact, B can be substituted by "b" and the whole
grammar is CH-3 compliant. So the algorithm used to determine
whether a production is CH-3 compliant should check for compliance of
all subrules (non terminals) in it. If "A -> B C" and "B -> epsilon",
"C -> c X", the algorithm should detect CH-3 compliance.
As far as I can tell anyway.
This leads me to the most important gotcha:
The implementation cannot be any better than my understanding of the matter.
It may be significantly worse, though.
(c) 2002 Steffen Mueller, mail / at / steffen-mueller / dot / net. All rights reserved.