The Background
The company I work (Yarde) for has a large system written in an '80s vintage 4GL
called DecisionPlus. This is one of those languages with integrated database,
screen handling, and report generation facilities. The current code base is in
excess of 460K lines of (often very nasty) code across nearly 1,300 files.
Yarde has a full set of source for the system, with new code and bug fixes for
the system rolling out into production on a daily basis.
The problem Yarde faces, like many other companies with custom or
semi-custom software, is that limits in the system are starting to affect day-to-day
operations. In our case, we're running into problems with the
integrated database library, which limits some of our critical data files to
around 2 million records. That might sound like a lot, but we're putting
nearly 5,000 records a day into those files, and that daily rate is
constantly increasing. On the one hand, that's good for the company as it
means our business is growing, but on the other hand, we can only keep about 400
business days of data online at once. This hasn't affected the business yet, but sooner or
later it will.
This is the equivalent of having the full C source to an application, but
not having the source to the C compiler or runtime library, and finding that
there's a large and unpleasant bug in one of the essential support routines.
There was no flaw in the application—for which we had the source—that
was causing the problem; rather, the problem was in the runtime, for which
we didn't have source.
The common solution to a problem this big, especially when the original
system is written in a language that's fallen out of favor, is to rewrite the
system in a new language, or to migrate to another software package. These
choices are both risky and expensive, and the experiences of other companies
show that projects like this fail more often than succeed. We have more than a
decade of customized business logic tied up in the current system, which we'd
have to tease out and recreate; hundreds of employees we'd need to retrain; and
lots of custom reports to recreate. Given the amount of customization needed,
it was doubtful that even if we used another off-the-shelf product, we could finish
the project for a reasonable cost (or even an unreasonable cost) before the
database issues stopped being annoyances and started being problems.
Neither rewriting the system nor ditching it for someone else's software was
an appealing option, especially given that the reason to do so was essentially
an internal issue and mostly invisible to the users. We decided instead to do
the sensible thing—we rewrote DecisionPlus. Or, rather, we wrote a
compiler that understood DecisionPlus and, rather than targeting the
DecisionPlus virtual machine and runtime that was causing us problems, we targeted
a newer and less restrictive system.
That may sound mad, but it really was the simplest thing to do, by far.
DecisionPlus the language is, well, let's call it charmingly simplistic. The
screen handling code was all terminal based and very straightforward to map to
ncurses with its form handling library. The database system required a bit of
thought, as the original was a fielded ISAM database; we originally looked at
the Berkeley DB, but realized we could use an SQL database with some database
trickery and runtime library shims, and settled on PostgreSQL. (We needed
writable views and triggers, so MySQL wasn't an option.)
The final project design ultimately has four pieces: a compiler, a database
runtime library, a screen runtime library, and some general utility code.
We started writing the compiler in Perl 5, to emit Perl 5 code. Writing the
compiler in Perl 5 meant we could use a Parse::RecDescent
grammar, which takes care of the tokenizing and some of the parsing. This was a
huge win. With Parse::RecDescent we got a preliminary compiler
written in less than two months. Unfortunately, because of this it became
obvious that Perl 5 wasn't a good fit as a target language. DecisionPlus,
because of its age, didn't have the concept of subroutines or blocks. Instead, it
made do with line labels, a goto, and a gosub
statement. It also has typed variables and strings with predeclared (and
enforced) lengths.
Perl could do this. The compiler turned labels into subroutines,
gotos into goto &subroutines, and
gosubs into sub calls. Typed data was implemented with scalars
tied into packages that enforced the typed behavior. The result was messy; even
before we wrote any runtime library or support code it was pretty clear that
the resulting code would be unwieldy and potentially slow.
Writing a compiler still was the right answer. The speed with which I built
it made that clear. Perl 5 was just the wrong back end to target. We just
needed a different back end. Conveniently, we had Parrot handy, and the results have been
entirely satisfactory.
Our compiler started out as a set of prototype single-function programs.
There's a simple C-style preprocessor, a parser, and the actual compiler.
Splitting out the functions into separate programs made development and testing
easier, and helped keep the different functional parts of the compiler from
bleeding into each other. This wasn't really necessary. At some point, we may
merge the different pieces into a single program.
The first step in the compilation sequence is preprocessing the source.
DecisionPlus has no macro or include functionality, but at some point in the
past someone figured out that you could use C's preprocessor if you didn't mind
ignoring all the warnings and complaints. Since only a small subset of the
preprocessor functionality is needed — constant substitution and
includes— we wrote our own C-style preprocessor in Perl so we could cut
down on the complaints.
Defining the Grammar
After preprocessing, the resulting source is fed into the parser. The parser
is again written in Perl, using Damian Conway's Parse::RecDescent
module to do the parsing. We'd considered using lex and yacc or some of the
other parser generators that are available, but using a perl-based parser made
our prototyping much easier and faster. Recursive descent parsers are
notoriously slow, as unoptimized grammars trigger a lot of backtracking. There
are some tricks you can use to speed up the parsing, and we'll get to those in
a bit.
Parse::RecDescent grammars are pretty standard as grammars go,
and the examples you'll find in most any compiler or parsing book will work just
fine. (Though you may have to do a bit of translation for LALR grammars, like
the ones fed to yacc) Each rule in a grammar has a set of one or more strings,
regular expressions, or sub-rules that must be matched in the input stream for
the rule to successfully match.
Parse::RecDescent has many handy features you can, for example,
have pieces of code that fire off whenever a rule matches, allowing your parser
to alter Parse::RecDescent's default tree. By default the tree
that's built is based on hashes. I found that this wasn't as easy to process as a
more traditional tree, but P::RDs documentation notes that if the
$::RD_AUTOACTION variable is set to q{[@items]} you get back an array of arrays
instead. I found this easier to process in the compiler, though that is a
matter of taste.
Since having an example is useful, we'll use the grammar in Example 1
throughout the article. This is a grammar specifically for a recursive descent
parser (and will work fine with Parse::RecDescent). If you've never
written a grammar before, or are only familiar with LALR grammars such as the
ones that yacc uses, it may seem a little odd.
Example 1. the example mathematical grammar
field: /\b\w+\b/
stringconstant: /'[^']*'/ | /"[^"]*"/
float: /[+-]?(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?/
constant: float | stringconstant
addop: '+' | '-'
mulop: '*' | '/'
modop: '%'
cmpop: '<>' | '>='| '<=' | '<' | '>' | '='
logop: 'and' | 'or'
parenexpr: '(' expr ')'
simplevalue: parenexpr | constant | field
modval: <leftop: simplevalue modop simplevalue>
mulval: <leftop: modval mulop modval>
addval: <leftop: mulval addop mulval>
cmpval: <leftop: addval cmpop addval>
logval: <leftop: cmpval logop cmpval>
expr: logval
declare: declare field
assign: field = expr
print: print expr
statement: assign | print | declare
In this case the grammar is for a mathematical or logical expression, with
standard precedence. (That is, modulus comes first, then multiplication and
division, then addition and subtraction, then comparisons, then logical
operations.) We also have a variable declaration rule, a value assignment
rule, and a print rule. There's a separate rule for each level of
precedenceadding in more levels is just a matter of adding in more rules. We've
also added a variable declaration rule, an assignment rule, and a printing rule
so we can hold temporary values and see what we've produced.
If you've used yacc before, you might wonder why we go to all this trouble.
This is because an LL, or recursive descent, grammar can't be left-recursive.
The leftmost term of a rule can't be the rule itself, as this would lead to
infinite recursion when evaluating the grammar. LR and LALR grammars, such as
those that yacc uses, don't have this limitation, making precedence rules much
simpler.
When this grammar is run over an expression, it will build up a tree that
represents the expression. We can feed that tree to our compiler to produce
code. In our case we're going to produce Parrot code, but you could easily use
this to generate C, Perl, or Python code, or even native machine code if you
were so inclined.
Pages: 1, 2, 3
|
Next Page |