Tuesday, June 13, 2006

Fast string token identification

Ever needed to parse a string and identify some parts of it, or identify some tokens found within it?
Well when I had compiler construction, and had to make a compiler, this was an essential part of the compiler as you had to identify tokens and send it to the parser to see if it was in the right order to (be in a valid syntax) make a sentence of your own programming language.


If you had read a line of source code (for instance) : if(i == 0)
The lexical analyser had to return: TOK_IF ; TOK_LBRACK ; TOK_ID(i) ; TOK_EQUALS ; TOK_INT(0) ; TOK_RBRACK

But when you've implemented the lexical analyser (using a non-traversal matrix, too big and too long to implement in such a short time) you sat with a series of "else if" statements to check if you came accross a keyword, to bind it to the right token.
For instance, lets say the lexical analyser came accross : "if"
Now you had to identify that string with a list of tokens in order for the compiler to know that it is a TOK_IF token.

if(!strcmp(stream, "for"))
return TOK_FOR;
else if(!strcmp(stream, "while"))
return TOK_WHILE;
else if(!strcmp(stream, "if"))
return TOK_IF;
.
.
.

In C/C++, using c-strings, that would have been the default route to pick, wouldn't it?
But thankfully, I sat and thoaght about hashtables for a while and also played around with C++'s MACROs. Then I finally tried something that worked. NOTE : I believe that this will only work in C/C++ as other languages (that I know of) doesn't support PREPROCESSOR Directives.
Another + point for C/C++ ;)
What did I do? I managed to find a way to compare a string with a bunch of strings in C++ in O(1) time, using a switch statement. How on earth did I do that?
Heres how I did it:

Firstly, you'll need a bunch of MACROs for the switch statement. You cannot use functions in a switch statement, the compiler will complain because it won't calculate the end result of each case-part in compile time except if you use MACROs.

#define H2(X1, X2) \
((X1 * 3 + X2) * 3)
#define H3(X1, X2, X3) \
(((X1 * 3 + X2) * 3 + X3) * 3)
#define H4(X1, X2, X3, X4) \
((((X1 * 3 + X2) * 3 + X3) * 3 + X4) * 3)
#define H5(X1, X2, X3, X4, X5) \
(((((X1 * 3 + X2) * 3 + X3) * 3 + X4) * 3 + X5) * 3)
#define H6(X1, X2, X3, X4, X5, X6) \
((((((X1 * 3 + X2) * 3 + X3) * 3 + X4) * 3 + X5) * 3 + X6) * 3)
#define H7(X1, X2, X3, X4, X5, X6, X7) \
(((((((X1 * 3 + X2) * 3 + X3) * 3 + X4) * 3 + X5) * 3 + X6) * 3 + X7) * 3)

Whats that you may wonder? Well, you didn't think that you can use a string directly in a switch statement now did you? Those MACROs are to convert your "string" to an integer representation of it to be used in a switch statement. So if you have string of 3 characters you have to use those MACROs like this: H3('f', 'o', 'r')
Thats the closest you'll get to putting a string in a switch statement.
Next you need a function to convert the string (you'd like to compare) into an integer representation of it:

int getHash(const char *str, int len)
{
int
i;
int res = 0;

for(
i = 0; i < len; ++i)
{
res += str[i];
res *= 3;
}

return
res;
}

The 3's that you see being multiplied by is a prime number. Those who've studied hash tables will know why.

Now we have all the tools we need to implement our fast string token identifier.

switch(getHash(str, len))
{
case
H3('f', 'o', 'r'):
return
TOK_FOR;
case
H5('w', 'h', 'i', 'l', 'e'):
return
TOK_WHILE;
case
H2('i', 'f'):
return
TOK_IF;
}

Well I believe that there are many applications for this kind of thing. I hope you find it usefull somewhere. Well, if you're doing C/C++ that is!

No comments: