Regular Expressions in C - http://the-edge.taht.net/

Regular Expressions in C
C is a PITA, but I knew that already

Regular expressions are a devilishly useful mini-language. Every few months I’ll identify a place where a regular expression would be useful. If I’m working in a language that supports them cleanly, like Perl, I’ll burn the hours or DAYS required to write the one line of code required to use them in the application.

Every high level language has a regex library; all have subtle differences.

This morning, I wanted to transform a string like this: “Test, Bold …“ “”“)

into a string like this: “Test, Bold.”

Unfortunately, this week, I’m programming in C. C lacks a conventional string type, so memory management is a problem, but I’m used to that. I don’t have the pcre library available (this is an embedded system), but the posix standard regex library is on the system. The root of all regex libraries is embedded in the regex implementaton in C, so figuring out how to use that directly should be easy, right?

Ha.

The manual page for the regex library is unhelpful, lacking even an example. All the C examples I could find on the web were embedded in other languages and other libraries, too general purpose to extract for my very specific case.

This one fails to compile the regexp entirely, and when I substitute a couple simpler regexes, I fail to get anything good, either. Obviously there’s something about posix regexes that’s different that I don’t understand. Yet. Or, I’m doing something stupid with a pointer. It’s hard to tell.

{{ % highlight c %}} #include #include #include #include #include

// What I want to do is find a set of html tags so I can strip them out const char regexstr = “</?(?i:script](a](b](embed](object](frameset](frame](iframe](meta](link](style)(.](\n)?>“; const char *teststring = “Test, Bold …”;

// And also find any place with more than one dot and eliminate them // (but I’m not there yet)

// another example regex to match an email address // const char regexstr = “<([A-Z][A-Z0-9])\b[^>]>(.?)</\1>“; // const char *teststring = “example @example.org”;

#define OUTBUF (64*1024) #define MAXMATCH 60

int main(int argc, char **argv) { char outputstr[OUTBUF]; regex_t *pattern_buffer = malloc(sizeof(regex_t)); regmatch_t pmatch[MAXMATCH]; int res; if((res = regcomp(pattern_buffer,regexstr,REG_ICASE](REG_EXTENDED)) != 0) { regerror(res, pattern_buffer, outputstr, OUTBUF); printf(“regex compilation error %d: %s\n”, res, outputstr); exit(-1); } if((res = regexec(pattern_buffer,teststring,MAXMATCH,pmatch,0)) !=0) { regerror(res, pattern_buffer, outputstr, OUTBUF); printf(“regex execution error %d: %s\n”, res, outputstr); exit(-1); } for(int i = 0; (i < MAXMATCH) && (pmatch[i].rm_so != -1); i++) { write(1,&teststring[pmatch[i].rm_so],pmatch[i].rm_eo); } regfree(pattern_buffer); } {{ % /highlight %}}

I guess I’m going to sit here and slowly write this one line of code, ever simplifying, or tracing a few other languages, until enlightenment hits. It would be faster to solve this one programmatically, walking the string for each pattern using something like sscanf, actually. But a regexp is “the right thing”. It’s definately a monday. Lots of searching and thinking to do today… for one line of code.

For all I know, C library regexps don’t do unicode, either.

Find me elsewhere.

Best of the blog Uncle Bill's Helicopter - A speech I gave to ITT Tech - Chicken soup for engineers
Beating the Brand - A pathological exploration of how branding makes it hard to think straight
Inside the Internet Mind - trying to map the weather within the global supercomputer that consists of humans and google
Sex In Politics - If politicians spent more time pounding the flesh rather than pressing it, it would be a better world
Getting resources from space - An alternative to blowing money on mars using NEAs.
On the Columbia - Why I care about space

Authors I like:
Doc Searls
Jerry Pournelle
The Cubic Dog
David Brin
Charlie Stross
Eric Raymond
Anonymous
WikiLeaks
The Intercept
Chunky Mark
Brizzled
Dan Luu's rants about hardware design
Selenian Boondocks
Transterrestial Musings
Callahans

February 10, 2011
493 words

Tags
code

Regular Expressions in C C is a PITA, but I knew that already

Find me elsewhere.

February 10, 2011 493 words

Regular Expressions in C
C is a PITA, but I knew that already

February 10, 2011
493 words