Regular expressions are a devilishly useful mini-language. Every few months I’ll identify a place where a regular expression would be useful. If I’m working in a language that supports them cleanly, like Perl, I’ll burn the hours or DAYS required to write the one line of code required to use them in the application.
Every high level language has a regex library; all have subtle differences.
This morning, I wanted to transform a string like this: “Test, Bold …“ “”“)
into a string like this: “Test, Bold.”
Unfortunately, this week, I’m programming in C. C lacks a conventional string type, so memory management is a problem, but I’m used to that. I don’t have the pcre library available (this is an embedded system), but the posix standard regex library is on the system. The root of all regex libraries is embedded in the regex implementaton in C, so figuring out how to use that directly should be easy, right?
Ha.
The manual page for the regex library is unhelpful, lacking even an example. All the C examples I could find on the web were embedded in other languages and other libraries, too general purpose to extract for my very specific case.
This one fails to compile the regexp entirely, and when I substitute a couple simpler regexes, I fail to get anything good, either. Obviously there’s something about posix regexes that’s different that I don’t understand. Yet. Or, I’m doing something stupid with a pointer. It’s hard to tell.
{{ % highlight c %}}
#include
// What I want to do is find a set of html tags so I can strip them out const char regexstr = “</?(?i:script](a](b](embed](object](frameset](frame](iframe](meta](link](style)(.](\n)?>“; const char *teststring = “Test, Bold …”;
// And also find any place with more than one dot and eliminate them // (but I’m not there yet)
// another example regex to match an email address // const char regexstr = “<([A-Z][A-Z0-9])\b[^>]>(.?)</\1>“; // const char *teststring = “example @example.org”;
#define OUTBUF (64*1024) #define MAXMATCH 60
int main(int argc, char **argv) { char outputstr[OUTBUF]; regex_t *pattern_buffer = malloc(sizeof(regex_t)); regmatch_t pmatch[MAXMATCH]; int res; if((res = regcomp(pattern_buffer,regexstr,REG_ICASE](REG_EXTENDED)) != 0) { regerror(res, pattern_buffer, outputstr, OUTBUF); printf(“regex compilation error %d: %s\n”, res, outputstr); exit(-1); } if((res = regexec(pattern_buffer,teststring,MAXMATCH,pmatch,0)) !=0) { regerror(res, pattern_buffer, outputstr, OUTBUF); printf(“regex execution error %d: %s\n”, res, outputstr); exit(-1); } for(int i = 0; (i < MAXMATCH) && (pmatch[i].rm_so != -1); i++) { write(1,&teststring[pmatch[i].rm_so],pmatch[i].rm_eo); } regfree(pattern_buffer); } {{ % /highlight %}}
I guess I’m going to sit here and slowly write this one line of code, ever simplifying, or tracing a few other languages, until enlightenment hits. It would be faster to solve this one programmatically, walking the string for each pattern using something like sscanf, actually. But a regexp is “the right thing”. It’s definately a monday. Lots of searching and thinking to do today… for one line of code.
For all I know, C library regexps don’t do unicode, either.