Regex to match HTML tags
25 Feb 2005
in the wee hours
Matt Winckler
I am aware that regular expressions are singularly cumbersome when using them on nested data such as HTML tags. However, when I started out working on this problem I was under the impression it would be even more cumbersome to either write my own parser or install a parser (HTML Tidy looked particularly nice) on the server I do not own. Add to this the fact that I refuse to incorporate GPL‘d code into this project (filthy communists), and I am left needing a solution that won’t take me weeks to write. I thought regular expressions were the answer. I still think so, though they have taken more time than I initially thought they would.
All I wanted to do was ensure that any open HTML tags were closed properly when summarizing entries for the front page. Over the past several months I have discovered time and time again that my regexes for finding open tags were flawed in some way. I’m still no PCRE master, but I think maybe I’ve finally got some working ones.
To find open tags, I use /(<\w+)(\s*[^>]*)(>)/. Parsing this out: first it looks for (<\w+), an opening bracket followed by a “word” character (anything legal in a Perl variable name, i.e. numbers, letters, and underscores) repeated 1 or more times. Following that, there can be any amount of whitespace (\s*), followed by any character that’s not a closing bracket (repeated zero or more times: [^>]*). Lastly there must be a closing bracket: (>).
To find closed tags, I use /(<\/\w+>)/. This is simpler: all we want to find is an opening bracket followed by a stroke and one or more word characters with a closing bracket on the end.
Once I have all open tags and closed tags, I compare the arrays, knocking off an open tag for each matching closed tag. The open tags remaining after that are naturally the ones that need to be closed, so I tack on a closing tag for them in the reverse order that they were found (to maintain HTML’s nesting), and I’m done.
In theory, this works pretty well. For most cases, it works great. However, over the past few months I’ve discovered strange exceptions here and there that force a revision of the open tag regex. This is its current version in the hopes it will be useful to somebody, but it may yet change as I discover new cases I have not previously thought of.
