Unbackslash
2024-9-23 03:0:0 Author: www.tbray.org(查看原文) 阅读量:2 收藏

Old software joke: “After the apocalypse, all that’ll be left will be cockroaches, Keith Richards, and markup characters that have been escaped (or unescaped) one too many (or few) times.” I’m working on a programming problem where escaping is a major pain in the ass, specifically “\”. So, for reasons that seem good to me, I want to replace it. What with?

The problem · My Quamina project is all about matching patterns (not going into any further details here, I’ve written this one to death). Recently, I implemented a “wildcard” pattern, that works just like a shell glob, so you can match things like *.xlsx or invoice-*.pdf. The only metacharacter is *, so it has basic escaping, just \* and \\.

It wasn’t hard to write the code, but the unit tests were a freaking nightmare, because \. Specifically, because Quamina’s patterns are wrapped in JSON, which also uses \ for escaping, and I’m coding in Go, which does too, differently for strings delimited by " and `. In the worst case, to test whether \\ was handled properly, I’d have \\\\\\\\ in my test code.

It got to the point that when a test was failing, I had to go into the debugger to figure out what eventually got passed to the library code I was working on. One of the cats jumped up on my keyboard while I was beset with \\\\ and found itself trying to tread air. (It was a short drop onto a soft carpet. But did I ever get glared at.)

Regular expressions ouch · That’s the Quamina feature I’ve just started working on. And as everyone knows, they use \ promiscuously. Dear Reader, I’m going to spare you the “Sickening Regexps I Have Known” war stories. I’m sure you have your own. And I bet they include lots of \’s.

(The particular dialect of regexps I’m writing is I-Regexp.)

I’ve never implemented a regular-expression processor myself, so I expect to find it a bit challenging. And I expect to have really a lot of unit tests. And the prospect of wrangling the \’s in those tests is making me nauseous.

I was telling myself to suck it up when a little voice in the back of my head piped up “But the people who use this library will be writing Go code to generate and test patterns that are JSON-wrapped, so they’re going to suffer just like you are now.”

Crazy idea · So I tried to adopt the worldview of a weary developer trying to unit-test their patterns and simultaneously fighting JSON and Go about what \\ might mean. And I thought “What if I used some other character for escaping in the regexp? One that didn’t have special meanings to multiple layers of software?”

“But that’s crazy” said the other half of my brain. Everyone has been writing things like \S+\.txt and [^{}[\]]+ for years and just thinks that way. Also, the Spanish Inquisition.”

Whatever; like Prince said, let’s go crazy.

The new backslash · We need something that’s visually distinctive, relatively unlikely to appear in common regular expressions, and not too hard for a programmer to enter. Here are some candidates, in no particular order.

For each, we’ll take a simple harmless regexp that matches a pair of parentheses containing no line breaks, like so:

Original: \([^\n\r)]*\)

And replace its \‘s with the candidate to see what it looks like:

Left guillemet: « · This is commonly used as open-quotation in non-English languages, in particular French. “Open quotation” has a good semantic feel; after all, \ sort of ”quotes” the following character. It’s visually pretty distinctive. But it’s hard to type on keyboards not located in Europe. Speaking of developers sitting behind those keyboards, they’re more likely to want to use « in a regexp. Hmm.

Sample: «([^«n«r)]*«)

Em dash: — · Speaking of characters used to begin quotes, Em dash seems visually identical to U+2015 QUOTATION DASH, which I’ve often seen as a quotation start in English-language fiction. Em dash is reasonably easy to type, unlikely to appear much in real life. Visually compelling.

Sample: —([^—n—r)]*—)

Left double quotation mark: “ · (AKA left smart quote.) So if we like something that suggests an opening quote, why not just use an opening quote? There’s a key combo to generate it on most people’s keyboards. It’s not that likely to appear in developers’ regular expressions. Visually strong enough?

Sample: “([^“n“r)]*“)

Pilcrow: ¶ · Usually used to mark a paragraph, so no semantic linkage. But, it’s visually strong (maybe too strong?) and has combos on many keyboards. Unlikely to appear in a regular expression.

Sample: ¶([^¶n¶r)]*¶)

Section sign: § · Once again, visually (maybe too) strong, accessible from many keyboards, not commonly found in regexps.

Sample: §([^§n§r)]*§)

Tilde: ~ · Why not? I’ve never seen one in a regexp.

Sample: ~([^~n~r)]*~)

Escaping · Suppose we used tilde to replace backslash. We’d need a way to escape tilde when we wanted it to mean itself. I think just doubling the magic character works fine. So suppose you wanted to match anything beginning with . in my home directory: ~~timbray/~.*

“But wait,” you cry, “why are any of these better than \?” Because there aren’t other layers of software fighting to interpret them as an escape, it’s all yours.

You can vote! · I’m going to run a series of polls on Mastodon. Get yourself an account anywhere in the Fediverse and follow the #unbackslash hashtag. Polls will occur on Friday September 27, in reasonable Pacific times. Of course, one of the options will be “Don’t do this crazy thing, stick with good ol’ \!”




文章来源: https://www.tbray.org/ongoing/When/202x/2024/09/22/Unbackslashing
如有侵权请联系:admin#unsafe.sh