Q Numbers
2024-7-10 03:0:0 Author: www.tbray.org(查看原文) 阅读量:0 收藏

This ongoing fragment describes how to match and compare numbers using a finite automaton, which involves transforming them into strings with the right lexical properties. My hope is that there are at least twelve people in the world who are interested in the intersection of numeric representation and finite automata.

Background · (Feel free to skip this part if you already know about Quamina.)

This is yet another entry in the Quamina Diary series of blog posts. Quamina is a Go-language library that allows you to compile a bunch of “Patterns” together and, when presented with “events”, i.e. JSON data blobs, inform you which (if any) of the Patterns match each event, at a speed which is high (often millions/second) and only weakly related to the number of Patterns in any Quamina instance.

Quamina was inspired by AWS Event Ruler (“Ruler” for short), a package I helped develop while at AWS that has since been open-sourced. (Thanks, AWS!). By “based on” I mean “does a subset of the same things compatibly, with a design that is quite different, in interesting ways”. Quamina is also a fruitful source of software geekery for me to write about here, which I enjoy.

The problem · Suppose you want to match records for boxes whose height is 20cm. A sample of such a record, with most fields removed:

{
  "box": {
    "dimensions": {
      "width": 100,
      "height": 20,

(Much omitted.)

A Quamina Pattern designed to match those records would look like this:

{
  "box": {
    "dimensions": { 
      "height": [ 20 ]
    }
  } 
}

All good so far. But what if, due to some upstream computer program or inventory admin, a message showed up like so?

{
  "box": {
    "dimensions": {
      "width": 100.0,
      "height": 20.0,

Up until my last PR landed, Quamina didn’t know that “20” and “20.0” and “2.0e1” were the same quantity; it knew how to compare strings to other strings and that was all. Which was unsatisfactory. And a problem which had been solved years ago (partly by me) in Ruler.

Question · Pause a moment and ask yourself: How would you write a finite automaton which would Do The Right Thing with numbers? I’m not going to claim that the way Ruler and Quamina do it is optimal, but it’s worked pretty well for years and processed trillions of events with no breakages I know of.

Our answer: normalize the numbers into fixed-sized strings whose lexical ordering is that of the numbers they represent. Code first:

func qNumFromFloat(f float64) (qNumber, error) {
	if f < -FiveBillion || f > FiveBillion {
		return nil, errors.New("value must be between -5e9 and +5e9 inclusive")
	}
	value := uint64(TenE6 * (FiveBillion + f))
	return toHexStringSkippingFirstByte(value), nil
}
    

Constraints · Quamina requires, for numeric matching to work properly, that:

  1. The numbers be between -/+5×109, inclusive.

  2. The numbers have no more than five digits to the right of the decimal point.

You’ll notice that the code above enforces the first condition but not the second. We’ll get to that.

Effects · So, what that code is doing is:

  1. Adding 5.0e9 to the number so it’s in the range 0 … 10.0e9.

  2. Multiplying by 106 to push the five-digit fractional part to the left of the decimal point, preserving its sixth digit (if any) so the rounding in the next step works.

  3. Converting it from a float64 into a uint64.

  4. Turning that into a big-endian 14-byte hex string.

So any number that meets the constraints above is represented as 14 hex digits whose lexical order is consistent with the underlying numbers. “20”, “20.0” and “2.0e1” are all “11C3793911AD00”. Which means that this Pattern will do what reasonable people expect:

{"box": { "dimensions": { "height": [ 20 ] } } }

More formally · There are 1015 numbers that meet the constraints described above. This process maps them into hex strings. The first three and their mappings are:

-5,000,000,000, -4,999,999,999.99999, -4,999,999,999.99998
00000000000000,       00000000000009,       00000000000014

And the last three:

4,999,999,999.99998, 4,999,999,999.99999,  5,000,000,000
     2386F26FC0FFEC,      2386F26FC0FFF6, 2386F26FC10000

Less formally · This includes “most” numbers that are used in practice, including prices, occurrence counts, size measurements, and so on.

Examples of numbers that do not meet these criteria include AWS account numbers, some telephone numbers, and cryptographic keys/signatures. For those, Quamina just preserves the digits, whatever they may be, and in fact, this also usually ends up doing what people expect.

Could we do better? · I think so. To start with, hex digits are an inefficient way to represent bits; there are many other options.

The current hex approach hasn’t changed since a very early version of Ruler because it’s never been a pain point.

Speaking of Ruler, they recently landed a PR that lets them have 6 fractional digits as opposed to Quamina’s 5, simply by using decimal rather than binary arithmetic. It’s fast, too! This was made easier by the fact that Java has BigDecimal built in, while Go doesn’t. There are good open-source options out there, but I am extremely reluctant to accept dependencies in a package as low-level as Quamina. I don’t think matching more one more fractional digit justifies the cost of a dependency.

“Q numbers?” · In Ruler, the data type is ComparableNumber. In Quamina I toyed with comparableNumber and canonicalNumber but neither is quite right and both produced excessively long and ugly variable and function names. So I decided to call them “Q numbers”, where Q is for Quamina. The code became noticeably more readable.

While the set of Rationals is called “Q”, the only other significant use of the phrase “Q number” is some weird old thing out of Texas Instruments.

Practicalities · The code to enforce the constraints and do the conversion isn’t that cheap. When I first naively dropped it in, I saw a nasty perforance regression. Code optimization helped, but I realized then that it’s really important not to convert an incoming field that happens to be a JSON number into a Q number unless you know, first, that it meets the constraints and second, that the finite automaton we’re trying to match has a Pattern with a numerical match that also met the Q number constraints.

The one moderately clever stroke here relies on the fact that Quamina has its own JSON parser, because Reasons. The parser obviously has to step its way through numbers, and it’s easy enough there to notice where the syntax characters like “.” and “e” are and cheaply figure out if the decimal is the right size.

Conclusion · Quamina now knows numbers. It’s a little slower to match a 14-digit Q than a string like “20”, but finite automata are fast and anyhow, being right matters.



文章来源: https://www.tbray.org/ongoing/When/202x/2024/07/09/Q-Numbers
如有侵权请联系:admin#unsafe.sh