Monday, December 21, 2015

BK-Trees, Toad Poison, and Tweaked Breakfast

Today I Learned:
1) ...the BK-tree algorithm for fuzzy text searching.

Consider the following problem. You are in charge of a network of a whole lot of emergency response drones, sitting in randomly-positioned stations across the state (whichever state you're in will do fine; if you're not in a state, then whichever country you're in will do fine; if you're not in a country, then you can fill in the 100 km radius centered on you will do fine). When an emergency of any kind is detected, the nearest drone requests the aid of all other drones within some radius to help. This needs to happen as quickly as possible.

The trouble is, although the drones are good at helping with emergencies, they're not very good at communicating, so the dispatching has to be done from a central control station. You, at the control station, can ask any two drones to check how far apart they are from each other, but this takes a non-trivial amount of time. That's all the information you get. A first-response drone requests assistance, and you can ask for the distances between drones, and you have to find all of the drones within some radius of the first-response drone and tell them to go help.

How do you tackle this problem? As always, I encourage you to take ten minutes to think about this. Right now. Set a timer for ten minutes, and just scribble out ideas on a piece of paper, if you have one.

...

...

...

...

So, the simplest thing you can do would be to go through the list of all of the emergency response drones, ask them how far they are from the first-response drone, and send any within some radius to go help. That's fine, unless you really, really have a lot of drones. Say it takes a couple of seconds to query one pairwise distance, and you have a million drones. It would take months to run that search, which is probably not fast enough.

If the drones are stationary, you could always all of your spare time between emergencies querying drone distances and writing them down on a table -- you'd end up with a big grid of pairwise distances between drones, where the distance in row X column Y is the distance between drone Y and column X. You could sort each row of the table so that you could just read off the answers from one side of the table. That gives you great emergency response time... but it's not very space-efficient. I don't know how many books it takes to hold a million-by-a-million entry table, but I'm guessing it isn't small. For scale, even if you digitized the table and put it in a computer, a million-by-a-million table is still order of terabytes in size. That's... maybe doable, if this is important enough. Not very efficient, though, and it won't scale well if you go to, say, a billion drones, because the size of the table gets scales with the square of the number of drones.

All of this is essentially based on the algorithm "check all of the distances from each drone to the first-responder drone" to find close drones. The BK-tree algorithm is another approach you can take to find drones in range. It's not as intuitive, but it turns out to be much more time and space efficient. The basic idea is to use the triangle inequality (of geometry) to narrow down the set of drones you have to check by excluding everything outside of arbitrarily-centered rings containing the region you're looking for. That doesn't make sense? Let's walk through it.

First, consider a geometric way of thinking about the "check all the distances from each drone to the first-responder drone". You could, if you have paper, a pencil, a ruler, and a compass, solve this problem by *mapping out* all of the drones' locations on a map (beforehand), then drawing a circle around the first-responder drone with the required radius and seeing which drones fall in circle. That's the simple way to solve the problem, but you can't really do that programmatically. It's fine if a human operator can just *do* it, but if this system needs to be automated (or you couldn't draw the circles -- that's coming up later), then the draw-a-circle method isn't going to work fantastically well.

The BK-tree method, interpreted geometrically, looks works like the following. Pick a random drone to be a reference drone. Your first responder drone requests the assistance of everything within radius R_assist. You measure the distance from the reference drone to the first-responder drone -- call it R_reference. Draw a circle around the reference drone (not the first responder drone!) of radius (R_reference + R_assist) and another of radius (R_reference - R_assist). The ring between these two circles has to contain all of the drones of interest. Why? That's where the triangle inequality comes in -- the length of one side of a triangle can't be greater than the sum of the lengths of the other two sides, or put another way, the distance between two drones can't be greater than the sum of the distances from those drones to an arbitrary third drone. In other words, if a drone is farther from the reference than (R_reference + R_assist) or closer than (R_reference - R_assist), then it can't be within R_assist of the first responder. If this isn't obvious, try drawing out a case.

Now, you've cut down the space of drones to check by potentially quite a bit, but there could still be drones in that ring that aren't drones of interest. You'll have to do some kind of search on the remaining possibly-useful drones to find all the close ones. If there are only a few left, you could use the "check all the distances" algorithm. If there are lots of drones still in that ring, then you can just pick another reference drone (within the ring) and apply the BK-tree method again (using only elements of the ring)!

"Wait a second", you might be thinking, "don't you still have to search through all of the drone distances to figure out which ones are in the ring?" Sort of. The beautiful thing about the reference drone, though, is that it doesn't matter which one you use -- any drone can be a reference drone, and the BK-algorithm will still cut down on the search time dramatically. That means that you can make a table of distances, just as you could for the simple algorithm, but *only for one drone*. That means the size of the table scales linearly with the number of drones, so a table for a billion drones will fit snugly on a decently-sized USB stick. There's actually an even *better* way to pre-cache the distances so that the whole thing is recursive, which lets you do multiple rounds of pruning-by-ring until you're down to single elements -- for details on the actual BK-tree structure that the algorithm is named after, see http://ift.tt/1O2ZzPw.

One more cool thing about the BK-tree algorithm, which is actually central to its usefulness. The BK-tree algorithm is a geometric algorithm, but the nifty thing is that it only uses one axiom of geometry -- the triangle inequality. (Ok, technically it's probably also using symmetry and a couple other trivial properties, but the triangle inequality is the big one). That means that (very much UNLIKE the "draw a circle and see what falls inside" algorithm) you can apply the BK-tree algorithm to objects in other spaces that satisfy the triangle inequality, and the algorithm doesn't change one bit. For instance, the BK-tree algorithm works just as well for drones scattered across a nebula. Or a galaxy. Or a pocket universe whose dimension is unknown. Or, more practically, the space of *words*. In fact, this is the algorithm behind autocorrect and other fuzzy-search features. Say you want to find all of the words that are less than N single-letter mistakes away from a search term. To the BK-tree algorithm, it's the same problem. The space is different, but there's still a distance metric (number of single-letter changes to get from one to another) on objects (words) in that space, and the triangle inequality still holds there, so the algorithm still works.

For the record, I learned all this because instead of getting out of bed in the morning, I decided to pull out my laptop and google "cool algorithms". After wading through some mucky probabilistic algorithms, I ran across this one, learned it, and decided I would try to find a more geometric, visualizable explanation for it. Let me know if it worked!

2) Toads have poison sacks behind their eyes -- not *behind* their eyeballs, but near the surface of the skin, next to the eyeballs but farther from the nose. Thanks to Chigozie on this one!

3) A variation on my breakfast that will save some trouble. I favor a bowl of cereal with blueberries most mornings. The old protocol for making this breakfast was to cover the bottom of the bowl with blueberries, microwave it for 1 minute, add ceral on top, and stir. Today I realized that if I put the cereal in *before* microwaving the blueberries, it keeps the blueberries from splattering while they heat, which saves either a paper towel for cleaning or water and time sponging out the microwave. (The ceral comes out a *little* bit more rubbery, but it also absorbs the blueberries a little better, especially at the bottom of the bowl).

No comments:

Post a Comment