Hipmunk's Hotel Data Pipeline ✈︎

The Problem

When you search for hotels on Hipmunk, we search dozens of partner APIs and merge the results so you can easily compare rates and find the best one. For example, if you're visiting San Francisco and want to stay at the Hotel Kabuki, we'll show you what the rates are from Expedia, Hotels.com, Priceline, and everyone else.


It's a straightforward idea but behind the scenes there's a lot going on to make it possible. The main problem is that there isn't one centralized hotel database that everyone can share. Every partner has their own distinct hotel database which means we have to match up each partner's version of the Hotel Kabuki against our own internal version. We do this through a process called hotel matching.


Before we dive into the details of hotel matching, let's start with how we get the data. Each partner makes their hotel inventory available to us via FTP or an API. We have jobs called loaders, usually one per partner, that will download a partner's inventory and store it in a Postgres table (in a common format for Hipmunk).

A key responsibility of the loaders is data validation. Since hotel data is often updated by humans, it's important to validate it to check for typos and inconsistencies in the data (e.g. a hotel that says it's in Paris, France but has a lat/lng in Paris, Texas). Validation also helps normalize the data which makes it easier to match on later.

Hotel loaders are constantly running and the net result is that we have a continuously updated set of database tables, one per partner. The next step is to take those tables and merge them into one.

Hotel Matching

Hotel matching would be easy if there was a single identifier that we could just merge hotels on. But there isn't, so instead we must examine things like the hotel's name, address, geolocation, and other attributes. The solution we found to work best uses clustering. Basically we compare hotels using a distance function which looks for similarities between the various attributes of two hotels and outputs a score which represents how similar the hotels are. Hotels within a given distance threshold are merged together.

There are many different factors that go into the distance algorithm but the core of it centers around comparing names and addresses using a combination of ngrams, levenshtein distance, and keyword analysis. It's a game of tradeoffs because making the distance function more accurate requires bigger computational costs which results in longer cycles before a hotel's data is updated.



The naive approach to clustering would be to compare each hotel against every other hotel. This would result in an O(N^2) computation time and would end up taking weeks to complete. We speed things up by only comparing hotels within the same country or state and using quad-trees to only compare against the closest hotels.

Once we've ran our clustering algorithm we use that to build our canonical hotel database. Each cluster becomes a hotel and each hotel has references to the source hotels it came from. This way we are able to translate our version of Hotel Kabuki into any partner's version when it comes time to book.