2M Here. 6M There.
It’s easy to get swept up in the frenzy of numbers being released from vacation rental listing giants like Airbnb and Booking. Just last week, they shared new listing count figures that show Airbnb coming out on top, giving them additional momentum as they head toward their planned IPO.
Between HomeAway and Airbnb alone, there are an estimated 8,000,000 vacation rental listings worldwide. But because many rental properties are listed on multiple platforms, it would be a huge mistake to equate the number of listings with the total number of properties.
In order to truly understand the size of the vacation rental industry, dual-listed rentals must be accounted for. Until recently, there has never been a way to accurately cross-reference listings across platforms.
Enter: The AirDNA Data Science Team
At AirDNA, we’ve focused most of our efforts in the past on Airbnb data. For 2019, we wanted to take a big step forward and integrate HomeAway data into MarketMinder, our app that shows short-term rental supply, demand, and pricing predictions across 80,000 markets worldwide.
In March, we announced just that; the merging of global HomeAway data and Airbnb data in our reporting tools.
But the combining of these two data sources presented a massive challenge. If we simply merged the Airbnb data and HomeAway data, we ran the risk of overstating market-wide supply numbers, in some cases as much as 35% on an annualized basis.
This would have negative implications for real estate investors who use the information to decide where to buy their next rental property, hosts who base pricing decisions — at least in part — on market-wide occupancy figures, and local governments who use supply numbers to estimate tax revenue from short-term rentals.
The Secret Sauce
Accounting for dual-listed rentals is not as easy as it sounds.
They can’t simply be identified based on physical address, because that information isn’t publicly available, and their geo-locations are intentionally obfuscated. Rental titles and descriptions can vary widely between listing platforms. To make matters worse, they often change throughout the year as savvy hosts and property managers update rental information to capitalize on seasonal attractions and high-demand events.
In order to accurately identify cross-platform listings that belong to the same rental property, AirDNA’s Lead Machine Learning Engineer, Erich Wellinger, developed an algorithm for identifying properties listed on multiple platforms. The algorithm is an XGBoost Classifier implemented in Python utilizing over fourteen different listing features (for a gentle introduction to XGBoost see this blog post).
Features such as location, listing titles, listing descriptions, numbers of bedrooms, and calendar availability are inputted into a gradient tree based model that ultimately outputs a probability a given pair is the same underlying property.
Two’s a Pair. Three’s a Crowd.
Sometimes, several potential matches are found for one single rental.
This happens a lot in ski and beach towns, where many units within the same apartment complex are used as vacation rentals. They share the same general location, usually have the same number of bedrooms, feature shared amenities such as a pool, and have similar titles, ie. “2-bedroom Condo w/pool, steps from Daytona Beach.”
They are virtually identical units within the same building, yet they are physically different rentals. These are especially difficult to parse using data.
Let’s walk through how it works.
|Feature Type||Highest-Scoring Match Score||Lowest-Scoring Match Score|
|Description Levenshtein Ratio||+2.359||+2.337|
|Calendar Percent Match||+1.663||-1.773|
|Title Term Frequency – Inverse Document Frequency||+1.271||+0.195|
|Difference in # Images||+0.633||-0.377|
|Rental Type Match||+0.348||+0.351|
The ELI5 package allows you to break down and see why a model is giving the prediction that it’s outputting. In the case of the subset of features listed above, we get a sense for how strong, and in what direction, a particular feature had on the ultimate prediction. The highest-scoring match had calendars that lined up with one another causing the model to move the prediction toward being a match (+1.663) while the lowest-scoring match had calendars that diverged from one another causing the model to detract from the likelihood that they are a match (-1.773).
Man vs. Machine
Taking an algorithmic approach makes identifying overlap across listing platforms scalable and maximizes accuracy. AirDNA analyzes over 10,000,000 listings globally, and recalculates matches on a monthly basis to resolve shifts in supply.
This approach also eliminates human error and bias. For example, the AirDNA team almost classified these Airbnb and HomeAway listings as separate properties. After all, their titles are different, they list different numbers of accommodates, the advertised rates are different, and the listing owner’s name is even different.
Here’s the full breakdown of how the Listing Match Algorithm scored the match:
|Calendar Percent Match||+2.095|
|Rental Type Match||+0.694|
|# Bedrooms Match||+0.423|
|# Reservations to # Reviews||+0.395|
|Difference in Historical Reservations||+0.343|
|Calendar Cosine Similarity||+0.262|
|Title Term Frequency – Inverse Document Frequency||+0.226|
|Percentage of Calendar Overlap||+0.150|
|Difference in Bathrooms||+0.127|
|Average Nightly Rate Difference||+0.011|
|Difference in # Images||-0.045|
|Description Term Frequency – Inverse Document Frequency||-0.440|
|Difference in # Accommodates||-1.093|
|Title Levenshtein Ratio||-1.436|
|Description Levenshtein Ratio||-1.517|
Despite the differences in listing titles and number of accommodates, their calendars and distances were a near match and weighted the highest, as seen in the two top-scoring features in the list.
When testing against a set of ground truth data, we’ve found that the algorithm routinely achieves a recall score of 95%.
Out with the Old, In with the New
The old way of measuring the vacation rental industry, by total number of listings, meant that Destination Marketing Organizations, Local Governments, Real Estate Investors, and Hoteliers weren’t getting an accurate picture of true supply.
AirDNA’s Listing Match Algorithm and the infusion of multi-platform data enables professionals to cut through the noise and see a truer representation of what is happening to vacation rental supply on a neighborhood, city, state, and country level.
The amount of overlap between Airbnb and HomeAway varies widely between markets, some having up to 45% overlap on a monthly basis, and others having almost none.
Click here to see the breakdown of Airbnb and HomeAway in your market, or visit AirDNA’s blog to learn more about how the integration of Airbnb data and HomeAway data impacts hosts, owners, vacation rental managers and more.