Is the Vacation Rental Industry Overstated?


It’s easy to get swept up in the frenzy of numbers being released from vacation rental listing giants like Airbnb and Booking. Just last week, they shared new listing count figures that show Airbnb coming out on top, giving them additional momentum as they head toward their planned IPO.

Between HomeAway and Airbnb alone, there are an estimated 8,000,000 vacation rental listings worldwide. But because many rental properties are listed on multiple platforms, it would be a huge mistake to equate the number of listings with the total number of properties.

In order to truly understand the size of the vacation rental industry, dual-listed rentals must be accounted for. Until recently, there has never been a way to accurately cross-reference listings across platforms. 


At AirDNA, we’ve focused most of our efforts in the past on Airbnb data. For 2019, we wanted to take a big step forward and integrate HomeAway data into MarketMinder, our app that shows short-term rental supply, demand, and pricing predictions across 80,000 markets worldwide.

In March, we announced just that; the merging of global HomeAway data and Airbnb data in our reporting tools. 

But the combining of these two data sources presented a massive challenge. If we simply merged the Airbnb data and HomeAway data, we ran the risk of overstating market-wide supply numbers, in some cases as much as 35% on an annualized basis.

This would have negative implications for real estate investors who use the information to decide where to buy their next rental property, hosts who base pricing decisions — at least in part — on market-wide occupancy figures, and local governments who use supply numbers to estimate tax revenue from short-term rentals. 


Accounting for dual-listed rentals is not as easy as it sounds.

They can’t simply be identified based on physical address, because that information isn’t publicly available, and their geo-locations are intentionally obfuscated. Rental titles and descriptions can vary widely between listing platforms. To make matters worse, they often change throughout the year as savvy hosts and property managers update rental information to capitalize on seasonal attractions and high-demand events. 

In order to accurately identify cross-platform listings that belong to the same rental property, AirDNA’s Lead Machine Learning Engineer, Erich Wellinger, developed an algorithm for identifying properties listed on multiple platforms. The algorithm is an XGBoost Classifier implemented in Python utilizing over fourteen different listing features (for a gentle introduction to XGBoost see this blog post).

Features such as location, listing titles, listing descriptions, numbers of bedrooms, and calendar availability are inputted into a gradient tree based model that ultimately outputs a probability a given pair is the same underlying property. 


Sometimes, several potential matches are found for one single rental.
This happens a lot in ski and beach towns, where many units within the same apartment complex are used as vacation rentals. They share the same general location, usually have the same number of bedrooms, feature shared amenities such as a pool, and have similar titles, ie. “2-bedroom Condo w/pool, steps from Daytona Beach.”
They are virtually identical units within the same building, yet they are physically different rentals. These are especially difficult to parse using data.

When multiple matches are found, the pair with the highest overall score is kept, and the discards default to their next highest-scored match, if there is one.

Nine potential matches were found for this Airbnb listing in a Corpus Christi condominium complex. Here’s a comparison of five feature types for the highest-scoring and lowest-scoring HomeAway matches as produced using the informative ELI5 package: 


<table class="tftable" border="1">
<tbody>
<tr>
<th style="text-align: center;">Feature Type</th>
<th style="text-align: center;">Highest-Scoring Match Score</th>
<th style="text-align: center;">Lowest-Scoring Match Score</th>
</tr>
<tr>
<td>Description Levenshtein Ratio</td>
<td style="text-align: center;">+2.359</td>
<td style="text-align: center;">+2.337</td>
</tr>
<tr>
<td>Calendar Percent Match</td>
<td style="text-align: center;">+1.663</td>
<td style="text-align: center;">-1.773</td>
</tr>
<tr>
<td>Title Term Frequency - Inverse Document Frequency</td>
<td style="text-align: center;">+1.271</td>
<td style="text-align: center;">+0.195</td>
</tr>
<tr>
<td>Difference in # Images</td>
<td style="text-align: center;">+0.633</td>
<td style="text-align: center;">-0.377</td>
</tr>
<tr>
<td>Rental Type Match</td>
<td style="text-align: center;">+0.348</td>
<td style="text-align: center;">+0.351</td>
</tr>
<tr>
<td style="text-align: right;"><em>Overall Score:</em></td>
<td style="text-align: center;"><em>3.944</em></td>
<td style="text-align: center;"><em>0.036</em></td>
</tr>
<tr>
<td style="text-align: right;"><em>Overall Probability:</em></td>
<td style="text-align: center;"><em>98%</em></td>
<td style="text-align: center;"><em>50%</em></td>
</tr>
</tbody>
</table>

The ELI5 package allows you to break down and see why a model is giving the prediction that it’s outputting. In the case of the subset of features listed above, we get a sense for how strong, and in what direction, a particular feature had on the ultimate prediction. The highest-scoring match had calendars that lined up with one another causing the model to move the prediction toward being a match (+1.663) while the lowest-scoring match had calendars that diverged from one another causing the model to detract from the likelihood that they are a match (-1.773). 


Taking an algorithmic approach makes identifying overlap across listing platforms scalable and maximizes accuracy. AirDNA analyzes over 10,000,000 listings globally, and recalculates matches on a monthly basis to resolve shifts in supply.

This approach also eliminates human error and bias. For example, the AirDNA team almost classified these Airbnb and HomeAway listings as separate properties. After all, their titles are different, they list different numbers of accommodates, the advertised rates are different, and the listing owner’s name is even different.

The Listing Match Algorithm, however, correctly classified them as a match! Upon further inspection, the team found the smoking gun that cemented their confidence that the two listings were indeed a property match:

There is no way that hanging quilt, bed set, and matching curtains are a coincidence. It’s 100% the same rental property.

Here’s the full breakdown of how the Listing Match Algorithm scored the match: 


<table class="tftable" border="1">
<tbody>
<tr>
<th style="text-align: center;">Feature Type</th>
<th style="text-align: center;">Score</th>
</tr>
<tr>
<td>Calendar Percent Match</td>
<td style="text-align: center;">+2.095</td>
</tr>
<tr>
<td>Distance</td>
<td style="text-align: center;">+1.293</td>
</tr>
<tr>
<td>Rental Type Match</td>
<td style="text-align: center;">+0.694</td>
</tr>
<tr>
<td># Bedrooms Match</td>
<td style="text-align: center;">+0.423</td>
</tr>
<tr>
<td># Reservations to # Reviews</td>
<td style="text-align: center;">+0.395</td>
</tr>
<tr>
<td>Difference in Historical Reservations</td>
<td style="text-align: center;">+0.343</td>
</tr>
<tr>
<td>Calendar Cosine Similarity</td>
<td style="text-align: center;">+0.262</td>
</tr>
<tr>
<td>Title Term Frequency - Inverse Document Frequency</td>
<td style="text-align: center;">+0.226</td>
</tr>
<tr>
<td>Percentage of Calendar Overlap</td>
<td style="text-align: center;">+0.150</td>
</tr>
<tr>
<td>Difference in Bathrooms</td>
<td style="text-align: center;">+0.127</td>
</tr>
<tr>
<td>Average Nightly Rate Difference</td>
<td style="text-align: center;">+0.011</td>
</tr>
<tr>
<td>Both Instant-Bookable</td>
<td style="text-align: center;">-0.030</td>
</tr>
<tr>
<td>Difference in # Images</td>
<td style="text-align: center;">-0.045</td>
</tr>
<tr>
<td>Model Bias</td>
<td style="text-align: center;">-0.069</td>
</tr>
<tr>
<td>Description Term Frequency - Inverse Document Frequency</td>
<td style="text-align: center;">-0.440</td>
</tr>
<tr>
<td>Difference in # Accommodates</td>
<td style="text-align: center;">-1.093</td>
</tr>
<tr>
<td>Title Levenshtein Ratio</td>
<td style="text-align: center;">-1.436</td>
</tr>
<tr>
<td>Description Levenshtein Ratio</td>
<td style="text-align: center;">-1.517</td>
</tr>
<tr>
<td style="text-align: right;"><em>Overall Score:</em></td>
<td style="text-align: center;"><em>1.391</em></td>
</tr>
<tr>
<td style="text-align: right;"><em>Overall Probability:</em></td>
<td style="text-align: center;"><em>80%</em></td>
</tr>
</tbody>
</table>

Despite the differences in listing titles and number of accommodates, their calendars and distances were a near match and weighted the highest, as seen in the two top-scoring features in the list.


When testing against a set of ground truth data, we’ve found that the algorithm routinely achieves a recall score of 95%.

The integration of Airbnb data and HomeAway data, plus the ability to accurately account for dual-listed rentals makes it possible to get a truer view of whole property-level performance — not just single listing-level performance.

Globally, ninety-three major markets — those with at least 1,000 total vacation rental listings — had at least 15% cross-platform overlap in supply for the year 2018. Cumulatively, 20% of their total listings were essentially advertising the same property as another listing, and did not represent unique supply options in the market.

The old way of measuring the vacation rental industry, by total number of listings, meant that Destination Marketing Organizations, Local Governments, Real Estate Investors, and Hoteliers weren’t getting an accurate picture of true supply.

AirDNA’s Listing Match Algorithm and the infusion of multi-platform data enables professionals to cut through the noise and see a truer representation of what is happening to vacation rental supply on a neighborhood, city, state, and country level.

The amount of overlap between Airbnb and HomeAway varies widely between markets, some having up to 45% overlap on a monthly basis, and others having almost none.

Click here to see the breakdown of Airbnb and HomeAway in your market, or visit AirDNA’s blog to learn more about how the integration of Airbnb data and HomeAway data impacts hosts, owners, vacation rental managers and more.

AirDNA's data science team reveals the vacation rental industry's true size by deduplicating properties listed on more than one channel. Read now.

Feature Type	Highest-Scoring Match Score	Lowest-Scoring Match Score
Description Levenshtein Ratio	+2.359	+2.337
Calendar Percent Match	+1.663	-1.773
Title Term Frequency - Inverse Document Frequency	+1.271	+0.195
Difference in # Images	+0.633	-0.377
Rental Type Match	+0.348	+0.351
Overall Score:	3.944	0.036
Overall Probability:	98%	50%

Feature Type	Score
Calendar Percent Match	+2.095
Distance	+1.293
Rental Type Match	+0.694
# Bedrooms Match	+0.423
# Reservations to # Reviews	+0.395
Difference in Historical Reservations	+0.343
Calendar Cosine Similarity	+0.262
Title Term Frequency - Inverse Document Frequency	+0.226
Percentage of Calendar Overlap	+0.150
Difference in Bathrooms	+0.127
Average Nightly Rate Difference	+0.011
Both Instant-Bookable	-0.030
Difference in # Images	-0.045
Model Bias	-0.069
Description Term Frequency - Inverse Document Frequency	-0.440
Difference in # Accommodates	-1.093
Title Levenshtein Ratio	-1.436
Description Levenshtein Ratio	-1.517
Overall Score:	1.391
Overall Probability:	80%