Airbnb Data Analysis — Toronto

Leonardo Mosimann Conti
6 min readApr 25, 2021

--

Photo by Daniel Salgado on Unsplash

Airbnb is considered today the biggest hotel franchise in the world, without actually owning a single of its properties.

By making contact between people who want a lodge, with house hosts who want to rent their places practically, Airbnb provides an innovative platform to make this work.

One of Airbnb's initiatives is to provide the portal Inside Airbnb where the site’s data for some of the main cities in the world can be found — data that can be downloaded by anyone, to analyze and understand key concepts about the franchise’s rentals all around the world.

Toronto is Canada’s largest city, also leading innovations in technology and business. As I visited Toronto in 2019 when going to my exchange year in Ottawa, I thought it would be a great idea to explore its data and learn more about people's choices when it comes to neighbourhoods, prices, and room types.

Loading Libraries and Data

  • pandas— Used for dataset manipulation.
  • matplotlib — Used to plot histograms.
  • seaborn— Used for heatmaps.
  • ploty— Used for interactive maps.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline
sns.set()
  • Pandas read_csv function reads a .csv file and returns a DataFrame.
  • The dataset will be downloaded by clicking here.
#import the file listings.csv to a DataFrameDATA_PATH = 'http://data.insideairbnb.com/canada/on/toronto/2021-01-02/visualisations/listings.csv'df = pd.read_csv(DATA_PATH)
  • df.head( ) to see if the data was successfully obtained by showing the first 5 entries of the dataset.
df.head()

Data Pre-processing

First look at the dataset

  • First, we’ll be identifying the volume of the dataset we’re working with, using df. shape[ ], and also verify the data type used in each variable with df.dtypes.
display(df.shape[0])
display(df.shape[1])
display(df.dtypes)
  • Entries: 18 265
  • Variables: 16

Missing data

By looking at the 5 first entries, shown before, we can see that there are missing values in the columns neighbourhood_groupand last_review.

  • As the variable neighbourhood_group has 100% of its values missing, we’ll remove it from the dataset, using the df.drop( ) method.
df.drop('neighbourhood_group', axis = 1, inplace = True)

How are the variables distributed?

  • To identify the variable distributions, we’ll plot histograms of the numeric meaningful variables.
# Only meaningful variables for a histogram were selected
df.hist(bins = 15, figsize = (15, 10))
plt.tight_layout()

Are there any outliers?

From the histogram distribution, it’s possible to identify hints of probable outliers. For example, in the variables price,minimum_nightsand calculated_host_listing_counts.

Outliers distort most of the graphical representation and don’t follow a proper distribution. For confirmation, we’ll be applying two fast methods to assist in the detection of outliers, being them:

  • Statistic summary, using the method describe( ).
  • Plot boxplots for the variable.

By looking at the statistic summary that the method `describe()` provided, we can confirm hypotheses such as:

  • The variable `price` has 75% of its values below CAD149, though its maximum value is CAD13000.
  • Some data in the minimum_nights variable, led its maximum above the 365 days of the year, being over three years in the property.

minimum_nights boxplot:

price boxplot:

Histogram without outliers :

Now that we’ve confirmed the existence of outliers in the variables price and minimum_nights, let’s clean the data, and plot the histogram once more.

  • Now that the outliers were removed and we’ve got to know the data, let’s analyze it.

Exploratory Analysis

What’s the correlation existent between variables?

Correlation means that there’s a relation between two or more things. In this context, we’re looking for a similarity or connection between the variables.

This correlation can be measured and, the correlation coefficient should establish the intensity of it. To identify these connections between variables of interest we’ll:

  • Create a correlation matrix.
  • Generate a heatmap from this matrix, using the seaborn library.

In the heatmap, the colder colors represent a greater coefficient of correlation, but, as seen above, there aren’t many correlations between the variables, only between themselves, and in the reviews_per_month and number_of_reviews, because both relate to the number of reviews a property gets.

What’s the average renting price?

# look at the mean of the price column
df_clean.price.mean()
  • 103.15963963963964

What’s the most rented room type in Airbnb?

#show the amount of each room availabledf_clean.room_type.value_counts()
  • Entire home/apt - 10158
  • Private room - 6133
  • Shared room - 301
  • Hotel room - 58
  • We can see that more than 60% of the rooms rented are entire home/apt, following by private rooms, with a percentage close to 37%.

What’s the most expensive neighbourhood to rent an Airbnb in Toronto?

  • Sort neighbourhoods by the average renting price, using : groupby( ), mean( ) and sort_values( ).
  • Waterfront Communities-The Island - 137.870
  • Niagara - 128.228
  • Rosedale-Moore Park - 124.959
  • Bay Street Corridor - 124.008
  • Casa Loma - 121.032

With this information, we have to see if there's enough data to support the place for the most expensive neighbourhood in Toronto.

  • The output was 0.15264, representing more than 15% of the data, which confirms that the Waterfront Communities is the most expensive neighbourhood to rent an Airbnb in Toronto, by average price.

Map of the Airbnb properties, sorted by price

  • Using the ploty library, we’ll be making an interactive plot, that shows all Airbnb properties in Toronto, from the clean data, we’ll be able to navigate the map and look into the properties name, room type, location, name, and price.

Insights:

The superficial exploratory analysis done in the Airbnb database highlighted some interesting data that allowed us to go further into the Airbnb listings in Toronto. Going through the data, insights were reached such as:

  • With the map plotted, it’s evident that the area near Old Toronto is the most crowded in properties. And the closer to downtown the more expensive the rentals get.
  • The Waterfront Communities-The Island has the highest price for rentals, mostly because of its water view, coasting Lake Ontario.
  • Therefore the data had outliers, they were surpassed, and after cleaning the data, we could interpret better the histograms and maps plotted.
  • There aren’t many correlations between the variables, the only significant one being on number_of_reviews and reviews_per_month.

For further analysis and greater insights, it’s recommended that the complete data is used, with all the 106 attributes available. All can be found at the portal Inside Airbnb.

Conclusion

By using Pandas I was able to analyze, visualize, and have a better understanding of the Toronto Airbnb properties data, it’s proven to be a great tool for data science, and a great way to start my first project.

Thanks for reading! Feel free to check my Github for the full Jupyter Notebook.

Leonardo Mosimann Conti

--

--

Leonardo Mosimann Conti
Leonardo Mosimann Conti

Written by Leonardo Mosimann Conti

Computer Science student | Enthusiastic to learn new things

No responses yet