Best Neighbourhood in Your Budget
Using Machine Learning techniques to cluster toronto's neighborhood on the basis of cost and nearby places
- Best Neighbourhood in Your Budget
 
Best Neighbourhood in Your Budget
Buying a perfect home is a dream for each an every individual. We have been always trying to buy the home which would be perfect for us. For years property consultants and brokers are the people who have helped us in our endeavour.
Now with the addition of new technologies stakeholders have changed their methods to improve their services and in this IT age with the help of technology and data they are trying their best to achieve better results.
So I have decided in my project to make something useful for all the buyers as well the stakeholders in this business. I am using Toronto City data but this project could be expanded to other metropolitan cities as well.
Click here to open notebook in binder
Make sure to make notebook trusted to display map
I am using Foursquare API as well as scraped webpages to get the average property cost in each and every area of Toronto City. As we know cost is the major driver where a person may live.
    Foursquare API would be used to get the locality of an area. The locality and venues would be helpful for individual looking to get the best place he needs. For example A bachelor would like to live where there is nearby pubs, entertainment centres and work places. But a person having a family may wants to live where there is nearby schools, shops and parks.
We are using FourSquare API, Geocoders, and web Scraping techniques to solve our problem.
FourSquare API would be used to get nearby venues around a location. This venue data would be used to classify our neighbourhood based on the locality.
Geocoders would be used to get latitude and longitude of neighbourhoods. This latitude and longitude is required for maps and FourSquare API.
I searched but couldn't find any structured dataset to get average housing cost in a neighborhood. So I scraped a webpage which shows the average housing cost of a neighborhood. Click here to view webpage
import requests
import pandas as pd
from geopy.geocoders import Nominatim
from IPython.display import HTML, IFrame
url = 'https://www.moneysense.ca/spend/real-estate/where-to-buy-real-estate-in-2020-city-of-toronto/'
header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}
r = requests.get(url, headers=header)
dfs = pd.read_html(r.text)
df=dfs[0]
df
df=df.filter(items=['Area','Neighbourhood','Area average price 2019'])
df
The average price is in object type hence it needs to be converted to float
df[df.columns[2]] = df[df.columns[2]].replace('[\$,]', '', regex=True).astype(float)
df
Using geocode to get latitude and longitude of a neighbourhood
# place=place+ ' Toronto'
# place
index=0
latitude=[]
longitude=[]
geolocator = Nominatim(user_agent="toronto")
for i in df['Neighbourhood'].iteritems():
            address = i[1] + ' Toronto'
            location = geolocator.geocode(address)
            if(location is None):
                df.drop([index],inplace=True)
                index=index+1
                continue
            lat = location.latitude
            long = location.longitude
            index=index+1
            latitude.append(lat)
            longitude.append(long)
df['Latitude']=latitude
df['Longitude']=longitude
df
df_price=df[['Neighbourhood','Area average price 2019']]
df_price.set_index('Neighbourhood',inplace=True)
%matplotlib inline 
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
mpl.style.use('ggplot') # optional: for ggplot-like style
# check for latest version of Matplotlib
print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0
df_price.plot(kind='hist',figsize=(16,8),color='grey')
df_price.plot(kind='bar',figsize=(16,8))
plt.title("Price Distribution by Neigbourhood")
plt.ylabel('Price in $')
df_price.describe()
df_price.plot(kind='box',figsize=(10,7),color='blue')
plt.title("Price Distribution")
plt.ylabel('Price in $')
Now we are using folium to demonstrate the data points on map with labels demonstrating average price and neighborhood
import folium
location = geolocator.geocode('Toronto')
latitude = location.latitude
longitude = location.longitude
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)
# add markers to map
for lat, lng, neighbourhood,price in zip(df['Latitude'], df['Longitude'], df['Neighbourhood'],df['Area average price 2019']):
    label = '{}, {}'.format(neighbourhood,price )
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
map_toronto.save('map_toronto.html')
df_new=df_price.reset_index()
df_new
geo = 'toronto_crs84.geojson' # geojson file
# create a plain world map
map1 = folium.Map(location=[43.718432, -79.333204], zoom_start=11)
map1.choropleth(
    geo_data=geo,
    data=df_new,
    columns=['Neighbourhood', 'Area average price 2019'],
    key_on='feature.properties.AREA_NAME',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Average Price in Toronto',
    
)
map1.save('map1.html')
CLIENT_ID = 'xxx' # your Foursquare ID
CLIENT_SECRET = 'xxx' # your Foursquare Secret
ACCESS_TOKEN = 'xxx' # your FourSquare Access Token
VERSION = '20180604'
LIMIT = 100 # A default Foursquare API limit value
Function to extract useful response
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
Getting nearby venues from each neighbourhood
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)
df.rename(columns={"Neighbourhood":"Neighborhood"},inplace=True)
toronto_venues = getNearbyVenues(names=df['Neighborhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )
toronto_venues.head()
toronto_venues.groupby('Neighborhood').count()
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 
# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]
toronto_onehot.head()
toronto_onehot.shape
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
# toronto_grouped =pd.merge(toronto_grouped,df2,on='Neighborhood')
toronto_grouped
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]
import numpy as np
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']
for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)
neighborhoods_venues_sorted.head()
toronto_grouped=pd.merge(toronto_grouped,df,on='Neighborhood')
toronto_grouped_clustering = toronto_grouped.drop(['Neighborhood','Area'],axis= 1)
toronto_grouped_clustering
from sklearn.cluster import KMeans
kclusters = 4
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:]
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_merged = df
# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
toronto_merged.head() # check the last columns!
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
    
map_clusters.save('map_cluster.html')
toronto_merged.loc[toronto_merged['Cluster Labels']==0]
toronto_merged.loc[toronto_merged['Cluster Labels']==1]
toronto_merged.loc[toronto_merged['Cluster Labels']==2]
toronto_merged.loc[toronto_merged['Cluster Labels']==3]
toronto_info=toronto_merged.groupby(by='Cluster Labels').mean()
toronto_merged.groupby(by='Cluster Labels').size()
round(toronto_info['Area average price 2019'])
Conclusion
We have grouped toronto's neighbourhood in 4 clusters.
 
Here are some results that we found after clustering:
On average cost.
- The first cluster's average cost of 2300120.0
 - The second cluster's average cost of 808301.0
 - The third cluster's average cost of 1631649.0
 - The fourth cluster's average cost of 1171028.0
 
The conclusion would be the first cluster is the most expensive neighbourhood and second is least expensive third and fourth are moderately expensive
The neighborhood is:
- First cluster's nearby venues consist of restaurant and grocery shops
 - Second cluster's nearby venues consist of all types of restaurant and basic amneties. It looks a great neighbourhood to live as the cost is also low
 - Third Cluster's nearby venues consist of pubs, restaurant, parks, entertainment and leisure places
 - Fourth Cluster's nearby venues consist of Parks, restaurant, Coffee shops, Banks
 
The conclusion I got from the above result is the first cluster is the most expensive one. It consist of restaurant and different types of shops and musuem. But this cluster only has six neighborhoods and by analyzing the map we find that it is in the epicentre of the city
The second cluster is the least expensive one and it has restaurant and basic amneties around which could be a great place for people having families By analyzing the map we could know that these neighborhoods surround the epicentre of the city.
The third cluster is all around the city and it is a second most expensive neighborhood. It consist of all the leisure centre and entertainment places hence would be suitable for bachelors. It may also be suitable for opening new stores. 
The fourth cluster is away from the epicentre of the city and it also a second least expensive place. It has Parks, restaurant and coffee shops all around it.