1. Introduction
In this capstone project I will use the knowledge learned in the previous chapters to solve a practical problem and demonstrating the creation of value by applying the learned skills。This is IBM Data Science Professional certificate course on Coursera concludes with a Capstone Project. This project is about using data science toolset on a real-life problem and demonstrating the creation of value by applying the learned skills. This report presents this capstone project. The analysis was performed in Python.
- Problem Definition
(1).problem
For this project, I chose a theoretical business problem. The question that we are trying to answer is the following.
My friend is studying in Shanghai University, he plans to start a business after graduation, therefor he decided to open a Tibetan restaurant in Shanghai,china.
Taking into account the price level at which the restaurant will operate, the intent is to find an optimal location in an area, where gastronomy is booming and which is easily accessible for tourists and for wealthier local citizens as well.
(2).business logic
we can use unsupervised machine learning to create clusters of districts that will provide us with a list of areas for consideration for the restaurant, The intent is that the restaurant to be situated close to one of the gastronomical centers and high income area
3.Data collection
To perform this analysis, we will need the following data:
- List of the districts of shanghai
List of districts will be obtained from wikipedia
(https://en.wikipedia.org/wiki/List_of_administrative_divisions_of_Shanghai)
- Geo-coordinates of the districts in shanghai
Geo-coordinates of districts will be obtained with the help of the geocoder tool in the notebook.
- Top venues of districts
Top venues data will be obtained from Foursquare through an API.
4.the income data (GDP per capita in 2019) for districts of shanghai
the income data for districts of shanghai will be obtained from Shanghai Bureau of Statistics(http://tjj.sh.gov.cn)
- Methodology
(1). Brief process
After tidying up and exploring the data, we will apply the Unsupervised machine learning technique for creating clusters of districts. We will use the silhouette score for choosing the optimal number of clusters.
(2) Data Preparation and exploration
As part of preparing the data, we start by creating a list of districts in shanghai and add the geo-coordinates of each district to a table. First I Network crawling the situation each district from wiki https://en.wikipedia.org/wiki/List_of_administrative_divisions_of_Shanghai After performing this task, we get the following table that we use in pandas dataframe format.
I got the situation about
In each districts, These indicators are very useful for my analysis.
Then we add in the per capita GDP data for each district in Shanghai
I got table like this:
A total of 16 districts are on the table, We select districts with a population of more than 500000 and a per capita GDP of more than 60000 RMB for analysis, Select the properties we need to analyze,
Because Jinshan District is a pure industrial area, we exclude it, so we get the following table:
But some reason I can’t use the geocode python library.so I got the latitude and longitude coordinates to each district of shanghai from this webside https://jingwei.supfree.net/mengzi.asp?id=820
After performing this task, we get the following table that we use in pandas dataframe format.
In the next step of the analysis, the districts were explored in greater detail. It means venues were collected for each district via Foursquare API. The data from Foursquare is received in json format. After arranging the data, we have up to 100 venues for each district. Venues are collected within a radius of 1000 meters from the point of district coordinates. The collected and arranged data looks like this. The following table shows some venues from the first district.
We can check how many venues have been collected for each district. The following table gives that summary.
(3). Analysis
One-hot encode
For analysing the districts, we focus on venue categories. For that purpose, we use the one-hot encoding. This creates dummy variables for categories so the data set could be used for machine learning. After performing manipulations with the dataset, we get the following table, which shows the top ten most common venues for each district (first four shown in the table).
Clustering
Now that we have the dataset ready, we perform clustering. For this, unsupervised machine learning technique will be used based on K-means. For K-means clustering, we need to decide on the number of clusters that we want to use. To avoid the trial and error approach, the silhouette score was used. The following graph shows the silhouette scores for a range of clusters variations.
From the graph, we can read that the optimal number of clusters to use is 4 (where the score is the highest). In the next step, we run the K-means clustering algorithm with the parameter of 4 as the number of clusters. When done, we add the cluster labels to the dataset. We get the following table.
Also, we can visualise the clusters on the map that we created earlier.
You may be surprised, but don't worry, because Shanghai is an old port city. The old city and the new city interact together. The functions of many regions are different, which affects our analysis
- Results
Understanding the Clusters
By looking at the cluster data, we can see that cluster 2 is the one that we are the most interested in.
1. Cluster 1
The first cluster (Cluster label 0) is a Typical industrial areas where workers work , It has a large population, but its per capita income is not high, So it's not where we're going to pick
2. Cluster 2
This cluster is Typical Urban Resident Area in Shanghai, It covers all aspects of residents' needs, but unlike our goal, we need a business district centered on food
3. Cluster 3
The first cluster is an outer district where top gastronomy is not really represented (coffe and fast food are in the top).
4. Cluster 4
Cluster 4 (Cluster label 4) is the biggest cluster, but this is where we see lots of gastronomy related venues (coffee shop, pizza place, Thai restaurant, beer bar, pub, modern European restaurant, etc..), they are the business districts,where I looking for! But how do we choose so many areas? Let's take a look at the population data and GDP per capita at the front of the table, and I'll get the answer, Huangpu District has the largest per capita income, but its population is small, But let's take a look at Pudong New Area, It has the largest population and a high per capita income. Most importantly, it is a multicultural area
- Discussion and Recommendations
Based on what we learned about the clusters, we can advise the restaurant owner to consider the districts from cluster 4 as a potential location for the tibetan restaurant. These are the districts where gastronomy is well represented and also hotels are frequent. These satisfy the two original criteria that the location should be in a gastronomical centre and in a location that is easily accessible for tourists
- Conclusion
This paper discussed the process of coming up with an answer for a hypothetical though real-life like business problem. The analysis was performed based on the toolset of data science and relied heavily on the use of Python and Python libraries such as Pandas, Scikit, Folium to name a few. Data was collected from a different type of sources and in different formats. For analysis, machine learning technique was used. The output of the analysis provided a thorough base for the recommendation for the business problem in question.
- References
The Jupyter notebook of the analysis can be found on GitHub.