How to cluster government census data in order to group Metropolitan statistical areas

Question

I have collected a bunch of census data from 2012 - 2018. I wanted to apply some clustering algorithms in order to compare Metropolitan statistical area (MSA's). Ideally once I run the clustering algorithm I would like to see which MSA is comparable to another.
The features that I am choosing to govern the clustering is below:
'Bachelors+',
'Estimate  Total  $10,000 to $14,999',
'Estimate  Total  $100,000 to $124,999',
'Estimate  Total  $125,000 to $149,999',
'Estimate  Total  $15,000 to $19,999',
'Estimate  Total  $150,000 to $199,999',
'Estimate  Total  $20,000 to $24,999',
'Estimate  Total  $200,000 or more',
'Estimate  Total  $25,000 to $29,999',
'Estimate  Total  $30,000 to $34,999',
'Estimate  Total  $75,000 to $99,999',
'Median Age',
'Median Gross rent as % of household inc',
'Number of educational and health service workers',
'Number of finance and real estate workers',
'Number of people in management, business, science, and arts',
'Number of service workers',
'Number of tech workers',
'Pct Asian',
'Pct Black',
'Pct Other Race',
'Pct White',
'Total Population',
'Total Population over 25'

Now a question I have is the data I have is on the tract level for every MSA in the United States from 2012 - 2018. Would I first need to aggregate the data so that I have the above features by their associated MSA then do the clustering algorithm from there?
From there how do I identify the MSAs by cluster?

JahKnows · Accepted Answer

If you want to identify the distance between MSAs. Then yes, I think it would be best to first aggregate your features such that each instance (row) represents an MSA. From there you will have an $ntimes m$ matrix where $n$ is the number of MSA, and $m$ is the number of features you end up with.
You can then apply your clustering algorithm, there are many to choose from, among my favorites I always try are:

K-means
K-nearest neighbors
Spectral clustering
DBSCAN

Others can be found here.
Once you train the clustering algorithm then you will get an associated cluster values for each of the $n$ rows in your input matrix. With this you will know what MSAs are similar in nature given the selected set of features.

How to cluster government census data in order to group Metropolitan statistical areas

One Answer

Add your own answers!

Ask a Question