Data Science Asked by Threadid on December 17, 2020
I have population data from Census.gov:
Total US population by age by year from 1940 through 2010
Depending on the range of decades, the data is missing discrete population values for ages greater than a certain age. Instead an aggregate amount is provided that represents all ages greater than the cutoff.
Specifically it follows this pattern:
The desired outcome is to have discrete data points for each age and year from 0-99 and then an aggregated lump sum figure for ages 100 and greater.
Therefore I want to input the missing discrete population values for ages 85 to 100 for years 1940 through 1979 and years 2000 through 2010.
And I want to use the actual discrete population values for ages 85 to 100 for years 1980 through 1989 to achieve that outcome.
Some Observations:
Looking at the data we can see that each of the three subsets of years have very similar patterns. More variation in younger ages and variation flattens out for ages greater than 60
Then if we focus on the years 1980 through 1989 we can fit a nice curve for ages 0 through 100 with a Multiple-R-Squared of .979.
Then if we narrow the focus to ages 60 to 100 and even narrower to ages 85 to 100 the Multiple-R-Squared increases to .9996.
Now if we flip our focus and look at the increasing levels of population we can observe that these relationships are linear. Population rises at a steady rate year over year.
Total population 1940 through 2010:
Ages 85 through 100 for years 1980 through 1999. Each age is linear. Each age has a slightly lower rate of increase (lesser slope).
This is where I could use some guidance to move forward:
When imputing discrete missing population values by age and year, how do I combine the fitted curve that models changes in population when age increases with the linear regression that models changes in population year over year?
Does one or more documented methods naturally apply to the problem as I have described it?
For example: KNN, PCA, BPCA, Mean, MICE, other?
If there is recommended method can you point me to available R or Python packages and documentation that describes the mechanics of applying a given approach?
I think you need to be wary of using curves to extrapolate beyond the age thresholds - specifically I think you should consider:
I would suggest a different approach. Actuaries have traditionally produced "life tables" which capture mortality in population cohorts. If you can find a set of tables which is applicable to the period in question, you could use these to calculate population numbers.
Answered by bradS on December 17, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP