Data Science Asked by Asif Khawaja on July 18, 2021
Problem Definition:
Our organization is conducting different type of surveys and Census in our country. The basic difference between Census and Survey is that, the target of Census is the Complete Population but the target of Survey is just a sample (sub-set) taken from that Population. Now the organization is willing to integrate the results of surveys and Census to map the data of surveys over the census data. Currently the organization is using some statistical approaches to integrate the data but these technique have loads of issues. I am interested to solve this issue by using some state of the art approaches from Data Science, Machine Learning or Deep Learning.
Consider population (census) data set contains 220 million rows, whereas sample (survey) data contains 40 million rows approximately.
Now, my question is for all the data scientists having sound background of statistics, how to do that?
Need a step by step guidance to achieve this goal. Kindly recommend me the algorithms to get this task done. Also suggest me some resources to read and understand this problem.
Example to Illustrate the Problem:
Let for Census the Population is P having attributes A,
Similarly, for Survey the Population (sample taken from P) is S having attributes B
i.e. S⊆P and A⊆B
For Example
Census Data (Table 1)
Block Code | HouseHoldID | PersonID | Sex | Marital Status | Age |
---|---|---|---|---|---|
1 | 1 | 1 | Male | Married | 25 |
1 | 1 | 2 | Female | Un-Married | 30 |
1 | 1 | 3 | Male | Married | 22 |
1 | 2 | 4 | Male | Married | 40 |
1 | 2 | 5 | Male | Un-Married | 30 |
1 | 3 | 6 | Male | Un-Married | 17 |
2 | 4 | 7 | Female | Married | 50 |
3 | 5 | 8 | Female | Married | 52 |
3 | 5 | 9 | Female | Married | 45 |
4 | 6 | 10 | Female | Un-Married | 45 |
4 | 7 | 11 | Female | Un-Married | 42 |
5 | 8 | 12 | Male | Married | 36 |
5 | 9 | 13 | Female | Married | 33 |
Survey Data (Table 2)
Block Code | HouseHoldID | PersonID | Sex | Marital Status | Age | Employment Status | Education Level |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | Male | Married | 25 | Employed | Graduate |
1 | 2 | 4 | Male | Married | 40 | Employed | No Schooling |
1 | 2 | 5 | Male | Un-Married | 30 | Un-Employed | Primary |
3 | 3 | 6 | Male | Un-Married | 17 | Un-Employed | Middle |
3 | 4 | 7 | Female | Married | 50 | Employed | Middle |
As per our policy we divide our country into Provinces, then provinces into districts, then districts into Tehsils and similarly following the pattern we reach to the lowest level known as BLOCKS. Each Block is composed of at-least 500 households (Families) and each family consist of members.
The above Table i.e. Table 1 showing Census data collected from 5 different blocks but in Table 2 the Survey data is collected from two different blocks. Here you can see that Survey data is subset of Census data. Similarly, there may be many blocks collected in Census may not be visited in Survey.
Furthermore, the attributes in Census data are subset of Attributes of Survey Data (Survey is short but detailed)
Now if I am supposed to map Survey data over Census data i.e. I want to see the percentage of un-employed members in Block with Code 2 (where as this block is not visited in Survey), then how would I do that?
We may need such type of mapping at block level or upper level.
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP