TransWikia.com

Looking for smallest set of rows that form a natural key in a data set

Data Science Asked by Ed Fine on July 23, 2021

I have several sets of text files on hdfs that are exports from relations. Unfortunately I do not know the structure of the table is, but I do know that each has a multi-part key that defines a row uniquely. I know through domain knowledge that the key is multi-part (e.g. reporting-date, and item number) and I can identify some columns that are clearly not in the key (e.g. revenue from sale). What is an effective way to identify potential sets of columns that are natural keys, because they are not duplicated in the observed data? I can get several days of logs in a few Gig, so python or sql could work. This seems like an great application for a dictionary, but I am not sure how to approach this.

One Answer

You could do this a few ways...

  1. write a script that does the following.

    • pick a table,

    • get number of rows in the table, can do this as you go. (+=1 in loop)

    • select several fields(columns) in table that you think may form the key,

    • create empty Set()

    • burn through the file row by row, for each row grab the target fields and construct a string key str(field1)+'_'+str(field2)..etc. Add this key to your set.

    • when you're done going through all rows of a table, check len(set). If the key you picked is good then len(set) will equal # rows.

  2. Another way, and this will depend on what you want to do with your data and how often you will access it...is to import it into a sql database. I usually use mysql. Once your data is in:

    • select count(*) from table_name; will give you rows in table

    • select count(distinct field1, field2, field3) from table; gives number of distinct combos of field1+field2+field3

    • if the two selects give the same number then you have a key that works. It may not guarantee a join-key but it will be unique to that table and will help the linking process.

  3. many people highly advocate the use of pandas. If you get the data imported into arrays you can create a dataframe and do some queries similar to SQL. I'm not super familiar with pandas so I can't give an example.

If it's a one-off, I'd make a python script. But if you think you'll be exploring the data further, spending the time to put the data in a database could be useful.

Answered by user1269942 on July 23, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP