Data Science Asked by Berbatov on May 25, 2021
I have a data set in the following form:
Product | Date
123 | 2019-01-01
456 | 2019-01-01
123 | 2019-01-02
123 | 2019-01-03
456 | 2019-01-03
123 | 2019-01-04
456 | 2019-01-04
789 | 2019-01-04
This is just a simplified version. The full set has ~300 products and four months of data. I want to understand how the product set changed over time. It’s obviously easy to calculate the count per day and see that I lost one product on Jan 2nd and gained one on Jan4th, but then I don’t know what product it was.
Is there a more systematic way of going about this? Ideally, the output would show me a list of days and what products dropped out / were added that day. I thought about min(date), max(date) by-product before, but products can drop and be added repeatedly and I wouldn’t capture this back and forth this way.
Available environments are Python, SQL, and Excel.
This response is based on your domain. For,e.g. sales, a product may not be sold every day and so there would be no record of it.
According to your dataset in which each day you would expect to see an occurrence of the product in your data set, you could consider the following approaches.
Mathematically:
A = the set of all possibilities (i.e. product occurrence for each date - you could generate this and use )
B = sample data set provided would be provided
C = A - B
= days that a product was missing.
DPART1 = You could then continue by retrieving the min(date) and max(date) for each product in the dataset to represent the introduction of the new product and possible cease of an existing product.
D= You could then filter the dataset (C) to remove dates less than the min(date) and greater than the maxdate).
In terms of sql:
A - Cartesion/Cross product of all your products and dates
B = Your current sample data set
C = SELECT * FROM A MINUS SELECT * FROM B
DPART1 = SELECT PRODUCT, MIN(date) as INTRODUCED_DATE, MAX(date) as CEASED_DATE from YourSampleDataSET GROUP BY PRODUCT
D = SELECT C.PRODUCT, C.Date FROM C LEFT JOIN DPART1 ON C.PRODUCT = DPART1.PRODUCT
WHERE DPART1.PRODUCT IS NULL OR (
C.Date BETWEEN DPART1.INTRODUCED_DATE AND DPART1.CEASED_DATE
)
Nb. "DPART1.PRODUCT IS NULL" ensures that you do not filter products that may not be in your sample subset.
Operational considerations, I would recommend indexes in your database to assist with the queries, partitioning where possibly.
Answered by ggordon on May 25, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP