Stack Overflow Asked by Caerus on November 27, 2021
I have a sales table that looks like this:
store_id cust_id txn_id txn_date amt industry
200 1 1 20180101 21.01 1000
200 2 2 20200102 20.01 1000
200 2 3 20200103 19 1000
200 3 4 20180103 19 1000
200 4 5 20200103 21.01 1000
300 2 6 20200104 1.39 2000
300 1 7 20200105 12.24 2000
300 1 8 20200105 25.02 2000
400 2 9 20180106 103.1 1000
400 2 10 20200107 21.3 1000
Here’s the code to generate this sample table:
CREATE TABLE sales(
store_id INT,
cust_id INT,
txn_id INT,
txn_date bigint,
amt float,
industry INT);
INSERT INTO sales VALUES(200,1,1,20180101,21.01,1000);
INSERT INTO sales VALUES(200,2,2,20200102,20.01,1000);
INSERT INTO sales VALUES(200,2,3,20200103,19.00,1000);
INSERT INTO sales VALUES(200,3,4,20180103,19.00,1000);
INSERT INTO sales VALUES(200,4,5,20200103,21.01,1000);
INSERT INTO sales VALUES(300,2,6,20200104,1.39,2000);
INSERT INTO sales VALUES(300,1,7,20200105,12.24,2000);
INSERT INTO sales VALUES(300,1,8,20200105,25.02,2000);
INSERT INTO sales VALUES(400,2,9,20180106,103.1,1000);
INSERT INTO sales VALUES(400,2,10,20200107,21.3,1000);
What I would like to do is create a new table, results
that answers the question: what percentage of my VIP customers have, since January 3rd 2020, shopped i) at my store only; ii) at my store and at other stores in the same industry; iii) at only other stores in the same industry? Define a VIP customer to be someone who has shopped at a given store at least once since 2019.
Here’s the target output table:
store industry pct_my_store_only pct_both pct_other_stores_only
200 1000 0.5 0.5 0.0
300 2000 0.5 0.5 0.0
400 1000 0.0 1.0 0.0
I’m trying to use window functions to accomplish this. Here’s what I have so far:
CREATE TABLE results as
SELECT s.store_id, s.industry,
COUNT(DISTINCT (CASE WHEN s.txn_date>=20200103 THEN s.cust_id END)) * 1.0 / sum(count(DISTINCT (CASE WHEN s.txn_date>=20200103 THEN s.cust_id END))) OVER (PARTITION BY s.industry) AS pct_my_store_only
...AS pct_both
...AS pct_other_stores_only
FROM sales s
WHERE sales.txn_date>=20190101
GROUP BY s.store_id, s.industry;
The above does not seem to be correct; how can I correct this?
Join the distinct store_ids and industries to the concatenated distinct store_ids and industries for each customer and then use window function avg()
with the function find_in_set()
to determine if a customer how many customer have shopped or not from each store:
with
stores as (
select distinct store_id, industry
from sales
where txn_date >= 20190103
),
customers as (
select cust_id,
group_concat(distinct store_id) stores,
group_concat(distinct industry) industries
from sales
where txn_date >= 20190103
group by cust_id
),
cte as (
select *,
avg(concat(s.store_id) = concat(c.stores)) over (partition by s.store_id, s.industry) pct_my_store_only,
avg(find_in_set(s.store_id, c.stores) = 0) over (partition by s.industry) pct_other_stores_only
from stores s inner join customers c
on find_in_set(s.industry, c.industries) and find_in_set(s.store_id, c.stores)
)
select distinct store_id, industry,
pct_my_store_only,
1 - pct_my_store_only - pct_other_stores_only pct_both,
pct_other_stores_only
from cte
order by store_id, industry
See the demo.
Results:
> store_id | industry | pct_my_store_only | pct_both | pct_other_stores_only
> -------: | -------: | ----------------: | -------: | --------------------:
> 200 | 1000 | 0.5000 | 0.5000 | 0.0000
> 300 | 2000 | 0.5000 | 0.5000 | 0.0000
> 400 | 1000 | 0.0000 | 1.0000 | 0.0000
Answered by forpas on November 27, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP