Stack Overflow Asked by Sawan S on November 23, 2020
I have data in the below format with 38 measure columns for various months as shown below.
+---------+-----------------+-----------------+------+------------------+------------------+------------------+---------+------------------+
| Cust_No | Measure1_month1 | Measure1_month2 | .... | Measure1_month72 | Measure2_month_1 | Measure2_month_2 | ….so on | Measure2_month72 |....Measure38_month1...
+---------+-----------------+-----------------+------+------------------+------------------+------------------+---------+------------------+
| 1 | 10 | 20 | …. | 500 | 40 | 50 | … | |
| 2 | 20 | 40 | …. | 800 | 70 | 150 | … | |
+---------+-----------------+-----------------+------+------------------+------------------+------------------+---------+------------------+
I want to achieve the below format using PYSPARK.
+---------+-------+----------+----------+
| CustNum | Month | Measure1 | Measure2.......measure38 |
+---------+-------+----------+----------+
| 1 | 1 | 10 | 30 |
| 1 | 2 | 20 | 40 |
| 1 | 3 | 30 | 80 |
| 1 | 4 | 70 | 90 |
| 1 | 5 | 40 | 100 |
| . | . | . | . |
| . | . | . | . |
| 1 | 72 | 700 | 50 |
+---------+-------+----------+----------+
and so on for every customer number
Can you please help me with this?
Thanks
IIUC, you need wide to long
kind of transformation which can be achieved by stack
in pyspark
I created a sample dataframe with 5 months data
df = spark.createDataFrame([(1,10,20,30,40,50,10,20,30,40,50),(2,10,20,30,40,50,10,20,30,40,50)],['cust','Measrue1_month1','Measrue1_month2','Measrue1_month3','Measrue1_month4','Measrue1_month5','Measrue2_month1','Measrue2_month2','Measrue2_month3','Measrue2_month4','Measrue2_month5'])
Now generating the clause for stack operation. Can be done in better ways but here is the most simplest example
Measure1 = [i for i in df.columns if i.startswith('Measrue1')]
Measure2 = [i for i in df.columns if i.startswith('Measrue2')]
final = []
for i in Measure1:
for j in Measure2:
if(i.split('_')[1]==j.split('_')[1]):
final.append((i,j))
rows = len(final)
values = ','.join([f"'{i.split('_')[1]}',{i},{j}" for i,j in final])
Now actually applying the stack operation
df.select('cust',expr(f'''stack({rows},{values})''').alias('Month','Measure1','Measure2')).show()
+----+------+--------+--------+
|cust| Month|Measure1|Measure2|
+----+------+--------+--------+
| 1|month1| 10| 10|
| 1|month2| 20| 20|
| 1|month3| 30| 30|
| 1|month4| 40| 40|
| 1|month5| 50| 50|
| 2|month1| 10| 10|
| 2|month2| 20| 20|
| 2|month3| 30| 30|
| 2|month4| 40| 40|
| 2|month5| 50| 50|
+----+------+--------+--------+
Answered by Shubham Jain on November 23, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP