Data Science Asked by new_analyst on June 7, 2021
Newbie to analytics with Python so please be gentle 🙂 I couldn’t find the answer to this question – apologies if it is already answered elsewhere in a different format.
I have a dataset of transaction data for a retail outlet. Variables along with explanation are:
Receipt has the same value for all the products purchased in a single transaction, thus it can be used to determine the average number of purchases made in a single transaction.
What is the best way to go about this? I essentially want to use groupby()
to group the receipt variable by its own identical occurrences so that I can create a histogram.
Working with the data in a pandas DataFrame.
EDIT:
Here is some sample data with header (prod_name is actually a hex number):
section,prod_name,receipt,cashier,cost,date,time
electronics,b46f23e7,102856,5,70.50,05/20/15,9:08:20
womenswear,74558d0d,102857,8,20.00,05/20/15,9:12:46
womenswear,031f36b7,102857,8,30.00,05/20/15,9:12:47
menswear,1d52cd9d,102858,3,65.00,05/20/15,9:08:20
From this sample set I would expect a histogram of receipt that shows two occurrences of receipt 102857 (since that person bought two items in one transaction) and one occurrence respectively of receipt 102856 and of receipt 102858. Note: my dataset is not huge, about 1 million rows.
From this sample set I would expect a histogram of receipt that shows two occurrences of receipt 102857 (since that person bought two items in one transaction) and one occurrence respectively of receipt 102856 and of receipt 102858.
Then you want:
df.groupby('receipt').receipt.count()
receipt
102856 1
102857 2
102858 1
Name: receipt, dtype: int64
Correct answer by Emre on June 7, 2021
I'm putting together some tutorials around data wrangling. Maybe my jupyter notebook on github will help. I think that it is the key is modifying the line:
df.groupby('male')['age'].mean()
to be:
df.groupby('reciept')['prod_name'].count()
To group by multiple variables this should work:
df.groupby(['reciept','date'])['reciept'].count()
Answered by Ryan on June 7, 2021
From what I can understand is that you would need a histogram of your receipt no. You can try something like this
import pandas as pd
data = np.read_csv("your_file_path.csv")
data.groupby(["receipt"])receipt.count().sort_values(ascending=False).head(20).plot.bar()
This will give you bar plots of most repeating billing numbers (20 most repeating) Change the number in the head function to get more or less.
Answered by Bharath KN on June 7, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP