Stack Overflow Asked by CoderMan on February 25, 2021
I have created a small program to find the mean, median and mode values for two particular columns of a df. I used np.mean and np.median to find the mean and median values but for the mode i created a numpy array from the df and calculated the mode. I print them to the console and the values seem fine, however i would like to get the mode value from the numpy array to appear in my df that has four columns for ‘STUDENT’ ‘score’ ‘mean’ and ‘median’. I am wondering if there is a way to get the mode value and attach to the end of the df to have a fifth column titled ‘mode’. My code is below to take a look. I would like to not use libraries like scipy for this also so as to not use sparse if there is another way around it.
def mean_median():
df = pd.read_csv('Surveys.csv')
dfm= df.groupby("STUDENT")[["SCORE"]].agg([np.mean, np.median]).reset_index()
print(dfm)
arr = dfm.to_numpy()
print('nNumpy Arrayn----------n', arr)
vals,counts = np.unique(arr, return_counts=True)
index = np.argmax(counts)
return vals[index]
Here is an example of my output if it helps makes things clearer to understand
STUDENT SCORE
mean median
0 2443.0 93.210145 94.0
1 2445.0 94.652113 95.0
2 2447.0 93.919775 95.0
3 2451.0 95.203571 95.0
4 2832.0 94.544304 95.0
.. ... ... ...
276 27323.0 95.585106 96.0
277 27324.0 94.562105 95.0
278 27325.0 96.986348 98.0
279 27326.0 96.809524 97.0
280 27334.0 96.102564 97.0
[281 rows x 3 columns]
Numpy Array
----------
[[ 2443. 93.21014493 94. ]
[ 2445. 94.65211268 95. ]
[ 2447. 93.91977481 95. ]
[ 2451. 95.20357143 95. ]
[ 2832. 94.5443038 95. ]
[ 2838. 94.97988265 95. ]
[ 2839. 93.88054608 94. ]
[ 2841. 93.90789474 94. ]
[ 2980. 94.14044944 95. ]
[ 3220. 94.44219067 95. ]
[ 3221. 93.80825959 94. ]
[ 3222. 93.88416076 94. ]
[ 3229. 98.42857143 100. ]
[ 3231. 92.11363636 93. ]
[ 3236. 94.3677686 95. ]
[ 3238. 93.84027778 94. ]
[ 3332. 93.12958963 94. ]
[ 3333. 92.83663366 93.5 ]
sample input data from a few rows to try and recreate
STUDENT SCORE
25718 97
25719 97
26990 95
23809 92
24032 90
22723 87
24688 92
25714 89
25718 78
23078 90
25713 90
24032 87
26990 77
26990 89
You can use pd.Series.mode
for calculating mode. Also, for mean and median you can simply use strings to reference the functions.
#Dummy dataframe
d = {'STUDENT': [25718, 25718, 25718, 25718, 25718, 22723, 22723, 22723, 22723, 22723, 25713, 25713, 25713],
'SCORE': [97, 97, 95, 92, 90, 87, 92, 89, 78, 92, 90, 87, 87]}
df = pd.DataFrame(d)
out = df.groupby("STUDENT")["SCORE"].agg(['mean','median',pd.Series.mode]).reset_index()
print(out)
STUDENT mean median mode
0 22723 87.6 89 92
1 25713 88.0 87 87
2 25718 94.2 95 97
This will give results if there exists a mode (at least one repeated value for each student). If there is no mode, it will throw an error.
More details here.
If you are not sure whether each student has a defined mode or not, you can simply take an average of the mode values returned by pd.Series.mode
. If it returns a mode, its average is the same. If it returns multiple modes, you return average of those.
d = {'STUDENT': [25718, 25718, 25718, 25718, 25718, 22723, 22723, 22723, 22723, 22723, 25713, 25713, 25713],
'SCORE': [97, 97, 95, 92, 90, 87, 92, 89, 78, 92, 90, 87, 88]}
mode = lambda x: pd.Series.mean(pd.Series.mode(x))
df = pd.DataFrame(d)
out = df.groupby("STUDENT")["SCORE"].agg(['mean','median', mode]).reset_index()
out.columns = ['STUDENT','mean','median','mode']
print(out)
STUDENT mean median mode
0 22723 87.600000 89 92.000000
1 25713 88.333333 88 88.333333
2 25718 94.200000 95 97.000000
Correct answer by Akshay Sehgal on February 25, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP