Stack Overflow Asked by Ravanelli on November 22, 2021
I have a pandas column, which is titles for online shopping products, classified by categories:
df
category title
electronics ALLDOCUBE iPlay 7T 4G LTE Kids Tablet 6.98" HD iPS Android 9.0 Tablets 16GB ROM Support 256G Expansion Dual Ai 4 Core Type C GPS
electronics Alldocube iPlay8 pro 8 inch Tablet Android 9.0 MTK MT8321 Quad core 3G Calling Tablet PC RAM 2GB ROM 32GB 800*1280 IPS OTG
accessories Alldocube iPlay10 Pro 10.1 inch Wifi Tablet Android 9.0 MT8163 quad core 1200*1920 IPS Tablets PC RAM 3GB ROM 32GB HDMI OTG
clothing ALLDOCUBE iPlay10 Pro Tablet 10.1 3GB RAM 32GB ROM Android 9.0 MT8163 Quad Core Tablet PC 1920 x 1200 IPS 6600mAh Wifi Tablet
I try to tokenize the title column, but returns not words, but letters. This is what I did:
df.loc[:,['category','title']].groupby('category').agg(lambda x: nltk.FreqDist([w for wordlist in x for w in wordlist]))
This is what it returns:
cat
accessories {'N': 1510, 'A': 1635, 'V': 498, 'I': 873, 'F': 2453, 'O': 837, 'R': 1577, 'C': 3087, 'E': 1831, ' ': 37476, 'M': 2497, 'e': 24599, 'n': 13621, 'W': 3112, 'a': 17129, 't': 11106, 'c': 6471, 'h': 4666, 's': 10988, 'r': 15707, 'p': 2774, 'o': 12459, 'f': 2069, 'S': 4262, 'i': 12812, 'l': 12888, 'Q': 333, 'u': 4711, 'z': 460, 'g': 4720, 'y': 3522, 'k': 1944, 'w': 2697, 'U': 385, '8': 338, '2': 1530, '9': 645, '1': 913, 'L': 1578, 'x': 645, 'B': 3366, 'd': 4593, 'D': 1209, ''': 221, 'm': 5425, 'P': 1709, 'G': 1906, '.': 116, 'b': 1290, 'j': 290, 'v': 1151, 'Y': 273, 'H': 1179, '5': 687, 'Z': 270, 'K': 431, '/': 346, 'J': 1346, 'X': 53, 'T': 963, '0': 1451, 'q': 219, '6': 215, '-': 237, '7': 209, ',': 96, '3': 377, '4': 555, '&': 102, '[': 21, ']': 21, '+': 42, 'ч': 3, 'а': 8, 'с': 7, 'ы': ...
electronics {'M': 1795, 'i': 6781, 'n': 4423, ' ': 22908, 'T': 1343, 'W': 1392, 'S': 3088, 'B': 1970, 'l': 4234, 'u': 2692, 'e': 10504, 't': 6545, 'o': 8519, 'h': 2655, '5': 836, '.': 783, '0': 2088, 'E': 1009, 'a': 7290, 'r': 7997, 'p': 2513, 's': 3768, 'H': 1266, 'd': 3039, '9': 422, 'D': 1474, 'f': 1088, 'c': 2560, 'I': 1000, 'k': 801, 'X': 471, 'm': 2653, '1': 1349, '"': 36, 'A': 1639, 'O': 688, 'L': 755, 'C': 2454, 'R': 1025, 'F': 1078, 'b': 1261, 'G': 1329, 'P': 2282, '6': 742, '7': 287, 'K': 442, 'w': 760, 'g': 1607, 'z': 161, 'н': 6, 'а': 5, 'у': 5, 'ш': 5, 'и': 7, 'к': 5, 'v': 547, 'V': 800, 'N': 626, '8': 623, 'J': 106, 'Q': 118, '-': 344, '4': 899, 'x': 498, 'U': 662, 'y': 1007, '3': 883, '2': 1264, 'Y': 147, '/': 337, '(': 12, ')': 10, '*': 25, '%': 11, 'j': 75, ',': 93, '+': 72, 'q': ...
I try to do the same to a new column of tokenized text, but don’t work
df['tokenized_text'] = df['title'].apply(word_tokenize)
df.loc[:,['cat','tokenized_text']].groupby('cat').agg(lambda x: nltk.FreqDist([w for wordlist in x for w in wordlist]))
EDIT
When I run
print (df['tokenized_text'].iloc[:2].tolist())
it returns list of lists of words like below:
[['NAVIFORCE',
'Men',
'Watches',
'Waterproof',
'Stainless',
'Steel',
'Quartz',
'Watch',
'Male',
'Chronograph',
'Military',
'Clock',
'Wrist',
'watch',
'Relogio',
'Masculino'],
['CURREN',
'8291',
'Luxury',
'Brand',
'Men',
'Analog',
'Digital',
'Leather',
'Sports',
'Watches',
'Men',
"'s",
'Army',
'Military',
'Watch',
'Man',
'Quartz',
'Clock',
'Relogio',
'Masculino']]
EDIT 2
I tried this code:
f = lambda x: pd.Series(nltk.FreqDist(x))
df.groupby('category')['title'].apply(f).reset_index()
and
f = lambda x: nltk.FreqDist(x)
df.groupby('category')['title'].apply(f).reset_index()
but both returns this :
cat level_1 title
0 accessories NAVIFORCE Men Watches Waterproof Stainless... 1
1 accessories CURREN 8291 Luxury Brand Men Analog Digital.. 1
2 accessories PAGANI Design Brand Luxury Men Watches... 2
3 accessories NO.ONEPAUL women belt Genuine Leather New 1
I believe you need:
f = lambda x: nltk.FreqDist([w for wordlist in x for w in wordlist])
df.groupby('category')['tokenized_text'].apply(f)
Answered by jezrael on November 22, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP