Data Science Asked on April 21, 2021
I have the following dataset which is a .json file:
and I would like to get the first word for every string inside lista_asm
, so I would like to get: jmp,push,uncomisd,…etc
what I am doing to do this is the following:
dataFrame['opcodes'] = dataFrame['lista_asm'].apply(lambda x:[i.split()[0] for i in x])
but it gives me back the following error message:
3589 else:
3590 values = self.astype(object).values
-> 3591 mapped = lib.map_infer(values, f, convert=convert_dtype)
3592
3593 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-18-5506b5721bf1> in <lambda>(x)
----> 1 dataFrame['opcodes'] = dataFrame['lista_asm'].apply(lambda x:[i.split()[0].strip() for i in x])
IndexError: list index out of range
I don’t understand what is wrong. Can somebody please help me?
[EDIT]Trying the code:
dataFrame['opcodes'] = dataFrame['lista_asm'].apply(lambda x:x[0].split(" ",2)[0])
and adding :
df = dataFrame[["opcodes", "semantic"]].copy()
df
I get:
what I would like to get is a list of the type [push,mov,..] and this for every row.
It seems like when I do x[0]
it does not return the first element of the list, but returns the pharentesis, which is weird. Am I doing something wrong which I don’t see?
My objective is to pre-process this dataset in order to feed features to my model, but I haveing hard times in doing so.
Your question is a bit confusing. So, as much as I understood from your examples, for each sample of list_asm, you want to extract the very first word from the string.
The thing you are doing wrong is treating the string as a list. That is, ['uncomisd xmm2, xmm2', 'jp 0x40', ...]
is considered as a string by python, not a list.
Thus, you need to extract the strings from your list first, then you can't take the first words from all these strings.
To achieve that, you can use a regular expression to find all the strings that are inside of quotes '...'
.
import pandas as pd
import re
# Read the file into dataframe
dataFrame = pd.read_json("dataset.json", lines=True)
# First extract the strings the take the first word of each string
dataFrame['opcodes'] = dataFrame['lista_asm'].apply(lambda x: [i.split()[0] for i in re.findall("'([^']*)'", x)])
print(dataFrame)
or modular form of the code would be:
import pandas as pd
import re
# Function to extract the first words from each string
def extractFirstWord(str):
listOfWords = re.findall("'([^']*)'", str)
return [i.split()[0] for i in listOfWords]
# Read the file into dataframe
dataFrame = pd.read_json("dataset.json", lines=True)
dataFrame['opcodes'] = dataFrame['lista_asm'].apply(lambda x: extractFirstWord(x))
print(dataFrame)
The result:
Correct answer by Shahriyar Mammadli on April 21, 2021
The problem is that you say "apply(lambda x:[i.split()[0] for i in x])"
As soon as you say apply, x is your list. So you can say following "apply(lambda x:x[0].split(" ", 2)[0])"
Meaning you say take first element in the list, than split on " " and in two parts. And than take the first word (part) with the last [0]
Answered by vienna_kaggling on April 21, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP