TransWikia.com

Problem extracting words from dataframe

Data Science Asked on April 21, 2021

I have the following dataset which is a .json file:

enter image description here

and I would like to get the first word for every string inside lista_asm, so I would like to get: jmp,push,uncomisd,…etc

what I am doing to do this is the following:

dataFrame['opcodes'] = dataFrame['lista_asm'].apply(lambda x:[i.split()[0] for i in x])

but it gives me back the following error message:

   3589             else:
   3590                 values = self.astype(object).values
-> 3591                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   3592 
   3593         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-18-5506b5721bf1> in <lambda>(x)
----> 1 dataFrame['opcodes'] = dataFrame['lista_asm'].apply(lambda x:[i.split()[0].strip() for i in x])

IndexError: list index out of range

I don’t understand what is wrong. Can somebody please help me?

[EDIT]Trying the code:

dataFrame['opcodes'] = dataFrame['lista_asm'].apply(lambda x:x[0].split(" ",2)[0])

and adding :

df = dataFrame[["opcodes", "semantic"]].copy()
df

I get:

enter image description here

what I would like to get is a list of the type [push,mov,..] and this for every row.

It seems like when I do x[0] it does not return the first element of the list, but returns the pharentesis, which is weird. Am I doing something wrong which I don’t see?

My objective is to pre-process this dataset in order to feed features to my model, but I haveing hard times in doing so.

2 Answers

Your question is a bit confusing. So, as much as I understood from your examples, for each sample of list_asm, you want to extract the very first word from the string.

The thing you are doing wrong is treating the string as a list. That is, ['uncomisd xmm2, xmm2', 'jp 0x40', ...] is considered as a string by python, not a list.

Thus, you need to extract the strings from your list first, then you can't take the first words from all these strings.

To achieve that, you can use a regular expression to find all the strings that are inside of quotes '...'.

import pandas as pd
import re

# Read the file into dataframe
dataFrame = pd.read_json("dataset.json", lines=True)


# First extract the strings the take the first word of each string  
dataFrame['opcodes'] = dataFrame['lista_asm'].apply(lambda x: [i.split()[0] for i in re.findall("'([^']*)'", x)])
print(dataFrame)

or modular form of the code would be:

import pandas as pd
import re

# Function to extract the first words from each string
def extractFirstWord(str):
    listOfWords = re.findall("'([^']*)'", str)
    return [i.split()[0] for i in listOfWords]

# Read the file into dataframe
dataFrame = pd.read_json("dataset.json", lines=True)


dataFrame['opcodes'] = dataFrame['lista_asm'].apply(lambda x: extractFirstWord(x))
print(dataFrame)

The result:

enter image description here

Correct answer by Shahriyar Mammadli on April 21, 2021

The problem is that you say "apply(lambda x:[i.split()[0] for i in x])"

As soon as you say apply, x is your list. So you can say following "apply(lambda x:x[0].split(" ", 2)[0])"

Meaning you say take first element in the list, than split on " " and in two parts. And than take the first word (part) with the last [0]

Answered by vienna_kaggling on April 21, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP