TransWikia.com

¿Cómo puedo reemplazar una línea con varias otras líneas en una dataframe?

Stack Overflow en español Asked on December 9, 2021

Tengo una dataframe

    QID     URL     Questions   Answers     Section     QType   Theme   Topics  Answer0     Answer1     Answer2     Answer3     Answer4     Answer5     Answer6
1113    1096    https://docs.google.com/forms/d/1hIkfKc2frAnxsQzGw_h4bIqasPwAPzzLFWqzPE3B_88/edit?usp=sharing   To what extent are the following factors considerations in your choice of flight?   ['Very important consideration', 'Important consideration', 'Neutral', 'Not an important consideration', 'Do not consider']     When choosing an airline to fly with, which factors are most important to you? (Please list 3.)     Multiple Choice     Airline XYZ     ['extent', 'follow', 'factor', 'consider', 'choic', 'flight']   Very important consideration    Important consideration     Neutral     Not an important consideration  Do not consider     NaN     NaN
1116    1097    https://docs.google.com/forms/d/1hIkfKc2frAnxsQzGw_h4bIqasPwAPzzLFWqzPE3B_88/edit?usp=sharing   How far in advance do you typically book your tickets?  ['0-2 months in advance', '2-4 months in advance', '4-6 months in advance', '6-8 months in advance', '8-10 months in advance', '10-12 months in advance', '12+ months in advance']  When choosing an airline to fly with, which factors are most important to you? (Please list 3.)     Multiple Choice     Airline XYZ     ['advanc', 'typic', 'book', 'ticket']   0-2 months in advance   2-4 months in advance   4-6 months in advance   6-8 months in advance   8-10 months in advance  10-12 months in advance     12+ months in advance

con filas de los cuales quiero cambiar unas pocas líneas que en realidad son títulos de QuestionGrid, con nuevas líneas que también representan las respuestas. Tengo un otra, Pickle, que contiene la información para construir las líneas que actualizarán las antiguas. Cada vez una línea antigua se transformará en varias líneas nuevas (lo especifico porque no sé cómo hacerlo).

Estas lineas solo son los titulos de grid de preguntas como la siguiente:

introducir la descripción de la imagen aquí

dataframe esperado

Me gustaria insertarlos en la dataframe original, en lugar de las lineas donde coinciden en la columna de Questions, como en el siguiente dataframe:

QID    Questions     QType    Answer1    Answer2    Answer3    Answer4    Answer5                                                                        
1096      'To what extent are the following factors considerations in your choice of flight?'    Question Grid    'Very important consideration'    'Important consideration'    'Neutral'    'Not an important consideration'    'Do not consider'
1096_S01  'The airline/company you fly with'
1096_S02  'The departure airport'
1096_S03  'Duration of flight/route'
1096_S04  'Baggage policy'
1097      'To what extent are the following factors considerations in your choice of flight?'    Question Grid ...
1097_S01 ...
...

El mio intento

Intenté utilisar la respuesta de gcoronel99 sobre la representacion de una grid de preguntas en un marco de datos.

import pickle

qa = pd.read_pickle(r'Python/interns.p')
df = pd.read_csv("QuestionBank.csv")


def isGrid(dic, df):
    '''Check if a row is a row related to a Google Forms grid
    if it is a case update this row'''
    d_answers = dic['answers']
    try:
        answers = d_answers[2]
        if len(answers) > 1:
            # find the line in df and replace the line where it matches by the lines
            update_lines(dic, df)
        return df
    except TypeError:
        return df
        
def update_lines(dic, df):
    '''find the line in df and replace the line where it matches 
    with the question in dic by the new lines'''
    lines_to_replace = df.index[df['Questions'] == dic['question']].tolist() # might be several rows and maybe they aren't all to replace
    # I check there is at least a row to replace 
    if lines_to_replace:
        # I replace all rows where the question matches
        for line_to_replace in lines_to_replace:
            # replace this row and the following by the following dataframe
            questions = reduce(lambda a,b: a + b,[data['answers'][2][x][3] for x in range(len(data['answers'][2]))])
            ind_answers = dic["answers"][2][0][1]
            answers = []
            # I get all the potential answers
            for i in range(len(ind_answers)):
                answers.append(reduce(lambda a,b: a+b,[ind_answers[i] for x in range(len(questions))])) # duplicates as there are many lines with the same answers in a grid, maybe I should have used set
            answers = {f"Answer{i}": answers[i] for i in range(0, len(answers))} # dyanmically allocate to place them in the right columns
            dict_replacing = {'Questions': questions, **answers} # dictionary that will replace the forle create the new lines
            df1 = pd.DataFrame(dict_replacing)
            df1.index = df1.index / 10 + line_to_replace
            df = df1.combine_first(df)
    return df

Hicé un Colaboratory notebook si quieren utilizarlo.

Lo que obtengo

Pero la dataframe esta de la misma tamana antes y despues de hacer esto. En efecto, obtengo:

QID    Questions     QType    Answer1    Answer2    Answer3    Answer4    Answer5                                                                        
1096      'To what extent are the following factors considerations in your choice of flight?'    Question Grid    'Very important consideration'    'Important consideration'    'Neutral'    'Not an important consideration'    'Do not consider'
1097      'To what extent are the following factors considerations in your choice of flight?'    Question Grid ...

Actualizacion

Logré obtener las filas que necesito dado una fila que en realidad representa un QuestionGrid.

En efecto con el siguiente ejemplo:

import collections
df = pd.DataFrame({"QID":[1177],"Questions":["The travel restrictions of COVID-19 have been lifted and you are looking to book a flight. To what extent are the following factors considerations in your choice of flight?"],"QType":["Likert Scale"],"Answer0":["Very important consideration"],"Answer1":["Important consideration"],"Answer2":["Somewhat consider"],"Answer3":["Not an important consideration"],"Answer4":["Do not consider"],"Answer5":["Discounted flights"],"Answer6":["Very important consideration"],"Answer7":["Important consideration"],"Answer8":["Somewhat consider"],"Answer9":["Not an important consideration"],"Answer10":["Do not consider"],"Answer11":["Baggage policy"],"Answer12":["Very important consideration"],"Answer13":["Important consideration"],"Answer14":["Somewhat consider"],"Answer15":["Not an important consideration"],"Answer16":["Do not consider"],"Answer17":["Price of flights"],"Answer18":["Very important consideration"],"Answer19":["Important consideration"],"Answer20":["Somewhat consider"],"Answer21":["Not an important consideration"],"Answer22":["Do not consider"],"Answer23":["Insurance"],"Answer24":["Very important consideration"],"Answer25":["Important consideration"],"Answer26":["Somewhat consider"],"Answer27":["Not an important consideration"],"Answer28":["Do not consider"],"Answer29":["Airport services"],"Answer30":["Very important consideration"],"Answer31":["Important consideration"],"Answer32":["Somewhat consider"],"Answer33":["Not an important consideration"],"Answer34":["Do not consider"],"Answer35":["Environmental impact"],"Answer36":["Very important consideration"],"Answer37":["Important consideration"],"Answer38":["Somewhat consider"],"Answer39":["Not an important consideration"],"Answer40":["Do not consider"],"Answer41":["In-flight service"],"Answer42":["Very important consideration"],"Answer43":["Important consideration"],"Answer44":["Somewhat consider"],"Answer45":["Not an important consideration"],"Answer46":["Do not consider"],"Answer47":["Customer support"],"Answer48":["Very important consideration"],"Answer49":["Important consideration"],"Answer50":["Somewhat consider"],"Answer51":["Not an important consideration"],"Answer52":["Do not consider"],"Answer53":["Overcrowding on aircraft/airports"],"Answer54":["Very important consideration"],"Answer55":["Important consideration"],"Answer56":["Somewhat consider"],"Answer57":["Not an important consideration"],"Answer58":["Do not consider"],"Answer59":["Airport safety after COVID-19"],"Answer60":["Very important consideration"],"Answer61":["Important consideration"],"Answer62":["Somewhat consider"],"Answer63":["Not an important consideration"],"Answer64":["Do not consider"],"Answer65":["Refund policy"]})
def getquestions(r):
    repeat = list({k:v for k,v in collections.Counter(r[3:].values).items() if v>1})
    questions = []
    firstfound = 0
    for i in range(3, len(r)-len(repeat)):
        if r[i:i+len(repeat)].tolist()==repeat:
            if r[i+len(repeat):i+len(repeat)+1].values[0] is not None:
                questions.append(r[i+len(repeat):i+len(repeat)+1].values[0])
            if firstfound==0: firstfound = i+len(repeat)
    if len(questions) > 0:
        # somethong odd, sometimes it's a list other times a str
        newq = r[1] + questions if isinstance(r[1], list) else [r[1]] + questions
        r[1] = newq
        # reset all the questions that have been used by list
        for i in range(firstfound, len(r)):
            if isinstance(r[i], str): r[i] = None
    return r
def fixqid(c):
    return [id if i==0 or c[i-1]!=id else f"{id}_{i}" for i, id in enumerate(c)]

df = df.apply(lambda r: getquestions(r), axis=1).explode("Questions").reset_index().drop("index", 1)
df["QID"] = fixqid(df["QID"].values)   
df

Obtengo:

QID     Questions   QType   Answer0     Answer1     Answer2     Answer3     Answer4     Answer5     Answer6     ...     Answer56    Answer57    Answer58    Answer59    Answer60    Answer61    Answer62    Answer63    Answer64    Answer65
0   1177    The travel restrictions of COVID-19 have been ...   Likert Scale    Very important consideration    Important consideration     Somewhat consider   Not an important consideration  Do not consider     None    None    ...     None    None    None    None    None    None    None    None    None    None
1   1177_1  Discounted flights  Likert Scale    Very important consideration    Important consideration     Somewhat consider   Not an important consideration  Do not consider     None    None    ...     None    None    None    None    None    None    None    None    None    None
2   1177_2  Baggage policy  Likert Scale    Very important consideration    Important consideration     Somewhat consider   Not an important consideration  Do not consider     None    None    ...     None    None    None    None    None    None    None    None    None    None
3   1177_3  Price of flights    Likert Scale    Very important consideration    Important consideration     Somewhat consider   Not an important consideration  Do not consider     None    None    ...     None    None    None    None    None    None    None    None    None    None
4   1177_4  Insurance   Likert Scale    Very important consideration    Important consideration     Somewhat consider   Not an important consideration  Do not consider     None    None    ...     None    None    None    None    None    None    None    None    None    None
5   1177_5  Airport services    Likert Scale    Very important consideration    Important consideration     Somewhat consider   Not an important consideration  Do not consider     None    None    ...     None    None    None    None    None    None    None    None    None    None
6   1177_6  Environmental impact    Likert Scale    Very important consideration    Important consideration     Somewhat consider   Not an important consideration  Do not consider     None    None    ...     None    None    None    None    None    None    None    None    None    None
7   1177_7  In-flight service   Likert Scale    Very important consideration    Important consideration     Somewhat consider   Not an important consideration  Do not consider     None    None    ...     None    None    None    None    None    None    None    None    None    None
8   1177_8  Customer support    Likert Scale    Very important consideration    Important consideration     Somewhat consider   Not an important consideration  Do not consider     None    None    ...     None    None    None    None    None    None    None    None    None    None
9   1177_9  Overcrowding on aircraft/airports   Likert Scale    Very important consideration    Important consideration     Somewhat consider   Not an important consideration  Do not consider     None    None    ...     None    None    None    None    None    None    None    None    None    None
10  1177_10     Airport safety after COVID-19   Likert Scale    Very important consideration    Important consideration     Somewhat consider   Not an important consideration  Do not consider     None    None    ...     None    None    None    None    None    None    None    None    None    None
11  1177_11     Refund policy   Likert Scale    Very important consideration    Important consideration     Somewhat consider   Not an important consideration  Do not consider     None    None    ...     None    None    None    

Entonces hoy querio aplicarlo a cada lineas de la dataframe.

for i, row in df.iterrows():
    passed_items = []
    for cell in row:
        if cell in passed_items:
            print("Line representing a Question Grid Detected")
            df_to_insert = getquestions(row)
            for j in range(len(df_to_insert)):
                df.loc[i+j] = df_to_insert.loc[i]
        else:
            passed_items.append(str(cell))

Pero obtengo:

Line representing a Question Grid Detected

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-188-ea807caf35a2> in <module>
      6             df_to_insert = getquestions(row)
      7             for j in range(len(df_to_insert)):
----> 8                 df.loc[i+j] = df_to_insert.loc[i]
      9         else:
     10             passed_items.append(str(cell))

C:ProgramDataAnaconda3libsite-packagespandascoreindexing.py in __getitem__(self, key)
   1766 
   1767             maybe_callable = com.apply_if_callable(key, self.obj)
-> 1768             return self._getitem_axis(maybe_callable, axis=axis)
   1769 
   1770     def _is_scalar_access(self, key: Tuple):

C:ProgramDataAnaconda3libsite-packagespandascoreindexing.py in _getitem_axis(self, key, axis)
   1962 
   1963         # fall thru to straight lookup
-> 1964         self._validate_key(key, axis)
   1965         return self._get_label(key, axis=axis)
   1966 

C:ProgramDataAnaconda3libsite-packagespandascoreindexing.py in _validate_key(self, key, axis)
   1829 
   1830         if not is_list_like_indexer(key):
-> 1831             self._convert_scalar_indexer(key, axis)
   1832 
   1833     def _is_scalar_access(self, key: Tuple) -> bool:

C:ProgramDataAnaconda3libsite-packagespandascoreindexing.py in _convert_scalar_indexer(self, key, axis)
    739         ax = self.obj._get_axis(min(axis, self.ndim - 1))
    740         # a scalar
--> 741         return ax._convert_scalar_indexer(key, kind=self.name)
    742 
    743     def _convert_slice_indexer(self, key: slice, axis: int):

C:ProgramDataAnaconda3libsite-packagespandascoreindexesbase.py in _convert_scalar_indexer(self, key, kind)
   2886             elif kind in ["loc"] and is_integer(key):
   2887                 if not self.holds_integer():
-> 2888                     self._invalid_indexer("label", key)
   2889 
   2890         return key

C:ProgramDataAnaconda3libsite-packagespandascoreindexesbase.py in _invalid_indexer(self, form, key)
   3075         """
   3076         raise TypeError(
-> 3077             f"cannot do {form} indexing on {type(self)} with these "
   3078             f"indexers [{key}] of {type(key)}"
   3079         )

TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [4] of <class 'int'>

Qaui el problema es que una fila que no representa un QuestionGrid fue detectado como esto.

2 Answers

Aqui esta una transformacion sofistica:

  1. al coro constructemos Questions como una lista de embedded preguntas, pues explode()
  2. necesitemos utilisar collections.Counter() para dar claves únicas a los valores en las columnas Answer#, pues eliminamos las unas que solo se utilisan una vez
  3. con estalist la muevamos a traves de la columnas para encontrar posiciones que coincidan. Tomamos la siguiente columna como una pregunta incrustada, añadirla a la lista
  4. una vez que tengamos las preguntas incrustadas, ponemos list de nuevo en Questions columna. Reajustando todas las respuestas que son redundantes
  5. post proceso para arreglar la columna QID.

Esto en realidad encuentra 10 preguntas incrustadas en este ejemplo. He incluido la salida en un formato demasiado amplio para que se pueda formatear de forma razonable.

import collections
df = pd.DataFrame({"QID":[1177,"1177R"],"Questions":["The travel restrictions of COVID-19 have been lifted and you are looking to book a flight. To what extent are the following factors considerations in your choice of flight?","How would you like to book your next holiday?"],"QType":["Likert Scale","Likert Scale"],"Answer0":["Very important consideration","Airline XYZ app"],"Answer1":["Important consideration","Airline XYZ website"],"Answer2":["Somewhat consider","Third party website"],"Answer3":["Not an important consideration","Third party app"],"Answer4":["Do not consider","Travel agent"],"Answer5":["Discounted flights","Call"],"Answer6":["Very important consideration",""],"Answer7":["Important consideration",""],"Answer8":["Somewhat consider",""],"Answer9":["Not an important consideration",""],"Answer10":["Do not consider",""],"Answer11":["Baggage policy",""],"Answer12":["Very important consideration",""],"Answer13":["Important consideration",""],"Answer14":["Somewhat consider",""],"Answer15":["Not an important consideration",""],"Answer16":["Do not consider",""],"Answer17":["Price of flights",""],"Answer18":["Very important consideration",""],"Answer19":["Important consideration",""],"Answer20":["Somewhat consider",""],"Answer21":["Not an important consideration",""],"Answer22":["Do not consider",""],"Answer23":["Insurance",""],"Answer24":["Very important consideration",""],"Answer25":["Important consideration",""],"Answer26":["Somewhat consider",""],"Answer27":["Not an important consideration",""],"Answer28":["Do not consider",""],"Answer29":["Airport services",""],"Answer30":["Very important consideration",""],"Answer31":["Important consideration",""],"Answer32":["Somewhat consider",""],"Answer33":["Not an important consideration",""],"Answer34":["Do not consider",""],"Answer35":["Environmental impact",""],"Answer36":["Very important consideration",""],"Answer37":["Important consideration",""],"Answer38":["Somewhat consider",""],"Answer39":["Not an important consideration",""],"Answer40":["Do not consider",""],"Answer41":["In-flight service",""],"Answer42":["Very important consideration",""],"Answer43":["Important consideration",""],"Answer44":["Somewhat consider",""],"Answer45":["Not an important consideration",""],"Answer46":["Do not consider",""],"Answer47":["Customer support",""],"Answer48":["Very important consideration",""],"Answer49":["Important consideration",""],"Answer50":["Somewhat consider",""],"Answer51":["Not an important consideration",""],"Answer52":["Do not consider",""],"Answer53":["Overcrowding on aircraft/airports",""],"Answer54":["Very important consideration",""],"Answer55":["Important consideration",""],"Answer56":["Somewhat consider",""],"Answer57":["Not an important consideration",""],"Answer58":["Do not consider",""],"Answer59":["Airport safety after COVID-19",""],"Answer60":["Very important consideration",""],"Answer61":["Important consideration",""],"Answer62":["Somewhat consider",""],"Answer63":["Not an important consideration",""],"Answer64":["Do not consider",""],"Answer65":["Refund policy",""]})
def getquestions(r):
    repeat = list({k:v for k,v in collections.Counter(r[3:].values).items() if v>1 and isinstance(k, str)})
    if len(repeat)<3: return r
    questions = []
    firstfound = 0
    for i in range(3, len(r)-len(repeat)):
        if r[i:i+len(repeat)].tolist()==repeat:
            if r[i+len(repeat):i+len(repeat)+1].values[0] is not None:
                questions.append(r[i+len(repeat):i+len(repeat)+1].values[0])
            if firstfound==0: firstfound = i+len(repeat)
    if len(questions) > 0:
        # somethong odd, sometimes it's a list other times a str
        newq = r[1] + questions if isinstance(r[1], list) else [r[1]] + questions
        r[1] = newq
        # reset all the questions that have been used by list
        for i in range(firstfound, len(r)):
            if isinstance(r[i], str): r[i] = np.nan
    return r
def fixqid(c):
    qid = []
    sub = 0
    for i, id in enumerate(c):
        if i==0 or c[i-1]!=id:
            sub=0
            qid.append(id)
        else:
            sub +=1
            qid.append(f"{id}_{sub}")
    return qid 

df = df.apply(lambda r: getquestions(r), axis=1).explode("Questions").reset_index().drop("index", 1)
df["QID"] = fixqid(df["QID"].values)   
df.iloc[:,:10]

Answered by Revolucion for Monica on December 9, 2021

Pero la dataframe esta de la misma tamana antes y despues de hacer esto.

¿Por qué no se obtiene de esta forma?

Realmente el problema está en el contenido Pickle, si te fijas a la hora de invocar la función con

update_lines(qa,df)

Te vas a topar con el error de

TypeError: list indices must be integers or slices, not str

Esto pasa a causa de la expresión ines_to_replace = df.index[df['Questions'] == dic['question']].tolist(), al decir dic['question'] se pretende que el Pickle tenga una llave que se llame "question". Lo que pasa es que el Pickle ni si siquiera es un diccionario : print(type(qa))

<class 'list'>

Si quieres reducir esa lista, tienes que aplicar dos filtros:

  1. El primera que unifice a todas la lista para solo obtener diccionarios.
  2. El segundo, que se deshaga de los enlaces de Drive

Aplicando ambos filtros a una copia de qa, llamada qa_

qa_ = reduce(lambda a,b: a + b,qa)
qa_ = [dict_ for dict_ in qa_ if str(dict_)[0:5] != "https"]

Por otro lado, revisando el Pickle me dí cuenta que no puedes seguir patrones estáticos para filtrar elementos. Por eso mismo obtienes ese resultado.

Intenté utilisar la respuesta de gcoronel99 sobre la representacion de una grid de preguntas en un marco de datos.

Fue mi error, en esta pregunta pensé que los únicos datos que tenías eran los que presentabas. No sabía que se tenían filtrar datos ordenados de maneras distintas.

Filtrar las respuestas

Al obtener los diccionarios te das cuanta, que el valor de la llave "answers" es una lista. Es una lista complicada de reducir, esto debido a que no es una lista de listas, si no una lista con listas y otros elementos.

Lo más práctico en este caso evidentemente es unificar todos en una misma lista. Como además de listas tiene otros elementos, reduce() no puede ayudar.

Entonces ¿Cómo?

Una manera de lograr esto es convirtiendo la lista en un string, y luego sacarle todos los corchetes.

def reduce_list(list_: list):
    rlist_ = list(filter(lambda a: a != "[" and a != "]",str(list_)))
    return list(eval("".join(rlist_)))

Con filter() se eliminan los corchetes de la lista como string, después se evalúa para obtener una sola lista.

Ya con esto se pueden obtener la listas de respuestas en una sola lista.

for dict_ in qa_:
    print(reduce_list(dict_["answers"]))

Se obtendría

[None, 5, 283973010, '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', 0, "I don't want vacation", 'I desperately need a vacation']
[None, 2, 1591633300, 'In next 3 months', None, None, None, 0, 'In next 6 months', None, None, None, 0, 'In next 1 year', None, None, None, 0, 'Only once COVID-19 is under control', None, None, None, 0, 'Only once COVID-19 vaccine is developed', None, None, None, 0, 0, None, None, None, None, None, 0]
...

Por otro lado, hay otro problema. Estas lista tienen valores que no se van a llegar a usar, como lo son los None y los valores numéricos. Habría que aplicar otro filtro para eliminar esto, ahora es mucho más fácil cuando se trata de una sola lista.

def del_useless(list_: list):
    return list(filter(lambda a: a != 0 and a != None and type(a) is not int and a != "",list_))

Con esta función se eliminan

  1. Ceros
  2. Valores None
  3. Cualquier otro valor de tipo entero
  4. Cadenas vacías

Ahora se obtiene las respuestas que son válidas

for dict_ in qa_:
    print(del_useless(reduce_list(dict_["answers"])))

Mostraría

['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', "I don't want vacation", 'I desperately need a vacation']
['In next 3 months', 'In next 6 months', 'In next 1 year', 'Only once COVID-19 is under control', 'Only once COVID-19 vaccine is developed']
["Doctor's availability in hotel", 'Ventilator availability in hotel', 'Tie-ups with nearby hospitals', 'Availability of medical rooms with primary first aid care']
...

Finalmente, lo más conveniente es acomodar todo en una misma lista de lista, donde las listas de adentro son cada lista de respuestas en qa_

answers = [del_useless(reduce_list(dict_["answers"])) for dict_ in qa_]

Filtrar las preguntas

Esto es sin duda es lo más fácil, con qa_ (una versión simplificada de qa) solo es cuestión de generar otra lista que contenga cada respuesta.

questions = [dict_["question"] for dict_ in qa_]

Eliminar las columnas que no se van a usar

En el DataFrame original habían 3016 filas y 333 columnas. El Pickle sola da para 2959 filas y 258 columnas (exclusivas de respuestas). Por lo tanto había que aplicarle drop() al DataFrame con las columnas

df1 = df.drop(["URL","Answers","Section","Theme","Topics",'Ref', 'odd', '1a', '1b', '1c', '2a', '2b', '2c', '3a', '3b', '3c', '3d', '3e', '3f', '3g', '3h', '4a', '4b', '4c', '5a', '5b', '5c', '5d', '6a', '6b', '6c', '6d', '7a', '7b', '7c', '7d', '7e', '7f', '7g', '7h', '7i', '7j', '7k', '8a', '8b', '8c', '8d', '8e', '8f', '8g', '9a', '9b', '9c', '9d', '9e', '9f', '9g', '10a', '10b', '10c', '10d', '10e', '10f', '11a', '11b', '11c', '11d', '11e', 'Sum', 'Comment', 'Beyond repair', 'Not scrapped well'],axis = 1)

Hacer los mismo con varias filas

rows_delete = [index for index in list(df1.index)[2959:]]
df1 = df1.drop(rows_delete,axis = 0)

Sin embargo, las respuestas todavía no se pueden aplicar al DataFrame, debido a que muchas de las listas con respuestas no tiene la misma longitud. Todas deben de tener 258 elementos.

def match_length(list_: list):
    while len(list_) < 258:
        list_.append(None)

    return list_

Aplicando esta función a cada lista de respuestas, todas tendrían la misma longitud

answers = [match_length(answer) for answer in answers]

Acomodar las respuestas

No se pueden meter al DataFrame cada lista de respuestas a secas, se deben de combinar. Esto por que en las columnas respuestas tienen que ir una respuesta por columna. Entonces en cada columna tendría que haber una respuesta de la lista de respuestas de cada lista de respuestas.

Para esto hay que hacer una especia de iteración anidada, con indices inversos

t_answers = []
for index2 in range(len(answers[0])):
    t_answers.append([])
    for index1 in range(len(answers)):
        t_answers[index2].append(answers[index1][index2])

Así t_answers tendrá una respuesta de cada lista de respuestas.

Me gustaria insertarlos en la dataframe original, en lugar de las lineas donde coinciden en la columna de Questions, como en el siguiente dataframe...

Montar el DataFrame

En este ejemplo lo voy a hacer con el DataFrame copia (df1), pero la idea es aplicable al mismo DataFrame.

Para montar las preguntas, hay que asignarle a la columna con las preguntas viejas (con las filas eliminadas para que solo queden 2959 filas) la lista questions.

df1["Questions"] = questions

Para montar las respuestas, al tratarse de 255 columnas para respuestas hay que hacer una iteración. De cualquier modo se logra.

for n in range(len(list(df1.columns)) - 3):
    df1[f'Answer{n}'] = t_answers[n]

Dicho esto, hagamos un update a update_lines()

La función final

Para que sea más entendible y limpio habría que definir las funciones reduce_list(), del_useless() y match_length() (antes mostradas). También es necesario importar reduce() del modulo functools.

def update_lines(list_,dataframe):
    qa_ = reduce(lambda a,b: a + b,list_)
    qa_ = [dict_ for dict_ in qa_ if str(dict_)[0:5] != "https"]

    answers = [del_useless(reduce_list(dict_["answers"])) for dict_ in qa_]
    answers = [match_length(answer) for answer in answers]

    questions = [dict_["question"] for dict_ in qa_]

    df = dataframe.drop(["URL","Answers","Section","Theme","Topics",'Ref', 'odd', '1a', '1b', '1c', '2a', '2b', '2c', '3a', '3b', '3c', '3d', '3e', '3f', '3g', '3h', '4a', '4b', '4c', '5a', '5b', '5c', '5d', '6a', '6b', '6c', '6d', '7a', '7b', '7c', '7d', '7e', '7f', '7g', '7h', '7i', '7j', '7k', '8a', '8b', '8c', '8d', '8e', '8f', '8g', '9a', '9b', '9c', '9d', '9e', '9f', '9g', '10a', '10b', '10c', '10d', '10e', '10f', '11a', '11b', '11c', '11d', '11e', 'Sum', 'Comment', 'Beyond repair', 'Not scrapped well'],axis = 1)
    rows_delete = [index for index in list(df.index)[2959:]]
    df = df.drop(rows_delete,axis = 0)

    t_answers = []
    for index2 in range(len(answers[0])):
        t_answers.append([])
        for index1 in range(len(answers)):
            t_answers[index2].append(answers[index1][index2])

    df["Questions"] = questions

    for n in range(len(list(df.columns)) - 3):
        df[f'Answer{n}'] = t_answers[n]

    return df

Este código es muy diferente a tu intento, debido a que las técnicas que usas son de una respuesta mía a otra pregunta tuya. En esa pregunta no me había quedado claro que se estaba hablando de tantos datos ordenados de maneras diversas. En esa respuesta usaba métodos específicos para tu caso, pienso que esta manera es portable a cualquier tipo de lista con listas y otros elementos.

Al invocar la función con

from functools import reduce

#Rutas de los archivos en mi escritorio
qa = pd.read_pickle("interns.p") 
df = pd.read_csv("QuestionBank.csv")

print(update_lines(qa,df))

Se obtendría esto

       QID                                          Questions            QType  ... Answer255 Answer256 Answer257
0        0  How do you feel about your next vacation after...  Multiple Choice  ...      None      None      None
1       0R    When do you think your next vacation can start?       Checkboxes  ...      None      None      None
2        1  What are your preferences regarding medical tr...  Multiple Choice  ...      None      None      None
3        2  What is your preferences of complementary brea...     Likert Scale  ...      None      None      None
4        4  What is your preference for a in-hotel grocery...  Multiple Choice  ...      None      None      None
...    ...                                                ...              ...  ...       ...       ...       ...
2954  2514  What suggestions do you have for Global Hotel ...  Multiple Choice  ...      None      None      None
2955  2515                                  What is your age?  Multiple Choice  ...      None      None      None
2956  2516                               What is your gender?  Multiple Choice  ...      None      None      None
2957  2517                    What is your highest education?  Multiple Choice  ...      None      None      None
2958  2518   What is your disposable income monthly? (in GBP)     Likert Scale  ...      None      None      None

Bien, esta vista no ofrece mucho, ya que muestra las ultimas columnas, donde casi no hay respuestas. Puedes ver el .csv completo aquí.

Como verás el problema estaba en como filtrar los datos. Espero te haya servido de al go.

Answered by user166844 on December 9, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP