Stack Overflow en español Asked on December 9, 2021
Tengo una dataframe
QID URL Questions Answers Section QType Theme Topics Answer0 Answer1 Answer2 Answer3 Answer4 Answer5 Answer6
1113 1096 https://docs.google.com/forms/d/1hIkfKc2frAnxsQzGw_h4bIqasPwAPzzLFWqzPE3B_88/edit?usp=sharing To what extent are the following factors considerations in your choice of flight? ['Very important consideration', 'Important consideration', 'Neutral', 'Not an important consideration', 'Do not consider'] When choosing an airline to fly with, which factors are most important to you? (Please list 3.) Multiple Choice Airline XYZ ['extent', 'follow', 'factor', 'consider', 'choic', 'flight'] Very important consideration Important consideration Neutral Not an important consideration Do not consider NaN NaN
1116 1097 https://docs.google.com/forms/d/1hIkfKc2frAnxsQzGw_h4bIqasPwAPzzLFWqzPE3B_88/edit?usp=sharing How far in advance do you typically book your tickets? ['0-2 months in advance', '2-4 months in advance', '4-6 months in advance', '6-8 months in advance', '8-10 months in advance', '10-12 months in advance', '12+ months in advance'] When choosing an airline to fly with, which factors are most important to you? (Please list 3.) Multiple Choice Airline XYZ ['advanc', 'typic', 'book', 'ticket'] 0-2 months in advance 2-4 months in advance 4-6 months in advance 6-8 months in advance 8-10 months in advance 10-12 months in advance 12+ months in advance
con filas de los cuales quiero cambiar unas pocas líneas que en realidad son títulos de QuestionGrid, con nuevas líneas que también representan las respuestas. Tengo un otra, Pickle, que contiene la información para construir las líneas que actualizarán las antiguas. Cada vez una línea antigua se transformará en varias líneas nuevas (lo especifico porque no sé cómo hacerlo).
Estas lineas solo son los titulos de grid de preguntas como la siguiente:
Me gustaria insertarlos en la dataframe original, en lugar de las lineas donde coinciden en la columna de Questions
, como en el siguiente dataframe:
QID Questions QType Answer1 Answer2 Answer3 Answer4 Answer5
1096 'To what extent are the following factors considerations in your choice of flight?' Question Grid 'Very important consideration' 'Important consideration' 'Neutral' 'Not an important consideration' 'Do not consider'
1096_S01 'The airline/company you fly with'
1096_S02 'The departure airport'
1096_S03 'Duration of flight/route'
1096_S04 'Baggage policy'
1097 'To what extent are the following factors considerations in your choice of flight?' Question Grid ...
1097_S01 ...
...
Intenté utilisar la respuesta de gcoronel99 sobre la representacion de una grid de preguntas en un marco de datos.
import pickle
qa = pd.read_pickle(r'Python/interns.p')
df = pd.read_csv("QuestionBank.csv")
def isGrid(dic, df):
'''Check if a row is a row related to a Google Forms grid
if it is a case update this row'''
d_answers = dic['answers']
try:
answers = d_answers[2]
if len(answers) > 1:
# find the line in df and replace the line where it matches by the lines
update_lines(dic, df)
return df
except TypeError:
return df
def update_lines(dic, df):
'''find the line in df and replace the line where it matches
with the question in dic by the new lines'''
lines_to_replace = df.index[df['Questions'] == dic['question']].tolist() # might be several rows and maybe they aren't all to replace
# I check there is at least a row to replace
if lines_to_replace:
# I replace all rows where the question matches
for line_to_replace in lines_to_replace:
# replace this row and the following by the following dataframe
questions = reduce(lambda a,b: a + b,[data['answers'][2][x][3] for x in range(len(data['answers'][2]))])
ind_answers = dic["answers"][2][0][1]
answers = []
# I get all the potential answers
for i in range(len(ind_answers)):
answers.append(reduce(lambda a,b: a+b,[ind_answers[i] for x in range(len(questions))])) # duplicates as there are many lines with the same answers in a grid, maybe I should have used set
answers = {f"Answer{i}": answers[i] for i in range(0, len(answers))} # dyanmically allocate to place them in the right columns
dict_replacing = {'Questions': questions, **answers} # dictionary that will replace the forle create the new lines
df1 = pd.DataFrame(dict_replacing)
df1.index = df1.index / 10 + line_to_replace
df = df1.combine_first(df)
return df
Hicé un Colaboratory notebook si quieren utilizarlo.
Pero la dataframe esta de la misma tamana antes y despues de hacer esto. En efecto, obtengo:
QID Questions QType Answer1 Answer2 Answer3 Answer4 Answer5
1096 'To what extent are the following factors considerations in your choice of flight?' Question Grid 'Very important consideration' 'Important consideration' 'Neutral' 'Not an important consideration' 'Do not consider'
1097 'To what extent are the following factors considerations in your choice of flight?' Question Grid ...
Logré obtener las filas que necesito dado una fila que en realidad representa un QuestionGrid.
En efecto con el siguiente ejemplo:
import collections
df = pd.DataFrame({"QID":[1177],"Questions":["The travel restrictions of COVID-19 have been lifted and you are looking to book a flight. To what extent are the following factors considerations in your choice of flight?"],"QType":["Likert Scale"],"Answer0":["Very important consideration"],"Answer1":["Important consideration"],"Answer2":["Somewhat consider"],"Answer3":["Not an important consideration"],"Answer4":["Do not consider"],"Answer5":["Discounted flights"],"Answer6":["Very important consideration"],"Answer7":["Important consideration"],"Answer8":["Somewhat consider"],"Answer9":["Not an important consideration"],"Answer10":["Do not consider"],"Answer11":["Baggage policy"],"Answer12":["Very important consideration"],"Answer13":["Important consideration"],"Answer14":["Somewhat consider"],"Answer15":["Not an important consideration"],"Answer16":["Do not consider"],"Answer17":["Price of flights"],"Answer18":["Very important consideration"],"Answer19":["Important consideration"],"Answer20":["Somewhat consider"],"Answer21":["Not an important consideration"],"Answer22":["Do not consider"],"Answer23":["Insurance"],"Answer24":["Very important consideration"],"Answer25":["Important consideration"],"Answer26":["Somewhat consider"],"Answer27":["Not an important consideration"],"Answer28":["Do not consider"],"Answer29":["Airport services"],"Answer30":["Very important consideration"],"Answer31":["Important consideration"],"Answer32":["Somewhat consider"],"Answer33":["Not an important consideration"],"Answer34":["Do not consider"],"Answer35":["Environmental impact"],"Answer36":["Very important consideration"],"Answer37":["Important consideration"],"Answer38":["Somewhat consider"],"Answer39":["Not an important consideration"],"Answer40":["Do not consider"],"Answer41":["In-flight service"],"Answer42":["Very important consideration"],"Answer43":["Important consideration"],"Answer44":["Somewhat consider"],"Answer45":["Not an important consideration"],"Answer46":["Do not consider"],"Answer47":["Customer support"],"Answer48":["Very important consideration"],"Answer49":["Important consideration"],"Answer50":["Somewhat consider"],"Answer51":["Not an important consideration"],"Answer52":["Do not consider"],"Answer53":["Overcrowding on aircraft/airports"],"Answer54":["Very important consideration"],"Answer55":["Important consideration"],"Answer56":["Somewhat consider"],"Answer57":["Not an important consideration"],"Answer58":["Do not consider"],"Answer59":["Airport safety after COVID-19"],"Answer60":["Very important consideration"],"Answer61":["Important consideration"],"Answer62":["Somewhat consider"],"Answer63":["Not an important consideration"],"Answer64":["Do not consider"],"Answer65":["Refund policy"]})
def getquestions(r):
repeat = list({k:v for k,v in collections.Counter(r[3:].values).items() if v>1})
questions = []
firstfound = 0
for i in range(3, len(r)-len(repeat)):
if r[i:i+len(repeat)].tolist()==repeat:
if r[i+len(repeat):i+len(repeat)+1].values[0] is not None:
questions.append(r[i+len(repeat):i+len(repeat)+1].values[0])
if firstfound==0: firstfound = i+len(repeat)
if len(questions) > 0:
# somethong odd, sometimes it's a list other times a str
newq = r[1] + questions if isinstance(r[1], list) else [r[1]] + questions
r[1] = newq
# reset all the questions that have been used by list
for i in range(firstfound, len(r)):
if isinstance(r[i], str): r[i] = None
return r
def fixqid(c):
return [id if i==0 or c[i-1]!=id else f"{id}_{i}" for i, id in enumerate(c)]
df = df.apply(lambda r: getquestions(r), axis=1).explode("Questions").reset_index().drop("index", 1)
df["QID"] = fixqid(df["QID"].values)
df
Obtengo:
QID Questions QType Answer0 Answer1 Answer2 Answer3 Answer4 Answer5 Answer6 ... Answer56 Answer57 Answer58 Answer59 Answer60 Answer61 Answer62 Answer63 Answer64 Answer65
0 1177 The travel restrictions of COVID-19 have been ... Likert Scale Very important consideration Important consideration Somewhat consider Not an important consideration Do not consider None None ... None None None None None None None None None None
1 1177_1 Discounted flights Likert Scale Very important consideration Important consideration Somewhat consider Not an important consideration Do not consider None None ... None None None None None None None None None None
2 1177_2 Baggage policy Likert Scale Very important consideration Important consideration Somewhat consider Not an important consideration Do not consider None None ... None None None None None None None None None None
3 1177_3 Price of flights Likert Scale Very important consideration Important consideration Somewhat consider Not an important consideration Do not consider None None ... None None None None None None None None None None
4 1177_4 Insurance Likert Scale Very important consideration Important consideration Somewhat consider Not an important consideration Do not consider None None ... None None None None None None None None None None
5 1177_5 Airport services Likert Scale Very important consideration Important consideration Somewhat consider Not an important consideration Do not consider None None ... None None None None None None None None None None
6 1177_6 Environmental impact Likert Scale Very important consideration Important consideration Somewhat consider Not an important consideration Do not consider None None ... None None None None None None None None None None
7 1177_7 In-flight service Likert Scale Very important consideration Important consideration Somewhat consider Not an important consideration Do not consider None None ... None None None None None None None None None None
8 1177_8 Customer support Likert Scale Very important consideration Important consideration Somewhat consider Not an important consideration Do not consider None None ... None None None None None None None None None None
9 1177_9 Overcrowding on aircraft/airports Likert Scale Very important consideration Important consideration Somewhat consider Not an important consideration Do not consider None None ... None None None None None None None None None None
10 1177_10 Airport safety after COVID-19 Likert Scale Very important consideration Important consideration Somewhat consider Not an important consideration Do not consider None None ... None None None None None None None None None None
11 1177_11 Refund policy Likert Scale Very important consideration Important consideration Somewhat consider Not an important consideration Do not consider None None ... None None None
Entonces hoy querio aplicarlo a cada lineas de la dataframe.
for i, row in df.iterrows():
passed_items = []
for cell in row:
if cell in passed_items:
print("Line representing a Question Grid Detected")
df_to_insert = getquestions(row)
for j in range(len(df_to_insert)):
df.loc[i+j] = df_to_insert.loc[i]
else:
passed_items.append(str(cell))
Pero obtengo:
Line representing a Question Grid Detected
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-188-ea807caf35a2> in <module>
6 df_to_insert = getquestions(row)
7 for j in range(len(df_to_insert)):
----> 8 df.loc[i+j] = df_to_insert.loc[i]
9 else:
10 passed_items.append(str(cell))
C:ProgramDataAnaconda3libsite-packagespandascoreindexing.py in __getitem__(self, key)
1766
1767 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1768 return self._getitem_axis(maybe_callable, axis=axis)
1769
1770 def _is_scalar_access(self, key: Tuple):
C:ProgramDataAnaconda3libsite-packagespandascoreindexing.py in _getitem_axis(self, key, axis)
1962
1963 # fall thru to straight lookup
-> 1964 self._validate_key(key, axis)
1965 return self._get_label(key, axis=axis)
1966
C:ProgramDataAnaconda3libsite-packagespandascoreindexing.py in _validate_key(self, key, axis)
1829
1830 if not is_list_like_indexer(key):
-> 1831 self._convert_scalar_indexer(key, axis)
1832
1833 def _is_scalar_access(self, key: Tuple) -> bool:
C:ProgramDataAnaconda3libsite-packagespandascoreindexing.py in _convert_scalar_indexer(self, key, axis)
739 ax = self.obj._get_axis(min(axis, self.ndim - 1))
740 # a scalar
--> 741 return ax._convert_scalar_indexer(key, kind=self.name)
742
743 def _convert_slice_indexer(self, key: slice, axis: int):
C:ProgramDataAnaconda3libsite-packagespandascoreindexesbase.py in _convert_scalar_indexer(self, key, kind)
2886 elif kind in ["loc"] and is_integer(key):
2887 if not self.holds_integer():
-> 2888 self._invalid_indexer("label", key)
2889
2890 return key
C:ProgramDataAnaconda3libsite-packagespandascoreindexesbase.py in _invalid_indexer(self, form, key)
3075 """
3076 raise TypeError(
-> 3077 f"cannot do {form} indexing on {type(self)} with these "
3078 f"indexers [{key}] of {type(key)}"
3079 )
TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [4] of <class 'int'>
Qaui el problema es que una fila que no representa un QuestionGrid fue detectado como esto.
Aqui esta una transformacion sofistica:
explode()
collections.Counter()
para dar claves únicas a los valores en las columnas Answer#, pues eliminamos las unas que solo se utilisan una vezEsto en realidad encuentra 10 preguntas incrustadas en este ejemplo. He incluido la salida en un formato demasiado amplio para que se pueda formatear de forma razonable.
import collections
df = pd.DataFrame({"QID":[1177,"1177R"],"Questions":["The travel restrictions of COVID-19 have been lifted and you are looking to book a flight. To what extent are the following factors considerations in your choice of flight?","How would you like to book your next holiday?"],"QType":["Likert Scale","Likert Scale"],"Answer0":["Very important consideration","Airline XYZ app"],"Answer1":["Important consideration","Airline XYZ website"],"Answer2":["Somewhat consider","Third party website"],"Answer3":["Not an important consideration","Third party app"],"Answer4":["Do not consider","Travel agent"],"Answer5":["Discounted flights","Call"],"Answer6":["Very important consideration",""],"Answer7":["Important consideration",""],"Answer8":["Somewhat consider",""],"Answer9":["Not an important consideration",""],"Answer10":["Do not consider",""],"Answer11":["Baggage policy",""],"Answer12":["Very important consideration",""],"Answer13":["Important consideration",""],"Answer14":["Somewhat consider",""],"Answer15":["Not an important consideration",""],"Answer16":["Do not consider",""],"Answer17":["Price of flights",""],"Answer18":["Very important consideration",""],"Answer19":["Important consideration",""],"Answer20":["Somewhat consider",""],"Answer21":["Not an important consideration",""],"Answer22":["Do not consider",""],"Answer23":["Insurance",""],"Answer24":["Very important consideration",""],"Answer25":["Important consideration",""],"Answer26":["Somewhat consider",""],"Answer27":["Not an important consideration",""],"Answer28":["Do not consider",""],"Answer29":["Airport services",""],"Answer30":["Very important consideration",""],"Answer31":["Important consideration",""],"Answer32":["Somewhat consider",""],"Answer33":["Not an important consideration",""],"Answer34":["Do not consider",""],"Answer35":["Environmental impact",""],"Answer36":["Very important consideration",""],"Answer37":["Important consideration",""],"Answer38":["Somewhat consider",""],"Answer39":["Not an important consideration",""],"Answer40":["Do not consider",""],"Answer41":["In-flight service",""],"Answer42":["Very important consideration",""],"Answer43":["Important consideration",""],"Answer44":["Somewhat consider",""],"Answer45":["Not an important consideration",""],"Answer46":["Do not consider",""],"Answer47":["Customer support",""],"Answer48":["Very important consideration",""],"Answer49":["Important consideration",""],"Answer50":["Somewhat consider",""],"Answer51":["Not an important consideration",""],"Answer52":["Do not consider",""],"Answer53":["Overcrowding on aircraft/airports",""],"Answer54":["Very important consideration",""],"Answer55":["Important consideration",""],"Answer56":["Somewhat consider",""],"Answer57":["Not an important consideration",""],"Answer58":["Do not consider",""],"Answer59":["Airport safety after COVID-19",""],"Answer60":["Very important consideration",""],"Answer61":["Important consideration",""],"Answer62":["Somewhat consider",""],"Answer63":["Not an important consideration",""],"Answer64":["Do not consider",""],"Answer65":["Refund policy",""]})
def getquestions(r):
repeat = list({k:v for k,v in collections.Counter(r[3:].values).items() if v>1 and isinstance(k, str)})
if len(repeat)<3: return r
questions = []
firstfound = 0
for i in range(3, len(r)-len(repeat)):
if r[i:i+len(repeat)].tolist()==repeat:
if r[i+len(repeat):i+len(repeat)+1].values[0] is not None:
questions.append(r[i+len(repeat):i+len(repeat)+1].values[0])
if firstfound==0: firstfound = i+len(repeat)
if len(questions) > 0:
# somethong odd, sometimes it's a list other times a str
newq = r[1] + questions if isinstance(r[1], list) else [r[1]] + questions
r[1] = newq
# reset all the questions that have been used by list
for i in range(firstfound, len(r)):
if isinstance(r[i], str): r[i] = np.nan
return r
def fixqid(c):
qid = []
sub = 0
for i, id in enumerate(c):
if i==0 or c[i-1]!=id:
sub=0
qid.append(id)
else:
sub +=1
qid.append(f"{id}_{sub}")
return qid
df = df.apply(lambda r: getquestions(r), axis=1).explode("Questions").reset_index().drop("index", 1)
df["QID"] = fixqid(df["QID"].values)
df.iloc[:,:10]
Answered by Revolucion for Monica on December 9, 2021
Pero la dataframe esta de la misma tamana antes y despues de hacer esto.
Realmente el problema está en el contenido Pickle, si te fijas a la hora de invocar la función con
update_lines(qa,df)
Te vas a topar con el error de
TypeError: list indices must be integers or slices, not str
Esto pasa a causa de la expresión ines_to_replace = df.index[df['Questions'] == dic['question']].tolist()
, al decir dic['question']
se pretende que el Pickle tenga una llave que se llame "question". Lo que pasa es que el Pickle ni si siquiera es un diccionario : print(type(qa))
<class 'list'>
Si quieres reducir esa lista, tienes que aplicar dos filtros:
Aplicando ambos filtros a una copia de qa
, llamada qa_
qa_ = reduce(lambda a,b: a + b,qa)
qa_ = [dict_ for dict_ in qa_ if str(dict_)[0:5] != "https"]
Por otro lado, revisando el Pickle me dí cuenta que no puedes seguir patrones estáticos para filtrar elementos. Por eso mismo obtienes ese resultado.
Intenté utilisar la respuesta de gcoronel99 sobre la representacion de una grid de preguntas en un marco de datos.
Fue mi error, en esta pregunta pensé que los únicos datos que tenías eran los que presentabas. No sabía que se tenían filtrar datos ordenados de maneras distintas.
Al obtener los diccionarios te das cuanta, que el valor de la llave "answers" es una lista. Es una lista complicada de reducir, esto debido a que no es una lista de listas, si no una lista con listas y otros elementos.
Lo más práctico en este caso evidentemente es unificar todos en una misma lista. Como además de listas tiene otros elementos, reduce()
no puede ayudar.
Entonces ¿Cómo?
Una manera de lograr esto es convirtiendo la lista en un string, y luego sacarle todos los corchetes.
def reduce_list(list_: list):
rlist_ = list(filter(lambda a: a != "[" and a != "]",str(list_)))
return list(eval("".join(rlist_)))
Con filter()
se eliminan los corchetes de la lista como string, después se evalúa para obtener una sola lista.
Ya con esto se pueden obtener la listas de respuestas en una sola lista.
for dict_ in qa_:
print(reduce_list(dict_["answers"]))
Se obtendría
[None, 5, 283973010, '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', 0, "I don't want vacation", 'I desperately need a vacation']
[None, 2, 1591633300, 'In next 3 months', None, None, None, 0, 'In next 6 months', None, None, None, 0, 'In next 1 year', None, None, None, 0, 'Only once COVID-19 is under control', None, None, None, 0, 'Only once COVID-19 vaccine is developed', None, None, None, 0, 0, None, None, None, None, None, 0]
...
Por otro lado, hay otro problema. Estas lista tienen valores que no se van a llegar a usar, como lo son los None
y los valores numéricos. Habría que aplicar otro filtro para eliminar esto, ahora es mucho más fácil cuando se trata de una sola lista.
def del_useless(list_: list):
return list(filter(lambda a: a != 0 and a != None and type(a) is not int and a != "",list_))
Con esta función se eliminan
None
Ahora se obtiene las respuestas que son válidas
for dict_ in qa_:
print(del_useless(reduce_list(dict_["answers"])))
Mostraría
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', "I don't want vacation", 'I desperately need a vacation']
['In next 3 months', 'In next 6 months', 'In next 1 year', 'Only once COVID-19 is under control', 'Only once COVID-19 vaccine is developed']
["Doctor's availability in hotel", 'Ventilator availability in hotel', 'Tie-ups with nearby hospitals', 'Availability of medical rooms with primary first aid care']
...
Finalmente, lo más conveniente es acomodar todo en una misma lista de lista, donde las listas de adentro son cada lista de respuestas en qa_
answers = [del_useless(reduce_list(dict_["answers"])) for dict_ in qa_]
Esto es sin duda es lo más fácil, con qa_
(una versión simplificada de qa
) solo es cuestión de generar otra lista que contenga cada respuesta.
questions = [dict_["question"] for dict_ in qa_]
En el DataFrame
original habían 3016 filas y 333 columnas. El Pickle sola da para 2959 filas y 258 columnas (exclusivas de respuestas). Por lo tanto había que aplicarle drop()
al DataFrame
con las columnas
df1 = df.drop(["URL","Answers","Section","Theme","Topics",'Ref', 'odd', '1a', '1b', '1c', '2a', '2b', '2c', '3a', '3b', '3c', '3d', '3e', '3f', '3g', '3h', '4a', '4b', '4c', '5a', '5b', '5c', '5d', '6a', '6b', '6c', '6d', '7a', '7b', '7c', '7d', '7e', '7f', '7g', '7h', '7i', '7j', '7k', '8a', '8b', '8c', '8d', '8e', '8f', '8g', '9a', '9b', '9c', '9d', '9e', '9f', '9g', '10a', '10b', '10c', '10d', '10e', '10f', '11a', '11b', '11c', '11d', '11e', 'Sum', 'Comment', 'Beyond repair', 'Not scrapped well'],axis = 1)
Hacer los mismo con varias filas
rows_delete = [index for index in list(df1.index)[2959:]]
df1 = df1.drop(rows_delete,axis = 0)
Sin embargo, las respuestas todavía no se pueden aplicar al DataFrame
, debido a que muchas de las listas con respuestas no tiene la misma longitud. Todas deben de tener 258 elementos.
def match_length(list_: list):
while len(list_) < 258:
list_.append(None)
return list_
Aplicando esta función a cada lista de respuestas, todas tendrían la misma longitud
answers = [match_length(answer) for answer in answers]
No se pueden meter al DataFrame cada lista de respuestas a secas, se deben de combinar. Esto por que en las columnas respuestas tienen que ir una respuesta por columna. Entonces en cada columna tendría que haber una respuesta de la lista de respuestas de cada lista de respuestas.
Para esto hay que hacer una especia de iteración anidada, con indices inversos
t_answers = []
for index2 in range(len(answers[0])):
t_answers.append([])
for index1 in range(len(answers)):
t_answers[index2].append(answers[index1][index2])
Así t_answers
tendrá una respuesta de cada lista de respuestas.
Me gustaria insertarlos en la dataframe original, en lugar de las lineas donde coinciden en la columna de Questions, como en el siguiente dataframe...
DataFrame
En este ejemplo lo voy a hacer con el DataFrame
copia (df1
), pero la idea es aplicable al mismo DataFrame.
Para montar las preguntas, hay que asignarle a la columna con las preguntas viejas (con las filas eliminadas para que solo queden 2959 filas) la lista questions
.
df1["Questions"] = questions
Para montar las respuestas, al tratarse de 255 columnas para respuestas hay que hacer una iteración. De cualquier modo se logra.
for n in range(len(list(df1.columns)) - 3):
df1[f'Answer{n}'] = t_answers[n]
Dicho esto, hagamos un update a update_lines()
Para que sea más entendible y limpio habría que definir las funciones reduce_list()
, del_useless()
y match_length()
(antes mostradas). También es necesario importar reduce()
del modulo functools
.
def update_lines(list_,dataframe):
qa_ = reduce(lambda a,b: a + b,list_)
qa_ = [dict_ for dict_ in qa_ if str(dict_)[0:5] != "https"]
answers = [del_useless(reduce_list(dict_["answers"])) for dict_ in qa_]
answers = [match_length(answer) for answer in answers]
questions = [dict_["question"] for dict_ in qa_]
df = dataframe.drop(["URL","Answers","Section","Theme","Topics",'Ref', 'odd', '1a', '1b', '1c', '2a', '2b', '2c', '3a', '3b', '3c', '3d', '3e', '3f', '3g', '3h', '4a', '4b', '4c', '5a', '5b', '5c', '5d', '6a', '6b', '6c', '6d', '7a', '7b', '7c', '7d', '7e', '7f', '7g', '7h', '7i', '7j', '7k', '8a', '8b', '8c', '8d', '8e', '8f', '8g', '9a', '9b', '9c', '9d', '9e', '9f', '9g', '10a', '10b', '10c', '10d', '10e', '10f', '11a', '11b', '11c', '11d', '11e', 'Sum', 'Comment', 'Beyond repair', 'Not scrapped well'],axis = 1)
rows_delete = [index for index in list(df.index)[2959:]]
df = df.drop(rows_delete,axis = 0)
t_answers = []
for index2 in range(len(answers[0])):
t_answers.append([])
for index1 in range(len(answers)):
t_answers[index2].append(answers[index1][index2])
df["Questions"] = questions
for n in range(len(list(df.columns)) - 3):
df[f'Answer{n}'] = t_answers[n]
return df
Este código es muy diferente a tu intento, debido a que las técnicas que usas son de una respuesta mía a otra pregunta tuya. En esa pregunta no me había quedado claro que se estaba hablando de tantos datos ordenados de maneras diversas. En esa respuesta usaba métodos específicos para tu caso, pienso que esta manera es portable a cualquier tipo de lista con listas y otros elementos.
Al invocar la función con
from functools import reduce
#Rutas de los archivos en mi escritorio
qa = pd.read_pickle("interns.p")
df = pd.read_csv("QuestionBank.csv")
print(update_lines(qa,df))
Se obtendría esto
QID Questions QType ... Answer255 Answer256 Answer257
0 0 How do you feel about your next vacation after... Multiple Choice ... None None None
1 0R When do you think your next vacation can start? Checkboxes ... None None None
2 1 What are your preferences regarding medical tr... Multiple Choice ... None None None
3 2 What is your preferences of complementary brea... Likert Scale ... None None None
4 4 What is your preference for a in-hotel grocery... Multiple Choice ... None None None
... ... ... ... ... ... ... ...
2954 2514 What suggestions do you have for Global Hotel ... Multiple Choice ... None None None
2955 2515 What is your age? Multiple Choice ... None None None
2956 2516 What is your gender? Multiple Choice ... None None None
2957 2517 What is your highest education? Multiple Choice ... None None None
2958 2518 What is your disposable income monthly? (in GBP) Likert Scale ... None None None
Bien, esta vista no ofrece mucho, ya que muestra las ultimas columnas, donde casi no hay respuestas. Puedes ver el .csv completo aquí.
Como verás el problema estaba en como filtrar los datos. Espero te haya servido de al go.
Answered by user166844 on December 9, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP