Data Science Asked by Pel de Pinda on May 1, 2021
I have a csv file of almost 2 million rows and about 20 columns. I am interested in about 40000 of these rows, and in 5 of the columns ‘squat’, ‘bench’, ‘deadlift’, ‘bodyweight’, ‘total’. I’d like to get tuples of these 5 values, but only of the rows that have ‘female’ in column ‘sex’ and ‘ipf’ in column ‘parent federation’. Also sometimes one of the first 5 columns has no value specified, I’d like to ignore those.
I can do this by going row by row, checking the condition and then adding a tuple to a list, but this is way too slow and takes up too much memory. Is there a way to expand on this how to take CSV file input in list of tuples using only a couple of columns and filtering?
you can use Pandas library to solve this problem. pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. further details can be found here: Pandas Dataframe Documentations
Answered by Tanhim Islam on May 1, 2021
So, what I get from this question is how to obtain tuples of information from a csv file after querying the database to extract rows where "sex" == "female" and "parent federation" == "ipf".
As @Tanhim has said in their answer, we use the pandas python module to read in information from a csv file:
I would say you can obtain tuples from a csv by the following:
df = pd.read_csv(FILEPATH, usecols = ["sex", "parent federation", "squat", "bench", "deadlift", "bodyweight", "total"])
df = df[(df["sex"] == "female") & (df["parent federation"] == "ipf")]
data = [entry for entry in tuple(df.values)]
If you want to have the resulting dataset be a tuple of tuples instead of a list of tuples, then you would replace the last line with this:
data = tuple([entry for entry in tuple(df.values)])
Here is a good resource to querying a pandas DataFrame, just you know what is happening: https://cmdlinetips.com/2018/02/how-to-subset-pandas-dataframe-based-on-values-of-a-column/
The usecols
argument in pd.read_csv
essentially choses the columns to load into the DataFrame instead of loading all of the columns in the csv file.
The FILEPATH
variable refers to a string which denotes the file path of the csv file (i.e. where the csv file lives).
Answered by shepan6 on May 1, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP