TransWikia.com

Getting some data from a csv file into a list of tuples using Python

Data Science Asked by Pel de Pinda on May 1, 2021

I have a csv file of almost 2 million rows and about 20 columns. I am interested in about 40000 of these rows, and in 5 of the columns ‘squat’, ‘bench’, ‘deadlift’, ‘bodyweight’, ‘total’. I’d like to get tuples of these 5 values, but only of the rows that have ‘female’ in column ‘sex’ and ‘ipf’ in column ‘parent federation’. Also sometimes one of the first 5 columns has no value specified, I’d like to ignore those.

I can do this by going row by row, checking the condition and then adding a tuple to a list, but this is way too slow and takes up too much memory. Is there a way to expand on this how to take CSV file input in list of tuples using only a couple of columns and filtering?

2 Answers

you can use Pandas library to solve this problem. pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. further details can be found here: Pandas Dataframe Documentations

Answered by Tanhim Islam on May 1, 2021

So, what I get from this question is how to obtain tuples of information from a csv file after querying the database to extract rows where "sex" == "female" and "parent federation" == "ipf".

As @Tanhim has said in their answer, we use the pandas python module to read in information from a csv file:

I would say you can obtain tuples from a csv by the following:

df = pd.read_csv(FILEPATH, usecols = ["sex", "parent federation", "squat", "bench", "deadlift", "bodyweight", "total"])

df = df[(df["sex"] == "female") & (df["parent federation"] == "ipf")]

data = [entry for entry in tuple(df.values)]

If you want to have the resulting dataset be a tuple of tuples instead of a list of tuples, then you would replace the last line with this:

data = tuple([entry for entry in tuple(df.values)])

Here is a good resource to querying a pandas DataFrame, just you know what is happening: https://cmdlinetips.com/2018/02/how-to-subset-pandas-dataframe-based-on-values-of-a-column/

The usecols argument in pd.read_csv essentially choses the columns to load into the DataFrame instead of loading all of the columns in the csv file. The FILEPATH variable refers to a string which denotes the file path of the csv file (i.e. where the csv file lives).

Answered by shepan6 on May 1, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP