Data Science Asked on July 24, 2021
I am working on an ML model in which I have been provided the data in 2 files test.csv
and train.csv
. I want to perform data cleaning on both files together be concatenating them and then separating them.
I know how to concatenate 2 dataframes, but after data cleaning how will I separate the two files? Please help me complete the code.
CODE
test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')
df = pd.concat([test, train])
//Data Cleaning steps
//Separating them back to train and test set for providing input to model
There are several methods to choose from. If you insist on concatenating the two dataframes, then first add a new column to each DataFrame called source
. Make the value for test.csv
'test' and likewise for the training set.
When you have finished cleaning the combined df
, then use the source column to split the data again.
An alternative method is to record all the operations you perform on the training set and simply repeat for the test set. This won't work it you normalise values based on the population.
Answered by fswings on July 24, 2021
Method 1: Develop a function that does a set of data cleaning operation. Then pass the train and test or whatever you want to clean through that function. The result will be consistent.
Method 2: If you want to concatenate then one way to do it is add a column "test" for test data set and a column "train" for train data set. Perform you operation then use python split to again divide it into 2 dataframe
data[data['type']=="test"]
Answered by Amar nayak on July 24, 2021
Add an indicator column while concatenating the two dataframes, so you can later seperate them again:
df = pd.concat([test.assign(ind="test"), train.assign(ind="train")])
Then later you can split them again:
test, train = df[df["ind"].eq("test")], df[df["ind"].eq("train")]
Answered by Erfan on July 24, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP