Data Science Asked on April 14, 2021
I have a dataset containing all the real estate ads for sale in the process of publishing a city:
ID URL CRAWL_SOURCE PROPERTY_TYPE NEW_BUILD DESCRIPTION IMAGES SURFACE LAND_SURFACE BALCONY_SURFACE ... DEALER_NAME DEALER_TYPE CITY_ID CITY ZIP_CODE DEPT_CODE PUBLICATION_START_DATE PUBLICATION_END_DATE LAST_CRAWL_DATE LAST_PRICE_DECREASE_DATE
0 22c05930-0eb5-11e7-b53d-bbead8ba43fe http://www.avendrealouer.fr/location/levallois... A_VENDRE_A_LOUER APARTMENT False Au rez de chaussée d'un bel immeuble récent,... ["https://cf-medias.avendrealouer.fr/image/_87... 72.0 NaN NaN ... Lamirand Et Associes AGENCY 54178039 Levallois-Perret 92300.0 92 2017-03-22T04:07:56.095 NaN 2017-04-21T18:52:35.733 NaN
1 8d092fa0-bb99-11e8-a7c9-852783b5a69d https://www.bienici.com/annonce/ag440414-16547... BIEN_ICI APARTMENT False Je vous propose un appartement dans la rue Col... ["http://photos.ubiflow.net/440414/165474561/p... 48.0 NaN NaN ... Proprietes Privees MANDATARY 54178039 Levallois-Perret 92300.0 92 2018-09-18T11:04:44.461 NaN 2019-06-06T10:08:10.89 2018-09-25
I would like to group real estate ads for the same property. Indeed, several professionals (see individual) publish ads on several real estate portals for the same property.
ID URL CRAWL_SOURCE PROPERTY_TYPE NEW_BUILD DESCRIPTION IMAGES SURFACE LAND_SURFACE BALCONY_SURFACE ... DEALER_NAME DEALER_TYPE CITY_ID CITY ZIP_CODE DEPT_CODE PUBLICATION_START_DATE PUBLICATION_END_DATE LAST_CRAWL_DATE LAST_PRICE_DECREASE_DATE
What methodology could I use to detect duplicates are the rows aren’t ? So far I thought eitehr of checking if the description was exactly the same (but I would need to remove special characters before) or if the images on the url are the same (but I am not a star in image processing)
To my mind, as far as people that post the same houses does it on different website these won’t be duplicates on the same website. Probably people post the same DESCRIPTION
or the same IMAGES
. The houses surely have the same surface
however we may have several dealer for a same house. So I did:
# Let's add a new boolean column to our DataFrame that will identify a duplicated order line item (False=Not a duplicate; True=Duplicate)
df['is_duplicated'] = df.duplicated(['DESCRIPTION'])
And did the sum:
# We can sum on a boolean column to get a count of duplicate order line items
df['is_duplicated'].sum()
Which returned 249. I don’t know how to compare the images yet.
Is there a better strategy?
I did this when i had duplicate values and multiple other variable which are unique.
adding unique variable to remove duplicates
"i have added variables as per above data please change as per your requirement"
Df <- DF%>% mutate(Path = paste(id,property_type,NEW_BUILD,SURFACE, sep = ">"))
retaining Unique values in the data
Df <-
distinct(DF,Path, .keep_all= TRUE)
Answered by Meghana Kanuri on April 14, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP