Grouping similar rows to detect duplicates

Question

I have a dataset containing all the real estate ads for sale in the process of publishing a city:

ID  URL CRAWL_SOURCE    PROPERTY_TYPE   NEW_BUILD   DESCRIPTION IMAGES  SURFACE LAND_SURFACE    BALCONY_SURFACE ... DEALER_NAME DEALER_TYPE CITY_ID CITY    ZIP_CODE    DEPT_CODE   PUBLICATION_START_DATE  PUBLICATION_END_DATE    LAST_CRAWL_DATE LAST_PRICE_DECREASE_DATE
0   22c05930-0eb5-11e7-b53d-bbead8ba43fe    http://www.avendrealouer.fr/location/levallois...   A_VENDRE_A_LOUER    APARTMENT   False   Au rez de chaussÃ©e d'un bel immeuble rÃ©cent,...   ["https://cf-medias.avendrealouer.fr/image/_87...   72.0    NaN NaN ... Lamirand Et Associes    AGENCY  54178039    Levallois-Perret    92300.0 92  2017-03-22T04:07:56.095 NaN 2017-04-21T18:52:35.733 NaN
1   8d092fa0-bb99-11e8-a7c9-852783b5a69d    https://www.bienici.com/annonce/ag440414-16547...   BIEN_ICI    APARTMENT   False   Je vous propose un appartement dans la rue Col...   ["http://photos.ubiflow.net/440414/165474561/p...   48.0    NaN NaN ... Proprietes Privees  MANDATARY   54178039    Levallois-Perret    92300.0 92  2018-09-18T11:04:44.461 NaN 2019-06-06T10:08:10.89  2018-09-25

I would like to group real estate ads for the same property. Indeed, several professionals (see individual) publish ads on several real estate portals for the same property.

ID  URL CRAWL_SOURCE    PROPERTY_TYPE   NEW_BUILD   DESCRIPTION IMAGES  SURFACE LAND_SURFACE    BALCONY_SURFACE ... DEALER_NAME DEALER_TYPE CITY_ID CITY    ZIP_CODE    DEPT_CODE   PUBLICATION_START_DATE  PUBLICATION_END_DATE    LAST_CRAWL_DATE LAST_PRICE_DECREASE_DATE

What methodology could I use to detect duplicates are the rows aren't ? So far I thought eitehr of checking if the description was exactly the same (but I would need to remove special characters before) or if the images on the url are the same (but I am not a star in image processing)

To my mind, as far as people that post the same houses does it on different website these won't be duplicates on the same website. Probably people post the same DESCRIPTION or the same IMAGES. The houses surely have the same surface however we may have several dealer for a same house. So I did:

# Let's add a new boolean column to our DataFrame that will identify a duplicated order line item (False=Not a duplicate; True=Duplicate)
df['is_duplicated'] = df.duplicated(['DESCRIPTION'])

And did the sum:

# We can sum on a boolean column to get a count of duplicate order line items
df['is_duplicated'].sum()

Which returned 249. I don't know how to compare the images yet.

Is there a better strategy?

Meghana Kanuri · Answer

I did this when i had duplicate values and multiple other variable which are unique.

adding unique variable to remove duplicates

"i have added variables as per above data please change as per your requirement"

Df <- DF%>% mutate(Path = paste(id,property_type,NEW_BUILD,SURFACE, sep = ">"))

retaining Unique values in the data

Df <-
distinct(DF,Path, .keep_all= TRUE)

Grouping similar rows to detect duplicates

One Answer

Add your own answers!

Ask a Question