TransWikia.com

Bash: Keep all lines with duplicate values in column X

Stack Overflow Asked on February 2, 2021

I have a file with a few thousand lines and 20+ columns. I now want to keep only the lines that have the same e-mail address in column 3 as in other lines.

file: (First Name; Last Name; E-Mail; …)

Mike;Tyson;[email protected]
Tom;Boyden;[email protected]
Tom;Cruise;[email protected]
Mike;Myers;[email protected]
Jennifer;Lopez;[email protected]
Andre;Agassi;[email protected]
Paul;Walker;[email protected]

I want to keep ALL lines that have a matching e-mail address. In this case the expected output would be

Mike;Tyson;[email protected]
Tom;Boyden;[email protected]
Tom;Cruise;[email protected]
Mike;Myers;[email protected]
Andre;Agassi;[email protected]

If I use

awk -F';' '!seen[$3]++' file

I will lose the first instance of the e-mail address, in this case line 1 and 2 and will keep ONLY the duplicates.

Is there a way to keep all lines?

5 Answers

Could you please try following, in a single read Input_file in single awk.

awk '
BEGIN{
  FS=";"
}
{
  mail[$3]++
  mailVal[$3]=($3 in mailVal?mailVal[$3] ORS:"")$0
}
END{
  for(i in mailVal){
    if(mail[i]>1){ print mailVal[i] }
  }
}' Input_file

Explanation: Adding detailed explanation for above.

awk '                                                  ##Starting awk program from here.
BEGIN{                                                 ##Starting BEGIN section of this program from here.
  FS=";"                                               ##Setting field separator as ; here.
}
{
  mail[$3]++                                           ##Creating mail with index of 3rd field here and keep adding its value with 1 here.
  mailVal[$3]=($3 in mailVal?mailVal[$3] ORS:"")$0     ##Creating mailVal which has 3rd field as index and value is current line and keep concatinating to it wiht new line.
}
END{                                                   ##Starting END block of this program from here.
  for(i in mailVal){                                   ##Traversing through mailVal here.
    if(mail[i]>1){ print mailVal[i] }                  ##Checking condition if value is greater than 1 then printing its value here.
  }
}
' Input_file                                           ##Mentioning Input_file name here.

Answered by RavinderSingh13 on February 2, 2021

I think @ceving just needs to go a little further.

ASSUMING the chosen column is NOT the first or last -

cut -f$col -d; file             |      # slice out the right column
  tr '[[:upper:]]' '[[:lower:]]' |      # standardize case
  sort | uniq -d                 |      # sort and output only the dups
  sed 's/^/;/; s/$/;/;'          > dups # save the lowercased keys
grep -iFf dups file > subset.csv        # pull matching records

This breaks if the chosen column is the first or last, but should otherwise preserve case and order from the original version.

If it might be the first or last, then pad the stream to that last grep and clean it afterwards -

sed 's/^/;/; s/$/;/;' file       |            # pad with leading/trailing delims
  grep -iFf dups                 |            # grab relevant records
sed 's/^;//; s/;$//;'            > subset.csv # strip the padding 

Answered by Paul Hodges on February 2, 2021

If the output order doesn't matter, here's a one-pass approach:

$ awk -F';' '$3 in first{print first[$3] $0; first[$3]=""; next} {first[$3]=$0 ORS}' file
Mike;Tyson;[email protected]
Tom;Cruise;[email protected]
Mike;Myers;[email protected]
Tom;Boyden;[email protected]
Andre;Agassi;[email protected]

Answered by Ed Morton on February 2, 2021

Find the duplicate e-mail addresses:

sed -s 's/^.*;/;/;s/$/$/' < file.csv | sort | uniq -d > dups.txt

Report the duplicate csv rows:

grep -f dups.txt file.csv

Update:

As "Ed Morton" pointed out the above commands will fail, when the e-mail addresses contain characters, which have a special meaning in a regular expression. This makes it necessary to escape the e-mail addresses.

One way to do so is to use Perl compatible regular expression. In a PCRE the escape sequences Q and E mark the beginning and the end of a string, which should not be treated as a regular expression. GNU grep supports PCREs with the option -P. But this can not be combined with the option -f. This makes it necessary to use something like xargs. But xargs interprets backslashes and ruins the regular expression. In order to prevent it, it is necessary to use the option -0.

Lessen learned: it is quite difficult to get it right without programming it in AWK.

sed -s 's/^.*;/;\Q/;s/$/\E$/' < file.csv | sort | uniq -d | tr 'n' '' > dups.txt
xargs -0 -i < dups.txt grep -P '{}' file.csv

Answered by ceving on February 2, 2021

This awk one-liner will help you:

awk -F';' 'NR==FNR{a[$3]++;next}a[$3]>1' file file

It passes the file twice, the first time it calculates the count of occurrence, the 2nd pass will check and output.

With the given input example, it prints:

Mike;Tyson;[email protected]
Tom;Boyden;[email protected]
Tom;Cruise;[email protected]
Mike;Myers;[email protected]
Andre;Agassi;[email protected]

Answered by Kent on February 2, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP