How to add different string to duplicates in a single column of a tab delimited file in order to make them unique

Question

I have a 4 column tab delimited file and the last column sometimes has duplicates. This is an excerpt from that file:
chr7    116038644       116039744       GeneA
chr7    116030947       116032047       GeneA
chr7    115846040       115847140       GeneA
chr7    115824610       115825710       GeneA
chr7    115801509       115802609       GeneA
chr7    115994986       115996086       GeneA
chrX    143933024       143934124       GeneB
chrX    143933119       143934219       GeneB
chrY    143933129       143933229       GeneC

For every set of duplicate in that column, I want to convert them into something like this (without really touching the non duplicate values in that column):
chr7    116038644       116039744       GeneA-1
chr7    116030947       116032047       GeneA-2
chr7    115846040       115847140       GeneA-3
chr7    115824610       115825710       GeneA-4
chr7    115801509       115802609       GeneA-5
chr7    115994986       115996086       GeneA-6
chrX    143933024       143934124       GeneB-1
chrX    143933119       143934219       GeneB-2
chrY    143933129       143933229       GeneC

How can I do this with awk or sed or Bash's for loop?

Kusalananda · Answer

A simple two-pass awk command:
$ awk -F 't' '
    BEGIN { OFS=FS }
    pass == 1 { count[$4]++; next }
    count[$4] > 1 { $4 = $4 "-" ++number[$4] }; 1' pass=1 file pass=2 file
chr7    116038644       116039744       GeneA-1
chr7    116030947       116032047       GeneA-2
chr7    115846040       115847140       GeneA-3
chr7    115824610       115825710       GeneA-4
chr7    115801509       115802609       GeneA-5
chr7    115994986       115996086       GeneA-6
chrX    143933024       143934124       GeneB-1
chrX    143933119       143934219       GeneB-2
chrY    143933129       143933229       GeneC

This assumes that the input file is tab-delimited. Modify or remove -F 't' if that is not the case.
On the first pass through the file, an associative array, count, is populated with the number of times each gene name is see in the fourth column.
On the second pass through the file, a dash and a number is added to the end of the gene name if the gene name has been seen more than once in the data set (in the first pass through the file).
The number added is another counter, also keyed by the gene name, just like in the counter array.

Praveen Kumar BS · Answer

step1:
awk '{a[$NF]++}END{for (x in a) {print x,a[x]}}' filename| awk '{if($NF >1){for(i=1;i<=$2;i++){print $1"-"i}}else{print $1}}' >first.txt

step2:
awk '{$NF="";print $0}' filename >second.txt

step3:
paste second.txt first.txt | sed -r "s/s+/ /g"

output
chr7 116038644 116039744 GeneA-1
chr7 116030947 116032047 GeneA-2
chr7 115846040 115847140 GeneA-3
chr7 115824610 115825710 GeneA-4
chr7 115801509 115802609 GeneA-5
chr7 115994986 115996086 GeneA-6
chrX 143933024 143934124 GeneB-1
chrX 143933119 143934219 GeneB-2
chrY 143933129 143933229 GeneC

python
#!/usr/bin/python
j=[]
ww=[]
k=open('rrr.txt','r')
for i in k:
    if i.split(' ')[-1].strip() not in j:
        j.append(i.split(' ')[-1].strip())

for g in j:
    h=open('rrr.txt','r')
    d=h.readlines()
    for e in d:
        if g in e.strip():
            ww.append(e.strip())
    if len(ww) > 1:
        for z in  range(0,len(ww),1):
            print "{0}-{1}".format(ww[z],z+1)
            
    else:
        print "".join(ww)
    ww=[]

output
chr7    116038644       116039744       GeneA-1
chr7    116030947       116032047       GeneA-2
chr7    115846040       115847140       GeneA-3
chr7    115824610       115825710       GeneA-4
chr7    115801509       115802609       GeneA-5
chr7    115994986       115996086       GeneA-6
chrX    143933024       143934124       GeneB-1
chrX    143933119       143934219       GeneB-2
chrY    143933129       143933229       GeneC

Ed Morton · Answer

Assuming your input file is grouped by the 4th column as shown in your example:
$ cat tst.awk
$NF != prev {
    prt()
    cnt = 0
    prev = $NF
}
{ rec[++cnt] = $0 }
END { prt() }

function prt() {
    for (i=1; i<=cnt; i++) {
        print rec[i] (cnt > 1 ? "-"i : "")
    }
}

.
$ awk -f tst.awk file
chr7    116038644       116039744       GeneA-1
chr7    116030947       116032047       GeneA-2
chr7    115846040       115847140       GeneA-3
chr7    115824610       115825710       GeneA-4
chr7    115801509       115802609       GeneA-5
chr7    115994986       115996086       GeneA-6
chrX    143933024       143934124       GeneB-1
chrX    143933119       143934219       GeneB-2
chrY    143933129       143933229       GeneC

fra-san · Answer

This only appends "-number" to the specified (target) field (the fourth one in your sample) if its value is not unique. It also handles the case of input not sorted by the target column and works for an arbitrary number of input columns. Since the following AWK script needs the input to be sorted by the target field, we use a pipeline to number the original lines, sort them by the (now) fifth field (the first being the prepended number), append the suffix to the non-unique values of the fifth field, bring the lines back to the initial sorting and remove the prepended numbers: nl file | sort -b -t '' -k5,5 -k1n,1n | awk -F 't' -v OFS='t' -v kf=5 ' function prn () { for (i = 1; i <= nfl; i++) { if (i == kf) printf("%s", prc[i] ( sw || cnt[prc[i]] ? "-"++cnt[prc[i]] : "")) else printf("%s", prc[i]) printf("%s", (i == nfl ? ORS : OFS)) } } NR > 1 { sw = ($kf == prc[kf]) prn() } { nfl = split($0, prc) } END { if (NR > 0) prn() } ' | sort -k1n,1n | cut -f 2- The gist of this AWK script is to print the previous line after checking if its kfth field is equal to that of the current line or if its kfth field has already appeared at least once. In both cases, the kfth field is printed with the number of times it has been seen appended to it. Make sure to adjust -v kf=5 (and the -k5,5 sort key) to reflect the actual position of the column you want to disambiguate. Given this sample (yours, with shuffled rows and an added column) as file: chr7 116038644 116039744 GeneA foo chrX 143933024 143934124 GeneB foo chr7 116030947 116032047 GeneA foo chr7 115824610 115825710 GeneA foo chrY 143933129 143933229 GeneC foo chr7 115994986 115996086 GeneA foo chrX 143933119 143934219 GeneB foo chr7 115801509 115802609 GeneA foo chr7 115846040 115847140 GeneA foo the output would be: chr7 116038644 116039744 GeneA-1 foo chrX 143933024 143934124 GeneB-1 foo chr7 116030947 116032047 GeneA-2 foo chr7 115824610 115825710 GeneA-3 foo chrY 143933129 143933229 GeneC foo chr7 115994986 115996086 GeneA-4 foo chrX 143933119 143934219 GeneB-2 foo chr7 115801509 115802609 GeneA-5 foo chr7 115846040 115847140 GeneA-6 foo

AdminBee · Answer

Try this
awk -F't' -v OFS='t' '{$4=$4 "-" (++count[$4])}1' file.tsv

This will store the occurence of each value of the 4th field in a counter array count (where the value of the 4th field is used as "index"), and append the pre-incremented value of that counter to the 4th field, separated by a dash.
The above "simple" example has a disadvantage: it will add a disambiguation number even to those values in column 4 that only appear once in the file. In order to suppress that, the following double-pass approach will work (command broken over two lines via  to improve readability):
 awk -F't' -v OFS='t' 'NR==FNR{f[$4]++}
      NR>FNR{if (f[$4]>1) {$4=$4 "-" (++count[$4])}; print}' file.tsv file.tsv

Note that the file to be processed is stated twice as argument, and will therefore be read twice.

The first time it is read (indicated by FNR, the per-file line counter, being equal to NR, the global line counter), we simply count how often every distinct value of column 4 appears in the file, and store that in an array f.
The second time the file is read, we perform the actual text processing just like in the "simple" approach and append the occurence counter to column 4, but only if the total number of occurences as found in the first pass is larger than 1.

This approach avoids buffering the entire file, which can be an advantage if the file is very large. The processing time is of course longer because the file is read two times.
As a general rule, using shell loops for text processing is rarely necessary, as awk  e.g. can perform loop operations by itself in a much more efficient way.

How to add different string to duplicates in a single column of a tab delimited file in order to make them unique

5 Answers

Add your own answers!

Ask a Question