Unix & Linux Asked by zerotimer on October 31, 2021
I have a 4 column tab delimited file and the last column sometimes has duplicates. This is an excerpt from that file:
chr7 116038644 116039744 GeneA
chr7 116030947 116032047 GeneA
chr7 115846040 115847140 GeneA
chr7 115824610 115825710 GeneA
chr7 115801509 115802609 GeneA
chr7 115994986 115996086 GeneA
chrX 143933024 143934124 GeneB
chrX 143933119 143934219 GeneB
chrY 143933129 143933229 GeneC
For every set of duplicate in that column, I want to convert them into something like this (without really touching the non duplicate values in that column):
chr7 116038644 116039744 GeneA-1
chr7 116030947 116032047 GeneA-2
chr7 115846040 115847140 GeneA-3
chr7 115824610 115825710 GeneA-4
chr7 115801509 115802609 GeneA-5
chr7 115994986 115996086 GeneA-6
chrX 143933024 143934124 GeneB-1
chrX 143933119 143934219 GeneB-2
chrY 143933129 143933229 GeneC
How can I do this with awk
or sed
or Bash’s for
loop?
A simple two-pass awk
command:
$ awk -F 't' '
BEGIN { OFS=FS }
pass == 1 { count[$4]++; next }
count[$4] > 1 { $4 = $4 "-" ++number[$4] }; 1' pass=1 file pass=2 file
chr7 116038644 116039744 GeneA-1
chr7 116030947 116032047 GeneA-2
chr7 115846040 115847140 GeneA-3
chr7 115824610 115825710 GeneA-4
chr7 115801509 115802609 GeneA-5
chr7 115994986 115996086 GeneA-6
chrX 143933024 143934124 GeneB-1
chrX 143933119 143934219 GeneB-2
chrY 143933129 143933229 GeneC
This assumes that the input file is tab-delimited. Modify or remove -F 't'
if that is not the case.
On the first pass through the file, an associative array, count
, is populated with the number of times each gene name is see in the fourth column.
On the second pass through the file, a dash and a number is added to the end of the gene name if the gene name has been seen more than once in the data set (in the first pass through the file).
The number added is another counter, also keyed by the gene name, just like in the counter
array.
Answered by Kusalananda on October 31, 2021
step1:
awk '{a[$NF]++}END{for (x in a) {print x,a[x]}}' filename| awk '{if($NF >1){for(i=1;i<=$2;i++){print $1"-"i}}else{print $1}}' >first.txt
step2:
awk '{$NF="";print $0}' filename >second.txt
step3:
paste second.txt first.txt | sed -r "s/s+/ /g"
output
chr7 116038644 116039744 GeneA-1
chr7 116030947 116032047 GeneA-2
chr7 115846040 115847140 GeneA-3
chr7 115824610 115825710 GeneA-4
chr7 115801509 115802609 GeneA-5
chr7 115994986 115996086 GeneA-6
chrX 143933024 143934124 GeneB-1
chrX 143933119 143934219 GeneB-2
chrY 143933129 143933229 GeneC
python
#!/usr/bin/python
j=[]
ww=[]
k=open('rrr.txt','r')
for i in k:
if i.split(' ')[-1].strip() not in j:
j.append(i.split(' ')[-1].strip())
for g in j:
h=open('rrr.txt','r')
d=h.readlines()
for e in d:
if g in e.strip():
ww.append(e.strip())
if len(ww) > 1:
for z in range(0,len(ww),1):
print "{0}-{1}".format(ww[z],z+1)
else:
print "".join(ww)
ww=[]
output
chr7 116038644 116039744 GeneA-1
chr7 116030947 116032047 GeneA-2
chr7 115846040 115847140 GeneA-3
chr7 115824610 115825710 GeneA-4
chr7 115801509 115802609 GeneA-5
chr7 115994986 115996086 GeneA-6
chrX 143933024 143934124 GeneB-1
chrX 143933119 143934219 GeneB-2
chrY 143933129 143933229 GeneC
Answered by Praveen Kumar BS on October 31, 2021
Assuming your input file is grouped by the 4th column as shown in your example:
$ cat tst.awk
$NF != prev {
prt()
cnt = 0
prev = $NF
}
{ rec[++cnt] = $0 }
END { prt() }
function prt() {
for (i=1; i<=cnt; i++) {
print rec[i] (cnt > 1 ? "-"i : "")
}
}
.
$ awk -f tst.awk file
chr7 116038644 116039744 GeneA-1
chr7 116030947 116032047 GeneA-2
chr7 115846040 115847140 GeneA-3
chr7 115824610 115825710 GeneA-4
chr7 115801509 115802609 GeneA-5
chr7 115994986 115996086 GeneA-6
chrX 143933024 143934124 GeneB-1
chrX 143933119 143934219 GeneB-2
chrY 143933129 143933229 GeneC
Answered by Ed Morton on October 31, 2021
This only appends "-number" to the specified (target) field (the fourth one in your sample) if its value is not unique. It also handles the case of input not sorted by the target column and works for an arbitrary number of input columns.
Since the following AWK script needs the input to be sorted by the target field, we use a pipeline to number the original lines, sort them by the (now) fifth field (the first being the prepended number), append the suffix to the non-unique values of the fifth field, bring the lines back to the initial sorting and remove the prepended numbers:
nl file | sort -b -t '<TAB>' -k5,5 -k1n,1n | awk -F 't' -v OFS='t' -v kf=5 '
function prn () {
for (i = 1; i <= nfl; i++) {
if (i == kf)
printf("%s", prc[i] ( sw || cnt[prc[i]] ? "-"++cnt[prc[i]] : ""))
else
printf("%s", prc[i])
printf("%s", (i == nfl ? ORS : OFS))
}
}
NR > 1 {
sw = ($kf == prc[kf])
prn()
}
{
nfl = split($0, prc)
}
END {
if (NR > 0)
prn()
} ' | sort -k1n,1n | cut -f 2-
The gist of this AWK script is to print the previous line after checking if its kf
th field is equal to that of the current line or if its kf
th field has already appeared at least once. In both cases, the kf
th field is printed with the number of times it has been seen appended to it.
Make sure to adjust -v kf=5
(and the -k5,5
sort
key) to reflect the actual position of the column you want to disambiguate.
Given this sample (yours, with shuffled rows and an added column) as file
:
chr7 116038644 116039744 GeneA foo
chrX 143933024 143934124 GeneB foo
chr7 116030947 116032047 GeneA foo
chr7 115824610 115825710 GeneA foo
chrY 143933129 143933229 GeneC foo
chr7 115994986 115996086 GeneA foo
chrX 143933119 143934219 GeneB foo
chr7 115801509 115802609 GeneA foo
chr7 115846040 115847140 GeneA foo
the output would be:
chr7 116038644 116039744 GeneA-1 foo
chrX 143933024 143934124 GeneB-1 foo
chr7 116030947 116032047 GeneA-2 foo
chr7 115824610 115825710 GeneA-3 foo
chrY 143933129 143933229 GeneC foo
chr7 115994986 115996086 GeneA-4 foo
chrX 143933119 143934219 GeneB-2 foo
chr7 115801509 115802609 GeneA-5 foo
chr7 115846040 115847140 GeneA-6 foo
Answered by fra-san on October 31, 2021
Try this
awk -F't' -v OFS='t' '{$4=$4 "-" (++count[$4])}1' file.tsv
This will store the occurence of each value of the 4th field in a counter array count
(where the value of the 4th field is used as "index"), and append the pre-incremented value of that counter to the 4th field, separated by a dash.
The above "simple" example has a disadvantage: it will add a disambiguation number even to those values in column 4 that only appear once in the file. In order to suppress that, the following double-pass approach will work (command broken over two lines via to improve readability):
awk -F't' -v OFS='t' 'NR==FNR{f[$4]++}
NR>FNR{if (f[$4]>1) {$4=$4 "-" (++count[$4])}; print}' file.tsv file.tsv
Note that the file to be processed is stated twice as argument, and will therefore be read twice.
FNR
, the per-file line counter, being equal to NR
, the global line counter), we simply count how often every distinct value of column 4 appears in the file, and store that in an array f
.This approach avoids buffering the entire file, which can be an advantage if the file is very large. The processing time is of course longer because the file is read two times.
As a general rule, using shell loops for text processing is rarely necessary, as awk
e.g. can perform loop operations by itself in a much more efficient way.
Answered by AdminBee on October 31, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP