Unix & Linux Asked by tacrolimus on December 2, 2020
Table 1 (tab separated):
NC_000001.11 1243 A T 0.14
NC_000005.11 1432 G C 0.0006
NC_000012.12 1284 A T 0.93428
NC_000021.9 9824 T C 0.9
Lookup table (tab seperated) – this is in fact huge around 6G gzipped :
NC_000001.11 1243 rs73647921 A T
NC_000005.11 1432 rs75444 G C
NC_000012.12 1284 rs754723 A T
NC_000021.9 9824 rs865545 T C
Desired output – I would like to match on the first 4 columns of table 1 which correspond to columns 1/2/4/5 of the lookup table; taking table 1 and transforming it into:
MarkerName P-Value
rs73647921 0.14
rs75444 0.0006
rs754723 0.93428
rs865545 0.9
I think I need to use join along the lines of:
join -t, -a 1 -a 2 -o0,1.5,2.3 -e ' -' file1 file2
But this doesn’t seem to work. How could I also use a gzipped file?
With awk (and bash) you can write
awk '
BEGIN {FS = OFS = "t"}
NR == FNR {pvalue[$1,$2,$3,$4] = $5; next}
FNR == 1 {print "MarkerName", "P-Value"}
{ key = $1 SUBSEP $2 SUBSEP $4 SUBSEP $5
sub(/r$/, "", key)
}
key in pvalue {print $3, pvalue[key]}
' table1.tsv <(zcat lookup.tsv.gz)
awk uses the SUBSEP variable to join comma-separated indices for arrays.
That last bit of syntax around the zcat is a bash Process Substitution
For multi-field join conditions, join
can be painful to work with. It also complains if the files are not sorted.
Correct answer by glenn jackman on December 2, 2020
have you tried .. ?
join -o 2.3,1.5 f1 <(zcat f2.gz )
by default join
use first field. (and -1 x -2 x
is equivalent to -j x
, and -j 1
is equivalent to " "
)
I assume
to sort on all fields
join -o 2.2,1.2
<(awk '{printf "%s %sn",$1 $2 $3 $4,$5}' f1)
<(zcat f2.gz| awk '{printf "%s %sn",$1 $2 $4 $5,$3}' )
where
awk
to pipe to join
might not be a good idea (altough this might save memory compared to an awk
only solution),awk
command build an index from revelant fields,zcat
used to ungzip,0
's problem open.Answered by Archemar on December 2, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP