Unix & Linux Asked on December 8, 2021
I Have 60 files, each contain about 10,000 lines. Each line contains a single string.
I want to find out only the strings that are common to all files.
Must be exact matches, so we are comparing the entire line.
Parallelized version in bash. It should work for files bigger than memory.
export LC_ALL=C
comm -12
<(comm -12
<(comm -12
<(comm -12
<(comm -12 <(comm -12 <(sort 1) <(sort 2);) <(comm -12 <(sort 3) <(sort 4););)
<(comm -12 <(comm -12 <(sort 5) <(sort 6);) <(comm -12 <(sort 7) <(sort 8);););)
<(comm -12
<(comm -12 <(comm -12 <(sort 9) <(sort 10);) <(comm -12 <(sort 11) <(sort 12););)
<(comm -12 <(comm -12 <(sort 13) <(sort 14);) <(comm -12 <(sort 15) <(sort 16););););)
<(comm -12
<(comm -12
<(comm -12 <(comm -12 <(sort 17) <(sort 18);) <(comm -12 <(sort 19) <(sort 20););)
<(comm -12 <(comm -12 <(sort 21) <(sort 22);) <(comm -12 <(sort 23) <(sort 24);););)
<(comm -12
<(comm -12 <(comm -12 <(sort 25) <(sort 26);) <(comm -12 <(sort 27) <(sort 28););)
<(comm -12 <(comm -12 <(sort 29) <(sort 30);) <(comm -12 <(sort 31) <(sort 32);););););)
<(comm -12
<(comm -12
<(comm -12
<(comm -12 <(comm -12 <(sort 33) <(sort 34);) <(comm -12 <(sort 35) <(sort 36););)
<(comm -12 <(comm -12 <(sort 37) <(sort 38);) <(comm -12 <(sort 39) <(sort 40);););)
<(comm -12
<(comm -12 <(comm -12 <(sort 41) <(sort 42);) <(comm -12 <(sort 43) <(sort 44););)
<(comm -12 <(comm -12 <(sort 45) <(sort 46);) <(comm -12 <(sort 47) <(sort 48););););)
<(comm -12
<(comm -12
<(comm -12 <(comm -12 <(sort 49) <(sort 50);) <(comm -12 <(sort 51) <(sort 52););)
<(comm -12 <(comm -12 <(sort 53) <(sort 54);) <(comm -12 <(sort 55) <(sort 56);););)
<(cat <(comm -12 <(comm -12 <(sort 57) <(sort 58);) <(comm -12 <(sort 59) <(sort 60););) ;);););
Replace sort
with cat
if the files are already sorted.
Answered by Ole Tange on December 8, 2021
With join
:
cp a jnd
for f in a b c; do join jnd $f >j__; cp j__ jnd; done
I have just numbers (1-6, 3-8, 5-9) in three files a, b and c. This is the two lines (numbers, strings) the three have in common.
]# cat jnd
5
6
It is not elegant/efficient like that, especially with that cp
in between. But it can be made to work in parallel quite easily. Select a subgroup of files (for f in a*
), give the files unique names and then you can run many subgroups at once. You still have to join these results... - with 64 files you would have 8 threads joining 8 files each, and then the remaining 8 joined files could again be split up, into 4 threads.
Answered by user373503 on December 8, 2021
With zsh
, using its ${a:*b}
array intersection operator on arrays marked with the unique flag (also using the $(<file)
ksh operator and f
parameter expansion flag to split on line feed characters):
#! /bin/zsh -
typeset -U all list
all=(${(f)"$(<${1?})"}); shift
for file do
list=(${(f)"$(<$file)"})
all=(${all:*list})
done
print -rC1 -- $all
(that script takes the list of files as arguments; empty lines are ignored).
Answered by Stéphane Chazelas on December 8, 2021
Try this,
awk '
BEGINFILE{fnum++; delete f;}
!f[$0]++{s[$0]++;}
END {for (l in s){if (s[l] == fnum) print l}}
' files*
Explanation:
BEGINFILE { ... }
Run at beginning of each file
fnum++
increment file counterdelete f
delete array which is used used to filter duplicate lines per file (see link for posix-compliant solution).!f[$0]++ { ... }
Run only for first occurence of a line in a file (when f[$0]
is 0 (false))
s[$0]++
Increment line-counter.END { ... }
Run once at the end
for (l in s){if (s[l] == fnum) print l}
Loop lines and print each one where number of occurences equals the number of files.600.000 lines should be fine in memory. Otherwise, you could possibly remove everything from s
which is less than fnum
in the BEGINFILE{...}
block.
Answered by pLumo on December 8, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP