Compare 60 large files and output only the lines that are common to all files

Question

I Have 60 files, each contain about 10,000 lines. Each line contains a single string.
I want to find out only the strings that are common to all files.
Must be exact matches, so we are comparing the entire line.

Ole Tange · Answer

Parallelized version in bash. It should work for files bigger than memory.
export LC_ALL=C
comm -12 
  <(comm -12 
    <(comm -12 
      <(comm -12 
        <(comm -12  <(comm -12  <(sort 1) <(sort 2);) <(comm -12  <(sort 3) <(sort 4););) 
        <(comm -12  <(comm -12  <(sort 5) <(sort 6);) <(comm -12  <(sort 7) <(sort 8);););) 
      <(comm -12 
        <(comm -12  <(comm -12  <(sort 9) <(sort 10);) <(comm -12  <(sort 11) <(sort 12););) 
        <(comm -12  <(comm -12  <(sort 13) <(sort 14);) <(comm -12  <(sort 15) <(sort 16););););) 
    <(comm -12 
      <(comm -12 
        <(comm -12  <(comm -12  <(sort 17) <(sort 18);) <(comm -12  <(sort 19) <(sort 20););) 
        <(comm -12  <(comm -12  <(sort 21) <(sort 22);) <(comm -12  <(sort 23) <(sort 24);););) 
      <(comm -12 
        <(comm -12  <(comm -12  <(sort 25) <(sort 26);) <(comm -12  <(sort 27) <(sort 28););) 
        <(comm -12  <(comm -12  <(sort 29) <(sort 30);) <(comm -12  <(sort 31) <(sort 32);););););) 
  <(comm -12 
    <(comm -12 
      <(comm -12 
        <(comm -12  <(comm -12  <(sort 33) <(sort 34);) <(comm -12  <(sort 35) <(sort 36););) 
        <(comm -12  <(comm -12  <(sort 37) <(sort 38);) <(comm -12  <(sort 39) <(sort 40);););) 
      <(comm -12 
        <(comm -12  <(comm -12  <(sort 41) <(sort 42);) <(comm -12  <(sort 43) <(sort 44););) 
        <(comm -12  <(comm -12  <(sort 45) <(sort 46);) <(comm -12  <(sort 47) <(sort 48););););) 
    <(comm -12 
      <(comm -12 
        <(comm -12  <(comm -12  <(sort 49) <(sort 50);) <(comm -12  <(sort 51) <(sort 52););) 
        <(comm -12  <(comm -12  <(sort 53) <(sort 54);) <(comm -12  <(sort 55) <(sort 56);););) 
      <(cat  <(comm -12  <(comm -12  <(sort 57) <(sort 58);) <(comm -12  <(sort 59) <(sort 60););) ;);););

Replace sort with cat if the files are already sorted.

user373503 · Answer

With join:
cp a jnd
for f in a b c; do join jnd $f >j__; cp j__ jnd; done

I have just numbers (1-6, 3-8, 5-9) in three files a, b and c. This is the two lines (numbers, strings) the three have in common.
]# cat jnd
5
6

It is not elegant/efficient like that, especially with that cp in between. But it can be made to work in parallel quite easily. Select a subgroup of files (for f in a*), give the files unique names and then you can run many subgroups at once. You still have to join these results... - with 64 files you would have 8 threads joining 8 files each, and then the remaining 8 joined files could again be split up, into 4 threads.

Stéphane Chazelas · Answer

With zsh, using its ${a:*b} array intersection operator on arrays marked with the unique flag (also using the $(<file) ksh operator and f parameter expansion flag to split on line feed characters):
#! /bin/zsh -
typeset -U all list
all=(${(f)"$(<${1?})"}); shift
for file do
  list=(${(f)"$(<$file)"})
  all=(${all:*list})
done
print -rC1 -- $all

(that script takes the list of files as arguments; empty lines are ignored).

pLumo · Answer

Try this,
awk '
    BEGINFILE{fnum++; delete f;}
    !f[$0]++{s[$0]++;}
    END {for (l in s){if (s[l] == fnum) print l}}
' files*

Explanation:

BEGINFILE { ... } Run at beginning of each file

fnum++ increment file counter
delete f delete array which is used used to filter duplicate lines per file (see link for posix-compliant solution).

!f[$0]++ { ... } Run only for first occurence of a line in a file (when f[$0] is 0 (false))

s[$0]++ Increment line-counter.

END { ... } Run once at the end

for (l in s){if (s[l] == fnum) print l} Loop lines and print each one where number of occurences equals the number of files.

600.000 lines should be fine in memory. Otherwise, you could possibly remove everything from s which is less than fnum in the BEGINFILE{...} block.

Compare 60 large files and output only the lines that are common to all files

4 Answers

Add your own answers!

Ask a Question