TransWikia.com

Compare 60 large files and output only the lines that are common to all files

Unix & Linux Asked on December 8, 2021

I Have 60 files, each contain about 10,000 lines. Each line contains a single string.

I want to find out only the strings that are common to all files.

Must be exact matches, so we are comparing the entire line.

4 Answers

Parallelized version in bash. It should work for files bigger than memory.

export LC_ALL=C
comm -12 
  <(comm -12 
    <(comm -12 
      <(comm -12 
        <(comm -12  <(comm -12  <(sort 1) <(sort 2);) <(comm -12  <(sort 3) <(sort 4););) 
        <(comm -12  <(comm -12  <(sort 5) <(sort 6);) <(comm -12  <(sort 7) <(sort 8);););) 
      <(comm -12 
        <(comm -12  <(comm -12  <(sort 9) <(sort 10);) <(comm -12  <(sort 11) <(sort 12););) 
        <(comm -12  <(comm -12  <(sort 13) <(sort 14);) <(comm -12  <(sort 15) <(sort 16););););) 
    <(comm -12 
      <(comm -12 
        <(comm -12  <(comm -12  <(sort 17) <(sort 18);) <(comm -12  <(sort 19) <(sort 20););) 
        <(comm -12  <(comm -12  <(sort 21) <(sort 22);) <(comm -12  <(sort 23) <(sort 24);););) 
      <(comm -12 
        <(comm -12  <(comm -12  <(sort 25) <(sort 26);) <(comm -12  <(sort 27) <(sort 28););) 
        <(comm -12  <(comm -12  <(sort 29) <(sort 30);) <(comm -12  <(sort 31) <(sort 32);););););) 
  <(comm -12 
    <(comm -12 
      <(comm -12 
        <(comm -12  <(comm -12  <(sort 33) <(sort 34);) <(comm -12  <(sort 35) <(sort 36););) 
        <(comm -12  <(comm -12  <(sort 37) <(sort 38);) <(comm -12  <(sort 39) <(sort 40);););) 
      <(comm -12 
        <(comm -12  <(comm -12  <(sort 41) <(sort 42);) <(comm -12  <(sort 43) <(sort 44););) 
        <(comm -12  <(comm -12  <(sort 45) <(sort 46);) <(comm -12  <(sort 47) <(sort 48););););) 
    <(comm -12 
      <(comm -12 
        <(comm -12  <(comm -12  <(sort 49) <(sort 50);) <(comm -12  <(sort 51) <(sort 52););) 
        <(comm -12  <(comm -12  <(sort 53) <(sort 54);) <(comm -12  <(sort 55) <(sort 56);););) 
      <(cat  <(comm -12  <(comm -12  <(sort 57) <(sort 58);) <(comm -12  <(sort 59) <(sort 60););) ;);););

Replace sort with cat if the files are already sorted.

Answered by Ole Tange on December 8, 2021

With join:

cp a jnd
for f in a b c; do join jnd $f >j__; cp j__ jnd; done

I have just numbers (1-6, 3-8, 5-9) in three files a, b and c. This is the two lines (numbers, strings) the three have in common.

]# cat jnd
5
6

It is not elegant/efficient like that, especially with that cp in between. But it can be made to work in parallel quite easily. Select a subgroup of files (for f in a*), give the files unique names and then you can run many subgroups at once. You still have to join these results... - with 64 files you would have 8 threads joining 8 files each, and then the remaining 8 joined files could again be split up, into 4 threads.

Answered by user373503 on December 8, 2021

With zsh, using its ${a:*b} array intersection operator on arrays marked with the unique flag (also using the $(<file) ksh operator and f parameter expansion flag to split on line feed characters):

#! /bin/zsh -
typeset -U all list
all=(${(f)"$(<${1?})"}); shift
for file do
  list=(${(f)"$(<$file)"})
  all=(${all:*list})
done
print -rC1 -- $all

(that script takes the list of files as arguments; empty lines are ignored).

Answered by Stéphane Chazelas on December 8, 2021

Try this,

awk '
    BEGINFILE{fnum++; delete f;}
    !f[$0]++{s[$0]++;}
    END {for (l in s){if (s[l] == fnum) print l}}
' files*

Explanation:

  • BEGINFILE { ... } Run at beginning of each file

  • !f[$0]++ { ... } Run only for first occurence of a line in a file (when f[$0] is 0 (false))

    • s[$0]++ Increment line-counter.
  • END { ... } Run once at the end

    • for (l in s){if (s[l] == fnum) print l} Loop lines and print each one where number of occurences equals the number of files.

600.000 lines should be fine in memory. Otherwise, you could possibly remove everything from s which is less than fnum in the BEGINFILE{...} block.

Answered by pLumo on December 8, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP