Identify strings shared between multiple files from the Linux command line

Question

Given a set of arbitrary files, what's the best way to identify the text strings shared between them (either in all files or a subset of them) from the Linux command line?
This would be useful for quickly identifying ways to write Yara rules for clusters of similar malicious files (for instance, malicious executables).

recvfrom · Answer

Here's one approach, for malicious files in a directory named malware:
find malware/ -type f | xargs -n1 -P1 -I{} sh -c 'strings {} | sort | uniq' | sort | uniq -c | sort -n

The output will look something like the following, where the first number on each line is the number of files containing the string:
      ...
      1 Sleep
      ...
      2 JFIF
      2 SetBkColor
      ...
      5 !This program cannot be run in DOS mode.
      5 t@PW
      5 @tVH
      ...

One useful variation of this when the input files are Windows executables is using strings -el instead of strings, which will cause UTF-16 little-endian strings (also known as wide character strings) to be shown.

Identify strings shared between multiple files from the Linux command line

One Answer

Add your own answers!

Ask a Question