Unix & Linux Asked by user394 on September 13, 2020
I did a website scrape for a conversion project. I’d like to do some statistics on the types of files in there — for instance, 400 .html
files, 100 .gif
, etc. What’s an easy way to do this? It has to be recursive.
Edit: With the script that maxschelpzig posted, I’m having some problems due to the architecture of the site I’ve scraped. Some of the files are of the name *.php?blah=blah&foo=bar
with various arguments, so it counts them all as unique. So the solution needs to consider *.php*
to be all of the same type, so to speak.
You could use find
and uniq
for this, e.g.:
$ find . -type f | sed 's/.*.//' | sort | uniq -c
16 avi
29 jpg
136 mp3
3 mp4
Command explanation
find
recursively prints all filenamessed
deletes from every filename the prefix until the file extensionuniq
assumes sorted input
-c
does the counting (like a histogram).Correct answer by maxschlepzig on September 13, 2020
I know this thread is old but, this is one of top results when searching for "bash count file extensions".
I encountered the same problem as you and created a script similar to maxschlepzig
Here is the command i made that counts the extensions of all files in the working directory recursively. This takes into account UPPER, and LOWER cases, merging them, removing false positive results, and counting the occurrences.
find . -type f
| tr '[:upper:]' '[:lower:]'
| grep -E ".*.[a-zA-Z0-9]*$"
| sed -e 's/.*(.[a-zA-Z0-9]*)$/1/'
| sort |
| uniq -c
| sort -n
Here is the github link if you'd like to see more documentation.
Answered by Andrew Hopkins on September 13, 2020
I've put a bash script into my ~/bin
folder called exhist
with this content:
#!/bin/bash
for d in */ ; do
echo $d
find $d -type f | sed -r 's/.*/([^/]+)/1/' | sed 's/^[^.]*$//' | sed -r 's/.*(.[^.]+)$/1/' | sort | uniq -c | sort -nr
# files only | keep filename only | no ext -> '' ext | keep part after . (i.e. ext) | count | sort by count desc
done
Whichever directory I'm in, I just type 'exh', tab auto-completes it, and I see something like this:
$ exhist
src/
7 .java
1 .txt
target/
42 .html
10 .class
4 .jar
3 .lst
2
1 .xml
1 .txt
1 .properties
1 .js
1 .css
P.S. Trimming the part after the question mark should be simple to do with another sed command probably after the last one (I haven't tried it): sed 's/?.*//'
Answered by Zsolt Katona on September 13, 2020
This one-liner seems to be a fairly robust method:
find . -type f -printf '%fn' | sed -r -n 's/.+(..*)$/1/p' | sort | uniq -c
The find . -type f -printf '%fn'
prints the basename of every regular file in the tree, with no directories. That eliminates having to worry about directories which may have .
's in them in your sed
regex.
The sed -r -n 's/.+(..*)$/1/p'
replaces the incoming filename with only its extension. E.g., .somefile.ext
becomes .ext
. Note the initial .+
in the regex; this results in any match needing at least one character before the extension's .
. This prevents filenames like .gitignore
from being treated as having no name at all and the extension '.gitignore', which is probably what you want. If not, replace the .+
with a .*
.
The rest of the line is from the accepted answer.
Edit: If you want a nicely-sorted histogram in Pareto chart format, just add another sort
to the end:
find . -type f -printf '%fn' | sed -r -n 's/.+(..*)$/1/p' | sort | uniq -c | sort -bn
Sample output from a built Linux source tree:
1 .1992-1997
1 .1994-2004
1 .1995-2002
1 .1996-2002
1 .ac
1 .act2000
1 .AddingFirmware
1 .AdvancedTopics
[...]
1445 .S
2826 .o
2919 .cmd
3531 .txt
19290 .h
23480 .c
Answered by Gary R. Van Sickle on September 13, 2020
With zsh:
print -rl -- **/?*.*(D.:e) | uniq -c |sort -n
The pattern **/?*.*
matches all files that have an extension, in the current directory and its subdirectories recursively. The glob qualifier D
let zsh
traverse even hidden directories and consider hidden files, .
selects only regular files. The history modifier retains only the file extension. print -rl
prints one match per line. uniq -c
counts consecutive identical items (the glob result is already sorted). The final call to sort
sorts the extensions by use count.
Answered by Gilles 'SO- stop being evil' on September 13, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP