Unix & Linux Asked by patbarron on November 28, 2020
Long ago, in Seventh Edition UNIX (a/k/a V7), there was a program called prep
. Its primary use was to take files of text, and break them up into one word per line, for further processing by other tools in a pipeline. It could do a little bit of other manipulation too, like telling you the location of each individual word within a file, ignoring specific words in an ignore list, or only paying attention to words specifically mentioned in an include list. It’s sort of difficult to explain exactly what it does, but here is a man page from 2.9BSD for it. It had an assortment of interesting uses – for example, building dictionaries, spell-checkers, and the like.
This program was rather short lived. It only existed in V7 and a couple of offshoots (and 2.9BSD was basically an offshoot of V7). It didn’t previously exist in V6. It was removed from V8. It never even made it into 4.2BSD. It doesn’t exist (at least not in this form) in any Linux distribution that I’m aware of, nor in FreeBSD and friends. There was another program that also (as far as I am aware) first appeared on V7, called deroff
, that was primarily for a completely different purpose – but it had a "-w
" option that told it to do the "split up files into one word per line" thing, similar to prep
, but didn’t do any of the other functions (like word numbering, include lists, and ignore lists). I assume for purposes like dictionary building, deroff -w
subsumed the function of prep
. That was comparatively much longer lived – but these days, there doesn’t even seem to be a version of deroff
packaged for any major Linux distribution, I know it’s not in any recent version of RHEL, it’s not in Fedora 32, and it’s not in Debian 10 (but I’m pretty sure it actually was in Debian until not that long ago).
Why did prep
go away? Was it really because deroff -w
duplicated most of its function? I presume that deroff
has disappeared in current Linux distributions because people generally don’t deal with [nt]roff-formatted documents anymore, except maybe for man pages. But with both of these tools gone, what can one use to do the "split up a text file into one word per line" function? Is there anything packaged for any modern Linux distro that would perform this function? (If you’re going to respond with, "you can probably do this yourself with a simple script", I concede that is probably correct – but that is not the answer I’m looking for right now, I’m looking for a way to do this with some existing tool that already exists in modern Linux distributions…) Ideally, I’d like to find something that implements all the features listed in the man page I linked (plus the "implied" behaviors that aren’t explicitly specified in the man page, like not considering punctuation to be part of a word, and how numbers that appear as part of a "word" are handled). 🙂 Practically, I don’t think the include and exclude lists are particularly crucial, and while I’d like to have the word numbering (it can sometimes be handy to know the location of a word in a file), it’s not that important. Handling of hyphenated words at the end of a line would be desirable.
Using Raku (formerly known as Perl6)
~$ raku -ne '.words.join("n").put;' < file
HTH.
Answered by jubilatious1 on November 28, 2020
It seems like tr -s " " "n" < file
ought to work for splitting a file to one word per line.
Answered by tim1724 on November 28, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP