English Language & Usage Asked on August 7, 2021
I’m talking about percentage of letters, not words, in case it isn’t clear. Is there a way to gauge this?
Between 2–4%, depending on the text and the genre.
To determine this, I downloaded a variety of texts from Project Gutenberg, then wrote a simple program to count the total number of alphabetic characters and the total number of capitalized characters in each file. Here are the raw numbers:
Title (Author) | Letter Count |
Caps Count |
Percent Caps |
---|---|---|---|
Pride and Prejudice (Austen) | 2,641,527 | 14,177 | 2.56% |
History of the Decline and Fall of the Roman Empire (Gibbon) | 1,295,410 | 34,893 | 2.69% |
Moby Dick (Melville) | 968,516 | 28,204 | 2.91% |
Great Expectations (Dickens) | 777,248 | 23,668 | 3.05% |
Shunned House (Lovecraft) | 66,779 | 2,223 | 3.32% |
Tom Sawyer (Twain) | 312,196 | 10,746 | 3.44% |
Somebody Comes to Town, Somebody Leaves Town (Doctorow) | 495,594 | 17,366 | 3.50% |
Bible (King James Version) | 3,343,105 | 117,344 | 3.51% |
Ulysses (Joyce) | 1,203,807 | 55,244 | 4.58% |
Hamlet (Shakespeare) | 139,132 | 7,812 | 5.61% |
Hamlet comes in with the highest percentage capitals, probably because it’s a script and the repeated character names are always capitalized. Ulysses is also unusually high, because Joyce is weird and uses lots of capitals in unexpected places. The other texts run from about 2.5% to 3.5%.
Edit: Added Melville, Lovecraft, Dickens, Doctorow to fill out the comparison of contemporary, early 20th century, and 19th century authors. I’m not seeing much of a trend here, with the most contemporary authors actually having a somewhat higher percentage of capitals than the earlier models. I suspect that more modern writers have shorter sentences, and therefore more sentence-initial capitalization, and that this effect swamps the effect of freer capitalization in earlier texts.
Correct answer by JSBձոգչ on August 7, 2021
Assuming the Project Gutenberg etext of Herman Melville’s Moby-Dick is representative of all English literature:
24,559 / 1,231,937 = 2.00% capitals letters (across all characters).
Or as a percentage of letters (ignoring non-letter characters):
24,559 / 960,737 = 2.56% capital letters (across all letters).
Edit 2: Taking this a step further, I ran a script on the plain text ebooks from Project Gutenberg’s CD and DVDs:
Source | Caps | Letters | Pct Caps | Characters | Pct Caps |
---|---|---|---|---|---|
Moby-Dick: 1 ebook | 24,559 | 960,737 | 2.56% | 1,231,937 | 2.00% |
2003 CD: 594 ebooks | 11,407,295 | 319,286,662 | 3.57% | 417,687,793 | 2.73% |
2006 DVD: 16,536 ebooks | 179,318,621 | 4,913,640,039 | 3.65% | 6,380,437,180 | 2.81% |
2010 DVD: 14,792 ebooks | 152,637,904 | 4,102,894,980 | 3.72% | 5,433,866,318 | 2.81% |
Total: 31,923 ebooks | 343,388,379 | 9,336,782,418 | 3.68% | 12,233,223,228 | 2.81% |
The median values are 3.72% and 2.85%, and mode values are 3.12% and 2.29%.
Answered by Hugo on August 7, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP