Please read. Significant change on the site that will affect compatibility [ Dismiss ]
Home » Forum » Author Hangout

Forum: Author Hangout

Word count question

StarFleet Carl 🚫

Just wondering something, if anyone else has seen this.

I store my daily back-up copy of what I'm working on, on one computer. That may not be the same one I actually write my story on - I keep my working copy on USB and will use one of three computers. Two of them use OpenOffice, one of them uses MS Word.

I've noticed that, while the three computers and two programs agree as to the number of characters, they disagree as to the number of words I've written. I'm going to post the first chapter of my new work either tonight or tomorrow morning, and MS Word says there are 12,467 words in it. Open Office says 10,118. Both of them (and Notepad, with the actual text file itself) agree that there are 67,658 characters.

Why the difference in word count?

And no, I've no real interest in actually sitting down and counting each individual word, I was just curious if someone else has seen this and if they know why.

Replies:   Switch Blayde
Ernest Bywater 🚫

Not sure if it is still the case or if it has any bearing on this matter, but the default setting for MS Word used to be to save the original document then the key strokes for each change after that, it's this that allows it to undo changes. There is a setting where you can have it save only the latest version.

However, the more likely cause of your problem can be how they define a word as some word processor programs define a word as the text between two spaces while some define a word as the letters and numbers between two space while another aspect is some define a hyphenated word as one word and others will define it as two words. Thus some software counts a space hyphen space, like - , as the hyphen being a word and others won't, and some will count mid-teen as one word and others will count it as two words.

To further confuse the issue, when you have headers and footers some programs will count each different type of header or footer once only, some won't count them at all, and some will count them on each page they appear.

Replies:   Dominions Son
Dominions Son 🚫

@Ernest Bywater

Not sure if it is still the case or if it has any bearing on this matter, but the default setting for MS Word used to be to save the original document then the key strokes for each change after that, it's this that allows it to undo changes.

I've seen others claim and present some evidence to back it up that the .doc format is basically a core dump. The .docx format is compressed xml like .odt.

Replies:   Vincent Berg
Vincent Berg 🚫

@Dominions Son

I've seen others claim and present some evidence to back it up that the .doc format is basically a core dump. The .docx format is compressed xml like .odt.

While I agree with your 'core dump' analogy, just because .docx format is somewhat compressed doesn't mean that it doesn't retain the exact same amount of junk.

I've gotten to the point, having learned which Styles I rely on and which I don't need, that I routinely delete Style Definitions in my files, cleaning up the older files as I go. It doesn't save a ton of space, but it certainly makes finding the file types I use much easier. But Word labels and tracks everything you enter so it can easily find a variety of elements (author, nation/state, italics vs emphasis), which accumulates pretty quickly.

I also use an external reporting tool for my stories, which routinely report word counts. I've learned, overtime, that just like here, their word counts are way off, and the difference is not traceable to anything as simple of counting punctuation differently (though it does count em- and en-dashes as the start of a new sentence). Often, as I try to trim the fat from my sentences, I'm reduced to actually counting the words between differently phrased sentences, and it's clear that neither Word, nor my reporting program, are counting actual words!

Replies:   Dominions Son
Dominions Son 🚫

@Vincent Berg

While I agree with your 'core dump' analogy

That wasn't an analogy.

hiltonls16 🚫
Updated:

A 20% difference is huge. At a guess OO is counting punctuation followed by space or return as a single word boundary where MSWord is counting each as a word boundary so every sentence gets (at least) an extra word.

A few paragraphs counted in each should enable you to work out which is more accurate, and even if my guess is anywhere close to the reason.

Edit: is MSword counting the undo buffer?

Switch Blayde 🚫
Updated:

@StarFleet Carl

I did a quick test on MS Word (on a Mac) and Pages.

word = counts as 1 word on both
word-word = 1 word on Word and 2 words on Pages (hyphen)
word…word = 1 word on Word and 2 words on Pages
word—word = 2 words on both (em-dash)

ETA:

word … word = 3 words on Word but only 2 on Pages (Word counted the standalone ellipsis as a word but Pages didn't)

* * * * = 4 words in Word and 0 in Pages (I use it for scene changes)

Replies:   Michael Loucks
Michael Loucks 🚫

@Switch Blayde

I decided to experiment with 'wc'. That's the Unix/Linux command to do a word count. It can provide lines, words, and characters, and does so by default. It disagrees with my editor of choice, BBEdit. For example, from my AWLL series:

WC:
637L; 6384W; 35266C

BBEdit:
638L, 6400W, 35223C

The difference appears to be that 'wc' counts as 'a word' anything surrounded by white space (e.g. space, tab, newline, etc), while BBEdit counts punctuation.

The discrepancy appears to arise because, for example, BBEdit counts 'single-file' as two words while 'wc' counts it as one. The line count discrepancy comes because BBEdit strips the trailing newline when it writes the file to disk. The character count difference comes from the handling of certain UTF-8 characters.

For completeness:

Pages: 6399W; 34586C
MSWord: 6381W; 34586C

Replies:   Vincent Berg
Vincent Berg 🚫

@Michael Loucks

The character count difference comes from the handling of certain UTF-8 characters.

I think this accounts for a lot of the discrepancy, as special characters—like hyphens and em-dashes—are not considered characters, but a collection of random characters, which each programs counts differently (ex: does "—" count as a single punctuation character, two, or three separate 'words'?)

QM 🚫

Seriously, if this is what worries you, life is good :)

Back to Top

Close
 

WARNING! ADULT CONTENT...

Storiesonline is for adult entertainment only. By accessing this site you declare that you are of legal age and that you agree with our Terms of Service and Privacy Policy.


Log In