Home » Forum » Bug Report and Feature Requests

Forum: Bug Report and Feature Requests

Word count algorithms?

Quasirandom 🚫

Not a bug notice so much as a question. I've noticed the word count on a story listed in my Library matches the one when you Show Details on the story itself but not what's given in the front of the downloaded ePub. (For example, on Zalezac's Lawyers in Love, the numbers are 7,518 and ≈8400.)

What are the actual calculations being used, and how were they chosen?

Replies: madnige Ernest Bywater Lazeez Jiddan (Webmaster) Ernest Bywater

madnige 🚫
Updated:

@Quasirandom

The counts listed in the story descriptions are recently introduced, and AFAIK are actual counts generated by scanning the story (I remember Lazeez posting here about a refinement to the algorithm to be more accurate when dealing with non-latin UTF-8 characters), whereas the ones in (at least older) epubs are likely just estimates based on the character count. This latter estimate is quite possibly from the rounded-to-1k count shown in the story descriptions, based on the first story I looked at having the exact same wordcount as you gave and the word count being the reported size in KB times 200. I also recall another forum post where Lazeez said that he would not be updating something in the epubs (quite possibly the wordcount) as it would require a rebuild of all the epubs on the site.

ETA: Note that the character count algorithm has also changed over time to give more accurate counts in the face of non-latin characters (which are encoded as multiple bytes, at least in UTF-8). The calculations comment is all just my supposition and deduction excepting the forum posts bits, so if it's important you should wait for the official word from Lazeez. And, the character size in KB is rounded, not truncated, based on my own counts of a few actual stories - poems sized at 0KB are actually < ~500 chars, and a ~700 char poem gets reported as 1KB.

Replies: Keet

Keet 🚫

@madnige

The algorithm I use myself is very simple and surprisingly accurate, even with html if only the html-body is counted: count the number of spaces (ignoring consecutive spaces). It's also very fast and no problems with non-latin chars.

Replies: madnige Switch Blayde BlacKnight

madnige 🚫
Updated:

@Keet

Very elegant! Could the HTML-body count be made more accurate by subtracting the number of < characters?

Replies: Keet

Keet 🚫

@madnige

Very elegant! Could the HTML-body count be made more accurate by subtracting the number of < characters?

No need for that, you want a word count. There should be no space between the tags and content. So a tag and a word is still one word. As I said, very simple and surprisingly accurate. For html you just have extract the body or, if all headers and footers for all chapter files are the same, you can count the words in those and subtract them from the total. That's what I actually do when I convert downloaded zips for my own library because I know the exact size of my new headers and footers and I don't have to extract the body part for a pretty exact word count.

Replies: madnige

madnige 🚫

@Keet

There should be no space between the tags and content. So a tag and a word is still one word.

Could well be true of SOL content (which will be consistent as Lazeez seems to use a homebrew engine to generate the webpages), but I though that in the general case there could be whitespace on both sides of the HTML tags (and, for some tags, whitespace inside the tag, although subtracting the < count wouldn't account for these)

Replies: Keet

Keet 🚫

@madnige

but I though that in the general case there could be whitespace on both sides of the HTML tags

Usually not, there's no requirement or need for it and most browser engines don't even display such a space. There are tags with a class name that do have a space but there are very few of those in SOL html files. I solved that by first replacing those instances without a space before counting. You could go even further by ignoring hr tags and catching 'creative' lines like * * * * often used as a scene break. It depends on your use-case what method you use and how accurate you want it to be.
If you use a html utility like HtmlAgilityPack or something alike you can get the plain text of the html body and calculate without any tags.

Switch Blayde 🚫

@Keet

count the number of spaces

That means every paragraph has one more word than counted.

The paragraph "Let me go." is 3 words with 2 spaces so a word is dropped from your count in that short paragraph of 3 words since you're only counting the 2 spaces. The same number of dropped words (1) would be dropped in a paragraph that is 50 words or 100 words or 500 words.

So if a story is written with short paragraphs, the word count would be off more than if the story has long paragraphs.

Replies: Keet

Keet 🚫

@Switch Blayde

That means every paragraph has one more word than counted.

That depends on how you 'count' spaces. I said count, but it's actually a split. Your three word sentence split on space is a three element array, each element a word. The length of the array (3) is the correct count.

Replies: Switch Blayde

Switch Blayde 🚫

@Keet

The length of the array (3) is the correct count.

That makes more sense.

BlacKnight 🚫

@Keet

I write directly in HTML in a text editor, and use the GNU wc for word counting. wc just counts runs of non-whitespace characters separated by whitespace. I have a wrapper script for wc that strips HTML tags... I'd post it, but the forum handles angle brackets badly, and I'm pretty sure it'd be impossible to get a working version of it through.

Replies: Switch Blayde Keet

Switch Blayde 🚫
Updated:

@BlacKnight

wc just counts runs of non-whitespace characters separated by whitespace.

I use Word to count the words in my docx (novel source).

I use em-dashes sometimes to offset something in a sentence. And my style is not to put a space on either side of it. So the following is a sentence from my WIP:

Boyd looked up at the sign—Brownwood Bar.

That's 8 words. But if the wc counter you're using requires spaces like you describe, it would only be 7 words. It would treat the em-dash as a hyphen and count "sign—Brownwood" as one word when it's two.

Now with hyphenated words, like "mid-calf", Word counts it as one word because it's hyphenated.

But Word counts *** (scene break) as a word even though it's not. I guess Word would have to be really smart to know that. And if my scene breaks were * * * *, Word would count it as 4 words.

I'm actually impressed that Word understands the difference between an em-dash and a hyphen. By the way, it also recognizes that an en-dash is not a hyphen and is separating two words, not hyphenating two parts of one word.

Replies: Keet BlacKnight

Keet 🚫

@Switch Blayde

I'm actually impressed that Word understands the difference between an em-dash and a hyphen.

That's not surprising as they are two different characters, just like the space is just another character. It all depends on which characters are used to determine if a string should break in two words or not. With Word (or any other word processor) you have little or no control over which characters determine a word split but if you create your own function your options are virtually unlimited.
in C# I use the String.Split() function to get an array of words from a line. That function accepts an array char[] as the split parameter which means I can feed it any number of chars that will be used to split the line into words. After the split the length of the array is the number of words in the line.
Does Word count the emdash as a word if it's in the text as space-emdash-space?

Replies: Switch Blayde

Switch Blayde 🚫

@Keet

Does Word count the emdash as a word if it's in the text as space-emdash-space?

Replies: Keet

Keet 🚫

@Switch Blayde

Does Word count the emdash as a word if it's in the text as space-emdash-space?

No

That surprises me, they did something right! :D

Replies: Switch Blayde

Switch Blayde 🚫

@Keet

they did something right!

The — boy (2 words)
The—boy (2 words)
however
The … boy (3 words)
The…boy (1 word)

It did something right with dashes.

Replies: Keet

Keet 🚫

@Switch Blayde

The — boy (2 words)
The—boy (2 words)
however
The … boy (3 words)
The…boy (1 word)

It did something right with dashes.

It depends on what you expect. I agree that the … is not a word but Word apparently sees it as a normal character, not even a split character, considering that it sees 'The…boy' as 1 word.
There's big set of characters that could be considered to split to two words and not seen as a word itself. There are even multiple types of spaces (space, emspace, enspace, emspace13, emspace14, numspace, thinspace, puncspace, nbspace). If you want it in a specific way you will have to create your own counter function.

Replies: Dominions Son

Dominions Son 🚫

@Keet

puncspace

Does it wear baggy jeans and ripped t-shirts?

Replies: Keet

Keet 🚫

@Dominions Son

Does it wear baggy jeans and ripped t-shirts?

Yep, and t-shirt has the text:
SPACEing
is out of
this world!
;)

BlacKnight 🚫

@Switch Blayde

I normally use spaces around em-dashes — mainly so the text stream wraps better in the editor — so they do get counted as words. On the other hand, I use the HR (horizontal rule) HTML tag for section breaks, so those get stripped by the wrapper script before the text reaches wc, and they don't get counted.

Back in the day when NaNoWriMo had a functional website, their official confirmation page used wc, or something that functioned the same way, so the official word count always agreed with mine, and there were no last-minute "sorry, you're a thousand words shy of your 50k because we're not counting your em-dashes" surprises.

Keet 🚫

@BlacKnight

I write directly in HTML in a text editor, and use the GNU wc for word counting. wc just counts runs of non-whitespace characters separated by whitespace. I have a wrapper script for wc that strips HTML tags... I'd post it, but the forum handles angle brackets badly, and I'm pretty sure it'd be impossible to get a working version of it through.

If you're on Linux there are many commands like wc that can help with all those little specific tasks. wc, cat, tail, cut, and many more like the editor sed for text manipulations. I don't use wc because getting the wordcount is part of a larger process and working on text that is loaded in memory. But for external use its very useful.

Ernest Bywater 🚫

@Quasirandom

The count I provide in my summaries and to Bookapy are from the word processor count. It will vary from the SoL count as the WP includes the contents page and a few other things that are stripped out for SoL. Also, the counts are often from when the story was first posted and later revisions that have added words rarely result in me remembering to amend the count in the story brief I have.

Lazeez Jiddan (Webmaster)

@Quasirandom

What are the actual calculations being used, and how were they chosen?

The word count function strips all html tags from the story's text, but excludes the chapter header, author header, story title and counts what is considered words (dashes and mdashes etc aren't counted).

So it's as accurate as possible at counting nothing but the story's text.

An EPUB reader includes the inserted description, tags and every other word in the built EPUB including title, author name, copyright info, chapter headers, footers, etc...

Replies: Quasirandom

Quasirandom 🚫

@Lazeez Jiddan (Webmaster)

Well, yes, a reader's word count will be different. But I meant the word count given as text in the downloaded ePub, which is different from what's listed in the story info on the site webpage.

Replies: Lazeez Jiddan (Webmaster)

Lazeez Jiddan (Webmaster)

@Quasirandom

Well, yes, a reader's word count will be different. But I meant the word count given as text in the downloaded ePub, which is different from what's listed in the story info on the site webpage.

Oh, I didn't realize that I had that in there. Originally I did a rough calculation from the KB size to words and put it in the EPUB. I didn't remember that when I added the word count display to the site.

I just updated the EPUB creator to use the same number now.

So, basically it was a bug and I fixed it.

Replies: Quasirandom

Quasirandom 🚫

@Lazeez Jiddan (Webmaster)

Oh. Okay. Um. Yay! Thank you.

Ernest Bywater 🚫

@Quasirandom

The hardest aspect of word counting is teaching the words the correct numbering system in the first place.

Reply to topic

Forum: Bug Report and Feature Requests

Word count algorithms?

WARNING! ADULT CONTENT...