Please read. Significant change on the site that will affect compatibility [ Dismiss ]
Home » Forum » Author Hangout

Forum: Author Hangout

Accents in foreign language words

Switch Blayde 🚫

When I type a foreign language word, Word adds the accents. I take them off because I'm afraid the characters will get screwed up when I convert it to epub (docx into Calibre with epub outputted).

Am I correct to do so?

Ernest Bywater 🚫
Updated:

Why not test it by creating an epub with them in it?

I leave accents etc in my stories and they go through OK, but I create the epub from a html file not the word processor file.

edit to add: I just checked my epubs where I have some accents etc. and found even the upside down ? symbol used in a Spanish quote was converted by Calibre properly. So I think they will make it through the conversion process for you.

Ross at Play 🚫
Updated:

May I mischievously suggest your fears are a few decades out-of-date?

I thought one benefit of moving to two-byte characters was that each font may define over 4,000 different characters, allowing vast ranges of letters with whatever accents, in Roman or italics, upper or lowercase, etc.

A scan at this list might set your mind at rest.

Suck it and see is what I suggest.

Lazeez Jiddan (Webmaster)
Updated:

@Ross at Play

I thought one benefit of moving to two-byte characters was that each font may define over 4,000 different characters

UTF-8 is now up to 4 bytes per character.

In WLPC's engine's database I use an encoding called UTF-8mb4 which allows up to 4 bytes for each character. So it's practically unlimited now at over 2 million possible characters.

Replies:   Ross at Play
Ross at Play 🚫
Updated:

@Lazeez Jiddan (Webmaster)

In WLPC's engine's database I use an encoding called UTF-8mb4

Thanks. But that means nothing to me.

As I understand it, if I'm working with any of Word, OpenOffice, or LibreOffice on a file in either doc, docx, or odt format, the characters I see will be what comes out if I save as either txt or html format and then post using the SOL wizard.

Is that correct? That's all I want to be sure of.

Switch may have other questions related to saving epub files which may later be viewed using various software packages.

Lazeez Jiddan (Webmaster)

@Ross at Play

As I understand it, if I'm working with any of Word, OpenOffice, or LibreOffice on a file in either doc, docx, or odt format, the characters I see will be what comes out if I save as either txt or html format and then post using the SOL wizard.

Is that correct? That's all I want to be sure of.

Yes. Whatever you can see in the exported file will be in the version displayed on SOL.

Dominions Son 🚫

@Ross at Play

I thought one benefit of moving to two-byte characters was that each font may define over 4,000 different characters

A 2 byte character set like Unicode, can support 2^16 or 65,536 unique characters.

Lazeez Jiddan (Webmaster)

@Dominions Son

A 2 byte character set like Unicode, can support 2^16 or 65,536 unique characters.

Not really.

I don't know why, but they don't use all the bits from each byte.

UTF8 when using 4 bytes (yes, it's up to 4 bytes) counts as 21 bit. So 2^21 = 2,097,152 possible characters.

Replies:   Dominions Son
Dominions Son 🚫

@Lazeez Jiddan (Webmaster)

Not really.

I don't know why, but they don't use all the bits from each byte.

The original ASCII character set was only 7 bits. The 8th bit was used for a parity check. ASCII was eventually extended to use the full 8 bits.

UTF8 when using 4 bytes (yes, it's up to 4 bytes) counts as 21 bit.

That seems very odd. 4 bytes would be 32 bits and 21 bits wouldn't even be a full 3 bytes.

One thing though is it does leave room for growth. later on. Besides, I doubt they could have come up with a full set of over 4 billion characters (the full limit for a 4 byte character)

Replies:   helmut_meukel
helmut_meukel 🚫

@Dominions Son

UTF8 when using 4 bytes (yes, it's up to 4 bytes) counts as 21 bit.

That seems very odd. 4 bytes would be 32 bits and 21 bits wouldn't even be a full 3 bytes.

Not really. Modern hardware can't handle Byte and Word (=double byte) data very well. It's faster to pad the unused space and store a one byte data item in a 32bit dword.

I had written a small test program in the early PC days.
It performed the same operations in loops for byte, short integer and long integer values. The data values ranged from 0 to 255 for all three. On an old PC with 8088 processor the byte data type was the fastest, very close followed by short integer. With more modern hardware, OS and programming language long integer was fastest!

Last time I used this small timing benchmark program was about 20 years ago. I just checked and it's no longer in my Tests and Bechmark folders.

HM.

Ross at Play 🚫

@Dominions Son

A 2 byte character set like Unicode, can support 2^16 or 65,536 unique characters.

Okay. So who am I to suggest Switch was several decades out-of-date? I was thinking of the first mainframe system I worked on, about 1980, with a 6-bit architecture and only 64 characters to play with. :(

Keet 🚫

The Chinese and some other countries would be left way behind if word processors, the web, ebooks, etc couldn't handle 2-byte characters. Compared to that accented letters are a breeze ;)

helmut_meukel 🚫

@Switch Blayde

Am I correct to do so?

No.
There may be a problem writing those characters in the first place.
Try to type those words or names in your favourite word processor:
garçon, cœr (Sacre Cœr), peut-être; [French]
Århus, øre, Øresund, Storebælt; [Danish]
Droste-Hülshoff-Straße, Äussere Bayreuther Straße, Töpen; [German]
then the Spanish 'ñ', the Slavic 'č' and others, not to forget the Icelandic 'ð' and 'þ'.

Be careful when selecting the font for your text, there are many free fancy TrueType fonts out there where the (american) font creator didn't bother to provide all characters.

HM.

Switch Blayde 🚫

I was concerned with that UTF-thing since Word doesn't use UTF-8. I thought I remembered characters coming out as boxes and other such stuff on SOL when the character set was not UTF-8.

As you can tell, I really don't understand UTF. btw, I'm even talking about words like "fiancé"

So I can keep the accents Word puts in and it will ultimately show up correctly on an e-reader (docx to Calibre (epub) to KDP (mobi))?

Replies:   Keet
Keet 🚫

@Switch Blayde

I was concerned with that UTF-thing since Word doesn't use UTF-8. I thought I remembered characters coming out as boxes and other such stuff on SOL when the character set was not UTF-8.

As you can tell, I really don't understand UTF. btw, I'm even talking about words like "fiancé"

UTF8 is the standard for web pages but Microsoft implemented the Windows 1252 encoding before that standard was set and has resumed to keep it that way. If I remember correctly the resulting webpage size is a little smaller with 1252 because it uses less bits but it is not guaranteed to work on every machine where UTF8 is. Since most webservers are Linux servers UTF8 will never give you problems. You can simply convert flat text windows 1252 to UTF8 with notepad (save as, select UTF8, Ok). I haven't used windows in many years so I'm not sure but I bet you can set Word to use the UTF8 encoding.
There's an article on Wikipedia about character encoding. It's a bit outdated but it's sufficient to make you understand what it is: Character Encoding.

Replies:   Switch Blayde
Switch Blayde 🚫

@Keet

Microsoft implemented the Windows 1252 encoding

Duh. I'm on a Mac now, not Windows.
Does Word on a Mac use UTF-8?

Replies:   Keet  Vincent Berg
Keet 🚫
Updated:

@Switch Blayde

Does Word on a Mac use UTF-8?

I have no idea (never used a Mac) but like with Windows I bet you can set it to use UTF8. For both Windows and Mac goes the same: export to a html page and see which encoding is set in the header. Good chance that it's already UFT8 on Mac. If not try to set Word to use UTF8 or convert the html to UTF8. That last one can be a bit tricky unless you are sure it's windows 1252. Check the header. For Mac and Linux you can use the iconv utility.

ETA:
Office - Choose text encoding when you open and save files

Replies:   Ross at Play
Ross at Play 🚫
Updated:

@Keet

Keet, I could probably cope if you made posts here in Dutch, but not Double Dutch like that. :-)

Replies:   Keet
Keet 🚫

@Ross at Play

Keet, I could probably cope if you made posts here in Dutch, but not Double Dutch like that. :-)

I was in a hurry. I deal with character encoding almost daily so maybe what is obvious to me is not to someone else ;)

Replies:   Ross at Play
Ross at Play 🚫

@Keet

I was in a hurry. I deal with character encoding almost daily so maybe what is obvious to me is not to someone else ;)

Sorry, I couldn't resist. I'd have figured it out if it mattered to me. :-)

Vincent Berg 🚫

@Switch Blayde

Duh. I'm on a Mac now, not Windows.
Does Word on a Mac use UTF-8?

Jumping in without reading the rest of the responses, Words' Windows 1252 encoding works on ALL browsers and on SOL, but not in ebooks. Though it's a pain, you'll need to replace ALL those 'special characters (including publishing marks) with their html equivalents, since UTF-8 doesn't support it. For me, using old Adobe Dreamweaver software, this is largely a no-brained, as I simply copy my text from the 'source' to the 'display' window and all the characters are automatically replaced.

I've been doing this for years, so I'm comfortable with what's required.

Replies:   Keet  Switch Blayde
Keet 🚫

@Vincent Berg

you'll need to replace ALL those 'special characters (including publishing marks) with their html equivalents, since UTF-8 doesn't support it.

Although there's nothing against using html codes for special characters UTF8 does support them. You can for example use the ampersand character from your keyboard instead of the html code "& amp ;" and with a UTF8 encoding it will show the same result. If that doesn't work you are not using the UTF8 encoding.
So the "problem" is getting from a windows 1252 encoding to UTF8 which can be as easy as using "save as" with most editors and selecting the UTF8 encoding before clicking "Ok". What would be better is to configure your writing tools to use UTF8 by default.

Replies:   Vincent Berg
Vincent Berg 🚫

@Keet

So the "problem" is getting from a windows 1252 encoding to UTF8 which can be as easy as using "save as" with most editors and selecting the UTF8 encoding before clicking "Ok". What would be better is to configure your writing tools to use UTF8 by default.

Agreed. But since all current browsers (as of 2010 or so) fully support Windows 1252, it's only an issue with ebooks, since they don't.

Replies:   Keet
Keet 🚫
Updated:

@Vincent Berg

Agreed. But since all current browsers (as of 2010 or so) fully support Windows 1252, it's only an issue with ebooks, since they don't.

That all browsers support 1252 is no excuse to not use the superior UTF8 encoding ;)

That epubs don't support it might be a blessing to get rid of a too widely used non-standard. Microsoft should have replaced it a long time ago because this time Embrace, Extend, Extinguish didn't work ( ͡ᵔ ͜ʖ ͡ᵔ ) (and that's UTF8)

Replies:   Ernest Bywater
Ernest Bywater 🚫

@Keet

That all browsers support 1252 is no excuse to not use the superior UTF8 encoding ;)

True, but that requires people to convert from the 1252 to UTF8 because MS Word creates as 1252, and few know how to convert it properly. However, conversion for HTML is not always the same as for conversion to epub as that will vary on how you do the conversion. I found that out the hard way.

Replies:   Keet
Keet 🚫

@Ernest Bywater

True, but that requires people to convert from the 1252 to UTF8 because MS Word creates as 1252, and few know how to convert it properly.

You can configure Word to use UTF8 by default.

Ernest Bywater 🚫

@Keet

You can configure Word to use UTF8 by default.

Which is beyond the knowledge and / or capabilities of 99% of MS Word users.

Replies:   Keet
Keet 🚫

@Ernest Bywater

Which is beyond the knowledge and / or capabilities of 99% of MS Word users.

True, but even though I won't tough anything MS with a ten foot pole, their documentation is excellent and you can find it with a simple search ;)

Ernest Bywater 🚫

@Keet

you can find it with a simple search ;)

Which is also beyond the capabilities of 99% of MS Word users. In general they get taught a few things and that's all they use or care about.

awnlee jawking 🚫

@Keet

their documentation is excellent and you can find it with a simple search ;)

Good thing you added a smiley to show you were being sarcastic.

AJ

Replies:   Keet
Keet 🚫

@awnlee jawking

Good thing you added a smiley to show you were being sarcastic.

Ernest and I often exchange a little banter and that's what the smiley was meant for. Microsofts documentation is about the only thing I wouldn't be sarcastic about and the only thing from Microsoft I incidentally use. It's more the horrible ability of users to find what they need using a search engine, or in a lot of cases simple laziness.

Replies:   awnlee jawking
awnlee jawking 🚫

@Keet

Microsofts documentation is about the only thing I wouldn't be sarcastic about

We'll have to disagree on that.

find what they need using a search engine

The good thing about the popularity(?) of Microsoft products is that plenty of other users will also have been stymied by the awfulness of its documentation, so there are plenty of independent tech sites and forums providing solutions. However I draw the line at editing the registry to get a Microsoft product to work.

AJ

Replies:   Keet
Keet 🚫

@awnlee jawking

However I draw the line at editing the registry to get a Microsoft product to work.

I'm glad I left that behind me a long, long time ago. I rarely have any problems with software on Linux, even if it's not from the repositories. I vaguely remember all the troubles with finding license keys because as a student I couldn't afford to buy, the very long installation times with multiple reboots, the crashes, eating 90% of your machine resources, updating every damn software package individually or have every package install it's own 'update checker'. Insanity. Looking back at it I don't understand how MS ever got so big. Must be all the marketing because the OS and software certainly didn't do it.
But that's getting off track. The MS on line user base for help is relatively small compared to the Linux user base. Of course that's inherent to open source software so no surprise there. The reason the MS support blogs grew in number and size because fixes are very slow to come up so work arounds are needed. The documentation still is good but I think I understand why we differ in opinion: we are talking about different types of documentation. I mainly refer to documentation for developers and those are very good. You are most likely referring to software/OS documentation. I have zero knowledge about those because I don't use it.
And make no mistake, there's only one reason why Windows is 'popular': it comes with the machine you buy, most users don't even know there is an alternative although that is getting better with the use of mobile phones which are all Linux or BSD based.

Switch Blayde 🚫

@Keet

You can configure Word to use UTF8 by default.

The link you gave said to click on files and then options. When I click on files there is no options.

But I found a site ( https://www.techwalla.com/articles/how-to-change-encoding-in-word ) that says:

One of the most common Unicode-based encodings is called UTF-8, and you'll often find UTF-8-encoded words on the internet and in files like Word documents. By default, recent versions of Word will use a Unicode encoding.

If I'm reading that right, newer versions of Word use UTF-8 by default.

Replies:   Keet
Keet 🚫

@Switch Blayde

If I'm reading that right, newer versions of Word use UTF-8 by default.

I think you read that right. I think that might be the result from bringing Word to an online version.
But that would mean that you have to buy a new copy of Word. I don't think you can buy Word anymore, it's now that rip-off scheme called office365 where you keep on paying until eternity. Unless you are in a company environment that's committed to Microsoft there's no reason to use Word. Switch over to LibreOffice, then select a different theme in LO so you have the same layout as you are used to in Word, even the ribbon bar is available. It seems insane to me that you pay through your nose for something that is available for free. And LibreOffice uses UTF8 by default since as long as I can remember.

Replies:   Switch Blayde
Switch Blayde 🚫

@Keet

I don't think you can buy Word anymore, it's now that rip-off scheme called office365 where you keep on paying until eternity.

I bought Word for my new Mac about a year and a half ago. I didn't want to lease it. MS said if you buy it outright you won't get updates. I didn't care. But I do get updates. One day it asked me to install AutoUpdate and I get updates for all the Office products.

I tried LibreOffice (and OpenOffice before that). I hated it. I tried it both on my old PC and my Mac. A one-time cost of a couple of hundred dollars for Office (only use Word and Excel but got them all) was well worth it.

Replies:   Keet  awnlee jawking
Keet 🚫

@Switch Blayde

I tried LibreOffice (and OpenOffice before that). I hated it. I tried it both on my old PC and my Mac. A one-time cost of a couple of hundred dollars for Office (only use Word and Excel but got them all) was well worth it.

You could have saved those couple of hundred dollars. What did you hate about LO/OO? The user interface? You can set-up LibreOffice to create the same look and feel as Word and you would hardly notice the difference.

Replies:   Switch Blayde
Switch Blayde 🚫

@Keet

What did you hate about LO/OO?

I didn't like the way it looked, but it was so slow loading I didn't even bother working on that. If I remember, all the products are one big program. So if I only want to use the word processor, it loaded everything. I'm constantly opening and closing Word and Excel.

awnlee jawking 🚫
Updated:

@Switch Blayde

Although diminishing in favour of Office 365, I believe some new PCs in the UK, particularly desktops, still come with Office Home and Student bundled. I think updates last for a couple of years.

AJ

Replies:   Keet
Keet 🚫

@awnlee jawking

Although diminishing in favour of Office 365, I believe some new PCs in the UK, particularly desktops, still come with Office Home and Student bundled. I think updates last for a couple of years.

That will probably be the last chance to buy the software. More and more software licenses are changing to subscriptions instead of buying because there is little or nothing left to add to the software that makes it worth buying a new version. Word processors are a prime example of that. Parties like Microsoft and Adobe change to subscriptions so they can keep taking your money without adding to the product. Personally, I think it's a scam. It's one of the reasons I'm a big supporter of open source software where such scams will never work.

Switch Blayde 🚫

@Vincent Berg

Words' Windows 1252 encoding works on ALL browsers and on SOL, but not in ebooks.

I'm using Word, but not Windows (Mac).

What I got from the above discussion is that if I input a docx file (using words with accents) into Calibre, the epub output will have the right coding for ebooks. I won't have to go in and manually fix them. And when I upload that epub to Amazon, it will convert it correctly to mobi.

Replies:   Ernest Bywater
Ernest Bywater 🚫

@Switch Blayde

if I input a docx file (using words with accents) into Calibre, the epub output will have the right coding for ebooks. I won't have to go in and manually fix them. And when I upload that epub to Amazon, it will convert it correctly to mobi.

G'day Switch,

I just created a file in Libre Office with graves and accents then saved it as a docx and converted it to epub via Calibre and the came through perfectly. I then converted the same file to MOBI using Calibre and they came through perfectly. Both methods should give you a good result.

I do wonder why you save as an epub with Calibre and then convert to a MOBI at Amazon instead of converting to MOBI with Calibre and uploading that to Amazon.

Replies:   Switch Blayde
Switch Blayde 🚫
Updated:

@Ernest Bywater

I do wonder why you save as an epub with Calibre and then convert to a MOBI at Amazon

Thanks for taking the time to test it.

I can upload a docx file to Amazon who will convert it to mobi. But I feel more comfortable uploading an epub file (the ToC is already built by Calibre). It's Amazon that converts it to their format (mobi). But if I send the ebook to someone to read, like a Beta reader, epub works better. Mobi is Kindle only (I believe).

Michael Loucks 🚫

@Switch Blayde

When I type a foreign language word, Word adds the accents. I take them off because I'm afraid the characters will get screwed up when I convert it to epub (docx into Calibre with epub outputted).

Am I correct to do so?

I use UTF-8 for everything, and I've had no trouble at all with Japanese, Korean, Chinese, Greek, Cyrillic, Swedish, German, etc, with all the attendant diacritical marks. I use Scrivener to create PDF, docx, mobi, and ePub files and all of the characters come across just fine.

Replies:   Keet
Keet 🚫

@Michael Loucks

I use UTF-8 for everything, and I've had no trouble at all with Japanese, Korean, Chinese, Greek, Cyrillic, Swedish, German, etc, with all the attendant diacritical marks. I use Scrivener to create PDF, docx, mobi, and ePub files and all of the characters come across just fine.

I deal with a lot of conversions for customers. one of the common problems why they need a conversion: ruined diacritics because they mixed up encodings somewhere in their processes. And then it's up to old me to create something to fix it.
The best solution to avoid mix ups in the first place: use UTF8 everywhere.

Replies:   Switch Blayde
Switch Blayde 🚫

@Keet

to avoid mix ups in the first place: use UTF8 everywhere.

If you have control of that. That's why I was asking.

I don't choose the encoding. I guess Word does. Or maybe my Mac OS does. I don't know. Or maybe Calibre chooses UTF8 when it creates the epub. But what if there's an encoding conflict between Word and Calibre?

I just remember I used to get boxes instead of letters on SOL. What I started doing years ago was to use vanilla everything, save my story as txt, and provide that file to the SOL Wizard.

Replies:   Keet  Michael Loucks
Keet 🚫

@Switch Blayde

I don't choose the encoding. I guess Word does. Or maybe my Mac OS does. I don't know.

I know Word on Windows used the 1252 encoding but I don't know what it does on MacOS. Check the Microsoft link I posted and make it export with UTF8 encoding and the rest should go smoothly.

Replies:   Switch Blayde
Switch Blayde 🚫

@Keet

Check the Microsoft link I posted

That gave me the most understandable explanation of how encoding works. It's so straightforward. I guess I was thinking it did more than that. Thanks.

My Word on Mac does not have an "options" under File. I tried "properties" but that didn't give me anything for encoding.

Based on what Ernest said, I'm assuming I can simply input my docx file into Calibre and Calibre will be smart enough to do it right.

As to the SOL Wizard, I don't know. Instead of saving the file as txt and pointing the Wizard at it, I'll try saving it as filtered HTML and pass that file along to the Wizard.

Lazeez Jiddan (Webmaster)

@Switch Blayde

As to the SOL Wizard, I don't know. Instead of saving the file as txt and pointing the Wizard at it, I'll try saving it as filtered HTML and pass that file along to the Wizard.

Yes, filtered html should work perfectly with regards to character encoding. I think you can select which encoding you want while doing the export operation.

Replies:   Switch Blayde
Switch Blayde 🚫

@Lazeez Jiddan (Webmaster)

Yes, filtered html should work perfectly with regards to character encoding.

Ok, I'll try that next time.

Will it work for ellipses and em-dashes?

Because I used to first save as txt I didn't use them (I used 3 dots and 2 dashes, respectively). Can I use the characters that come with the font (e.g., … and —).

Lazeez Jiddan (Webmaster)
Updated:

@Switch Blayde

Will it work for ellipses and em-dashes?

Yes, but, I replace ellipses with three dots because I don't like their look, but em-dashes are preserved.

Replies:   Switch Blayde
Switch Blayde 🚫

@Lazeez Jiddan (Webmaster)

I remove replace ellipses with three dots

That's fine. As long as it's 3 dots and not a box.
Good to know. I've been doing a lot of extra work.

Michael Loucks 🚫

@Switch Blayde

I don't choose the encoding. I guess Word does. Or maybe my Mac OS does. I don't know. Or maybe Calibre chooses UTF8 when it creates the epub. But what if there's an encoding conflict between Word and Calibre?

I'm on a Mac and my process is different (BBEdit --> Scrivener) and I can select the exact type of encoding I want fo the text file (always UTF-8). The value of having all my originals in text files is that they import perfectly into whatever other application I might need (e.g. Pages, Word, Scrivener, Calibre, etc). Of course, there is zero formatting, but for SOL I use the simple text markup (i.e. with underscores and asterisks), so that's not an issue.

Ernest Bywater 🚫

As Switch said at the start, his concern is the conversion from Word on Mac to epub via Calibre.

I use Calibre 3.34 and I just converted an ODT file to Epub with it and checked the Spanish question I had in it. The accent, grave and upside down question mark converted perfectly. So I'd say it Calibre will convert the characters for him, unless the Word document is using some odd characters. The best thing is to do the conversion and check it to see if it works. That only takes a few minutes.

Replies:   Switch Blayde  Keet
Switch Blayde 🚫
Updated:

@Ernest Bywater

So I'd say it Calibre will convert the characters for him

Thanks.

This is how it ended up:

Llévalo a la casa de la alberca

Keet 🚫
Updated:

@Ernest Bywater

As Switch said at the start, his concern is the conversion from Word on Mac to epub via Calibre.

I posted a Microsoft link that explains how to make Word read and write in UTF8. In Calibre you can set the encoding but I don't think you need to do that if you give Calibre html files as input since the encoding is listed in the header of the html file.

Calibre FAQ: How do I convert my file containing non-English characters, or smart quotes?.

The "square boxes" is not an encoding problem but rather a font problem. In the case of the square boxes the used font doesn't have a character assigned to the code for the character. I see them a lot because I block everything that is Google and it seems a lot of lazy programmers can't live without the Google fonts. So either the font doesn't fully support all characters or you can't get to the source of the font like me when I block it.

In short: If you export your Word files to UTF8 encoded html files you probably have nothing else to do in the rest of the process.

Epub files packages the font along with the rest of the files so there shouldn't be a problem with square boxes. (that's why you can't distribute an epub with a font you don't have a distribution license for.)

typo: can > can't

Back to Top

Close
 

WARNING! ADULT CONTENT...

Storiesonline is for adult entertainment only. By accessing this site you declare that you are of legal age and that you agree with our Terms of Service and Privacy Policy.


Log In