I just put some stories I am planning to merge into MS Word.
They came out with fewer KB than the individual stories totaled in WordPad.
This is a new version for me.
Is compression real?
Is it fairly new?
Did I just not notice previously?
I just put some stories I am planning to merge into MS Word.
They came out with fewer KB than the individual stories totaled in WordPad.
This is a new version for me.
Is compression real?
Is it fairly new?
Did I just not notice previously?
I just put some stories I am planning to merge into MS Word.
They came out with fewer KB than the individual stories totaled in WordPad.
This is a new version for me.
Is compression real?
Is it fairly new?
Did I just not notice previously?
You saved in the docx format? If so then it's compressed, docx is an archive. (A zip file)
I don't know if it's still the case, but the default save file setting in MS Word used to be fast save which saved every key stroke so when you saved in added the latest changes to the end of the file. When you disabled fast save it saved the latest version of the text only. With a document that had a lot of changes this meant the file size decreased a lot when you saved it. Also, when you copied the text of one file into another one it only copied across the latest version, not all of the earlier changes, and this meant the size of the copied across text was less bytes in the new file than the file it came from.
Also, when you copied the text of one file into another one it only copied across the latest version, not all of the earlier changes, and this meant the size of the copied across text was less bytes in the new file than the file it came from.
Saving a document without the change history does decrease the file size so you're correct on that one. But Uther didn't convert Word->Word, he mentioned Wordpad->Word. Other then the reason I mentioned (zip file format) he probably used the RTF format which can be very inefficient although it doesn't keep a change history. Epub tools like Calibre can do the same: every line gets loads of extra tagging repeated again and again making it very inefficient considering size. RTF can do the same if you're not very careful. Try opening an RTF file with notepad, it's just a flat text file. It will show loads of tags, so many you often can hardly find the real text. So just opening the RTF file in Word and saving it as docx gets rid of a lot of double tags thus decreasing the overall file size even with the extra overhead Word adds. Combine that with a zipped file format like docx and you're sure to get a smaller file.
Don't forget that in addition to the text, Word saves all of the document's formatting information. In a file that is heavily formatted, that can be a significant amount of data.
As an experiment, I opened a 64K word file and saved the text content without the file formatting information. The result was a 54K file.
Don't forget that in addition to the text, Word saves all of the document's formatting information. In a file that is heavily formatted, that can be a significant amount of data.
Word's formatting sure is a significant amount of data but not very much compared to what an RTF file does. With an RTF file you don't have to do much to make the formatting data bigger then the actual text itself. And that continues for the entire document. With Word the larger the document gets the smaller the percentage of formatting data is. If you eliminate the revision data that Ernest mentioned it will also decrease the file size.
On the other hand nowadays the file size is not really something you should worry about. Most current formats are of the archive type anyway (docx, odt, etc) so they can hold a lot of formatting data without really causing an increase in file size.
Since I don't work with RTF files, I don't know that much about them.
Your remark about the percentage of the file size becoming smaller due to an increase in file size is not accurate. The larger the file size, the greater the amount of formatting information that must be retained. The percentages are probably very close for different file sizes that are formatted similarly.
ETA: I just did a similar test of a 147K file, and got close to a 25% reduction in size.
Your remark about the percentage of the file size becoming smaller due to an increase in file size is not accurate. The larger the file size, the greater the amount of formatting information that must be retained.
It depends. Yes, the amount of formatting data will increase for a growing document but the initial basic template data remains the same and does not increase with a growing document resulting in a smaller overall percentage for the formatting data. If for example the font doesn't change then more text will not add more formatting for a font. In short: the size of an empty document is the part that stays the same regardless the amount of text. If the amount of text grows the part of that initial size gets smaller compared to the total size. Of course any additional formatting will increase the total size of the formatting.
RTF is an old flat text format. You could read and write it with a simple text editor if you know all the formatting codes but it would be very difficult. It worked fine in the old DOS days because you can interpret it in small blocks but the format has the disadvantage of repeating rather long formatting tags every sentence/paragraph which often increases the total document size to more then twice the size of only the text.
The old doc format is a binary format. You need Word to interpret what is in the file. If you opened it with a text editor you would see "garbage". The reason was to keep the format exclusive to Microsoft and thus keep a lock-in for Word.
The docx/xlsx/pptx and odt/ods/odg formats are archives with a specific structure. If you unzip them you get a folder structure with multiple files. It's almost the same as an epub file which is also an archive with a specific structure.
Word's formatting sure is a significant amount of data but not very much compared to what an RTF file does.
RTF is nice in theory, but many Word Processors insist on copying all of their internal formatting into the text .rft file, where it doesn't belong.
In my opinion, there is NO reason in ever exporting anything in .rtf, and if I need data from an existing .rtf, I'll typically select all the text and copy it to a plain text file before re-importing it to my document.
In my opinion, there is NO reason in ever exporting anything in .rtf, and if I need data from an existing .rtf, I'll typically select all the text and copy it to a plain text file before re-importing it to my document.
RTF is an old format and fortunately still supported to allow reading very old documents. You can import them in Word or LibreOffice and save-as in one of the regular current formats. Just copying the text like you do is the most efficient way to get rid of all formatting data.
I'll typically select all the text and copy it to a plain text file before re-importing it to my document.
But then you'll lose the "rich text" formatting. That's why you choose RTF over TXT. To have that formatting.
Don't forget that in addition to the text, Word saves all of the document's formatting information. In a file that is heavily formatted, that can be a significant amount of data.
As an experiment, I opened a 64K word file and saved the text content without the file formatting information. The result was a 54K file.
Just as an aside, since I've long worked with Style definitions, I've started deleting all the standard Word styles definitions I don't use, and it's surprising how much of a size difference it makes. I'm now routinely deleting a whole roster of 'garbage' style definitions, though it doesn't keep them from reappearing as you copy text into your document. :(
it doesn't keep them from reappearing as you copy text into your document.
Hint or tip:
Keep open a simple text editor window (Notepad), and to strip formatting, styles etc. from something you're wanting to paste somewhere you don't want the excess gumph, paste it into the open Notepad, re-select the text and cut it (cut rather than copy, so the Notepad window is left empty again), then paste it to where you want it. Going through Notepad (or other simple text editor) forces it to plain text, which will (in Word) then acquire the formatting of the paste point.
ETA: No need for this procedure copy/pasting text within a single document, just when pulling in something from another doc which may have different styles etc.