Please read. Significant change on the site that will affect compatibility [ Dismiss ]
Home ยป Forum ยป Author Hangout

Forum: Author Hangout

Extracting Text From PDF or Ebook

oldegrump ๐Ÿšซ

I have a problem reading PDF and Ebook files. I need to reformat the files to read them. I have used several converters, but for the most part, I end up spending much more time correcting the results than I do reading the finished file.

Dominions Son ๐Ÿšซ

@oldegrump

For PDF, you might be out of luck. Many have the contents as an image and you'd have to run OCR software on it to get the text back.

But why would you have to extract the text from an e-book to reformat?

Most ebook readers will let the user adjust text size, justification, font, text and background colors. Other than chapter and paragraph breaks, almost none of the formatting decisions by the author are enforceable against the reader.

Grey Wolf ๐Ÿšซ

@Dominions Son

There are some PDF readers which will 'reflow' PDFs (which are not contents-as-image, of course) to allow for viewing on a wider range of screen sizes. Allegedly MS Office will do this (I haven't tried it). Calibre will as well, though it can produce iffy results.

If your PDF is image-based, OCR is the only solution. I'm not really sure why someone would do that on a PDF intended for publication, as it makes file sizes far larger and is worse for the reader, unless it's as a minimal sort of copy protection.

Crumbly Writer ๐Ÿšซ

@Dominions Son

For PDF, you might be out of luck. Many have the contents as an image and you'd have to run OCR software on it to get the text back.

PDF is an older technology which was never intended as an eBook tool, instead it seeks to duplicate the original print medium, retaining the exact same dimensions, borders and margins, which makes reading them on a smart phone incredibly painful.

You're better off selecting the text (if you can) and then pasting it into a new document. If you can't do that, use Calibre to convert the entire document into either a formatted Word file, or into a plain .txt or even a .rtf format.

Luckily, it's uncommon for published novels to be distributed as several hundred pages of separate images, magazines though, are a whole different story.

@oldegrump

My problems with e-books are that my reader makes a mess of bookmarks and the text is shown in two-page (side by side) format.

That should be a user selectable setting (the page display option), but of course, that depends on your viewer.

Replies:   Dominions Son
Dominions Son ๐Ÿšซ

@Crumbly Writer

PDF is an older technology which was never intended as an eBook tool,

I never suggested it was.

You're better off selecting the text (if you can) and then pasting it into a new document.

If the PDF is set up as an image or set of images, the text won't be selectable.

oldegrump ๐Ÿšซ

@oldegrump

My problems with e-books are that my reader makes a mess of bookmarks and the text is shown in two-page (side by side) format. I find it difficult to read that way. PDF (acrobat reader) files do not allow bookmarks (at least that I know about).

Replies:   Dominions Son
Dominions Son ๐Ÿšซ

@oldegrump

My problems with e-books are that my reader makes a mess of bookmarks and the text is shown in two-page (side by side) format.

That's your reader, it's not inherent to the ebooks, and again, that is probably changeable in the settings. What reader are you using?

Replies:   Ernest Bywater
Ernest Bywater ๐Ÿšซ

@Dominions Son

that is probably changeable in the settings.

I agree. I've 2 e-book readers on my PC and one has a default setting of 2 pages side by side, but there is a setting I changed to have it show one page, while the other only shows one page at a time.

Also, as CW has said, Calibre has an option to convert PDF into various e-book formats as well as docx and txt output options.

Keet ๐Ÿšซ

@oldegrump

For pdf extraction I've had good results using one of the poppler-utils on Linux (pdftohtml). Works great except for tables. Of course you're out of luck if the pdf was created using images. Like Dominions Son stated, that would require OCR which generally produces very questionable results.
Ebooks should be no problem since you can adjust most visuals with most ebook readers. If you can't, than extract the ebook (it's just a zip file) and read it with a browser. That way you could change the css to your liking.

Replies:   Dominions Son
Dominions Son ๐Ÿšซ

@Keet

Ebooks should be no problem since you can adjust most visuals with most ebook readers. If you can't, than extract the ebook (it's just a zip file) and read it with a browser.

That may be true for an epub. I don't think that holds for a .mobi and I'm certain it wouldn't hold for an .azw.

How and even whether it's possible to extract the raw text from an ebook will be dependent on exactly what ebook format he's talking about.

Replies:   Keet
Keet ๐Ÿšซ

@Dominions Son

That may be true for an epub. I don't think that holds for a .mobi and I'm certain it wouldn't hold for an .azw.

non-drm mobi files can be unpacked with https://github.com/kevinhendricks/KindleUnpack. Still needs some figuring out but most types can be unpacked.
Another way is to first convert the ebook: https://www.ebook-converter.com/69-how-to-convert-kindle-azw-other-format.htm, including azw books.
These two links are just examples of some of the multiple possibilities. Especially on Linux there's almost always a way to get these kind of things done. Not always perfect and sometimes needs some tinkering, but it can be done.

Ernest Bywater ๐Ÿšซ

@oldegrump

My prefered e-book reader is FBReader which is available for PC and mobile devices

https://fbreader.org/

Replies:   Keet
Keet ๐Ÿšซ

@Ernest Bywater

My prefered e-book reader is FBReader which is available for PC and mobile devices

Also available in most repositories for Linux if not installed by default.

Back to Top

Close
 

WARNING! ADULT CONTENT...

Storiesonline is for adult entertainment only. By accessing this site you declare that you are of legal age and that you agree with our Terms of Service and Privacy Policy.


Log In