Having been doing this for a living for a few years now, I have come across a number of situations where that most dreadful of file formats has smashed through my neatly designed workflows, providing endless hours of “fun”.
Now as I’m sure you are aware, there are a number of CAT tools and ancillary applications that allow you to throw PDF files into them in order for said files to be promptly crushed by their inner workings, thus rendering them vulnerable to our powers of linguistic conversion.
Still, some PDF files seem imbued of some dark magic, for all seems fine when you open them, concealing their deceit until immediately after you have accepted the project, upon which they readily blow up in your face.
Joking aside, PDF files can be both a great thing or a major pain in the butt, and it all depends on how they’ve been created or, as the Adobe-folk like to call it, “distilled”. For some reason, this term always reminds me of of some hillbilly hiding in the bush making moonshine, but whatever.
My most common issues with PDF files can be divided into three categories:
- PDF files of scanned documents
- PDF files with overlaid images
- Poorly distilled PDF files, which show as garbled text when copying or extracting contents
1. PDF files of scanned documents
Well, when we are talking about scanned documents, the first thing that pops up in our hivemind-like conscience is the magical acronym of OCR. I admit, if the document is clearly legible and the typefaces being used are large enough, that will be the first thing I’ll try.
There is a large number of programs to do just that, so there’s not much point in discussing each and every one of them. If you want to know my experience, just shoot me an e-mail.
As a side note, if you OCR any document with the intent of loading it into a CAT tool, do yourself a favour and run a spell-check in the OCR’ed file first. It will save you much grief and your resulting TM will be much neater if you do so.
However, when the scanned PDF files are bad, be it by poor scanning, photocopied documents or whatever, you have a faster option. I often find this to be superior when I have large tables in a file as well.
What I do is I recreate the document structure first, placing tables and place-holder text if required, and then dictate all over said structure. If the result of an OCR attempt is so pathetic as to warrant extensive formatting, I often find this to be significantly faster, particularly when the documents are not repetitive in their nature (as in not lending themselves well to CAT tool handling). For a blog post on dictation solutions, check out this link.
2. PDF files with overlaid images
Ah, I particularly love these. Anyone who has had a run with governmental clinical trial managing platforms has probably run into this. You get a nicely laid out PDF file, with text neatly arranged in it, only to find that the table is not really a table, but a sort of background image to which the text is superimposed.
Sometimes this can be solved with OCR, sometimes the neat images screw that up badly. In that case, I wholeheartedly recommend you and get BCL Easy Convert. It will transform said PDF nastiness into an RTF file (don’t run away just yet), which you can then convert to a more civilized format such a .docx and load in the CAT tool of your choice. You worry about the text, let the application place the background images. Depending in the expansion/contraction of the text during translation, you may need to do a minor bit of formatting, but the great advantage here is that this will allow you to tap onto the source text directly, building a TM for future use.
You can also use a free online version here, but beware of confidentiality issues, as the sent files will be processed at a remote server rather than your machine.
3. Poorly distilled PDF files
My personal favorite kind of bad news. I used to get absolutely pissed at this before I found a fix. These are files which look entirely normal until you try to copy or extract their contents, upon which you obtain a lovely arrangement of random, garbled characters and symbols.
Apparently this is due to a fault when managing fonts during the “distilling” process.
You have a more than a few solutions, depending on your platform and software of choice. This is not an exhaustive run-down of them so by all means chime in with your input either as a comment or wherever this ends up being posted.
You are a Mac user. Macs out-PDF Adobe Acrobat.
You use your Mac’s Jedi-like powers to redistill the PDF file into normality. Simply open the file as normal, which should use the Preview app. Go to File / Export and choose PDF or File / Print and click on the PDF button. Voilà, le PDF is now properly distilled by Quartz (Apple’s 2D graphics engine which renders PDFs) and ready to be thrown into OCR or extraction tool of choice.
You are not a Mac user, but you have Adobe Acrobat.
Even easier. If you have Adobe Acrobat, you’ll have Adobe Distiller too. Open the wretched file, and somewhere under the File menu there will be an option to save/redistill the PDF file. This should do the trick, just make sure you check out the font options. Do note, it requires the proper, full, paid version of Acrobat – NOT Acrobat Reader.
You are not a Mac user, nor do you own Adobe Acrobat.
Get InFix PDF Editor. Buy it, support the developers and claim the right to pay the first beer when you meet me. You are quite welcome to do so 🙂
There are other options too, such as using Google Docs to open the file and copy the contents from within. These are the three that I’ve found less fussy, but please share your experience.
As usual, please get in touch if there’s anything you’d like to add or throw into the mix. Any feedback is appreciated, click the email link.