If you're looking to hire a development firm to build you a mobile or web app, this Quick Reference Guide is a must-have.
TL;DR We created a series of scripts designed to scrape and import content from unfriendly CMS’ into WordPress. It saved us 55 hours of really super ultra boring manual work, so we’re releasing it for everyone to modify & use. It’s handy for developers – check it out on github and read the background story below.
In the world of web development, we rarely get the opportunity to start with a blank slate; more often than not, existing content needs to be ported over to whatever new system we build. Sometimes this task is as easy as logging into whatever CMS the old site used, clicking the nice little green “export” button, and VOILA! Every once in a while however, we get presented with a new challenge. This was one of those times.
How to Avoid Manually Importing 850 Posts Because That’s Wack
In this particular case, not only did the old CMS not have an “export” button… it didn’t even use a relational database! So no easy export, no SQL dump, and the garbage off-the-shelf CMS importing tool we originally specced and paid for didn’t work at all. Cool. It was looking more and more like we were going to be stuck spending hours clicking copy/paste/copy/paste. We ran a quick site crawl to see just how much time we (and by “we”, I mean “the lucky least-busy developer”, aka me) could expect to spend wearing out the same three keys on his keyboard. The result of that crawl? Over 850 posts that needed to be imported into WordPress. At about 5 minutes a post, we were looking at ~70 hours.
Sweet.
Obviously this did not sit well with the developers project manager anyone, especially considering there was no room in the project budget for something like that. So we set about trying to automate the task, and wrote our own tool to do that very thing.
An Automated Solution for Scraping & Importing
The result was Commander Vee, a collection of scripts created during a two-day sprint to see if we could make that automation happen. It may not be the most elegant codebase we’ve ever written, but it certainly served the purpose! Using Commander Vee, we saved 55 hours of manual content entry time (and now that it’s written, the time savings going forward will be huge for everyone), got to push the normal use cases of web technologies, and learned all about the evils of character encodings.
At its heart, Commander Vee does two things:
- Runs a crawler script that gets & reformats HTML data into WordPress-friendly JSON, then
- Runs a follow-up importer script that gets the JSON & imports that data into WordPress.
That import includes author names, publish dates, blog content, and so on – all stuff that we required. You can modify Commander Vee to suit your needs though!
How We Wrote Commander Vee and Why Character Encodings Are Not Cool
In a typical work day, we write code in about three languages… PHP, Javascript, and SCSS. Normally we use these to create awesome user-facing experiences or killer back-end applications – but Commander Vee was a neat excuse to ignore users of any sort for a couple days (ed. hilarious developer joke, Devon) and toss some data around.
The first problem was getting the content off of the existing site. Thankfully, all the parts we cared about (author names, publish dates, blog content, etc) were formatted similarly across all the pages. Fire up a loop to load in the URLs we needed (Thanks, jQuery.load()
), and… nope. Cross-domain security won’t let you just load in pages from other websites like that. Well, if it won’t let us load in pages from that domain, let’s try loading the pages on our own domain!
wget -i urls-to-dl.txt
There. Load those pages in, and we’re in business. Grab the author name, publish date, title, and content, then fire it into a JSON array so we can pass it to PHP for the WordPress upload. Nope.
Invalid JSON
Me: Huh. Well, the content section looks like there’s a lot going on, let’s try and narrow this down. After a few minutes of ‘narrowing down’:
{"content": "in-between"} : Invalid JSON
Turns out, copy-pasting from Microsoft Word into a TinyMCE editor carries over all sorts of fun unprintable characters (or characters that look similar, but are really just special). There was a single unprintable character hiding just before that dash; removed it (and added some code to deal with common odd characters), and we were back in business!
Once we had the JSON array, reading it into WordPress and building out the new posts was a fairly straightforward task:
- Place the file in the base folder of the WordPress theme.
- Modify the ‘Load up the JSON’ section (line 11) to point at the JSON files.
- Modify the calls to
importPages()
on line 82 to match the code tweaked on step 2. Additionally, provide the category IDs to associate with the posts from each file. - Ensure the WordPress install contains categories that match the IDs provided in step 3.
- Run this file (we used Chrome, but experiment away!).
- Verify that the file ran without PHP errors, and then check out the newly imported posts!
The Result
At the end of the day, Commander Vee saved us over 55 hours of manual work. Based off our initial time estimate, that’s a time savings of more than 75%. Now that it’s written? The script takes a few minutes to modify for the unique content source, and like 10 minutes to run for 800+ posts. While Commander Vee isn’t quite as user-friendly as Waldo or Alton (a couple of our other open source projects), it was a fun example of how web technologies can be used to do something beyond just displaying pretty pixels. The code is all hosted on Github if you feel like taking a look, and we’ve done our best to lay out the README and code comments so that you can follow along (or morph Commander Vee into something useful for you!).
Wait! Why “Commander Vee”?
The whole point of this little site scraper/importer was to save us from endless hours of copy and paste. Since we use OSX here at Paper Leaf, I named it after the OSX keyboard shortcut for ‘paste’… Command + V. So I’m basically also a branding expert now I guess.