There are many reasons you might want to capture a web page as an image or convert a web page to a PDF, MS Word or other type of document. Maybe you need thumbnail images for web sites to use in a web directory or list of bookmarks. Maybe you're creating an archive that captures changes to the content or design of a web site over time. Maybe you're looking for a richer (or more tightly controlled) “print” version of a web page than the browser's default. Maybe you're looking to capture a bunch of web content for editing and repurposing in other forms.
We've been experimenting with web-to-image and web-to-document conversions for fair while now, but we've never quite been satisfied with the results.
The conventional answer to this problem is the venerable wkhtmltopdf / wkhtmltoimage, a headless engine for rendering web pages using the Qt version of the WebKit layout engine that underlies Google's Chrome, Apple's Safari and many other browsers.
(The names make more sense when pronounced “WebKit HTML-to-PDF” and “WebKit HTML-to-Image”).
And in general, the conventional answer is a good one—wkhtmltox does an admirable job in most circumstances. But sometimes wkhtmltox fails to render some web-pages correctly, or at all, and—largely due to issues in the underlying Qt/WebKit framework—has historically had issues with image resolution and quality, obscure or unusual HTML5 and SVG content, and in some areas, full Unicode or emoji support.
If you happen to control the content, you can often make small changes to the web page you're trying to render to work-around these problems. But that's not helpful at web scale.
PhantomJS is general-purpose headless browser that can be used to capture images of web pages. PhantomJS is also Qt/WebKit-based, but uses a more recent version of the library that avoids some of the issues wkhtmltox runs in to. But not all of them. PhantomJS has also has issues with some web-sites—some overlapping with wkhtmltox's issues, but there are sites that PhantomJS renders better than wkhtmltox and other sites that wkhtmltox renders better than PhantomJS.
Add to this mix Blink, a new(ish) and improved fork of the WebKit core that is used in current versions of Chromium, Chrome, Opera and other browsers, and several ways to drive Blink—the ChromiumEmbeddedFramework (CEF), Brightray and Electron, to name just a few.
There are just a lot of ways to capture a screenshot of a web page. And each has its own strengths and weaknesses.
So after much experimentation, we've finally settled on a hybrid solution that uses heuristics to choose “the right tool for the job”—selecting one web-to-image engine or another (with some custom pre- and post-processing here and there) based on the nature of the page being rendered.
We'll be rolling out new web-to-image REST endpoint based on this hybrid solution over the next few days. (Watch this space 🙂 .) And while the current results are quite good, we fully expect to continue to tweak and evolve that hybrid solution as we use it with more and more sites (such is the nature of “web scale”), so we're inviting everyone in earshot to take it for spin.