Screen scraping pages that use CSS for layout and formatting…how to scrape the CSS applicable to the html?
I am working on an app for doing screen scraping of small portions of external web pages (not an entire page, just a small subset of it).
So I have the code working perfectly for scraping the html, but my problem is that I want to scrape not just the raw html, but also the CSS styles used to format the section of the page I am extracting, so I can display on a new page with it's original formatting intact.
If you are familiar with firebug, it is able to display which CSS styles are applicable to the specific subset of the page you have highlighted, so if I could figure out a way to do that, then I could just use those styles when displaying the content on my new page. But I have no idea how to do this........
Today I needed to scrape Facebook share dialogs to be used as dynamic preview samples in our app builder for facebook apps. I've taken Firebug 1.5 codebase and added a new context menu option "Copy HTML with inlined styles". I've copied their getElementHTML function from lib.js and modified it to do this:
- remove class, id and style attributes
- remove all data-something attributes
- remove explicit hrefs and replace them with "#"
- replace all block level elements with div and inline element with span (to prevent inheriting styles on target page)
- absolutize relative urls
- inline all applied non-default css atributes into brand new style attribute
- reduce inline style bloat by considering styling parent/child inheritance by traversion DOM tree up
- indent output
It works well for simpler pages, but the solution is not 100% robust because of bugs in Firebug (or Firefox?). But it is definitely usable when operated by a web developer who can debug and fix all quirks.
Problems I've found so far:
- sometimes clear css property is not emitted (it breaks layout pretty badly)
- :hover and other pseudo-classes cannot be captured this way
- firefox keeps only mozilla specific css properties/values in it's model, so for example you lose -webkit-border-radius, because this was skipped by CSS parser
Anyway, this solution saved lot of my time. Originally I was manually selecting pieces of their stylesheets and doing manual selection and postprocessing. It was slow, boring and polluted our class namespace. Now I'm able to scrap facebook markup in minutes instead of hours and exported markup does not interfere with the rest of the page.