Screen scraping pages that use CSS for layout and formatting…how to scrape the CSS applicable to the html?

Tags: html,css,screen-scraping,firebug

Problem :

I am working on an app for doing screen scraping of small portions of external web pages (not an entire page, just a small subset of it).

So I have the code working perfectly for scraping the html, but my problem is that I want to scrape not just the raw html, but also the CSS styles used to format the section of the page I am extracting, so I can display on a new page with it's original formatting intact.

If you are familiar with firebug, it is able to display which CSS styles are applicable to the specific subset of the page you have highlighted, so if I could figure out a way to do that, then I could just use those styles when displaying the content on my new page. But I have no idea how to do this........

Solution :

Today I needed to scrape Facebook share dialogs to be used as dynamic preview samples in our app builder for facebook apps. I've taken Firebug 1.5 codebase and added a new context menu option "Copy HTML with inlined styles". I've copied their getElementHTML function from lib.js and modified it to do this:

  • remove class, id and style attributes
  • remove onclick and similar javascript handlers
  • remove all data-something attributes
  • remove explicit hrefs and replace them with "#"
  • replace all block level elements with div and inline element with span (to prevent inheriting styles on target page)
  • absolutize relative urls
  • inline all applied non-default css atributes into brand new style attribute
  • reduce inline style bloat by considering styling parent/child inheritance by traversion DOM tree up
  • indent output

It works well for simpler pages, but the solution is not 100% robust because of bugs in Firebug (or Firefox?). But it is definitely usable when operated by a web developer who can debug and fix all quirks.

Problems I've found so far:

  • sometimes clear css property is not emitted (it breaks layout pretty badly)
  • :hover and other pseudo-classes cannot be captured this way
  • firefox keeps only mozilla specific css properties/values in it's model, so for example you lose -webkit-border-radius, because this was skipped by CSS parser

Anyway, this solution saved lot of my time. Originally I was manually selecting pieces of their stylesheets and doing manual selection and postprocessing. It was slow, boring and polluted our class namespace. Now I'm able to scrap facebook markup in minutes instead of hours and exported markup does not interfere with the rest of the page.

    CSS Howto..

    How do I display a DIV if the current version of the program is greater than the version I have?

    How can I make it so that CSS uses a class only if another is not present on an element?

    how to place div next to centered div css

    How to make the textbox above the slider

    How to cancel specific one element at external CSS?

    How to hide element that is currently mousedover and shown via :hover css pseudo class via jQuery

    How to get css class name using Selenium?

    How to vertically align a form and image

    How do I displaying a list of multiple objects, one below the other, in HTML using Django?

    How to get hyperlinks inside a “pop-up” term reference on mouse-over, and seperate the HTML term from the “pop-up” reference content

    CSS sprite - showing part of another image when zooming

    how to change CSS priority (don't use !important)

    How to select the first occurrence of an HTML element in CSS that does not have a specific class?

    Button transition using javascript and css, how to temporarily disable the transition

    How can i make the horizontal navigation menu center aligned

    CSS How to make text to scale like an image

    How to create a table inner bevel in HTML/CSS?

    How to vertically center text overlapping an image?

    How to use absolute element when its closet ancestor is a float element?

    How to invoke a “tooltip” when hovering over ID'd HTML elements in a Browser?

    css how to align absolute positioned child element to the right of fluid width parent

    How to make my a scalable website with auto:height and overflow:auto

    How to style SVG with external CSS?

    How do you add multile vendor prefixes to one css property?

    How can I make vertical switch

    How do I selectively stack columns with twitter bootstrap

    Visual studio 2013 crash when editing a css file. How do I troubleshoot?

    How to make CSS color transition time correctly with transform perspective?

    How to get partial scroll bar in html page?

    How to stop css class on textbox