How do I hide images that have a certain class when creating a pdf from html?

Problem :

I am having an issue trying to hide image elements that contain a certain class when converting the html to pdf, using iTextSharp (5.x).

I do not have access over the original Html as it comes from another source, however, I can do basic things like Regex and string.replace in C# after I get it.

A simple example of the Html string would be something like this:

        <img src="somepath/desktop.jpg" class="img-desktop">Desktop</img>
        <img src="somepath/mobile.jpg" class="img-mobile">Mobile</img>

This string is then getting created into a PDF using the XMLWorker in iTextSharp.

I need to hide the second image and, more generically, any image element with the "img-mobile" class.

What I've tried:

  • Add img.img-mobile {display:none} to the CSS that is sent in when creating the pdf
  • Add img.img-mobile {width:0;height:0} to the CSS
  • Add @media print { img.img-mobile: display:none} to the CSS
  • Add @media print { img.img-mobile: width:0;height:0} to the CSS
  • Use Regex to find an img element with that classes, then loop through the matches, replace the source with empty source and replace the original html of that string with the new string (my Regex isn't grabbing any matches, unfortunately)

            var pattern = "<img.*?class=\"img-mobile.*\"\\s?>.*</img>";
            var mobileImages = Regex.Matches(innerHtml, pattern);
            var srcPattern = "src=\".*\" ";
            foreach (var imageElement in mobileImages)
                var replaceString = Regex.Replace(imageElement.ToString(), srcPattern, " ");
                innerHtml.Replace(imageElement.ToString(), replaceString);

I am quickly running out of ideas on how to handle this... The only saving grace is that the Html that comes in is consistent since a tool is generating it, somewhere else. So, when a user "adds an image to that html" it will always be structured the same, so Regex and replace methods are acceptable, although a CSS method would be much more preferred...

Solution :

Even if you're a Regex expert and your input is predictable as mentioned, parsing HTML is hard. A better and easier way is to use a tested/proven parser, which is available in pretty much every programming language. For .NET it's HtmlAgilityPack. If you know a bit of XPath, which is quite similar to CSS selectors, it's pretty simple to setup and select the specific nodes you want to remove:

string RemoveImage(string htmlToParse)
    var hDocument = new HtmlDocument()
        OptionWriteEmptyNodes = true,
        OptionAutoCloseOnEnd = true
    var root = hDocument.DocumentNode;
    var imagesDesktop = root.SelectNodes("//img[@class='img-desktop']"); 
    foreach (var image in imagesDesktop)
        var imageText = image.NextSibling;
    return root.WriteTo();

And then pass your parsed HTML to iTextSharp:

var parsedHtml = RemoveImage(HTML);
using (var xmlSnippet = new StringReader(parsedHtml))
    using (FileStream stream = new FileStream(
        using (var document = new Document())
            PdfWriter writer = PdfWriter.GetInstance(
                document, stream
                writer, document, xmlSnippet

works for me with the HTML snippet you provided.

UPDATE, after comment about 'approved' code:

Aah, the dreaded CCB. Know how that goes. :( If HtmlAgilityPack doesn't pass, here's an alternate solution, although it's probably not the best Regex ever written. ;)

const string HTML = @"
    <p class='img-desktop'>Paragraph</p>
        <img src='somepath/desktop.jpg' class='img-desktop'>Desktop</img>
        <img src='somepath/mobile.jpg' class='img-mobile'>Mobile</img>
        <img src='somepath/desktop.jpg' alt='img-desktop' title='img-desktop' class=""img-desktop"">Desktop
        <img src='somepath/mobile.jpg' class='img-mobile'>Mobile</img>

public void Go()
    var regex = new Regex(
        // initial update
        // @"<img[^>]*class='?""?'?img-desktop""?[^>]*>.*?</img>",

        // after seeing accepted answer, noticed a bad copy/paste.
        // above works, but for readability should have been this:
        // and also noticed above can be shortened to this, which works too
        // @"<img[^>]*class=[^>]*img-desktop[^>]*>.*?</img>"
        RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Singleline
    Console.WriteLine(regex.Replace(HTML, ""));

The Regex gives you a little extra leeway in case the actual HTML you're dealing with isn't exactly as posted above.

