Top 10 Best Usage Examples of PHP Simple HTML DOM Parser

Simple HTML DOM Parser is one of the best things that has happened to me. I remember the days when I used to use regular expressions and preg_match_all function to fetch values from scraped text, they were not so good. But ever since I found this HTML DOM Parser, life has been way too easy when it comes to fetching data and extracting values from html pages.

During my initial days while using this script, I was confused quite a lot of times. The parser is actually so awesome that it provides too many features and it can do almost everything you would want a parser to do. Only problem is to remember the syntax and method of calling various functions along with numerous distinct parameters for each of them.

php-simple-html-dom-parser

I’ve made a list of codes, which I use from time to time, that can come in handy for you all. Read further to understand the usage of Simple HTML DOM Parser and get readymade PHP codes for the same.

  1. Downloading and storing structured data

    Data can be obtained from mainly three different sources : URL, Static File or HTML String. Use the following code to create a DOM from three different alternatives.

    
    <?php
    
    include('simple_html_dom.php');
    
    //to parse a webpage
    $html = file_get_html("http://nimishprabhu.com");
    
    //to parse a file using relative location
    $html = file_get_html("index.html");
    
    //to parse a file using absolute location
    $html = file_get_html("/home/admin/nimishprabhu.com/testfiles/index.html");
    
    //to parse a string as html code
    $html = str_get_html("<html><head><title>Cool HTML Parser</title></head><body><h2>PHP Simple HTML DOM Parser</h2><p>PHP Simple HTML DOM Parser is the best HTML DOM parser in any programming language.</p></body></html>");
    
    //to fetch a webpage in a string and then parse
    $data = file_get_contents("http://nimishprabhu.com"); //or you can use curl too, like me :)
    // Some manipulation with the $data variable, for e.g.
    $data = str_replace("Nimish", "NIMISH", $data);
    //now parsing it into html
    $html = str_get_html($data);
    
    ?>
    
  2. Finding HTML elements based on their tag names

    Suppose you wanted to find each and every image on a webpage or say, each and every hyperlink. We will be using “find” function to extract this information from the object. Here’s how to do it using Simple HTML DOM Parser :

    
    <?php
    
    include('simple_html_dom.php');
    
    $html = file_get_html('http://nimishprabhu.com/');
    
    //to fetch all hyperlinks from a webpage
    $links = array();
    foreach($html->find('a') as $a) {
     $links[] = $a->href;
    }
    print_r($links);
    
    //to fetch all images from a webpage
    $images = array();
    foreach($html->find('img') as $img) {
     $images[] = $img->src;
    }
    print_r($images);
    
    //to find h1 headers from a webpage
    $headlines = array();
    foreach($html->find('h1') as $header) {
     $headlines[] = $header->plaintext;
    }
    print_r($headlines);
    ?>
    
  3. Extracting values of attributes from elements

    Suppose you want to get names of all input fields on a webpage, let’s say for e.g., http://nimishprabhu.com/chrome-extension-hello-world-example.html. Now if you see the webpage you will notice that there is a comment form on the page which has input fields. Please note that the comment box is a textarea element and not input element, so it will not be detected. But to detect rest of the visible as well has hidden fields you can use following code :

    <?php
    
    include('simple_html_dom.php');
    
    $url = 'http://nimishprabhu.com/chrome-extension-hello-world-example.html';
    
    $html = file_get_html($url);
    
    foreach($html->find('input') as $input) {
     echo $input->name.'<br />';
    }
    
    // Output for above script :
    // author
    // email
    // url
    // submit
    // comment_post_ID
    // comment_parent
    
    ?>
    
  4. Filtering elements based on values of its attributes

    When a developer designs a page, he uses various attributes to uniquely identify and classify the information on the webpage. A parser is not human and hence cannot visualize the difference, but it can detect these attributes and filter the output so as to obtain a precise set of data. Let us take a practical example for better understanding. If you see this page : https://www.phpbb.com/community/viewtopic.php?f=46&t=543171 you can see the page is divided into header, content and footer. Now even the content is further sub divided into posts. This page has only 1 post but I decided to choose this as it contains quite a lot of hyperlinks. Now suppose you wanted to extract only the hyperlinks in the post and not the entire page. The approach should be as follows :

    Check the source of the webpage. Find out whether the hyperlinks are following some kind of pattern. If you look closely you will find that all of them have class=”postlink”. This will make extracting them, a piece of cake. Read the code below to see how to filter html elements based on values of attributes.

    <?php
    
    include('simple_html_dom.php');
    
    $url = 'https://www.phpbb.com/community/viewtopic.php?f=46&t=543171';
    
    $html = file_get_html($url);
    $links = array();
    foreach($html->find('a[class="postlink"]') as $a) {
     $links[] = $a->href;
    }
    
    print_r($links);
    
    ?>
    

    There is something worth noting here, you can use “.” and “#” prefixes to filter class and id attributes respectively. So the above code will work without any change if you use the filter as :

    foreach($html->find('a.postlink') as $a)
    
  5. Pattern matching while filtering attributes of elements

    Consider the above example where we are extracting all links from the post. Say you want to find only the links of the sub forums in the community. If you notice all of them begin with http://www.phpbb.com/community/viewforum.php. So let’s filter the hyperlinks using “starts with” filter to fetch only the links starting with http://www.phpbb.com/community/viewforum.php

    <?php
    include('simple_html_dom.php');
    
    $url = 'https://www.phpbb.com/community/viewtopic.php?f=46&t=543171';
    
    $html = file_get_html($url);
    $links = array();
    foreach($html->find('a[href^="http://www.phpbb.com/community/viewforum.php"]') as $a) {
     $links[] = $a->href;
    }
    
    print_r($links);
    
    ?>
    

    Similarly, say if you want to find all links containing phpbb.com then you can filter using “contains” filter as follows :

    foreach($html->find('a[href*="phpbb.com"]') as $a)
    

    If you are sure about only the end part of the value of an attribute. Let’s say, for e.g., you are scrapping a webpage which contains numerous div elements. These div elements have the id attribute something like :
    <div id=”1_message_id”>content here</div>
    <div id=”2_message_id”>content here</div>
    and so on.
    Then you can find such div elements using the “ends with” filter as follows :

    foreach($html->find('div[id$="_message_id"]' as $div)
    
  6. Adding / Changing attributes of the elements

    Let’s say you want to change the value of attribute of particular element. For e.g. if you wished to change all the hyperlinks having class=postlink to class=topiclink, you can do so as follows :

    
    <?php
    
    include('simple_html_dom.php');
    
    $url = 'https://www.phpbb.com/community/viewtopic.php?f=46&t=543171';
    
    $html = file_get_html($url);
    
    foreach($html->find('a.postlink') as $a) {
     $a->class = 'topiclink';
    }
    
    echo $html;
    ?>
    
    
  7. Finding nth element from parsed data

    Note that the numbering of elements starts from 0 and not 1. Thus the first element will be found at 0th location. Let’s assume that you want to extract the hyperlink of the 3rd link with class postlink on a webpage, you can use the following approach :

    <?php
    include('simple_html_dom.php');
    $url = 'https://www.phpbb.com/community/viewtopic.php?f=46&t=543171';
    $html = file_get_html($url);
    echo $html->find('a.postlink',2)->href;
    ?>
    
  8. Manipulating the inner content of tags

    If you wish to clear the inner contents of the div with id as content, you can do so as follows :

    $html->find('div#content',0)->innertext = '';
    

    If you wish to append text to existing content, you can do so as follows :

    $appendcode = '<p>This is the text to append to existing innertext</p>';
    $html->find('div#content',0)->innertext .= $appendcode;
    

    Inorder to prepend text to existing content, you can use the following code :

    $prependcode = '<h2>Nice article below</h2>';
    $html->find('div#content,0)->innertext = $prependcode . $html->find('div#content',0)->innertext;
    
  9. Wrap the contents of an element inside a new element

    Say you have an existing div with id content, now you made a wrapper div and want to enclose the content div in the wrapper div. Here’s how you do it :

    $html->find('div#content',0)->outertext = '<div id="wrapper">' . $html->find('div#content',0)->outertext. '</div>';
    

  10. Handling memory leak issues while using PHP Simple HTML DOM Parser

    Last but definitely not the least, handling the memory leak issue. Once you start using this script extensively you will encounter memory exhausted errors and will keep wondering what’s wrong with your script. The problem might be due to not handling the memory leak issue. I will not talk in detail about what memory leak is or how this issue is caused but you can read quite a bit about it here.To handle this issue don’t forget to clear the $html variable created and unset it once it’s not required further.

    $html->clear();
    unset($html);
    

    You can also use the cool function created by Flash Thunder from StackOverFlow.com, check it out here along with its usage example.

I guess these examples are sufficient enough for you to get started with using PHP Simple HTML DOM Parser. If you have any doubts or queries use the comment form below. I will add more examples as per requests and queries. Hope this article helps you scrape data efficiently.

This entry was posted in PHP.

2 thoughts on “Top 10 Best Usage Examples of PHP Simple HTML DOM Parser

  1. Hi.. please check html

    100 Bullets (Mature Readers) #100 near mint[46373]MAXIMUM_ORDER_TEXT
    $4.99

    From this, i want to scrap only title i.e “100 Bullets (Mature Readers) #100 near mint”
    But i am getting both .here is the o/p
    [product_title] => Array
    (
    [0] => 100 Bullets (Mature Readers) #30 near mint
    [10]
    MAXIMUM_ORDER_TEXT
    [1] => $1.99
    [2] => 100 Bullets (Mature Readers) #100 near mint
    [46373]
    MAXIMUM_ORDER_TEXT
    [3] => $4.99
    [4] => 100 Bullets (Mature Readers) #32 near mint
    [12]
    MAXIMUM_ORDER_TEXT
    [5] => $1.99
    [6] => 100 Bullets (Mature Readers) #34 near mint
    [14]
    MAXIMUM_ORDER_TEXT
    [7] => $1.99
    [8] => 100th Anniversary Special Guardians of the Galaxy (2014 one shot) #1 (variant) near mint

    please let me know what to do????????

    • Hi Jyoti

      Observe the pattern and accordingly split the strings obtained.

      For e.g.

      $title = explode(‘[‘,$product_title);
      or
      $title = explode(‘near mint’, $product_title);

      Then use $title[0] to obtain the final output.

      Trust this helps.

      Thanks & Regards

      Nimish Prabhu

Leave a Reply

Your email address will not be published. Required fields are marked *