Top 10 Best Usage Examples of PHP Simple HTML DOM Parser
Simple HTML DOM Parser is one of the best things that has happened to me. I remember the days when I used to use regular expressions and preg_match_all function to fetch values from scraped text, they were not so good. But ever since I found this HTML DOM Parser, life has been way too easy when it comes to fetching data and extracting values from html pages.
During my initial days while using this script, I was confused quite a lot of times. The parser is actually so awesome that it provides too many features and it can do almost everything you would want a parser to do. Only problem is to remember the syntax and method of calling various functions along with numerous distinct parameters for each of them.
I’ve made a list of codes, which I use from time to time, that can come in handy for you all. Read further to understand the usage of Simple HTML DOM Parser and get readymade PHP codes for the same.
-
Downloading and storing structured data
Data can be obtained from mainly three different sources : URL, Static File or HTML String. Use the following code to create a DOM from three different alternatives.
<?php include('simple_html_dom.php'); //to parse a webpage $html = file_get_html("http://nimishprabhu.com"); //to parse a file using relative location $html = file_get_html("index.html"); //to parse a file using absolute location $html = file_get_html("/home/admin/nimishprabhu.com/testfiles/index.html"); //to parse a string as html code $html = str_get_html("<html><head><title>Cool HTML Parser</title></head><body><h2>PHP Simple HTML DOM Parser</h2><p>PHP Simple HTML DOM Parser is the best HTML DOM parser in any programming language.</p></body></html>"); //to fetch a webpage in a string and then parse $data = file_get_contents("http://nimishprabhu.com"); //or you can use curl too, like me :) // Some manipulation with the $data variable, for e.g. $data = str_replace("Nimish", "NIMISH", $data); //now parsing it into html $html = str_get_html($data); ?>
-
Finding HTML elements based on their tag names
Suppose you wanted to find each and every image on a webpage or say, each and every hyperlink. We will be using “find” function to extract this information from the object. Here’s how to do it using Simple HTML DOM Parser :
<?php include('simple_html_dom.php'); $html = file_get_html('http://nimishprabhu.com/'); //to fetch all hyperlinks from a webpage $links = array(); foreach($html->find('a') as $a) { $links[] = $a->href; } print_r($links); //to fetch all images from a webpage $images = array(); foreach($html->find('img') as $img) { $images[] = $img->src; } print_r($images); //to find h1 headers from a webpage $headlines = array(); foreach($html->find('h1') as $header) { $headlines[] = $header->plaintext; } print_r($headlines); ?>
-
Extracting values of attributes from elements
Suppose you want to get names of all input fields on a webpage, let’s say for e.g., http://nimishprabhu.com/chrome-extension-hello-world-example.html. Now if you see the webpage you will notice that there is a comment form on the page which has input fields. Please note that the comment box is a textarea element and not input element, so it will not be detected. But to detect rest of the visible as well has hidden fields you can use following code :
<?php include('simple_html_dom.php'); $url = 'http://nimishprabhu.com/chrome-extension-hello-world-example.html'; $html = file_get_html($url); foreach($html->find('input') as $input) { echo $input->name.'<br />'; } // Output for above script : // author // email // url // submit // comment_post_ID // comment_parent ?>
-
Filtering elements based on values of its attributes
When a developer designs a page, he uses various attributes to uniquely identify and classify the information on the webpage. A parser is not human and hence cannot visualize the difference, but it can detect these attributes and filter the output so as to obtain a precise set of data. Let us take a practical example for better understanding. If you see this page : https://www.phpbb.com/community/viewtopic.php?f=46&t=543171 you can see the page is divided into header, content and footer. Now even the content is further sub divided into posts. This page has only 1 post but I decided to choose this as it contains quite a lot of hyperlinks. Now suppose you wanted to extract only the hyperlinks in the post and not the entire page. The approach should be as follows :
Check the source of the webpage. Find out whether the hyperlinks are following some kind of pattern. If you look closely you will find that all of them have class=”postlink”. This will make extracting them, a piece of cake. Read the code below to see how to filter html elements based on values of attributes.
<?php include('simple_html_dom.php'); $url = 'https://www.phpbb.com/community/viewtopic.php?f=46&t=543171'; $html = file_get_html($url); $links = array(); foreach($html->find('a[class="postlink"]') as $a) { $links[] = $a->href; } print_r($links); ?>
There is something worth noting here, you can use “.” and “#” prefixes to filter class and id attributes respectively. So the above code will work without any change if you use the filter as :
foreach($html->find('a.postlink') as $a)
-
Pattern matching while filtering attributes of elements
Consider the above example where we are extracting all links from the post. Say you want to find only the links of the sub forums in the community. If you notice all of them begin with http://www.phpbb.com/community/viewforum.php. So let’s filter the hyperlinks using “starts with” filter to fetch only the links starting with http://www.phpbb.com/community/viewforum.php
<?php include('simple_html_dom.php'); $url = 'https://www.phpbb.com/community/viewtopic.php?f=46&t=543171'; $html = file_get_html($url); $links = array(); foreach($html->find('a[href^="http://www.phpbb.com/community/viewforum.php"]') as $a) { $links[] = $a->href; } print_r($links); ?>
Similarly, say if you want to find all links containing phpbb.com then you can filter using “contains” filter as follows :
foreach($html->find('a[href*="phpbb.com"]') as $a)
If you are sure about only the end part of the value of an attribute. Let’s say, for e.g., you are scrapping a webpage which contains numerous div elements. These div elements have the id attribute something like :
<div id=”1_message_id”>content here</div>
<div id=”2_message_id”>content here</div>
and so on.
Then you can find such div elements using the “ends with” filter as follows :foreach($html->find('div[id$="_message_id"]' as $div)
-
Adding / Changing attributes of the elements
Let’s say you want to change the value of attribute of particular element. For e.g. if you wished to change all the hyperlinks having class=postlink to class=topiclink, you can do so as follows :
<?php include('simple_html_dom.php'); $url = 'https://www.phpbb.com/community/viewtopic.php?f=46&t=543171'; $html = file_get_html($url); foreach($html->find('a.postlink') as $a) { $a->class = 'topiclink'; } echo $html; ?>
-
Finding nth element from parsed data
Note that the numbering of elements starts from 0 and not 1. Thus the first element will be found at 0th location. Let’s assume that you want to extract the hyperlink of the 3rd link with class postlink on a webpage, you can use the following approach :
<?php include('simple_html_dom.php'); $url = 'https://www.phpbb.com/community/viewtopic.php?f=46&t=543171'; $html = file_get_html($url); echo $html->find('a.postlink',2)->href; ?>
-
Manipulating the inner content of tags
If you wish to clear the inner contents of the div with id as content, you can do so as follows :
$html->find('div#content',0)->innertext = '';
If you wish to append text to existing content, you can do so as follows :
$appendcode = '<p>This is the text to append to existing innertext</p>'; $html->find('div#content',0)->innertext .= $appendcode;
Inorder to prepend text to existing content, you can use the following code :
$prependcode = '<h2>Nice article below</h2>'; $html->find('div#content,0)->innertext = $prependcode . $html->find('div#content',0)->innertext;
-
Wrap the contents of an element inside a new element
Say you have an existing div with id content, now you made a wrapper div and want to enclose the content div in the wrapper div. Here’s how you do it :
$html->find('div#content',0)->outertext = '<div id="wrapper">' . $html->find('div#content',0)->outertext. '</div>';
-
Handling memory leak issues while using PHP Simple HTML DOM Parser
Last but definitely not the least, handling the memory leak issue. Once you start using this script extensively you will encounter memory exhausted errors and will keep wondering what’s wrong with your script. The problem might be due to not handling the memory leak issue. I will not talk in detail about what memory leak is or how this issue is caused but you can read quite a bit about it here.To handle this issue don’t forget to clear the $html variable created and unset it once it’s not required further.
$html->clear(); unset($html);
You can also use the cool function created by Flash Thunder from StackOverFlow.com, check it out here along with its usage example.
I guess these examples are sufficient enough for you to get started with using PHP Simple HTML DOM Parser. If you have any doubts or queries use the comment form below. I will add more examples as per requests and queries. Hope this article helps you scrape data efficiently.
Hi.. please check html
100 Bullets (Mature Readers) #100 near mint[46373]MAXIMUM_ORDER_TEXT
$4.99
From this, i want to scrap only title i.e “100 Bullets (Mature Readers) #100 near mint”
But i am getting both .here is the o/p
[product_title] => Array
(
[0] => 100 Bullets (Mature Readers) #30 near mint
[10]
MAXIMUM_ORDER_TEXT
[1] => $1.99
[2] => 100 Bullets (Mature Readers) #100 near mint
[46373]
MAXIMUM_ORDER_TEXT
[3] => $4.99
[4] => 100 Bullets (Mature Readers) #32 near mint
[12]
MAXIMUM_ORDER_TEXT
[5] => $1.99
[6] => 100 Bullets (Mature Readers) #34 near mint
[14]
MAXIMUM_ORDER_TEXT
[7] => $1.99
[8] => 100th Anniversary Special Guardians of the Galaxy (2014 one shot) #1 (variant) near mint
please let me know what to do????????
Hi Jyoti
Observe the pattern and accordingly split the strings obtained.
For e.g.
$title = explode(‘[‘,$product_title);
or
$title = explode(‘near mint’, $product_title);
Then use $title[0] to obtain the final output.
Trust this helps.
Thanks & Regards
Nimish Prabhu
Call to undefined function file_get_html()
Same here,
Call to a member function find() on a non-object in….
any ideas?
thanks
Is there a way to exclude matched elements by position? For example, if I want to alter all the matched elements after the first one, but leave the first one alone?
$counter = 0;
foreach($elements as $elementKey => $elementValue) {
if ($counter >= 1) {
// your code here applies to all elements after the 1st one..
}
$counter++;
}
Hi, what about sites which auto load contents on scrolling down? How to scrape such websites?
many thanks for this great documentation – this is really outstanding. keep up the great work – it rocks
Hi,
Is there a way I can insert values inside input fields?
Ninish,
Thanks for the tutorial.
Can you build a tutorial here where the DOM Parser will extract both the urls and anchor texts from links ?
If you want to go deeper then build a tutorial that shows us how to build a simple web crawler. This is where you feed the crawler a url and it fetches the page and extracts all the urls and their anchor texts from the links found on the page.
If you want to take this another step further then show us how to get the crawler to follow DoFollow links. And the link following should be done on 2 options.
Option 1: We set the link depth level;
Option 2: We set the crawler to stay on domain of the original page it fetches.
Option 3: We set the crawler to stay on the domains of the found links found within the set depth level (link following depth level).
I think this tutorial would get a good ranking in Google.
As appreciation for the tutorial, here’s a feed-back:
I found this tutorial on the 1st page of google search with the KWs:
scrape links and anchor texts of pages AND php AND curl
Email me when your tutorial is up and running.
Thanks
I was trying to get something from a page but nothinh work. I have PHP 7.2.
I thing the problem is with include(‘simple_html_dom.php’);
thank you for any help