Top 10 Best Usage Examples of PHP Simple HTML DOM Parser

February 26, 2014February 26, 2014 NIMISHPHP

Simple HTML DOM Parser is one of the best things that has happened to me. I remember the days when I used to use regular expressions and preg_match_all function to fetch values from scraped text, they were not so good. But ever since I found this HTML DOM Parser, life has been way too easy when it comes to fetching data and extracting values from html pages.

During my initial days while using this script, I was confused quite a lot of times. The parser is actually so awesome that it provides too many features and it can do almost everything you would want a parser to do. Only problem is to remember the syntax and method of calling various functions along with numerous distinct parameters for each of them.

I’ve made a list of codes, which I use from time to time, that can come in handy for you all. Read further to understand the usage of Simple HTML DOM Parser and get readymade PHP codes for the same.

Downloading and storing structured data

Data can be obtained from mainly three different sources : URL, Static File or HTML String. Use the following code to create a DOM from three different alternatives.


<?php

include('simple_html_dom.php');

//to parse a webpage
$html = file_get_html("http://nimishprabhu.com");

//to parse a file using relative location
$html = file_get_html("index.html");

//to parse a file using absolute location
$html = file_get_html("/home/admin/nimishprabhu.com/testfiles/index.html");

//to parse a string as html code
$html = str_get_html("<html><head><title>Cool HTML Parser</title></head><body><h2>PHP Simple HTML DOM Parser</h2><p>PHP Simple HTML DOM Parser is the best HTML DOM parser in any programming language.</p></body></html>");

//to fetch a webpage in a string and then parse
$data = file_get_contents("http://nimishprabhu.com"); //or you can use curl too, like me :)
// Some manipulation with the $data variable, for e.g.
$data = str_replace("Nimish", "NIMISH", $data);
//now parsing it into html
$html = str_get_html($data);

?>

Finding HTML elements based on their tag names

Suppose you wanted to find each and every image on a webpage or say, each and every hyperlink. We will be using “find” function to extract this information from the object. Here’s how to do it using Simple HTML DOM Parser :


<?php

include('simple_html_dom.php');

$html = file_get_html('http://nimishprabhu.com/');

//to fetch all hyperlinks from a webpage
$links = array();
foreach($html->find('a') as $a) {
 $links[] = $a->href;
}
print_r($links);

//to fetch all images from a webpage
$images = array();
foreach($html->find('img') as $img) {
 $images[] = $img->src;
}
print_r($images);

//to find h1 headers from a webpage
$headlines = array();
foreach($html->find('h1') as $header) {
 $headlines[] = $header->plaintext;
}
print_r($headlines);
?>

Extracting values of attributes from elements

Suppose you want to get names of all input fields on a webpage, let’s say for e.g., http://nimishprabhu.com/chrome-extension-hello-world-example.html. Now if you see the webpage you will notice that there is a comment form on the page which has input fields. Please note that the comment box is a textarea element and not input element, so it will not be detected. But to detect rest of the visible as well has hidden fields you can use following code :
```
<?php

include('simple_html_dom.php');

$url = 'http://nimishprabhu.com/chrome-extension-hello-world-example.html';

$html = file_get_html($url);

foreach($html->find('input') as $input) {
 echo $input->name.'<br />';
}

// Output for above script :
// author
// email
// url
// submit
// comment_post_ID
// comment_parent

?>
```
Filtering elements based on values of its attributes

When a developer designs a page, he uses various attributes to uniquely identify and classify the information on the webpage. A parser is not human and hence cannot visualize the difference, but it can detect these attributes and filter the output so as to obtain a precise set of data. Let us take a practical example for better understanding. If you see this page : https://www.phpbb.com/community/viewtopic.php?f=46&t=543171 you can see the page is divided into header, content and footer. Now even the content is further sub divided into posts. This page has only 1 post but I decided to choose this as it contains quite a lot of hyperlinks. Now suppose you wanted to extract only the hyperlinks in the post and not the entire page. The approach should be as follows :

Check the source of the webpage. Find out whether the hyperlinks are following some kind of pattern. If you look closely you will find that all of them have class=”postlink”. This will make extracting them, a piece of cake. Read the code below to see how to filter html elements based on values of attributes.
```
<?php

include('simple_html_dom.php');

$url = 'https://www.phpbb.com/community/viewtopic.php?f=46&t=543171';

$html = file_get_html($url);
$links = array();
foreach($html->find('a[class="postlink"]') as $a) {
 $links[] = $a->href;
}

print_r($links);

?>
```
There is something worth noting here, you can use “.” and “#” prefixes to filter class and id attributes respectively. So the above code will work without any change if you use the filter as :
```
foreach($html->find('a.postlink') as $a)
```
Pattern matching while filtering attributes of elements

Consider the above example where we are extracting all links from the post. Say you want to find only the links of the sub forums in the community. If you notice all of them begin with http://www.phpbb.com/community/viewforum.php. So let’s filter the hyperlinks using “starts with” filter to fetch only the links starting with http://www.phpbb.com/community/viewforum.php
```
<?php
include('simple_html_dom.php');

$url = 'https://www.phpbb.com/community/viewtopic.php?f=46&t=543171';

$html = file_get_html($url);
$links = array();
foreach($html->find('a[href^="http://www.phpbb.com/community/viewforum.php"]') as $a) {
 $links[] = $a->href;
}

print_r($links);

?>
```
Similarly, say if you want to find all links containing phpbb.com then you can filter using “contains” filter as follows :
```
foreach($html->find('a[href*="phpbb.com"]') as $a)
```
If you are sure about only the end part of the value of an attribute. Let’s say, for e.g., you are scrapping a webpage which contains numerous div elements. These div elements have the id attribute something like :
<div id=”1_message_id”>content here</div>
<div id=”2_message_id”>content here</div>
and so on.
Then you can find such div elements using the “ends with” filter as follows :
```
foreach($html->find('div[id$="_message_id"]' as $div)
```
Adding / Changing attributes of the elements

Let’s say you want to change the value of attribute of particular element. For e.g. if you wished to change all the hyperlinks having class=postlink to class=topiclink, you can do so as follows :
```
<?php

include('simple_html_dom.php');

$url = 'https://www.phpbb.com/community/viewtopic.php?f=46&t=543171';

$html = file_get_html($url);

foreach($html->find('a.postlink') as $a) {
 $a->class = 'topiclink';
}

echo $html;
?>
```
Finding nth element from parsed data

Note that the numbering of elements starts from 0 and not 1. Thus the first element will be found at 0th location. Let’s assume that you want to extract the hyperlink of the 3rd link with class postlink on a webpage, you can use the following approach :
```
<?php
include('simple_html_dom.php');
$url = 'https://www.phpbb.com/community/viewtopic.php?f=46&t=543171';
$html = file_get_html($url);
echo $html->find('a.postlink',2)->href;
?>
```

Manipulating the inner content of tags

If you wish to clear the inner contents of the div with id as content, you can do so as follows :

$html->find('div#content',0)->innertext = '';

If you wish to append text to existing content, you can do so as follows :

$appendcode = '<p>This is the text to append to existing innertext</p>';
$html->find('div#content',0)->innertext .= $appendcode;

Inorder to prepend text to existing content, you can use the following code :

$prependcode = '<h2>Nice article below</h2>';
$html->find('div#content,0)->innertext = $prependcode . $html->find('div#content',0)->innertext;

Wrap the contents of an element inside a new element

Say you have an existing div with id content, now you made a wrapper div and want to enclose the content div in the wrapper div. Here’s how you do it :
```
$html->find('div#content',0)->outertext = '<div id="wrapper">' . $html->find('div#content',0)->outertext. '</div>';
```
Handling memory leak issues while using PHP Simple HTML DOM Parser

Last but definitely not the least, handling the memory leak issue. Once you start using this script extensively you will encounter memory exhausted errors and will keep wondering what’s wrong with your script. The problem might be due to not handling the memory leak issue. I will not talk in detail about what memory leak is or how this issue is caused but you can read quite a bit about it here.To handle this issue don’t forget to clear the $html variable created and unset it once it’s not required further.
```
$html->clear();
unset($html);
```
You can also use the cool function created by Flash Thunder from StackOverFlow.com, check it out here along with its usage example.

I guess these examples are sufficient enough for you to get started with using PHP Simple HTML DOM Parser. If you have any doubts or queries use the comment form below. I will add more examples as per requests and queries. Hope this article helps you scrape data efficiently.

PHP

11 thoughts on “Top 10 Best Usage Examples of PHP Simple HTML DOM Parser”

jyoti says:

June 19, 2015 at 7:31 am

Hi.. please check html

100 Bullets (Mature Readers) #100 near mint[46373]MAXIMUM_ORDER_TEXT
$4.99

From this, i want to scrap only title i.e “100 Bullets (Mature Readers) #100 near mint”
But i am getting both .here is the o/p
[product_title] => Array
(
[0] => 100 Bullets (Mature Readers) #30 near mint
[10]
MAXIMUM_ORDER_TEXT
[1] => $1.99
[2] => 100 Bullets (Mature Readers) #100 near mint
[46373]
MAXIMUM_ORDER_TEXT
[3] => $4.99
[4] => 100 Bullets (Mature Readers) #32 near mint
[12]
MAXIMUM_ORDER_TEXT
[5] => $1.99
[6] => 100 Bullets (Mature Readers) #34 near mint
[14]
MAXIMUM_ORDER_TEXT
[7] => $1.99
[8] => 100th Anniversary Special Guardians of the Galaxy (2014 one shot) #1 (variant) near mint

please let me know what to do????????

Reply
1. NIMISH says:
  
  July 25, 2015 at 12:25 pm
  
  Hi Jyoti
  
  Observe the pattern and accordingly split the strings obtained.
  
  For e.g.
  
  $title = explode(‘[‘,$product_title);
  or
  $title = explode(‘near mint’, $product_title);
  
  Then use $title[0] to obtain the final output.
  
  Trust this helps.
  
  Thanks & Regards
  
  Nimish Prabhu
  
  Reply
swapnil says:

February 8, 2016 at 10:25 am

Call to undefined function file_get_html()

Reply
1. sam says:
  
  October 1, 2017 at 8:33 am
  
  Same here,
  Call to a member function find() on a non-object in….
  any ideas?
  
  thanks
  
  Reply
KarenM says:

March 24, 2017 at 2:07 am

Is there a way to exclude matched elements by position? For example, if I want to alter all the matched elements after the first one, but leave the first one alone?

Reply
1. Yavor Kirov says:
  
  March 22, 2018 at 5:45 am
  
  $counter = 0;
  foreach($elements as $elementKey => $elementValue) {
  if ($counter >= 1) {
  // your code here applies to all elements after the 1st one..
  }
  $counter++;
  }
  
  Reply
Salil says:

November 5, 2017 at 2:12 pm

Hi, what about sites which auto load contents on scrolling down? How to scrape such websites?

Reply
matze says:

January 30, 2018 at 9:11 pm

many thanks for this great documentation – this is really outstanding. keep up the great work – it rocks

Reply
AkhalK says:

March 15, 2018 at 7:58 am

Hi,
Is there a way I can insert values inside input fields?

Reply
Hafis Ali says:

May 17, 2018 at 5:02 pm

Ninish,

Thanks for the tutorial.
Can you build a tutorial here where the DOM Parser will extract both the urls and anchor texts from links ?
If you want to go deeper then build a tutorial that shows us how to build a simple web crawler. This is where you feed the crawler a url and it fetches the page and extracts all the urls and their anchor texts from the links found on the page.
If you want to take this another step further then show us how to get the crawler to follow DoFollow links. And the link following should be done on 2 options.
Option 1: We set the link depth level;
Option 2: We set the crawler to stay on domain of the original page it fetches.
Option 3: We set the crawler to stay on the domains of the found links found within the set depth level (link following depth level).

I think this tutorial would get a good ranking in Google.
As appreciation for the tutorial, here’s a feed-back:
I found this tutorial on the 1st page of google search with the KWs:
scrape links and anchor texts of pages AND php AND curl

Email me when your tutorial is up and running.

Thanks

Reply
Mario Goren says:

April 23, 2019 at 9:58 pm

I was trying to get something from a page but nothinh work. I have PHP 7.2.

I thing the problem is with include(‘simple_html_dom.php’);

thank you for any help

Reply

Top 10 Best Usage Examples of PHP Simple HTML DOM Parser

Downloading and storing structured data

Finding HTML elements based on their tag names

Extracting values of attributes from elements

Filtering elements based on values of its attributes

Pattern matching while filtering attributes of elements

Adding / Changing attributes of the elements

Finding nth element from parsed data

Manipulating the inner content of tags

Wrap the contents of an element inside a new element

Handling memory leak issues while using PHP Simple HTML DOM Parser

Categories

Recent Posts

Categories

11 thoughts on “Top 10 Best Usage Examples of PHP Simple HTML DOM Parser”

Leave a Reply Cancel reply

Top 10 Best Usage Examples of PHP Simple HTML DOM Parser

Downloading and storing structured data

Finding HTML elements based on their tag names

Extracting values of attributes from elements

Filtering elements based on values of its attributes

Pattern matching while filtering attributes of elements

Adding / Changing attributes of the elements

Finding nth element from parsed data

Manipulating the inner content of tags

Wrap the contents of an element inside a new element

Handling memory leak issues while using PHP Simple HTML DOM Parser

Categories

Recent Posts

Categories

Related Posts

[PHP] Enable HTTP to HTTPS redirect in Codeigniter 4 without .htaccess

How to Fix CURLOPT_FOLLOWLOCATION cannot be activated PHP Error in Kloxo Linux

Header() in PHP – Refresh (Redirect) to Location (URL) in X seconds

Solve New Line Character Stripping – Simple HTML DOM Parser PHP

11 thoughts on “Top 10 Best Usage Examples of PHP Simple HTML DOM Parser”

Leave a Reply Cancel reply