Welcome to TheCredence.com - You may like to subscribe to our RSS feed to stay updated.

After the previous article (which I just published 5 minutes back) on PHP Reflection API and how can we use it to reverse engineer scripts Let me put another semi-advanced article on .

Though PHP is a web language and is mainly used to serve/deal web pages but there are numerous other possibilities that can be achieved with php. Web Bots or spiders are for many purposes. PHP is also not the most popular language for developing one. Usually a bots task involves accessing web pages and fetching information. This is something that can be achieved easily with php. Today we will learn how to do this; that means, accessing a page and fetching info from it.

How to Create Bots, Spiders and Crawlers with PHP

As said before, a basic bot’s task is to fetch a web page and parse necessary information from it. To accomplish this task we need to follow these procedures:
* Request for a page.
* Follow redirects if asked
* Fetch the page
* Parse the page for information

This is the basic procedure followed by even the most advanced bots. We will create a link bot that will look whether a particular link exists or not.

In order to do this in PHP we will use the CURL extension. So make sure it’s installed and activated. Instead of using the functions manually, we will use a CURL wrapper class. This class can be found here(http://phpclasses.org)

The codes
Suppose the bot gets the URL of the page as a command line argument. So the code to get the URL,

 
#!/usr/bin/php
if ($argc != 3 || empty($argv[1]) || empty($argv[2])){
   echo "Please give a URL and pattern";
} else {
  echo $argv[1];
   require_once("class.curl.php");
   $c = new curl($argv[1]);
   $c->setopt(CURLOPT_FOLLOWLOCATION, true);
   $data = $c->exec();
   if ($theError = $c->hasError())
   {
   	  echo $theError ;
   	  $c->close() ;
   	  exit();
   }
   $c->close() ;
}
 
?>

The codes above checks whether an argument was passed and if passed than starts the CURL class. The CURL class takes the URL of the page as an arg. After that we specify a CURL option to follow redirects. You see how easily we have gone this far. Writing this without CURL would have taken several lines of code. Finally we call the exec method which fetches and returns all the page data.

Parsing the data
We have done the hard part. We have fetched the page now we just have to check whether the particular link exists or not. To do this, we will use the second argument which will specify the particular link to look for. We will just use a simple strops function. It is possible to RegEx for better pattern matching but we will use the simple version to keep it simple.

 
   if (strpos($data, $argv[2])) {
   	  echo "Link found";
   } else {
   	  echo "Link not found";
   }

This is the simplest of bot possible. You can extend it by adding more validation, a database and so on. The purpose of this article was to show creating a bot,spider and crawler is not so difficult task as it may sound. It is just another type of data recognition. Nowl create some advanced bots with PHP and put the links as comments to them. I would love to more about this topic.

1 Star2 Stars3 Stars4 Stars5 Stars (8 votes, average: 4.25 out of 5)
Loading ... Loading ...
Subscribe in a reader |

Links you may find interesting -