Class PHPCrawler

Description

PHPCrawl mainclass

  • author: Uwe Hunfeld (phpcrawl@cuab.de)
  • version: 0.81

Located in /libs/PHPCrawler/PHPCrawler.class.php (line 10)


	
			
Direct descendents
Class Description
SMCCrawler Loading external PHPCrawler-class
Variable Summary
Method Summary
PHPCrawler __construct ()
bool addBasicAuthentication (string $url_regex, string $username, string $password)
bool addContentTypeReceiveRule (string $regex)
void addFollowMatch ( $regex)
bool addLinkPriority (string $regex, int $level)
bool addLinkSearchContentType (string $regex)
void addNonFollowMatch ( $regex)
bool addPostData (string $url_regex, array $post_data_array)
void addReceiveContentType ( $regex)
void addReceiveToMemoryMatch ( $regex)
void addReceiveToTmpFileMatch ( $regex)
bool addStreamToFileContentType (string $regex)
bool addURLFilterRule (string $regex)
bool addURLFollowRule (string $regex)
int checkForAbort ()
void cleanup ()
void disableExtendedLinkInfo ( $mode)
bool enableAggressiveLinkSearch (bool $mode)
bool enableCookieHandling (bool $mode)
int getCrawlerId ()
void getReport ()
void go ()
void goMultiProcessed ([int $process_count = 3], [int $multiprocess_mode = 1])
int handlePageData (array &$page_data)
void obeyNoFollowTags (bool $mode)
bool obeyRobotsTxt (bool $mode)
bool processUrl (PHPCrawlerURLDescriptor $UrlDescriptor)
void resume (int $crawler_id)
bool setConnectionTimeout (int $timeout)
bool setContentSizeLimit (int $bytes)
void setCookieHandling ( $mode)
bool setFollowMode (int $follow_mode)
bool setFollowRedirects (bool $mode)
void setFollowRedirectsTillContent (bool $mode)
void setLinkExtractionTags (array $tag_array)
void setPageLimit (int $limit, [bool $only_count_received_documents = false])
bool setPort (int $port)
void setProxy (string $proxy_host, int $proxy_port, [string $proxy_username = null], [string $proxy_password = null])
bool setStreamTimeout (int $timeout)
void setTmpFile ( $tmp_file)
bool setTrafficLimit (int $bytes, [bool $complete_requested_files = true])
bool setURL (string $url)
bool setUrlCacheType (int $url_cache_type)
void setUserAgentString (string $user_agent)
bool setWorkingDirectory (string $directory)
Variables
int $child_process_number = null (line 152)

Number of child-process (NOT the PID!)

  • access: protected
mixed $class_version = "0.81" (line 12)
  • access: public
PHPCrawlerCookieCache $CookieCache (line 33)

The PHPCrawlerCookieCache-Object

  • access: protected
bool $cookie_handling_enabled = true (line 98)

Flag cookie-handling enabled/diabled

  • access: protected
string $crawler_uniqid = null (line 129)

UID of this instance of the crawler

  • access: protected
PHPCrawlerDocumentInfoQueue $DocumentInfoQueue = null (line 173)

DocumentInfoQueue-object

  • access: protected
int $document_limit = 0 (line 77)

Limit of documents to receive

  • access: protected
mixed $follow_redirects_till_content = true (line 175)
  • access: protected
mixed $is_chlid_process = false (line 110)

Flag indicating whether this instance is running in a child-process (if crawler runs multi-processed)

  • access: protected
mixed $is_parent_process = false (line 115)

Flag indicating whether this instance is running in the parent-process (if crawler runs multi-processed)

  • access: protected
PHPCrawlerURLCache $LinkCache (line 26)

The PHPCrawlerLinkCache-Object

  • access: public
mixed $link_priority_array = array() (line 145)
  • access: protected
int $multiprocess_mode = 0 (line 166)

Multiprocess-mode the crawler is runnung in.

  • var: One of the PHPCrawlerMultiProcessModes-constants
  • access: protected
mixed $obey_robots_txt = false (line 70)

Defines whether robots.txt-file should be obeyed

  • access: protected
bool $only_count_received_documents = true (line 91)

Defines if only documents that were received will be counted.

  • access: protected
PHPCrawlerHTTPRequest $PageRequest (line 19)

The PHPCrawlerHTTPRequest-Object

  • access: protected
int $porcess_abort_reason = null (line 105)

The reason why the process was aborted/finished.

  • var: One of the PHPCrawlerAbortReasons::ABORTREASON-constants.
  • access: protected
PHPCrawlerProcessCommunication $ProcessCommunication = null (line 159)

ProcessCommunication-object

  • access: protected
PHPCrawlerDocumentInfoQueue $resumtion_enabled = false (line 182)

Flag indicating whether resumtion is activated

  • access: protected
PHPCrawlerRobotsTxtParser $RobotsTxtParser (line 47)

The RobotsTxtParser-Object

  • access: protected
string $starting_url = "" (line 63)

The URL the crawler should start with.

The URL is full qualified and normalized.

  • access: protected
int $traffic_limit = 0 (line 84)

Limit of bytes to receive

  • var: The limit in bytes
  • access: protected
mixed $urlcache_purged = false (line 187)

Flag indicating whether the URL-cahce was purged at the beginning of a crawling-process

  • access: protected
PHPCrawlerURLFilter $UrlFilter (line 40)

The UrlFilter-Object

  • access: protected
int $url_cache_type = 1 (line 122)

URl cache-type.

  • var: One of the PHPCrawlerUrlCacheTypes::URLCACHE..-constants.
  • access: protected
PHPCrawlerUserSendDataCache $UserSendDataCache (line 54)

UserSendDataCahce-object.

  • access: protected
string $working_base_directory (line 136)

Base-directory for temporary directories

  • access: protected
string $working_directory = null (line 143)

Complete path to the temporary directory

  • access: protected
Methods
Constructor __construct (line 192)

Initiates a new crawler.

  • access: public
PHPCrawler __construct ()
addBasicAuthentication (line 1549)

Adds a basic-authentication (username and password) to the list of basic authentications that will be send with requests.

Example:

  1.  $crawler->addBasicAuthentication("#http://www\.foo\.com/protected_path/#""myusername""mypasswd");
This lets the crawler send the authentication "myusername/mypasswd" with every request for content placed in the path "protected_path" on the host "www.foo.com".

  • section: 10 Other settings
  • access: public
bool addBasicAuthentication (string $url_regex, string $username, string $password)
  • string $url_regex: Regular-expression defining the URL(s) the authentication should be send to.
  • string $username: The username
  • string $password: The password
addContentTypeReceiveRule (line 1182)

Adds a rule to the list of rules that decides which pages or files - regarding their content-type - should be received

After receiving the HTTP-header of a followed URL, the crawler check's - based on the given rules - whether the content of that URL should be received. If no rule matches with the content-type of the document, the content won't be received.

Example:

  1.  $crawler->addContentTypeReceiveRule("#text/html#");
  2.  $crawler->addContentTypeReceiveRule("#text/css#");
This rules lets the crawler receive the content/source of pages with the Content-Type "text/html" AND "text/css". Other pages or files with different content-types (e.g. "image/gif") won't be received (if this is the only rule added to the list).

IMPORTANT: By default, if no rule was added to the list, the crawler receives every content.

Note: To reduce the traffic the crawler will cause, you only should add content-types of pages/files you really want to receive. But at least you should add the content-type "text/html" to this list, otherwise the crawler can't find any links.

  • return: TRUE if the rule was added to the list. FALSE if the given regex is not valid.
  • section: 2 Filter-settings
  • access: public
bool addContentTypeReceiveRule (string $regex)
  • string $regex: The rule as a regular-expression
addFollowMatch (line 1255)

Alias for addURLFollowRule().

  • deprecated:
  • section: 11 Deprecated
  • access: public
void addFollowMatch ( $regex)
  • $regex
addLinkExtractionTags (line 1525)

Sets the list of html-tags from which links should be extracted from.

This method was named wrong in previous versions of phpcrawl. It does not ADD tags, it SETS the tags from which links should be extracted from.

Example

  1. $crawler->addLinkExtractionTags("href""src");

  • deprecated: Please use setLinkExtractionTags()
  • section: 11 Deprecated
  • access: public
void addLinkExtractionTags ()
addLinkPriority (line 1073)

Adds a regular expression togehter with a priority-level to the list of rules that decide what links should be prefered.

Links/URLs that match an expression with a high priority-level will be followed before links with a lower level. All links that don't match with any of the given rules will get the level 0 (lowest level) automatically.

The level can be any positive integer.

Example:

Telling the crawler to follow links that contain the string "forum" before links that contain ".gif" before all other found links.

  1.  $crawler->addLinkPriority("/forum/"10);
  2.  $cralwer->addLinkPriority("/\.gif/"5);

  • return: TRUE if a valid preg-pattern is given as argument and was succsessfully added, otherwise it returns FALSE.
  • section: 10 Other settings
bool addLinkPriority (string $regex, int $level)
  • string $regex: Regular expression definig the rule
  • int $level: The priority-level
addLinkSearchContentType (line 1700)

Adds a rule to the list of rules that decide in what kind of documents the crawler should search for links in (regarding their content-type)

By default the crawler ONLY searches for links in documents of type "text/html". Use this method to add one or more other content-types the crawler should check for links.

Example:

  1.  $crawler->addLinkSearchContentType("#text/css# i");
  2.  $crawler->addLinkSearchContentType("#text/xml# i");
These rules let the crawler search for links in HTML-, CSS- ans XML-documents.

Please note: It is NOT recommended to let the crawler checkfor links in EVERY document- type! This could slow down the crawling-process dramatically (e.g. if the crawler receives large binary-files like images and tries to find links in them).

  • return: TRUE if the rule was successfully added
  • section: 6 Linkfinding settings
  • access: public
bool addLinkSearchContentType (string $regex)
  • string $regex: Regular-expression defining the rule
addNonFollowMatch (line 1267)

Alias for addURLFilterRule().

  • deprecated:
  • section: 11 Deprecated
  • access: public
void addNonFollowMatch ( $regex)
  • $regex
addPostData (line 1781)

Adds post-data together with an URL-rule to the list of post-data to send with requests.

Example

  1.  $post_data array("username" => "me""password" => "my_password""action" => "do_login");
  2.  $crawler->addPostData("#http://www\.foo\.com/login.php#"$post_data);
This example sends the post-values "username=me", "password=my_password" and "action=do_login" to the URL http://www.foo.com/login.php

  • section: 10 Other settings
  • access: public
bool addPostData (string $url_regex, array $post_data_array)
  • string $url_regex: Regular expression defining the URL(s) the post-data should be send to.
  • array $post_data_array: Post-data-array, the array-keys are the post-data-keys, the array-values the post-values. (like array("post_key1" => "post_value1", "post_key2" => "post_value2")
addReceiveContentType (line 1194)

Alias for addContentTypeReceiveRule().

  • deprecated:
  • section: 11 Deprecated
  • access: public
void addReceiveContentType ( $regex)
  • $regex
addReceiveToMemoryMatch (line 1363)

Has no function anymore!

This method was redundant, please use addStreamToFileContentType(). It just still exists because of compatibility-reasons.

  • deprecated: This method has no function anymore since v 0.8.
  • section: 11 Deprecated
  • access: public
void addReceiveToMemoryMatch ( $regex)
  • $regex
addReceiveToTmpFileMatch (line 1349)

Alias for addStreamToFileContentType().

  • deprecated:
  • section: 11 Deprecated
  • access: public
void addReceiveToTmpFileMatch ( $regex)
  • $regex
addStreamToFileContentType (line 1300)

Adds a rule to the list of rules that decides what types of content should be streamed diretly to a temporary file.

If a content-type of a page or file matches with one of these rules, the content will be streamed directly into a temporary file without claiming local RAM.

It's recommendend to add all content-types of files that may be of bigger size to prevent memory-overflows. By default the crawler will receive every content to memory!

The content/source of pages and files that were streamed to file are not accessible directly within the overidden method handleDocumentInfo(), instead you get information about the file the content was stored in. (see properties PHPCrawlerDocumentInfo::received_to_file and PHPCrawlerDocumentInfo::content_tmp_file).

Please note that this setting doesn't effect the link-finding results, also file-streams will be checked for links.

A common setup may look like this example:

  1.  // Basically let the crawler receive every content (default-setting)
  2.  $crawler->addReceiveContentType("##");
  3.  
  4.  // Tell the crawler to stream everything but "text/html"-documents to a tmp-file
  5.  $crawler->addStreamToFileContentType("#^((?!text/html).)*$#");

  • return: TRUE if the rule was added to the list and the regex is valid.
  • section: 10 Other settings
  • access: public
bool addStreamToFileContentType (string $regex)
  • string $regex: The rule as a regular-expression
addURLFilterRule (line 1243)

Adds a rule to the list of rules that decide which URLs found on a page should be ignored by the crawler.

If the crawler finds an URL and this URL matches with one of the given regular-expressions, the crawler will ignore this URL and won't follow it.

Example:

  1.  $crawler->addURLFilterRule("#(jpg|jpeg|gif|png|bmp)$# i");
  2.  $crawler->addURLFilterRule("#(css|js)$# i");
These rules let the crawler ignore URLs that end with "jpg", "jpeg", "gif", ..., "css" and "js".

  • return: TRUE if the regex is valid and the rule was added to the list, otherwise FALSE.
  • section: 2 Filter-settings
  • access: public
bool addURLFilterRule (string $regex)
  • string $regex: Regular-expression defining the rule
addURLFollowRule (line 1220)

Adds a rule to the list of rules that decide which URLs found on a page should be followd explicitly.

If the crawler finds an URL and this URL doesn't match with any of the given regular-expressions, the crawler will ignore this URL and won't follow it.

NOTE: By default and if no rule was added to this list, the crawler will NOT filter ANY URLs, every URL the crawler finds will be followed (except the ones "excluded" by other options of course).

Example:

  1.  $crawler->addURLFollowRule("#(htm|html)$# i");
  2.  $crawler->addURLFollowRule("#(php|php3|php4|php5)$# i");
These rules let the crawler ONLY follow URLs/links that end with "html", "htm", "php", "php3" etc.

  • return: TRUE if the regex is valid and the rule was added to the list, otherwise FALSE.
  • section: 2 Filter-settings
  • access: public
bool addURLFollowRule (string $regex)
  • string $regex: Regular-expression defining the rule
checkForAbort (line 730)

Checks if the crawling-process should be aborted.

  • return: NULL if the process shouldn't be aborted yet, otherwise one of the PHPCrawlerAbortReasons::ABORTREASON-constants.
  • access: protected
int checkForAbort ()
cleanup (line 795)

Cleans up the crawler after it has finished.

  • access: protected
void cleanup ()
createWorkingDirectory (line 775)

Creates the working-directory for this instance of the cralwer.

  • access: protected
void createWorkingDirectory ()
disableExtendedLinkInfo (line 1574)

Has no function anymore.

Thes method has no function anymore, just still exists because of compatibility-reasons.

  • deprecated:
  • section: 11 Deprecated
  • access: public
void disableExtendedLinkInfo ( $mode)
  • $mode
enableAggressiveLinkSearch (line 1477)

Enables or disables agressive link-searching.

If this is set to FALSE, the crawler tries to find links only inside html-tags (< and >). If this is set to TRUE, the crawler tries to find links everywhere in an html-page, even outside of html-tags. The default value is TRUE.

Please note that if agressive-link-searching is enabled, it happens that the crawler will find links that are not meant as links and it also happens that it finds links in script-parts of pages that can't be rebuild correctly - since there is no javascript-parser/interpreter implemented. (E.g. javascript-code like document.location.href= a_var + ".html").

Disabling agressive-link-searchingn results in a better crawling-performance.

  • section: 6 Linkfinding settings
  • access: public
bool enableAggressiveLinkSearch (bool $mode)
  • bool $mode
enableCookieHandling (line 1441)

Enables or disables cookie-handling.

If cookie-handling is set to TRUE, the crawler will handle all cookies sent by webservers just like a common browser does. The default-value is TRUE.

It's strongly recommended to set or leave the cookie-handling enabled!

  • section: 10 Other settings
  • access: public
bool enableCookieHandling (bool $mode)
  • bool $mode
enableResumption (line 1875)

Prepares the crawler for process-resumption.

In order to be able to resume an aborted/terminated crawling-process, it is necessary to initially call the enableResumption() method in your script/project.

For further details on how to resume aborted processes please see the documentation of the resume() method.

  • section: 9 Process resumption
  • access: public
void enableResumption ()
getCrawlerId (line 1792)

Returns the unique ID of the instance of the crawler

  • section: 9 Process resumption
  • access: public
int getCrawlerId ()
getProcessReport (line 814)

Retruns summarizing report-information about the crawling-process after it has finished.

  • return: PHPCrawlerProcessReport-object containing process-summary-information
  • section: 1 Basic settings
  • access: public
PHPCrawlerProcessReport getProcessReport ()
getReport (line 857)

Retruns an array with summarizing report-information after the crawling-process has finished

For detailed information on the conatining array-keys see PHPCrawlerProcessReport-class.

  • deprecated: Please use getProcessReport() instead.
  • section: 11 Deprecated
  • access: public
void getReport ()
go (line 324)

Starts the crawling process in single-process-mode.

Be sure you did override the handleDocumentInfo()- or handlePageData()-method before calling the go()-method to process the documents the crawler finds.

  • section: 1 Basic settings
  • access: public
void go ()
goMultiProcessed (line 387)

Starts the cralwer by using multi processes.

When using this method instead of the go()-method to start the crawler, phpcrawl will use the given number of processes simultaneously for spidering the target-url. Using multi processes will speed up the crawling-progress dramatically in most cases.

There are some requirements though to successfully run the cralwler in multi-process mode:

  • The multi-process mode only works on unix-based systems (linux)
  • Scripts using the crawler have to be run from the commandline (cli)
  • The <a href="http://php.net/manual/en/pcntl.installation.php">PCNTL-extension</a> for php (process control) has to be installed and activated.
  • The <a href="http://php.net/manual/en/sem.installation.php">SEMAPHORE-extension</a> for php has to be installed and activated.
  • The <a href="http://de.php.net/manual/en/posix.installation.php">POSIX-extension</a> for php has to be installed and activated.
  • The <a href="http://de2.php.net/manual/en/pdo.installation.php">PDO-extension</a> together with the SQLite-driver (PDO_SQLITE) has to be installed and activated.

PHPCrawls supports two different modes of multiprocessing:

  1. PHPCrawlerMultiProcessModes::MPMODE_PARENT_EXECUTES_USERCODE The cralwer uses multi processes simultaneously for spidering the target URL, but the usercode provided to the overridable function handleDocumentInfo() gets always executed on the same main-process. This means that the usercode never gets executed simultaneously and so you dont't have to care about concurrent file/database/handle-accesses or smimilar things. But on the other side the usercode may slow down the crawling-procedure because every child-process has to wait until the usercode got executed on the main-process. This ist the recommended multiprocess-mode!
  2. PHPCrawlerMultiProcessModes::MPMODE_CHILDS_EXECUTES_USERCODE The cralwer uses multi processes simultaneously for spidering the target URL, and every chld-process executes the usercode provided to the overridable function handleDocumentInfo() directly from it's process. This means that the usercode gets executed simultaneously by the different child-processes and you should take care of concurrent file/data/handle-accesses proberbly (if used). When using this mode and you use any handles like database-connections or filestreams in your extended crawler-class, you should open them within the overridden mehtod initChildProcess() instead of opening them from the constructor. For more details see the documentation of the initChildProcess()-method.

Example for starting the crawler with 5 processes using the recommended MPMODE_PARENT_EXECUTES_USERCODE-mode:

  1.  $crawler->goMultiProcessed(5PHPCrawlerMultiProcessModes::MPMODE_PARENT_EXECUTES_USERCODE);

Please note that increasing the number of processes to high values does't automatically mean that the crawling-process will go off faster! Using 3 to 5 processes should be good values to start from.

  • section: 1 Basic settings
  • access: public
void goMultiProcessed ([int $process_count = 3], [int $multiprocess_mode = 1])
  • int $process_count: Number of processes to use
  • int $multiprocess_mode: The multiprocess-mode to use. One of the PHPCrawlerMultiProcessModes-constants
handleDocumentInfo (line 990)

Override this method to get access to all information about a page or file the crawler found and received.

Everytime the crawler found and received a document on it's way this method will be called. The crawler passes all information about the currently received page or file to this method by a PHPCrawlerDocumentInfo-object.

Please see the PHPCrawlerDocumentInfo documentation for a list of all properties describing the html-document.

Example:

  1.  class MyCrawler extends PHPCrawler
  2.  {
  3.    function handleDocumentInfo($PageInfo)
  4.    {
  5.      // Print the URL of the document
  6.      echo "URL: ".$PageInfo->url."<br />";
  7.  
  8.      // Print the http-status-code
  9.      echo "HTTP-statuscode: ".$PageInfo->http_status_code."<br />";
  10.  
  11.      // Print the number of found links in this document
  12.      echo "Links found: ".count($PageInfo->links_found_url_descriptors)."<br />";
  13.  
  14.      // ..
  15.    }
  16.  }

  • return: The crawling-process will stop immedeatly if you let this method return any negative value.
  • section: 3 Overridable methods / User data-processing
  • access: public
int handleDocumentInfo (PHPCrawlerDocumentInfo $PageInfo)
  • PHPCrawlerDocumentInfo $PageInfo: A PHPCrawlerDocumentInfo-object containing all information about the currently received document. Please see the reference of the PHPCrawlerDocumentInfo-class for detailed information.

Redefined in descendants as:
handleHeaderInfo (line 893)

Overridable method that will be called after the header of a document was received and BEFORE the content will be received.

Everytime a header of a document was received, the crawler will call this method. If this method returns any negative integer, the crawler will NOT reveice the content of the particular page or file.

Example:

  1.  class MyCrawler extends PHPCrawler
  2.  {
  3.    function handleHeaderInfo(PHPCrawlerResponseHeader $header)
  4.    {
  5.      // If the content-type of the document isn't "text/html" -> don't receive it.
  6.      if ($header->content_type != "text/html")
  7.      {
  8.        return -1;
  9.      }
  10.    }
  11.  
  12.    function handleDocumentInfo($PageInfo)
  13.    {
  14.      // ...
  15.    }
  16.  }

  • return: The document won't be received if you let this method return any negative value.
  • section: 3 Overridable methods / User data-processing
  • access: public
int handleHeaderInfo (PHPCrawlerResponseHeader $header)
handlePageData (line 952)

Override this method to get access to all information about a page or file the crawler found and received.

Everytime the crawler found and received a document on it's way this method will be called. The crawler passes all information about the currently received page or file to this method by the array $page_data.

  • return: The crawling-process will stop immedeatly if you let this method return any negative value.
  • deprecated: Please use and override the handleDocumentInfo-method to access document-information instead.
  • section: 3 Overridable methods / User data-processing
  • access: public
int handlePageData (array &$page_data)
  • array &$page_data: Array containing all information about the currently received document. For detailed information on the conatining keys see PHPCrawlerDocumentInfo-class.
initChildProcess (line 935)

Overridable method that will be called by every used child-process just before it starts the crawling-procedure.

Every child-process of the crawler will call this method just before it starts it's crawling-loop from within it's process-context.

So when using the multi-process mode "PHPCrawlerMultiProcessModes::MPMODE_CHILDS_EXECUTES_USERCODE", this method should be overidden and used to open any needed database-connections, file streams or other similar handles to ensure that they will get opened and accessible for every used child-process.

Example:

  1.  class MyCrawler extends PHPCrawler
  2.  {
  3.    protected $mysql_link;
  4.  
  5.    function initChildProcess()
  6.    {
  7.      // Open a database-connection for every used process
  8.      $this->mysql_link mysql_connect("myhost""myusername""mypassword");
  9.      mysql_select_db ("mydatabasename"$this->mysql_link);
  10.    }
  11.  
  12.    function handleDocumentInfo($PageInfo)
  13.    {
  14.      mysql_query("INSERT INTO urls SET url = '".$PageInfo->url."';"$this->mysql_link);
  15.    }
  16.  }
  17.  
  18.  // Start crawler with 5 processes
  19.  $crawler new MyCrawler();
  20.  $crawler->setURL("http://www.any-url.com");
  21.  $crawler->goMultiProcessed(5PHPCrawlerMultiProcessModes::MPMODE_CHILDS_EXECUTES_USERCODE);

  • section: 3 Overridable methods / User data-processing
  • access: public
void initChildProcess ()
initCrawlerProcess (line 273)

Initiates a crawler-process

  • access: protected
void initCrawlerProcess ()
obeyNoFollowTags (line 1758)

Decides whether the crawler should obey "nofollow"-tags

If set to TRUE, the crawler will not follow links that a marked with rel="nofollow" (like &lt;a href="page.html" rel="nofollow"&gt;) nor links from pages containing the meta-tag <meta name="robots" content="nofollow">.

By default, the crawler will NOT obey nofollow-tags.

  • section: 2 Filter-settings
  • access: public
void obeyNoFollowTags (bool $mode)
  • bool $mode: If set to TRUE, the crawler will obey "nofollow"-tags
obeyRobotsTxt (line 1335)

Decides whether the crawler should parse and obey robots.txt-files.

If this is set to TRUE, the crawler looks for a robots.txt-file for every host that sites or files should be received from during the crawling process. If a robots.txt-file for a host was found, the containig directives appliying to the useragent-identification of the cralwer ("PHPCrawl" or manually set by calling setUserAgentString()) will be obeyed.

The default-value is FALSE (for compatibility reasons).

Pleas note that the directives found in a robots.txt-file have a higher priority than other settings made by the user. If e.g. addFollowMatch("#http://foo\.com/path/file\.html#") was set, but a directive in the robots.txt-file of the host foo.com says "Disallow: /path/", the URL http://foo.com/path/file.html will be ignored by the crawler anyway.

  • section: 2 Filter-settings
  • access: public
bool obeyRobotsTxt (bool $mode)
  • bool $mode: Set to TRUE if you want the crawler to obey robots.txt-files.
processRobotsTxt (line 717)
  • access: protected
void processRobotsTxt ()
processUrl (line 601)

Receives and processes the given URL

  • return: TURE if the crawling-process should be aborted after processig the URL, otherwise FALSE.
  • access: protected
bool processUrl (PHPCrawlerURLDescriptor $UrlDescriptor)
resume (line 1846)

Resumes the crawling-process with the given crawler-ID

If a crawling-process was aborted (for whatever reasons), it is possible to resume it by calling the resume()-method before calling the go() or goMultiProcessed() method and passing the crawler-ID of the aborted process to it (as returned by getCrawlerId()).

In order to be able to resume a process, it is necessary that it was initially started with resumption enabled (by calling the enableResumption() method).

This method throws an exception if resuming of a crawling-process failed.

Example of a resumeable crawler-script:

  1.  // ...
  2.  $crawler new MyCrawler();
  3.  $crawler->enableResumption();
  4.  $crawler->setURL("www.url123.com");
  5.  
  6.  // If process was started the first time:
  7.  // Get the crawler-ID and store it somewhere in order to be able to resume the process later on
  8.  if (!file_exists("/tmp/crawlerid_for_url123.tmp"))
  9.  {
  10.    $crawler_id $crawler->getCrawlerId();
  11.    file_put_contents("/tmp/crawlerid_for_url123.tmp"$crawler_id);
  12.  }
  13.  
  14.  // If process was restarted again (after a termination):
  15.  // Read the crawler-id and resume the process
  16.  else
  17.  {
  18.    $crawler_id file_get_contents("/tmp/crawlerid_for_url123.tmp");
  19.    $crawler->resume($crawler_id);
  20.  }
  21.  
  22.  // ...
  23.  
  24.  // Start your crawling process
  25.  $crawler->goMultiProcessed(5);
  26.  
  27.  // After the process is finished completely: Delete the crawler-ID
  28.  unlink("/tmp/crawlerid_for_url123.tmp");

  • section: 9 Process resumption
  • access: public
void resume (int $crawler_id)
  • int $crawler_id: The crawler-ID of the crawling-process that should be resumed. (see getCrawlerId())
setAggressiveLinkExtraction (line 1488)

Alias for enableAggressiveLinkSearch()

  • deprecated: Please use enableAggressiveLinkSearch()
  • section: 11 Deprecated
  • access: public
void setAggressiveLinkExtraction ( $mode)
  • $mode
setConnectionTimeout (line 1640)

Sets the timeout in seconds for connection tries to hosting webservers.

If the the connection to a host can't be established within the given time, the request will be aborted.

  • section: 10 Other settings
  • access: public
bool setConnectionTimeout (int $timeout)
  • int $timeout: The timeout in seconds, the default-value is 5 seconds.
setContentSizeLimit (line 1402)

Sets the content-size-limit for content the crawler should receive from documents.

If the crawler is receiving the content of a page or file and the contentsize-limit is reached, the crawler stops receiving content from this page or file.

Please note that the crawler can only find links in the received portion of a document.

The default-value is 0 (no limit).

  • section: 5 Limit-settings
  • access: public
bool setContentSizeLimit (int $bytes)
  • int $bytes: The limit in bytes.
setCookieHandling (line 1455)

Alias for enableCookieHandling()

  • deprecated: Please use enableCookieHandling()
  • section: 11 Deprecated
  • access: public
void setCookieHandling ( $mode)
  • $mode
setFollowMode (line 1148)

Sets the basic follow-mode of the crawler.

The following list explains the supported follow-modes:

0 - The crawler will follow EVERY link, even if the link leads to a different host or domain. If you choose this mode, you really should set a limit to the crawling-process (see limit-options), otherwise the crawler maybe will crawl the whole WWW!

1 - The crawler only follow links that lead to the same domain like the one in the root-url. E.g. if the root-url (setURL()) is "http://www.foo.com", the crawler will follow links to "http://www.foo.com/..." and "http://bar.foo.com/...", but not to "http://www.another-domain.com/...".

2 - The crawler will only follow links that lead to the same host like the one in the root-url. E.g. if the root-url (setURL()) is "http://www.foo.com", the crawler will ONLY follow links to "http://www.foo.com/...", but not to "http://bar.foo.com/..." and "http://www.another-domain.com/...". This is the default mode.

3 - The crawler only follows links to pages or files located in or under the same path like the one of the root-url. E.g. if the root-url is "http://www.foo.com/bar/index.html", the crawler will follow links to "http://www.foo.com/bar/page.html" and "http://www.foo.com/bar/path/index.html", but not links to "http://www.foo.com/page.html".

  • section: 1 Basic settings
  • access: public
bool setFollowMode (int $follow_mode)
  • int $follow_mode: The basic follow-mode for the crawling-process (0, 1, 2 or 3).
setFollowRedirects (line 1095)

Defines whether the crawler should follow redirects sent with headers by a webserver or not.

  • section: 10 Other settings
  • access: public
bool setFollowRedirects (bool $mode)
  • bool $mode: If TRUE, the crawler will follow header-redirects. The default-value is TRUE.
setFollowRedirectsTillContent (line 1117)

Defines whether the crawler should follow HTTP-redirects until first content was found, regardless of defined filter-rules and follow-modes.

Sometimes, when requesting an URL, the first thing the webserver does is sending a redirect to another location, and sometimes the server of this new location is sending a redirect again (and so on). So at least its possible that you find the expected content on a totally different host as expected.

If you set this option to TRUE, the crawler will follow all these redirects until it finds some content. If content finally was found, the root-url of the crawling-process will be set to this url and all defined options (folllow-mode, filter-rules etc.) will relate to it from now on.

  • section: 10 Other settings
  • access: public
void setFollowRedirectsTillContent (bool $mode)
  • bool $mode: If TRUE, the crawler will follow redirects until content was finally found. Defaults to TRUE.
setLinkExtractionTags (line 1508)

Sets the list of html-tags the crawler should search for links in.

By default the crawler searches for links in the following html-tags: href, src, url, location, codebase, background, data, profile, action and open. As soon as the list is set manually, this default list will be overwritten completly.

Example:

  1. $crawler->setLinkExtractionTags(array("href""src"));
This setting lets the crawler search for links (only) in "href" and "src"-tags.

Note: Reducing the number of tags in this list will improve the crawling-performance (a little).

  • section: 6 Linkfinding settings
  • access: public
void setLinkExtractionTags (array $tag_array)
  • array $tag_array: Numeric array containing the tags.
setPageLimit (line 1379)

Sets a limit to the number of pages/files the crawler should follow.

If the limit is reached, the crawler stops the crawling-process. The default-value is 0 (no limit).

  • section: 5 Limit-settings
  • access: public
void setPageLimit (int $limit, [bool $only_count_received_documents = false])
  • int $limit: The limit, set to 0 for no limit (default value).
  • bool $only_count_received_documents: OPTIONAL. TRUE means that only documents the crawler received will be counted. FALSE means that ALL followed and requested pages/files will be counted, even if the content wasn't be received.
setPort (line 1038)

Sets the port to connect to for crawling the starting-url set in setUrl().

The default port is 80.

Note:

  1.  $cralwer->setURL("http://www.foo.com");
  2.  $crawler->setPort(443);
effects the same as

  1.  $cralwer->setURL("http://www.foo.com:443");

  • section: 1 Basic settings
  • access: public
bool setPort (int $port)
  • int $port: The port
setProxy (line 1624)

Assigns a proxy-server the crawler should use for all HTTP-Requests.

  • section: 10 Other settings
  • access: public
void setProxy (string $proxy_host, int $proxy_port, [string $proxy_username = null], [string $proxy_password = null])
  • string $proxy_host: Hostname or IP of the proxy-server
  • int $proxy_port: Port of the proxy-server
  • string $proxy_username: Optional. The username for proxy-authentication or NULL if no authentication is required.
  • string $proxy_password: Optional. The password for proxy-authentication or NULL if no authentication is required.
setStreamTimeout (line 1664)

Sets the timeout in seconds for waiting for data on an established server-connection.

If the connection to a server was be etablished but the server doesnt't send data anymore without closing the connection, the crawler will wait the time given in timeout and then close the connection.

  • section: 10 Other settings
  • access: public
bool setStreamTimeout (int $timeout)
  • int $timeout: The timeout in seconds, the default-value is 2 seconds.
setTmpFile (line 1313)

Has no function anymore.

Please use setWorkingDirectory()

  • deprecated: This method has no function anymore since v 0.8.
  • section: 11 Deprecated
  • access: public
void setTmpFile ( $tmp_file)
  • $tmp_file
setTrafficLimit (line 1419)

Sets a limit to the number of bytes the crawler should receive alltogether during crawling-process.

If the limit is reached, the crawler stops the crawling-process. The default-value is 0 (no limit).

  • section: 5 Limit-settings
  • access: public
bool setTrafficLimit (int $bytes, [bool $complete_requested_files = true])
  • int $bytes: Maximum number of bytes
  • bool $complete_requested_files: This parameter has no function anymore!
setURL (line 1006)

Sets the URL of the first page the crawler should crawl (root-page).

The given url may contain the protocol (http://www.foo.com or https://www.foo.com), the port (http://www.foo.com:4500/index.php) and/or basic-authentication-data (http://loginname:passwd@www.foo.com)

This url has to be set before calling the go()-method (of course)! If this root-page doesn't contain any further links, the crawling-process will stop immediately.

  • section: 1 Basic settings
  • access: public
bool setURL (string $url)
  • string $url: The URL
setUrlCacheType (line 1736)

Defines what type of cache will be internally used for caching URLs.

Currently phpcrawl is able to use a in-memory-cache or a SQlite-database-cache for caching/storing found URLs internally.

The memory-cache (PHPCrawlerUrlCacheTypes::URLCACHE_MEMORY) is recommended for spidering small to medium websites. It provides better performance, but the php-memory-limit may be hit when too many URLs get added to the cache. This is the default-setting.

The SQlite-cache (PHPCrawlerUrlCacheTypes::URLCACHE_SQLite) is recommended for spidering huge websites. URLs get cached in a SQLite-database-file, so the cache only is limited by available harddisk-space. To increase performance of the SQLite-cache you may set it's location to a shared-memory device like "/dev/shm/" by using the setWorkingDirectory()-method.

Example:

  1.  $crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
  2.  $crawler->setWorkingDirectory("/dev/shm/");

NOTE: When using phpcrawl in multi-process-mode (goMultiProcessed()), the cache-type is automatically set to PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE.

  • section: 1 Basic settings
  • access: public
bool setUrlCacheType (int $url_cache_type)
  • int $url_cache_type:

    1 -> in-memory-cache (default setting) 2 -> SQlite-database-cache

    Or one of the PHPCrawlerUrlCacheTypes::URLCACHE..-constants.

setUserAgentString (line 1560)

Sets the "User-Agent" identification-string that will be send with HTTP-requests.

  • section: 10 Other settings
  • access: public
void setUserAgentString (string $user_agent)
  • string $user_agent: The user-agent-string. The default-value is "PHPCrawl".
setWorkingDirectory (line 1604)

Sets the working-directory the crawler should use for storing temporary data.

Every instance of the crawler needs and creates a temporary directory for storing some internal data.

This setting defines which base-directory the crawler will use to store the temporary directories in. By default, the crawler uses the systems temp-directory as working-directory. (i.e. "/tmp/" on linux-systems)

All temporary directories created in the working-directory will be deleted automatically after a crawling-process has finished.

NOTE: To speed up the performance of a crawling-process (especially when using the SQLite-urlcache), try to set a mounted shared-memory device as working-direcotry (i.e. "/dev/shm/" on Debian/Ubuntu-systems).

Example:

  1.  $crawler->setWorkingDirectory("/tmp/");

  • return: TRUE on success, otherwise false.
  • section: 1 Basic settings
  • access: public
bool setWorkingDirectory (string $directory)
  • string $directory: The working-directory
starControllerProcessLoop (line 475)

Starts the loop of the controller-process (main-process).

  • access: protected
void starControllerProcessLoop ()
startChildProcessLoop (line 530)

Starts the loop of a child-process.

  • access: protected
void startChildProcessLoop ()

Documentation generated on Sun, 20 Jan 2013 21:18:49 +0200 by phpDocumentor 1.4.4