PHP Manual

Downloading the whole site by links in PHP

06. 11. 2019

Obsah článku

Relatively often I solve the task of downloading all pages within one site or domain, because I then perform various measurements with the results, or I use the pages for full-text search.

One possible solution is to use the ready-made tool Xenu, which is very difficult to install on a web server (it is a Windows program), or Wget, which is not supported everywhere and creates another unnecessary dependency.

If the task is just to make a copy of the page for later viewing, the program HTTrack is very useful, which I like the most, only when we are inding parameterized URLs we can lose accuracy in some cases.

So I started looking for a tool that can index all pages automatically directly in PHP with advanced configuration. Eventually, this became an opensource project.

Baraja WebCrawler

For exactly these needs I implemented my own Composer package WebCrawler, which can handle the process of indexing pages elegantly on its own, and if I come across a new case, I further improve it.

It is installed with the Composer command:

``shell composer require baraja-core/webcrawler


And it's easy to use. Just create an instance and call the `crawl()` method:

```php
$crawler = new \Baraja\WebCrawler\Crawler;

$result = $crawler->crawl('https://example.com');

In the $result variable, the complete result will be available as an instance of the CrawledResult entity, which I recommend to study because it contains a lot of interesting information about the whole site.

Crawler settings

Often we need to limit the downloading of pages somehow, because otherwise we would probably download the whole internet.

This is done by using the Config entity, which is passed the configuration as an array (key-value) and then passed to the Crawler by the constructor.

For example:

$crawler = new \Baraja\WebCrawler\Crawler(
new \Baraja\WebCrawler\Config([
// key => value
])
);

Setting options:

Key Default Value Possible Values
followExternalLinks false Bool: Stay only within the same domain? Can it index external links as well?
sleepBetweenRequests 1000 Int: Wait time between downloading each page in milliseconds.
maxHttpRequests 1000000 Int: How many maximum downloaded URLs?
maxCrawlTimeInSeconds 30 Int: How long maximum download in seconds?
allowedUrls ['.+'] String[]: Array of allowed URL formats as regular expressions.
forbiddenUrls [''] String[]: Array of forbidden URL formats as regular expressions.

The regular expression must match the entire URL exactly as a string.

Jan Barášek   Více o autorovi

Autor článku pracuje jako seniorní vývojář a software architekt v Praze. Navrhuje a spravuje velké webové aplikace, které znáte a používáte. Od roku 2009 nabral bohaté zkušenosti, které tímto webem předává dál.

Rád vám pomůžu:

Související články

1.
Status:
All systems normal.
2024