Relatively often I solve the task of downloading all pages within one site or domain, because I then perform various measurements with the results, or I use the pages for full-text search.
One possible solution is to use the ready-made tool Xenu, which is very difficult to install on a web server (it is a Windows program), or Wget, which is not supported everywhere and creates another unnecessary dependency.
If the task is just to make a copy of the page for later viewing, the program HTTrack is very useful, which I like the most, only when we are inding parameterized URLs we can lose accuracy in some cases.
So I started looking for a tool that can index all pages automatically directly in PHP with advanced configuration. Eventually, this became an opensource project.
For exactly these needs I implemented my own Composer package WebCrawler, which can handle the process of indexing pages elegantly on its own, and if I come across a new case, I further improve it.
It is installed with the Composer command:
``shell composer require baraja-core/webcrawler
And it's easy to use. Just create an instance and call the `crawl()` method:
```php
$crawler = new \Baraja\WebCrawler\Crawler;
$result = $crawler->crawl('https://example.com');
In the $result
variable, the complete result will be available as an instance of the CrawledResult
entity, which I recommend to study because it contains a lot of interesting information about the whole site.
Often we need to limit the downloading of pages somehow, because otherwise we would probably download the whole internet.
This is done by using the Config
entity, which is passed the configuration as an array (key-value) and then passed to the Crawler by the constructor.
For example:
$crawler = new \Baraja\WebCrawler\Crawler(new \Baraja\WebCrawler\Config([// key => value]));
Setting options:
Key | Default Value | Possible Values |
---|---|---|
followExternalLinks |
false |
Bool : Stay only within the same domain? Can it index external links as well? |
sleepBetweenRequests |
1000 |
Int : Wait time between downloading each page in milliseconds. |
maxHttpRequests |
1000000 |
Int : How many maximum downloaded URLs? |
maxCrawlTimeInSeconds |
30 |
Int : How long maximum download in seconds? |
allowedUrls |
['.+'] |
String[] : Array of allowed URL formats as regular expressions. |
forbiddenUrls |
[''] |
String[] : Array of forbidden URL formats as regular expressions. |
The regular expression must match the entire URL exactly as a string.
Jan Barášek Více o autorovi
Autor článku pracuje jako seniorní vývojář a software architekt v Praze. Navrhuje a spravuje velké webové aplikace, které znáte a používáte. Od roku 2009 nabral bohaté zkušenosti, které tímto webem předává dál.
Rád vám pomůžu:
Články píše Jan Barášek © 2009-2024 | Kontakt | Mapa webu
Status | Aktualizováno: ... | en