Relatively often I solve the task of downloading all pages within one site or domain, because I then perform various measurements with the results, or I use the pages for full-text search.
One possible solution is to use the ready-made tool Xenu, which is very difficult to install on a web server (it is a Windows program), or Wget, which is not supported everywhere and creates another unnecessary dependency.
If the task is just to make a copy of the page for later viewing, the program HTTrack is very useful, which I like the most, only when we are inding parameterized URLs we can lose accuracy in some cases.
So I started looking for a tool that can index all pages automatically directly in PHP with advanced configuration. Eventually, this became an opensource project.
For exactly these needs I implemented my own Composer package WebCrawler, which can handle the process of indexing pages elegantly on its own, and if I come across a new case, I further improve it.
It is installed with the Composer command:
``shell
composer require baraja-core/webcrawler
And it's easy to use. Just create an instance and call the `crawl()` method:
php
$crawler = new \Baraja\WebCrawler\Crawler;
$result = $crawler->crawl('https://example.com');
In the `$result` variable, the complete result will be available as an instance of the `CrawledResult` entity, which I recommend to study because it contains a lot of interesting information about the whole site.
Crawler settings
------------------
Often we need to limit the downloading of pages somehow, because otherwise we would probably download the whole internet.
This is done by using the `Config` entity, which is passed the configuration as an array (key-value) and then passed to the Crawler by the constructor.
For example:
php
$crawler = new \Baraja\WebCrawler\Crawler(
new \Baraja\WebCrawler\Config([
// key => value
])
);
```
Setting options:
Key | Default Value | Possible Values |
---|---|---|
followExternalLinks |
false |
Bool : Stay only within the same domain? Can it index external links as well? |
sleepBetweenRequests |
1000 |
Int : Wait time between downloading each page in milliseconds. |
maxHttpRequests |
1000000 |
Int : How many maximum downloaded URLs? |
maxCrawlTimeInSeconds |
30 |
Int : How long maximum download in seconds? |
allowedUrls |
['.+'] |
String[] : Array of allowed URL formats as regular expressions. |
forbiddenUrls |
[''] |
String[] : Array of forbidden URL formats as regular expressions. |
The regular expression must match the entire URL exactly as a string.
Jan Barášek Více o autorovi
Autor článku pracuje jako seniorní vývojář a software architekt v Praze. Navrhuje a spravuje velké webové aplikace, které znáte a používáte. Od roku 2009 nabral bohaté zkušenosti, které tímto webem předává dál.
Rád vám pomůžu:
Články píše Jan Barášek © 2009-2025 | Kontakt | Mapa webu
Status | Aktualizováno: ... | en