CurlMulti 插件



  • Curl多线程采集.

php-curlmulti:https://github.com/ares333/php-curlmulti

GitHub:https://github.com/jae-jae/QueryList-CurlMulti

安装

  1. composer require jaeger/querylist-curl-multi

API

  • CurlMulti curlMulti($urls = []): 设置待采集的URL集合

  • class CurlMulti

    • CurlMulti add($urls):添加URL任务
    • array getUrls():获取所有URL
    • CurlMulti success(Closure $callback):任务成功的时候调用
    • CurlMulti error(Closure $callback):任务失败的时候调用
    • CurlMulti start(array $opt = []):开始执行采集任务,此方法是阻塞的。

安装参数

QueryList::use(CurlMulti::class,$opt1)

  • $opt1:curlMulti 函数别名.

用法

  • 安装插件
  1. use QL\QueryList;
  2. use QL\Ext\CurlMulti;
  3. $ql = QueryList::getInstance();
  4. $ql->use(CurlMulti::class);
  5. //or Custom function name
  6. $ql->use(CurlMulti::class,'curlMulti');
  • Example-1
    采集GitHub排行榜:
  1. $ql->rules([
  2. 'title' => ['h3 a','text'],
  3. 'link' => ['h3 a','href']
  4. ])->curlMulti([
  5. 'https://github.com/trending/php',
  6. 'https://github.com/trending/go'
  7. ])->success(function (QueryList $ql,CurlMulti $curl,$r){
  8. echo "Current url:{$r['info']['url']} \r\n";
  9. $data = $ql->query()->getData();
  10. print_r($data->all());
  11. })->start();

Out:

  1. Current url:https://github.com/trending/php
  2. Array
  3. (
  4. [0] => Array
  5. (
  6. [title] => jupeter / clean-code-php
  7. [link] => /jupeter/clean-code-php
  8. )
  9. [1] => Array
  10. (
  11. [title] => laravel / laravel
  12. [link] => /laravel/laravel
  13. )
  14. [2] => Array
  15. (
  16. [title] => spatie / browsershot
  17. [link] => /spatie/browsershot
  18. )
  19. //....
  20. )
  21. Current url:https://github.com/trending/go
  22. Array
  23. (
  24. [0] => Array
  25. (
  26. [title] => DarthSim / imgproxy
  27. [link] => /DarthSim/imgproxy
  28. )
  29. [1] => Array
  30. (
  31. [title] => jaegertracing / jaeger
  32. [link] => /jaegertracing/jaeger
  33. )
  34. [2] => Array
  35. (
  36. [title] => jdkato / prose
  37. [link] => /jdkato/prose
  38. )
  39. //...
  40. )
  • Example-2
  1. $ql->curlMulti('https://github.com/trending/php')
  2. ->success(function (QueryList $ql,CurlMulti $curl,$r){
  3. echo "Current url:{$r['info']['url']} \r\n";
  4. if($r['info']['url'] == 'https://github.com/trending/php'){
  5. // append a task
  6. $curl->add('https://github.com/trending/go');
  7. }
  8. $data = $ql->find('h3 a')->texts();
  9. print_r($data->all());
  10. })
  11. ->start();

Out:

  1. Current url:https://github.com/trending/php
  2. Array
  3. (
  4. [0] => jupeter / clean-code-php
  5. [1] => laravel / laravel
  6. [2] => spatie / browsershot
  7. //...
  8. )
  9. Current url:https://github.com/trending/go
  10. Array
  11. (
  12. [0] => DarthSim / imgproxy
  13. [1] => jaegertracing / jaeger
  14. [2] => jdkato / prose
  15. //...
  16. )
  • Example-3
  1. $ql->curlMulti([
  2. 'https://github-error-host.com/trending/php',
  3. 'https://github.com/trending/go'
  4. ])->success(function (QueryList $ql,CurlMulti $curl,$r){
  5. echo "Current url:{$r['info']['url']} \r\n";
  6. $data = $ql->rules([
  7. 'title' => ['h3 a','text'],
  8. 'link' => ['h3 a','href']
  9. ])->query()->getData();
  10. print_r($data->all());
  11. })->error(function ($errorInfo,CurlMulti $curl){
  12. echo "Current url:{$errorInfo['info']['url']} \r\n";
  13. print_r($errorInfo['error']);
  14. })->start([
  15. // 最大并发数,这个值可以运行中动态改变。
  16. 'maxThread' => 10,
  17. // 触发curl错误或用户错误之前最大重试次数,超过次数$error指定的回调会被调用。
  18. 'maxTry' => 3,
  19. // 全局CURLOPT_*
  20. 'opt' => [
  21. CURLOPT_TIMEOUT => 10,
  22. CURLOPT_CONNECTTIMEOUT => 1,
  23. CURLOPT_RETURNTRANSFER => true
  24. ],
  25. // 缓存选项很容易被理解,缓存使用url来识别。如果使用缓存类库不会访问网络而是直接返回缓存。
  26. 'cache' => ['enable' => false, 'compress' => false, 'dir' => null, 'expire' =>86400, 'verifyPost' => false]
  27. ]);

Out:

  1. Current url:https://github.com/trending/go
  2. Array
  3. (
  4. [0] => Array
  5. (
  6. [title] => DarthSim / imgproxy
  7. [link] => /DarthSim/imgproxy
  8. )
  9. [1] => Array
  10. (
  11. [title] => jaegertracing / jaeger
  12. [link] => /jaegertracing/jaeger
  13. )
  14. [2] => Array
  15. (
  16. [title] => getlantern / lantern
  17. [link] => /getlantern/lantern
  18. )
  19. //...
  20. )
  21. Current url:https://github-error-host.com/trending/php
  22. Array
  23. (
  24. [0] => 28
  25. [1] => Resolving timed out after 1000 milliseconds
  26. )
  • Example-3
  1. $ql->rules([
  2. 'title' => ['h3 a','text'],
  3. 'link' => ['h3 a','href']
  4. ])->curlMulti()->add('https://github.com/trending/go')
  5. ->success(function (QueryList $ql,CurlMulti $curl,$r){
  6. echo "Current url:{$r['info']['url']} \r\n";
  7. $data = $ql->query()->getData();
  8. print_r($data->all());
  9. })->start()
  10. ->add('https://github.com/trending/php')
  11. ->start();

释放内存占用

多线程插件涉及到大量页面采集,如不合理释放资源,很容易造成内存占用过大:

  1. $ql->rules([
  2. 'title' => ['h3 a','text'],
  3. 'link' => ['h3 a','href']
  4. ])->curlMulti([
  5. 'https://github.com/trending/php',
  6. 'https://github.com/trending/go'
  7. ])->success(function (QueryList $ql,CurlMulti $curl,$r){
  8. echo "Current url:{$r['info']['url']} \r\n";
  9. $data = $ql->query()->getData();
  10. print_r($data->all());
  11. // 释放资源
  12. $ql->destruct();
  13. })->start();