QueryList get($url,$args = null,$otherArgs = [])



  • Http get插件,用来轻松获取网页。该插件基于GuzzleHttp,请求参数与它一致。
GuzzleHttp 手册: http://guzzle-cn.readthedocs.io/zh_CN/latest/request-options.html

用法


基本用法

  1. $ql = QueryList::get('http://httpbin.org/get?param1=testvalue');
  2. echo $ql->getHtml();

等价于下面操作:

  1. $html = file_get_contents('http://httpbin.org/get?param1=testvalue');
  2. $ql = QueryList::html($html);
  3. echo $ql->getHtml();

带url请求参数

  1. $ql->get('http://httpbin.org/get',[
  2. 'param1' => 'testvalue',
  3. 'params2' => 'somevalue'
  4. ]);
  5. $ql->get('http://httpbin.org/get','param1=testvalue& params2=somevalue');
  6. echo $ql->getHtml();

输出:

  1. {
  2. "args": {
  3. "param1": "testvalue",
  4. "params2": "somevalue"
  5. },
  6. "headers": {
  7. "Connection": "close",
  8. "Host": "httpbin.org",
  9. "Referer": "http://httpbin.org/get",
  10. "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
  11. },
  12. "origin": "112.97.*.*",
  13. "url": "http://httpbin.org/get?param1=testvalue¶ms2=somevalue"
  14. }
  • 例一
  1. //采集新浪微博需要登录才能访问的页面
  2. $ql = QueryList::get('http://weibo.com',[],[
  3. 'headers' => [
  4. //填写从浏览器获取到的cookie
  5. 'Cookie' => 'SINAGLOBAL=546064; wb_cmtLike_2112031=1; wvr=6;....'
  6. ]
  7. ]);
  8. //echo $ql->getHtml();
  9. echo $ql->find('title')->text();
  10. //输出: 我的首页 微博-随时随地发现新鲜事
  • 例二http插件默认已经开启了cookie功能,当然你也可以手动设置cookie,具体用法可查看GuzzleHttp文档。
  1. $cookieJar = new \GuzzleHttp\Cookie\CookieJar();
  2. $ql = QueryList::get('https://www.baidu.com/',[],[
  3. 'cookies' => $cookieJar
  4. ]);

伪造浏览器请求头部信息

  1. $ql->get('http://httpbin.org/get',[
  2. 'param1' => 'testvalue',
  3. 'params2' => 'somevalue'
  4. ],[
  5. 'headers' => [
  6. 'Referer' => 'https://querylist.cc/',
  7. 'User-Agent' => 'testing/1.0',
  8. 'Accept' => 'application/json',
  9. 'X-Foo' => ['Bar', 'Baz'],
  10. 'Cookie' => 'abc=111;xxx=222'
  11. ]
  12. ]);
  13. echo $ql->getHtml();

输出:

  1. {
  2. "args": {
  3. "param1": "testvalue",
  4. "params2": "somevalue"
  5. },
  6. "headers": {
  7. "Accept": "application/json",
  8. "Connection": "close",
  9. "Cookie": "abc=111;xxx=222",
  10. "Host": "httpbin.org",
  11. "Referer": "https://querylist.cc/",
  12. "User-Agent": "testing/1.0",
  13. "X-Foo": "Baz"
  14. },
  15. "origin": "112.97.*.*",
  16. "url": "http://httpbin.org/get?param1=testvalue¶ms2=somevalue"
  17. }

使用Http代理

  1. $ql->get('http://httpbin.org/get',[
  2. 'param1' => 'testvalue',
  3. 'params2' => 'somevalue'
  4. ],[
  5. 'proxy' => 'http://222.141.11.17:8118',
  6. //设置超时时间,单位:秒
  7. 'timeout' => 30,
  8. 'headers' => [
  9. 'Referer' => 'https://querylist.cc/',
  10. 'User-Agent' => 'testing/1.0',
  11. 'Accept' => 'application/json',
  12. 'X-Foo' => ['Bar', 'Baz'],
  13. 'Cookie' => 'abc=111;xxx=222'
  14. ]
  15. ]);
  16. echo $ql->getHtml();

输出:

  1. {
  2. "args": {
  3. "param1": "testvalue",
  4. "params2": "somevalue"
  5. },
  6. "headers": {
  7. "Accept": "application/json",
  8. "Connection": "close",
  9. "Cookie": "abc=111;xxx=222",
  10. "Host": "httpbin.org",
  11. "Proxy-Connection": "Keep-Alive",
  12. "Referer": "https://querylist.cc/",
  13. "User-Agent": "testing/1.0",
  14. "X-Foo": "Baz"
  15. },
  16. "origin": "222.141.11.17",
  17. "url": "http://httpbin.org/get?param1=testvalue¶ms2=somevalue"
  18. }

使用 HTTP Cache

HTTP缓存功能基于PHP-Cache包,它支持多种缓存驱动,如:文件缓存、Redis缓存,MySQL缓存等,PHP-Cache文档:http://www.php-cache.com/en/latest/

合理的使用HTTP缓存功能可以避免频繁的去抓取内容未改变的页面,提高采集效率,它会在第一次抓取页面HTML后,将页面HTML缓存下来,下一次再次抓取时直接从缓存中读取HTML内容。

  • 使用文件缓存驱动
  1. // 缓存文件夹路径
  2. $cache_path = __DIR__.'/temp/';
  3. $ql = = QueryList::get($url,null,[
  4. 'cache' => $cache_path,
  5. 'cache_ttl' => 600 // 缓存有效时间,单位:秒,可以不设置缓存有效时间
  6. ]);
  • 使用其它缓存驱动以使用Predis缓存驱动为例,首先安装Predis缓存适配器
  1. composer require cache/predis-adapter

使用Predis缓存驱动:

  1. use Cache\Adapter\Predis\PredisCachePool;

$client = new \Predis\Client('tcp:/127.0.0.1:6379');$pool = new PredisCachePool($client);

$ql = = QueryList::get($url,null,['cache' => $pool,'cache_ttl' => 600 // 缓存有效时间,单位:秒,可以不设置缓存有效时间]);

  1. ## 更多强大的Http网络操作
  2. `GuzzleHTTP`是一款功能非常强大的Http客户端,你需要的Http功能它都有,更多用法可以查看GuzzleHTTP文档:[http://guzzle-cn.readthedocs.io/zh_CN/latest/request-options.html](http://guzzle-cn.readthedocs.io/zh_CN/latest/request-options.html)