Coder Social home page Coder Social logo

phpspider's Introduction

phpspider -- PHP蜘蛛爬虫框架

《我用爬虫一天时间“偷了”知乎一百万用户,只为证明PHP是世界上最好的语言 》所使用的程序

phpspider是一个爬虫开发框架。使用本框架,你不用了解爬虫的底层技术实现,爬虫被网站屏蔽、有些网站需要登录或验证码识别才能爬取等问题。简单几行PHP代码,就可以创建自己的爬虫,利用框架封装的多进程Worker类库,代码更简洁,执行效率更高速度更快。

demo目录下有一些特定网站的爬取规则,只要你安装了PHP环境,代码就可以在命令行下直接跑。 对爬虫感兴趣的开发者可以加QQ群一起讨论:147824717。

下面以糗事百科为例, 来看一下我们的爬虫长什么样子:

$configs = array(
    'name' => '糗事百科',
    'domains' => array(
        'qiushibaike.com',
        'www.qiushibaike.com'
    ),
    'scan_urls' => array(
        'http://www.qiushibaike.com/'
    ),
    'content_url_regexes' => array(
        "http://www.qiushibaike.com/article/\d+"
    ),
    'list_url_regexes' => array(
        "http://www.qiushibaike.com/8hr/page/\d+\?s=\d+"
    ),
    'fields' => array(
        array(
            // 抽取内容页的文章内容
            'name' => "article_content",
            'selector' => "//*[@id='single-next-link']",
            'required' => true
        ),
        array(
            // 抽取内容页的文章作者
            'name' => "article_author",
            'selector' => "//div[contains(@class,'author')]//h2",
            'required' => true
        ),
    ),
);
$spider = new phpspider($configs);
$spider->start();

爬虫的整体框架就是这样, 首先定义了一个$configs数组, 里面设置了待爬网站的一些信息, 然后通过调用$spider = new phpspider($configs);$spider->start();来配置并启动爬虫.

运行界面如下:

更多详细内容,移步到:

开发文档

phpspider's People

Contributors

awebc avatar kayw-geek avatar owner888 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

phpspider's Issues

当field有子项的时候,有些问题

当field有子项的时候,且只有一项,选出来是一个字符串的时候

 return count($result) > 1 ? $result : $result[0];    //selector.php 158行
if (!empty($values) && !empty($conf['children']))        //phpspider.php 1784行
                {
                    $child_values = array();
                    // 父项抽取到的html作为子项的提取内容
                    foreach ($values as $child_html) 
                    {
                        // 递归调用本方法, 所以多少子项目都支持
                        $child_value = $this->get_fields($conf['children'], $child_html, $url, $page);
                        if (!empty($child_value)) 
                        {
                            $child_values[] = $child_value;
                        }
                    }
                    // 有子项就存子项的数组, 没有就存HTML代码块
                    if (!empty($child_values)) 
                    {
                        $values = $child_values;
                    }
                }

这里将string做循环,请问是不是有点问题?

jsonpath方法并没有?

// 没有设置抽取规则的类型 或者 设置为 xpath
1678 if (!isset($conf['selector_type']) || $conf['selector_type']=='xpath')
1679 {
1680 // 返回值一定是多项的
1681 $values = $this->get_fields_xpath($html, $conf['selector'], $conf['name']);
1682 }
1683 elseif ($conf['selector_type']=='regex')
1684 {
1685 $values = $this->get_fields_regex($html, $conf['selector'], $conf['name']);
1686 }

貌似并没有支持jsonpath吧?
如果要抓取像这样一个url的数据 http://q.stock.sohu.com/hisHq?code=cn_000001&start=19800101&end=20161010, jsonpath格式应该合适, 提供个jsonpath数据处理的demo?

彩蛋有毒啊...

根据文档,一个一个对照...采用测试的方法进行测试都没有问题。但是就是一直[error] Unknown error...
无奈之下怒查源码,发现有毒彩蛋一枚。

    // 彩蛋
    $included_files = get_included_files();
    $content = file_get_contents($included_files[0]);
    if (!preg_match("#/\* Do NOT delete this comment \*/#", $content) || !preg_match("#/\* 不要删除这段注释 \*/#", $content))
    {
        $msg = "Unknown error...";
        log::error($msg);
        exit;
    }

使用 css 选择器的时候,可能会出错。

fields 中配置

'selector' => 'table > tr > td > h1',
'selector_type' => 'css',

会报错:PHP Fatal error: Call to undefined function pq() in ***\core\selector.php on line 236.
我吧这个改成phpQuery::pq()之后,也遇到了问题:

PHP Fatal error: Uncaught exception 'Exception' with message 'Can' use last created DOM, because there isn't any. Use phpQuery::newDocument() first." in ***\libary\phpQuery.php 4515

所以,我最后吧所有的css都改成了xpath 选择。到现在还在跑!

$conf配置children后,后续规则失效 bug,已定位

core/phpspider.php

`1715行:
public function get_fields($confs, $html, $url, $page)

1781行:
foreach ($values as $html)
{
// 递归调用本方法, 所以多少子项目都支持
$child_value = $this->get_fields($conf['children'], $html, $url, $page);
if (!empty($child_value))
{
$child_values[] = $child_value;
}
}`

foreach中使用外部已定义变量$html,会直接修改外部变量值,执行完children规则之后,$html成了最后一次匹配到的局部html,并不是整个网页的完整html

什么是CLI 运行环境?

在Centos 下搭配了PHP5.4 的运行环境,运行demo/jd.php 提示 需要在CLI运行环境下进行,不知道什么是CLI 环境,另外是不是一定要在PHP5.7 下才能启动得了?

爬取ajax内容

问下,在这个爬虫中,有没有实现爬取ajax内容的方法?

Ctrl-C无法终止

Ctrl-C无法正常终止终止,PHP version:7.0.8-0ubuntu0.16.04.3
不知道是不是环境的问题
加入declare(ticks = 1);后可使用Ctrl-C正常退出

我很想知道,关于模拟登陆的事情

我大概看了您的代码, 然后对您的这句:

因为知乎需要登录才能获取到关注者页面,所以从chrome登录之后把cookie拷贝下来给curl程序模拟登录。

还是不太明白,所以非常想了解您是怎么做的。
因为我将浏览器的cookie复制后做成数组添加进curl是失败的。
所以请解答, 谢谢O(∩_∩)O

PHP Fatal error: Allowed memory size of 1073741824 bytes exhausted

----------------------------- PHPSPIDER -----------------------------
PHPSpider version:3.0.4 PHP version:7.0.15
start time:2017-01-28 22:02:43 run 0 days 6 hours 12 minutes
spider name: JD.com
task number: 1
load average: 4, 4.17, 4.28
document: https://doc.phpspider.org
------------------------------- TASKS -------------------------------
taskid taskpid mem collect succ collect fail speed
1 17294 1022MB 62956 0 2.82/s
--------------------------- COLLECT STATUS --------------------------
find pages queue collected fields depth
788541 726641 61900 8968 3

Press Ctrl-C to quit. Start success.
PHP Fatal error: Allowed memory size of 1073741824 bytes exhausted (tried to allocate 217088 bytes) in /home/ken/php/phpspider/core/requests.php on line 276
PHP Stack trace:
PHP 1. {main}() /home/ken/php/phpspider/demo/jd_demo.php:0
PHP 2. phpspider->start() /home/ken/php/phpspider/demo/jd_demo.php:184
PHP 3. phpspider->do_collect_page() /home/ken/php/phpspider/core/phpspider.php:918
PHP 4. phpspider->collect_page() /home/ken/php/phpspider/core/phpspider.php:990
PHP 5. phpspider->request_url() /home/ken/php/phpspider/core/phpspider.php:1060
PHP 6. requests::get() /home/ken/php/phpspider/core/phpspider.php:1229
PHP 7. requests::request() /home/ken/php/phpspider/core/requests.php:431
PHP 8. requests::get_response_body() /home/ken/php/phpspider/core/requests.php:617
PHP 9. implode() /home/ken/php/phpspider/core/requests.php:276

抓取数据有多条时,写csv会出错

20:44:21 结果10:{"ip":["202.108.2.42","112.92.208.19","124.193.33.233","116.253.243.20","202.99.172.165","119.132.147.219","139.196.240.207","180.161.99.75","111.202.154.88","110.72.5.9"],"port":["80","9999","3128","9000","8081","9797","808","8123","8080","8123"]}

Array
(
[ip] => Array
(
[0] => 202.108.2.42
[1] => 112.92.208.19
[2] => 124.193.33.233
[3] => 116.253.243.20
[4] => 202.99.172.165
[5] => 119.132.147.219
[6] => 139.196.240.207
[7] => 180.161.99.75
[8] => 111.202.154.88
[9] => 110.72.5.9
)

[port] => Array
    (
        [0] => 80
        [1] => 9999
        [2] => 3128
        [3] => 9000
        [4] => 8081
        [5] => 9797
        [6] => 808
        [7] => 8123
        [8] => 8080
        [9] => 8123
    )

)

Notice: Array to string conversion in /Users/xcxxx/test/phpspider/core/util.php on line 529

请问多线程的系统环境怎么配?

环境是lnmp的,pcntl ; redis都装了,但是还是报这个错!请问还需要装其他什么东西么?
2017-05-03 23:59:05 [error] Spider kept running state needs Redis support, Error: The redis extension was not found

测试demo超时是什么问题啊?

环境PHP5.6+nginx
curl测试正常

[root@bogon demo]# php 13384.php 

[13384美女图爬虫] 开始爬行...

!开发文档:
https://doc.phpspider.org

2016-07-21 05:18:47 Curl error: Connection time-out
05:18:47  网页下载失败:http://www.13384.com/qingchunmeinv/

05:18:47  HTTP CODE:0

05:18:47  爬取完成

爬虫运行时间:00小时00分钟05秒
总共抓取网页:0

cls_curl有用到么?

cls_curl有用到么?我看是curl多线程,phpspider.php代码里看到的用的是requests,为什么不用cls_curl?
还有有考虑win下支持多进程么?pcntl用不了,不过我好想记得有扩展可以再win下用,搜了半天找不到在哪了
或者多线程呢?pthreads这个扩展(我原先写过一个用pthreads的爬虫,但是感觉不太稳定,应该是代码的问题,现在都找不到扔哪去了...)

教程里第一个demo是不能运行的

$configs = array(
    'name' => '糗事百科',
    'domains' => array(
        'qiushibaike.com',
        'www.qiushibaike.com'
    ),
    'scan_urls' => array(
        'http://www.qiushibaike.com/'
    ),
    'content_url_regexes' => array(
        "http://www.qiushibaike.com/article/\d+"
    ),
    'list_url_regexes' => array(
        "http://www.qiushibaike.com/8hr/page/\d+\?s=\d+"
    ),
    'fields' => array(
        array(
            // 抽取内容页的文章内容
            'name' => "article_content",
            'selector' => "//*[@id='single-next-link']",
            'required' => true
        ),
        array(
            // 抽取内容页的文章作者
            'name' => "article_author",
            'selector' => "//div[contains(@class,'author')]//h2",
            'required' => true
        ),
    ),
);
$spider = new phpspider($configs);
$spider->start();

这个例子是不能直接运行的

list_page是否可以只获取特定代码块的url

列表页获取内容url能否只获取特定代码块的url, 过滤页面上不需要的url?
测试了在回调on_list_page时调试输出完全没反应, on_scan_page可以获取 但是没明白怎么过滤

数据没有入库问题

数据没有入redis,数据表也没有,但是跑起来也没有错误,有可能是什么问题啊

Find list page: http://www.mafengwo.cn/gonglve/ajax.php?act=get_travellist&mddid=63515
23:00:53 Find list page: http://www.mafengwo.cn/gonglve/ajax.php?act=get_travellist&mddid=140736
23:00:53 Success process page: http://www.mafengwo.cn/mdd/base/list/pagedata_citylist?page=96 Use time: 0.338 s

23:00:53 Spider running time: 00 hour 02 minutes 10 seconds

23:00:53 Find pages: 1109

23:00:53 Waiting for collect pages: 1011

23:00:53 Collected pages: 98

作者有点调皮,ʅ(´◔౪◔)ʃ

刚开始用,自己模仿一个,死活都是‘Unknown error...’提示,遂看看源码,发现备注很详细,思路很清晰,然后邪恶的笑了。敲黑板,划重点了!!

/* Do NOT delete this comment /
/
不要删除这段注释 */

ʅ(´◔౪◔)ʃ
啦啦啦,啦啦啦

爬虫如何做计划任务?

爬虫写好了,一次就把网址数据爬完了,当时想用linux上计划任务实现每天定时爬取,发现再次执行时候会输出:
Found that the data of Redis, no continue will empty Redis data start again
Do you want to continue? [Y/n]
这样的选项,那我该怎么去做计划任务爬虫呢?
望告知谢谢!!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.