holgerd77 / django-dynamic-scraper Goto Github PK

View Code? Open in Web Editor NEW

1.1K 76.0 308.0 4.81 MB

Creating Scrapy scrapers via the Django admin interface

Home Page: http://django-dynamic-scraper.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 94.65% Shell 2.11% HTML 2.72% JavaScript 0.51%

python django scraper scraping scrapy spider webscraping

django-dynamic-scraper's Introduction

django-dynamic-scraper

Django Dynamic Scraper (DDS) is an app for Django which builds on top of the scraping framework Scrapy and lets you create and manage Scrapy spiders via the Django admin interface. It was originally developed for german webtv program site http://fernsehsuche.de.

Documentation

Read more about DDS in the ReadTheDocs documentation:

http://django-dynamic-scraper.readthedocs.org/

Getting Help/Updates

There is a mailing list on Google Groups, feel free to ask questions or make suggestions:

https://groups.google.com/group/django-dynamic-scraper

Infos about new releases and updates on Twitter:

https://twitter.com/#!/dynamicscraper

django-dynamic-scraper's People

Contributors

Stargazers

Watchers

Forkers

mechanism dmpeters netconstructor paoloc68 samuderapase kevinwan lovevn eyelee snbuback ranjithtenz hedde 0day1day wd5 scraping-xx leadsplus pombredanne brentcappello st316 netquin aliyakamercan lapsule smallevilbeast nextlanding wehrlem open-source-gis opensource- big-data listings-xx github-xv openscripts-xx hosts-xx hosting-scripts openhosts parsing philbritton dustinthughes zzpwelkin andylou amrali onemade masterdubs atassumer eric011 aburan28 guoyunsky jerryxing98 jspenc72 kc-ci23 shiftdirector wyrover sp576 zhenbinluo calvin-he jice-lavocat dominicx ubehera huokedu etongle satish1337 resourceintelligence timeless0728 hacder 1060460048 dgram luis-wang shanyu-e donsunsoft vickysirwani canercandan easonyi getwingm islammohamed yanziqiguo serho15 liwei123o0 mostdev jumango avd087 mic101 michal-f kuldat teolemon mdlx goroot iceman1265 snailwalker widy28 c3pko gischen sundisee agz1990 valuxlong shubhadipde kholidfu tonyzhu livepy zerocoolrocker brother-simon ghd214 leeomar

django-dynamic-scraper's Issues

Allow scrapers without detail page url

Widens the possible use cases of DDS without limiting it's ease of use.

Support for more than one image attribute

Hello,

I think it would be interesting to handle more than one IMAGE attribute.

I think you could replace your DjangoImagesPipeline by this code :

class DjangoImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        try:
            for img_elem in info.spider.scraper.get_image_elems():
                if img_elem.scraped_obj_attr.name in item and item[img_elem.scraped_obj_attr.name]:
                    yield Request(item[img_elem.scraped_obj_attr.name])
        except (ScraperElem.DoesNotExist, TypeError):
            pass

    def image_key(self, url):
        image_guid = hashlib.sha1(url).hexdigest()
        return '%s.jpg' % (image_guid)

    def thumb_key(self, url, thumb_id):
        image_guid = hashlib.sha1(url).hexdigest()
        return '%s.jpg' % (image_guid)

    def item_completed(self, results, item, info):
        try:
            img_elems = info.spider.scraper.get_image_elems()
        except ScraperElem.DoesNotExist:
            return item
        for i in range(len(results)):
            if results[i][0]:
                print(results[i][1])
                for img_elem in img_elems:
                    if results[i][1]['url'] == item[img_elem.scraped_obj_attr.name]:
                        item[img_elem.scraped_obj_attr.name] = results[i][1]['path']
        return item

Using dynamic XPaths based on response.data

What I am trying to achieve is a little complex, I am looking to somehow dynamically use a Xpath for a scraped_obj_attr in a particular scraper class. To put this into simple words, say I have the following Django model that I want to populate using DDS:

class Article(models.Model):
    title = models.CharField(max_length=200)
    news_website = models.ForeignKey(NewsWebsite)
    description = models.TextField(blank=True)
    url = models.URLField()
    checker_runtime = models.ForeignKey(SchedulerRuntime, blank=True, null=True, on_delete=models.SET_NULL)

    def __unicode__(self):
        return self.title

I have created the required ScrapedObjectClass, Scraper and NewsWebsite objects. Now in the scraper class I use the scraped_obj_attrs that were defined in the associated scraped_object_class.

This runs fine. In my use case I have article pages where the required information is present in multiple elements and I want to pick it up from the most relevant element.

Say, An article page has og tags like <meta property="og:title" ... > and has <h1 class="title"> ...</h1> both containing the same information that I want to feed to the title field in my Article model. The og tag thing is optional and only present in some article pages. I want to check if the og tags are present in response then use an Xpath to get the information from the og tag (example: //meta[contains(@property, "og:title")]/@content) and if it's not there in the response then use Xpath for h1 element (example: //h1[@class="title"]/text())

Is the above currently somehow achievable by DDS ? If yes, could you point me in the right direction. If not, could you make this a feature request.

Thanks,

Amyth

RuntimeWarning: DateTimeField Scraper.last_checker_delete received a naive datetime

/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/db/models/fields/__init__.py:1474: RuntimeWarning: DateTimeField Scraper.last_checker_delete received a naive datetime (2016-05-25 11:30:25.292116) while time zone support is active.
  RuntimeWarning)

How to scrape all sites

Hi how we can scrape all article with input domain only like

wikipedia.com (scrape all content)

can't run python manage.py celeryd

Trying running $python manage.py celeryd -l info -B --settings=example_project.settings gives me this error:
File "manage.py", line 10, in
TypeError: invalid arguments

My system info as below:
Python 2.7
Celery 3.1.19
Django-celery 3.1.16
Django-celery is installed and can be seen in example_project django admin page except I got issue when running the example command. Any advise would be appreciated. Thanks.

SystemCheckError: System check identified some issues:

SystemCheckError: System check identified some issues:

ERRORS:
srpy.Article.checker_runtime: (fields.E300) Field defines a relation with model 'SchedulerRuntime', which is either not installed, or is abstract.
srpy.NewsWebsite.scraper: (fields.E300) Field defines a relation with model 'Scraper', which is either not installed, or is abstract.
srpy.NewsWebsite.scraper_runtime: (fields.E300) Field defines a relation with model 'SchedulerRuntime', which is either not installed, or is abstract.

I just copy the tutorial, but something goes wrong.

loaddata of example got errors

I config the example.
when I run python manage.py loaddata open_news/open_news.json
I got the errors below:

$ python manage.py loaddata open_news/open_news.json
Problem installing fixture 'open_news/open_news.json': Traceback (most recent call last):
File "/usr/lib64/python2.7/site-packages/django/core/management/commands/loaddata.py", line 196, in handle
obj.save(using=using)
File "/usr/lib64/python2.7/site-packages/django/core/serializers/base.py", line 165, in save
models.Model.save_base(self.object, using=using, raw=True)
File "/usr/lib64/python2.7/site-packages/django/db/models/base.py", line 551, in save_base
result = manager._insert([self], fields=fields, return_id=update_pk, using=using, raw=raw)
File "/usr/lib64/python2.7/site-packages/django/db/models/manager.py", line 203, in _insert
return insert_query(self.model, objs, fields, **kwargs)
File "/usr/lib64/python2.7/site-packages/django/db/models/query.py", line 1576, in insert_query
return query.get_compiler(using=using).execute_sql(return_id)
File "/usr/lib64/python2.7/site-packages/django/db/models/sql/compiler.py", line 910, in execute_sql
cursor.execute(sql, params)
File "/usr/lib64/python2.7/site-packages/django/db/backends/util.py", line 40, in execute
return self.cursor.execute(sql, params)
File "/usr/lib64/python2.7/site-packages/django/db/backends/sqlite3/base.py", line 337, in execute
return Database.Cursor.execute(self, query, params)
IntegrityError: Could not load contenttypes.ContentType(pk=25): columns app_label, model are not unique

What's wrong with it ?

Error caught on signal handler

I've did the example followed the "Advance Topic" api,but i got the error:

Error caught on signal handler: <bound method ?.item_scraped of <scrapy.contrib.feedexport.FeedExporter object at 0x20568d0>>
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/Twisted-13.1.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 575, in _runCallbacks
current.result = callback(current.result, _args, *_kw)
File "/usr/local/lib/python2.7/dist-packages/Scrapy-0.16.5-py2.7.egg/scrapy/core/scraper.py", line 213, in _itemproc_finished
item=output, response=response, spider=spider)
File "/usr/local/lib/python2.7/dist-packages/Scrapy-0.16.5-py2.7.egg/scrapy/signalmanager.py", line 23, in send_catch_log_deferred
return signal.send_catch_log_deferred(_a, *_kw)
File "/usr/local/lib/python2.7/dist-packages/Scrapy-0.16.5-py2.7.egg/scrapy/utils/signal.py", line 53, in send_catch_log_deferred
_arguments, *_named)
--- ---
File "/usr/local/lib/python2.7/dist-packages/Twisted-13.1.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 137, in maybeDeferred
result = f(_args, *_kw)
File "/usr/local/lib/python2.7/dist-packages/Scrapy-0.16.5-py2.7.egg/scrapy/xlib/pydispatch/robustapply.py", line 47, in robustApply
return receiver(_arguments, *_named)
File "/usr/local/lib/python2.7/dist-packages/Scrapy-0.16.5-py2.7.egg/scrapy/contrib/feedexport.py", line 191, in item_scraped
slot.exporter.export_item(item)
File "/usr/local/lib/python2.7/dist-packages/Scrapy-0.16.5-py2.7.egg/scrapy/contrib/exporter/init.py", line 87, in export_item
self.file.write(self.encoder.encode(itemdict) + '\n')
File "/usr/local/lib/python2.7/dist-packages/Scrapy-0.16.5-py2.7.egg/scrapy/utils/serialize.py", line 89, in encode
return super(ScrapyJSONEncoder, self).encode(o)
File "/usr/lib/python2.7/json/encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode
return _iterencode(o, 0)
File "/usr/local/lib/python2.7/dist-packages/Scrapy-0.16.5-py2.7.egg/scrapy/utils/serialize.py", line 109, in default
return super(ScrapyJSONEncoder, self).default(o)
File "/usr/lib/python2.7/json/encoder.py", line 184, in default
raise TypeError(repr(o) + " is not JSON serializable")
exceptions.TypeError: <NewsWebsite: Wikinews> is not JSON serializable

anyone can help?

Is DDS support Django 1.6 + Scrapy 0.24 already?

Is DDS support Django 1.6 + Scrapy 0.24 already?
In docs it is not clear, seems it is outdate, mention scrapy 0.16

how can i config scraper and crawl this articles?

First of all,Thanks for author sharing this resource.It's helpful. I Meet a question like that:
for example:
<div class="date">2016-1-1</div>
<div class="fruit" >
<ul>
<li><span class="name">apple</span><span class="price">1.23</span><span class="minute">11:12 AM</span></li>
<li><span class="name">Lemon</span><span class="price">2.12</span><span class="minute">11:13 AM</span></li>
.......
</ul>
</div>
<div class="date">2016-1-2</div>
<div class="fruit" >
<ul>
<li><span class="name">apple</span><span class="price">1.33</span><span class="minute">06:12 AM</span></li>
<li><span class="name">Lemon</span><span class="price">2.42</span><span class="minute">09:13 AM</span></li>
.......
</ul>
</div>
the question is :1、how can i config SCRAPER ELEMS for scrapy the data,i want the data like:
[{'Name':'Apple','Price':'1.21','DateTime':'2016-01-01 11:12' }......]
2、in the model i set only datetime for storage date,so how can i combine date and minute?
I read the code and can't find any perfect solution.

djcelery/scheduler sec outdated in docs

Rename follow_url attributes in DB

Change attribute type naming in ScrapedObjAttr from follow_url to detail_page_url for better understanding
Alongside change follow_url to from_detail_page in ScraperElem definitions
Adopt docs (text/screenshots)

MAX_ITEMS_READ, MAX_ITEMS_SAVE as shell command params

pre_url produces ERROR Unsupported URL scheme 'doublehttp' when rerunning scrapy after saving articles to DB

HI,

I'm stuck in this problem, i configured a similar example to the startup project providing a detail page with 'pre_url': 'http://www.website.com'. I want it to scrape the listing every hour (using crontab) and add any new articles.

When i run the command for the first time (Article table empty), it populates the items correctly, however if i run the command again when new article added (with scrapy crawl article_spider -a id=2 -a do_action=yes) with populated article it does scrap the page but doesn't add the new articles -

2016-08-27 10:33:45 [scrapy] ERROR: Error downloading <GET doublehttp://www.website.com/politique/318534.html>
Traceback (most recent call last):
  File "/home/akram/eb-virt/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/home/akram/eb-virt/local/lib/python2.7/site-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/akram/eb-virt/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "/home/akram/eb-virt/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred
    result = f(*args, **kw)
  File "/home/akram/eb-virt/local/lib/python2.7/site-packages/scrapy/core/downloader/handlers/__init__.py", line 64, in download_request
    (scheme, self._notconfigured[scheme]))
NotSupported: Unsupported URL scheme 'doublehttp': no handler available for that scheme
2016-08-27 10:33:45 [scrapy] INFO: Closing spider (finished)

i searched for this "doublehttp" scheme error but couldn't find anything useful.

Versions i have -

Twisted==16.3.2
Scrapy==1.1.2
scrapy-djangoitem==1.1.1
django-dynamic-scraper==0.11.2

URL in DB (for an article) -

http://www.website.com/politique/318756.html

scraped URL without pre_url -

/politique/318756.html

Any hint ?

Thank you for your consideration and for this great project.

ScraperAdmin: UnicodeEncodeError: 'ascii' codec can't encode...

URL: /admin/dynamic_scraper/scraper/1/

Traceback (most recent call last):
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 85, in run
    self.result = application(self.environ, self.start_response)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/contrib/staticfiles/handlers.py", line 63, in __call__
    return self.application(environ, start_response)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/core/handlers/wsgi.py", line 189, in __call__
    response = self.get_response(request)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 218, in get_response
    response = self.handle_uncaught_exception(request, resolver, sys.exc_info())
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 164, in get_response
    response = response.render()
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/response.py", line 158, in render
    self.content = self.rendered_content
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/response.py", line 135, in rendered_content
    content = template.render(context, self._request)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/backends/django.py", line 74, in render
    return self.template.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/template_timings_panel/panels/TemplateTimings.py", line 137, in timing_hook
    result = func(self, *args, **kwargs)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/base.py", line 210, in render
    return self._render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/test/utils.py", line 96, in instrumented_test_render
    return self.nodelist.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/base.py", line 905, in render
    bit = self.render_node(node, context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/debug.py", line 79, in render_node
    return node.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/loader_tags.py", line 135, in render
    return compiled_parent._render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/test/utils.py", line 96, in instrumented_test_render
    return self.nodelist.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/base.py", line 905, in render
    bit = self.render_node(node, context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/debug.py", line 79, in render_node
    return node.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/loader_tags.py", line 135, in render
    return compiled_parent._render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/test/utils.py", line 96, in instrumented_test_render
    return self.nodelist.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/base.py", line 905, in render
    bit = self.render_node(node, context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/debug.py", line 79, in render_node
    return node.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/template_timings_panel/panels/TemplateTimings.py", line 137, in timing_hook
    result = func(self, *args, **kwargs)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/loader_tags.py", line 65, in render
    result = block.nodelist.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/base.py", line 905, in render
    bit = self.render_node(node, context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/debug.py", line 79, in render_node
    return node.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/template_timings_panel/panels/TemplateTimings.py", line 137, in timing_hook
    result = func(self, *args, **kwargs)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/loader_tags.py", line 65, in render
    result = block.nodelist.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/base.py", line 905, in render
    bit = self.render_node(node, context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/debug.py", line 79, in render_node
    return node.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/defaulttags.py", line 217, in render
    nodelist.append(node.render(context))
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/loader_tags.py", line 159, in render
    return template.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/template_timings_panel/panels/TemplateTimings.py", line 137, in timing_hook
    result = func(self, *args, **kwargs)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/base.py", line 212, in render
    return self._render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/test/utils.py", line 96, in instrumented_test_render
    return self.nodelist.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/base.py", line 905, in render
    bit = self.render_node(node, context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/debug.py", line 79, in render_node
    return node.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/defaulttags.py", line 217, in render
    nodelist.append(node.render(context))
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/defaulttags.py", line 329, in render
    return nodelist.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/base.py", line 905, in render
    bit = self.render_node(node, context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/debug.py", line 79, in render_node
    return node.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/defaulttags.py", line 329, in render
    return nodelist.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/base.py", line 905, in render
    bit = self.render_node(node, context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/debug.py", line 79, in render_node
    return node.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/debug.py", line 92, in render
    output = force_text(output)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/utils/encoding.py", line 94, in force_text
    s = six.text_type(bytes(s), encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)

Pushing to PyPi

I know that django-dynamic-server is still aplha code but is there a reason why your haven't marked it as such and pushed it to PyPi.

Pushing to PyPi would be great and it's a trivial thing to do.

Add Celery 3.0 support

Integrate parts of Scrapy CrawlSpider Rule flexibility into DDS

With CrawlSpider Rules there exists a more powerful tool in Scrapy (http://doc.scrapy.org/en/0.14/topics/spiders.html) to crawl pages from different urls following a certain pattern than it is actually realized in DDS with pagination.

See the following Google Groups discussion thread for reference:
https://groups.google.com/forum/?fromgroups#!topic/django-dynamic-scraper/tQJMpcbqbfc

It would be desirable to integrate at least a part of it.

Ideas:

Application of one '"allow"-Rule could be integrated as a pagination type together with the pagination_append_str attribute without changing the DB structure

scrapy ImportError: cannot import name log

When I try to start up my django-dynamic-scraper project, I get the following error:

$ python manage.py runserver
Validating models...

/usr/local/lib/python2.7/site-packages/debug_toolbar/settings.py:137: DeprecationWarning: INTERCEPT_REDIRECTS is deprecated. Please use the DISABLE_PANELS config in the DEBUG_TOOLBAR_CONFIG setting.
  "DEBUG_TOOLBAR_CONFIG setting.", DeprecationWarning)

Unhandled exception in thread started by <function wrapper at 0x10ce28488>
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/django/utils/autoreload.py", line 93, in wrapper
    fn(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/django/core/management/commands/runserver.py", line 102, in inner_run
    self.validate(display_num_errors=True)
  File "/usr/local/lib/python2.7/site-packages/django/core/management/base.py", line 310, in validate
    num_errors = get_validation_errors(s, app)
  File "/usr/local/lib/python2.7/site-packages/django/core/management/validation.py", line 34, in get_validation_errors
    for (app_name, error) in get_app_errors().items():
  File "/usr/local/lib/python2.7/site-packages/django/db/models/loading.py", line 196, in get_app_errors
    self._populate()
  File "/usr/local/lib/python2.7/site-packages/django/db/models/loading.py", line 78, in _populate
    self.load_app(app_name)
  File "/usr/local/lib/python2.7/site-packages/django/db/models/loading.py", line 99, in load_app
    models = import_module('%s.models' % app_name)
  File "/usr/local/lib/python2.7/site-packages/django/utils/importlib.py", line 40, in import_module
    __import__(name)
  File "/Users/nateaune/Dropbox/code/moocminder/moocminder/moocscraper/models.py", line 3, in <module>
    from scrapy.contrib.djangoitem import DjangoItem
  File "/usr/local/lib/python2.7/site-packages/scrapy/__init__.py", line 56, in <module>
    from scrapy.spider import Spider
  File "/usr/local/lib/python2.7/site-packages/scrapy/spider.py", line 6, in <module>
    from scrapy import log
ImportError: cannot import name log

Since I don't know if this is a problem with DDS or scrapy, I've also reported it to the scrapy issue tracker: scrapy/scrapy#942

No module named future

I tried running the latest DDS using Python 2.7.10, the following error appeared:

  File "/project/apps/feeds/tasks/__init__.py", line 3, in <module>
    from dynamic_scraper.utils.task_utils import TaskUtils
  File "/usr/local/lib/python2.7/site-packages/dynamic_scraper/utils/task_utils.py", line 2, in <module>
    from future import standard_library
ImportError: No module named future

I then figured that i was missing the 'future' package so i installed it. I tried checking what the issue is and it's probably because of this: 83dcf29

I am not sure why 'install_requires' has been commented out. This is preventing the package from installing it's dependencies and hence, the error appeared.

Migrations are missing from setup file

The setup.py is missing the migrations.

Duplicates items

I am scraping news, and News model has a link field that is ID_FIELD, but it's not unique in database.

For some reason scraper stores duplicates. In general two duplicate items have time difference in 2-5 minutes. So I don't believe it's a race condition. I even tried to handle this situation in DjangoItem, but I still get duplicates. What might be the reason?

class NewsItem(DjangoItem):
    django_model = News

    def save(self, *a, **kw):
        link = self.instance.link
        existing = News.objects.filter(link=link).only('id').order_by('-ts').first()
        if existing:
            self.instance.pk = self.instance.id = existing.id
        instance = super(NewsItem, self).save(*a, **kw)
        return instance

TypeError: can't compare offset-naive and offset-aware datetimes

  File "/usr/lib/python2.7/wsgiref/handlers.py", line 85, in run
    self.result = application(self.environ, self.start_response)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/contrib/staticfiles/handlers.py", line 63, in __call__
    return self.application(environ, start_response)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/core/handlers/wsgi.py", line 189, in __call__
    response = self.get_response(request)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 218, in get_response
    response = self.handle_uncaught_exception(request, resolver, sys.exc_info())
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 164, in get_response
    response = response.render()
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/response.py", line 158, in render
    self.content = self.rendered_content
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/response.py", line 135, in rendered_content
    content = template.render(context, self._request)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/backends/django.py", line 74, in render
    return self.template.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/base.py", line 210, in render
    return self._render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/base.py", line 202, in _render
    return self.nodelist.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/base.py", line 905, in render
    bit = self.render_node(node, context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/debug.py", line 79, in render_node
    return node.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/loader_tags.py", line 135, in render
    return compiled_parent._render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/base.py", line 202, in _render
    return self.nodelist.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/base.py", line 905, in render
    bit = self.render_node(node, context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/debug.py", line 79, in render_node
    return node.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/loader_tags.py", line 135, in render
    return compiled_parent._render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/base.py", line 202, in _render
    return self.nodelist.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/base.py", line 905, in render
    bit = self.render_node(node, context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/debug.py", line 79, in render_node
    return node.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/loader_tags.py", line 65, in render
    result = block.nodelist.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/base.py", line 905, in render
    bit = self.render_node(node, context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/debug.py", line 79, in render_node
    return node.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/loader_tags.py", line 65, in render
    result = block.nodelist.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/base.py", line 905, in render
    bit = self.render_node(node, context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/debug.py", line 79, in render_node
    return node.render(context)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/template/base.py", line 1273, in render
    _dict = func(*resolved_args, **resolved_kwargs)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/contrib/admin/templatetags/admin_list.py", line 320, in result_list
    'results': list(results(cl))}
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/contrib/admin/templatetags/admin_list.py", line 293, in results
    yield ResultList(form, items_for_result(cl, res, form))
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/contrib/admin/templatetags/admin_list.py", line 287, in __init__
    super(ResultList, self).__init__(*items)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/contrib/admin/templatetags/admin_list.py", line 199, in items_for_result
    f, attr, value = lookup_field(field_name, result, cl.model_admin)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/django/contrib/admin/utils.py", line 278, in lookup_field
    value = attr(obj)
  File "/home/vagrant/.virtualenv/local/lib/python2.7/site-packages/dynamic_scraper/admin.py", line 171, in last_scraper_save_
    if not obj.last_scraper_save or obj.last_scraper_save < datetime.datetime.now() - td:
TypeError: can't compare offset-naive and offset-aware datetimes

Future developments of this repo?

Hello @holgerd77,

Any plans on developing this app any further?

Use Django South for data migrations

Use Django South to make it easier to adopt to migrations in Django DB schema:
http://south.aeracode.org/

Remember: Doc update! requirements.txt!

Integrated scrapy-redis and Implementate of distributed crawling

Objective: To use multiple servers to crawl page and use the master server parses the page and process the data(use xpath to parse_item).
Celery is a distributed task queue, it seems not be used for distributed crawling, it may make more sense to implement distributed crawling.

Can you give an idea to implement this objective?

thx

OperationalError at /admin/open_news/newswebsite/

OperationalError at /admin/open_news/newswebsite/
no such table: open_news_newswebsite
Request Method: GET
Request URL: http://scrap-prabhatiitbhu.c9users.io/admin/open_news/newswebsite/
Django Version: 1.9.9
Exception Type: OperationalError
Exception Value:
no such table: open_news_newswebsite
Exception Location: /home/ubuntu/workspace/venv/local/lib/python2.7/site-packages/django/db/backends/sqlite3/base.py in execute, line 323
Python Executable: /home/ubuntu/workspace/venv/bin/python
Python Version: 2.7.6
Python Path:
[u'/home/ubuntu/workspace/django-dynamic-scraper/example_project/example_project/../..',
'/home/ubuntu/workspace/django-dynamic-scraper/example_project',
'/home/ubuntu/workspace/venv/lib/python2.7',
'/home/ubuntu/workspace/venv/lib/python2.7/plat-x86_64-linux-gnu',
'/home/ubuntu/workspace/venv/lib/python2.7/lib-tk',
'/home/ubuntu/workspace/venv/lib/python2.7/lib-old',
'/home/ubuntu/workspace/venv/lib/python2.7/lib-dynload',
'/usr/lib/python2.7',
'/usr/lib/python2.7/plat-x86_64-linux-gnu',
'/usr/lib/python2.7/lib-tk',
'/home/ubuntu/workspace/venv/local/lib/python2.7/site-packages',
'/home/ubuntu/workspace/venv/local/lib/python2.7/site-packages/django_dynamic_scraper-0.11.2-py2.7.egg',
'/home/ubuntu/workspace/venv/lib/python2.7/site-packages',
'/home/ubuntu/workspace/venv/lib/python2.7/site-packages/django_dynamic_scraper-0.11.2-py2.7.egg']

pagination appends replace string to the end of the object url

I am using RANGE_FUNCT options for pagination in one of my products, but the urls being requested have the replace string appended to the end of the urls instead of actually replacing the string in the url. So instead of:

http://www.example.com/products/brandname_products2.html

the url being requested is

http://www.example.com/products/brandname_products.html_products2.html

for more details on what pagination options and setup I am using, please refer to this SO Question I posted.

Allow the user to customize the scraper

I am a newbie to web crawler. Now I am gong to build a django web app, in a part of which users fill up a form(keywords and time,etc.) and submit it. Then the scraper will begin to work and craw the data according to the user's requirement at a specific website(set by me). After that the data will pass to another model to do some clustering work. How can I do that?

I installed django dynamic scraper to my app and when i pushed it my app broke

I added django dynamic scraper to my app. it worked locally so I deployed it to my heroku server and It's giving me a server error of 500. I have no error messages in my heroku logs and it works fine locally. since I have no errors and it works locally I didn't know what top post and was wondering if someone may have had the same issue. This si crazy to me.

EDIT

I've done some digging. I tried to

 heroku run python manage.py makemigrations

and got this

Running python manage.py makemigrations on ⬢ heights... up, run.1515
Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/app/.heroku/python/lib/python3.5/site-packages/django/core/management/__init__.py", line 353, in execute_from_command_line
    utility.execute()
  File "/app/.heroku/python/lib/python3.5/site-packages/django/core/management/__init__.py", line 345, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/app/.heroku/python/lib/python3.5/site-packages/django/core/management/base.py", line 348, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/app/.heroku/python/lib/python3.5/site-packages/django/core/management/base.py", line 399, in execute
    output = self.handle(*args, **options)
  File "/app/.heroku/python/lib/python3.5/site-packages/django/core/management/commands/makemigrations.py", line 65, in handle
    loader = MigrationLoader(None, ignore_no_migrations=True)
  File "/app/.heroku/python/lib/python3.5/site-packages/django/db/migrations/loader.py", line 49, in __init__
    self.build_graph()
  File "/app/.heroku/python/lib/python3.5/site-packages/django/db/migrations/loader.py", line 306, in build_graph
    _reraise_missing_dependency(migration, parent, e)
  File "/app/.heroku/python/lib/python3.5/site-packages/django/db/migrations/loader.py", line 276, in _reraise_missing_dependency
    raise exc
  File "/app/.heroku/python/lib/python3.5/site-packages/django/db/migrations/loader.py", line 302, in build_graph
    self.graph.add_dependency(migration, key, parent)
  File "/app/.heroku/python/lib/python3.5/site-packages/django/db/migrations/graph.py", line 126, in add_dependency
    parent
django.db.migrations.exceptions.NodeNotFoundError: Migration blog.0011_auto_20160816_1834 dependencies reference nonexistent parent node ('dynamic_scraper', '0018_auto_20160816_1834')

EDIT

this is my 0011_auto_20160816_1834.py

# -*- coding: utf-8 -*-
    # Generated by Django 1.9.2 on 2016-08-16 22:34
    from __future__ import unicode_literals

    from django.db import migrations, models
    import django.db.models.deletion


    class Migration(migrations.Migration):

        dependencies = [
            ('dynamic_scraper', '0018_auto_20160816_1834'),
            ('blog', '0010_auto_20160627_1133'),
        ]

        operations = [
            migrations.CreateModel(
                name='NewsWebsite',
                fields=[
                    ('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
                    ('name', models.CharField(max_length=200)),
                    ('scraper', models.ForeignKey(blank=True, null=True, on_delete=django.db.models.deletion.SET_NULL, to='dynamic_scraper.Scraper')),
                    ('scraper_runtime', models.ForeignKey(blank=True, null=True, on_delete=django.db.models.deletion.SET_NULL, to='dynamic_scraper.SchedulerRuntime')),
                ],
            ),
            migrations.AddField(
                model_name='post',
                name='news_website',
                field=models.ForeignKey(blank=True, null=True, on_delete=django.db.models.deletion.CASCADE, to='blog.NewsWebsite'),
            ),
        ]

can I just delete

('dynamic_scraper', '0018_auto_20160816_1834'),

an run the migrations again? I don't want to completely crash my app. Novice programmer and not completely sure of the affects of doing that on my app.

EDIT

when I run show migrations

(practice) apples-MBP:src ray$ ./manage.py showmigrations
    admin
     [X] 0001_initial
     [X] 0002_logentry_remove_auto_add
    auth
     [X] 0001_initial
     [X] 0002_alter_permission_name_max_length
     [X] 0003_alter_user_email_max_length
     [X] 0004_alter_user_username_opts
     [X] 0005_alter_user_last_login_null
     [X] 0006_require_contenttypes_0002
     [X] 0007_alter_validators_add_error_messages
    blog
     [X] 0001_initial
     [X] 0002_auto_20160404_2019
     [X] 0003_post_image_url
     [X] 0004_auto_20160406_2353
     [X] 0005_image
     [X] 0006_auto_20160603_2317
     [X] 0007_auto_20160603_2326
     [X] 0008_auto_20160625_1708
     [X] 0009_auto_20160627_1034
     [X] 0010_auto_20160627_1133
     [X] 0011_auto_20160816_1834
    contenttypes
     [X] 0001_initial
     [X] 0002_remove_content_type_name
    dynamic_scraper
     [X] 0001_initial
     [X] 0002_scraper_render_javascript
     [X] 0003_auto_20150610_0906
     [X] 0004_scrapedobjattr_id_field
     [X] 0005_new_dict_params_for_scraper
     [X] 0006_request_type_and_body
     [X] 0007_dont_filter_attribute
     [X] 0008_new_request_page_types_construct
     [X] 0009_removed_legacy_request_page_type_scraper_fields
     [X] 0010_move_save_to_db_to_scraped_obj_attr
     [X] 0011_extracted_checker_attributes_to_own_checker_class
     [X] 0012_removed_legacy_checker_scraper_attributes
     [X] 0013_added_scraper_save_and_checker_delete_datetime_fields
     [X] 0014_added_scraper_save_and_checker_delete_alert_period_fields_for_scraper
     [X] 0015_added_datetime_fields_for_last_scraper_save_and_checker_delete_alert
     [X] 0016_optional_xpath_fields_text_type_for_x_path_reg_exp_processor_fields
     [X] 0017_added_order_to_scraped_obj_attr
     [X] 0018_auto_20160816_1834
    sessions
     [X] 0001_initial
    taggit
     [X] 0001_initial
     [X] 0002_auto_20150616_2121

EDIT when I run it on my showmigrations on my production server

(practice) apples-MBP:src ray$ heroku run python manage.py showmigrations
Running python manage.py showmigrations on ⬢ cheights... up, run.7252
Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/app/.heroku/python/lib/python3.5/site-packages/django/core/management/__init__.py", line 353, in execute_from_command_line
    utility.execute()
  File "/app/.heroku/python/lib/python3.5/site-packages/django/core/management/__init__.py", line 345, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/app/.heroku/python/lib/python3.5/site-packages/django/core/management/base.py", line 348, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/app/.heroku/python/lib/python3.5/site-packages/django/core/management/base.py", line 399, in execute
    output = self.handle(*args, **options)
  File "/app/.heroku/python/lib/python3.5/site-packages/django/core/management/commands/showmigrations.py", line 36, in handle
    return self.show_list(connection, options['app_labels'])
  File "/app/.heroku/python/lib/python3.5/site-packages/django/core/management/commands/showmigrations.py", line 44, in show_list
    loader = MigrationLoader(connection, ignore_no_migrations=True)
  File "/app/.heroku/python/lib/python3.5/site-packages/django/db/migrations/loader.py", line 49, in __init__
    self.build_graph()
  File "/app/.heroku/python/lib/python3.5/site-packages/django/db/migrations/loader.py", line 306, in build_graph
    _reraise_missing_dependency(migration, parent, e)
  File "/app/.heroku/python/lib/python3.5/site-packages/django/db/migrations/loader.py", line 276, in _reraise_missing_dependency
    raise exc
  File "/app/.heroku/python/lib/python3.5/site-packages/django/db/migrations/loader.py", line 302, in build_graph
    self.graph.add_dependency(migration, key, parent)
  File "/app/.heroku/python/lib/python3.5/site-packages/django/db/migrations/graph.py", line 126, in add_dependency
    parent
django.db.migrations.exceptions.NodeNotFoundError: Migration blog.0011_auto_20160816_1834 dependencies reference nonexistent parent node ('dynamic_scraper', '0018_auto_20160816_1834')

Integrating Scrapy Commands into Django

I've writing a few management commands for DDS that will allow you to execute Scrapy commands using Django's manage.py

It is not too simple but it has it's benefits:

You can run commands using manage.py. All other applications do it, why shouldn't scrapy.
Having scrapy settings in the Django settings.py file
Support for having multiple settings files as allowed in Django using the --settings argument. This way, you could have multiple settings files and you can specify the one that Scrapy should use.
Proving help and output in the same way that Django provides for it's other management commands. You get a standardised output format across all management commands.

I'm trying to make this management command as dynamic and flexible as possible. I don't intend to even touch a bit of Scrapy code. I was looking at how other projects are implemented. I had a look at django-celery. Now instead of the having one management command, they have two. It seems to be a bad idea to have one single Scrapy management command. We should have them separately. For example: crawl would be one management command and server would be one management command.

However, there is one little issue. Scrapy has a whole bunch of management commands that aren't entirely needed to be executed through the manage.py directly and it doesn't makes sense at this juncture to avoid them entirely. If someone has free time, they can add it later. The commands will still be accessible but only through the default scrapy command.

crawl: this allows you to run a spider and shoudl be implemented
list: allows you view a list of all the spiders and should be implemented.
shell: allows you run a scraping shell for testing and should be implemented

These are a list of commands that are "good to have" but not entirely important or overkill to implement at this point:

runspider: allows you to run a spider and I can't understand if this is utterly important. You can already run a spider using the crawl command and this could be left out for now.
genspider: allows you to create a spider template and I don't see why this should be implemented either. It uses some kind of Scrapy template for creating a spider but the DDS spider is little different and we need to use a different template. It would be good to have this but I don't have the time.
settings: lists out all the setting and I think it isn't that important too. We can skip this. Even Django doesn't have a separate command for viewing settings.
versions: lists out versions of the underlying libraries. Yolk is the de-facto method of viewing Python packages installed and the versions and I feel this should be skipped.
server: allows you to run a scrapy server that can be controlled using a web interface but requires the scrapy binaries to be installed using apt-get. This could be implemented but again, seems a bit of an overkill at this point.
deploy: allows you to package a spider and deploy it to a server using distribute. This is a little overkill at this point. You could just run the spider using the crawl command for now and we could implement this later.
edit: allows you to edit a spider using your default editor. I don't see why this critical.
view: allows you to view a Scrapy URL in the browser as seen by Scrapy. Again, I don't think it is critical.

These seem to be important-ish but I haven't really delved into them. They're mainly for testing, building and debugging your spiders and can be implemented later.

fetch: fetches the contents of a URL and displays it in the console
parse: fetches the contents of a URL, parses it using the default spider and prints the parsed output to the console with colours.

If you have read the long and boring aforementioned text, do let me know if this is okay. I'll fork your repo, make the changes and implement the four commands — crawl, shell, server and list — and send you a pull request.

I've gone this issue in great detail today. If you don't agree with my points, I'd love you hear your opinion. Thanks.

(Whew)

Django 1.7 Boolean field warning

The newer Django installations issues a system check warning as follows:
dynamic_scraper.ScraperElem.from_detail_page: (1_6.W002) BooleanField does not have a default value.

HINT: Django 1.6 changed the default value of BooleanField from False to None. See https://docs.djangoproject.com/en/1.6/ref/models/fields/#booleanfield for more information.

Internal error when visiting Scrapers in Admi

Hi, I installed the dds on my test project locally.

What I did:
installation via pip
added the app to my installed app in settings
sync db

When I visit my admin site and I click on Scrapers (or Log marker, logs...) I get this error


InternalError at /admin/dynamic_scraper/scraper/
current transaction is aborted, commands ignored until end of transaction block
Request Method: GET
Request URL:    http://localhost:8000/admin/dynamic_scraper/scraper/
Django Version: 1.5.1
Exception Type: InternalError
Exception Value:    
current transaction is aborted, commands ignored until end of transaction block
Exception Location: /home/nishant/venvs/datafootball/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py in execute_sql, line 840
Python Executable:  /home/nishant/venvs/datafootball/bin/python
Python Version: 2.7.3
Python Path:    
['/home/nishant/workspace/datafootball',
 '/home/nishant/venvs/datafootball/local/lib/python2.7/site-packages/distribute-0.6.24-py2.7.egg',
 '/home/nishant/venvs/datafootball/local/lib/python2.7/site-packages/pip-1.1-py2.7.egg',
 '/home/nishant/venvs/datafootball/lib/python2.7',
 '/home/nishant/venvs/datafootball/lib/python2.7/plat-linux2',
 '/home/nishant/venvs/datafootball/lib/python2.7/lib-tk',
 '/home/nishant/venvs/datafootball/lib/python2.7/lib-old',
 '/home/nishant/venvs/datafootball/lib/python2.7/lib-dynload',
 '/usr/lib/python2.7',
 '/usr/lib/python2.7/plat-linux2',
 '/usr/lib/python2.7/lib-tk',
 '/home/nishant/venvs/datafootball/local/lib/python2.7/site-packages',
 '/home/nishant/workspace/datafootball/datafootball/datafootball',
 '/home/nishant/workspace/datafootball/datafootball']

Did I do something wrong? Any idea on what could be?

TIA

Remove references to ScraperRuntime in docs

Install fails on windows with pip and easy_install

After doing some digging I believe the failure is caused by the trailing slash in the manifest.in file.
prune example_project/.scrapy/

I got around the issue by installing it manually and everything seems to be working fine, just a heads up and thanks for this project. If you make the update I will try and install again to see if that solves the problem.

scrapyd return ERROR

I try run scrapyd, but received only
{"status": "error", "message": "spider 'article_checker' not found"}

conn.request("POST", "/schedule.json", params, headers)
conn.getresponse().read()

There is support scrapyd?

Restructure scraper unit tests

error adding scheduler runtime

I got this error when I add scheduler runtime

expected a character buffer object
http://test.cplewis.webfactional.com/admin/dynamic_scraper/schedulerruntime/add/?_popup=1

TakeFirst vs Returning multiple elements found

Post on mailing list from Rakan:

Hello,

I've posted about scraping multiple images using DDS previously on this group. However, i discovered that if you're capturing an article (example) with a body inside multiple paragraph tags, you'll only get the first one. Digging through this detail, i found out that the DjangoSpider was defining:

_get_processors
procs = [TakeFirst(), processors.string_strip,]
as the default processors
and
self.loader.default_output_processor = TakeFirst()

Do you think this was a correct design decision given that it adds the restriction of ending up with 1 element only xpath or regex? I believe DDS should allow the flexibility of returning multiple elements and users would need to use TakeFirst in case they only need the first element.

Thanks,
Rakan

Integrate CroppingDjangoImagesPipeline in DDS

with scrapyd got this error: exceptions.OSError: [Errno 20] Not a directory

Sorry, my English is not very well, but I'm working on it.
I'm flowing the doc to study how to use DDS, everything works fine just before the 'Scheduling scrapers/checkers' section in the advanced topics, I started scrapyd and deploy the scrapy project to it, then run this command:

python manage.py celeryd -l info -B --settings=DjangoScrapy.settings

the job runs every 5 minute just as I set in django admin, but I got this error in the terminal window which I start scrapyd:

......
2015-08-17 00:03:24+0800 [Launcher,13578/stderr] /Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/site-packages/dynamic_scraper/pipelines.py:4: ScrapyDeprecationWarning: Module `scrapy.contrib.pipeline` is deprecated, use `scrapy.pipelines` instead
      from scrapy.contrib.pipeline.images import ImagesPipeline
2015-08-17 00:03:24+0800 [Launcher,13578/stderr] /Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/site-packages/dynamic_scraper/pipelines.py:4: ScrapyDeprecationWarning: Module `scrapy.contrib.pipeline.images` is deprecated, use `scrapy.pipelines.images` instead
      from scrapy.contrib.pipeline.images import ImagesPipeline
2015-08-17 00:03:24+0800 [Launcher,13578/stderr] Unhandled error in Deferred:
2015-08-17 00:03:24+0800 [Launcher,13578/stderr]

Traceback (most recent call last):
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/site-packages/scrapy/cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 57, in run
    self.crawler_process.crawl(spname, **opts.spargs)
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/site-packages/scrapy/crawler.py", line 153, in crawl
    d = crawler.crawl(*args, **kwargs)
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/site-packages/twisted/internet/defer.py", line 1274, in unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
    result = g.send(result)
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/site-packages/scrapy/crawler.py", line 71, in crawl
    self.engine = self._create_engine()
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/site-packages/scrapy/crawler.py", line 83, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/site-packages/scrapy/core/engine.py", line 67, in __init__
    self.scraper = Scraper(crawler)
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/site-packages/scrapy/core/scraper.py", line 70, in __init__
    self.itemproc = itemproc_cls.from_crawler(crawler)
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/site-packages/scrapy/middleware.py", line 56, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/site-packages/scrapy/middleware.py", line 34, in from_settings
    mw = mwcls.from_crawler(crawler)
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/site-packages/scrapy/pipelines/media.py", line 33, in from_crawler
    pipe = cls.from_settings(crawler.settings)
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/site-packages/scrapy/pipelines/images.py", line 57, in from_settings
    return cls(store_uri)
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/site-packages/dynamic_scraper/pipelines.py", line 15, in __init__
    super(DjangoImagesPipeline,  self).__init__(*args, **kwargs)
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/site-packages/scrapy/pipelines/files.py", line 160, in __init__
    self.store = self._get_store(store_uri)
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/site-packages/scrapy/pipelines/files.py", line 181, in _get_store
    return store_cls(uri)
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/site-packages/scrapy/pipelines/files.py", line 43, in __init__
    self._mkdir(self.basedir)
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/site-packages/scrapy/pipelines/files.py", line 72, in _mkdir
    os.makedirs(dirname)
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/os.py", line 150, in makedirs
    makedirs(head, mode)
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/os.py", line 150, in makedirs
    makedirs(head, mode)
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/os.py", line 150, in makedirs
    makedirs(head, mode)
  File "/Users/zhushajun/.pyenv/versions/scrapy_env2/lib/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
exceptions.OSError: [Errno 20] Not a directory: '/private/var/folders/32/2_lmw4t55m5fx_gb5v7d_0vr0000gn/T/ScrapyDemo-1439724559-dDctsA.egg/ScrapyDemo'
2015-08-17 00:03:24+0800 [-] Process finished:  project='ScrapyDemo' spider='article_spider' job='51059b61443011e58ecf3c075429eea5' pid=13578 log='logs/ScrapyDemo/article_spider/51059b61443011e58ecf3c075429eea5.log' items='file:///Users/zhushajun/items/ScrapyDemo/article_spider/51059b61443011e58ecf3c075429eea5.jl'

(ScrapyDemo is my scrapy project and the django app name)
I'm so confused, and don't know how to fix this,could u please help me to fix this error? Thanks!

By the way, if I don't know how many pages of a website, does pagination support to scrapy as much pages as it can? Or, is there a solution to check if the page have a 'next page' button, this will scrapy all the pages, that would be very useful to me.

Thumbnails are also downloaded if scraped obj is not saved

Issues with running tests

Hello,

There are issues when running the tests.

If you use

cd tests
./run_tests.sh

The following errors will emerge:

Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/Users/rakan/.virtualenvs/dynamic-scraper/lib/python2.7/site-packages/django/core/management/__init__.py", line 385, in execute_from_command_line
    utility.execute()
  File "/Users/rakan/.virtualenvs/dynamic-scraper/lib/python2.7/site-packages/django/core/management/__init__.py", line 377, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/Users/rakan/.virtualenvs/dynamic-scraper/lib/python2.7/site-packages/django/core/management/commands/test.py", line 50, in run_from_argv
    super(Command, self).run_from_argv(argv)
  File "/Users/rakan/.virtualenvs/dynamic-scraper/lib/python2.7/site-packages/django/core/management/base.py", line 288, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "/Users/rakan/.virtualenvs/dynamic-scraper/lib/python2.7/site-packages/django/core/management/commands/test.py", line 71, in execute
    super(Command, self).execute(*args, **options)
  File "/Users/rakan/.virtualenvs/dynamic-scraper/lib/python2.7/site-packages/django/core/management/base.py", line 338, in execute
    output = self.handle(*args, **options)
  File "/Users/rakan/.virtualenvs/dynamic-scraper/lib/python2.7/site-packages/django/core/management/commands/test.py", line 88, in handle
    failures = test_runner.run_tests(test_labels)
  File "/Users/rakan/.virtualenvs/dynamic-scraper/lib/python2.7/site-packages/django/test/runner.py", line 146, in run_tests
    suite = self.build_suite(test_labels, extra_tests)
  File "/Users/rakan/.virtualenvs/dynamic-scraper/lib/python2.7/site-packages/django/test/runner.py", line 66, in build_suite
    tests = self.test_loader.loadTestsFromName(label)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/unittest/loader.py", line 100, in loadTestsFromName
    parent, obj = obj, getattr(obj, part)
AttributeError: 'module' object has no attribute 'ModelTest'

Running

python manage.py test basic

is all fine. However, running

python manage.py test scraper

Will result in:

======================================================================
ERROR: test_xml_content_type (scraper.scraper_run_test.ScraperRunTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/rakan/Work/Python/django-dynamic-scraper/tests/scraper/scraper_test.py", line 155, in setUp
    self.crawler.install()
AttributeError: 'CrawlerProcess' object has no attribute 'install'

----------------------------------------------------------------------

This is for all tests in scraper.

Allow storing extra XPATHs / add another pagination option

Currently only 5 XPATH types are stored — STANDARD, STANDARD_UPDATE, DETAIL, BASE and IMAGE. It would be good to have another section called EXTRA.

It is quite often that I need to access an XPATH value that might not be necessarily mapped to a model field. I my case, I need an additional XPATH for finding the next pagination link and have had to resort to using on of the other fields as a hack.

ERROR: Mandatory elem title missing!

When I try to run the scraper on the example from the docs, it seems to get stuck scraping an item, and never saves the item to the Django database:

scrapy crawl article_spider -a id=1 -a do_action=yes
2013-09-11 20:18:25-0700 [scrapy] INFO: Scrapy 0.18.2 started (bot: open_news)
2013-09-11 20:18:25-0700 [scrapy] DEBUG: Optional features available: ssl, http11, boto, django
2013-09-11 20:18:25-0700 [scrapy] DEBUG: Overridden settings: {'SPIDER_MODULES': ['dynamic_scraper.spiders', 'open_news.scraper'], 'ITEM_PIPELINES': ['dynamic_scraper.pipelines.ValidationPipeline', 'open_news.scraper.pipelines.DjangoWriterPipeline'], 'USER_AGENT': 'open_news/1.0', 'BOT_NAME': 'open_news'}
2013-09-11 20:18:26-0700 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-09-11 20:18:26-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-09-11 20:18:26-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-09-11 20:18:26-0700 [scrapy] DEBUG: Enabled item pipelines: ValidationPipeline, DjangoWriterPipeline
2013-09-11 20:18:26-0700 [article_spider] INFO: Spider opened
2013-09-11 20:18:26-0700 [article_spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-09-11 20:18:26-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-09-11 20:18:26-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-09-11 20:18:26-0700 [article_spider] DEBUG: Crawled (200) <GET http://en.wikinews.org/wiki/Main_Page> (referer: None)
2013-09-11 20:18:26-0700 [article_spider] INFO: Starting to crawl item 1.
2013-09-11 20:18:26-0700 [article_spider] DEBUG: description         'The United States President Barack Obama announced last Saturday he was seeking Congressional authorisation for military intervention in Syria. Wikinews interviewed professors Scott Lucas, Professor of American Studies from the UK's University of Birmingham; Majid Rafizadeh, the President of the International American Council on the Middle East; and ProfEyal Zisser, a Syrian expert from Tel Aviv University about the risks of military intervention in Syria.'
2013-09-11 20:18:26-0700 [article_spider] DEBUG: url                 'http://en.wikinews.org/wiki/Wikinews_interviews_Scott_Lucas,_Eyal_Zisser,_Majid_Rafizadeh_about_risks_of_US_military_intervention_in_Syria'
2013-09-11 20:18:26-0700 [article_spider] INFO: Starting to crawl item 2.
2013-09-11 20:18:26-0700 [article_spider] DEBUG: description         'attended the finals of the Women's National Wheelchair Basketball League at the Sydney University Sports and Aquatic Centre over the weekend.'
2013-09-11 20:18:26-0700 [article_spider] DEBUG: url                 'http://en.wikinews.org/wiki/Western_Stars_win_Women%27s_National_Wheelchair_Basketball_League_championship_in_a_thriller'
2013-09-11 20:18:26-0700 [article_spider] INFO: Starting to crawl item 3.
2013-09-11 20:18:26-0700 [article_spider] DEBUG: description         'Local residents of Dungog, New South Wales, held a celebratory nature walk after they received assurance that their local forest was deemed worthy of "enduring protection." A proposal before the NSW government to log over one million hectares of protected national park forests had caused alarm among nature conservationists.'
2013-09-11 20:18:26-0700 [article_spider] DEBUG: url                 'http://en.wikinews.org/wiki/Dungog,_Australia_residents_celebrate_continued_protection_of_local_forest'
2013-09-11 20:18:26-0700 [article_spider] INFO: Starting to crawl item 4.
2013-09-11 20:18:26-0700 [article_spider] DEBUG: description         'attended a roller derby event at the Caloundra Indoor Stadium on Australia's Sunshine Coast Saturday.'
2013-09-11 20:18:26-0700 [article_spider] DEBUG: url                 'http://en.wikinews.org/wiki/Coastal_Assassins_get_wins_at_roller_derby_event_on_Australia%27s_Sunshine_Coast'
2013-09-11 20:18:26-0700 [article_spider] INFO: Starting to crawl item 5.
2013-09-11 20:18:26-0700 [article_spider] DEBUG: description         'A group of volcanologists from the UK and USA have traveled to North Korea to assist them with conducting scientific investigations near the volcano Mount Paektu.'
2013-09-11 20:18:26-0700 [article_spider] DEBUG: url                 'http://en.wikinews.org/wiki/Wikinews_interviews_Dr._Robert_Kelly_regarding_joint_scientific_venture_in_North_Korea'
2013-09-11 20:18:27-0700 [article_spider] DEBUG: Crawled (200) <GET http://en.wikinews.org/wiki/Western_Stars_win_Women%27s_National_Wheelchair_Basketball_League_championship_in_a_thriller> (referer: http://en.wikinews.org/wiki/Main_Page)
2013-09-11 20:18:27-0700 [article_spider] DEBUG: title               None
2013-09-11 20:18:27-0700 [article_spider] ERROR: Mandatory elem title missing!
2013-09-11 20:18:27-0700 [article_spider] WARNING: Dropped: 
    {u'description': u"attended the finals of the Women's National Wheelchair Basketball League at the Sydney University Sports and Aquatic Centre over the weekend.",
     u'title': None,
     u'url': u'http://en.wikinews.org/wiki/Western_Stars_win_Women%27s_National_Wheelchair_Basketball_League_championship_in_a_thriller'}
2013-09-11 20:18:27-0700 [article_spider] DEBUG: Crawled (200) <GET http://en.wikinews.org/wiki/Coastal_Assassins_get_wins_at_roller_derby_event_on_Australia%27s_Sunshine_Coast> (referer: http://en.wikinews.org/wiki/Main_Page)
2013-09-11 20:18:27-0700 [article_spider] DEBUG: title               None
2013-09-11 20:18:27-0700 [article_spider] DEBUG: Crawled (200) <GET http://en.wikinews.org/wiki/Wikinews_interviews_Scott_Lucas,_Eyal_Zisser,_Majid_Rafizadeh_about_risks_of_US_military_intervention_in_Syria> (referer: http://en.wikinews.org/wiki/Main_Page)
2013-09-11 20:18:27-0700 [article_spider] DEBUG: Crawled (200) <GET http://en.wikinews.org/wiki/Dungog,_Australia_residents_celebrate_continued_protection_of_local_forest> (referer: http://en.wikinews.org/wiki/Main_Page)
2013-09-11 20:18:27-0700 [article_spider] ERROR: Mandatory elem title missing!
2013-09-11 20:18:27-0700 [article_spider] WARNING: Dropped: 
    {u'description': u"attended a roller derby event at the Caloundra Indoor Stadium on Australia's Sunshine Coast Saturday.",
     u'title': None,
     u'url': u'http://en.wikinews.org/wiki/Coastal_Assassins_get_wins_at_roller_derby_event_on_Australia%27s_Sunshine_Coast'}
2013-09-11 20:18:27-0700 [article_spider] DEBUG: title               None
2013-09-11 20:18:27-0700 [article_spider] DEBUG: title               None
2013-09-11 20:18:27-0700 [article_spider] DEBUG: Crawled (200) <GET http://en.wikinews.org/wiki/Wikinews_interviews_Dr._Robert_Kelly_regarding_joint_scientific_venture_in_North_Korea> (referer: http://en.wikinews.org/wiki/Main_Page)
2013-09-11 20:18:27-0700 [article_spider] ERROR: Mandatory elem title missing!
2013-09-11 20:18:27-0700 [article_spider] WARNING: Dropped: 
    {u'description': u"The United States President Barack Obama announced last Saturday he was seeking Congressional authorisation for military intervention in Syria. Wikinews interviewed professors Scott Lucas, Professor of American Studies from the UK's University of Birmingham; Majid Rafizadeh, the President of the International American Council on the Middle East; and ProfEyal Zisser, a Syrian expert from Tel Aviv University about the risks of military intervention in Syria.",
     u'title': None,
     u'url': u'http://en.wikinews.org/wiki/Wikinews_interviews_Scott_Lucas,_Eyal_Zisser,_Majid_Rafizadeh_about_risks_of_US_military_intervention_in_Syria'}
2013-09-11 20:18:27-0700 [article_spider] ERROR: Mandatory elem title missing!
2013-09-11 20:18:27-0700 [article_spider] WARNING: Dropped: 
    {u'description': u'Local residents of Dungog, New South Wales, held a celebratory nature walk after they received assurance that their local forest was deemed worthy of "enduring protection." A proposal before the NSW government to log over one million hectares of protected national park forests had caused alarm among nature conservationists.',
     u'title': None,
     u'url': u'http://en.wikinews.org/wiki/Dungog,_Australia_residents_celebrate_continued_protection_of_local_forest'}
2013-09-11 20:18:27-0700 [article_spider] DEBUG: title               None
2013-09-11 20:18:27-0700 [article_spider] ERROR: Mandatory elem title missing!
2013-09-11 20:18:27-0700 [article_spider] WARNING: Dropped: 
    {u'description': u'A group of volcanologists from the UK and USA have traveled to North Korea to assist them with conducting scientific investigations near the volcano Mount Paektu.',
     u'title': None,
     u'url': u'http://en.wikinews.org/wiki/Wikinews_interviews_Dr._Robert_Kelly_regarding_joint_scientific_venture_in_North_Korea'}
2013-09-11 20:18:27-0700 [article_spider] INFO: Closing spider (finished)
2013-09-11 20:18:27-0700 [article_spider] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 1921,
     'downloader/request_count': 6,
     'downloader/request_method_count/GET': 6,
     'downloader/response_bytes': 89787,
     'downloader/response_count': 6,
     'downloader/response_status_count/200': 6,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2013, 9, 12, 3, 18, 27, 275875),
     'item_dropped_count': 5,
     'item_dropped_reasons_count/DropItem': 5,
     'log_count/DEBUG': 27,
     'log_count/ERROR': 5,
     'log_count/INFO': 8,
     'log_count/WARNING': 5,
     'request_depth_max': 1,
     'response_received_count': 6,
     'scheduler/dequeued': 6,
     'scheduler/dequeued/memory': 6,
     'scheduler/enqueued': 6,
     'scheduler/enqueued/memory': 6,
     'start_time': datetime.datetime(2013, 9, 12, 3, 18, 26, 203931)}
2013-09-11 20:18:27-0700 [article_spider] INFO: Spider closed (finished)

how "dynamic scheduling" config and work

Hi, I love the "dynamic scheduling" idea. However, I can not make this work in the example_project after I looked into the doc and try everything in example_project.
Actually, I do not quite understand the "Scheduling configuration" section in doc.

Attributes crawled from the detail_page are not detected as unicode

For instance here is the log related to an attribute crawled on the detail page:

[flatrent_vivastreet_spider] DEBUG: address 'Paris 9\u00e8me ardt - 75009, Paris 9\u00e8me ardt, 75009, FR'

And the when the item is saved it becomes:

{u'address': u'Paris 9\u00e8me ardt - 75009, Paris 9\u00e8me ardt, 75009, FR',

But it should be:

{u'address': u'Paris 9\u00e8me ardt - 75009, Paris 9\u00e8me ardt, 75009, FR',

Does anyone know why ?

Again this problem occurs only for attributes on detail_page, the others are flawlessly retrieved.

Can't scrape multiple URLS from overview page

Hey!

I've followed the README from top to bottom, and things work fine. If I get a detail page, the scraper puts it in my database. However, there's a problem with the BASE.

If there are about 50 entires on an Overview page, it only keeps scraping the first url even though I've made multiple changes to xpath and the xPath returns a list of all such objects if I check in the scrapy shell. I've tried enough, gone through code, but still can't find a possible solution.

Pagination creating workflow

Can you suggest me, what can I do with links like that:

http://localhost/category/category_name/offset15/

Where get category_name? Relative links works in Pagination append str?

holgerd77 / django-dynamic-scraper Goto Github PK

django-dynamic-scraper's Introduction

django-dynamic-scraper

Documentation

Getting Help/Updates

django-dynamic-scraper's People

Contributors

Stargazers

Watchers

Forkers

django-dynamic-scraper's Issues

EDIT

EDIT

EDIT

EDIT when I run it on my showmigrations on my production server

Recommend Projects

Recommend Topics

Recommend Org