sebdah / scrapy-mongodb Goto Github PK

MongoDB pipeline for Scrapy. This module supports both MongoDB in standalone setups and replica sets. scrapy-mongodb will insert the items to MongoDB as soon as your spider finds data to extract.

Home Page: http://sebdah.github.com/scrapy-mongodb/

License: Other

Makefile 0.62% Python 99.38%

scrapy-mongodb's Introduction

scrapy-mongodb

MongoDB pipeline for Scrapy. This library supports both MongoDB in standalone setups and replica sets. It will insert items to MongoDB as soon as your spider finds data to extract. scrapy-mongodb can also buffer objects if you prefer to write chunks of data to MongoDB rather than one write per document (see MONGODB_BUFFER_DATA option for details).

INSTALLATION

Dependencies

Instructions

Install via pip:

pip install -r requirements.txt
pip install scrapy-mongodb

CONFIGURATION

Basic configuration

Add these options to settings.py file:

ITEM_PIPELINES = {
    ...
    'scrapy_mongodb.MongoDBPipeline': 300,
    ...
}

MONGODB_URI = 'mongodb://localhost:27017'
MONGODB_DATABASE = 'scrapy'
MONGODB_COLLECTION = 'my_items'

If you want a unique key in your database, name the key with this option:

MONGODB_UNIQUE_KEY = 'url'

Replica sets

You can configure scrapy-mongodb to support MongoDB replica sets by adding MONGODB_REPLICA_SET option and specify additional replica set hosts in MONGODB_URI:

MONGODB_REPLICA_SET = 'myReplicaSetName'
MONGODB_URI = 'mongodb://host1.example.com:27017,host2.example.com:27017,host3.example.com:27017'

If you need to ensure that your data has been replicated, use the MONGODB_REPLICA_SET_W option. It is an implementation of the w parameter in pymongo. Details from the pymongo documentation:

Write operations will block until they have been replicated to the specified number or tagged set of servers. w=<int> always includes the replica set primary (e.g. w=3 means write to the primary and wait until replicated to two secondaries). Passing w=0 disables write acknowledgement and all other write concern options.

Data buffering

To ease the load on MongoDB, scrapy-mongodb has a buffering feature. You can enable it by setting MONGODB_BUFFER_DATA to the buffer size you want. E.g: scrapy-mongodb will write 10 documents at a time to the database if you set:

MONGODB_BUFFER_DATA = 10

It is not possible to combine this feature with MONGODB_UNIQUE_KEY. Technically due to that the update method in pymongo doesn't support multi-doc updates.

Timestamps

scrapy-mongodb can append a timestamp to your item when inserting it to the database. Enable this feature with:

MONGODB_ADD_TIMESTAMP = True

This will modify the document to something like this:

{
    ...
    'scrapy-mongodb': {
        'ts': ISODate("2013-01-10T07:43:56.797Z")
    }
    ...
}

The timestamp is in UTC.

One collection per spider

It's possible to write data to 1 collection per spider. To enable that feature, set this environment variable:

MONGODB_SEPARATE_COLLECTIONS = True

Full list of available options

Parameter	Default	Required?	Description
`MONGODB_DATABASE`	scrapy-mongodb	No	Database to use. Does not need to exist.
`MONGODB_COLLECTION`	items	No	Collection within the database to use. Does not need to exist.
`MONGODB_URI`	mongodb://localhost:27017	No	URI to the MongoDB instance or replica sets you want to connect to. It must start with `mongodb://` (see more in the MongoDB docs). E.g.: `mongodb://user:pass@host:port`, `mongodb://user:pass@host:port,host2:port2`
`MONGODB_UNIQUE_KEY`	None	No	If you want to have a unique key in the database, enter the key name here. `scrapy-mongodb` will ensure the key is properly indexed.
`MONGODB_BUFFER_DATA`	None	No	To ease the load on MongoDB, set this option to the number of items you want to buffer in the client before sending them to database. Setting a `MONGODB_UNIQUE_KEY` together with `MONGODB_BUFFER_DATA` is not supported.
`MONGODB_ADD_TIMESTAMP`	False	No	If set to True, scrapy-mongodb will add a timestamp key to the documents.
`MONGODB_FSYNC`	False	No	If set to True, it forces MongoDB to wait for all files to be synced before returning.
`MONGODB_REPLICA_SET`	None	Yes, for replica sets	Set this if you want to enable replica set support. The option should be given the name of the replica sets you want to connect to. `MONGODB_URI` should point at your config servers.
`MONGODB_REPLICA_SET_W`	0	No	Best described in the pymongo docs. Write operations will block until they have been replicated to the specified number or tagged set of servers. `w=<int>` always includes the replica set primary (e.g. `w=3` means write to the primary and wait until replicated to two secondaries). Passing `w=0` disables write acknowledgement and all other write concern options.
`MONGODB_STOP_ON_DUPLICATE`	0	No	Set this to a value greater than 0 to close the spider when that number of duplicated insertions in MongoDB are detected. If set to 0, this option has no effect.

Deprecated options

Since scrapy-mongodb 0.5.0

Parameter	Default	Required?	Description
`MONGODB_HOST`	localhost	No	MongoDB host name to connect to. Use `MONGODB_URI` instead.
`MONGODB_PORT`	27017	No	MongoDB port number to connect to. Use `MONGODB_URI` instead.
`MONGODB_REPLICA_SET_HOSTS`	None	No	Host string to use to connect to the replica set. See the `hosts_or_uri` option in the pymongo docs. Use `MONGODB_URI` instead.

PUBLISHING TO PYPI

make release

CHANGELOG

AUTHOR

This project is maintained by: Sebastian Dahlgren (GitHub | Twitter | LinkedIn).

LICENSE

scrapy-mongodb's People

Contributors

Stargazers

Watchers

Forkers

pombredanne mizhgun kostast zcookie yuxi0203 van5150 renatosvo ivanfrain atassumer wyrover ms5 boosheng extrataylor easyfmxu vicjung sherzberg sprij ikeikeikeike etongle jiajie999 yfann ravishi yupengyan webknjaz mms-systems cnu italomaia rverbitsky piroux cedricporter micleli wezchow vidascontadas cheung-chifung bluescheung shancci kentchun33333 haika mtaziz chen-tam wujju last2003 dvukolov dingwf jonatasrenan xblues veterun tifcty hotcoder0708 algoskynet tarhan leezqcst forksofpower baffolobill janes ssnnaruka xunux edison7500 jin10086 wingyiu bit03 metachenyiyan knightth0r hiroshiro fusionbeam henryvps melnikovalex d2much ginking nokiam9 dkgitdev kongyuanhao rockrockyy oitmain-public ajiehust engmedoo wllbll formertechie alpharootbeta vagase kirinse hhy5277 asiellb anuragsinghchaudhary fenildf hongyu9000 strategist922 boss14420 kowalej-925 gokhu18 kennycaiguo bigfishkun zanachka eric-seekas n1k0ver3e mifody sandboxplayer tiwe0 gracyf

scrapy-mongodb's Issues

Memory leak in item buffer

There is a memory leak in the item buffer. See this commit, a0bf6a0.

Support buffered data

It would be nice if scrapy-mongodb supported data buffering to ease the load on the MongoDB cluster if needed. It would then send every n docs to MongoDB in a batch.

Warnings on Log. Deprecated Libraries

Few warning msgs are shown on the output log because the project users deprecated libraries.

Sample:

2016-04-06 07:15:15 [py.warnings] WARNING: c:\python27\lib\site-packages\scrapy_mongodb.py:30: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.  from scrapy import log

2016-04-06 07:15:15 [py.warnings] WARNING: c:\python27\lib\site-packages\scrapy_mongodb.py:31: ScrapyDeprecationWarning: Module `scrapy.contrib.exporter` is deprecated, use `scrapy.exporters` instead  from scrapy.contrib.exporter import BaseItemExporter

2016-04-06 07:15:15 [py.warnings] WARNING: c:\python27\lib\site-packages\scrapy_mongodb.py:105: ScrapyDeprecationWarning: log.msg has been deprecated, create a python logger and log through it instead  self.config['collection']))

2016-04-06 07:17:44 [py.warnings] WARNING: c:\python27\lib\site-packages\scrapy_mongodb.py:256: ScrapyDeprecationWarning: log.msg has been deprecated, create a python logger and log through it instead  spider=spider)

This pull request may solve the issue but need to be reviewed: Pull Request 31

UnicodeError on log message

Just got this message in my deploy:

File "/home/myuser/.virtualenvs/stealth/local/lib/python2.7/site-packages/scrapy_mongodb.py", line 104, in __init__
        self.config['collection']))
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 27-28: ordinal not in range(128)

I believe the problem is related with passwords with non ascii characters.

ScrapyDeprecationWarning: crawler.settings

scrapy-mongodb uses a deprecated method to get settings:

/usr/local/lib/python2.7/site-packages/scrapy_mongodb.py:30: ScrapyDeprecationWarning: Module `scrapy.conf` is deprecated, use `crawler.settings` attribute instead

from scrapy.conf import settings

Bug when upserting items with a unique key

No items are ingested when the unique key feature was enabled (at least on more recent versions of pymongo).

Implement as item exporter

It would be nice to use scrapy-mongodb as a regular item exporter. Loading it as a PIPELINE feels unflexible as you would have to change code in order to change the output target.

new pipeline

Do I need to replace my pipeline where I treat items with the new pipeline of scrapy-mondb ?
and where is this pipeline located so I can edit it ?

Implement support for MongoDB authentication

Implement support for MongoDB authentication (mongod --auth).

Add flag for setting timestamps in the database

Add flag for setting timestamps in the database so each item gets a timestamp with the datetime it was fetched.

I can't manage to make uri with user:pass working...

Hi,

without user:pass, your lib is working great, but I really can't figure out what I'm doing wrong...Did you try it? Perhaps, you have an idea to help me?

I'm using meteor, but it shouldn't be the issue as I manage to make it working locally. Unless The production database has some specific behaviour.. If you can help me, it would be great :) Just tell me if you already managed, if yes, I'll ask for meteor guys.

Here is my error:

+ scrapy crawl custojusto -s MONGODB_URI=mongodb://client:[email protected]:27017 -s MONGODB_DATABASE=myapp_meteor_com
2013-11-14 23:43:22+0000 [scrapy] INFO: Scrapy 0.16.5 started (bot: house)
2013-11-14 23:43:23+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-11-14 23:43:23+0000 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-11-14 23:43:23+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 5, in <module>
    pkg_resources.run_script('Scrapy==0.16.5', 'scrapy')
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources.py", line 489, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources.py", line 1207, in run_script
    execfile(script_filename, namespace, namespace)
  File "/Library/Python/2.7/site-packages/Scrapy-0.16.5-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
    execute()
  File "/Library/Python/2.7/site-packages/Scrapy-0.16.5-py2.7.egg/scrapy/cmdline.py", line 131, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Library/Python/2.7/site-packages/Scrapy-0.16.5-py2.7.egg/scrapy/cmdline.py", line 76, in _run_print_help
    func(*a, **kw)
  File "/Library/Python/2.7/site-packages/Scrapy-0.16.5-py2.7.egg/scrapy/cmdline.py", line 138, in _run_command
    cmd.run(args, opts)
  File "/Library/Python/2.7/site-packages/Scrapy-0.16.5-py2.7.egg/scrapy/commands/crawl.py", line 43, in run
    spider = self.crawler.spiders.create(spname, **opts.spargs)
  File "/Library/Python/2.7/site-packages/Scrapy-0.16.5-py2.7.egg/scrapy/command.py", line 33, in crawler
    self._crawler.configure()
  File "/Library/Python/2.7/site-packages/Scrapy-0.16.5-py2.7.egg/scrapy/crawler.py", line 41, in configure
    self.engine = ExecutionEngine(self, self._spider_closed)
  File "/Library/Python/2.7/site-packages/Scrapy-0.16.5-py2.7.egg/scrapy/core/engine.py", line 63, in __init__
    self.scraper = Scraper(crawler)
  File "/Library/Python/2.7/site-packages/Scrapy-0.16.5-py2.7.egg/scrapy/core/scraper.py", line 66, in __init__
    self.itemproc = itemproc_cls.from_crawler(crawler)
  File "/Library/Python/2.7/site-packages/Scrapy-0.16.5-py2.7.egg/scrapy/middleware.py", line 50, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/Library/Python/2.7/site-packages/Scrapy-0.16.5-py2.7.egg/scrapy/middleware.py", line 31, in from_settings
    mw = mwcls.from_crawler(crawler)
  File "/Library/Python/2.7/site-packages/scrapy_mongodb.py", line 68, in from_crawler
    return cls(crawler.settings)
  File "/Library/Python/2.7/site-packages/scrapy_mongodb.py", line 89, in __init__
    read_preference=ReadPreference.PRIMARY)
  File "/Library/Python/2.7/site-packages/pymongo/mongo_client.py", line 369, in __init__
    raise ConfigurationError(str(exc))
pymongo.errors.ConfigurationError: command SON([('authenticate', 1), ('user', u'client'), ('nonce', u'xx'), ('key', u'xxx')]) failed: auth fails

Add Python 3 support

is there any smart way to prevent overlaping data via data's property?

ex.

data

data_id
data_title
data_content
data_time


if data_id exists on database:
    print("Data already exists, skipping this data...")

else: 
    add_record_to_mongo(data)

Add support for MongoDB URIs when connecting

Add support for MongoDB URIs when configuring scrapy-mongodb.

duplicate key ?

Hi,
Thx for putting this out there! Good work.
I use mongodb 2.6 + scrapy and your pipeline.
Only the first item is store in the database. After that I get a duplicate key error in the scrappy log...
2014-04-14 00:04:25+0200 [scrapy] DEBUG: Duplicate key found
2014-04-14 00:04:25+0200 [lispider2] DEBUG: Stored item(s) in MongoDB liDB/liCollection
And a line stating that the item is stored in the db

In the mongo shell still only one item stored...

db.liCollection.count()
1

settings.py looks like:

mongodb pipeline

MONGODB_URI = 'mongodb://localhost:27017'

MONGODB_HOST = 'localhost'

MONGODB_PORT = 27017

MONGODB_DATABASE = 'liDB'
MONGODB_COLLECTION = 'liCollection'

MONGODB_UNIQUE_KEY = 'keyLinkedin'

MONGODB_ADD_TIMESTAMP = True

ITEM_PIPELINES = {
'scrapy_mongodb.MongoDBPipeline':300,
}

As you see the MONGODB_UNIQUE_KEY is commented out, I first had a keyerror if I uncomment that setting...

Any suggestions?

thx

Use new-style classes

Really, this is 2014, you should use new-style classes. Even scrapy's documentation uses new-style classes in their pipeline examples.

Is the pipeline blocking?

It looks like the scrapy-mongodb pipeline is blocking. So the crawl speed is limited to the mongodb data write speed.

Is it better to change it to non-blocking pipeline such as using txmongo?

Allow customization of Mongo connection at the Spider or Item level

First of all, thanks for the great project. Discovered it last night and pretty much just worked for storing results from a Spider.

I'm using a project that has multiple Spiders storing different items and it would be really helpful if the Mongo connection could be customized to point at a different database/collection depending on the item being stored or the spider being run. Was planning on forking the project and adding myself but am wondering about your thoughts first.

Add support for MongoDB replica sets without mongos routers

Add support for MongoDB replica sets without mongos routers. Users should be able to just give a list of replica set nodes as an argument.

Use Python format() for output

Use Python format() for output instead of %

Call serializer before inserting

Hi folks. In my project, my item fields usually have one or two serializer functions sticked to it. Using scrapy-mongodb I noticed it does not call the serializer before inserting into the database, which creates dirty data for me. So, what did I do? I patched the whole thing in order to call BaseItemExporter._get_serialized_fields. The result is neat. Would that be a desired behavior?

create index on 2 columns

How can i create index on 2 columns, like this:

        self.db[self.mongo_collection].create_index(
            [
                ("url", pymongo.DESCENDING),
                ("category", pymongo.ASCENDING),
            ],
            unique=True
        )

Do not store None or empty values

If there is no value, we do not need to save it to MongoDB. See 9a1bd85 for reference.

Process_item to return data other than Item

Hi,
I would like to send to the MongoPipeline a data structure other than Item.
I actually want to store a dict.
I got this Error :
'dict' object has no attribute 'fields'
which means that it always look for an item in the return of process_item.
I recall it was working fine in a previous version.Can you suggest a workaround ?
Thanks

get_project_settings precludes configuring settings on command line

The way you're currently getting settings:

from scrapy.utils.project import get_project_settings

Only allows reading settings from the settings.py file. If a user wants to overwrite MONGODB_DATABASE, for example, from the command line using the -s flag, it won't register in scrapy-mongodb.

The solution would be to do something like:

class MyExtension(object):

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        if settings['LOG_ENABLED']:
            print "log is enabled!"

Source: http://doc.scrapy.org/en/latest/topics/settings.html

Update readme for custom_settings

I think it would be good idea to tell about custom_settings per spider in read me!