eliangcs / pystock-crawler Goto Github PK

View Code? Open in Web Editor NEW

310.0 33.0 105.0 4.07 MB

(UNMAINTAINED) Crawl and parse financial reports (XBRL) from SEC EDGAR, and daily stock prices from Yahoo Finance

License: MIT License

Python 100.00%

pystock-crawler's Introduction

pystock-crawler

pystock-crawler is a utility for crawling historical data of US stocks, including:

Ticker symbols listed in NYSE, NASDAQ or AMEX from NASDAQ.com
Daily prices from Yahoo Finance
Fundamentals from 10-Q and 10-K filings (XBRL) on SEC EDGAR

Example Output

NYSE ticker symbols:

DDD   3D Systems Corporation
MMM   3M Company
WBAI  500.com Limited
...

Apple's daily prices:

symbol,date,open,high,low,close,volume,adj_close
AAPL,2014-04-28,572.80,595.75,572.55,594.09,23890900,594.09
AAPL,2014-04-25,564.53,571.99,563.96,571.94,13922800,571.94
AAPL,2014-04-24,568.21,570.00,560.73,567.77,27092600,567.77
...

Google's fundamentals:

symbol,end_date,amend,period_focus,fiscal_year,doc_type,revenues,op_income,net_income,eps_basic,eps_diluted,dividend,assets,cur_assets,cur_liab,cash,equity,cash_flow_op,cash_flow_inv,cash_flow_fin
GOOG,2009-06-30,False,Q2,2009,10-Q,5522897000.0,1873894000.0,1484545000.0,4.7,4.66,0.0,35158760000.0,23834853000.0,2000962000.0,11911351000.0,31594856000.0,3858684000.0,-635974000.0,46354000.0
GOOG,2009-09-30,False,Q3,2009,10-Q,5944851000.0,2073718000.0,1638975000.0,5.18,5.13,0.0,37702845000.0,26353544000.0,2321774000.0,12087115000.0,33721753000.0,6584667000.0,-3245963000.0,74851000.0
GOOG,2009-12-31,False,FY,2009,10-K,23650563000.0,8312186000.0,6520448000.0,20.62,20.41,0.0,40496778000.0,29166958000.0,2747467000.0,10197588000.0,36004224000.0,9316198000.0,-8019205000.0,233412000.0
...

Installation

Prerequisites:

Python 2.7

pystock-crawler is based on Scrapy, so you will also need to install prerequisites such as lxml and libffi for Scrapy and its dependencies. On Ubuntu, for example, you can install them like this:

sudo apt-get update
sudo apt-get install -y gcc python-dev libffi-dev libssl-dev libxml2-dev libxslt1-dev build-essential

See Scrapy's installation guide for more details.

After installing prerequisites, you can then install pystock-crawler with pip:

(sudo) pip install pystock-crawler

Quickstart

Example 1. Fetch Google's and Yahoo's daily prices ordered by date:

pystock-crawler prices GOOG,YHOO -o out.csv --sort

Example 2. Fetch daily prices of all companies listed in ./symbols.txt:

pystock-crawler prices ./symbols.txt -o out.csv

Example 3. Fetch Facebook's fundamentals during 2013:

pystock-crawler reports FB -o out.csv -s 20130101 -e 20131231

Example 4. Fetch fundamentals of all companies in ./nyse.txt and direct the log to ./crawling.log:

pystock-crawler reports ./nyse.txt -o out.csv -l ./crawling.log

Example 5. Fetch all ticker symbols in NYSE, NASDAQ and AMEX:

pystock-crawler symbols NYSE,NASDAQ,AMEX -o out.txt

Usage

Type pystock-crawler -h to see command help:

Usage:
  pystock-crawler symbols <exchanges> (-o OUTPUT) [-l LOGFILE] [-w WORKING_DIR]
                                      [--sort]
  pystock-crawler prices <symbols> (-o OUTPUT) [-s YYYYMMDD] [-e YYYYMMDD]
                                   [-l LOGFILE] [-w WORKING_DIR] [--sort]
  pystock-crawler reports <symbols> (-o OUTPUT) [-s YYYYMMDD] [-e YYYYMMDD]
                                    [-l LOGFILE] [-w WORKING_DIR]
                                    [-b BATCH_SIZE] [--sort]
  pystock-crawler (-h | --help)
  pystock-crawler (-v | --version)

Options:
  -h --help       Show this screen
  -o OUTPUT       Output file
  -s YYYYMMDD     Start date [default: ]
  -e YYYYMMDD     End date [default: ]
  -l LOGFILE      Log output [default: ]
  -w WORKING_DIR  Working directory [default: .]
  -b BATCH_SIZE   Batch size [default: 500]
  --sort          Sort the result

There are three commands available:

pystock-crawler symbols grabs ticker symbol lists
pystock-crawler prices grabs daily prices
pystock-crawler reports grabs fundamentals

<exchanges> is a comma-separated string that specifies the stock exchanges you want to include. Current, NYSE, NASDAQ and AMEX are supported.

The output file of pystock-crawler symbols can be used for <symbols> argument in pystock-crawler prices and pystock-crawler reports commands.

<symbols> can be an inline string separated with commas or a text file that lists symbols line by line. For example, the inline string can be something like AAPL,GOOG,FB. And the text file may look like this:

# This line is comment
AAPL    Put anything you want here
GOOG    Since the text here is ignored
FB

Use -o to specify the output file. For pystock-crawler symbols command, the output format is a simple text file. For pystock-crawler prices and pystock-crawler reports the output format is CSV.

-l is where the crawling logs go to. If not specified, the logs go to stdout.

By default, the crawler uses the current directory as the working directory. If you don't want to use the current directoy, you can specify it with -w option. The crawler keeps HTTP cache in a directory named .scrapy under the working directory. The cache can save your time by avoid downloading the same web pages. However, the cache can be quite huge. If you don't need it, just delete the .scrapy directory after you've done crawling.

-b option is only available to pystock-crawler reports command. It allows you to split a large symbol list into smaller batches. This is actually a workaround for an unresolved bug (#2). Normally you don't have to specify this option. Default value (500) works just fine.

The rows in the output file are in an arbitrary order by default. Use --sort option to sort them by symbols and dates. But if you have a large output file, don't use --sort because it will be slow and eat a lot of memory.

Developer Guide

Installing Dependencies

pip install -r requirements.txt

Running Test

Install test requirements:

pip install -r requirements-test.txt

Then run the test:

py.test

This will download the test data (a lot of XML/XBRL files) from from SEC EDGAR on the fly, so it will take some time and disk space. The test data is saved to pystock_crawler/tests/sample_data directory. It can be reused on the next time you run the test. If you don't need them, just delete the sample_data directory.

pystock-crawler's People

Contributors

Stargazers

Watchers

Forkers

jlovison purejade justinleoye markstoehr abr123 iamaris z123 codobaggins nastako vanessaluong vertigeaux biddyweb damahua wesavetheworld dsy88 hsd315 jkothari18 mmcblk98 nimin168 navd zerghero davy3331 tempbottle panyang divadoec yongxu74 dangjlin xlabwang fulquan jcurtiswebb m2shad0w 21stock davidd2k jovekuang kiwifreezer haozhuoran1991 jiacli bo-yang bigdatainsight rlr shuang777 jlyharia prestonvanloon hal2001 aegorenkov deshengxu hakaesbe jeancsil seanzhou1023 aiorwoa jayhull sortigoza raeidsaqur fogine jamesbyars vcalasans stephica guildary alexbigboy cijogeorge astropeak blueberrybits siuyiuyeung wizardshowing kppamy greensuse polaris79 kevin8545 barrygolden yepuv1 wec7 aamnv ms5 vishalbelsare sashibee sophiezhou mnabil kuantan breakhearts xmeng18 magastzheng hhy5277 tomwu48 jsnyder8844 txu2014 2quants sshuster fagan2888 nick-niu168 upton1919 thanhhnguyen23 trangdothuy tmscheng imjimmylol belokonm koreakky rmallof mikest18 qxzsilver1 gkuo06

pystock-crawler's Issues

Follow Flake8 coding style

Some of Flake8 (pep8 + pyflakes) rules were ignored. Should do a code refactoring to follow all Flake8 rules.

install issue

ubgpu@ubgpu:~/github/pystock-crawler$ pystock-crawler -h
File "/usr/local/bin/pystock-crawler", line 209
print 'pystock-crawler %s' % pystock_crawler.version
^
SyntaxError: invalid syntax
Error in sys.excepthook:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/apport_python_hook.py", line 53, in apport_excepthook
if not enabled():
File "/usr/lib/python3/dist-packages/apport_python_hook.py", line 24, in enabled
import re
File "/usr/lib/python3.4/re.py", line 324, in
import copyreg
File "/usr/local/lib/python2.7/dist-packages/copyreg/init.py", line 7, in
raise ImportError('This package should not be accessible on Python 3. '
ImportError: This package should not be accessible on Python 3. Either you are trying to run from the python-future src folder or your installation of python-future is corrupted.

Original exception was:
File "/usr/local/bin/pystock-crawler", line 209
print 'pystock-crawler %s' % pystock_crawler.version
^
SyntaxError: invalid syntax
ubgpu@ubgpu:/github/pystock-crawler$
ubgpu@ubgpu:/github/pystock-crawler$

New field for crawling - fiscal year

Fiscal year can't be deduced from end_date and period_focus, so we need to crawl fiscal year as well.

"Cannot find context" after a long run

After crawling EDGAR for hours using pystock-crawler reports command, it has a great possibility that a lot of these warning messages show up in the log:

[scrapy] WARNING: Cannot find context: eol_PE5972----1310-Q0007_STD_273_20130930_0 in http://www.sec.gov/Archives/edgar/data/41719/000119312513438262/glt-20130930.xml
[scrapy] WARNING: Cannot find context: Y11Q4 in http://www.sec.gov/Archives/edgar/data/1041368/000093905712000032/rvsb-20111231.xml
[scrapy] WARNING: Cannot find context: D111001_120331 in http://www.sec.gov/Archives/edgar/data/1046050/000093905712000146/tsbk-20120331.xml

This makes those reports have many null values. Perhaps it is because the crawler hits EDGAR too often, making EDGAR return bad content.

A batch size option for crawling reports

#2 only occurs when the symbol list is large (~5k symbols). To work around it, maybe I can add a --batch-size option to the pystock-crawler reports command. This option splits a large symbol list into smaller parts and merges them into one output file when crawling is done.

Support for AMEX

American Stock Exchange (AMEX) is the third largest stock exchange in the US and NASDAQ.com also provides AMEX company list. So adding support for AMEX is trivial.

Add a command option to specify working directory

Currently, the HTTP cache is stored to the current working directory. It would be better if users can specify the working directory using a -w command option, e.g.,

pystock-crawler reports SYMBOLS -o OUTPUT -w WORKING_DIR

ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated

ubgpu@ubgpu:/github/pystock-crawler$
ubgpu@ubgpu:/github/pystock-crawler$ PYTHONPATH=/usr/local/lib/python2.7/dist-packages pystock-crawler prices GOOG,YHOO -o out.csv --sort
/usr/local/bin/pystock-crawler:33: ScrapyDeprecationWarning: Module scrapy.log has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
from scrapy import log
Traceback (most recent call last):
File "/usr/local/bin/pystock-crawler", line 267, in
main()
File "/usr/local/bin/pystock-crawler", line 252, in main
log.start(logfile=log_file)
AttributeError: 'module' object has no attribute 'start'
ubgpu@ubgpu:/github/pystock-crawler$
ubgpu@ubgpu:/github/pystock-crawler$ sudo pip2 install scrapy
Requirement already satisfied (use --upgrade to upgrade): scrapy in /usr/local/lib/python2.7/dist-packages
Cleaning up...
ubgpu@ubgpu:~/github/pystock-crawler$

LevelDB terminates installation

Hi, working on Win8.1
after installing all Prerequisites (Scrapy),
getting this error:

Using cached leveldb-0.193.tar.gz
No files/directories in c:\users...\ap
leveldb\pip-egg-info (from PKG-INFO)

i tried to install leveldb by myself but i get an error: "Don't know how to compile for windows!"

Merge failed if result is empty

pystock-crawler reports command raises error when merging subfiles if one of them is empty.

Error output:

2014-10-08 11:20:48-0400 [scrapy] INFO: Merging files to /home/eliang/2014/10/01/reports.csv
2014-10-08 11:20:48-0400 [-] ERROR: Traceback (most recent call last):
2014-10-08 11:20:48-0400 [-] ERROR:   File "/home/eliang/.virtualenvs/pystock/bin/pystock-crawler", line 264, in <module>
2014-10-08 11:20:48-0400 [-] ERROR:     main()
2014-10-08 11:20:48-0400 [-] ERROR:   File "/home/eliang/.virtualenvs/pystock/bin/pystock-crawler", line 250, in main
2014-10-08 11:20:48-0400 [-] ERROR:     crawl(spider, symbols, start_date, end_date, output, log_file, batch_size)
2014-10-08 11:20:48-0400 [-] ERROR:   File "/home/eliang/.virtualenvs/pystock/bin/pystock-crawler", line 153, in crawl
2014-10-08 11:20:48-0400 [-] ERROR:     merge_files(output, output_files, ignore_header=True)
2014-10-08 11:20:48-0400 [-] ERROR:   File "/home/eliang/.virtualenvs/pystock/bin/pystock-crawler", line 104, in merge_files
2014-10-08 11:20:48-0400 [-] ERROR:     f.next()  # Ignore CSV header
2014-10-08 11:20:48-0400 [-] ERROR:   File "/home/eliang/.virtualenvs/pystock/lib/python2.7/codecs.py", line 681, in next
2014-10-08 11:20:48-0400 [-] ERROR:     return self.reader.next()
2014-10-08 11:20:48-0400 [-] ERROR:   File "/home/eliang/.virtualenvs/pystock/lib/python2.7/codecs.py", line 615, in next
2014-10-08 11:20:48-0400 [-] ERROR:     raise StopIteration
2014-10-08 11:20:48-0400 [-] ERROR: StopIteration

IndexError on PLT report

Command pystock-crawler reports PLT -o output.csv raised this error:

2014-08-10 20:41:16+0800 [edgar] ERROR: Spider error processing <GET http://www.sec.gov/Archives/edgar/data/914025/000091402513000049/plt-20130630.xml>
    Traceback (most recent call last):
      File "/Users/eliang/.virtualenvs/pystock-crawler/lib/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/Users/eliang/.virtualenvs/pystock-crawler/lib/python2.7/site-packages/twisted/internet/task.py", line 638, in _tick
        taskObj._oneWorkUnit()
      File "/Users/eliang/.virtualenvs/pystock-crawler/lib/python2.7/site-packages/twisted/internet/task.py", line 484, in _oneWorkUnit
        result = next(self._iterator)
      File "/Users/eliang/.virtualenvs/pystock-crawler/lib/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
        work = (callable(elem, *args, **named) for elem in iterable)
    --- <exception caught here> ---
      File "/Users/eliang/.virtualenvs/pystock-crawler/lib/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
        yield next(it)
      File "/Users/eliang/.virtualenvs/pystock-crawler/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 26, in process_spider_output
        for x in result:
      File "/Users/eliang/.virtualenvs/pystock-crawler/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
        return (_set_referer(r) for r in result or ())
      File "/Users/eliang/.virtualenvs/pystock-crawler/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/Users/eliang/.virtualenvs/pystock-crawler/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/Users/eliang/.virtualenvs/pystock-crawler/lib/python2.7/site-packages/scrapy/contrib/spiders/crawl.py", line 67, in _parse_response
        cb_res = callback(response, **cb_kwargs) or ()
      File "/Users/eliang/Projects/pystock-crawler/pystock_crawler/spiders/edgar.py", line 56, in parse_10qk
        loader = ReportItemLoader(response=response)
      File "/Users/eliang/Projects/pystock-crawler/pystock_crawler/loaders.py", line 473, in __init__
        self.add_xpath('amend', '//dei:AmendmentFlag')
      File "/Users/eliang/Projects/pystock-crawler/pystock_crawler/loaders.py", line 364, in add_xpath
        self.add_value(field_name, values, *processors, **kw)
      File "/Users/eliang/.virtualenvs/pystock-crawler/lib/python2.7/site-packages/scrapy/contrib/loader/__init__.py", line 45, in add_value
        self._add_value(field_name, value)
      File "/Users/eliang/.virtualenvs/pystock-crawler/lib/python2.7/site-packages/scrapy/contrib/loader/__init__.py", line 59, in _add_value
        processed_value = self._process_input_value(field_name, value)
      File "/Users/eliang/.virtualenvs/pystock-crawler/lib/python2.7/site-packages/scrapy/contrib/loader/__init__.py", line 117, in _process_input_value
        return proc(value)
      File "/Users/eliang/.virtualenvs/pystock-crawler/lib/python2.7/site-packages/scrapy/contrib/loader/processor.py", line 27, in __call__
        next_values += arg_to_iter(func(v))
      File "/Users/eliang/Projects/pystock-crawler/pystock_crawler/loaders.py", line 63, in __call__
        return value.xpath('./text()')[0].extract()
    exceptions.IndexError: list index out of range

Company having multiple symbols

Command pystock-crawler reports GOOG -o output.csv --sort gave this output:

symbol,end_date,amend,period_focus,fiscal_year,doc_type,revenues,op_income,net_income,eps_basic,eps_diluted,dividend,assets,cur_assets,cur_liab,cash,equity,cash_flow_op,cash_flow_inv,cash_flow_fin
"GOOG, GOOGL",2014-06-30,False,Q2,2014,10-Q,15955000000.0,4258000000.0,3422000000.0,5.07,4.99,0.0,121608000000.0,77905000000.0,17097000000.0,19620000000.0,95749000000.0,10018000000.0,-8487000000.0,-640000000.0
...

"GOOG, GOOGL" at 2nd line is ugly and should be changed to GOOG/GOOGL.

get reports return empty

command is pystock-crawler reports WBAI -o out.csv
WBAI symbol is one of lists symbol files.

Below is detail info. Cheers
2016-08-30 09:40:08+0100 [scrapy] INFO: Command: scrapy crawl edgar -a symbols="WBAI" -t csv -a limit=0,500 -o "/Users/XXXX/out.csv.1"
2016-08-30 09:40:08+0100 [scrapy] INFO: Creating temporary config: /Users/XXXX/scrapy.cfg
2016-08-30 09:40:08+0100 [scrapy] INFO: Scrapy 0.24.4 started (bot: pystock-crawler)
2016-08-30 09:40:08+0100 [scrapy] INFO: Optional features available: ssl, http11
2016-08-30 09:40:08+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'pystock_crawler.spiders', 'FEED_URI': '/Users/XXXX/out.csv.1', 'LOG_LEVEL': 'INFO', 'SPIDER_MODULES': ['pystock_crawler.spiders'], 'HTTPCACHE_ENABLED': True, 'RETRY_TIMES': 4, 'BOT_NAME': 'pystock-crawler', 'COOKIES_ENABLED': False, 'FEED_FORMAT': 'csv', 'HTTPCACHE_POLICY': 'scrapy.contrib.httpcache.RFC2616Policy', 'HTTPCACHE_STORAGE': 'scrapy.contrib.httpcache.LeveldbCacheStorage'}
2016-08-30 09:40:08+0100 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, PassiveThrottle, SpiderState
2016-08-30 09:40:08+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, ChunkedTransferMiddleware, DownloaderStats, HttpCacheMiddleware
2016-08-30 09:40:08+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-08-30 09:40:08+0100 [scrapy] INFO: Enabled item pipelines:
2016-08-30 09:40:08+0100 [edgar] INFO: Spider opened
2016-08-30 09:40:08+0100 [edgar] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-08-30 09:40:10+0100 [edgar] INFO: Closing spider (finished)
2016-08-30 09:40:10+0100 [edgar] INFO: Dumping Scrapy stats:
{'delay_count': 0,
'downloader/request_bytes': 608,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 3253,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 8, 30, 8, 40, 10, 300817),
'httpcache/firsthand': 1,
'httpcache/hit': 1,
'httpcache/miss': 1,
'httpcache/uncacheable': 1,
'log_count/INFO': 7,
'request_depth_count/0': 1,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2016, 8, 30, 8, 40, 8, 704882)}
2016-08-30 09:40:10+0100 [edgar] INFO: Spider closed (finished)
2016-08-30 09:40:10+0100 [scrapy] INFO: Deleting /Users/XXXX/scrapy.cfg
2016-08-30 09:40:10+0100 [scrapy] INFO: Merging files to /Users/XXXX/out.csv
2016-08-30 09:40:10+0100 [scrapy] INFO: Deleting /Users/XXXX/out.csv.1

periodic_focus not stripped

Run this command to reproduce:

pystock-crawler reports CIT -o cit.csv

One of the output rows is incorrect:

CIT,2010-06-30,False,"
  Q2",2010,10-Q,669500000.0,,142100000.0,0.71,0.71,0.0,54916800000.0,,,1060700000.0,8633900000.0,178100000.0,7122800000.0,-6218700000.0

periodic_focus needs to be stripped.

Add more fundamental data

Add more fundamental data:

Operating income
Current assets
Current liabilities
Cash flow (operation)
Cash flow (investing)
Cash flow (financing)

Price report is empty

Hello,

I'm able to use your script to run the fundamental reports just fine.

However, i'm getting an empty report when i try to look up prices. Can you please help?

This is the script I used:

Johns-MacBook-Air:~ JohnSnyder$ pystock-crawler prices GOOG -o test.csv --sort
2018-04-15 02:25:46-0700 [scrapy] INFO: Command: scrapy crawl yahoo -a symbols="GOOG" -t csv -o "/Users/JohnSnyder/test.csv"
2018-04-15 02:25:46-0700 [scrapy] INFO: Creating temporary config: /Users/JohnSnyder/scrapy.cfg
2018-04-15 02:25:47-0700 [scrapy] INFO: Scrapy 0.24.4 started (bot: pystock-crawler)
2018-04-15 02:25:47-0700 [scrapy] INFO: Optional features available: ssl, http11
2018-04-15 02:25:47-0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'pystock_crawler.spiders', 'FEED_URI': '/Users/JohnSnyder/test.csv', 'LOG_LEVEL': 'INFO', 'SPIDER_MODULES': ['pystock_crawler.spiders'], 'HTTPCACHE_ENABLED': True, 'RETRY_TIMES': 4, 'BOT_NAME': 'pystock-crawler', 'COOKIES_ENABLED': False, 'FEED_FORMAT': 'csv', 'HTTPCACHE_POLICY': 'scrapy.contrib.httpcache.RFC2616Policy', 'HTTPCACHE_STORAGE': 'scrapy.contrib.httpcache.LeveldbCacheStorage'}
2018-04-15 02:25:47-0700 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, PassiveThrottle, SpiderState
2018-04-15 02:25:47-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, ChunkedTransferMiddleware, DownloaderStats, HttpCacheMiddleware
2018-04-15 02:25:47-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2018-04-15 02:25:47-0700 [scrapy] INFO: Enabled item pipelines:
2018-04-15 02:25:47-0700 [yahoo] INFO: Spider opened
2018-04-15 02:25:47-0700 [yahoo] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-04-15 02:25:47-0700 [yahoo] ERROR: Error downloading <GET http://ichart.finance.yahoo.com/table.csv?s=GOOG&d=&e=&f=&g=d&a=&b=&c=&ignore=.csv>
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/_resolver.py", line 200, in finish
resolutionReceiver.resolutionComplete()
File "/usr/local/lib/python2.7/site-packages/twisted/internet/endpoints.py", line 900, in resolutionComplete
d.callback(addresses)
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 459, in callback
self._startRunCallbacks(result)
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 567, in _startRunCallbacks
self._runCallbacks()
--- ---
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/site-packages/twisted/internet/endpoints.py", line 954, in startConnectionAttempts
"no results for hostname lookup: {}".format(self._hostStr)
twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: ichart.finance.yahoo.com.

2018-04-15 02:25:47-0700 [yahoo] INFO: Closing spider (finished)
2018-04-15 02:25:47-0700 [yahoo] INFO: Dumping Scrapy stats:
{'delay_count': 0,
'downloader/exception_count': 5,
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 5,
'downloader/request_bytes': 1365,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 5,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 4, 15, 9, 25, 47, 375880),
'httpcache/miss': 5,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2018, 4, 15, 9, 25, 47, 337781)}
2018-04-15 02:25:47-0700 [yahoo] INFO: Spider closed (finished)
2018-04-15 02:25:47-0700 [scrapy] INFO: Deleting /Users/JohnSnyder/scrapy.cfg
2018-04-15 02:25:47-0700 [scrapy] INFO: Sorting: /Users/JohnSnyder/test.csv
2018-04-15 02:25:47-0700 [scrapy] INFO: No need to sort empty file: /Users/JohnSnyder/test.csv

Add stats to log

It is difficult to find out the root cause of #2. To make it easier to debug, it'd be better if we collect some stats in the log, such as number of start URLs that were fired and the number of requests that were throttled.