talhashraf / major-scrapy-spiders Goto Github PK

View Code? Open in Web Editor NEW

283.0 34.0 118.0 79 KB

Scrapy spiders of major websites. Google Play Store, Facebook, Instagram, Ebay, YTS Movies, Amazon

Python 100.00%

ebay facebook instagram amazon python scrapy-spider android googleplay yts yts-movie

major-scrapy-spiders's Introduction

Major Scrapy Spiders

Scrapy spiders of major websites. Play Store, Facebook, Instagram, Ebay, YTS Movies, Amazon

Installation

git clone https://github.com/talhashraf/major-scrapy-spiders mss

Requirements

cd mss
pip install -r requirements.txt

Usage

scrapy crawl <spider>

List of Spiders

major-scrapy-spiders's People

Contributors

Stargazers

Watchers

Forkers

likers ferrero-zhang cermat jroquejr raphaelfruneaux pohfy123 thearchiver yamolekula 13122310958 huanglg chenkaigithub tifcty lux182 yoophi howardyan93 jonathanbowker vigyanik srkgupta almeidaf amangpt777 zhuyouzha jsean662 jesbin vinchu superf2t darkpeach tedmai pyquery dodermatt cuchulainx samirfor mh-uq brianwang gorarakelyan kinkocho aldiferdiyan bytearchive mhcrnl mcveighjr manzo1991 ymero vadi88 nihao2984 dariolourenco sumei1009 didomadresma ghjan trungvan86 candyfaz caofeifei946198 gillian0023 leonardohipolito walker088 wangay messicool bcnorwegian jamdart mostafakml yangcloud wed3nsday hybridious zartenc osinthill zy79 kyuhwas michaelnashed03 shaffiqvinales jinliangshui aadaa88 engmedoo miksdigital licycle montecriston thientu lidutech vudev donaldxialiu alpharootbeta viralsteroids wrailway celikburak aryaliuye askk paulhb7 mark-thu devvax oppamalia joplin68 oppam474 kneelib halibuts milanblood joukana gokhu18 ekptaiwan n0n0b0dy hentaio mokhfn luisriverag bazickoff

major-scrapy-spiders's Issues

Facebook v=info captcha challenge

Now, Facebook only display profile info if you solve the captcha.

KeyError: 'Spider not found: facebook'

(scrapy-splash) user@user-desktop:~/mss$ scrapy crawl facebook
2022-10-11 08:39:55 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: mss)
2022-10-11 08:39:55 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'mss', 'NEWSPIDER_MODULE': 'mss.spiders', 'SPIDER_MODULES': ['mss.spiders'], 'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0'}
Traceback (most recent call last):
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/spiderloader.py", line 69, in load
return self._spiders[spider_name]
KeyError: 'facebook'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/user/anaconda3/envs/scrapy-splash/bin/scrapy", line 8, in
sys.exit(execute())
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/cmdline.py", line 149, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/cmdline.py", line 89, in _run_print_help
func(*a, **kw)
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/cmdline.py", line 156, in _run_command
cmd.run(args, opts)
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/commands/crawl.py", line 57, in run
self.crawler_process.crawl(spname, **opts.spargs)
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/crawler.py", line 167, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/crawler.py", line 195, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/crawler.py", line 199, in _create_crawler
spidercls = self.spider_loader.load(spidercls)
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/spiderloader.py", line 71, in load
raise KeyError("Spider not found: {}".format(spider_name))
KeyError: 'Spider not found: facebook'
(scrapy-splash) user@user-desktop:/mss$ ls
data mss README.md requirements.txt scrapy.cfg
(scrapy-splash) user@user-desktop:/mss$ cd mss/
(scrapy-splash) user@user-desktop:/mss/mss$ ls
init.py items.py pipelines.py pycache settings.py spiders utils
(scrapy-splash) user@user-desktop:/mss/mss$ cd spiders/
(scrapy-splash) user@user-desktop:~/mss/mss/spiders$ scrapy crawl facebook
2022-10-11 08:40:41 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: mss)
2022-10-11 08:40:41 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'mss', 'NEWSPIDER_MODULE': 'mss.spiders', 'SPIDER_MODULES': ['mss.spiders'], 'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0'}
Traceback (most recent call last):
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/spiderloader.py", line 69, in load
return self._spiders[spider_name]
KeyError: 'facebook'

During handling of the above exception, another exception occurred:

Running on conda with python==3.8 on ubuntu 22.04LTS kernel 5.15.0-48-generic

raise KeyError("Spider not found: %s" % spider_name)

Hi talhashraf,
Thanks for your's project, i run with facebook crawl but i had error :

Traceback (most recent call last):
  File "c:\python27\lib\runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "c:\python27\lib\runpy.py", line 72, in _run_code
    exec code in run_globals
  File "C:\Python27\Scripts\scrapy.exe\__main__.py", line 9, in <module>
  File "c:\python27\lib\site-packages\scrapy\cmdline.py", line 143, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "c:\python27\lib\site-packages\scrapy\cmdline.py", line 89, in _run_print_help
    func(*a, **kw)
  File "c:\python27\lib\site-packages\scrapy\cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "c:\python27\lib\site-packages\scrapy\commands\crawl.py", line 58, in run
    spider = crawler.spiders.create(spname, **opts.spargs)
  File "c:\python27\lib\site-packages\scrapy\spidermanager.py", line 44, in create
    raise KeyError("Spider not found: %s" % spider_name)
KeyError: 'Spider not found: https://m.facebook.com/PhanKhanhHung?refid=46'
'sld' is not recognized as an internal or external command,
operable program or batch file.
'fref' is not recognized as an internal or external command,
operable program or batch file.

Can you help me solve that error ?
Thanks so much !!!

ValueError: Unknown string format

I've installed all the requirement packages. but when I run "scrapy crawl GooglePlayStore", It says

2017-11-17 10:53:11 [scrapy.core.scraper] ERROR: Spider error processing <GET https://play.google.com/store/apps/details?id=com.FDGEntertainment.TowerBoxing.gp> (referer: https://play.google.com/store/apps/collection/promotion_3001b85_impossible_games?clp=SjcKKAoicHJvbW90aW9uXzMwMDFiODVfaW1wb3NzaWJsZV9nYW1lcxAHGAMSC0dBTUVfQVJDQURF:S:ANO1ljJlLEA)
Traceback (most recent call last):
File "c:\users\jarvan\miniconda3\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\Scrapy_data\mss\mss\spiders\google\playstore.py", line 80, in parse_app
if last_updated else ''),
File "c:\users\jarvan\miniconda3\lib\site-packages\dateutil\parser.py", line 1182, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "c:\users\jarvan\miniconda3\lib\site-packages\dateutil\parser.py", line 559, in parse
raise ValueError("Unknown string format")
ValueError: Unknown string format

I want to know how to fix this error.

Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

Hi talhashraf,

I run crawl facebook but i didn't get anything.
This is what print on console when i run script :

2016-12-10 01:04:17+0700 [scrapy] INFO: Scrapy 0.24.4 started (bot: mss)
2016-12-10 01:04:17+0700 [scrapy] INFO: Optional features available: ssl, http11, django
2016-12-10 01:04:17+0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'mss.spiders', 'FEED_URI': 'data.csv', 'SPIDER_MODULES': ['mss.spiders'], 'BOT_NAME': 'mss', 'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0', 'FEED_FORMAT': 'csv'}
2016-12-10 01:04:17+0700 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2016-12-10 01:04:19+0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-12-10 01:04:19+0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-12-10 01:04:19+0700 [scrapy] INFO: Enabled item pipelines:
2016-12-10 01:04:19+0700 [fb] INFO: Spider opened
2016-12-10 01:04:19+0700 [fb] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-10 01:04:19+0700 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-10 01:04:19+0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2016-12-10 01:04:20+0700 [fb] DEBUG: Crawled (200) <GET https://m.facebook.com/> (referer: None)
2016-12-10 01:04:20+0700 [fb] DEBUG: Redirecting (302) to <GET https://m.facebook.com/home.php?refsrc=https%3A%2F%2Fm.facebook.com%2F&refid=8&_rdr> from <POST https://m.facebook.com/login.php?refsrc=https%3A%2F%2Fm.facebook.com%2F&lwv=100&login_try_number=1&refid=8>
2016-12-10 01:04:21+0700 [fb] DEBUG: Crawled (200) <GET https://m.facebook.com/home.php?refsrc=https%3A%2F%2Fm.facebook.com%2F&refid=8&_rdr> (referer: https://m.facebook.com/)
2016-12-10 01:04:21+0700 [fb] INFO: Closing spider (finished)
2016-12-10 01:04:21+0700 [fb] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 1763,
         'downloader/request_count': 3,
         'downloader/request_method_count/GET': 2,
         'downloader/request_method_count/POST': 1,
         'downloader/response_bytes': 26811,
         'downloader/response_count': 3,
         'downloader/response_status_count/200': 2,
         'downloader/response_status_count/302': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2016, 12, 9, 18, 4, 21, 979000),
         'log_count/DEBUG': 5,
         'log_count/INFO': 7,
         'request_depth_max': 1,
         'response_received_count': 2,
         'scheduler/dequeued': 3,
         'scheduler/dequeued/memory': 3,
         'scheduler/enqueued': 3,
         'scheduler/enqueued/memory': 3,
         'start_time': datetime.datetime(2016, 12, 9, 18, 4, 19, 66000)}
2016-12-10 01:04:21+0700 [fb] INFO: Spider closed (finished)

How to fix this problem ?
Thanks very much !

A question about page turning

Hi，
How to get more package names on the search page,
for example:https://play.google.com/store/search?q=clean&c=apps

KeyError: 'Spider not found: mss/spiders/instagram'

Hi,

Trying to run the app but gets this error continuously..

root@2031f496dec5:/# scrapy crawl instagram
2018-03-08 12:56:53 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: mss)
2018-03-08 12:56:53 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'mss', 'NEWSPIDER_MODULE': 'mss.spiders', 'SPIDER_MODULES': ['mss.spiders'], 'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0'}
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/scrapy/spiderloader.py", line 69, in load
    return self._spiders[spider_name]
KeyError: 'mss/spiders/instagram'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/usr/local/lib/python3.6/site-packages/scrapy/cmdline.py", line 149, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/usr/local/lib/python3.6/site-packages/scrapy/cmdline.py", line 89, in _run_print_help
    func(*a, **kw)
  File "/usr/local/lib/python3.6/site-packages/scrapy/cmdline.py", line 156, in _run_command
    cmd.run(args, opts)
  File "/usr/local/lib/python3.6/site-packages/scrapy/commands/crawl.py", line 57, in run
    self.crawler_process.crawl(spname, **opts.spargs)
  File "/usr/local/lib/python3.6/site-packages/scrapy/crawler.py", line 167, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.6/site-packages/scrapy/crawler.py", line 195, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.6/site-packages/scrapy/crawler.py", line 199, in _create_crawler
    spidercls = self.spider_loader.load(spidercls)
  File "/usr/local/lib/python3.6/site-packages/scrapy/spiderloader.py", line 71, in load
    raise KeyError("Spider not found: {}".format(spider_name))
KeyError: 'Spider not found: mss/spiders/instagram'