Scrapy spiders of major websites. Play Store, Facebook, Instagram, Ebay, YTS Movies, Amazon
git clone https://github.com/talhashraf/major-scrapy-spiders mss
cd mss
pip install -r requirements.txt
scrapy crawl <spider>
Scrapy spiders of major websites. Google Play Store, Facebook, Instagram, Ebay, YTS Movies, Amazon
Scrapy spiders of major websites. Play Store, Facebook, Instagram, Ebay, YTS Movies, Amazon
git clone https://github.com/talhashraf/major-scrapy-spiders mss
cd mss
pip install -r requirements.txt
scrapy crawl <spider>
Now, Facebook only display profile info if you solve the captcha.
(scrapy-splash) user@user-desktop:~/mss$ scrapy crawl facebook
2022-10-11 08:39:55 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: mss)
2022-10-11 08:39:55 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'mss', 'NEWSPIDER_MODULE': 'mss.spiders', 'SPIDER_MODULES': ['mss.spiders'], 'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0'}
Traceback (most recent call last):
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/spiderloader.py", line 69, in load
return self._spiders[spider_name]
KeyError: 'facebook'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/user/anaconda3/envs/scrapy-splash/bin/scrapy", line 8, in
sys.exit(execute())
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/cmdline.py", line 149, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/cmdline.py", line 89, in _run_print_help
func(*a, **kw)
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/cmdline.py", line 156, in _run_command
cmd.run(args, opts)
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/commands/crawl.py", line 57, in run
self.crawler_process.crawl(spname, **opts.spargs)
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/crawler.py", line 167, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/crawler.py", line 195, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/crawler.py", line 199, in _create_crawler
spidercls = self.spider_loader.load(spidercls)
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/spiderloader.py", line 71, in load
raise KeyError("Spider not found: {}".format(spider_name))
KeyError: 'Spider not found: facebook'
(scrapy-splash) user@user-desktop:/mss$ ls/mss$ cd mss/
data mss README.md requirements.txt scrapy.cfg
(scrapy-splash) user@user-desktop:
(scrapy-splash) user@user-desktop:/mss/mss$ ls/mss/mss$ cd spiders/
init.py items.py pipelines.py pycache settings.py spiders utils
(scrapy-splash) user@user-desktop:
(scrapy-splash) user@user-desktop:~/mss/mss/spiders$ scrapy crawl facebook
2022-10-11 08:40:41 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: mss)
2022-10-11 08:40:41 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'mss', 'NEWSPIDER_MODULE': 'mss.spiders', 'SPIDER_MODULES': ['mss.spiders'], 'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0'}
Traceback (most recent call last):
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/spiderloader.py", line 69, in load
return self._spiders[spider_name]
KeyError: 'facebook'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/user/anaconda3/envs/scrapy-splash/bin/scrapy", line 8, in
sys.exit(execute())
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/cmdline.py", line 149, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/cmdline.py", line 89, in _run_print_help
func(*a, **kw)
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/cmdline.py", line 156, in _run_command
cmd.run(args, opts)
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/commands/crawl.py", line 57, in run
self.crawler_process.crawl(spname, **opts.spargs)
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/crawler.py", line 167, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/crawler.py", line 195, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/crawler.py", line 199, in _create_crawler
spidercls = self.spider_loader.load(spidercls)
File "/home/user/anaconda3/envs/scrapy-splash/lib/python3.8/site-packages/scrapy/spiderloader.py", line 71, in load
raise KeyError("Spider not found: {}".format(spider_name))
KeyError: 'Spider not found: facebook'
Running on conda with python==3.8 on ubuntu 22.04LTS kernel 5.15.0-48-generic
Hi talhashraf,
Thanks for your's project, i run with facebook crawl but i had error :
Traceback (most recent call last):
File "c:\python27\lib\runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "c:\python27\lib\runpy.py", line 72, in _run_code
exec code in run_globals
File "C:\Python27\Scripts\scrapy.exe\__main__.py", line 9, in <module>
File "c:\python27\lib\site-packages\scrapy\cmdline.py", line 143, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "c:\python27\lib\site-packages\scrapy\cmdline.py", line 89, in _run_print_help
func(*a, **kw)
File "c:\python27\lib\site-packages\scrapy\cmdline.py", line 150, in _run_command
cmd.run(args, opts)
File "c:\python27\lib\site-packages\scrapy\commands\crawl.py", line 58, in run
spider = crawler.spiders.create(spname, **opts.spargs)
File "c:\python27\lib\site-packages\scrapy\spidermanager.py", line 44, in create
raise KeyError("Spider not found: %s" % spider_name)
KeyError: 'Spider not found: https://m.facebook.com/PhanKhanhHung?refid=46'
'sld' is not recognized as an internal or external command,
operable program or batch file.
'fref' is not recognized as an internal or external command,
operable program or batch file.
Can you help me solve that error ?
Thanks so much !!!
I've installed all the requirement packages. but when I run "scrapy crawl GooglePlayStore", It says
2017-11-17 10:53:11 [scrapy.core.scraper] ERROR: Spider error processing <GET https://play.google.com/store/apps/details?id=com.FDGEntertainment.TowerBoxing.gp> (referer: https://play.google.com/store/apps/collection/promotion_3001b85_impossible_games?clp=SjcKKAoicHJvbW90aW9uXzMwMDFiODVfaW1wb3NzaWJsZV9nYW1lcxAHGAMSC0dBTUVfQVJDQURF:S:ANO1ljJlLEA)
Traceback (most recent call last):
File "c:\users\jarvan\miniconda3\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\Scrapy_data\mss\mss\spiders\google\playstore.py", line 80, in parse_app
if last_updated else ''),
File "c:\users\jarvan\miniconda3\lib\site-packages\dateutil\parser.py", line 1182, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "c:\users\jarvan\miniconda3\lib\site-packages\dateutil\parser.py", line 559, in parse
raise ValueError("Unknown string format")
ValueError: Unknown string format
I want to know how to fix this error.
Hi talhashraf,
I run crawl facebook but i didn't get anything.
This is what print on console when i run script :
2016-12-10 01:04:17+0700 [scrapy] INFO: Scrapy 0.24.4 started (bot: mss)
2016-12-10 01:04:17+0700 [scrapy] INFO: Optional features available: ssl, http11, django
2016-12-10 01:04:17+0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'mss.spiders', 'FEED_URI': 'data.csv', 'SPIDER_MODULES': ['mss.spiders'], 'BOT_NAME': 'mss', 'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0', 'FEED_FORMAT': 'csv'}
2016-12-10 01:04:17+0700 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2016-12-10 01:04:19+0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-12-10 01:04:19+0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-12-10 01:04:19+0700 [scrapy] INFO: Enabled item pipelines:
2016-12-10 01:04:19+0700 [fb] INFO: Spider opened
2016-12-10 01:04:19+0700 [fb] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-10 01:04:19+0700 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-10 01:04:19+0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2016-12-10 01:04:20+0700 [fb] DEBUG: Crawled (200) <GET https://m.facebook.com/> (referer: None)
2016-12-10 01:04:20+0700 [fb] DEBUG: Redirecting (302) to <GET https://m.facebook.com/home.php?refsrc=https%3A%2F%2Fm.facebook.com%2F&refid=8&_rdr> from <POST https://m.facebook.com/login.php?refsrc=https%3A%2F%2Fm.facebook.com%2F&lwv=100&login_try_number=1&refid=8>
2016-12-10 01:04:21+0700 [fb] DEBUG: Crawled (200) <GET https://m.facebook.com/home.php?refsrc=https%3A%2F%2Fm.facebook.com%2F&refid=8&_rdr> (referer: https://m.facebook.com/)
2016-12-10 01:04:21+0700 [fb] INFO: Closing spider (finished)
2016-12-10 01:04:21+0700 [fb] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1763,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 26811,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 12, 9, 18, 4, 21, 979000),
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2016, 12, 9, 18, 4, 19, 66000)}
2016-12-10 01:04:21+0700 [fb] INFO: Spider closed (finished)
How to fix this problem ?
Thanks very much !
Hi,
How to get more package names on the search page,
for example:https://play.google.com/store/search?q=clean&c=apps
Hi,
Trying to run the app but gets this error continuously..
root@2031f496dec5:/# scrapy crawl instagram
2018-03-08 12:56:53 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: mss)
2018-03-08 12:56:53 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'mss', 'NEWSPIDER_MODULE': 'mss.spiders', 'SPIDER_MODULES': ['mss.spiders'], 'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0'}
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/scrapy/spiderloader.py", line 69, in load
return self._spiders[spider_name]
KeyError: 'mss/spiders/instagram'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 11, in <module>
sys.exit(execute())
File "/usr/local/lib/python3.6/site-packages/scrapy/cmdline.py", line 149, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/local/lib/python3.6/site-packages/scrapy/cmdline.py", line 89, in _run_print_help
func(*a, **kw)
File "/usr/local/lib/python3.6/site-packages/scrapy/cmdline.py", line 156, in _run_command
cmd.run(args, opts)
File "/usr/local/lib/python3.6/site-packages/scrapy/commands/crawl.py", line 57, in run
self.crawler_process.crawl(spname, **opts.spargs)
File "/usr/local/lib/python3.6/site-packages/scrapy/crawler.py", line 167, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "/usr/local/lib/python3.6/site-packages/scrapy/crawler.py", line 195, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "/usr/local/lib/python3.6/site-packages/scrapy/crawler.py", line 199, in _create_crawler
spidercls = self.spider_loader.load(spidercls)
File "/usr/local/lib/python3.6/site-packages/scrapy/spiderloader.py", line 71, in load
raise KeyError("Spider not found: {}".format(spider_name))
KeyError: 'Spider not found: mss/spiders/instagram'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.