Coder Social home page Coder Social logo

huaying / instagram-crawler Goto Github PK

View Code? Open in Web Editor NEW
1.1K 40.0 281.0 26.72 MB

Get Instagram posts/profile/hashtag data without using Instagram API

License: MIT License

Python 100.00%
instagram instagram-crawler python webdriver instagram-liker autoliker auto likers instagram-scraper instagram-bot

instagram-crawler's Introduction

instagram-crawler's People

Contributors

aalshakarchi avatar angus258963 avatar ayushmanglani avatar emphasis10 avatar huaying avatar jimchr-r4gn4r avatar juanjovazpar avatar kswgit avatar leehanyeong avatar rrbarioni avatar satnam-sandhu avatar sutjin avatar vinyguedess avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

instagram-crawler's Issues

Cant run script

I follow the instalation guide but when i try to launch it i get this:

Traceback (most recent call last):
File "crawler.py", line 3, in
from builtins import open
ImportError: No module named builtins

I think this crawler is blocked. Is it right?

The results of using your code are as follows:

Strating fetching...
Number of fetched posts: 0
Wait for 1 sec...
Number of fetched posts: 0
Wait for 2 sec...
Number of fetched posts: 0
Wait for 4 sec...
Number of fetched posts: 0
Wait for 8 sec...
Number of fetched posts: 0
Wait for 16 sec...
Number of fetched posts: 0
Wait for 32 sec...
Number of fetched posts: 0
Wait for 64 sec...
Number of fetched posts: 0
Wait for 128 sec...
Number of fetched posts: 0
Wait for 256 sec...
Number of fetched posts: 0
Wait for 512 sec...
Done. Fetched 0 posts.

location info

Is there a way to get the latitude and longitude information for images that have it?

I want to search by hashtag and also search by location, and I would like to get hashtag, lat/long location aswell.

To overcome the 1000 results limit, would it be possible to also restrict the search by date posted?

Thanks

How to go to the next post if one of the post can't load

Hi Huaying,

I have a problem, if the internet connection is slow. When the i use the post_full, crawler trying to get the some post, but suddenly one of the post didn't load, and it suddenly return an error because it can't get the None Type element.

Is there any way to overcome it by going to the next Post if its return None? I already tried to edit the wait time parameter to 120 in findone method, but its same, the post just can't be loaded and in the end return None Type object.

Thank you

code doesn't work at all

hello i'm new in python and crawling
please help me with this error


F:\DEV\python\instagram crawler from github\instagram-crawler>python crawler.py posts -u cal_foodie -n 100 -o ./output
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\common\service.py", line 76, in start
stdin=PIPE)
File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 756, in init
restore_signals, start_new_session)
File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 1155, in _execute_child
startupinfo)
PermissionError: [WinError 5] Access is denied

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "crawler.py", line 77, in
args.debug
File "crawler.py", line 24, in get_posts_by_user
ins_crawler = InsCrawler(has_screen=debug)
File "F:\DEV\python\instagram crawler from github\instagram-crawler\inscrawler\crawler.py", line 54, in init
self.browser = Browser(has_screen)
File "F:\DEV\python\instagram crawler from github\instagram-crawler\inscrawler\browser.py", line 23, in init
chrome_options=chrome_options)
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\chrome\webdriver.py", line 68, in init
self.service.start()
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\common\service.py", line 88, in start
os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable may have wrong permissions. Please see https://sites.google.com/a/chromium.org/chromedriver/home

posts_full failing when getting: post_num, follower_num, following_num

By running the following command:
python crawler.py posts_full -u teslamotors -n 3

I get the following error:
File "/Users/c.cardenasvelasquez/Documents/GitHub/instagram-crawler/inscrawler/crawler.py", line 102, in get_user_profile post_num, follower_num, following_num = statistics ValueError: not enough values to unpack (expected 3, got 0)

AttributeError

Thank you for sharing great code.
recently I have been struggling with AttributeError, so i came here to ask question.

whenever i try to crawl more than 500 post,
AttribueError comes up.

What I wrote is :
$ python crawler.py hashtag -t paris -n 10000 -o ./output

and Error message is :

AttributeError: 'NoneType' object has no attribute 'get_attribute'

File "C:\Users\inhye\projects\crawler\ins-crawler\inscrawler\crawler.py", line 155, in _get_posts_full
cur_key = ele_a_datetime.get_attribute('href')

I have been trying to fix this problem for 3 weeks but it was not sucessful.
I wonder if there is some way to solve this problem

Caption comes as "verified"

I tried to crawl a verified profile and all the captions came as "verified". Is there a way we can get the action caption instead?

Cannot login

Whenever the script starts, it goes to login screen, then it says suspicious login, asking me to confirm. I confirmed on the app then the browser closes and throws me an error on the command line.

There's not enough time to confirm by text or email before the browser closes

"Traceback" & "Failed to fetch the post" Bug happened when fetch_comments

result.json.zip
(https://github.com/huaying/instagram-crawler/files/3681900/result.json.zip)
Thanks very much for your work, and this API is definitely the best Instagram crawler I can found on the internet. But when I try the example, I have found an interesting bug.

I have installed the toolkit with "chromedriver_mac64.zip" (chromedriver 77.0.3865.40)

Step1: % python3 crawler.py posts_full -u elenalinnn --fetch_comments
Step2:
we can see that 19/79 is ok, but the 20th one will show the following in red:

Failed to fetch the post: https://www.instagram.com/p/BxxzM7ACyaX/Traceback (most recent call last):
File "/Users/xingchen/Documents/Python/spider-Ins/instagram-crawler-master/inscrawler/crawler.py", line 217, in _get_posts_full
fetch_comments(browser, dict_post)
File "/Users/xingchen/Documents/Python/spider-Ins/instagram-crawler-master/inscrawler/fetch.py", line 145, in fetch_comments
show_comment_btn.click()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/selenium/webdriver/remote/webelement.py", line 80, in click
self._execute(Command.CLICK_ELEMENT)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/selenium/webdriver/remote/webelement.py", line 633, in _execute
return self._parent.execute(command, params)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.ElementNotInteractableException: Message: element not interactable
(Session info: headless chrome=77.0.3865.90)

After the running, we can get the 20th item as:
{"key": "https://www.instagram.com/p/BrjMfkVFaSE/", "datetime": "2018-12-19T01:08:47.000Z"},
which is very different from others.

This bug will happen in item 20, 31, 48, 50, 55, 59, 60.

I can not find out what lead to this bug. Can you figure out this issue?

Thank you very much. Have a nice day.

Best Regards.

[pip install -r requirements.txt] error

I have no idea why this error occured.
I have visited the official site below, and checked newest version. It was 19.3b0.
https://pypi.org/project/black/

===================error message==============================
Collecting black==19.3b0 (from -r requirements.txt (line 5))
ERROR: Could not find a version that satisfies the requirement black==19.3b0 (from -r requirements.txt (line 5)) (from versions: none)
ERROR: No matching distribution found for black==19.3b0 (from -r requirements.txt (line 5))

Getting all comments?

Hello,

Thanks for the amazing work ! One step further for me would be to try and get all comments for each post. Would you know how to do this ? It seems hard because it's not only scrolling down but rather finding how to click the "+" button at the bottom of the comments...

Thanks for the help !

Vincent

Crawl different data based on the config

We should have the config file which specify the data you want to crawl.
Some heavy operations like getting all likers, all comments can be specified in the config.

-fetch_comments

python crawler.py posts_full -u insidelombok -n 100 --fetch_comments

URL not fetched
Traceback (most recent call last):
File "/home/newbie/instagram-crawler/inscrawler/crawler.py", line 217, in _get_posts_full
fetch_comments(browser, dict_post)
File "/home/newbie/instagram-crawler/inscrawler/fetch.py", line 146, in fetch_comments
show_comment_btn.click()
File "/home/newbie/.local/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 80, in click
self._execute(Command.CLICK_ELEMENT)
File "/home/newbie/.local/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 628, in _execute
return self._parent.execute(command, params)
File "/home/newbie/.local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 312, in execute
self.error_handler.check_response(response)
File "/home/newbie/.local/lib/python2.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
ElementNotInteractableException: Message: element not interactable
(Session info: headless chrome=79.0.3945.88)

Liker bounces out of log in before the first like

Hi Huaying,

Thanks for the amazing crawler, its been working well recently but today when I tried to use the liker it logs in and then it bounces out (returns to the log in screen). This happens right after it opens an image and before it does the first like.

Thank you for your help again!

i got log problem

hi Huaying, i got problem that log file was not found.
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/instagram-crawler-1538195382.log'
Exception ignored in: <function Logging.del at 0x02C62270>
Traceback (most recent call last):
File "D:\ARIEF\CRAWLIG2\inscrawler\crawler.py", line 37, in del
self.logger.close()
AttributeError: 'InsCrawler' object has no attribute 'logger'
image

WebDriverException

how can I fix this problem:
Traceback (most recent call last): File "crawler.py", line 83, in <module> args.username, args.number, args.mode == "posts_full", args.debug File "crawler.py", line 27, in get_posts_by_user ins_crawler = InsCrawler(has_screen=debug) File "/root/instagram-crawler/inscrawler/crawler.py", line 67, in __init__ self.browser = Browser(has_screen) File "/root/instagram-crawler/inscrawler/browser.py", line 26, in __init__ chrome_options=chrome_options, File "/usr/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 69, in __init__ desired_capabilities=desired_capabilities) File "/usr/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 98, in __init__ self.start_session(desired_capabilities, browser_profile) File "/usr/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 185, in start_session response = self.execute(Command.NEW_SESSION, parameters) File "/usr/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 249, in execute self.error_handler.check_response(response) File "/usr/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.WebDriverException: Message: invalid argument: unrecognized capability: chromeOptions

Data format

data does not contain captions of the post.
can you make one code that scraps #hashtags dataset like if I want to scrap friends hashtags data then my data set should contain image name, the caption for that image, hashtags for the same image, like on the image post, and images should store in one folder?

question

Like posts_full mode based on hashtag, can I fetch post contents, mention etc ..
for example, python crawler.py hashtag -t taiwan --fetch_comments -- fetch_mentions?

Instagram Like fetching problem

Failed to fetch the post: https://www.instagram.com/p/B1zBXiygROa/Traceback (most recent call last): File "instagram-crawler/inscrawler/crawler.py", line 213, in _get_posts_full fetch_likers(browser, dict_post) File "instagram-crawler/inscrawler/fetch.py", line 93, in fetch_likers like_info_btn.click() AttributeError: 'NoneType' object has no attribute 'click' fetching: 100%| [{"key": "https://www.instagram.com/p/B1zBXiygROa/", "datetime": "2019-08-30T17:54:17.000Z", "img_urls": ["https://scontent-mxp1-1.cdninstagram.com/vp/b4825b2b1f9fcadd1df00ef2629f22e0/5D6C7096/t51.2885-15/e35/69663551_128155128469835_2515122558944658279_n.jpg?_nc_ht=scontent-mxp1-1.cdninstagram.com"]}]

https://mixmag.net/read/instagram-remove-likes-trial-news

Crawling by # stops at 200

Hi! I want to use this crawler to crawl by # and retrieve all the information about a post (num. likes , num. comments...). The original version of the crawler does not allow to do that, so I modified the source code. I succeed in retrieve all the information I need but the crawler stops at 200 posts.
Do you have any insights about that?
Thank you!

Failed to fetch the post

An error occurs when crawling user dain __. 070528
Error post is
https://www.instagram.com/p/BuEe8f_DeA5/

The traceback is shown below

fetching:  29%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹                                                                                                              | 12/41 [01:01<02:23,  4.94s/it]
Failed to fetch the post: https://www.instagram.com/p/BuEe8f_DeA5/

Traceback (most recent call last):
  File "/Users/lhy/projects/10wonders/app/inscrawler/crawler.py", line 286, in _get_posts_full
    self._fetch_post_with_key(cur_key, dict_post)
  File "/Users/lhy/projects/10wonders/app/inscrawler/crawler.py", line 183, in _fetch_post_with_key
    img_urls.add(ele_img.get_attribute('src'))
  File "/usr/local/var/pyenv/versions/3.7.2/envs/10wonders/lib/python3.7/site-packages/selenium/webdriver/remote/webelement.py", line 143, in get_attribute
    resp = self._execute(Command.GET_ELEMENT_ATTRIBUTE, {'name': name})
  File "/usr/local/var/pyenv/versions/3.7.2/envs/10wonders/lib/python3.7/site-packages/selenium/webdriver/remote/webelement.py", line 633, in _execute
    return self._parent.execute(command, params)
  File "/usr/local/var/pyenv/versions/3.7.2/envs/10wonders/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "/usr/local/var/pyenv/versions/3.7.2/envs/10wonders/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: headless chrome=73.0.3683.103)
  (Driver info: chromedriver=2.37.544337 (8c0344a12e552148c185f7d5117db1f28d6c9e85),platform=Mac OS X 10.13.6 x86_64)

posts_full Not working

tried for multiple user profiles,

next_key = ele_a_datetime.get_attribute('href')
AttributeError: 'NoneType' object has no attribute 'get_attribute'
Exception KeyError: KeyError(<weakref at 0x10b479f18; to 'tqdm' at 0x10b443a90>,) in ignored

Possible to get post likes?

Is is possible to get information about who have liked a post? I want to access all usernames who have liked a post?

Output is all in one line on Windows 10

So I just ran python crawler.py hashtag -t DDay -o ./output on Windows 10, and the output was given all in one line and really hard to read. I had to use vim to search and replace and get the output to look like your example pictures.

Log in for private accounts

How do I enter in the username and password in secret.py? I am getting errors that there is no username and password after I directly edit secret.py. I am also running the cp inscrawler/secret.py.dist inscrawler/secret.py command after I edit.

Couldn't automatically log in

Hi I'm new to python,
when I put my instagram username as 'bingdnboujee' and password in secret.py
seems like it didn't work, the code recognized my username as 'bingd', which is my windows system user name.

image orders

Thanks for this nice repo first!
Is it able to recognize the order of the images in one post with multiple images?
I want to know which image is the first photo (cover of the post).
I tried to trace the code and the ['img_urls'] to find the rule but I failed.
Are there any ways to solve this issue? Please let me know.
Thanks!

Error!!

I encountered the following errors. Did anyone face the same situation?
How did you guys solve it?

Traceback (most recent call last):
File "crawler.py", line 83, in
args.username, args.number, args.mode == "posts_full", args.debug
File "crawler.py", line 27, in get_posts_by_user
ins_crawler = InsCrawler(has_screen=debug)
File "C:\Users\Roshan Dhamala\Independent Class\Insta\instagram-crawler\inscrawler\crawler.py", line 68, in init
self.browser = Browser(has_screen)
File "C:\Users\Roshan Dhamala\Independent Class\Insta\instagram-crawler\inscrawler\browser.py", line 27, in init
chrome_options=chrome_options,
File "C:\Python3\lib\site-packages\selenium\webdriver\chrome\webdriver.py", line 68, in init
self.service.start()
File "C:\Python3\lib\site-packages\selenium\webdriver\common\service.py", line 76, in start
stdin=PIPE)
File "C:\Python3\lib\subprocess.py", line 775, in init
restore_signals, start_new_session)
File "C:\Python3\lib\subprocess.py", line 1178, in _execute_child
startupinfo)
OSError: [WinError 193] %1 is not a valid Win32 application

problem with posts_full

thanks for your work and it works really well, but an Issue bothers me time to time.

selenium.common.exceptions.WebDriverException: Message: unknown error: Element <a class="HBoOv coreSpriteRightPaginationArrow" tabindex="0" href="/p/1920252025238890452/">...</a> is not clickable at point (763, 300). Other element would receive the click: <div class="QhbhU" role="button" tabindex="0"></div>
there is no other thing with the same class name "HBoOv" on the page, but it just works stop

Small? Error in post_full

Trying to scrape: https://www.instagram.com/explore/tags/satoyamaexperience/

python crawler.py posts_full -t satoyamaexperience -n 1000 -o ./output

Error

Traceback (most recent call last):
  File "crawler.py", line 77, in <module>
    args.debug
  File "crawler.py", line 25, in get_posts_by_user
    return ins_crawler.get_user_posts(username, number, detail)
  File "/Users/philipperemy/PycharmProjects/instagram-crawler/inscrawler/crawler.py", line 100, in get_user_posts
    user_profile = self.get_user_profile(username)
  File "/Users/philipperemy/PycharmProjects/instagram-crawler/inscrawler/crawler.py", line 89, in get_user_profile
    post_num, follower_num, following_num = statistics
ValueError: not enough values to unpack (expected 3, got 0)

Thanks a million in advance :)

login doesn't work

Hi Huaying, thanks for your crawler!

I am trying to use liker.py and the code seems to work (briefly) but it diverts back to the login page since it isn't logged in to the user and then app crashes. I have already set my own password and username in the secret file but it still won't log in.

Thank you!

Here is the log file:

Traceback (most recent call last): File "liker.py", line 22, in <module> ins_crawler.auto_like(tag=args.hashtag, maximum=args.number) File "/Users/mimichindasook/Downloads/instagram-crawler-master/inscrawler/crawler.py", line 112, in auto_like ele_post.click() File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 80, in click self._execute(Command.CLICK_ELEMENT) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 628, in _execute return self._parent.execute(command, params) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 312, in execute self.error_handler.check_response(response) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.WebDriverException: Message: unknown error: Element <a href="/p/BoXEX4hhapD/?tagged=surf">...</a> is not clickable at point (132, 473). Other element would receive the click: <div class="pHxcJ">...</div> (Session info: chrome=69.0.3497.100) (Driver info: chromedriver=2.42.591059 (a3d9684d10d61aa0c45f6723b327283be1ebaad8),platform=Mac OS X 10.11.6 x86_64)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.