huaying / instagram-crawler Goto Github PK

View Code? Open in Web Editor NEW

1.1K 40.0 281.0 26.72 MB

Get Instagram posts/profile/hashtag data without using Instagram API

License: MIT License

Python 100.00%

instagram instagram-crawler python webdriver instagram-liker autoliker auto likers instagram-scraper instagram-bot

instagram-crawler's Introduction

👋🏻 Huaying Tsai

🤠 thoth - knowledge sharing platform: https://thoth.tw/

👾 tech blog: https://june.monster/

🧑🏻‍💻 linkedin: https://www.linkedin.com/in/huayingtsai/

instagram-crawler's People

Contributors

Stargazers

Watchers

Forkers

grandtourismo doricardo-zz python26 imaduddin76 kts12345 ndfc angus258963 vuongducdai mavahedinia sandyway jeongchanwoo ichi-kim torst kyuhwas hamzafunnel1 iamdoodoo philipperemy bluesid okyksl tovarischsukhov anubhav-jangra nvnnghia osztro mzeiher leafinity maminh seohosong mansung ywangzi noceur wywongbd virresh alirezabayatmk hello-anmol ryoito goethelee umkevinc leehanyeong new-learn inhyek exploit-inters jayppo hama hhy5277 define16 dschwalm junwonpk stanfordhci faithfulzhening t510599 amanajaslg anik-dasgupta hysophie cnxtech lizasivriver vincy13 azridev fikrifahruroji broshanfekr dinoforcered canshot gautamsharma0095 seunghyunwang whatabeautifulname wrgsray kilinco setubal18 238mail sammilei jason9075 aalshakarchi kingking888 sutjin colafishx eunyoung4444 runeknight itsdrnicolas knifenomad brillante06 darkokoa nastako bbangdi kingkarlito cswaynecool flaviowildner zhanglipku fan-xin ktxcx rickmunene lawrencechen0921 binjjam chuajiesheng kt9302 a8252525 aarles b2220333 ljh7340 maykim51 yasharjahanshahloo zdenekhynek

instagram-crawler's Issues

Post and Post_full don't fetch content text

Cant run script

I follow the instalation guide but when i try to launch it i get this:

Traceback (most recent call last):
File "crawler.py", line 3, in
from builtins import open
ImportError: No module named builtins

I think this crawler is blocked. Is it right?

The results of using your code are as follows:

Strating fetching...
Number of fetched posts: 0
Wait for 1 sec...
Number of fetched posts: 0
Wait for 2 sec...
Number of fetched posts: 0
Wait for 4 sec...
Number of fetched posts: 0
Wait for 8 sec...
Number of fetched posts: 0
Wait for 16 sec...
Number of fetched posts: 0
Wait for 32 sec...
Number of fetched posts: 0
Wait for 64 sec...
Number of fetched posts: 0
Wait for 128 sec...
Number of fetched posts: 0
Wait for 256 sec...
Number of fetched posts: 0
Wait for 512 sec...
Done. Fetched 0 posts.

Get mentions, hashtags, likers, likes_plays using the hashtag mode?

When searching for hashtag, only 3 fields show up regardless of additional options: key, caption and img_url. Is there a way to get more information, such as the ones available with posts_full?

location info

Is there a way to get the latitude and longitude information for images that have it?

I want to search by hashtag and also search by location, and I would like to get hashtag, lat/long location aswell.

To overcome the 1000 results limit, would it be possible to also restrict the search by date posted?

Thanks

How to go to the next post if one of the post can't load

Hi Huaying,

I have a problem, if the internet connection is slow. When the i use the post_full, crawler trying to get the some post, but suddenly one of the post didn't load, and it suddenly return an error because it can't get the None Type element.

Is there any way to overcome it by going to the next Post if its return None? I already tried to edit the wait time parameter to 120 in findone method, but its same, the post just can't be loaded and in the end return None Type object.

Thank you

code doesn't work at all

hello i'm new in python and crawling
please help me with this error

F:\DEV\python\instagram crawler from github\instagram-crawler>python crawler.py posts -u cal_foodie -n 100 -o ./output
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\common\service.py", line 76, in start
stdin=PIPE)
File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 756, in init
restore_signals, start_new_session)
File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 1155, in _execute_child
startupinfo)
PermissionError: [WinError 5] Access is denied

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "crawler.py", line 77, in
args.debug
File "crawler.py", line 24, in get_posts_by_user
ins_crawler = InsCrawler(has_screen=debug)
File "F:\DEV\python\instagram crawler from github\instagram-crawler\inscrawler\crawler.py", line 54, in init
self.browser = Browser(has_screen)
File "F:\DEV\python\instagram crawler from github\instagram-crawler\inscrawler\browser.py", line 23, in init
chrome_options=chrome_options)
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\chrome\webdriver.py", line 68, in init
self.service.start()
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\common\service.py", line 88, in start
os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable may have wrong permissions. Please see https://sites.google.com/a/chromium.org/chromedriver/home

posts_full failing when getting: post_num, follower_num, following_num

By running the following command:
python crawler.py posts_full -u teslamotors -n 3

I get the following error:
File "/Users/c.cardenasvelasquez/Documents/GitHub/instagram-crawler/inscrawler/crawler.py", line 102, in get_user_profile post_num, follower_num, following_num = statistics ValueError: not enough values to unpack (expected 3, got 0)

Can't fetch the content if the post contains video.

utf8 character problem

how can I active hashtag with '감솨함솨'?

Unicode encode error

AttributeError

Thank you for sharing great code.
recently I have been struggling with AttributeError, so i came here to ask question.

whenever i try to crawl more than 500 post,
AttribueError comes up.

What I wrote is :
$ python crawler.py hashtag -t paris -n 10000 -o ./output

and Error message is :

AttributeError: 'NoneType' object has no attribute 'get_attribute'

File "C:\Users\inhye\projects\crawler\ins-crawler\inscrawler\crawler.py", line 155, in _get_posts_full
cur_key = ele_a_datetime.get_attribute('href')

I have been trying to fix this problem for 3 weeks but it was not sucessful.
I wonder if there is some way to solve this problem

Write tests

Caption comes as "verified"

I tried to crawl a verified profile and all the captions came as "verified". Is there a way we can get the action caption instead?

Crawling posts by its location

Can I search for posts by its location?

Thanks in advance.

Cannot login

Whenever the script starts, it goes to login screen, then it says suspicious login, asking me to confirm. I confirmed on the app then the browser closes and throws me an error on the command line.

There's not enough time to confirm by text or email before the browser closes

"Traceback" & "Failed to fetch the post" Bug happened when fetch_comments

result.json.zip
(https://github.com/huaying/instagram-crawler/files/3681900/result.json.zip)
Thanks very much for your work, and this API is definitely the best Instagram crawler I can found on the internet. But when I try the example, I have found an interesting bug.

I have installed the toolkit with "chromedriver_mac64.zip" (chromedriver 77.0.3865.40)

Step1: % python3 crawler.py posts_full -u elenalinnn --fetch_comments
Step2:
we can see that 19/79 is ok, but the 20th one will show the following in red:

Failed to fetch the post: https://www.instagram.com/p/BxxzM7ACyaX/Traceback (most recent call last):
File "/Users/xingchen/Documents/Python/spider-Ins/instagram-crawler-master/inscrawler/crawler.py", line 217, in _get_posts_full
fetch_comments(browser, dict_post)
File "/Users/xingchen/Documents/Python/spider-Ins/instagram-crawler-master/inscrawler/fetch.py", line 145, in fetch_comments
show_comment_btn.click()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/selenium/webdriver/remote/webelement.py", line 80, in click
self._execute(Command.CLICK_ELEMENT)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/selenium/webdriver/remote/webelement.py", line 633, in _execute
return self._parent.execute(command, params)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.ElementNotInteractableException: Message: element not interactable
(Session info: headless chrome=77.0.3865.90)

After the running, we can get the 20th item as:
{"key": "https://www.instagram.com/p/BrjMfkVFaSE/", "datetime": "2018-12-19T01:08:47.000Z"},
which is very different from others.

This bug will happen in item 20, 31, 48, 50, 55, 59, 60.

I can not find out what lead to this bug. Can you figure out this issue?

Thank you very much. Have a nice day.

Best Regards.

[pip install -r requirements.txt] error

I have no idea why this error occured.
I have visited the official site below, and checked newest version. It was 19.3b0.
https://pypi.org/project/black/

===================error message==============================
Collecting black==19.3b0 (from -r requirements.txt (line 5))
ERROR: Could not find a version that satisfies the requirement black==19.3b0 (from -r requirements.txt (line 5)) (from versions: none)
ERROR: No matching distribution found for black==19.3b0 (from -r requirements.txt (line 5))

Display unicode correctly

Getting all comments?

Hello,

Thanks for the amazing work ! One step further for me would be to try and get all comments for each post. Would you know how to do this ? It seems hard because it's not only scrolling down but rather finding how to click the "+" button at the bottom of the comments...

Thanks for the help !

Vincent

Crawl different data based on the config

We should have the config file which specify the data you want to crawl.
Some heavy operations like getting all likers, all comments can be specified in the config.

"stale element reference: element is not attached to the page document"

add elment date time

we need date time to analys

-fetch_comments

python crawler.py posts_full -u insidelombok -n 100 --fetch_comments

URL not fetched
Traceback (most recent call last):
File "/home/newbie/instagram-crawler/inscrawler/crawler.py", line 217, in _get_posts_full
fetch_comments(browser, dict_post)
File "/home/newbie/instagram-crawler/inscrawler/fetch.py", line 146, in fetch_comments
show_comment_btn.click()
File "/home/newbie/.local/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 80, in click
self._execute(Command.CLICK_ELEMENT)
File "/home/newbie/.local/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 628, in _execute
return self._parent.execute(command, params)
File "/home/newbie/.local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 312, in execute
self.error_handler.check_response(response)
File "/home/newbie/.local/lib/python2.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
ElementNotInteractableException: Message: element not interactable
(Session info: headless chrome=79.0.3945.88)

Liker bounces out of log in before the first like

Hi Huaying,

Thanks for the amazing crawler, its been working well recently but today when I tried to use the liker it logs in and then it bounces out (returns to the log in screen). This happens right after it opens an image and before it does the first like.

Thank you for your help again!

only get first image of post

Even though a post have over 1 images, it could only get the first one.

i got log problem

hi Huaying, i got problem that log file was not found.
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/instagram-crawler-1538195382.log'
Exception ignored in: <function Logging.del at 0x02C62270>
Traceback (most recent call last):
File "D:\ARIEF\CRAWLIG2\inscrawler\crawler.py", line 37, in del
self.logger.close()
AttributeError: 'InsCrawler' object has no attribute 'logger'

WebDriverException

how can I fix this problem:
Traceback (most recent call last): File "crawler.py", line 83, in <module> args.username, args.number, args.mode == "posts_full", args.debug File "crawler.py", line 27, in get_posts_by_user ins_crawler = InsCrawler(has_screen=debug) File "/root/instagram-crawler/inscrawler/crawler.py", line 67, in __init__ self.browser = Browser(has_screen) File "/root/instagram-crawler/inscrawler/browser.py", line 26, in __init__ chrome_options=chrome_options, File "/usr/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 69, in __init__ desired_capabilities=desired_capabilities) File "/usr/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 98, in __init__ self.start_session(desired_capabilities, browser_profile) File "/usr/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 185, in start_session response = self.execute(Command.NEW_SESSION, parameters) File "/usr/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 249, in execute self.error_handler.check_response(response) File "/usr/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.WebDriverException: Message: invalid argument: unrecognized capability: chromeOptions

posts_full function doesn't seem to work

Error message I am getting:

Post_full stops working

Posts_full stops fetching data when one post has multiple pictures.

Data format

data does not contain captions of the post.
can you make one code that scraps #hashtags dataset like if I want to scrap friends hashtags data then my data set should contain image name, the caption for that image, hashtags for the same image, like on the image post, and images should store in one folder?

question

Like posts_full mode based on hashtag, can I fetch post contents, mention etc ..
for example, python crawler.py hashtag -t taiwan --fetch_comments -- fetch_mentions?

Post with video don't fetch content

Heya,

Dope crawler, works well. Only thing is that if there is a video post it doesn't grab the caption data.

cheers!

Instagram Like fetching problem

Failed to fetch the post: https://www.instagram.com/p/B1zBXiygROa/Traceback (most recent call last): File "instagram-crawler/inscrawler/crawler.py", line 213, in _get_posts_full fetch_likers(browser, dict_post) File "instagram-crawler/inscrawler/fetch.py", line 93, in fetch_likers like_info_btn.click() AttributeError: 'NoneType' object has no attribute 'click' fetching: 100%| [{"key": "https://www.instagram.com/p/B1zBXiygROa/", "datetime": "2019-08-30T17:54:17.000Z", "img_urls": ["https://scontent-mxp1-1.cdninstagram.com/vp/b4825b2b1f9fcadd1df00ef2629f22e0/5D6C7096/t51.2885-15/e35/69663551_128155128469835_2515122558944658279_n.jpg?_nc_ht=scontent-mxp1-1.cdninstagram.com"]}]

https://mixmag.net/read/instagram-remove-likes-trial-news

Crawling by # stops at 200

Hi! I want to use this crawler to crawl by # and retrieve all the information about a post (num. likes , num. comments...). The original version of the crawler does not allow to do that, so I modified the source code. I succeed in retrieve all the information I need but the crawler stops at 200 posts.
Do you have any insights about that?
Thank you!

Failed to fetch the post

An error occurs when crawling user dain __. 070528
Error post is
https://www.instagram.com/p/BuEe8f_DeA5/

The traceback is shown below

fetching:  29%|█████████████████████████████████████████████▋                                                                                                              | 12/41 [01:01<02:23,  4.94s/it]
Failed to fetch the post: https://www.instagram.com/p/BuEe8f_DeA5/

Traceback (most recent call last):
  File "/Users/lhy/projects/10wonders/app/inscrawler/crawler.py", line 286, in _get_posts_full
    self._fetch_post_with_key(cur_key, dict_post)
  File "/Users/lhy/projects/10wonders/app/inscrawler/crawler.py", line 183, in _fetch_post_with_key
    img_urls.add(ele_img.get_attribute('src'))
  File "/usr/local/var/pyenv/versions/3.7.2/envs/10wonders/lib/python3.7/site-packages/selenium/webdriver/remote/webelement.py", line 143, in get_attribute
    resp = self._execute(Command.GET_ELEMENT_ATTRIBUTE, {'name': name})
  File "/usr/local/var/pyenv/versions/3.7.2/envs/10wonders/lib/python3.7/site-packages/selenium/webdriver/remote/webelement.py", line 633, in _execute
    return self._parent.execute(command, params)
  File "/usr/local/var/pyenv/versions/3.7.2/envs/10wonders/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "/usr/local/var/pyenv/versions/3.7.2/envs/10wonders/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: headless chrome=73.0.3683.103)
  (Driver info: chromedriver=2.37.544337 (8c0344a12e552148c185f7d5117db1f28d6c9e85),platform=Mac OS X 10.13.6 x86_64)

posts_full Not working

tried for multiple user profiles,

next_key = ele_a_datetime.get_attribute('href')
AttributeError: 'NoneType' object has no attribute 'get_attribute'
Exception KeyError: KeyError(<weakref at 0x10b479f18; to 'tqdm' at 0x10b443a90>,) in ignored

Possible to get post likes?

Is is possible to get information about who have liked a post? I want to access all usernames who have liked a post?

Output is all in one line on Windows 10

So I just ran python crawler.py hashtag -t DDay -o ./output on Windows 10, and the output was given all in one line and really hard to read. I had to use vim to search and replace and get the output to look like your example pictures.

Log in for private accounts

How do I enter in the username and password in secret.py? I am getting errors that there is no username and password after I directly edit secret.py. I am also running the cp inscrawler/secret.py.dist inscrawler/secret.py command after I edit.

cannot import name 'InsCrawler' from 'inscrawler'

Can you give some instructions about where InsCrawler function is found. In general its not clear how to get everything up and running to begin with. Thank you.

Couldn't automatically log in

Hi I'm new to python,
when I put my instagram username as 'bingdnboujee' and password in secret.py
seems like it didn't work, the code recognized my username as 'bingd', which is my windows system user name.

image orders

Thanks for this nice repo first!
Is it able to recognize the order of the images in one post with multiple images?
I want to know which image is the first photo (cover of the post).
I tried to trace the code and the ['img_urls'] to find the rule but I failed.
Are there any ways to solve this issue? Please let me know.
Thanks!

Crawl only specific number of posts while given

Even if we already restricted the number -n, it still tried to crawl posts as many as possible

Ex.
python crawler.py posts -u cal_foodie -n 5 => crawl all posts and return the first 5.

Error!!

I encountered the following errors. Did anyone face the same situation?
How did you guys solve it?

Traceback (most recent call last):
File "crawler.py", line 83, in
args.username, args.number, args.mode == "posts_full", args.debug
File "crawler.py", line 27, in get_posts_by_user
ins_crawler = InsCrawler(has_screen=debug)
File "C:\Users\Roshan Dhamala\Independent Class\Insta\instagram-crawler\inscrawler\crawler.py", line 68, in init
self.browser = Browser(has_screen)
File "C:\Users\Roshan Dhamala\Independent Class\Insta\instagram-crawler\inscrawler\browser.py", line 27, in init
chrome_options=chrome_options,
File "C:\Python3\lib\site-packages\selenium\webdriver\chrome\webdriver.py", line 68, in init
self.service.start()
File "C:\Python3\lib\site-packages\selenium\webdriver\common\service.py", line 76, in start
stdin=PIPE)
File "C:\Python3\lib\subprocess.py", line 775, in init
restore_signals, start_new_session)
File "C:\Python3\lib\subprocess.py", line 1178, in _execute_child
startupinfo)
OSError: [WinError 193] %1 is not a valid Win32 application

Set cronjob for Travis

It will help me find issue in no time whenever Instagram gets updated.

problem with posts_full

thanks for your work and it works really well, but an Issue bothers me time to time.

selenium.common.exceptions.WebDriverException: Message: unknown error: Element <a class="HBoOv coreSpriteRightPaginationArrow" tabindex="0" href="/p/1920252025238890452/">...</a> is not clickable at point (763, 300). Other element would receive the click: <div class="QhbhU" role="button" tabindex="0"></div>
there is no other thing with the same class name "HBoOv" on the page, but it just works stop

Small? Error in post_full

Trying to scrape: https://www.instagram.com/explore/tags/satoyamaexperience/

python crawler.py posts_full -t satoyamaexperience -n 1000 -o ./output

Error

Traceback (most recent call last):
  File "crawler.py", line 77, in <module>
    args.debug
  File "crawler.py", line 25, in get_posts_by_user
    return ins_crawler.get_user_posts(username, number, detail)
  File "/Users/philipperemy/PycharmProjects/instagram-crawler/inscrawler/crawler.py", line 100, in get_user_posts
    user_profile = self.get_user_profile(username)
  File "/Users/philipperemy/PycharmProjects/instagram-crawler/inscrawler/crawler.py", line 89, in get_user_profile
    post_num, follower_num, following_num = statistics
ValueError: not enough values to unpack (expected 3, got 0)

Thanks a million in advance :)

login doesn't work

Hi Huaying, thanks for your crawler!

I am trying to use liker.py and the code seems to work (briefly) but it diverts back to the login page since it isn't logged in to the user and then app crashes. I have already set my own password and username in the secret file but it still won't log in.

Thank you!

Here is the log file:

Traceback (most recent call last): File "liker.py", line 22, in <module> ins_crawler.auto_like(tag=args.hashtag, maximum=args.number) File "/Users/mimichindasook/Downloads/instagram-crawler-master/inscrawler/crawler.py", line 112, in auto_like ele_post.click() File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 80, in click self._execute(Command.CLICK_ELEMENT) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 628, in _execute return self._parent.execute(command, params) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 312, in execute self.error_handler.check_response(response) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.WebDriverException: Message: unknown error: Element <a href="/p/BoXEX4hhapD/?tagged=surf">...</a> is not clickable at point (132, 473). Other element would receive the click: <div class="pHxcJ">...</div> (Session info: chrome=69.0.3497.100) (Driver info: chromedriver=2.42.591059 (a3d9684d10d61aa0c45f6723b327283be1ebaad8),platform=Mac OS X 10.11.6 x86_64)

Is it possible to fetch the number of likes of each post as well?

as subject. thanks!