linkedtales / scrapedin-linkedin-crawler Goto Github PK

View Code? Open in Web Editor NEW

214.0 214.0 71.0 77 KB

Crawler for LinkedIn full profiles 2019

License: Apache License 2.0

JavaScript 100.00%

crawler linkedin linkedin-crawler

scrapedin-linkedin-crawler's People

Contributors

Stargazers

Watchers

Forkers

syedfarhan wojtablo gig397 sushil-iamplus tdooner fozz2k sbrichardson reactual vlad-ovsyannikov skatingboy2006 trungtv vinco-mobile jakuta-tech ravishekharco ali-essam backwardn chingloong borisgodunov netgator techspecs bryanvaz anasdox gordyu miguelramosfdz susithrupasinghe sumepr ssitb linqtosql christinajones-dev shahariaazam masoodahme rsk2111999 radamant24 radarch thomasproctor ishandutta2007 bahrie127 peiying98 threepointsquare aakanksha1 himanshu272 sbelidhe qxtaiba marianoronchi fazaio mjserrato messhias 0xflotus swhsiang gasbarroni8 abhij89 leandrosilvaferreira aballiet sine99 rmasiniexpert barseghyanartur rasata cmur11 amitg90 jogramphy buxxux trhacknonimous bartevers apollodionysus wfcheli zahidsqldba07 ethandem fabiengjcouderc digitalcooperatives codingcozy2019 yesouicom

scrapedin-linkedin-crawler's Issues

Q: Disable scraping connections

Hello,
Is it possible to download only the provided root profiles (without each profile's connections)?
Thanks.

relatedProfiles not working when relatedProfilesKeywords is empty

if relatedProfilesKeywords : [] , the related profiles sections return as 0 profiles found

insert cookie and got another error

This time I filled into the cookie field and were able to start scrap, but after only one profile scraping, it throws an error :

scrapedin: 2019-10-18T20:09:22.541Z info: [profile] finished scraping url: https://www.linkedin.com/in/linda-b32a4815a
(node:9720) UnhandledPromiseRejectionWarning: ReferenceError: relatedProfiles is not defined
at crawl (/home//Downloads/scrapedin-linkedin-crawler/src/crawler.js:31:55)
at
at process._tickCallback (internal/process/next_tick.js:188:7)
(node:9720) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:9720) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
scrapedin: 2019-10-18T20:09:23.286Z info: [profile] finished scraping url: https://www.linkedin.com/in/tiffanie-walker-26512757
(node:9720) UnhandledPromiseRejectionWarning: ReferenceError: relatedProfiles is not defined
at crawl (/home/Downloads/scrapedin-linkedin-crawler/src/crawler.js:31:55)
at
at process._tickCallback (internal/process/next_tick.js:188:7)
(node:9720) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2)

-----------so , any signs of issues ? thanks

Linkedin posts crawler

Hi,

Thank you for this library.

It seems that it does not supports LinkedIn posts, so please let us know is it possible to crawl LinkedIn posts with this library or by adding custom codes to this library?

Thanks.

Infinite crawling

If I need to scroll through, say 10 LinkedIn profile URLs, the crawler after successfully crawling through the 10 URL links provided into the rootProfiles, keep on executing. It keeps on returning the logger as:

2020-05-15T19:53:36.399Z info: starting scraping: undefined show urls undefined 2020-05-15T19:53:36.401Z error: error on crawling profile: undefined TypeError: Cannot read property 'indexOf' of undefined 2020-05-15T19:53:37.400Z info: starting scraping: undefined show urls undefined 2020-05-15T19:53:37.401Z error: error on crawling profile: undefined TypeError: Cannot read property 'indexOf' of undefined 2020-05-15T19:53:38.400Z info: starting scraping: undefined show urls undefined
again and again even after profiles has been saved.

Is there any mechanism to stop the script automatically after the number of profile matching and fetching has been done?

Because this infinite crawling might consume up the resource much more than required.

Node is either not visible or not an HTMLElement

ScrapedIn Version: 1.0.16
Windows 10 Home Build: 18362.778
WSL: Ubuntu 18.04

Hopefully this isn't because I'm using @mnpenner's creative fix to get the puppeteer dependency to work (since I'm using WSL).

The browser successfully opens, logs into LinkedIn, navigates to the profile page, and scrolls down. That's when the error message is displayed.

Here's my code:

const scrapedin = require('scrapedin')
const fs = require('fs')
const rimraf = require('del');

const PATHS = {
    win32: {
        executablePath: "C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe",
        userDataDir: 'C:\\Users\\<USER>\\AppData\\Local\\Temp\\puppeteer_user_data',
    },
    linux: {
        executablePath: "/mnt/c/Program Files (x86)/Google/Chrome/Application/chrome.exe",
        userDataDir: '/mnt/c/Users/<USER>/AppData/Local/Temp/puppeteer_user_data',
    },
}

const cookies = fs.readFileSync('linkedincookie.json')
const options = {
  cookies: JSON.parse(cookies),
  puppeteerArgs: {
    executablePath: PATHS[process.platform].executablePath,
    userDataDir: PATHS.win32.userDataDir,
    headless: false,
  }
}

scrapedin(options)
.then((profileScraper) => profileScraper('https://www.linkedin.com/in/raphaelpreston/'))
.then((profile) => console.log(profile))
.finally(async () => {
    rimraf(PATHS[process.platform].userDataDir, {force: true}).then(() => {
        console.log("Deleted")
    })
}).catch(err => {
    console.error(err);
    process.exit(1);
});

The full error message:

Error: Node is either not visible or not an HTMLElement
    at ElementHandle._clickablePoint (/mnt/c/Users/<USER>/<DIR>/node_modules/puppeteer/lib/JSHandle.js:196:13)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
    at async ElementHandle.click (/mnt/c/Users/<USER>/<DIR>/node_modules/puppeteer/lib/JSHandle.js:248:20)
    at async scrapAccomplishmentPanel (/mnt/c/Users/<USER>/<DIR>/node_modules/scrapedin/src/profile/scrapAccomplishmentPanel.js:10:5)
    at async module.exports (/mnt/c/Users/<USER>/<DIR>/node_modules/scrapedin/src/profile/profile.js:50:19)
  -- ASYNC --
    at ElementHandle.<anonymous> (/mnt/c/Users/<USER>/<DIR>/node_modules/puppeteer/lib/helper.js:108:27)
    at scrapAccomplishmentPanel (/mnt/c/Users/<USER>/<DIR>/node_modules/scrapedin/src/profile/scrapAccomplishmentPanel.js:10:25)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
    at async module.exports (/mnt/c/Users/<USER>/<DIR>/node_modules/scrapedin/src/profile/profile.js:50:19)

This is my first time using a package that depends on Puppeteer, so let me know if there's something else I need to do to get stuff working.

Minor profile save issues

1.) If the config file attempts to save to a nested directory the mkDirSync is not creating the directory structure.
2.) The profile filenames are being saved as .json.json (not really an issue but looks odd.)

Pull request forthcoming...

update scrapedin to 1.0.21

hi all

just tried to use this scraper but either headless or not let's my bot endup on an unavailable page. Root is set, keyword also used before but it ends with this error message.

scrapedin: 2020-09-29T08:36:10.991Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (23)
scrapedin: 2020-09-29T08:36:11.362Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (25)
scrapedin: 2020-09-29T08:36:11.363Z warn: [profile/scrollToPageBottom.js] page bottom not found
scrapedin: 2020-09-29T08:36:11.363Z info: [profile/profile.js] applying 1st delay
scrapedin: 2020-09-29T08:36:11.494Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (24)
scrapedin: 2020-09-29T08:36:11.996Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (25)
scrapedin: 2020-09-29T08:36:11.996Z warn: [profile/scrollToPageBottom.js] page bottom not found
scrapedin: 2020-09-29T08:36:11.996Z info: [profile/profile.js] applying 1st delay
scrapedin: 2020-09-29T08:36:13.863Z info: [profile/profile.js] clicking on see more buttons
scrapedin: 2020-09-29T08:36:14.176Z info: [profile/profile.js] applying 2nd delay
scrapedin: 2020-09-29T08:36:14.497Z info: [profile/profile.js] clicking on see more buttons
scrapedin: 2020-09-29T08:36:14.808Z info: [profile/profile.js] applying 2nd delay
scrapedin: 2020-09-29T08:36:16.751Z info: [profile/profile.js] finished scraping url: https://www.linkedin.com/in/tps:/www.linkedin.com/feed/hashtag/motivation/
scrapedin: 2020-09-29T08:36:16.752Z error: [profile/cleanProfileData.js] LinkedIn website changed and scrapedin 1.0.18 can't read basic data. Please report this issue at https://github.com/linkedtales/scrapedin/issues
2020-09-29T08:36:16.752Z error: error on crawling profile: https://www.linkedin.com/feed/hashtag/motivation/
Error: LinkedIn website changed and scrapedin 1.0.18 can't read basic data. Please report this issue at https://github.com/linkedtales/scrapedin/issues
2020-09-29T08:36:16.753Z info: starting scraping: https://www.linkedin.com/feed/hashtag/construction/
scrapedin: 2020-09-29T08:36:16.753Z info: [profile/profile.js] starting scraping url: https://www.linkedin.com/in/tps:/www.linkedin.com/feed/hashtag/construction/

I guess it's due to the scrapedin-crawler which is on "0.0.2" in the multi scraper. The other repo has 1.0.18 but I can't figure out how to update it for the multi scraper.

Hope my explanation makes sense, thanks!

LinkedIn Company/People Python Scraper

I am trying to write a Python code that searches companies on Linkedin, and under people, will enter my desired keywords and collected the links (or just job title/name) of the individual employees.

So far I have used selenium webdriver to open linkedin and login.

Open the company page and loop through keyword search.

company = 'interfood-group' #just testing with one company now but intend to loop through a list of companies
full_link ='https://www.linkedin.com/company/'+ company + '/people/'
`browser.get(full_link)`

#keywords list
job_keywords_short = ['chief%20financial%20officer','controller']

profile_to_search = []

#start for loop of keyword list
for keyword in job_keywords_short:
    keyword_search = full_link + '?keywords='+ keyword
    browser.get(keyword_search)
    html_search = browser.page_source
    parse_search = bs(html_search,'html5lib')
    for link in parse_search('org-people-profiles-modules_profile-item'):
        if link.has_attr('org-people-profiles-modules_profile-item'):
            sleep(randint(2,5))

#links of profiles that match keywords
print(profile_to_search)

It seems simple, however, I cannot find what I should be asking the program to look for is parse_search in order to collect the links.

After this step, I plan to search each profile for the name, job title, date, location.

Thank you for helping, I am new to coding so any guidance is very helpful.

Updates

I am new with Git. So don't know the question answering process here.
But I am having an issue in saving a file.
I have to save all resulting JSON to single CSV.
Can you guys help me with that?

LinkedIn website changed and scrapedin can't read basic data

Hey there,

Giving a go at this project and got a problem to report :

scrapedin: 2019-10-22T17:06:45.854Z error: [cleanMessageData] LinkedIn website changed and scrapedin can't read basic data. Please report this issue at https://github.com/linkedtales/scrapedin/issues
2019-10-22T17:06:45.854Z error: error on crawling profile: https://www.linkedin.com/in/XXXXXXXX/
Error: LinkedIn website changed and scrapedin can't read basic data. Please report this issue at https://github.com/linkedtales/scrapedin/issues
(node:20226) UnhandledPromiseRejectionWarning: ReferenceError: relatedProfiles is not defined
at crawl (/home/rui/TEMP/scrapedin-linkedin-crawler/src/crawler.js:31:55)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:20226) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)

PUPPETEER_SKIP_CHROMIUM_DOWNLOAD Isseu

Getting this issue after clone your project and run sudo npm i command

TimeoutError: waiting for selector "#login-email" failed: timeout 30000ms exceeded

On mac, I followed everything, chromium is opened but I get this error after a while

(node:3042) UnhandledPromiseRejectionWarning: TimeoutError: waiting for selector "#login-email" failed: timeout 30000ms exceeded
    at new WaitTask (/Users/myname/Downloads/linkedinCrawler/node_modules/puppeteer/lib/DOMWorld.js:561:28)
    at DOMWorld._waitForSelectorOrXPath (/Users/myname/Downloads/linkedinCrawler/node_modules/puppeteer/lib/DOMWorld.js:490:22)
    at DOMWorld.waitForSelector (/Users/myname/Downloads/linkedinCrawler/node_modules/puppeteer/lib/DOMWorld.js:444:17)
    at Frame.waitForSelector (/Users/myname/Downloads/linkedinCrawler/node_modules/puppeteer/lib/FrameManager.js:628:47)
    at Frame.<anonymous> (/Users/myname/Downloads/linkedinCrawler/node_modules/puppeteer/lib/helper.js:111:23)
    at Frame.waitFor (/Users/myname/Downloads/linkedinCrawler/node_modules/puppeteer/lib/FrameManager.js:613:19)
    at Page.waitFor (/Users/myname/Downloads/linkedinCrawler/node_modules/puppeteer/lib/Page.js:1037:29)
    at module.exports (/Users/myname/Downloads/linkedinCrawler/node_modules/scrapedin/src/login.js:10:14)
    at processTicksAndRejections (internal/process/next_tick.js:81:5)
(node:3042) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:3042) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Btw, the opened Linkedin page is not resized and the login button is hidden in the far right side, I'm testing on a 13" mac book air so it's not visible

UnhandledPromiseRejectionWarning: ReferenceError: relatedProfiles is not defined

Hi,
I've setup the crawler but it stops after having crawled 2 profiles with the following error:

(node:21269) UnhandledPromiseRejectionWarning: ReferenceError: relatedProfiles is not defined
at crawl (/home/rossetti/PycharmProjects/LinkedinScraping/scrapedin-linkedin-crawler-master/src/crawler.js:31:55)
at processTicksAndRejections (internal/process/task_queues.js:93:5)
(node:21269) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)

do you have insights on what can cause such an issue?

linkedin account was banned after I use this scrapper

after I got several result , the account was banned from linkedin , so how to avoid this happen if I register another account ?

Filter profiles to save

Hello,

relatedProfilesKeywords allow as to filter profiles to crawl and we save all crawled profiles. It will be very usefull to filter profiles to save. what do you think @leonardiwagner ?

Scraping only given profiles

I am curious if I can scrape only a given set of profiles through this?

rootProfiles in config.json contains the URLs of from where to scrape and not the actual profile urls. Is there a provision to do that?

We should avoid crawling related profiles already crawled

Hello,

I observed that the crawler does not remember the related profiles already crawled. And some time it produce an infinite crawling loop.

TypeError: Cannot read property 'replace' of undefined

2020-04-29T11:13:53.234Z info: there is no profiles to crawl right now...
scrapedin: 2020-04-29T11:13:53.995Z info: [profile] clicking on see more buttons
2020-04-29T11:13:54.235Z info: there is no profiles to crawl right now...
scrapedin: 2020-04-29T11:13:54.326Z info: [profile] applying 2nd delay
2020-04-29T11:13:55.236Z info: there is no profiles to crawl right now...
2020-04-29T11:13:56.238Z info: there is no profiles to crawl right now...
scrapedin: 2020-04-29T11:13:57.036Z info: [profile] finished scraping url: https://www.linkedin.com/in/place
2020-04-29T11:13:57.037Z error: error on crawling profile: https://www.linkedin.com/in/place/
TypeError: Cannot read property 'replace' of undefined

We should add package-lock.json to the repo

Hi, thank you for sharing :)

I got the following error when i start the crawler

npm start

[email protected] start /home/anas/Tools/scrapedin-linkedin-crawler
node src/index

scrapedin: 2019-10-17T13:52:01.937Z info: [scrapedin] initializing
scrapedin: 2019-10-17T13:52:03.192Z info: [scrapedin] email and password was provided, we're going to login...
(node:6249) UnhandledPromiseRejectionWarning: Error: Protocol error (Network.deleteCookies): Invalid parameters name: string value expected
    at Promise (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Connection.js:183:56)
    at new Promise (<anonymous>)
    at CDPSession.send (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Connection.js:182:12)
    at Page.deleteCookie (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Page.js:395:26)
    at Page.<anonymous> (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/helper.js:112:23)
    at Page.setCookie (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Page.js:413:16)
    at Page.<anonymous> (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/helper.js:112:23)
    at module.exports (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/scrapedin/src/openPage.js:16:20)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:188:7)
(node:6249) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:6249) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
(node:6249) UnhandledPromiseRejectionWarning: Error: Protocol error (Page.navigate): Invalid parameters url: string value expected
    at Promise (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Connection.js:183:56)
    at new Promise (<anonymous>)
    at CDPSession.send (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Connection.js:182:12)
    at navigate (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/FrameManager.js:118:39)
    at FrameManager.navigateFrame (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/FrameManager.js:95:7)
    at Frame.goto (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/FrameManager.js:406:37)
    at Frame.<anonymous> (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/helper.js:112:23)
    at Page.goto (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Page.js:674:49)
    at Page.<anonymous> (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/helper.js:112:23)
    at module.exports (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/scrapedin/src/openPage.js:24:14)
  -- ASYNC --
    at Frame.<anonymous> (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/helper.js:111:15)
    at Page.goto (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Page.js:674:49)
    at Page.<anonymous> (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/helper.js:112:23)
    at module.exports (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/scrapedin/src/openPage.js:24:14)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:188:7)
(node:6249) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2)`

Could you provide the package-lock.json to be sure to get the same dependencies please.

Thanks

Retry the request if failed

Hello,

After some good crawling request, I got the following error :

scrapedin: 2019-10-18T09:03:38.320Z error: [cleanMessageData] LinkedIn website changed and scrapedin can't read basic data. Please report this issue at https://github.com/linkedtales/scrapedin/issues
2019-10-18T09:03:38.321Z error: error on crawling profile: https://www.linkedin.com/in/xxxxxxx/ 
 Error: LinkedIn website changed and scrapedin can't read basic data. Please report this issue at https://github.com/linkedtales/scrapedin/issues
(node:29800) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'length' of undefined
    at crawl (/home/anas/workspace/scrapedin-linkedin-crawler/src/crawler.js:26:71)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:188:7)
(node:29800) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:29800) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

We should add a retries configurable feature to avoid such problem.

"There is no profiles to crawl right now", despite filling in rootProfiles in config.json

Hello!

Not sure if I'm missing anything, but I filled in the rootProfiles in config.json, and when I do npm start, it seems to be running an infinite loop with the following message -> "There is no profiles to crawl right now".

I'd appreciate any help! Thank you

linkedin website changed and can not read basic data

inished scraping url: https://www.linkedin.com/in/inmudassar-iqbal-a9a9159b/
scrapedin: 2020-01-22T07:42:56.489Z error: [cleanMessageData] LinkedIn website changed and scrapedin can't read basic data. Please report this issue at https://github.com/linkedtales/scrapedin/issues
2020-01-22T07:42:56.490Z error: error on crawling profile: https://linkedin/in/mudassar-iqbal-a9a9159b/
Error: LinkedIn website changed and scrapedin can't read basic data. Please report this issue at https://github.com/linkedtales/scrapedin/issues
2020-01-22T07:42:56.830Z info: starting scraping: https://linkedin/in/nadeem-aslam-057341102/
scrapedin: 2020-01-22T07:42:56.830Z info: [profile] starting scraping url: https://www.linkedin.com/in/innadeem-aslam-057341102/
scrapedin: 2020-01-22T07:42:58.070Z info: [profile] finished scraping url: https://www.linkedin.com/in/inamjad-khan-a03634b7/
scrapedin: 2020-01-22T07:42:58.070Z error: [cleanMessageData] LinkedIn website changed and scrapedin can't read basic data. Please report this issue at https://github.com/linkedtales/scrapedin/issues
2020-01-22T07:42:58.070Z error: error on crawling profile: https://linkedin/in/amjad-khan-a03634b7/
Error: LinkedIn website changed and scrapedin can't read basic data. Please report this issue at https://github.com/linkedtales/scrapedin/issues
2020-01-22T07:42:58.832Z info: starting scraping: https://linkedin/in/baraa-faisal-0529a5a3/
scrapedin: 2020-01-22T07:42:58.833Z info: [profile] starting scraping url: https://www.linkedin.com/in/inbaraa-faisal-0529a5a3/

No profile to crawl right now

@leonardiwagner thanks for this great stuff i am using your library to crawl but it says no profile to crawl right now can you please fix this

Crawler doesn't work

Hello,

Crawler doesn't work, it only scans the root profile.
(Even if we init a keyword list).

2019-06-27T13:01:12.099Z info: crawler finished: there are no more profiles found with specified keywords

Allow exporting the newly found profiles urls

Currently the newly found profile URLs just added to the list. Is it possible to create an option to export them to a csv file?

TypeError: Cannot read property 'replace' of undefined

Hi, i'm scrapping some URLs and in some of them i'm receiving these warnings and error messages:

warn: [profile] profile selector was not found

warn: [scrollToPageBottom] page bottom not found

error: error on crawling profile: https://www.linkedin.com/in/IDENTIFIER/
 TypeError: Cannot read property 'replace' of undefined

I've already set isHeadless to false on config.json and updated the scrapedin version to the latest one (1.0.16)

Another question is if the program finishes after crawling all the profiles. I keep getting the message:
info: there is no profiles to crawl right now...
but it don't terminate the job

Error: net::ERR_INTERNET_DISCONNECTED at https://www.linkedin.com/login

I got this error under centos 7 , this crawler was able to work but it throw the following error after I change my place to work , so any one has some idea ？

(node:6378) UnhandledPromiseRejectionWarning: Error: net::ERR_INTERNET_DISCONNECTED at https://www.linkedin.com/login
at navigate (/home/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/FrameManager.js:120:37)
at process._tickCallback (internal/process/next_tick.js:68:7)
-- ASYNC --
at Frame. (/home/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/helper.js:111:15)
at Page.goto (/home/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Page.js:674:49)
at Page. (/home/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/helper.js:112:23)
at module.exports (/home/scrapedin-linkedin-crawler/node_modules/scrapedin/src/openPage.js:24:14)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:6378) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:6378) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Save to excel

Hey, can we save all result to excel?
I am not a programmer so can you provide the code? please help me if you don't mind

Thanks

one bad profile scrap will throw error on whole process

not sure about other users, when the scraping come across one bad profile , the whole process will stop and throw error , would it be possible to just keep the error and keep move to the next profile scraping ?

Scrapper doesn't close browser

Scrapper doesn't close browser on exception or successful scrapping. If there are 100 profiles to scrap, it initiates 100 browser instances which creates memory issues.

Skip Profiles already downloaded

Is there an option to skip profiles that are already downloaded in "saveDirectory".
The reason I am asking this is whenever it encounters some error and have to run it again with same "rootProfiles", it starts downloading everything from scratch, why can't it process filenames already present in "saveDirectory" and parse accordingly?

Error Codes for NPM Start - Internal Modules Loader Error

~/crawler$ npm start
(node:16966) [DEP0139] DeprecationWarning: Calling process.umask() with no arguments is prone to race conditions and is a potential security vulnerability.
(Use node --trace-deprecation ... to show where the warning was created)

[email protected] start /home/homespace/crawler
node src/index

internal/modules/cjs/loader.js:1288
throw err;
^

SyntaxError: /home/infiniteiotnode1/crawler/config.json: Unexpected token ] in JSON at position 408
at parse ()
at Object.Module._extensions..json (internal/modules/cjs/loader.js:1285:22)
at Module.load (internal/modules/cjs/loader.js:1099:32)
at Function.Module._load (internal/modules/cjs/loader.js:964:14)
at Module.require (internal/modules/cjs/loader.js:1139:19)
at require (internal/modules/cjs/helpers.js:75:18)
at Object. (/home/infiniteiotnode1/crawler/src/index.js:2:20)
at Module._compile (internal/modules/cjs/loader.js:1250:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:1271:10)
at Module.load (internal/modules/cjs/loader.js:1099:32)
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! [email protected] start: node src/index
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the [email protected] start script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

Any Assistant is greatly appreciated.

Is it possible to crawl job postings?

Right now the crawler is only able to crawl the user profiles. Is it possible to use it to crawl the search result of jobs and their postings?

Can this project run on Windows?

When I encountered this problem on windows10.

Crawl LinkedIn skills

Hi! I'm looking for a way to extract a list of skills used by LinkedIn. This used to be not so hard until some two years ago, but not so anymore since the site structure changed. Any ideas on how to extract skill data?

Irregular information loss when "maxConcurrentCrawlers" is greater than 1

Hi there, this project is awesome and i have successfully used it to scrape some profiles. Thanks for sharing!
However, there is a small issue arising when i set 'maxConcurrentCrawlers' greater than 1. If so, some records on 'education' and 'position' are likely to be lost. Is this related to my internet speed or platform(OSX)? Thanks!

crawler doesnt work with only 1 profile in url list

it closes after scraping a single profile

avoid related profile loop

Hi and thank you for your amzing tool!

Sometimes a related profile contain a previously scraped profil. How can we avoid this, because it may loop indefinitely :)

Scrape Contacts

Nice project, will this also scrape emails?

Scrape company employees

Is possible to retrieve users (and then scrape their profiles) from company url?
Given a company url : https://www.linkedin.com/company/mycompany
We can have a list of employees scraping : https://www.linkedin.com/company/mycompany/people/
There is any built-in functionality to achieve this or should i implement this by myself?

(node:11136) UnhandledPromiseRejectionWarning: Error: Failed to launch chrome!

$ npm start

[email protected] start /Users/st/Desktop/panya/scrapedin-linkedin-crawler
node src/index

(node:11136) UnhandledPromiseRejectionWarning: Error: Failed to launch chrome!
dlopen /Users/st/Desktop/panya/scrapedin-linkedin-crawler/node_modules/puppeteer/.local-chromium/mac-641577/chrome-mac/Chromium.app/Contents/MacOS/../Versions/75.0.3738.0/Chromium Framework.framework/Chromium Framework: dlopen(/Users/st/Desktop/panya/scrapedin-linkedin-crawler/node_modules/puppeteer/.local-chromium/mac-641577/chrome-mac/Chromium.app/Contents/MacOS/../Versions/75.0.3738.0/Chromium Framework.framework/Chromium Framework, 261): Library not loaded: /System/Library/Frameworks/MediaPlayer.framework/Versions/A/MediaPlayer
Referenced from: /Users/st/Desktop/panya/scrapedin-linkedin-crawler/node_modules/puppeteer/.local-chromium/mac-641577/chrome-mac/Chromium.app/Contents/Versions/75.0.3738.0/Chromium Framework.framework/Versions/A/Chromium Framework
Reason: image not found

TROUBLESHOOTING: https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md

at onClose (/Users/st/Desktop/panya/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Launcher.js:342:14)
at Interface.helper.addEventListener (/Users/st/Desktop/panya/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Launcher.js:331:50)
at emitNone (events.js:111:20)
at Interface.emit (events.js:208:7)
at Interface.close (readline.js:368:8)
at Socket.onend (readline.js:147:10)
at emitNone (events.js:111:20)
at Socket.emit (events.js:208:7)
at endReadableNT (_stream_readable.js:1064:12)
at _combinedTickCallback (internal/process/next_tick.js:138:11)

(node:11136) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:11136) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

TypeError: Cannot read property 'replace' of undefined

Hi, i'm scrapping some URLs and in most of them (almost all) i'm receiving these warnings and error messages:

warn: [profile] profile selector was not found

warn: [scrollToPageBottom] page bottom not found

warn: [seeMoreButtons] couldn't click on button.pv-profile-section__see-more-inline, it's probably invisible

error: error on crawling profile: https://www.linkedin.com/in/IDENTIFIER/
 TypeError: Cannot read property 'replace' of undefined

I've already set isHeadless to false on config.json and updated the scrapedin version to the latest one (1.0.14)

Make `WORKER_INTERVAL_MS` a configurable parameter

This is something I've found useful for throttling my scrapes to avoid being banned. It just requires 3 changed lines of code.

happy new year , and I got this unhandledPromiseRejectionERROR

we're going to login...
(node:5154) UnhandledPromiseRejectionWarning: Error: net::ERR_INTERNET_DISCONNECTED at https://www.linkedin.com/login
at navigate (/home/tony/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/FrameManager.js:120:37)
at process._tickCallback (internal/process/next_tick.js:68:7)
-- ASYNC --
at Frame. (/home/tony/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/helper.js:111:15)
at Page.goto (/home/tony/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Page.js:674:49)
at Page. (/home/tony/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/helper.js:112:23)
at module.exports (/home/tony/scrapedin-linkedin-crawler/node_modules/scrapedin/src/openPage.js:26:14)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:5154) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:5154) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

NOT SURE IT IS THE NODE PROBLEM OR WHAT

THANKS

linkedtales / scrapedin-linkedin-crawler Goto Github PK

scrapedin-linkedin-crawler's People

Contributors

Stargazers

Watchers

Forkers

scrapedin-linkedin-crawler's Issues

Recommend Projects

Recommend Topics

Recommend Org