Coder Social home page Coder Social logo

scrapedin-linkedin-crawler's People

Contributors

leonardiwagner avatar netgator avatar thomasproctor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrapedin-linkedin-crawler's Issues

insert cookie and got another error

This time I filled into the cookie field and were able to start scrap, but after only one profile scraping, it throws an error :

scrapedin: 2019-10-18T20:09:22.541Z info: [profile] finished scraping url: https://www.linkedin.com/in/linda-b32a4815a
(node:9720) UnhandledPromiseRejectionWarning: ReferenceError: relatedProfiles is not defined
at crawl (/home//Downloads/scrapedin-linkedin-crawler/src/crawler.js:31:55)
at
at process._tickCallback (internal/process/next_tick.js:188:7)
(node:9720) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:9720) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
scrapedin: 2019-10-18T20:09:23.286Z info: [profile] finished scraping url: https://www.linkedin.com/in/tiffanie-walker-26512757
(node:9720) UnhandledPromiseRejectionWarning: ReferenceError: relatedProfiles is not defined
at crawl (/home/Downloads/scrapedin-linkedin-crawler/src/crawler.js:31:55)
at
at process._tickCallback (internal/process/next_tick.js:188:7)
(node:9720) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2)

-----------so , any signs of issues ? thanks

Linkedin posts crawler

Hi,

Thank you for this library.

It seems that it does not supports LinkedIn posts, so please let us know is it possible to crawl LinkedIn posts with this library or by adding custom codes to this library?

Thanks.

Infinite crawling

If I need to scroll through, say 10 LinkedIn profile URLs, the crawler after successfully crawling through the 10 URL links provided into the rootProfiles, keep on executing. It keeps on returning the logger as:

2020-05-15T19:53:36.399Z info: starting scraping: undefined show urls undefined 2020-05-15T19:53:36.401Z error: error on crawling profile: undefined TypeError: Cannot read property 'indexOf' of undefined 2020-05-15T19:53:37.400Z info: starting scraping: undefined show urls undefined 2020-05-15T19:53:37.401Z error: error on crawling profile: undefined TypeError: Cannot read property 'indexOf' of undefined 2020-05-15T19:53:38.400Z info: starting scraping: undefined show urls undefined
again and again even after profiles has been saved.

Is there any mechanism to stop the script automatically after the number of profile matching and fetching has been done?

Because this infinite crawling might consume up the resource much more than required.

Node is either not visible or not an HTMLElement

ScrapedIn Version: 1.0.16
Windows 10 Home Build: 18362.778
WSL: Ubuntu 18.04

Hopefully this isn't because I'm using @mnpenner's creative fix to get the puppeteer dependency to work (since I'm using WSL).

The browser successfully opens, logs into LinkedIn, navigates to the profile page, and scrolls down. That's when the error message is displayed.

Here's my code:

const scrapedin = require('scrapedin')
const fs = require('fs')
const rimraf = require('del');

const PATHS = {
    win32: {
        executablePath: "C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe",
        userDataDir: 'C:\\Users\\<USER>\\AppData\\Local\\Temp\\puppeteer_user_data',
    },
    linux: {
        executablePath: "/mnt/c/Program Files (x86)/Google/Chrome/Application/chrome.exe",
        userDataDir: '/mnt/c/Users/<USER>/AppData/Local/Temp/puppeteer_user_data',
    },
}

const cookies = fs.readFileSync('linkedincookie.json')
const options = {
  cookies: JSON.parse(cookies),
  puppeteerArgs: {
    executablePath: PATHS[process.platform].executablePath,
    userDataDir: PATHS.win32.userDataDir,
    headless: false,
  }
}

scrapedin(options)
.then((profileScraper) => profileScraper('https://www.linkedin.com/in/raphaelpreston/'))
.then((profile) => console.log(profile))
.finally(async () => {
    rimraf(PATHS[process.platform].userDataDir, {force: true}).then(() => {
        console.log("Deleted")
    })
}).catch(err => {
    console.error(err);
    process.exit(1);
});

The full error message:

Error: Node is either not visible or not an HTMLElement
    at ElementHandle._clickablePoint (/mnt/c/Users/<USER>/<DIR>/node_modules/puppeteer/lib/JSHandle.js:196:13)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
    at async ElementHandle.click (/mnt/c/Users/<USER>/<DIR>/node_modules/puppeteer/lib/JSHandle.js:248:20)
    at async scrapAccomplishmentPanel (/mnt/c/Users/<USER>/<DIR>/node_modules/scrapedin/src/profile/scrapAccomplishmentPanel.js:10:5)
    at async module.exports (/mnt/c/Users/<USER>/<DIR>/node_modules/scrapedin/src/profile/profile.js:50:19)
  -- ASYNC --
    at ElementHandle.<anonymous> (/mnt/c/Users/<USER>/<DIR>/node_modules/puppeteer/lib/helper.js:108:27)
    at scrapAccomplishmentPanel (/mnt/c/Users/<USER>/<DIR>/node_modules/scrapedin/src/profile/scrapAccomplishmentPanel.js:10:25)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
    at async module.exports (/mnt/c/Users/<USER>/<DIR>/node_modules/scrapedin/src/profile/profile.js:50:19)

This is my first time using a package that depends on Puppeteer, so let me know if there's something else I need to do to get stuff working.

Minor profile save issues

1.) If the config file attempts to save to a nested directory the mkDirSync is not creating the directory structure.
2.) The profile filenames are being saved as .json.json (not really an issue but looks odd.)

Pull request forthcoming...

update scrapedin to 1.0.21

hi all

just tried to use this scraper but either headless or not let's my bot endup on an unavailable page. Root is set, keyword also used before but it ends with this error message.

scrapedin: 2020-09-29T08:36:10.991Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (23)
scrapedin: 2020-09-29T08:36:11.362Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (25)
scrapedin: 2020-09-29T08:36:11.363Z warn: [profile/scrollToPageBottom.js] page bottom not found
scrapedin: 2020-09-29T08:36:11.363Z info: [profile/profile.js] applying 1st delay
scrapedin: 2020-09-29T08:36:11.494Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (24)
scrapedin: 2020-09-29T08:36:11.996Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (25)
scrapedin: 2020-09-29T08:36:11.996Z warn: [profile/scrollToPageBottom.js] page bottom not found
scrapedin: 2020-09-29T08:36:11.996Z info: [profile/profile.js] applying 1st delay
scrapedin: 2020-09-29T08:36:13.863Z info: [profile/profile.js] clicking on see more buttons
scrapedin: 2020-09-29T08:36:14.176Z info: [profile/profile.js] applying 2nd delay
scrapedin: 2020-09-29T08:36:14.497Z info: [profile/profile.js] clicking on see more buttons
scrapedin: 2020-09-29T08:36:14.808Z info: [profile/profile.js] applying 2nd delay
scrapedin: 2020-09-29T08:36:16.751Z info: [profile/profile.js] finished scraping url: https://www.linkedin.com/in/tps:/www.linkedin.com/feed/hashtag/motivation/
scrapedin: 2020-09-29T08:36:16.752Z error: [profile/cleanProfileData.js] LinkedIn website changed and scrapedin 1.0.18 can't read basic data. Please report this issue at https://github.com/linkedtales/scrapedin/issues
2020-09-29T08:36:16.752Z error: error on crawling profile: https://www.linkedin.com/feed/hashtag/motivation/
Error: LinkedIn website changed and scrapedin 1.0.18 can't read basic data. Please report this issue at https://github.com/linkedtales/scrapedin/issues
2020-09-29T08:36:16.753Z info: starting scraping: https://www.linkedin.com/feed/hashtag/construction/
scrapedin: 2020-09-29T08:36:16.753Z info: [profile/profile.js] starting scraping url: https://www.linkedin.com/in/tps:/www.linkedin.com/feed/hashtag/construction/

I guess it's due to the scrapedin-crawler which is on "0.0.2" in the multi scraper. The other repo has 1.0.18 but I can't figure out how to update it for the multi scraper.

Hope my explanation makes sense, thanks!

LinkedIn Company/People Python Scraper

I am trying to write a Python code that searches companies on Linkedin, and under people, will enter my desired keywords and collected the links (or just job title/name) of the individual employees.

So far I have used selenium webdriver to open linkedin and login.

Open the company page and loop through keyword search.

company = 'interfood-group' #just testing with one company now but intend to loop through a list of companies
full_link ='https://www.linkedin.com/company/'+ company + '/people/'
`browser.get(full_link)`

#keywords list
job_keywords_short = ['chief%20financial%20officer','controller']

profile_to_search = []

#start for loop of keyword list
for keyword in job_keywords_short:
    keyword_search = full_link + '?keywords='+ keyword
    browser.get(keyword_search)
    html_search = browser.page_source
    parse_search = bs(html_search,'html5lib')
    for link in parse_search('org-people-profiles-modules_profile-item'):
        if link.has_attr('org-people-profiles-modules_profile-item'):
            sleep(randint(2,5))

#links of profiles that match keywords
print(profile_to_search)

It seems simple, however, I cannot find what I should be asking the program to look for is parse_search in order to collect the links.

After this step, I plan to search each profile for the name, job title, date, location.

Thank you for helping, I am new to coding so any guidance is very helpful.

Updates

I am new with Git. So don't know the question answering process here.
But I am having an issue in saving a file.
I have to save all resulting JSON to single CSV.
Can you guys help me with that?

LinkedIn website changed and scrapedin can't read basic data

Hey there,

Giving a go at this project and got a problem to report :

scrapedin: 2019-10-22T17:06:45.854Z error: [cleanMessageData] LinkedIn website changed and scrapedin can't read basic data. Please report this issue at https://github.com/linkedtales/scrapedin/issues
2019-10-22T17:06:45.854Z error: error on crawling profile: https://www.linkedin.com/in/XXXXXXXX/
Error: LinkedIn website changed and scrapedin can't read basic data. Please report this issue at https://github.com/linkedtales/scrapedin/issues
(node:20226) UnhandledPromiseRejectionWarning: ReferenceError: relatedProfiles is not defined
at crawl (/home/rui/TEMP/scrapedin-linkedin-crawler/src/crawler.js:31:55)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:20226) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)

TimeoutError: waiting for selector "#login-email" failed: timeout 30000ms exceeded

On mac, I followed everything, chromium is opened but I get this error after a while

(node:3042) UnhandledPromiseRejectionWarning: TimeoutError: waiting for selector "#login-email" failed: timeout 30000ms exceeded
    at new WaitTask (/Users/myname/Downloads/linkedinCrawler/node_modules/puppeteer/lib/DOMWorld.js:561:28)
    at DOMWorld._waitForSelectorOrXPath (/Users/myname/Downloads/linkedinCrawler/node_modules/puppeteer/lib/DOMWorld.js:490:22)
    at DOMWorld.waitForSelector (/Users/myname/Downloads/linkedinCrawler/node_modules/puppeteer/lib/DOMWorld.js:444:17)
    at Frame.waitForSelector (/Users/myname/Downloads/linkedinCrawler/node_modules/puppeteer/lib/FrameManager.js:628:47)
    at Frame.<anonymous> (/Users/myname/Downloads/linkedinCrawler/node_modules/puppeteer/lib/helper.js:111:23)
    at Frame.waitFor (/Users/myname/Downloads/linkedinCrawler/node_modules/puppeteer/lib/FrameManager.js:613:19)
    at Page.waitFor (/Users/myname/Downloads/linkedinCrawler/node_modules/puppeteer/lib/Page.js:1037:29)
    at module.exports (/Users/myname/Downloads/linkedinCrawler/node_modules/scrapedin/src/login.js:10:14)
    at processTicksAndRejections (internal/process/next_tick.js:81:5)
(node:3042) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:3042) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Btw, the opened Linkedin page is not resized and the login button is hidden in the far right side, I'm testing on a 13" mac book air so it's not visible

UnhandledPromiseRejectionWarning: ReferenceError: relatedProfiles is not defined

Hi,
I've setup the crawler but it stops after having crawled 2 profiles with the following error:

(node:21269) UnhandledPromiseRejectionWarning: ReferenceError: relatedProfiles is not defined
at crawl (/home/rossetti/PycharmProjects/LinkedinScraping/scrapedin-linkedin-crawler-master/src/crawler.js:31:55)
at processTicksAndRejections (internal/process/task_queues.js:93:5)
(node:21269) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)

do you have insights on what can cause such an issue?

Filter profiles to save

Hello,

relatedProfilesKeywords allow as to filter profiles to crawl and we save all crawled profiles. It will be very usefull to filter profiles to save. what do you think @leonardiwagner ?

Scraping only given profiles

I am curious if I can scrape only a given set of profiles through this?

rootProfiles in config.json contains the URLs of from where to scrape and not the actual profile urls. Is there a provision to do that?

TypeError: Cannot read property 'replace' of undefined

2020-04-29T11:13:53.234Z info: there is no profiles to crawl right now...
scrapedin: 2020-04-29T11:13:53.995Z info: [profile] clicking on see more buttons
2020-04-29T11:13:54.235Z info: there is no profiles to crawl right now...
scrapedin: 2020-04-29T11:13:54.326Z info: [profile] applying 2nd delay
2020-04-29T11:13:55.236Z info: there is no profiles to crawl right now...
2020-04-29T11:13:56.238Z info: there is no profiles to crawl right now...
scrapedin: 2020-04-29T11:13:57.036Z info: [profile] finished scraping url: https://www.linkedin.com/in/place
2020-04-29T11:13:57.037Z error: error on crawling profile: https://www.linkedin.com/in/place/
TypeError: Cannot read property 'replace' of undefined

We should add package-lock.json to the repo

Hi, thank you for sharing :)

I got the following error when i start the crawler

npm start

[email protected] start /home/anas/Tools/scrapedin-linkedin-crawler
node src/index

scrapedin: 2019-10-17T13:52:01.937Z info: [scrapedin] initializing
scrapedin: 2019-10-17T13:52:03.192Z info: [scrapedin] email and password was provided, we're going to login...
(node:6249) UnhandledPromiseRejectionWarning: Error: Protocol error (Network.deleteCookies): Invalid parameters name: string value expected
    at Promise (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Connection.js:183:56)
    at new Promise (<anonymous>)
    at CDPSession.send (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Connection.js:182:12)
    at Page.deleteCookie (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Page.js:395:26)
    at Page.<anonymous> (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/helper.js:112:23)
    at Page.setCookie (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Page.js:413:16)
    at Page.<anonymous> (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/helper.js:112:23)
    at module.exports (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/scrapedin/src/openPage.js:16:20)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:188:7)
(node:6249) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:6249) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
(node:6249) UnhandledPromiseRejectionWarning: Error: Protocol error (Page.navigate): Invalid parameters url: string value expected
    at Promise (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Connection.js:183:56)
    at new Promise (<anonymous>)
    at CDPSession.send (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Connection.js:182:12)
    at navigate (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/FrameManager.js:118:39)
    at FrameManager.navigateFrame (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/FrameManager.js:95:7)
    at Frame.goto (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/FrameManager.js:406:37)
    at Frame.<anonymous> (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/helper.js:112:23)
    at Page.goto (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Page.js:674:49)
    at Page.<anonymous> (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/helper.js:112:23)
    at module.exports (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/scrapedin/src/openPage.js:24:14)
  -- ASYNC --
    at Frame.<anonymous> (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/helper.js:111:15)
    at Page.goto (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Page.js:674:49)
    at Page.<anonymous> (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/helper.js:112:23)
    at module.exports (/home/anas/Tools/scrapedin-linkedin-crawler/node_modules/scrapedin/src/openPage.js:24:14)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:188:7)
(node:6249) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2)`

Could you provide the package-lock.json to be sure to get the same dependencies please.

Thanks

Retry the request if failed

Hello,

After some good crawling request, I got the following error :

scrapedin: 2019-10-18T09:03:38.320Z error: [cleanMessageData] LinkedIn website changed and scrapedin can't read basic data. Please report this issue at https://github.com/linkedtales/scrapedin/issues
2019-10-18T09:03:38.321Z error: error on crawling profile: https://www.linkedin.com/in/xxxxxxx/ 
 Error: LinkedIn website changed and scrapedin can't read basic data. Please report this issue at https://github.com/linkedtales/scrapedin/issues
(node:29800) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'length' of undefined
    at crawl (/home/anas/workspace/scrapedin-linkedin-crawler/src/crawler.js:26:71)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:188:7)
(node:29800) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:29800) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

We should add a retries configurable feature to avoid such problem.

linkedin website changed and can not read basic data

inished scraping url: https://www.linkedin.com/in/inmudassar-iqbal-a9a9159b/
scrapedin: 2020-01-22T07:42:56.489Z error: [cleanMessageData] LinkedIn website changed and scrapedin can't read basic data. Please report this issue at https://github.com/linkedtales/scrapedin/issues
2020-01-22T07:42:56.490Z error: error on crawling profile: https://linkedin/in/mudassar-iqbal-a9a9159b/
Error: LinkedIn website changed and scrapedin can't read basic data. Please report this issue at https://github.com/linkedtales/scrapedin/issues
2020-01-22T07:42:56.830Z info: starting scraping: https://linkedin/in/nadeem-aslam-057341102/
scrapedin: 2020-01-22T07:42:56.830Z info: [profile] starting scraping url: https://www.linkedin.com/in/innadeem-aslam-057341102/
scrapedin: 2020-01-22T07:42:58.070Z info: [profile] finished scraping url: https://www.linkedin.com/in/inamjad-khan-a03634b7/
scrapedin: 2020-01-22T07:42:58.070Z error: [cleanMessageData] LinkedIn website changed and scrapedin can't read basic data. Please report this issue at https://github.com/linkedtales/scrapedin/issues
2020-01-22T07:42:58.070Z error: error on crawling profile: https://linkedin/in/amjad-khan-a03634b7/
Error: LinkedIn website changed and scrapedin can't read basic data. Please report this issue at https://github.com/linkedtales/scrapedin/issues
2020-01-22T07:42:58.832Z info: starting scraping: https://linkedin/in/baraa-faisal-0529a5a3/
scrapedin: 2020-01-22T07:42:58.833Z info: [profile] starting scraping url: https://www.linkedin.com/in/inbaraa-faisal-0529a5a3/

Crawler doesn't work

Hello,

Crawler doesn't work, it only scans the root profile.
(Even if we init a keyword list).

2019-06-27T13:01:12.099Z info: crawler finished: there are no more profiles found with specified keywords

TypeError: Cannot read property 'replace' of undefined

Hi, i'm scrapping some URLs and in some of them i'm receiving these warnings and error messages:

warn: [profile] profile selector was not found

warn: [scrollToPageBottom] page bottom not found

error: error on crawling profile: https://www.linkedin.com/in/IDENTIFIER/
 TypeError: Cannot read property 'replace' of undefined

I've already set isHeadless to false on config.json and updated the scrapedin version to the latest one (1.0.16)

Another question is if the program finishes after crawling all the profiles. I keep getting the message:
info: there is no profiles to crawl right now...
but it don't terminate the job

Error: net::ERR_INTERNET_DISCONNECTED at https://www.linkedin.com/login

I got this error under centos 7 , this crawler was able to work but it throw the following error after I change my place to work , so any one has some idea ๏ผŸ

(node:6378) UnhandledPromiseRejectionWarning: Error: net::ERR_INTERNET_DISCONNECTED at https://www.linkedin.com/login
at navigate (/home/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/FrameManager.js:120:37)
at process._tickCallback (internal/process/next_tick.js:68:7)
-- ASYNC --
at Frame. (/home/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/helper.js:111:15)
at Page.goto (/home/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Page.js:674:49)
at Page. (/home/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/helper.js:112:23)
at module.exports (/home/scrapedin-linkedin-crawler/node_modules/scrapedin/src/openPage.js:24:14)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:6378) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:6378) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Save to excel

Hey, can we save all result to excel?
I am not a programmer so can you provide the code? please help me if you don't mind

Thanks

Scrapper doesn't close browser

Scrapper doesn't close browser on exception or successful scrapping. If there are 100 profiles to scrap, it initiates 100 browser instances which creates memory issues.

Skip Profiles already downloaded

Is there an option to skip profiles that are already downloaded in "saveDirectory".
The reason I am asking this is whenever it encounters some error and have to run it again with same "rootProfiles", it starts downloading everything from scratch, why can't it process filenames already present in "saveDirectory" and parse accordingly?

Error Codes for NPM Start - Internal Modules Loader Error

~/crawler$ npm start
(node:16966) [DEP0139] DeprecationWarning: Calling process.umask() with no arguments is prone to race conditions and is a potential security vulnerability.
(Use node --trace-deprecation ... to show where the warning was created)

[email protected] start /home/homespace/crawler
node src/index

internal/modules/cjs/loader.js:1288
throw err;
^

SyntaxError: /home/infiniteiotnode1/crawler/config.json: Unexpected token ] in JSON at position 408
at parse ()
at Object.Module._extensions..json (internal/modules/cjs/loader.js:1285:22)
at Module.load (internal/modules/cjs/loader.js:1099:32)
at Function.Module._load (internal/modules/cjs/loader.js:964:14)
at Module.require (internal/modules/cjs/loader.js:1139:19)
at require (internal/modules/cjs/helpers.js:75:18)
at Object. (/home/infiniteiotnode1/crawler/src/index.js:2:20)
at Module._compile (internal/modules/cjs/loader.js:1250:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:1271:10)
at Module.load (internal/modules/cjs/loader.js:1099:32)
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! [email protected] start: node src/index
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the [email protected] start script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

Any Assistant is greatly appreciated.

Crawl LinkedIn skills

Hi! I'm looking for a way to extract a list of skills used by LinkedIn. This used to be not so hard until some two years ago, but not so anymore since the site structure changed. Any ideas on how to extract skill data?

Irregular information loss when "maxConcurrentCrawlers" is greater than 1

Hi there, this project is awesome and i have successfully used it to scrape some profiles. Thanks for sharing!
However, there is a small issue arising when i set 'maxConcurrentCrawlers' greater than 1. If so, some records on 'education' and 'position' are likely to be lost. Is this related to my internet speed or platform(OSX)? Thanks!

avoid related profile loop

Hi and thank you for your amzing tool!

Sometimes a related profile contain a previously scraped profil. How can we avoid this, because it may loop indefinitely :)

Scrape company employees

Is possible to retrieve users (and then scrape their profiles) from company url?
Given a company url : https://www.linkedin.com/company/mycompany
We can have a list of employees scraping : https://www.linkedin.com/company/mycompany/people/
There is any built-in functionality to achieve this or should i implement this by myself?

(node:11136) UnhandledPromiseRejectionWarning: Error: Failed to launch chrome!

$ npm start

[email protected] start /Users/st/Desktop/panya/scrapedin-linkedin-crawler
node src/index

(node:11136) UnhandledPromiseRejectionWarning: Error: Failed to launch chrome!
dlopen /Users/st/Desktop/panya/scrapedin-linkedin-crawler/node_modules/puppeteer/.local-chromium/mac-641577/chrome-mac/Chromium.app/Contents/MacOS/../Versions/75.0.3738.0/Chromium Framework.framework/Chromium Framework: dlopen(/Users/st/Desktop/panya/scrapedin-linkedin-crawler/node_modules/puppeteer/.local-chromium/mac-641577/chrome-mac/Chromium.app/Contents/MacOS/../Versions/75.0.3738.0/Chromium Framework.framework/Chromium Framework, 261): Library not loaded: /System/Library/Frameworks/MediaPlayer.framework/Versions/A/MediaPlayer
Referenced from: /Users/st/Desktop/panya/scrapedin-linkedin-crawler/node_modules/puppeteer/.local-chromium/mac-641577/chrome-mac/Chromium.app/Contents/Versions/75.0.3738.0/Chromium Framework.framework/Versions/A/Chromium Framework
Reason: image not found

TROUBLESHOOTING: https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md

at onClose (/Users/st/Desktop/panya/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Launcher.js:342:14)
at Interface.helper.addEventListener (/Users/st/Desktop/panya/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Launcher.js:331:50)
at emitNone (events.js:111:20)
at Interface.emit (events.js:208:7)
at Interface.close (readline.js:368:8)
at Socket.onend (readline.js:147:10)
at emitNone (events.js:111:20)
at Socket.emit (events.js:208:7)
at endReadableNT (_stream_readable.js:1064:12)
at _combinedTickCallback (internal/process/next_tick.js:138:11)

(node:11136) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:11136) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

TypeError: Cannot read property 'replace' of undefined

Hi, i'm scrapping some URLs and in most of them (almost all) i'm receiving these warnings and error messages:

warn: [profile] profile selector was not found

warn: [scrollToPageBottom] page bottom not found

warn: [seeMoreButtons] couldn't click on button.pv-profile-section__see-more-inline, it's probably invisible

error: error on crawling profile: https://www.linkedin.com/in/IDENTIFIER/
 TypeError: Cannot read property 'replace' of undefined

I've already set isHeadless to false on config.json and updated the scrapedin version to the latest one (1.0.14)

happy new year , and I got this unhandledPromiseRejectionERROR

we're going to login...
(node:5154) UnhandledPromiseRejectionWarning: Error: net::ERR_INTERNET_DISCONNECTED at https://www.linkedin.com/login
at navigate (/home/tony/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/FrameManager.js:120:37)
at process._tickCallback (internal/process/next_tick.js:68:7)
-- ASYNC --
at Frame. (/home/tony/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/helper.js:111:15)
at Page.goto (/home/tony/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/Page.js:674:49)
at Page. (/home/tony/scrapedin-linkedin-crawler/node_modules/puppeteer/lib/helper.js:112:23)
at module.exports (/home/tony/scrapedin-linkedin-crawler/node_modules/scrapedin/src/openPage.js:26:14)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:5154) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:5154) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.


NOT SURE IT IS THE NODE PROBLEM OR WHAT

THANKS

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.