bunsly / jobspy Goto Github PK

View Code? Open in Web Editor NEW

457.0 12.0 88.0 532 KB

Jobs scraper library for LinkedIn, Indeed, Glassdoor & ZipRecruiter

Home Page: https://usejobspy.com

License: MIT License

Python 100.00%

indeed linkedin ziprecruiter jobs-scraper linkedin-scraper job-scraper internship job-search jobs-search jobsearch

jobspy's People

Contributors

Stargazers

Watchers

Forkers

jesusoctavioas p0ggs benpmeredith dblankin minicoz fred2026 saidctb hammad400 adeyinka-dev ecrespo pfython ivanvs jonwhittlestone levelingupdata flashback712 luxu 5l1v3r1 jakob-haywood pyvovarovi maka-77x ducphunguyen0412 james26572 damonclifford kingname yidan-github willblears alibaljehar frzkhan maxdata printroshn rohit10jr vizardx mohideen2000 vinceyyy swmeyer1979 abdulbaseer657 priteshusadadiya bitnom trduc97 gregmagdits mak3r mthomas-io neamat26 spayyavula isonupandit11 harryalways317 zagham-nadeem jake-brewer-isa abrahamcb milevavantuyl ches-ctrl romelgomez zacharyebaugh jeffrey840 degouville sachins-eng jwbadev brendancsmith lucky7xz bnastase-alt ak-kumar-v amrmostafa-1 pippinmole troy-conte giga-sec ciphersprint meliani alvarogp24 kellenmace tre3x usman522 ainafetse22 chaowang101 kristine-z cerisaiah aylarba vincentropy lawrenj upex-galaxy weiweikee sarthak55k ashispapu dynamicendpoints bittricky splendidabbey dan-shumayev jacobshumate lluissalord

jobspy's Issues

is tls_client really needed?

Just wondering, as its slow, buggy and cant be async, and the support for it unknown.

Received a response code 403

I run the code in usage,but i received a response code 403.
jobspy.scrapers.exceptions.IndeedException: bad response with status code: 403

Missing job types in enum

Some job types are missing as JobType enum, and this is triggering an IndeedException

Czech Republic:

Částečný úvazek
Deltid
Praktik

Does this support pagination?

Hi there.
So far, I can only select 50 job posts each time, is there anyway we can select the next page?

Feature: Scrape company profiles

It would be good to be able to scrape information from a companies page, such as name, description, any images/resources.

Feature: Time-based filtering

I love what you're building.
Is it possible to add a filter to search for recent jobs e.g from field that accepts a datetime and uses it as a cutoff?

Alternatively, if we can order by date_posted that would be helpful as well.
I'm not sure how the jobs are ordered now, seems pretty random.

I currently do this manually, but filtering before we get the results would reduce the likelihood of getting blocked, and latency:
jobs.sort_values(by='date_posted', ascending=False)

improvement: add '-' for each <li> element found

Hi,

Thank you for introducing #84. I wonder if we could build more onto this, such as introducing - characters for every <li> that occurs in the description?

Currently, a description like this:

will output a description like so:

What you will get in return:
Fantastic market leading Pension scheme
Excellent employee health and wellbeing services including an Employee Assistance Programme
Exceptional starting annual leave entitlement, plus bank holidays
Additional paid closure over the Christmas period
Local and national discounts at a range of major retailers

what I'm suggesting is, we append the - character (which could be customizable to the library user), so there is at least some knowledge that it was a list element:

What you will get in return:
- Fantastic market leading Pension scheme
- Excellent employee health and wellbeing services including an Employee Assistance Programme
- Exceptional starting annual leave entitlement, plus bank holidays
- Additional paid closure over the Christmas period
- Local and national discounts at a range of major retailers

Thanks

Feature request: allow filtering linkedin job requests by linkedin company ID

I have a list of company ID's that I search for jobs by, along with keyword and location. Would be nice to be able to pull this programatically.

Scraping companies websites of LinkedIn

Hey! Love the library!!
Not an actual issue, more like a question.
Is there a way to get the websites url of the companies instead of the LinkedIn profile url?
Like all companies have their own website attach to their LinkedIn profile, so that is actually what I want.
Thanks in advance!!

Upload to gsheets

Error in packages

Image attached >> image
I've been trying to download this package to no avail. The package is installed on my computer, but for some reason, this code doesn't run. Can I ask for help regarding this matter? Thank you.

Scrape compensation on linkedin jobs

Suggestion: Multiple site_types in search

Can we allow multiple sites to be search?
Right now it seems like it only defaults to one

rate limiter

Can you make a rate limiter adjustable on the request polling so we can adjust to attempt to avoid being blocked? I also don't see rate limiters on indeed and glassdoor?

Returns as JobSpy Class

I believe that it should be returned as pd.Dataframe class rather than <class 'jobspy.ScrapeResults'>

Any workarounds or fix?

Curious about new functionality

Hi,

Thanks for developing this, I am curious about if there is a new function that only crawls today's job on Linkedin.

Thanks!

ZipRecruiter - HTML variant

There's a rare bug where they send you a different html page that isn't supported by the bot. Have not received the page enough due to probability to debug it properly

improvement: Remove pandas dependency

Hi,

I'd like to set up a serverless function which runs every 15 minutes to scrape jobs and put them into a database. However, many serverless providers (like DigitalOcean Functions) have package size restrictions.

It looks like Pandas (which has a dependency of numpy) takes up at least 120MB. All to use pd.DataFrame.

I understand this is quite a big change, so I don't expect anything.

Thanks

No descriptions on LinkedIn postings

I did two queries, one of which was the demo one, and none of the jobs from LinkedIn had job descriptions.

Can I set up offset? How to catch next portion of data?

For example in first call I catch 20 jobs.
So for the next call I would like to catch jobs from 20 to 40 positions.
Is it possible?

macOS Catalina - unsupported

On OSX 10.15, python 3.11.5

import jobspy
Segmentation fault: 11

from jobspy import scrape_jobs
Segmentation fault: 11

Any ideas ?

Can location change

Hi there! Thank you for the amazing project.
Is there anyway to change the country?

Fetch job urls from Sheets to prevent duplicates

Indeed USA getting instantly blocked

Hi dear creators.

I doing a simple test with your demo file.
Last week 27 of Nov. I was able to fetch results from Indeed.

As for this week, I ran the same script and I got an immediately blocked ("403 error").
I tried to use a proxy but to no avail.

enh: show html tags in description

This bit of code:

jobs = scrape_jobs(
    site_name=["linkedin"],
    search_term="software engineer",
    location="Texas, US",
    results_wanted=15,
    country_indeed='US'  # only needed for indeed / glassdoor
)

returns a description, but does not contain any \n characters. This means when rendering the text it comes out as one big block of text:

ZipRecruiter Job Scraper returns 'NoneType' exception

What I have:
Python 3.11, clean install

from jobspy import scrape_jobs

jobs = scrape_jobs(
    site_name=["indeed", "linkedin", "zip_recruiter"],
    search_term="software engineer",
    location="Dallas, TX",
    results_wanted=10,
    country_indeed="USA",  # only needed for indeed
)
print(f"Found {len(jobs)} jobs")
print(jobs.head())
jobs.to_csv("jobs.csv", index=False)  # / to_xlsx

When I try to run this, I get the following error:

jobspy.scrapers.exceptions.ZipRecruiterException: 'NoneType' object is not iterable
which points to:

line 90, in scrape
 jobs_on_page, continue_token = self.find_jobs_in_page(scraper_input, continue_token)

this method.

I tried to use different parameters for the search, seems like ZipRecruiter is not working with any of them.

Taking 3-4 minutes per result on LinkedIn

All the other websites work great! However, when doing searches on Linkedin, it takes 3-4 minutes per result. Does anyone else have the same issue?

job_type filter doesn't work

The job_type doesn't work. Even if we leave the other criteria out (like is_remote).
For example: "fulltime", "temporary", "contract" don't return any errors, but the data returned has different job_types to the requested.

PS: Just realized that bunsly.com doesn't have a job_type filter so I'm guessing you're already aware of this?

indeed scraper max/min compensation values switched

it appears to me that the values for max/min compensation for the indeed scraper are the wrong way around.

Class type error

Can somebody else please check the class input for site_name parameter in scrape_jobs()?

Argument of type "list[str]" cannot be assigned to parameter "site_name" of type "str | Site | List[Site]" in function "scrape_jobs" "Literal['linkedin']" is incompatible with "Site" "Literal['zip_recruiter']" is incompatible with "Site"Pylance[reportGeneralTypeIssues](https://github.com/microsoft/pyright/blob/main/docs/configuration.md#reportGeneralTypeIssues)

This is the error I keep receiving.

Long running time when deploying

I want to create an API to crawl the job. It worked well in local environment but It runs long or even did not return anything when I deployed the API to clouds like Render, Azure. I also tried Flask+Celery in local, but scrape_jobs function does not response anything.

Is there a way to filter the results by description (contains certain keywords)

Wanted to ask if you can set keywords that need to be present in description to prefilter the results?

Add proxy support for scrapers

Thanks for your great work! It will be more convenient to use proxies to bypass the rate limit.

Add threading on LinkedIn scraper

ZipRecruiter and Indeed scrapers are multi-threaded to speed up processing. However, LinkedIn is not atm as it's a bit more tricky with LinkedIn as they will deny too many requests, so this has to be implemented more elegantly than how we have done so with the other two sites.

usejobspy.com

The instances ip is cooked. Please wait till we fix it

BUG: Malformed urls - LinkedIn/Indeed

Working on it, I see they got mixed up

Extracting emails from job postings

Is there a way to extract emails and other contact data from the job posting (if it exists). That could prove invaluable.

job descriptions truncated

Running into an issue where job descriptions are truncated when pulling them. Any idea why?

RunTime Error

Error code below
Code: The default one

import csv
from jobspy import scrape_jobs

jobs = scrape_jobs(
    site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor"],
    search_term="software engineer",
    location="Dallas, TX",
    results_wanted=20,
    hours_old=72, # (only linkedin is hour specific, others round up to days old)
    country_indeed='USA'  # only needed for indeed / glassdoor
)
print(f"Found {len(jobs)} jobs")
print(jobs.head())
jobs.to_csv("jobs.csv", quoting=csv.QUOTE_NONNUMERIC, escapechar="\\", index=False) # to_xlsx

ERROR CODE BELOW

Testing
panic: runtime error: index out of range [1] with length 1

goroutine 18 [running, locked to thread]:
github.com/bogdanfinn/tls-client.stringToSpec({0x0, 0x0}, {0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}, {0x0, 0x0, ...}, ...)
	/Users/bogdan/go/pkg/mod/github.com/bogdanfinn/[email protected]/ja3.go:89 +0xe5f