bunsly / jobspy Goto Github PK
View Code? Open in Web Editor NEWJobs scraper library for LinkedIn, Indeed, Glassdoor & ZipRecruiter
Home Page: https://usejobspy.com
License: MIT License
Jobs scraper library for LinkedIn, Indeed, Glassdoor & ZipRecruiter
Home Page: https://usejobspy.com
License: MIT License
Just wondering, as its slow, buggy and cant be async, and the support for it unknown.
I run the code in usage,but i received a response code 403.
jobspy.scrapers.exceptions.IndeedException: bad response with status code: 403
Some job types are missing as JobType enum, and this is triggering an IndeedException
Czech Republic:
Hi there.
So far, I can only select 50 job posts each time, is there anyway we can select the next page?
It would be good to be able to scrape information from a companies page, such as name, description, any images/resources.
I love what you're building.
Is it possible to add a filter to search for recent jobs e.g from
field that accepts a datetime and uses it as a cutoff?
Alternatively, if we can order by date_posted
that would be helpful as well.
I'm not sure how the jobs are ordered now, seems pretty random.
I currently do this manually, but filtering before we get the results would reduce the likelihood of getting blocked, and latency:
jobs.sort_values(by='date_posted', ascending=False)
Hi,
Thank you for introducing #84. I wonder if we could build more onto this, such as introducing -
characters for every <li>
that occurs in the description?
Currently, a description like this:
will output a description like so:
What you will get in return:
Fantastic market leading Pension scheme
Excellent employee health and wellbeing services including an Employee Assistance Programme
Exceptional starting annual leave entitlement, plus bank holidays
Additional paid closure over the Christmas period
Local and national discounts at a range of major retailers
what I'm suggesting is, we append the -
character (which could be customizable to the library user), so there is at least some knowledge that it was a list element:
What you will get in return:
- Fantastic market leading Pension scheme
- Excellent employee health and wellbeing services including an Employee Assistance Programme
- Exceptional starting annual leave entitlement, plus bank holidays
- Additional paid closure over the Christmas period
- Local and national discounts at a range of major retailers
Thanks
I have a list of company ID's that I search for jobs by, along with keyword and location. Would be nice to be able to pull this programatically.
Hey! Love the library!!
Not an actual issue, more like a question.
Is there a way to get the websites url of the companies instead of the LinkedIn profile url?
Like all companies have their own website attach to their LinkedIn profile, so that is actually what I want.
Thanks in advance!!
Image attached >> image
I've been trying to download this package to no avail. The package is installed on my computer, but for some reason, this code doesn't run. Can I ask for help regarding this matter? Thank you.
Can we allow multiple sites to be search?
Right now it seems like it only defaults to one
Can you make a rate limiter adjustable on the request polling so we can adjust to attempt to avoid being blocked? I also don't see rate limiters on indeed and glassdoor?
I believe that it should be returned as pd.Dataframe class rather than <class 'jobspy.ScrapeResults'>
Any workarounds or fix?
Hi,
Thanks for developing this, I am curious about if there is a new function that only crawls today's job on Linkedin.
Thanks!
There's a rare bug where they send you a different html page that isn't supported by the bot. Have not received the page enough due to probability to debug it properly
Hi,
I'd like to set up a serverless function which runs every 15 minutes to scrape jobs and put them into a database. However, many serverless providers (like DigitalOcean Functions) have package size restrictions.
It looks like Pandas (which has a dependency of numpy) takes up at least 120MB. All to use pd.DataFrame
.
I understand this is quite a big change, so I don't expect anything.
Thanks
I did two queries, one of which was the demo one, and none of the jobs from LinkedIn had job descriptions.
For example in first call I catch 20 jobs.
So for the next call I would like to catch jobs from 20 to 40 positions.
Is it possible?
On OSX 10.15, python 3.11.5
import jobspy
Segmentation fault: 11
from jobspy import scrape_jobs
Segmentation fault: 11
Any ideas ?
Hi there! Thank you for the amazing project.
Is there anyway to change the country?
Hi dear creators.
I doing a simple test with your demo file.
Last week 27 of Nov. I was able to fetch results from Indeed.
As for this week, I ran the same script and I got an immediately blocked ("403 error").
I tried to use a proxy but to no avail.
This bit of code:
jobs = scrape_jobs(
site_name=["linkedin"],
search_term="software engineer",
location="Texas, US",
results_wanted=15,
country_indeed='US' # only needed for indeed / glassdoor
)
returns a description, but does not contain any \n
characters. This means when rendering the text it comes out as one big block of text:
What I have:
Python 3.11, clean install
from jobspy import scrape_jobs
jobs = scrape_jobs(
site_name=["indeed", "linkedin", "zip_recruiter"],
search_term="software engineer",
location="Dallas, TX",
results_wanted=10,
country_indeed="USA", # only needed for indeed
)
print(f"Found {len(jobs)} jobs")
print(jobs.head())
jobs.to_csv("jobs.csv", index=False) # / to_xlsx
When I try to run this, I get the following error:
jobspy.scrapers.exceptions.ZipRecruiterException: 'NoneType' object is not iterable
which points to:
line 90, in scrape
jobs_on_page, continue_token = self.find_jobs_in_page(scraper_input, continue_token)
this method.
I tried to use different parameters for the search, seems like ZipRecruiter is not working with any of them.
All the other websites work great! However, when doing searches on Linkedin, it takes 3-4 minutes per result. Does anyone else have the same issue?
The job_type doesn't work. Even if we leave the other criteria out (like is_remote).
For example: "fulltime", "temporary", "contract" don't return any errors, but the data returned has different job_types to the requested.
PS: Just realized that bunsly.com doesn't have a job_type filter so I'm guessing you're already aware of this?
Can somebody else please check the class input for site_name
parameter in scrape_jobs()
?
Argument of type "list[str]" cannot be assigned to parameter "site_name" of type "str | Site | List[Site]" in function "scrape_jobs" "Literal['linkedin']" is incompatible with "Site" "Literal['zip_recruiter']" is incompatible with "Site"Pylance[reportGeneralTypeIssues](https://github.com/microsoft/pyright/blob/main/docs/configuration.md#reportGeneralTypeIssues)
This is the error I keep receiving.
I want to create an API to crawl the job. It worked well in local environment but It runs long or even did not return anything when I deployed the API to clouds like Render, Azure. I also tried Flask+Celery in local, but scrape_jobs function does not response anything.
Wanted to ask if you can set keywords that need to be present in description to prefilter the results?
Thanks for your great work! It will be more convenient to use proxies to bypass the rate limit.
ZipRecruiter and Indeed scrapers are multi-threaded to speed up processing. However, LinkedIn is not atm as it's a bit more tricky with LinkedIn as they will deny too many requests, so this has to be implemented more elegantly than how we have done so with the other two sites.
The instances ip is cooked. Please wait till we fix it
Working on it, I see they got mixed up
Is there a way to extract emails and other contact data from the job posting (if it exists). That could prove invaluable.
Running into an issue where job descriptions are truncated when pulling them. Any idea why?
Error code below
Code: The default one
import csv
from jobspy import scrape_jobs
jobs = scrape_jobs(
site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor"],
search_term="software engineer",
location="Dallas, TX",
results_wanted=20,
hours_old=72, # (only linkedin is hour specific, others round up to days old)
country_indeed='USA' # only needed for indeed / glassdoor
)
print(f"Found {len(jobs)} jobs")
print(jobs.head())
jobs.to_csv("jobs.csv", quoting=csv.QUOTE_NONNUMERIC, escapechar="\\", index=False) # to_xlsx
ERROR CODE BELOW
Testing
panic: runtime error: index out of range [1] with length 1
goroutine 18 [running, locked to thread]:
github.com/bogdanfinn/tls-client.stringToSpec({0x0, 0x0}, {0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}, {0x0, 0x0, ...}, ...)
/Users/bogdan/go/pkg/mod/github.com/bogdanfinn/[email protected]/ja3.go:89 +0xe5f
We're using ZipRecruiter's API but they require some cookies for auth that seem to expire, resulting in 403 after some time
Cookies of interest:
I found that the application
https://addons.mozilla.org/en-US/firefox/addon/simplify-jobs/
can easily auto apply for you on the compnays website from the linkedin links. Awesome pluging when used with yours as well.
Does JobSpy software work?
https://www.bunsly.com/demo/jobspy
I'm trying to download blockain jobs but it gives me an empty excel or csv. What is the error?
I can't install python-jobspy on my venv python3.12.1.
I'm getting the
AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'?
error. I've tried many different methods but it always comes back to this.
the html is unreliable but the javascript always has the jobs
It would be cool if we could get the company image and the external job URL and not the linkedin, indeed, glassdoor one. e.g. if it's an Apple job ad on linkedin, the URL should be the one that redirects the user directly to Apple's job page.
Used the simple example from the README and all Indeed results return a null description. (Also tried a few others)
LinkedIn, Zip, and Glassdoor all seem to be working fine
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.