Coder Social home page Coder Social logo

bunsly / jobspy Goto Github PK

View Code? Open in Web Editor NEW
457.0 12.0 88.0 532 KB

Jobs scraper library for LinkedIn, Indeed, Glassdoor & ZipRecruiter

Home Page: https://usejobspy.com

License: MIT License

Python 100.00%
indeed linkedin ziprecruiter jobs-scraper linkedin-scraper job-scraper internship job-search jobs-search jobsearch

jobspy's People

Contributors

augustogunsch avatar cullenwatson avatar frzkhan avatar giga-sec avatar harryalways317 avatar kellenmace avatar lluissalord avatar minicoz avatar troy-conte avatar vinceyyy avatar vitaminb16 avatar zacharyhampton avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jobspy's Issues

Received a response code 403

I run the code in usage,but i received a response code 403.
jobspy.scrapers.exceptions.IndeedException: bad response with status code: 403

Missing job types in enum

Some job types are missing as JobType enum, and this is triggering an IndeedException

Czech Republic:

  • Částečný úvazek
  • Deltid
  • Praktik

Feature: Time-based filtering

I love what you're building.
Is it possible to add a filter to search for recent jobs e.g from field that accepts a datetime and uses it as a cutoff?

Alternatively, if we can order by date_posted that would be helpful as well.
I'm not sure how the jobs are ordered now, seems pretty random.

I currently do this manually, but filtering before we get the results would reduce the likelihood of getting blocked, and latency:
jobs.sort_values(by='date_posted', ascending=False)

improvement: add '-' for each <li> element found

Hi,

Thank you for introducing #84. I wonder if we could build more onto this, such as introducing - characters for every <li> that occurs in the description?

Currently, a description like this:

Screenshot 2024-01-23 at 18 17 30

will output a description like so:

What you will get in return:
Fantastic market leading Pension scheme
Excellent employee health and wellbeing services including an Employee Assistance Programme
Exceptional starting annual leave entitlement, plus bank holidays
Additional paid closure over the Christmas period
Local and national discounts at a range of major retailers

what I'm suggesting is, we append the - character (which could be customizable to the library user), so there is at least some knowledge that it was a list element:

What you will get in return:
- Fantastic market leading Pension scheme
- Excellent employee health and wellbeing services including an Employee Assistance Programme
- Exceptional starting annual leave entitlement, plus bank holidays
- Additional paid closure over the Christmas period
- Local and national discounts at a range of major retailers

Thanks

Scraping companies websites of LinkedIn

Hey! Love the library!!
Not an actual issue, more like a question.
Is there a way to get the websites url of the companies instead of the LinkedIn profile url?
Like all companies have their own website attach to their LinkedIn profile, so that is actually what I want.
Thanks in advance!!

Error in packages

Image attached >> image
I've been trying to download this package to no avail. The package is installed on my computer, but for some reason, this code doesn't run. Can I ask for help regarding this matter? Thank you.

rate limiter

Can you make a rate limiter adjustable on the request polling so we can adjust to attempt to avoid being blocked? I also don't see rate limiters on indeed and glassdoor?

Returns as JobSpy Class

I believe that it should be returned as pd.Dataframe class rather than <class 'jobspy.ScrapeResults'>

Any workarounds or fix?

Curious about new functionality

Hi,

Thanks for developing this, I am curious about if there is a new function that only crawls today's job on Linkedin.

Thanks!

ZipRecruiter - HTML variant

There's a rare bug where they send you a different html page that isn't supported by the bot. Have not received the page enough due to probability to debug it properly

improvement: Remove pandas dependency

Hi,

I'd like to set up a serverless function which runs every 15 minutes to scrape jobs and put them into a database. However, many serverless providers (like DigitalOcean Functions) have package size restrictions.

It looks like Pandas (which has a dependency of numpy) takes up at least 120MB. All to use pd.DataFrame.

I understand this is quite a big change, so I don't expect anything.

Thanks

macOS Catalina - unsupported

On OSX 10.15, python 3.11.5

import jobspy
Segmentation fault: 11

from jobspy import scrape_jobs
Segmentation fault: 11

Any ideas ?

Can location change

Hi there! Thank you for the amazing project.
Is there anyway to change the country?

Indeed USA getting instantly blocked

Hi dear creators.

I doing a simple test with your demo file.
Last week 27 of Nov. I was able to fetch results from Indeed.

As for this week, I ran the same script and I got an immediately blocked ("403 error").
I tried to use a proxy but to no avail.

enh: show html tags in description

This bit of code:

jobs = scrape_jobs(
    site_name=["linkedin"],
    search_term="software engineer",
    location="Texas, US",
    results_wanted=15,
    country_indeed='US'  # only needed for indeed / glassdoor
)

returns a description, but does not contain any \n characters. This means when rendering the text it comes out as one big block of text:

image

ZipRecruiter Job Scraper returns 'NoneType' exception

What I have:
Python 3.11, clean install

from jobspy import scrape_jobs

jobs = scrape_jobs(
    site_name=["indeed", "linkedin", "zip_recruiter"],
    search_term="software engineer",
    location="Dallas, TX",
    results_wanted=10,
    country_indeed="USA",  # only needed for indeed
)
print(f"Found {len(jobs)} jobs")
print(jobs.head())
jobs.to_csv("jobs.csv", index=False)  # / to_xlsx

When I try to run this, I get the following error:

jobspy.scrapers.exceptions.ZipRecruiterException: 'NoneType' object is not iterable
which points to:

line 90, in scrape
 jobs_on_page, continue_token = self.find_jobs_in_page(scraper_input, continue_token)

this method.

I tried to use different parameters for the search, seems like ZipRecruiter is not working with any of them.

job_type filter doesn't work

The job_type doesn't work. Even if we leave the other criteria out (like is_remote).
For example: "fulltime", "temporary", "contract" don't return any errors, but the data returned has different job_types to the requested.

PS: Just realized that bunsly.com doesn't have a job_type filter so I'm guessing you're already aware of this?

Class type error

Can somebody else please check the class input for site_name parameter in scrape_jobs()?

Argument of type "list[str]" cannot be assigned to parameter "site_name" of type "str | Site | List[Site]" in function "scrape_jobs" "Literal['linkedin']" is incompatible with "Site" "Literal['zip_recruiter']" is incompatible with "Site"Pylance[reportGeneralTypeIssues](https://github.com/microsoft/pyright/blob/main/docs/configuration.md#reportGeneralTypeIssues)

This is the error I keep receiving.

Long running time when deploying

I want to create an API to crawl the job. It worked well in local environment but It runs long or even did not return anything when I deployed the API to clouds like Render, Azure. I also tried Flask+Celery in local, but scrape_jobs function does not response anything.

Add threading on LinkedIn scraper

ZipRecruiter and Indeed scrapers are multi-threaded to speed up processing. However, LinkedIn is not atm as it's a bit more tricky with LinkedIn as they will deny too many requests, so this has to be implemented more elegantly than how we have done so with the other two sites.

usejobspy.com

The instances ip is cooked. Please wait till we fix it

RunTime Error

Error code below
Code: The default one

import csv
from jobspy import scrape_jobs

jobs = scrape_jobs(
    site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor"],
    search_term="software engineer",
    location="Dallas, TX",
    results_wanted=20,
    hours_old=72, # (only linkedin is hour specific, others round up to days old)
    country_indeed='USA'  # only needed for indeed / glassdoor
)
print(f"Found {len(jobs)} jobs")
print(jobs.head())
jobs.to_csv("jobs.csv", quoting=csv.QUOTE_NONNUMERIC, escapechar="\\", index=False) # to_xlsx

ERROR CODE BELOW

Testing
panic: runtime error: index out of range [1] with length 1

goroutine 18 [running, locked to thread]:
github.com/bogdanfinn/tls-client.stringToSpec({0x0, 0x0}, {0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}, {0x0, 0x0, ...}, ...)
	/Users/bogdan/go/pkg/mod/github.com/bogdanfinn/[email protected]/ja3.go:89 +0xe5f

image

ZipRecruiter unstable

We're using ZipRecruiter's API but they require some cookies for auth that seem to expire, resulting in 403 after some time

Cookies of interest:

  • _cf_bm (cloudflare)
  • ziprecruiter_session

I can't install python-jobspy on my venv python3.12.1

I can't install python-jobspy on my venv python3.12.1.
I'm getting the
AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'?
error. I've tried many different methods but it always comes back to this.

Job URL and Company Image

It would be cool if we could get the company image and the external job URL and not the linkedin, indeed, glassdoor one. e.g. if it's an Apple job ad on linkedin, the URL should be the one that redirects the user directly to Apple's job page.

All Indeed Jobs have null Descriptions

Used the simple example from the README and all Indeed results return a null description. (Also tried a few others)

LinkedIn, Zip, and Glassdoor all seem to be working fine

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.