When there's too many job postings from the same company at the time of job searching,

Prevent adding company to employee_counts when last recorded time is less than X days about linkedin-job-scraper HOT 7 CLOSED

arshka commented on May 27, 2024

Prevent adding company to employee_counts when last recorded time is less than X days

from linkedin-job-scraper.

Comments (7)

AndhikaWB commented on May 27, 2024

The problem with option 1 is INSERT INTO doesn't support WHERE clause (so complex sub-query will be needed, if that's even possible), the problem with option 2 is we need to make separate script, where the steps should be like this:

DELETE record_id or use SELECT without record id
Remove duplicated rows using DISTINCT
Remove some rows where the time recorded is too short using WHERE clause? (optional, may be pretty hard using SQL alone)
Readd primary key using something like this (not sure whether SQLite support it or not)

from linkedin-job-scraper.

AndhikaWB commented on May 27, 2024

Or we can also use millisecond/microsecond as the time and make the time_recorded as primary key, but we still need to do above steps to fix previous values on the table. This solution will also be broken once quantum computer is released lol.

I think it's easier to fix these problems using Pandas rather than SQL itself.

from linkedin-job-scraper.

ArshKA commented on May 27, 2024

Yeah, for future values I think turning the time_recorded and company_id as the primary key would be sufficient. For the data already uploaded, I'll just create a sql query to remove rows with the same company_id and time. That should work, right?

from linkedin-job-scraper.

ArshKA commented on May 27, 2024

I'll update that tomorrow

from linkedin-job-scraper.

AndhikaWB commented on May 27, 2024

Yeah, for future values I think turning the time_recorded and company_id as the primary key would be sufficient.

Isn't primary key supposed to be unique? I've never used 2 primary keys before, but I think each primary key is representing it's own column (not grouped together if there are multiple primary keys). In your case, making time_recorded as primary key will make time_recorded unique, which may not allow duplicated time_recorded like screenshot below? Or I'm learning something wrong this whole time?

Also making company_id unique may also cause loss of information if you intend to make it a time series data in the future. People can measure company growth, mass-layoffs, or turn-over rate. Though I guess it's a company specific thing so it will rarely be needed (e.g. only when applying on that company).

It's up to you I guess, I myself rarely need that data.

from linkedin-job-scraper.

ArshKA commented on May 27, 2024

I just changed it so time_recorded and company_id are a composite ID. This means there shouldn't be any duplicate rows with the same time and company id. Multiple rows can have the same time and the same id, just not both at the same time. This should prevent the problem you initially describes with multiple consecutive jobs by the same company filling up unnecessarily space.

from linkedin-job-scraper.

AndhikaWB commented on May 27, 2024

Oh right, composite primary key. In that case you should undo changes made by me on the previous commit, see 0ed9cb1 (create_db.py).

from linkedin-job-scraper.

Prevent adding company to employee_counts when last recorded time is less than X days about linkedin-job-scraper HOT 7 CLOSED

Comments (7)

Related Issues (4)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent