Comments (7)
The problem with option 1 is INSERT INTO
doesn't support WHERE
clause (so complex sub-query will be needed, if that's even possible), the problem with option 2 is we need to make separate script, where the steps should be like this:
- DELETE record_id or use SELECT without record id
- Remove duplicated rows using DISTINCT
- Remove some rows where the time recorded is too short using WHERE clause? (optional, may be pretty hard using SQL alone)
- Readd primary key using something like this (not sure whether SQLite support it or not)
from linkedin-job-scraper.
Or we can also use millisecond/microsecond as the time and make the time_recorded as primary key, but we still need to do above steps to fix previous values on the table. This solution will also be broken once quantum computer is released lol.
I think it's easier to fix these problems using Pandas rather than SQL itself.
from linkedin-job-scraper.
Yeah, for future values I think turning the time_recorded and company_id as the primary key would be sufficient. For the data already uploaded, I'll just create a sql query to remove rows with the same company_id and time. That should work, right?
from linkedin-job-scraper.
I'll update that tomorrow
from linkedin-job-scraper.
Yeah, for future values I think turning the time_recorded and company_id as the primary key would be sufficient.
Isn't primary key supposed to be unique? I've never used 2 primary keys before, but I think each primary key is representing it's own column (not grouped together if there are multiple primary keys). In your case, making time_recorded
as primary key will make time_recorded
unique, which may not allow duplicated time_recorded
like screenshot below? Or I'm learning something wrong this whole time?
Also making company_id
unique may also cause loss of information if you intend to make it a time series data in the future. People can measure company growth, mass-layoffs, or turn-over rate. Though I guess it's a company specific thing so it will rarely be needed (e.g. only when applying on that company).
It's up to you I guess, I myself rarely need that data.
from linkedin-job-scraper.
I just changed it so time_recorded and company_id are a composite ID. This means there shouldn't be any duplicate rows with the same time and company id. Multiple rows can have the same time and the same id, just not both at the same time. This should prevent the problem you initially describes with multiple consecutive jobs by the same company filling up unnecessarily space.
from linkedin-job-scraper.
Oh right, composite primary key. In that case you should undo changes made by me on the previous commit, see 0ed9cb1 (create_db.py).
from linkedin-job-scraper.
Related Issues (4)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from linkedin-job-scraper.