Coder Social home page Coder Social logo

Comments (7)

AndhikaWB avatar AndhikaWB commented on May 27, 2024

The problem with option 1 is INSERT INTO doesn't support WHERE clause (so complex sub-query will be needed, if that's even possible), the problem with option 2 is we need to make separate script, where the steps should be like this:

  1. DELETE record_id or use SELECT without record id
  2. Remove duplicated rows using DISTINCT
  3. Remove some rows where the time recorded is too short using WHERE clause? (optional, may be pretty hard using SQL alone)
  4. Readd primary key using something like this (not sure whether SQLite support it or not)

from linkedin-job-scraper.

AndhikaWB avatar AndhikaWB commented on May 27, 2024

Or we can also use millisecond/microsecond as the time and make the time_recorded as primary key, but we still need to do above steps to fix previous values on the table. This solution will also be broken once quantum computer is released lol.

I think it's easier to fix these problems using Pandas rather than SQL itself.

from linkedin-job-scraper.

ArshKA avatar ArshKA commented on May 27, 2024

Yeah, for future values I think turning the time_recorded and company_id as the primary key would be sufficient. For the data already uploaded, I'll just create a sql query to remove rows with the same company_id and time. That should work, right?

from linkedin-job-scraper.

ArshKA avatar ArshKA commented on May 27, 2024

I'll update that tomorrow

from linkedin-job-scraper.

AndhikaWB avatar AndhikaWB commented on May 27, 2024

Yeah, for future values I think turning the time_recorded and company_id as the primary key would be sufficient.

Isn't primary key supposed to be unique? I've never used 2 primary keys before, but I think each primary key is representing it's own column (not grouped together if there are multiple primary keys). In your case, making time_recorded as primary key will make time_recorded unique, which may not allow duplicated time_recorded like screenshot below? Or I'm learning something wrong this whole time?
image

Also making company_id unique may also cause loss of information if you intend to make it a time series data in the future. People can measure company growth, mass-layoffs, or turn-over rate. Though I guess it's a company specific thing so it will rarely be needed (e.g. only when applying on that company).

It's up to you I guess, I myself rarely need that data.

from linkedin-job-scraper.

ArshKA avatar ArshKA commented on May 27, 2024

I just changed it so time_recorded and company_id are a composite ID. This means there shouldn't be any duplicate rows with the same time and company id. Multiple rows can have the same time and the same id, just not both at the same time. This should prevent the problem you initially describes with multiple consecutive jobs by the same company filling up unnecessarily space.

from linkedin-job-scraper.

AndhikaWB avatar AndhikaWB commented on May 27, 2024

Oh right, composite primary key. In that case you should undo changes made by me on the previous commit, see 0ed9cb1 (create_db.py).

from linkedin-job-scraper.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.