Coder Social home page Coder Social logo

kopernikus's Introduction

Data Cleaning pipeline task by Kopernikus

Step 1. Create and activate the conda environment

conda env create -f environment.yml
conda activate data_cleaning

Step 2. Run the main file with the path to the images

python main.py --dir <path to images>

About the dataset:

  • This dataset comprises images depicting covered parking lots.
  • Multiple cameras have been employed to capture diverse perspectives within the dataset.
  • Each image is labeled with a unique identifier composed of the camera ID and timestamp.
  • Timestamps are provided in two formats: Unix and Python datetime.
  • The images encompass various times of the day.
  • Some images may contain noise, and there are instances where images are None.
  • Images vary in size.

How does the code work?

The entire process is repeated for each unique camera_id to systematically identify and mark images for removal across all cameras.

  1. Each camera provides a distinct viewpoint, so we organize the dataset based on the camera_id. We maintain a dictionary that associates each camera_id with its corresponding set of images.
  2. To facilitate sorting, we convert Python datetime objects to Unix timestamps. This ensures the chronological ordering of the images across all cameras.
  3. We sort the images based on their timestamps. This step is crucial as images captured closer in time are more likely to exhibit similarities.
  4. We define a sliding window mechanism to efficiently compare images. This involves selecting a base image and comparing it with others within a specified window.
  5. To optimize processing, we implement a caching mechanism. This prevents the unnecessary reloading of images within a window, improving overall efficiency.
  6. The first image within a window is chosen as the base image for comparison. Subsequent images within the window are compared against this base image
  7. Images are compared using a scoring mechanism, and if the score exceeds a predefined threshold, the image path is marked for removal.
  8. Remove the images marked for removal from the directory.

Choosing the right values

The values for min_contour_area and min_score were chosen with experimentation. We chose min_contuor_area as 100 and min_score as 3000.

Starting with conservative values and analysing the results these scores were estimated to be giving good results.

How to improve data-collection

  • Curated Images: By using deep learning models to selectively capture images based on specific criteria, you can focus on collecting only the relevant data for your use case. For instance, in a parking lot scenario, you can capture images containing only cars. This reduces the amount of irrelevant data and makes data storage and processing more efficient. By selectively capturing and storing only the necessary images, you can significantly reduce storage costs and optimize computational resources.
  • GDPR Compliant: If required by the law, faces and license plates must be blurred. This will save the company from future legal troubles.
  • Image naming: Choose a consistent image naming scheme for consistency purposes and avoid any future confusion and errors.
  • All angles: For a given timestamp image of a parking lot must be taken from all cameras for better scene understanding.

kopernikus's People

Contributors

sauravgarg540 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.