This project processes images to extract text using OCR and converts the text into structured data in a CSV format.
- Python 3.12.3+
- Tesseract OCR
- GitHub Account
- Git
- Visit the GitHub Signup Page
- Fill in the required details (username, email, password) and complete the sign-up process.
- Verify your email address by clicking on the verification link sent to your email.
- Download Git for Windows from the official website: git-scm.com
- Run the installer and follow the instructions.
- Verify the installation:
git --version
- Install Homebrew if you haven't already:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
- Install Git:
brew install git
- Verify the installation:
git --version
- Update the package list:
sudo apt update
- Install Git:
sudo apt install git
- Verify the installation:
git --version
-
Clone the repository:
git clone https://github.com/your-username/ocr-image-processing.git cd ocr-image-processing
-
Install the required Python packages:
pip install -r requirements.txt
-
Ensure Tesseract is installed and the
TESSDATA_PREFIX
environment variable is set:export TESSDATA_PREFIX=/usr/local/share/tessdata/
- Download Python from the official website: python.org
- Run the installer and follow the instructions, ensuring to check the box to add Python to your PATH.
- Verify the installation:
python --version pip --version
- Install Homebrew if you haven't already:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
- Install Python:
brew install python
- Verify the installation:
python3 --version pip3 --version
- Update the package list:
sudo apt update
- Install Python:
sudo apt install python3 python3-pip
- Verify the installation:
python3 --version pip3 --version
-
Place the images to be processed in the
data/images
directory. -
Run the main script with the images directory and output directory as arguments:
python src/main.py data/images data/output
-
The processed CSV files will be saved in the specified output directory.
python src/main.py data/images data/output
This project uses GitHub Actions for CI/CD. When you push images to the master
branch, the GitHub Actions workflow will automatically process the images and upload the CSV outputs as artifacts.
-
Push Images to GitHub:
git add data/images/ git commit -m "Add new images" git push origin master
-
Download CSV Outputs:
- Go to the Actions tab in your GitHub repository.
- Select the latest workflow run.
- Scroll down to the
Artifacts
section. - Download the
csv-files
artifact.
Contributions are welcome! Please open an issue or submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.