Coder Social home page Coder Social logo

csv-workshop's Introduction

Data processing for large files workshop

The workshop will show you how to process large csv and Excel files using powerful command line tools, without the need to open Excel.

Setup

To get our work environment started log in to the server I sent on the Slack channel, and run (don't forget to substitute your name):

cd csv_workshop
./run_workshop.sh <your name here>

This will create a Docker container with your name - This is a sandbox with all the tools that we need already pre-installed.

Conversion

You will find a sample file in your root directory: Profile.xls. As a first step, let's convert it to a csv file for easier processing:

in2csv Profile.xls > categories.csv

This command takes the first worksheet and pipes its contents into a file called categories.csv. But what if we want to take a different worksheet? Let's do that for the prospecting terms worksheet:

in2csv --sheet "Prospecting Terms" Profile.xls > prospecting.csv

So now we also have a csv for the prospecting terms. In this way you can easily break apart a multiple worksheet excel file.

Peek-a-boo

So now that we have a file, let's see what it looks like:

csvlook categories.csv

Ouch! That's a mess. Too many columns and too many lines. Let's try and do better.

What do you column?

Let's try this (the '|' --pipe-- operator takes one command's output and uses it as input for the next):

csvcut -c1,3,5 categories.csv | csvlook

This way, we only refer to the first, third and fifth columns. Better, but still too long. Let's add one more thing:

csvcut -c1,3,5 categories.csv > short_categories.csv
csvlook short_categories.csv | head

With these two commands, we created a new smaller file, and sent the formatted view of the file to a utility that only shows the first 10 lines. What if we want to see the first 20 lines?

cat short_categories.csv | head -20

Insights

Now that we know how to look at a file, let's see what we can learn about it:

csvstat short_categories.csv

The result:

  1. Keyword
        <type 'unicode'>
        Nulls: False
        Unique values: 398
        5 most frequent values:
                Call Center Management: 5
                Infrastructure Engineer:        4
                Security Architecture Engineer: 4
                Chief Customer Officer: 4
                Infrastructure Manager: 4
        Max length: 35
  2. Type
        <type 'unicode'>
        Nulls: False
        Values: Field, Indicator, Technology, indicator, field
  3. Strength
        <type 'unicode'>
        Nulls: True
        Values: Strong, Medium, Weak, strong

Row count: 648

You can get an amazing amount of information with a short command!

Filtering & Sorting

Let's say we want to see a subset of the data, filtered by a line's strength (Weak) and sorted by it's type:

csvgrep -c Strength -m Weak short_categories.csv | csvsort -c Type | csvlook

To create a file with the results:

csvgrep -c Strength -m Weak short_categories.csv | csvsort -c Type > weak_categories_by_type.csv

Deep diving into the data

If we really want to dive into the data and manipulate it beyond a simple search, we can take the csv file and refer to it as an SQL table. First up, let's see what our file would look like if it were an SQL table:

csvsql categories.csv

The result:

CREATE TABLE categories (
        "Keyword" VARCHAR(35) NOT NULL,
        "Keyword is alias" BOOLEAN NOT NULL,
        "Type" VARCHAR(10) NOT NULL,
        "Aliases" VARCHAR(609),
        "Strength" VARCHAR(6),
        "Category" VARCHAR(33),
        "Type_2" VARCHAR(10),
        CHECK ("Keyword is alias" IN (0, 1))
);

From this we see that our table name is categories, as well as the names of the columns we can refer to. Let's make use of this. First run a script that starts a MySQL instance:

./init_mysql.sh

Now let's import our categories and prospecting files into tables. The following commands create tables and import rows from the files:

csvsql --db mysql://localhost:3306/workshop --insert categories.csv
csvsql --db mysql://localhost:3306/workshop --insert prospecting.csv

Let's run mysql. First thing, let's switch to the "workshop" database (already created by the script we ran):

mysql

In MySql:

use workshop;

You can now go crazy and do whatever MySQL allows you. For example:

select distinct c.keyword from categories c, prospecting p where c.aliases like '%'||p.`key term`||'%' and p.levels = 'Staff';

csv-workshop's People

Contributors

ygaller avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.