Coder Social home page Coder Social logo

public-person's Introduction

public-person dataset: identifying unically Brazilian public persons.

    

Semantic as schema.org (sc) complemented by Wikidata (wd):

The main source for "valid public names" (and license to use here) are the Brazilians Official Gazettes.

Data sources and preparation

Data comes from multiple sources as follows:

  • TSE - Tribunal Superior Eleitoral. Preparation here. Produces data/tse-*.csv files and/or SQL.
  • ...

For CSV files description, see datapackage.json.

License

This material is licensed by its maintainers under the Public Domain Dedication and License.

Nevertheless, it should be noted that this material is ultimately sourced from TSE, ....

public-person's People

Contributors

ppkrauss avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

public-person's Issues

normalize and canonize names

Rules to normalize:

  • same name same ASCII representation, by unaccent(name) function.
  • similar name same Metaphone-pt

Rules to canonize (choice of the official):

  1. WHEN same CPF and similar name (or same first names)
    1.1. check official name with a CPF-resolver
    1.2. (if not possible) use the most recent

  2. WHEN no CPF and same name (and same birthDate)
    2.1. use the most recent when more than 6 years diff
    2.2. (when else) use the "most accented" version or "most standard pt-BR" (eg. preffer use of i insted y)

... Use some log to notice mesages in conflict resolutions ...

Example:

               name               | birthdate  |           source         
----------------------------------+------------+---------------------------
 ANTONIO SETUBAL SILVESTRE        | 1963-01-04 | br:tse;ce:candidatos:2016
 ANTÔNIO SETUBAL SILVESTRE        | 1963-01-04 | br:tse;ce:candidatos:2008
 ANTONIO SETÚBAL SILVESTRE        | 1963-01-04 | br:tse;ce:candidatos:2004
 ANTÔNIO SETÚBAL SILVESTRE        | 1963-01-04 | br:tse;ce:candidatos:2012
 FABRICIO JOSE SATIRO DE OLIVEIRA | 1975-07-01 | br:tse;sc:candidatos:2010
 FABRICIO JOSÉ SATIRO DE OLIVEIRA | 1975-07-01 | br:tse;sc:candidatos:2004
 FABRÍCIO JOSÉ SATIRO DE OLIVEIRA | 1975-07-01 | br:tse;sc:candidatos:2000
 FABRICIO JOSÉ SÁTIRO DE OLIVEIRA | 1975-07-01 | br:tse;sc:candidatos:2008
 FABRÍCIO JOSÉ SÁTIRO DE OLIVEIRA | 1975-07-01 | br:tse;sc:candidatos:2012

Most accented "ANTÔNIO SETÚBAL SILVESTRE" of 2012, most recent "ANTONIO SETUBAL SILVESTRE" of 2016...

After canonized, delete records and register all variants in the info JSON

{
 "etc":"etc",
 "synonymous":[
     {"name":"ANTONIO SETUBAL SILVESTRE", "source":123},
     {"name":"ANTÔNIO SETUBAL SILVESTRE", "source":456}
  ]
}

check best performance after stable CSV algorithms

Small and medium files

CREATE FUNCTION lib.get_csvfile(
  file text,
  delim_char char(1) = ',',
  quote_char char(1) = '"'
) RETURNS setof text[]  AS $f$
  import csv
  return csv.reader(
     open(file, 'rb'), 
     quotechar=quote_char, 
     delimiter=delim_char, 
     skipinitialspace=True, 
     escapechar='\\'
  )
$f$ immutable language PLpythonU;

CREATE FUNCTION lib.get_csvline(
    _line text,
    _delim_char char(1) = ',',  
    _quote_char char(1) = '"'
) RETURNS text[] AS $f$
    import csv
    return [row for row in csv.reader(
        [_line], 
        quotechar = _quote_char, 
        delimiter = _delim_char, 
        skipinitialspace = True, 
        escapechar = '\\'
    )][0]
$f$ immutable language PLpythonU;

big CSV files

For big CSV files, best is to use EXTENSION file_fdw. Complex but works fine, and need only one command after implemented.

    CREATE SCHEMA csv; -- a library, easy to drop 

    CREATE FUNCTION csv.split_csv2(
      line text,                  -- the input CSV UTF-8 line
      delim_char char(1) = ',',   -- can be also E'\t', ';', '|', etc.
      quote_char char(1) = '"'   -- field quotation
    ) RETURNS setof text[] AS $f$
      import csv
      return csv.reader(
          line.splitlines(), 
          quotechar=quote_char, 
          delimiter=delim_char, 
          skipinitialspace=True, 
          escapechar='\\'
      )
    $f$ IMMUTABLE language PLpythonU;

    CREATE FUNCTION csv.split_csv_line2(
      text, char(1) DEFAULT ',', char(1) DEFAULT '"'
    ) RETURNS text[] AS $f$
      SELECT x FROM csv.split_csv($1,$2,$3) x LIMIT 1;
    $f$ language SQL IMMUTABLE;

   -- not reliable but fast? http://stackoverflow.com/a/42926265/287948
   -- compare performance with csv.split_csv_line2()
   -- ideal a C implementation
    CREATE OR REPLACE FUNCTION csv.split_csv_line(
       line text,                 -- the input CSV string
       delim_char char(1) = ',',  -- can be also E'\t', ';', '|', etc.
       quote_char char(1) = '"',  -- field quotation
       OUT result text[]
    ) LANGUAGE plpgsql AS
    $f$DECLARE
       i integer;
       t text := '';
       c char(1);
       /*
        * 0 means unquoted
        * 1 means quoted
        * 2 means quoted, and we just read a quote
        */
       q integer := 0;
    BEGIN
       /* loop through the characters */
       FOR i IN 1..length(line) LOOP
          /* get i-th character */
          c := substring(line FROM i FOR 1);
          /* end of string is at an unquoted delimiter */
          IF c = delim_char AND q <> 1 THEN
             result := result || t;
             t := '';
             q := 0;
          ELSIF c = quote_char THEN
             CASE q
                WHEN 0 THEN
                   q := 1;
                WHEN 1 THEN
                   IF c = quote_char THEN
                      q := 2;
                   ELSE
                      t := t || c;
                   END IF;
                WHEN 2 THEN
                   q := 1;
                   t := t || quote_char;
             END CASE;
          ELSE
             IF q = 2 THEN
                q := 0;
             END IF;
             t := t || c;
          END IF;
       END LOOP;
       /* add the last string */
       result := result || t;
    END;$f$;

    CREATE EXTENSION file_fdw WITH SCHEMA csv;
    CREATE SERVER files FOREIGN DATA WRAPPER file_fdw;

    CREATE or replace FUNCTION csv.ftable_drop(tname text) RETURNS void AS $f$
      BEGIN
      EXECUTE format('DROP VIEW IF EXISTS %s', 'csv.'||$1);
      EXECUTE format('DROP FOREIGN TABLE IF EXISTS %s CASCADE', 'csv.fgn_'||$1);
      END
    $f$ language PLpgSQL;
    
    CREATE or replace FUNCTION csv.ftable_create(
         tname text,  -- table name for scan text and "csv_tname"  
         fpath text,  -- file path to the input CSV
         delim_char char(1) = ',',   -- can be also E'\t', ';', '|', etc.
         quote_char char(1) = '"',   -- field quotation
         p_check boolean default true  -- false to not check-and-drop.
    ) RETURNS text AS $f$
      DECLARE
        fgn_tname text;
        csv_tname text;
      BEGIN
      fgn_tname := 'csv.fgn_'||$1;
      csv_tname := 'csv.'||$1;
      IF p_check THEN PERFORM csv.ftable_drop($1); END IF;
      EXECUTE format(E'CREATE FOREIGN TABLE %s (line text) SERVER files OPTIONS ( filename %L, format \'text\')', fgn_tname, $2);
      EXECUTE format('CREATE VIEW %s AS SELECT csv.split_csv_line(line,%L,%L) as c FROM %s', csv_tname, delim_char,quote_char, fgn_tname);
      RETURN format(E'\ntables %I and %I created\n',fgn_tname,csv_tname);
      END
    $f$ language PLpgSQL;

    ---- TESTING ----
    SELECT csv.ftable_create('my','/tmp/TSE2/lix.csv',';');

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.