Coder Social home page Coder Social logo

dragnet_data's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dragnet_data's Issues

Inconsistent encodings

I was trying to use the data for some experiments, but when reading it directly with open in python3, in encountered an encoding error for file R121.html:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 1213: invalid continuation byte

I wrote a small script to check the encoding of the failing files using the chardet utility. The code is below.

for f in *.html; do
        encoding=$(file -i $f | cut -d"=" -f 2)  # get the mime encoding
        if [ "$encoding" != "us-ascii" ] && [ "$encoding" != "utf-8" ]; then
                res=$(chardetect $f)  # try to detect it otherwise
                encoding=$(echo $res | cut -d" " -f 2)
                echo $res
        fi
done

And produces the following result:

R121.html: Windows-1254 with confidence 0.5434633906826465
R17.html: ISO-8859-1 with confidence 0.73
R736.html: Windows-1252 with confidence 0.73
R757.html: Windows-1252 with confidence 0.73
R826.html: Windows-1252 with confidence 0.73
R827.html: Windows-1252 with confidence 0.73
T156.html: windows-1251 with confidence 0.7538428528079772
T19.html: Windows-1252 with confidence 0.73
T2.html: Windows-1254 with confidence 0.5434729438118417
T31.html: Windows-1254 with confidence 0.5239184224706976
T97.html: ISO-8859-1 with confidence 0.73

These inconsistencies are not major and I managed to fix them afterwards with a few changes to the detection script, but a few failed even with recode(R121.html, T19.html, T2.html, T31.html) and I had to remove them. Here is the script I used to convert the inconsistent ones.

for f in *.html; do
        encoding=$(file -i $f | cut -d"=" -f 2)  # get the mime encoding
        if [ "$encoding" != "us-ascii" ] && [ "$encoding" != "utf-8" ]; then
                res=$(chardetect $f)  # try to detect it otherwise
                encoding=$(echo $res | cut -d" " -f 2)
                echo $res - CONVERTING TO UTF-8
                recode ${encoding}..utf-8 $f
        fi
done

This might be an issue on my part, maybe, so I'm curious if this is something that came to your attention before.

questions re: creating my own data

Hi, I'm currently compiling additional, more modern html documents with gold standard content + comments for use in training dragnet models, and I have a few questions:

  • Should I consider the text of embedded tweets, posts, quotations, image captions, and other rich media to be content?
  • Should I include author byline, pubdate, etc. at the start of an article as content? What about typical addenda at the bottom of the article?
  • Should I include non-English language content?

Thanks for your help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.