The dragnet_data from seomoz

Inconsistent encodings

I was trying to use the data for some experiments, but when reading it directly with open in python3, in encountered an encoding error for file R121.html:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 1213: invalid continuation byte

I wrote a small script to check the encoding of the failing files using the chardet utility. The code is below.

for f in *.html; do
        encoding=$(file -i $f | cut -d"=" -f 2)  # get the mime encoding
        if [ "$encoding" != "us-ascii" ] && [ "$encoding" != "utf-8" ]; then
                res=$(chardetect $f)  # try to detect it otherwise
                encoding=$(echo $res | cut -d" " -f 2)
                echo $res
        fi
done

And produces the following result:

R121.html: Windows-1254 with confidence 0.5434633906826465
R17.html: ISO-8859-1 with confidence 0.73
R736.html: Windows-1252 with confidence 0.73
R757.html: Windows-1252 with confidence 0.73
R826.html: Windows-1252 with confidence 0.73
R827.html: Windows-1252 with confidence 0.73
T156.html: windows-1251 with confidence 0.7538428528079772
T19.html: Windows-1252 with confidence 0.73
T2.html: Windows-1254 with confidence 0.5434729438118417
T31.html: Windows-1254 with confidence 0.5239184224706976
T97.html: ISO-8859-1 with confidence 0.73

These inconsistencies are not major and I managed to fix them afterwards with a few changes to the detection script, but a few failed even with recode(R121.html, T19.html, T2.html, T31.html) and I had to remove them. Here is the script I used to convert the inconsistent ones.

for f in *.html; do
        encoding=$(file -i $f | cut -d"=" -f 2)  # get the mime encoding
        if [ "$encoding" != "us-ascii" ] && [ "$encoding" != "utf-8" ]; then
                res=$(chardetect $f)  # try to detect it otherwise
                encoding=$(echo $res | cut -d" " -f 2)
                echo $res - CONVERTING TO UTF-8
                recode ${encoding}..utf-8 $f
        fi
done

This might be an issue on my part, maybe, so I'm curious if this is something that came to your attention before.

questions re: creating my own data

Hi, I'm currently compiling additional, more modern html documents with gold standard content + comments for use in training dragnet models, and I have a few questions:

Should I consider the text of embedded tweets, posts, quotations, image captions, and other rich media to be content?
Should I include author byline, pubdate, etc. at the start of an article as content? What about typical addenda at the bottom of the article?
Should I include non-English language content?

Thanks for your help!

seomoz / dragnet_data Goto Github PK

dragnet_data's People

Stargazers

Watchers

Forkers

dragnet_data's Issues

Inconsistent encodings

questions re: creating my own data

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent