Coder Social home page Coder Social logo

test-lists's Introduction

Usage

What Is It?

Contained are URL testing lists intended to help in testing URL censorship, divided by country codes. In addition to these local lists, the global list consists of a wide range of internationally relevant and popular websites, including sites with content that is perceived to be provocative or objectionable. Most of the websites on the global list are in English. In contrast, the local lists are designed individually for each country by regional experts. They have content representing a wide range of categories at the local and regional levels, and content in local languages. In countries where Internet censorship has been reported, the local lists also include many of the sites that are alleged to have been blocked.

Categories are divided among four broad themes:

  • Political (This category is focused primarily on Web sites that express views in opposition to those of the current government. Content more broadly related to human rights, freedom of expression, minority rights, and religious movements is also considered here.)

  • Social (This group covers material related to sexuality, gambling, and illegal drugs and alcohol, as well as other topics that may be socially sensitive or perceived as offensive).

  • Conflict/Security (Content related to armed conflicts, border disputes, separatist movements, and militant groups is included in this category).

  • Internet Tools (Web sites that provide e-mail, Internet hosting, search, translation, Voice-over Internet Protocol (VoIP) telephone service, and circumvention methods are grouped in this category.)

More information about testing methodology can be found here.

The only testing list that applies regionally (more than one or more country) is the CIS testing list which is intended for testing former Commonwealth of Independent States nations.

Lists are available in both CSV and JSON format.

Please note that these lists are not the entirety of testing lists but rather just the newest list for every unique country code.

Contributing URLs

To learn how to contribute URLs for testing see: https://ooni.org/get-involved/contribute-test-lists/

Citation

If using this dataset in a publication, please see the following BibTeX File format.

@misc{testlist,
  title={URL testing lists intended for discovering website censorship},
  author={Citizen Lab and Others},
  year={2014},
  url={https://github.com/citizenlab/test-lists},
  note={\href{https://github.com/citizenlab/test-lists}{https://github.com/citizenlab/test-lists}}
}

An example Chicago Style citation is included below:

Citizen Lab and Others. 2014. URL Testing Lists Intended for Discovering Website Censorship. https://github.com/citizenlab/test-lists.

License

All data is provided under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International and available in full here and summarized here

test-lists's People

Contributors

abitrolly avatar agharbeia avatar agrabeli avatar anadahz avatar arky avatar bact avatar bassosimone avatar hellais avatar jakubd avatar kaerumy avatar kay-wong avatar marek22k avatar masaarnet avatar mimi89999 avatar moeltaher avatar nunesgh avatar oymate avatar raja063 avatar randomguy48 avatar rayasharbain avatar sitinurliza95 avatar sloncocs avatar sneft avatar tafiti avatar te-k avatar tomac4t avatar vivivibo avatar willscott avatar xhdix avatar yanyiyi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

test-lists's Issues

Make list of government issued blocklists

Add links into the comments below!

Are you aware of a government issued blocklist that is available for download?

Then add a link pointing to where it can be downloaded (even better if you specify if it's legal or not to mirror/publish it) and what country to applies to.

If you want to be πŸ‘ πŸ’― then you should write the parser code and submit a pull request ;)

White-listing/Black-listing URLs per country?

At TorDev 2016 it was suggested to OONI that instead of creating criteria for URL risk assessment, perhaps we could consider white-listing and black-listing URLs per country?

Whitelists can include URLs that are commonly accessed and present low risk (e.g. Alexa top 100), while blacklists can include pornography, hate speech, and other objectionable categories.

My main concern with this suggestion is that many URLs might not clearly fall in a "white" or "black" list, or it might not be clear/obvious if they should be white-listed or black-listed. In such cases, would we be creating even more risk for users (if, for example, we have white-listed a URL which is actually quite risky to test)? Furthermore, would this inevitably lead to fewer tests for URLs that are riskier and possibly more interesting to test?

Extra consistency checks on category codes and URLs

We have run into the issue, when using the lists in OONI, that some country lists present the following problems:

  1. The same URL present in different country specific lists, presents a different category code

ex.
id.csv:http://denypagetests.netsweeper.com,NEWS,News Media,2014-04-15,citizenlab,
kw.csv:http://denypagetests.netsweeper.com,CTRL,Control content,2014-04-15,citizenlab,

  1. The same URL is present in both the global and the country specific list

ex.
global.csv:http://www.crazyshit.com,PORN,Pornography,2014-04-15,citizenlab,Updated by OONI on 2017-02-14
sg.csv:http://www.crazyshit.com,NEWS,News Media,2014-04-15,citizenlab,

We should add checks to the lint-lists.py script that checks if:

  1. There are inconsistencies in category codes across lists
  2. If a URL is present in the global list it should not also be present in the country specific list

On this second point I would like to hear from @sneft and others to know if this is reasonable or if it's maybe just a OONI specific usage of the lists.

Add "safety" column to the test-lists

Add tags to URLs for Safe/Not safe or an extra column that provides a level of safety for that certain URL.

It was suggested by @josswright that we could also do this in an automated way by prompting the user before doing a scan "Do you feel safe testing these URLs":

URL 1 - yes/no
URL 2 - yes/no
etc.

This user feedback can then be submitted to us in a safe way and used to defined a metric of safety or perceived safety (perhaps even on a country by country basis).

Much more is missing from List.

Pakistan Blocked much more websites and links. If only these you tested is fine but if this is Total blocked URL list then I can add much more.

Review the proposed new content category codes

I have drafted a proposed set of revised content category codes which would replace the existing codes. The proposed set of codes are here:

https://github.com/citizenlab/test-lists/blob/master/lists/00-proposed-category_codes.csv

The aim here has been to streamline the categories, collapsing redundant categories and revising categories which were frequently confusing for list-creators. The goal was simplicity - fewer categories will be easier to use, prevent duplication and ideally limit miscategorization.

Alternative approaches considered were to adopt an existing category scheme wholesale (such as Netsweeper or Blue Coat). These generally have many more categories than the 30 proposed here, which can be a pro or a con depending on your needs.

Please feel free to offer feedback here. This list format is very much building on the prior ONI scheme, so I would love to hear comments based on alternative uses for the lists.

Restful API

Tracking issue for development of an API for extracting URLs for specific categories and/or countries.

hdl.handle.net is blocked in Iran, not sure how to categorize

Handle.net is a provider of persistent URLs and is commonly used for institutional repositories (similar to DOIs). I am the administrator of a large open access repository of agricultural outputs and we recently noticed that the Handles do not work in Iran.

Here is an example of a persistent Handle link to an item on our repository: http://hdl.handle.net/10568/91553

I would submit a pull request with the URL in the Iran country list, but I'm not sure what to classify it as.

Review the URL list for your country

Are you interested in helping us out in making censorship measurement work better in your country?

Here is what you should do:

  1. Look to see if we already have a test-list for your country.

To figure out what is the the two letter country code for your country see the country code mappings.

Then find the file called XY.csv (where XY is your countries two letter code) in here: https://github.com/citizenlab/test-lists/tree/master/lists

  1. Review the list of URLs we test by adding new ones or creating a new file for your country.

Currently this process requires a bit of technical knowledge. If you want to contribute, but don't know git just send an email to art torproject org or comment this ticket with the suggested changes.

All URLs in the Global List Should be Active URLs.

I realize that there is an intention to keep some URLs that are dead (404ing, parked pages,etc) on the local lists because this repo is intended to measure censorship and this may sometimes last longer than the site itself. I believe that this logic does not hold as well when talking about the global list. Is there any objection to checking the consistency of the global list and making sure that all the global list URLs are active non-dead URLs?

Discuss possibility of including application specific addresses

There are some cases where we want to have lists of some addresses (or strings) that are used specifically in the context of some application.

I believe the most logical place to put these is inside of a dedicated "global" list, but it should be placed in the context of the software using it.

Here are some examples of the sorts of addresses I am talking about:

  • The pluggable transport addresses (and in some cases shared secrets or hash fingerprints) of the Tor bridges shipped as part of Tor Browser Bundle release X
  • The IP and ports (and possibly fingerprints) or Tor directory authorities that are used for discovery of the Tor network by "vanilla" tor (country that block access to the Tor network will block these, but may not be blocking also the bridges)
  • The list of IP and ports used by a $INSERT_POPULAR_INTERNET_CHAT_APP application (I'm thinking whatsapp, viber, etc.)
  • The list of IP, port and protocol by the various VPN providers (it can be useful to check if these are being blocked)
  • List of IPs of open DNS resolvers per country

I think you get an idea of the sorts of data I am talking about.

I would propose we create the following new directory structure:

lists
β”œβ”€β”€ services
β”‚Β Β  β”œβ”€β”€ tor
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ bridges.csv
β”‚Β Β  β”‚Β Β  └── directory_authorities.csv
β”‚Β Β  └── vpn
etc.

Each service specific namespace can have it's own format, maintaing some common elements such as: name, date_added, data_format_version, source, notes.

@sneft, @chokepoint-project, @mobilesuit, @jakubd what do you think?

"LGBT (excluding pornography)"

There are several different category codes (other than PORN) that could arguably include pornography, notably

Provocative Attire,PROV,PROV,"Websites which show provocative attire and portray women in a sexual manner, wearing minimal clothing. "
Sex Education,XED,XED,"Includes contraception, abstinence, STDs, healthy sexuality, teen pregnancy, rape prevention, abortion, sexual rights, and sexual health services."
Online Dating,DATE,DATE,"Online dating services which can be used to meet people, post profiles, chat, etc"
Media sharing,MMED,MMED,"Video, audio or photo sharing platforms."

The LGBT category, however, is alone in having a description that specifies that it excludes pornography:

LGBT,LGBT,GAYL,A range of gay-lesbian-bisexual-transgender queer issues. (Excluding pornography)

This seems inconsistent. I propose that either "Excluding pornography" be added to the descriptions of PROV, XED, DATE, and MMED, or that it be removed from the description of LGBT.

Can source be anon?

We have a request from a country partner, that wishes to not be credited as direct source due to the sensitive nature of some of the urls.

Should we put the source as anon? Or should an intermediary that is not at risk such as sinarproject or ooni be put in there instead?

Regards

Add urls blocked in Spain related to Catalan referendum

Since last week, the Spanish government has been running after websites hosting information about the Catalan referendum that will be hold next 1st October. They started blocking the main website referendum.cat by forcing Spanish operators to hijack the domain name:

$ host referendum.cat
referendum.cat is an alias for paginaintervenida.edgesuite.net.
paginaintervenida.edgesuite.net is an alias for a1836.b.akamai.net.
a1836.b.akamai.net has address 2.16.65.170
a1836.b.akamai.net has address 2.16.65.152

At the beginning an error was shown because they didn't configure properly the webserver to handle the redirection. Now it shows the following image:

paginaintervenida

The day after it happened, someone uploaded the content of the website to github (https://github.com/GrenderG/referendum_cat_mirror) and more mirrors have appeared online.

Again, they have been running after these mirrors to the extent that they have blocked the whole gateway.ipfs.io domain.

A list of the blocked domains (may be incomplete) can be found here
1-o_catalan_referendum_blocked_websites_in_spain.txt

The virtual harassment has been paralleled with physical harassment. Read what EFF wrote about it at https://www.eff.org/deeplinks/2017/09/cat-domain-casualty-catalonian-independence-crackdown

Also interesting reading about technical measures taken beforehand to prevent DDoS and censorship: https://medium.com/@josepot/is-sensitive-voter-data-being-exposed-by-the-catalan-government-af9d8a909482

Categorise MISC websites

Currently I count 317 MISC category codes inside of the country and global testing lists.

It would be great if these sites were categorised. It's not something that needs to be done all at once, but maybe tackled country by country.

These are all the countries that have MISC category codes in them:

  • ae.csv
  • br.csv
  • by.csv
  • cis.csv
  • cn.csv
  • cu.csv
  • eg.csv
  • et.csv
  • global.csv
  • gt.csv
  • id.csv
  • in.csv
  • ke.csv
  • kr.csv
  • kz.csv
  • lb.csv
  • md.csv
  • mm.csv
  • np.csv
  • om.csv
  • pk.csv
  • ru.csv
  • sa.csv
  • sg.csv
  • so.csv
  • sy.csv
  • th.csv
  • tm.csv
  • tr.csv
  • tz.csv
  • ua.csv
  • ug.csv
  • uz.csv
  • vn.csv
  • ye.csv

Convert whiteboard sketched out at CSI15 into tickets and milestones

Here is a dump of the content of the whiteboard that should become a roadmap for the next 6-12 months of development:

Questions that should be answered via documentation

  • How many countries have public block-lists?
  • What countries are we missing test lists for?

Things we would like to have

  • IPs of block pages
  • Government issued block-lists for the various countries
  • REST API for getting a subset of testing URLs into your measurement software
  • URLs used for captive portal testing (ooniprobe has this as part of the captive portal test)

Help wanted

  • Review current URL for your country (somebody agreed to take care of ZA and I gave them the template as an excel spreadsheet)
  • Write text for the wizard (@citizenlab can you make available the excel you used for defining the existing lists?)
  • Translation of text for the wizard

TODO

  • Define the policy by which URLs are accepted into the test-lists repository
  • Add tags to URLs for Safe/Not safe (perhaps do this in an automated way by prompting to the user "Do you feel safe testing these URLs and show them a random sample, based on user feedback we can build a metric for safety)
  • Integrate tags from other third parties (OpenDNS, Alexa, BlueCoat, etc.)

Icons for the new citizenlab category codes

I was working on some data viz for OONI and as part of that I came up with an icon set for each of the citizenlab category codes.

Most of the icons are taken from either font-awesome or material design icons (which both use SIL Open Font License) and some are designed by OONI (mostly based on stacking font-awesome icons). @bnvk do you think it would be useful to add the OONI made ones to @opensourcedesign?

citizenlab category codes

Feedback an input on the iconset is greatly appreciated. Also if it would be useful I can possibly add these to the repository itself, though maybe that will just increase the size of the repo and maybe most people don't care about the icons.

"description" field

It's hard for list maintainers to figure out what some of the test urls are as well as when writing reports manually or generated.

A one line description field would be helpful for maintainers and users of the list (human or machine) to able to to know what the site is about especially if it's in global list or for another country unfamiliar to the maintainer

eg.

  • "The Citizen Lab is an interdisciplinary laboratory based at the Munk School of Global Affairs at the University of Toronto, Canada."
  • "The Coalition for Clean and Fair Elections or Bersih is a coalition of non-governmental organisations which seeks to reform the current electoral system in Malaysia to ensure free, clean and fair elections."

We might want to have a title field also.

Character 0x9d present in some lists

I noticed that some URLs contain the character 0x9d at the end. This triggers a parsing error when I try to convert it to unicode.

Can I just remove them?

Consolidate Iran lists

It looks like there are two ir.csv files. One of them is in the root directory, and the other one is in the lists directory. Futhermore, they appear to be out of sync.

Define the policy for adding URLs to this repository

Currently we don't have a clear policy that allows us to evaluate if a certain set of URLs should or should not be merged.

We should consult some lawyers that have a good understanding of Canadian law to help us draft a good policy.

using or not www

Hello!

I'm working in a huge revamp of the venezuelan list and i'm a little bit confused in what to do with sites that support www. because in some lists we have sites with, others without and others with both. Which approach do you recommend for this repo?

I am almost ready to upload the new list but this is stopping me.

Thank you very much!

Agree on convention for including or not HTTPS URLs

In #38 (comment) we discussed about the pro and contra of including or not https URLs in the testing lists.

@sneft

This is a debate we've had internally a number of times. It's obvious inefficient to repeatedly test the full path of HTTPS URLs for the same domain, but at the same time we didn't want to make any initial assumptions in constructing the lists that would limit their usefulness in the long-run. In other words, trying to future-proof the lists, leave open the possibility that someone's being MITM'd, etc.
We generally have settled on being exhaustive (in some cases testing both HTTP/HTTPS versions) as the performance impact wasn't our primary consideration.
One middle ground may be to include a small sample of HTTPS full-path URLs (e.g. a handful of full HTTPS URLs of sensitive Twitter accounts or YouTube videos) while testing the HTTP version of the rest.

@hellais

One thing to take into consideration is the fact that if you only do testing for the HTTPS version of the site you will not catch cases in which the specific path is being filtered by a transparent proxy that doesn't do MITM.
I guess in the end it's probably up to the tool that does the measurements to take into account the fact that it should be testing the URL in both it's HTTPS and HTTP version.
That said if we decide to include in the URLs to test the HTTPS version of websites it should be consistent across all of them. That is if the site does support HTTPS the URL should always be listed using https. It will then be up to the tool consuming the testing list to take into account the fact it should be testing for the URL in both it's HTTPS and HTTP version.

Do we agree that the convention to adopt is "when the site supports HTTPS list it in HTTPS" and it should be up to the tool using the list to figure out that it should ALSO be testing for "HTTP" when testing a URL?
If so I can apply changes in batch to all the lists to update the URLs that also support HTTPS to have them prefixed by that.
Changes will also be made to ooniprobe to support testing for HTTP AND HTTPS when it finds HTTPS URLs.

I think this is also relevant to @rpanah for centinel testing.

Add a tags column

I am thinking that it would be valuable to have an extra column to annotate in a machine readable way properties of the URLs.

The use-case of this is, for example, instead of removing dead URLs that should no longer be tested, simply adding the inactive tag or for URLs that are generated as part of some automated tooling, marking those as autogenerated, etc.

I think having a generic tags column is sufficiently flexible to support many use-cases.

Thoughts?

Does this create parsing issues for people consuming these test lists?

I believe this would also make it easier to integrate the @berkmancenter effort (see: #236)
cc @sneft @agrabeli @jakubd @rpanah @jdcc

Add URLs for captive portal tests

Some people were interested in having the list of URLs that some vendors use to detect the presence of captive portals.

We have this information inside of the ooniprobe captive portal test.

I will paste the important text in here:

# This test is a collection of tests to detect the presence of a
# captive portal. Code is taken, in part, from the old ooni-probe,
# which was written by Jacob Appelbaum and Arturo FilastΓ².
#
# This module performs multiple tests that match specific vendor captive
# portal tests. This is a basic internet captive portal filter tester written
# for RECon 2011.
#
# Read the following URLs to understand the captive portal detection process
# for various vendors:
#
# http://technet.microsoft.com/en-us/library/cc766017%28WS.10%29.aspx
# http://blog.superuser.com/2011/05/16/windows-7-network-awareness/
# http://isc.sans.org/diary.html?storyid=10312&
# http://src.chromium.org/viewvc/chrome?view=rev&revision=74608
# http://code.google.com/p/chromium-os/issues/detail?3281ttp,
# http://crbug.com/52489
# http://crbug.com/71736
# https://bugzilla.mozilla.org/show_bug.cgi?id=562917
# https://bugzilla.mozilla.org/show_bug.cgi?id=603505
# http://lists.w3.org/Archives/Public/ietf-http-wg/2011JanMar/0086.html
# http://tools.ietf.org/html/draft-nottingham-http-portal-02
#
        vendor_tests = [['http://www.apple.com/library/test/success.html',
                         'Success',
                         '200',
                         'Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3',
                         'Apple HTTP Captive Portal'],
                        ['http://tools.ietf.org/html/draft-nottingham-http-portal-02',
                         '428 Network Authentication Required',
                         '428',
                         'Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0',
                         'W3 Captive Portal'],
                        ['http://www.msftncsi.com/ncsi.txt',
                         'Microsoft NCSI',
                         '200',
                         'Microsoft NCSI',
                         'MS HTTP Captive Portal', ]]

Missing comma's in ae.csv

The last two entries in ae.csv are missing the final comma for the 'notes' column. Please correct these.

Update URL for gay.com

As of 03 August 2017, gay.com has changed ownership. It appears that the website no longer has an HTTPS version, just an HTTP connection that redirects to https://vanguardnow.org/

This causes a problem in the "Web Connectivity" test in ooniprobe, which expects "https://gay.com" to resolve correctly.

Relevant line:
https://github.com/citizenlab/test-lists/blob/master/lists/global.csv#L582

More information:
This was tested and verified by performing

# Checks open ports
nmap gay.com
# Checks HTTPS
curl -v https://gay.com

from 3 locations:

  • Home internet
  • School internet (MTSU)
  • DigitalOcean droplet (NYC)

On all 3 locations, port 443 does not show as open, and performing curl fails to connect using HTTPS

Cross-reference:
https://github.com/TheTorProject/ooniprobe-android/issues/106
ooni/probe-android#107

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.