Coder Social home page Coder Social logo

wri's Introduction

environmental-conflict-tracker

This project scrapes news media articles to identify environmental conflict events such as resource conflict, land appropriation, human-wildlife conflict, and supply chain conflict. With an initial focus on India, the project also aims to connect conflict events to their jurisdictional policies to identify how to mediate conflict events or to identify where there is a gap in legislation. Currently, the database contains 65,000 candidate news articles about conflict in India, of which around 11,000 are expected to be environmental conflict events.

The data collection involves the following steps:

  1. Pulling all news data in a given country that contains a conflict event
  2. Keyword extraction of candidate articles from titles of news articles based on regular expression matches to a curated dictionary
  3. Scraping of full news media text for candidate articles with NewsPlease
  4. Manual curation of gold standard dataset, stored in data/gold_standard/

Notebooks

  • 0-quickstart: Inspect the data
  • 1-download: Download GDELT data for a given country for an input period of time
  • 2-scrape: Subset and scrape full text of candidate event articles
  • 3-gold-standard: Create and save a gold standard dataset

Roadmap

The project has been curated to allow participants in the Omdena challenge to tackle a wide range of data science needs. The ultimate end-goal is to generate maps of conflict events with extracted information about their actors, types, numbers, actions, locations, and dates. This extracted information should be paired with metadata about and text from policy documents to identify which policies have jurisdiction over the conflict event.

List of potential tasks as a starting point

  • Coreference resolution
  • Data subsetting to remove false positives
  • Identifying news articles that refer to the same conflict event
  • Named entity recognition
  • PDF document processing
  • Sentence similarity between conflict event and policy sections

Named entity recognition

We are interested in identifying the following entities: actor, type, number, action, location, date, which can be disaggregated as follows:

  • Actor: farmer, government, trader, smallholder, etc.
  • Type: Human-wildlife conflict, land tenure, land appropriation, land use rights, water scarcity, resource scarcity, livelihoods,
  • Number: number of people affected
  • Action: Protest, kill, threaten, seize, etc.
  • Location - provided in the data/metadata/variables/$MONTH.csv
  • Date: Either listed in the article or date_publish from the text

The named entity recognition process will likely involve metadata extracted from GDELT in the data/metadata/variables/$MONTH.csv folder -- containing information about the types of conflict, the actors, the locations, and the dates. However, this extracted information is meant to be noisy and will likely be paired with a manually curated NER pipeline.

The highest priority entity is the type of conflict. Location and Date should be provided by the GDELT metadata data/metadata/variables. Action may be provided by the GDELT metadata through the CAMEO codebook but should be verified. Number and Actor are going to be difficult to identify and will likely need an implementation of coreference resolution.

Document classification

Although relying on NER itself may be sufficient, we expect that the data is currently too noisy and candidate articles will need to be extracted prior to performing NER. We have manually curated a small gold standard dataset, in data/gold-standard, that contains the month and ids of news articles that refer to an environmental conflict. These can be paired with the data/metadata/variables or data/texts/$MONTH/$ID.pkl by using the data/metadata/matching/$MONTH dictionary to identify the full text or metadata for each known positive example.

Although various options exist, we suggest an initial starting point would be to generate weakly supervised class labels for the other documents with Snorkel, and to then classify the documents with Snorkel's API. Documents with a probability above a threshold should be selected as candidates for NER.

Jurisdictional policies

We are unclear as to the best methodology to tie conflict events to their jurisdictional policies and are open to suggestions or experimentation. This could be a manually defined algorithm or an NLP matching algorithm. The goal is to use the extracted information to pair a conflict event with a policy that should govern it. This will help highlight the gaps between policy and practice for use in stakeholder engagement.

Validation

The data/reference/ folder contains validated data from ACLED on conflict events in India during 2017. This database refers to all conflict events, not just environmental conflict, but could be used as a baseline for regional level violence, or by keyword extraction.

Visualization

We aim to create a simple dashboard with leaflet to visualize the geolocated conflict events.

Getting started

  1. Clone this repository
  2. Unzip the files located in data/texts/*
  3. Load the gold standard dataset and extract the full text and metadata from the articles
import pickle
import pandas as pd
from newsplease import NewsPlease

def load_obj(month, idx):
    month = str(month).zfill(2)
    idx = str(idx).zfill(5)
    with open("data/texts/{}/{}.pkl".format(month, idx), "rb") as f:
        return pickle.load(f)
        
gs = pd.read_csv('../data/gold-standard/gold_standard.csv')
    
gs_articles = {}

for i in range(len(gs)):
    article = load_obj(gs['month'][i], gs['ids'][i])
    gs_articles[i] = article
  1. Load the metadata attached to one of the gold standard articles
import pickle
import pandas as pd

def load_dict(month):
    month = str(month).zfill(2)
    with open("data/metadata/matching/{}.pkl".format(month), "rb") as f:
        return pickle.load(f)

gs = pd.read_csv('../data/gold-standard/gold_standard.csv')
gs_sample = gs.iloc[1]

def match_metadata(gs_sample):
    month = str(gs_sample['month']).zfill(2)
    idx = gs_sample['ids']
    matching_dictionary = load_dict(month)
    print('This sample matches these rows in {}.csv: {}'.format(month, matching_dictionary[idx]))
    df = pd.read_csv("../data/metadata/variables/{}.csv".format(month))
    return df.iloc[matching_dictionary[idx]]
    
match_metadata(gs_sample)

Important references

Organization

|-- data
    |-- metadata
        |-- matching
            |-- month1.pkl # dictionary mapping urls to variables/month{}.csv
            |-- ...        # because GDELT may extract multiple conflict events per article
                           # this is of the form {text/url id: [metadata/variables/month.csv row IDs]}
                           # {0: [0, 1, 2, 3, 4],
                           #  1: [5, 6],
                           #  2: [7, 11],
                           #  3: [8],
                           #  4: [9, 10, 12, 13]
            |-- month12.pkl
        |-- variables
            |-- month1.csv # all GDELT extracted data on event
            |-- ...
            |-- month12.csv
    |-- texts
        |-- month1.zip # unzip to create a folder containing the below
            |-- 00000.pkl
                |-- # indexed with doc.title, doc.text, doc.publish_date
            |-- ...
            |-- 0000n.pkl
        |-- month2
        |-- ...
        |-- month12
    |-- urls
        |-- month1.txt # generated with set(df[url])
        |-- ...
        |-- month12.txt
    |-- gold-standard # contains hand labeled positive class examples
        |-- gold_standard.csv
               |-- month, url, id, class

Python scripts

  • scrape.py: python3 scrape.py --month $MONTH will subset and scrape a month of data, optionally with --multiprocessing True will parallelize the process.

TODO: @johnmbrandt:

  • example jurisictional linkage
  • urls data folder

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.