aruneshmathur / dark-patterns Goto Github PK

Code and data belonging to our CSCW 2019 paper: "Dark Patterns at Scale: Findings from a Crawl of 11K Shopping Websites".

Home Page: https://webtransparency.cs.princeton.edu/dark-patterns/

License: GNU General Public License v3.0

Python 0.25% Java 0.03% HTML 9.90% JavaScript 0.18% Shell 0.01% Jupyter Notebook 89.64%

dark-pattern research-paper research-data human-computer-interaction public-policy sludge

dark-patterns's People

Contributors

Stargazers

Watchers

dark-patterns's Issues

Link segments to phases

We would like to know which "page" (product page, cart page, checkout page, post-checkout page) each segment comes from. The segments table logs a timestamp; we can perhaps associate these with the time we click specific buttons.

Screen capturing to detect animated (dynamic) dark patterns

We can use scrot to take screenshots automatically if Selenium somehow doesn't work:
https://www.maketecheasier.com/take-screenshots-in-ubuntu-at-regular-interval/

Related Work section

Clean up the analysis notebooks

Some of the notebooks in the repo should be cleaned and rerun before the publication.
For instance: https://github.com/aruneshmathur/dark-patterns/blob/master/analysis/deceptive/Stock.ipynb

Detecting non-standard Select product attributes that use a <div>

Our code fails when detecting non-standard Select product attributes which use a <div> for the dropdown elements.

Example: https://www.alexandermcqueen.com/us/alexandermcqueen/coat_cod41822828kr.html

Duplicate websites

The dataset contains a small number of duplicate websites (e.g., amazon.com). These are duplicated because they fall into more than one category. How do we deal with them?

Store number of buttons, imgs, links inside each segment

Elements such as buttons may help clustering certain dark patterns, e.g. confirm shaming with "No, I don't want..." buttons or links.

Let's store the number of element types contained in each segment.

Switch to "Behind the Overlay" to close overlay dialogs

Overlay dialogs -- such as the one shown below -- may break our code that navigates product pages. We need to be able to detect and close them, just as any user would.

We have a script that currently does this: https://github.com/aruneshmathur/dark-patterns/blob/master/src/crawler/zindex.js. The script detects a <div> element, if there's one, with the highest z-index value (this is non-trivial, beware), and then searches for elements within this <div> that have an attribute whose name or value matches "close". If these elements are found, it clicks them to dismiss the overlay.

@gunesacar found a Chrome extension that does this too. I suspect their code is tested more thoroughly than ours. We should consider switching at some point.

Extension name: Behind the Overlay
Extension repository: https://github.com/NicolaeNMV/BehindTheOverlay/blob/master/chrome/js/overlay_remover.js

Navigating shopping websites

We would like to navigate shopping websites, specifically move from the product page to the shopping cart page, and then to the checkout page. Any strategy to achieve this should work across different kinds of shopping websites, with possibly inconsistent interfaces.

Reducing overhead in segmentation

@gunesacar pointed out that our segmentation algorithm makes repeated calls to getBoundingClientRect() for element width/height. This function call is expensive and adds significant overhead.

We need to avoid making repeated calls to getBoundingClientRect(), and reuse its output where we can.

Add a language detection module to avoid non-English sites

We decided to exclude non-English sites from our study. Let's detect language of the sites during link extraction crawls to avoid non-English sites.

We should be able to use langdetect or polyglot for language detection and python-goose for text extraction.

Handling body elements with 0 width and height

Some body elements in the crawl zero width and height. The segmentation algorithm considers these as segments. Need to investigate this further.

<div>, <span> based product attributes

Some websites do not implement product attributes using <ul> or <ol> elements. For example:

https://shop4reebok.com/#!product/CN8042_temposlipon
https://www.prettylittlething.com/mustard-rib-button-detail-midi-skirt.html

We still need to be able to detect these.

Decide between textNode and innerText for getting element text

There are different ways to get text from elements: innerText takes style into account and doesn't contain hidden or invisible text. textContent retrieves all text (hidden or not) and replace newline characters with space. Other differences are explained here.

For the following case (red numbers updated randomly), innerText seem to work better since textContent contains placeholder numbers (8) that are hidden from the user:
URL: https://loveandlinen.co/collections/womens-graphic-tees/products/zihuatanejo-mexico-womens-fit-t-shirt:

innerText:

HURRY! Only 
3
3
4
 left!

textContent:
HURRY! Only 83834 left!

OuterHTML:

<div id="products-available" style="color: rgb(0, 0, 0); margin-top: 10px;">
<img src="https://window-shoppers.azurewebsites.net/Images/hurry-left.png" class="ws-message-icon">
<span><span style="color: #030303">HURRY! Only <strong style="color: #DB1D1D"><span class="glowing odometer odometer-auto-theme odometer-animating-down odometer-animating">
<div class="odometer-inside"><span class="odometer-digit">
<span class="odometer-digit-spacer">8</span><span class="odometer-digit-inner">
<span class="odometer-ribbon"><span class="odometer-ribbon-inner">
<div class="odometer-value odometer-last-value odometer-first-value">3</div></span></span></span></span><span class="odometer-digit">
<span class="odometer-digit-spacer">8</span><span class="odometer-digit-inner">
<span class="odometer-ribbon"><span class="odometer-ribbon-inner">
<div class="odometer-value odometer-first-value">3</div>
<div class="odometer-value odometer-last-value">4</div></span></span></span></span></div></span></strong> left!</span></span></div>

innerText

HOT! 
1
1
1
 sold in the last hour!

textContent:
HOT! 818181 sold in the last hour!

OuterHTML:

<div id="products-sold" style="color: rgb(0, 0, 0); margin-top: 10px;">
<img src="https://window-shoppers.azurewebsites.net/Images/hot-sold.png" class="ws-message-icon">
<span>
<span style="color: #030303">HOT! <strong style="color: #DB1D1D">
<span class="glowing odometer odometer-auto-theme">
<div class="odometer-inside"><span class="odometer-digit">
<span class="odometer-digit-spacer">8</span>
<span class="odometer-digit-inner">
<span class="odometer-ribbon">
<span class="odometer-ribbon-inner">
<span class="odometer-value">1</span></span></span></span></span>
<span class="odometer-digit">
<span class="odometer-digit-spacer">8</span>
<span class="odometer-digit-inner">
<span class="odometer-ribbon">
<span class="odometer-ribbon-inner">
<span class="odometer-value">1</span></span></span></span></span>
<span class="odometer-digit">
<span class="odometer-digit-spacer">8</span>
<span class="odometer-digit-inner">
<span class="odometer-ribbon">
<span class="odometer-ribbon-inner">
<span class="odometer-value">1</span></span></span></span></span></div></span></strong> sold in the last hour!</span></span></div>

innerText:
114 others are viewing this right now!

textContent:
114 others are viewing this right now!

OuterHTML:

<div id="people-looking" style="color: rgb(0, 0, 0); margin-top: 10px;">
<img src="https://window-shoppers.azurewebsites.net/Images/people-looking.png" class="ws-message-icon">
<span><span style="color: #030303"><strong style="color: #DB1D1D">
<span class="glowing">114</span></strong> others are viewing this right now!</span></span></div>

Add unique IDs to elements

In some cases we receive multiple mutation summary events for some elements (e.g. due to animations or attribute changes). We currently don't have a way to attribute those events to individual elements or group them by the element.

We should add unique IDs to segment elements so that we can group events by elements that they belong to. These IDs should be globally unique and can be added as DOM element attributes.

Segmenting a given webpage

We currently have a script that segments a webpage based on <div> tags. However, this is approach is naive, since it entirely depends on the dom structure and fails to account for visual similarity.

We need to determine whether we can improve upon this approach. Also look into other page segmentation libraries (e.g., Fathom).

Evaluate the accuracy of shopping website detection

We want to measure and document false negatives and positives of the shopping website detection method.
We'll use the output logs of link extraction crawl and manually check 100 websites to see if the detection was correct.

Incorrect background-color value

The CSS style we dump for the segments contains the wrong background-color value since the property is not inherited by default.

We need to recursively query each segment's parent until we find the correct value. This seems to be our best bet at retrieving the correct background-color value.

Implement domain categorization

[from today's discussions] One of the alternatives for building a list of shopping sites is categorizing Alexa top 1m sites using an external (non-Alexa) API.

The following code can be used to query the Bluecoat API:
https://github.com/PoorBillionaire/sitereview/blob/master/sitereview.py

For instance it categorizes myntra.com as a shopping site:

python sitereview.py https://www.myntra.com/

======================
Symantec Site Review
======================

URL: https://www.myntra.com:443/
Last Time Rated/Reviewed: > 7 days 
Category: Shopping

Evaluating MutationObserver

MutationObserver is a Web API that logs changes to the DOM.

We could use this API to log Dark Patterns that emerge only after the page has loaded (e.g., Ajax calls) or through certain user interactions (e.g., on selecting product attribute).

As a first step, we could manually examine how it behaves on a small set of shopping websites.

Attribute DOM changes to scripts

We should be able to attribute DOM changes to individual scripts. OpenWPM has support for easily getting responsible script for synchronous calls, but this won't work with Mutation observers/summaries.

We need to use Mutation events instead to be able to access the call stack. But, this may slow things down. Let's experiment with different options.

Dropdown list product attributes

We need to detect product attributes that appear as a list or dropdown.

Switch to OpenWPM

@gunesacar and I discussed that we should to move away from vanilla Selenium and integrate our crawler with OpenWPM.

OpenWPM has several logging capabilities that may be of use to us in the long term (e.g., HTTP logging to determine requests that correspond to the Dark Patterns we discover).

Integrate segmentation into mutation summary event flow

Just for the record, we decided to integrate mutation summary events and segmentation as follows.
For each mutation summary event we received about an element:

onNodeAdded, onNodeReparented, onNodeReordered, onAttrsChanged: re-segment the element's (old) segment, take snapshot(s) of the new segments
onCharacterDataChanged: find the element's segment, take snapshot
onNodeRemoved: NOOP

(take snapshot=store segment details in the database)

Build a product page classifier based on URL features

We'd like to distinguish product pages with minimum effort, e.g. not crawling them.
An idea @aruneshmathur and @randomwalker had is to build a classifier based on URL features.

List of potential features:

length of the URL
number of dashes in the URL
BOW based on the URL(?)

I feel like this approach assumes a SEO-friendly URL pattern.

@aruneshmathur
1- are there other features you can think of, or that you discussed with @randomwalker?
2- any idea how to handle non-SEO-friendly URLs (e.g. http://www2.hm.com/en_us/productpage.0629048001.html )

Store details of the longest text node of each segment

In certain cases, we have several text nodes in a segment each having different styles. The style of the segment element may not be representative of the prominent text we are really interested in.

We decided we should store the text, style, dimensions and position of the longest text node of each segment.

Evaluate accuracy of checkout crawler

Manually go through the successful checkout crawls and examine how many really succeeded.

Make naive segmentation recursive

We want to improve naive segmentation by descending into non-block elements until we find a textNode. We should be able to handle cases like
<div id="incorrect_segment"><span><div id="correct_segment"><span><text>..

Toggle product attributes on websites

Many websites require certain product attributes (e.g. size, color) to be selected before adding to cart. Clothing websites are one such example.

We need to be able to select these attributes through our crawls. We should also consider exploring the space of product attributes since these may contain dark patterns. See example below, where on selecting a particular shoe size, the website claims "You just missed it"; this creates urgency.

Experiment with Mutation Summary library

Mutation Summary library claims to make DOM monitoring easier and more efficient. We decided to give it a try to see if it'll be useful.

Rate limit/throttle segmentation and element snapshots

from #26

It appears CSS animations cause a ton of mutation summary events (and segmentations) in a short time interval. Since we segment only the relevant elements, segmentation doesn't really take long.
But the amount of data stored per page goes upwards of 200MB in some cases.

Let's see if we can rate limit/throttle these expensive operations. Something like wait at least 100ms for running segmentation for an element should work. Perhaps it's a good idea to open a new issue for that.

Methods section

Limit product attribute selection

Some websites have more than 5 product attributes. For these, only select one random combination.

Clustering preprocessing

Before starting the clustering, remove those URLs that have a "not a product page" or "isProductPage error (WebDriverException)" error in the log.

Move product attribute code to Javascript

Moving forward, we need to move the python product attribute selection code to Javascript so it integrates with the rest of the flows and is faster.

Introduction section

usage: cluster_browser_http.py [-h] clusters_pickle cluster_id_column cluster_browser_http.py: error: the following arguments are required: clusters_pickle, cluster_id_column

how to solve this problem

Efficiently store computed style data

We plan to use computed style in the clustering stage. But storing multiple snapshots of elements' style can be too much, especially in cases like fade in or similar effects, where the style is updated every few milliseconds.

We can consider storing only non-default style values, something like: https://stackoverflow.com/questions/22907735/get-the-computed-style-and-omit-defaults

Build a clustering dashboard

We need a better clustering dashboard when looking through the clusters, specifically an interface that provides more context than just the text of the segments (e.g., the actual segment UI).

Handle cases where computed style is null

This call to getComputedStyle in common.js returns null in some cases and causes and error.

I tried to see if there's something wrong with our instrumentation, but Firefox devtools itself cannot find any style info for these elements:

Let's handle these cases.

Data pre-processing

The data file data/sites_with_rank_sorted.csv has four columns: url, popularity_rank, category, and overall_rank.

popularity_rank refers to a rank based on the popularity of a website. Subdomains within a specific website may have different popularity_rank values.

overall_rank refers to the Alexa rank of the website. A website and its subdomains have the same overall_rank. Also, if a website is repeated in more than one category, it has the same overall_rank.

We want to ensure that if a website has subdomains in the dataset, they carry meaning from the point of view of dark pattern measurements. If not, we intend to remove those subdomains and retain only the base domain.

Once this procedure is complete, we should have a list of websites that interest us. We can then sort them by popularity_rank.

aruneshmathur / dark-patterns Goto Github PK

dark-patterns's People

Contributors

Stargazers

Watchers

Forkers

dark-patterns's Issues

Recommend Projects

Recommend Topics

Recommend Org