Coder Social home page Coder Social logo

aruneshmathur / dark-patterns Goto Github PK

View Code? Open in Web Editor NEW
123.0 13.0 22.0 57.75 MB

Code and data belonging to our CSCW 2019 paper: "Dark Patterns at Scale: Findings from a Crawl of 11K Shopping Websites".

Home Page: https://webtransparency.cs.princeton.edu/dark-patterns/

License: GNU General Public License v3.0

Python 0.25% Java 0.03% HTML 9.90% JavaScript 0.18% Shell 0.01% Jupyter Notebook 89.64%
dark-pattern research-paper research-data human-computer-interaction public-policy sludge

dark-patterns's People

Contributors

aruneshmathur avatar elucherini avatar gunesacar avatar michaeljfriedman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dark-patterns's Issues

Link segments to phases

We would like to know which "page" (product page, cart page, checkout page, post-checkout page) each segment comes from. The segments table logs a timestamp; we can perhaps associate these with the time we click specific buttons.

Duplicate websites

The dataset contains a small number of duplicate websites (e.g., amazon.com). These are duplicated because they fall into more than one category. How do we deal with them?

Switch to "Behind the Overlay" to close overlay dialogs

Overlay dialogs -- such as the one shown below -- may break our code that navigates product pages. We need to be able to detect and close them, just as any user would.

image

We have a script that currently does this: https://github.com/aruneshmathur/dark-patterns/blob/master/src/crawler/zindex.js. The script detects a <div> element, if there's one, with the highest z-index value (this is non-trivial, beware), and then searches for elements within this <div> that have an attribute whose name or value matches "close". If these elements are found, it clicks them to dismiss the overlay.

@gunesacar found a Chrome extension that does this too. I suspect their code is tested more thoroughly than ours. We should consider switching at some point.

Extension name: Behind the Overlay
Extension repository: https://github.com/NicolaeNMV/BehindTheOverlay/blob/master/chrome/js/overlay_remover.js

Navigating shopping websites

We would like to navigate shopping websites, specifically move from the product page to the shopping cart page, and then to the checkout page. Any strategy to achieve this should work across different kinds of shopping websites, with possibly inconsistent interfaces.

Reducing overhead in segmentation

@gunesacar pointed out that our segmentation algorithm makes repeated calls to getBoundingClientRect() for element width/height. This function call is expensive and adds significant overhead.

We need to avoid making repeated calls to getBoundingClientRect(), and reuse its output where we can.

Decide between textNode and innerText for getting element text

There are different ways to get text from elements: innerText takes style into account and doesn't contain hidden or invisible text. textContent retrieves all text (hidden or not) and replace newline characters with space. Other differences are explained here.

For the following case (red numbers updated randomly), innerText seem to work better since textContent contains placeholder numbers (8) that are hidden from the user:
URL: https://loveandlinen.co/collections/womens-graphic-tees/products/zihuatanejo-mexico-womens-fit-t-shirt:

3333

innerText:

HURRY! Only 
3
3
4
 left!

textContent:
HURRY! Only 83834 left!

OuterHTML:

<div id="products-available" style="color: rgb(0, 0, 0); margin-top: 10px;">
<img src="https://window-shoppers.azurewebsites.net/Images/hurry-left.png" class="ws-message-icon">
<span><span style="color: #030303">HURRY! Only <strong style="color: #DB1D1D"><span class="glowing odometer odometer-auto-theme odometer-animating-down odometer-animating">
<div class="odometer-inside"><span class="odometer-digit">
<span class="odometer-digit-spacer">8</span><span class="odometer-digit-inner">
<span class="odometer-ribbon"><span class="odometer-ribbon-inner">
<div class="odometer-value odometer-last-value odometer-first-value">3</div></span></span></span></span><span class="odometer-digit">
<span class="odometer-digit-spacer">8</span><span class="odometer-digit-inner">
<span class="odometer-ribbon"><span class="odometer-ribbon-inner">
<div class="odometer-value odometer-first-value">3</div>
<div class="odometer-value odometer-last-value">4</div></span></span></span></span></div></span></strong> left!</span></span></div>

innerText

HOT! 
1
1
1
 sold in the last hour!

textContent:
HOT! 818181 sold in the last hour!

OuterHTML:

<div id="products-sold" style="color: rgb(0, 0, 0); margin-top: 10px;">
<img src="https://window-shoppers.azurewebsites.net/Images/hot-sold.png" class="ws-message-icon">
<span>
<span style="color: #030303">HOT! <strong style="color: #DB1D1D">
<span class="glowing odometer odometer-auto-theme">
<div class="odometer-inside"><span class="odometer-digit">
<span class="odometer-digit-spacer">8</span>
<span class="odometer-digit-inner">
<span class="odometer-ribbon">
<span class="odometer-ribbon-inner">
<span class="odometer-value">1</span></span></span></span></span>
<span class="odometer-digit">
<span class="odometer-digit-spacer">8</span>
<span class="odometer-digit-inner">
<span class="odometer-ribbon">
<span class="odometer-ribbon-inner">
<span class="odometer-value">1</span></span></span></span></span>
<span class="odometer-digit">
<span class="odometer-digit-spacer">8</span>
<span class="odometer-digit-inner">
<span class="odometer-ribbon">
<span class="odometer-ribbon-inner">
<span class="odometer-value">1</span></span></span></span></span></div></span></strong> sold in the last hour!</span></span></div>

innerText:
114 others are viewing this right now!

textContent:
114 others are viewing this right now!

OuterHTML:

<div id="people-looking" style="color: rgb(0, 0, 0); margin-top: 10px;">
<img src="https://window-shoppers.azurewebsites.net/Images/people-looking.png" class="ws-message-icon">
<span><span style="color: #030303"><strong style="color: #DB1D1D">
<span class="glowing">114</span></strong> others are viewing this right now!</span></span></div>

Add unique IDs to elements

In some cases we receive multiple mutation summary events for some elements (e.g. due to animations or attribute changes). We currently don't have a way to attribute those events to individual elements or group them by the element.

We should add unique IDs to segment elements so that we can group events by elements that they belong to. These IDs should be globally unique and can be added as DOM element attributes.

Segmenting a given webpage

We currently have a script that segments a webpage based on <div> tags. However, this is approach is naive, since it entirely depends on the dom structure and fails to account for visual similarity.

We need to determine whether we can improve upon this approach. Also look into other page segmentation libraries (e.g., Fathom).

Evaluate the accuracy of shopping website detection

We want to measure and document false negatives and positives of the shopping website detection method.
We'll use the output logs of link extraction crawl and manually check 100 websites to see if the detection was correct.

Incorrect background-color value

The CSS style we dump for the segments contains the wrong background-color value since the property is not inherited by default.

We need to recursively query each segment's parent until we find the correct value. This seems to be our best bet at retrieving the correct background-color value.

Implement domain categorization

[from today's discussions] One of the alternatives for building a list of shopping sites is categorizing Alexa top 1m sites using an external (non-Alexa) API.

The following code can be used to query the Bluecoat API:
https://github.com/PoorBillionaire/sitereview/blob/master/sitereview.py

For instance it categorizes myntra.com as a shopping site:

python sitereview.py https://www.myntra.com/

======================
Symantec Site Review
======================

URL: https://www.myntra.com:443/
Last Time Rated/Reviewed: > 7 days 
Category: Shopping

Evaluating MutationObserver

MutationObserver is a Web API that logs changes to the DOM.

We could use this API to log Dark Patterns that emerge only after the page has loaded (e.g., Ajax calls) or through certain user interactions (e.g., on selecting product attribute).

As a first step, we could manually examine how it behaves on a small set of shopping websites.

Attribute DOM changes to scripts

We should be able to attribute DOM changes to individual scripts. OpenWPM has support for easily getting responsible script for synchronous calls, but this won't work with Mutation observers/summaries.

We need to use Mutation events instead to be able to access the call stack. But, this may slow things down. Let's experiment with different options.

Switch to OpenWPM

@gunesacar and I discussed that we should to move away from vanilla Selenium and integrate our crawler with OpenWPM.

OpenWPM has several logging capabilities that may be of use to us in the long term (e.g., HTTP logging to determine requests that correspond to the Dark Patterns we discover).

Integrate segmentation into mutation summary event flow

Just for the record, we decided to integrate mutation summary events and segmentation as follows.
For each mutation summary event we received about an element:

  • onNodeAdded, onNodeReparented, onNodeReordered, onAttrsChanged: re-segment the element's (old) segment, take snapshot(s) of the new segments
  • onCharacterDataChanged: find the element's segment, take snapshot
  • onNodeRemoved: NOOP

(take snapshot=store segment details in the database)

Build a product page classifier based on URL features

We'd like to distinguish product pages with minimum effort, e.g. not crawling them.
An idea @aruneshmathur and @randomwalker had is to build a classifier based on URL features.

List of potential features:

  • length of the URL
  • number of dashes in the URL
  • BOW based on the URL(?)

I feel like this approach assumes a SEO-friendly URL pattern.

@aruneshmathur
1- are there other features you can think of, or that you discussed with @randomwalker?
2- any idea how to handle non-SEO-friendly URLs (e.g. http://www2.hm.com/en_us/productpage.0629048001.html )

Store details of the longest text node of each segment

In certain cases, we have several text nodes in a segment each having different styles. The style of the segment element may not be representative of the prominent text we are really interested in.

We decided we should store the text, style, dimensions and position of the longest text node of each segment.

Make naive segmentation recursive

We want to improve naive segmentation by descending into non-block elements until we find a textNode. We should be able to handle cases like
<div id="incorrect_segment"><span><div id="correct_segment"><span><text>..

Toggle product attributes on websites

Many websites require certain product attributes (e.g. size, color) to be selected before adding to cart. Clothing websites are one such example.

We need to be able to select these attributes through our crawls. We should also consider exploring the space of product attributes since these may contain dark patterns. See example below, where on selecting a particular shoe size, the website claims "You just missed it"; this creates urgency.

screen shot 2018-09-05 at 12 59 27 pm

Rate limit/throttle segmentation and element snapshots

from #26

It appears CSS animations cause a ton of mutation summary events (and segmentations) in a short time interval. Since we segment only the relevant elements, segmentation doesn't really take long.
But the amount of data stored per page goes upwards of 200MB in some cases.

Let's see if we can rate limit/throttle these expensive operations. Something like wait at least 100ms for running segmentation for an element should work. Perhaps it's a good idea to open a new issue for that.

Clustering preprocessing

Before starting the clustering, remove those URLs that have a "not a product page" or "isProductPage error (WebDriverException)" error in the log.

Build a clustering dashboard

We need a better clustering dashboard when looking through the clusters, specifically an interface that provides more context than just the text of the segments (e.g., the actual segment UI).

Handle cases where computed style is null

This call to getComputedStyle in common.js returns null in some cases and causes and error.

I tried to see if there's something wrong with our instrumentation, but Firefox devtools itself cannot find any style info for these elements:

screenshot_2018-11-28_19-10-55

Let's handle these cases.

Data pre-processing

The data file data/sites_with_rank_sorted.csv has four columns: url, popularity_rank, category, and overall_rank.

popularity_rank refers to a rank based on the popularity of a website. Subdomains within a specific website may have different popularity_rank values.

overall_rank refers to the Alexa rank of the website. A website and its subdomains have the same overall_rank. Also, if a website is repeated in more than one category, it has the same overall_rank.

We want to ensure that if a website has subdomains in the dataset, they carry meaning from the point of view of dark pattern measurements. If not, we intend to remove those subdomains and retain only the base domain.

Once this procedure is complete, we should have a list of websites that interest us. We can then sort them by popularity_rank.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.