arx-deidentifier / arx Goto Github PK

ARX is a comprehensive open source data anonymization tool aiming to provide scalability and usability. It supports various anonymization techniques, methods for analyzing data quality and re-identification risks and it supports well-known privacy models, such as k-anonymity, l-diversity, t-closeness and differential privacy.

Home Page: http://arx.deidentifier.org/

License: Apache License 2.0

HTML 0.37% Java 99.63%

arx cross-platform data-analytics data-anonymization de-identification open-source privacy

arx's People

Contributors

Stargazers

Watchers

Forkers

royakiki zhanghan irockdilip saetre kbabioch twesper matthiaswk saurabh9567 azurblur git12345 imawolf scarlethue bitraten charles-cai kohlmayer shoe54 pppppo kloosf therealrasu bassmonkey fstahnke tijanat ibmane kathrinrz roney ajayshendye kentoa luyna p2y moizmhb roman-novosad sebastianst subharanjan1987 hwkongsgaard areejabdu matthewsilverstein mbode ga82bos guhjy darg0001 jgaupp chengxiangli andrevictorf raffaelbild moulega worldwideantimalarialresistancenetwork jianlianggao mithils uhjish wellingtonlucas rourke101 thomasguenzel mattqzhang qiyuangong torpenhow battlewar02 martinwaltl eicherj sksundaram-learning smacm angel-lq pofftermatt donghyun-kang yxp1992 eskcode anish749 dylan-fan soulmakh alishakiba mikethomsen mark-e-deyoung giclabs ptzagk anand48jha daddis2 cenleiding dkulakov2014 victorxwu katharina-w sasanh aldorion hibellm nartz nabilraza456 lalalland luch2017 zxr-v2 codersarts ye20131121 paulo308 learned2013 tigerly nina-roth praveenmunagapati ali-saadat chawco sandrociceros themallcop lhsalud lrq3000

arx's Issues

Improve performance for very large files (IPUMS)

When loading the the new 10% Decennial Census sample from 2010 (https://usa.ipums.org/usa/chapter2/chapter2.shtml#2010) which consists of ~31M records with ~25 attributes each, memory consumption is ~5G. This must (and can) be reduced.

GUI Freeze While Scrolling Through Results

GUI freeze when scrolling through results. Stack trace attached. The only outstanding circumstance was probably quick switch from configuration to results, with an immediate attempt to scroll. Otherwise, using a simple dataset and configuration for testing of the game theoretic model.

Per an email exchange with Dr. Prasser - "likely hit a rare race condition [in] the code which synchronizes the scroll position of both tables."

How to use local recoding?

Dear Fabian, Thank you for kindly answering. I also have questions towards using the "local recoding" in ARX. I find it is a little tricky to control the "strength" of generlization by since different jargons are in used (such as the fixed point, 100 pass etc.). Is there any documentations to download? Cheers Yang

Fail to build with Ant on OS X

I am trying to build the project using Ant (version 1.9.4 compiled on April 29 2014) on OS X 10.10.1 (Yosemite) and here's what I get from the compile task (followed by 30 errors due to the lack of packages.

[javac] /Users/abasu/github/arx/src/gui/org/deidentifier/arx/gui/view/impl/menu/DialogProperties.java:31: error: package de.linearbits.preferences does not exist

Am I missing something?

Ability to change population settings in the risk analysis perspective

Hi! In the "Analyze risk" tab, on the bottom of the UI, tab "Population", I can't un-tick the "Use this population" box, nor un-select USA to select another country. When clicking the box, the tick doesn't go away. I reproduced this with the example.deid project, and with some dummy data from generatedata.com.

I'm using ARX 3.5.1 on Linux, running java 1.8.0_112. uname -a returns:
Linux [redacted] 3.13.0-101-generic #148-Ubuntu SMP Thu Oct 20 22:08:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Feature Request: Editing multiple attributes at once

Hello!

Following up on our email correspondence, I would like to suggest to allow for the selection and editing of multiple attributes at once. For example, if one has more than 15-20 attributes, editing each attribute manually and separately becomes cumbersome.

It would be nice to be able to mark several attributes as quasi identifiers at once - or to select all except a few as quasi identifiers. Also, editing data types in that way (or assignments of equal generalization hierarchies) could be useful.

Bug in project settings under Linux/GTK

When running ARX with Linux/GTK, the last entry in each category of the project settings dialog has rendering issues.

Bug in counting of suppressed elements

Missing locales

ARX seems to lack support for some locales and unicode encodings, e.g. Persian/Iran.

Consider using something like "ICU - International Components for Unicode"

see: http://site.icu-project.org/

LKC-privacy

Have you come across lkc-privacy? http://www.cs.umanitoba.ca/~noman/Papers/MFHL10tkdd.pdf

[GUI] Data types are not considered when rendering data to a table

For example 5.1234567 will be displayed as 5.1234567 even when the data type is set to decimal with a format string of #.###

Add support for sampling weights

EU General Data Protection Regulation compliance support?

Totally new to this tool, so please excuse the fact that my question/request is not informed by any practical knowledge of the software, but would it be possible to provide guidance on the website and in documentation, and possibly within the app itself regarding the suitability of different options for meeting the new GDPR regulations?

Get rid of ARXResult.getOptimizablePrivacyModels()

Implement as a method, such as PrivacyCriterion.supportsLocalRecoding().

Javadoc comments lead to warnings and they are not Java 8 compatible

building the current master branch results in a failed build for the javadocs.

Help for get information loss metric

thanks for arx lib its very helpful for me, i use
adult data set and anonymiz it, at the end i should get the information loss
and DM metric for compare with another solution, can you show how i can use
and get result for information loss after anonymize data? and how to understand and Analysis the result of metric, can you give me a
example?
thanks a lot

Integration with Spark ?

Hi,

Very nice tool. Do you know anyone who ever used it as a Spark dependency? Would it make sense?

Thanks.

Regards,
Yann

On launching ARX, call to "finishLaunching" reports (non fatal) error

Starting ARX on Mac OS X El Capitan, with Java 1.8.0_92:

java -XstartOnFirstThread -jar ARX.jar
2016-05-02 10:40:37.175 java[89155:10638489] _createMenuRef called with existing principal MenuRef already associated with menu
2016-05-02 10:40:37.176 java[89155:10638489] (
0 CoreFoundation 0x00007fff8fff94f2 __exceptionPreprocess + 178
1 libobjc.A.dylib 0x00007fff8b6a273c objc_exception_throw + 48
2 CoreFoundation 0x00007fff900604bd +[NSException raise:format:] + 205
3 AppKit 0x00007fff895f2d8a -[NSCarbonMenuImpl _createMenuRef] + 62
4 AppKit 0x00007fff895f26c9 -[NSCarbonMenuImpl _instantiateCarbonMenu] + 140
5 AppKit 0x00007fff895f0374 -[NSApplication finishLaunching] + 856
6 libswt-pi-cocoa-4234.jnilib 0x000000011c835a5e Java_org_eclipse_swt_internal_cocoa_OS_objc_1msgSendSuper__Lorg_eclipse_swt_internal_cocoa_objc_1super_2J + 89
7 ??? 0x0000000103df2554 0x0 + 4359923028

This is not fatal.

QueryBuilder crashes with very large dataset

Dataset (https://db-ip.com/db/download/city) approx. 5 million tupels

Working query: 'Column #2' = 'value'

Exception below when simply adding OR to the above statement

Exception in thread "Thread-8" java.lang.NullPointerException
at
org.deidentifier.arx.gui.view.impl.menu.DialogQuery$1.run(DialogQuery.java:79)
at java.lang.Thread.run(Unknown Source)

Error 15 in import CSV data interface

Hi there,

I am currently using version 3.4.2.
The ARX tools is showing me an error with number 15 at the interface of importing CSV data.
The output of dataset is not displayed in the GUI and anonymized dataset is not analysed in the tool.
Is there any solution is resolve import CSV file issue ?

Thank you.

Oracle JDBC Connection in GUI version

It would be great to have not only MySQL, MS SQL, Postgres and SQLite but also Oracle to import data in the GUI version.

pageup/down keys

Hi,
I've seen in the code that DefaultSelectionBindings has been added to UI configurations. However, pressing pageup/down keys has no effects i.e., the scrollbar does not move.

Add data masking functionality

Hi,
I am not aware for string or categorical data what generalizations or aggregate functions exist, but no matter if, for an attribute set as Identifiable I set Transformation: Generalization or Microaggregation, and for microaggregation no matter what aggregation function I choose, the result of these attributes remain "" and the Transformation gets reset to Generalization. I am also wondering if you have plans for salted hashing transformations which I believe are useful to be applied on identifiers although less anonymous than "" fields - but perhaps useful if creating a test database where we want to minimize private data but still be able to do verifications and correlations. Thanks

Import not working with MAC carriage return

A csv-file with MAC carriage return cannot be imported. Error message (512) is displayed

Add support for household structures

i cant using ARX API for anonymizing

hello, i want to use ARX api and write sample that anonymize data with t-closeness, l diversity and k-anonymity but i have some error as follow:
Attribute 'age': hierarchy misses some values or contains duplicates
at org.deidentifier.arx.framework.data.GeneralizationHierarchy.(Unknown Source)
at org.deidentifier.arx.framework.data.DataManager.(Unknown Source)
at org.deidentifier.arx.ARXAnonymizer.getDataManager(Unknown Source)
at org.deidentifier.arx.ARXAnonymizer.anonymize(Unknown Source)
at kanonytest.kanony.main(kanony.java:68)
i use the last version of ARX API.
can you help me? thanks a lot

Unable to open a Project

Hi,
Whenever I try to open a project which I have already created, I get the following error:

Attribute 'Workclass' : hierarchy misses some values or contains duplicates.

Could you please help me with this?

Thanks,
Udyot

Bug when creating hierarchies based on intervals for negative values

There seems to be a bug when hierarchies based on intervals are created for numerical attributes which have negative values. The problem can be triggered, when building a hierarchy for the attribute "income" in the high-dimensional dataset.

Implement method for loading hierarchies specifying an instance of CSVDataInput

Toggle subset view before anonymization

Toggle subset view of input data cannot be performed after selection of a subset via the query builder. Only after successful anonymization.

Parallelised or distributed version

I am looking at searching and annonymising data with a large number of records (at least 10m). One of the use cases is for horizontally integrating results from multiple locations without sharing the raw data. While the flash implementation is very fast at the moment it does not appear parallelised for large local sets or distributable for partitioned sets.

Duplicated select option in GUI (OSX)

Hi, I would like to report a duplicated select entry in the GUI (yyyy-MM-dd), hereby a screenshot:

OSX 10.9.5
ARX 2.2.0

String data anonymization

Hi,

I am reading data from a database table. It contains values in varchar type. I want to anonymize that data. Instead of setting hierarchy manually, i want to set hierarchy dynamically like we do for numbers(we calculate geometric mean or arithmetic mean) . Do we have any utility like that? Please help me out.

Error message needs to be more verbose

The error message comes from the class GeneralizationHierarchy.java and it says "Attribute 'name': hierarchy misses some values or contains duplicates". It would be much more useful to see some of the examples that cause the problem.

Ability to edit imported hierarchies in the hierarchy wizard dialog?

Dear Developers,

If create hierarchies using "hierarchy wizard" GUI, I can later review the hierarchies in the GUI again. But if I import hierarchies from csv files, I am not allowed to review the hierarchies in the wizard GUI. I think it will be great to have this function, particularly for beginner users to understand how the hierarchies can be created manually.

Many thanks,

Best wishes,
Jianliang

Wildcards in query builder

It would be nice to use wildcards in the query builder e.g. when searching for long strings

Height of column titles in tables too small - text truncated

Column titles in most tables are too small, which leads to text that is truncated (see screenshot):

Change distribution of class sizes to distribution of re-identification risks

In the risk analysis perspective, it would be great to replace the distribution of class sizes with to distribution of re-identification risks (1/class-size). This affects the table as well as the histogram and the ARX API/Library as well as the GUI (it would be great to change this on an API-level as well...)

SQL Server schema extract issue

Arx can't seem to handle tables in SQL Server that are not in the dbo schema.

Hierarchy has some missing values or duplicates

Hello there,

my dataset contains columns that also include multi-valued attributes like in this example column:

"EC_Aaa123"
"EC_Xxx, EC_Yyyy, EC_Z"
"EC_Bbbb567"

(* You can think of proteins that can be labeled with one or multiple EC numbers that come from a hierarchical tree-like taxonomy. I am actually using some different taxonomies, but the principle is the same)

I know that ARX cannot generalize such multi-value strings, so I did it already in a preprocessing step. That is, I used preprocessing to simplify all multi-valued strings as much as possible with respect to the underlying taxonomy, using a most recent common ancestor approach.

Now I want to provide ARX this column along with a mock-up taxonomy, like so:

EC_Aaa123; EC_Aaa; EC_A; ****
EC_Bbb567; EC_Bbb; EC_B; ****
EC_Xxx,EC_Yyyy,EC_Z; ****; ****; ****

So, the idea is to give ARX the freedom to generalize at least the single-value attributes as best as possible, or to suppress the multi-valued ones. By the way, the items in multi-valued strings appear in sorted order, to avoid ambiguity.

This approach works except for some columns. ARX complains about missing values, but I don't know why. I checked the taxonomies several times.

Are there any latent restrictions that apply? For example, on the length of the strings in the input columns?

I really hope you can help me, as I already invested two weeks into the intricate preprocessing steps of the dataset. Unfortunately, I am not allowed to share that data.

Multithreaded GUI on-top of a single-threaded appliation

ARX is single threaded and the methods provided by the API are typically not thread-safe.

However, ARX provides "Builders" for performing resource-instensive tasks asynchronously, e.g. computing statistics, estimating risks for making the ARX GUI more responsive.

The user can probably interfere with these parallel processes and trigger errors, for example by sorting the data while an analysis is performed in the background...

Decimal Interval hierarchy generation - if interval is higher precision than a given value, we get the 'No Interval Found' exception

When the interval (in an interval hierarchy) is a higher precision than a given value, we get the no interval found exception. See this standalone class to replicate. The interval is 0.001, however one of the values is 40.812 - also 3 decimal places. Run it to see the error. To remove the error either reduce the precision of the interval to 0.01 or add a digit to 40.812x. Setting the format to #.###### doesn't remove the error - but the format is applied in the error message.

package org.deidentifier.arx.examples;

import org.deidentifier.arx.DataType;
import org.deidentifier.arx.aggregates.HierarchyBuilderGroupingBased.Level;
import org.deidentifier.arx.aggregates.HierarchyBuilderIntervalBased;
import org.deidentifier.arx.aggregates.HierarchyBuilderIntervalBased.Interval;
import org.deidentifier.arx.aggregates.HierarchyBuilderIntervalBased.Range;

import cern.colt.Arrays;

public class ExampleDecimal extends Example {

public static void main(final String[] args) {
    intervalBased(0.01d);
}

private static void intervalBased(double interval) {

    //DataType<Double> dataType = DataType.createDecimal("#.######");
    DataType<Double> dataType = DataType.DECIMAL;

    Double lower = new Double(40d);
    Double upper = new Double(41d);

    // Create the builder
    HierarchyBuilderIntervalBased<Double> builder = HierarchyBuilderIntervalBased.create(
                                                        dataType,
                                                      new Range<Double>(lower, lower, lower),
                                                      new Range<Double>(upper, upper, upper));

    // Define base intervals
    builder.setAggregateFunction(dataType.createAggregate().createIntervalFunction(true, false));
    builder.addInterval(new Double(0d), interval);

    // Define grouping fanouts
    builder.getLevel(0).addGroup(2);
    builder.getLevel(1).addGroup(3);


    System.out.println("------------------------");
    System.out.println("INTERVAL-BASED HIERARCHY");
    System.out.println("------------------------");
    System.out.println("");
    System.out.println("SPECIFICATION");

    // Print specification
    for (Interval<Double> interval1 : builder.getIntervals()){
        System.out.println(interval1);
    }

    // Print specification
    for (Level<Double> level : builder.getLevels()) {
        System.out.println(level);
    }

    // Print info about resulting levels
    System.out.println("Resulting levels: "+Arrays.toString(builder.prepare(getExampleData())));

    System.out.println("");
    System.out.println("RESULT");

    // Print resulting hierarchy
    printArray(builder.build().getHierarchy());
    System.out.println("");
}

private static String[] getExampleData() {

    String[] data = new String[]{
            "40.764725",
            "40.646866",
            "40.786007",
            "40.812",       // This data value throws illegal state exception "No interval found for: 0.006000000000000227 raw: 0.006000000000000227"
            "40.644527",
            "40.749702",
            "40.764137",

    };

    return data;
}

Show aggregate function in input properties view

The aggregate function selected for attribute-based utility measures is currently not listed in the view displaying the input configuration in the utility analysis perspective. Include it.

(d, γ)-privacy => αβ-algorithm

Hi
i can't find αβ-algorithm in arx.
any plan to add this algorithm?
http://dl.acm.org/citation.cfm?id=1325913

thanks

parsing dates from cvs via interface in OSX

Hi,

Thank you for this project!

I would like to report a problem using the GUI. I can't parse dates in yyyy-MM-dd format from a cvs file. The only format that worked so far was the default format (dd.MM.yyyy). When trying to select the format I want the interface refuses because it finds that 'Format doesn not match all data values', which is not true.

cvs file contents:

DateLastContact,dateSynced
2014-09-03,2014-10-03
2014-09-09,2014-09-10

software:

OSX 10.9.5
ARX 2.2.0

Highlight selected records in input data view

The selected tuples (research subest) in the input data view should be highlighted for a better overview

Dialog for changing privacy parameters is difficult to use

I would like to set the value to 5, but it only allows 2, then 7. I've unzipped the deid file and tried to change the setting in input/config.xml and output/config.xml, but it appears to have no effect.

Ideas?

Test fails because of missing data files

After cloning this repo, I tried to run the test ant target. It failed because, obviously, the test classes couldn't find the files ../arx-data/data-junit/*.csv

So I created a folder arx-data on the same level as this repo's folder (arx) and then created the symbolic link data-junit -> ../arx/data to point to the data dir in this repo. Then most tests could run.

Please add this step to the readme file, if it is indeed needed to run the test target.

The following csv files are still missing in the data directory:

adult_age_microaggregated.csv
atus.csv
cup.csv
cup_hierarchy_RAMNTALL.csv
dis.csv
fars.csv
fars_hierarchy_istatenum.csv
ihis.csv

Could you add these to the data directory?

Thanks!

sorting data in a column

Hi,

Thank you for developing such a powerful and flexible tool for data anonymization. In order to enable sorting data per column, I've modified the implementation of the DataTableGridLayer class by (1) defining a SortHeaderLayer on top of the columnHeaderLayer:
SortHeaderLayer<String[]> sortHeaderLayer = new SortHeaderLayer<>(columnHeaderLayer, this.sortModel);
(2) passing this layer to the cornerLayer:
ILayer cornerLayer = new CornerLayer(cornerDataLayer, rowHeaderLayer, sortHeaderLayer);
and (3) setting that layer as the column header layer of the gridLayer:
setColumnHeaderLayer(sortHeaderLayer);
The sort functionality works fine now, but in cases that the sum of columns width is less than the width of the wrapper composite, table body is empty, i.e., only the row/column headers are displayed. A snapshot is attached. Any help would be appreciated.

Implement further variants of l-diversity

Positive Disclosure-Recursive (c, ℓ)-Diversity
Negative/Positive Disclosure-Recursive (c1, c2, ℓ)-Diversity