Coder Social home page Coder Social logo

arx-deidentifier / arx Goto Github PK

View Code? Open in Web Editor NEW
618.0 618.0 214.0 382.31 MB

ARX is a comprehensive open source data anonymization tool aiming to provide scalability and usability. It supports various anonymization techniques, methods for analyzing data quality and re-identification risks and it supports well-known privacy models, such as k-anonymity, l-diversity, t-closeness and differential privacy.

Home Page: http://arx.deidentifier.org/

License: Apache License 2.0

HTML 0.37% Java 99.63%
arx cross-platform data-analytics data-anonymization de-identification open-source privacy

arx's People

Contributors

arx-deidentifier avatar chawco avatar dependabot[bot] avatar dfirman avatar eicherj avatar elmuto avatar ga82bos avatar idhamari avatar iylee71 avatar jenno-verdonck avatar kbabioch avatar kohlmayer avatar marklackey avatar martinwaltl avatar muellerarmin avatar nartz avatar pepijndereus avatar prasser avatar raffaelbild avatar rourke101 avatar scarlethue avatar sebastianst avatar shoe54 avatar srcds avatar swisssoftware avatar tanjascats avatar victorxwu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arx's Issues

GUI Freeze While Scrolling Through Results

GUI freeze when scrolling through results. Stack trace attached. The only outstanding circumstance was probably quick switch from configuration to results, with an immediate attempt to scroll. Otherwise, using a simple dataset and configuration for testing of the game theoretic model.

Per an email exchange with Dr. Prasser - "likely hit a rare race condition [in] the code which synchronizes the scroll position of both tables."

stack

How to use local recoding?

Dear Fabian, Thank you for kindly answering. I also have questions towards using the "local recoding" in ARX. I find it is a little tricky to control the "strength" of generlization by since different jargons are in used (such as the fixed point, 100 pass etc.). Is there any documentations to download? Cheers Yang

Fail to build with Ant on OS X

I am trying to build the project using Ant (version 1.9.4 compiled on April 29 2014) on OS X 10.10.1 (Yosemite) and here's what I get from the compile task (followed by 30 errors due to the lack of packages.

[javac] /Users/abasu/github/arx/src/gui/org/deidentifier/arx/gui/view/impl/menu/DialogProperties.java:31: error: package de.linearbits.preferences does not exist

Am I missing something?

Ability to change population settings in the risk analysis perspective

Hi! In the "Analyze risk" tab, on the bottom of the UI, tab "Population", I can't un-tick the "Use this population" box, nor un-select USA to select another country. When clicking the box, the tick doesn't go away. I reproduced this with the example.deid project, and with some dummy data from generatedata.com.

I'm using ARX 3.5.1 on Linux, running java 1.8.0_112. uname -a returns:
Linux [redacted] 3.13.0-101-generic #148-Ubuntu SMP Thu Oct 20 22:08:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Feature Request: Editing multiple attributes at once

Hello!

Following up on our email correspondence, I would like to suggest to allow for the selection and editing of multiple attributes at once. For example, if one has more than 15-20 attributes, editing each attribute manually and separately becomes cumbersome.

It would be nice to be able to mark several attributes as quasi identifiers at once - or to select all except a few as quasi identifiers. Also, editing data types in that way (or assignments of equal generalization hierarchies) could be useful.

EU General Data Protection Regulation compliance support?

Totally new to this tool, so please excuse the fact that my question/request is not informed by any practical knowledge of the software, but would it be possible to provide guidance on the website and in documentation, and possibly within the app itself regarding the suitability of different options for meeting the new GDPR regulations?

Help for get information loss metric

thanks for arx lib its very helpful for me, i use
adult data set and anonymiz it, at the end i should get the information loss
and DM metric for compare with another solution, can you show how i can use
and get result for information loss after anonymize data? and how to understand and Analysis the result of metric, can you give me a
example?
thanks a lot

Integration with Spark ?

Hi,

Very nice tool. Do you know anyone who ever used it as a Spark dependency? Would it make sense?

Thanks.

Regards,
Yann

On launching ARX, call to "finishLaunching" reports (non fatal) error

Starting ARX on Mac OS X El Capitan, with Java 1.8.0_92:

java -XstartOnFirstThread -jar ARX.jar
2016-05-02 10:40:37.175 java[89155:10638489] _createMenuRef called with existing principal MenuRef already associated with menu
2016-05-02 10:40:37.176 java[89155:10638489] (
0 CoreFoundation 0x00007fff8fff94f2 __exceptionPreprocess + 178
1 libobjc.A.dylib 0x00007fff8b6a273c objc_exception_throw + 48
2 CoreFoundation 0x00007fff900604bd +[NSException raise:format:] + 205
3 AppKit 0x00007fff895f2d8a -[NSCarbonMenuImpl _createMenuRef] + 62
4 AppKit 0x00007fff895f26c9 -[NSCarbonMenuImpl _instantiateCarbonMenu] + 140
5 AppKit 0x00007fff895f0374 -[NSApplication finishLaunching] + 856
6 libswt-pi-cocoa-4234.jnilib 0x000000011c835a5e Java_org_eclipse_swt_internal_cocoa_OS_objc_1msgSendSuper__Lorg_eclipse_swt_internal_cocoa_objc_1super_2J + 89
7 ??? 0x0000000103df2554 0x0 + 4359923028

This is not fatal.

Error 15 in import CSV data interface

Hi there,

I am currently using version 3.4.2.
The ARX tools is showing me an error with number 15 at the interface of importing CSV data.
The output of dataset is not displayed in the GUI and anonymized dataset is not analysed in the tool.
Is there any solution is resolve import CSV file issue ?

Thank you.

pageup/down keys

Hi,
I've seen in the code that DefaultSelectionBindings has been added to UI configurations. However, pressing pageup/down keys has no effects i.e., the scrollbar does not move.

Add data masking functionality

Hi,
I am not aware for string or categorical data what generalizations or aggregate functions exist, but no matter if, for an attribute set as Identifiable I set Transformation: Generalization or Microaggregation, and for microaggregation no matter what aggregation function I choose, the result of these attributes remain "" and the Transformation gets reset to Generalization. I am also wondering if you have plans for salted hashing transformations which I believe are useful to be applied on identifiers although less anonymous than "" fields - but perhaps useful if creating a test database where we want to minimize private data but still be able to do verifications and correlations. Thanks

i cant using ARX API for anonymizing

hello, i want to use ARX api and write sample that anonymize data with t-closeness, l diversity and k-anonymity but i have some error as follow:
Attribute 'age': hierarchy misses some values or contains duplicates
at org.deidentifier.arx.framework.data.GeneralizationHierarchy.(Unknown Source)
at org.deidentifier.arx.framework.data.DataManager.(Unknown Source)
at org.deidentifier.arx.ARXAnonymizer.getDataManager(Unknown Source)
at org.deidentifier.arx.ARXAnonymizer.anonymize(Unknown Source)
at kanonytest.kanony.main(kanony.java:68)
i use the last version of ARX API.
can you help me? thanks a lot

Unable to open a Project

Hi,
Whenever I try to open a project which I have already created, I get the following error:

Attribute 'Workclass' : hierarchy misses some values or contains duplicates.

Could you please help me with this?

Thanks,
Udyot

Parallelised or distributed version

I am looking at searching and annonymising data with a large number of records (at least 10m). One of the use cases is for horizontally integrating results from multiple locations without sharing the raw data. While the flash implementation is very fast at the moment it does not appear parallelised for large local sets or distributable for partitioned sets.

String data anonymization

Hi,

I am reading data from a database table. It contains values in varchar type. I want to anonymize that data. Instead of setting hierarchy manually, i want to set hierarchy dynamically like we do for numbers(we calculate geometric mean or arithmetic mean) . Do we have any utility like that? Please help me out.

Error message needs to be more verbose

The error message comes from the class GeneralizationHierarchy.java and it says "Attribute 'name': hierarchy misses some values or contains duplicates". It would be much more useful to see some of the examples that cause the problem.

Ability to edit imported hierarchies in the hierarchy wizard dialog?

Dear Developers,

If create hierarchies using "hierarchy wizard" GUI, I can later review the hierarchies in the GUI again. But if I import hierarchies from csv files, I am not allowed to review the hierarchies in the wizard GUI. I think it will be great to have this function, particularly for beginner users to understand how the hierarchies can be created manually.

Many thanks,

Best wishes,
Jianliang

Change distribution of class sizes to distribution of re-identification risks

In the risk analysis perspective, it would be great to replace the distribution of class sizes with to distribution of re-identification risks (1/class-size). This affects the table as well as the histogram and the ARX API/Library as well as the GUI (it would be great to change this on an API-level as well...)

Hierarchy has some missing values or duplicates

Hello there,

my dataset contains columns that also include multi-valued attributes like in this example column:

"EC_Aaa123"
"EC_Xxx, EC_Yyyy, EC_Z"
"EC_Bbbb567"

(* You can think of proteins that can be labeled with one or multiple EC numbers that come from a hierarchical tree-like taxonomy. I am actually using some different taxonomies, but the principle is the same)

I know that ARX cannot generalize such multi-value strings, so I did it already in a preprocessing step. That is, I used preprocessing to simplify all multi-valued strings as much as possible with respect to the underlying taxonomy, using a most recent common ancestor approach.

Now I want to provide ARX this column along with a mock-up taxonomy, like so:

EC_Aaa123; EC_Aaa; EC_A; ****
EC_Bbb567; EC_Bbb; EC_B; ****
EC_Xxx,EC_Yyyy,EC_Z; ****; ****; ****

So, the idea is to give ARX the freedom to generalize at least the single-value attributes as best as possible, or to suppress the multi-valued ones. By the way, the items in multi-valued strings appear in sorted order, to avoid ambiguity.

This approach works except for some columns. ARX complains about missing values, but I don't know why. I checked the taxonomies several times.

Are there any latent restrictions that apply? For example, on the length of the strings in the input columns?

I really hope you can help me, as I already invested two weeks into the intricate preprocessing steps of the dataset. Unfortunately, I am not allowed to share that data.

Multithreaded GUI on-top of a single-threaded appliation

ARX is single threaded and the methods provided by the API are typically not thread-safe.

However, ARX provides "Builders" for performing resource-instensive tasks asynchronously, e.g. computing statistics, estimating risks for making the ARX GUI more responsive.

The user can probably interfere with these parallel processes and trigger errors, for example by sorting the data while an analysis is performed in the background...

Decimal Interval hierarchy generation - if interval is higher precision than a given value, we get the 'No Interval Found' exception

When the interval (in an interval hierarchy) is a higher precision than a given value, we get the no interval found exception. See this standalone class to replicate. The interval is 0.001, however one of the values is 40.812 - also 3 decimal places. Run it to see the error. To remove the error either reduce the precision of the interval to 0.01 or add a digit to 40.812x. Setting the format to #.###### doesn't remove the error - but the format is applied in the error message.

package org.deidentifier.arx.examples;

import org.deidentifier.arx.DataType;
import org.deidentifier.arx.aggregates.HierarchyBuilderGroupingBased.Level;
import org.deidentifier.arx.aggregates.HierarchyBuilderIntervalBased;
import org.deidentifier.arx.aggregates.HierarchyBuilderIntervalBased.Interval;
import org.deidentifier.arx.aggregates.HierarchyBuilderIntervalBased.Range;

import cern.colt.Arrays;

public class ExampleDecimal extends Example {

public static void main(final String[] args) {
    intervalBased(0.01d);
}

private static void intervalBased(double interval) {

    //DataType<Double> dataType = DataType.createDecimal("#.######");
    DataType<Double> dataType = DataType.DECIMAL;

    Double lower = new Double(40d);
    Double upper = new Double(41d);

    // Create the builder
    HierarchyBuilderIntervalBased<Double> builder = HierarchyBuilderIntervalBased.create(
                                                        dataType,
                                                      new Range<Double>(lower, lower, lower),
                                                      new Range<Double>(upper, upper, upper));

    // Define base intervals
    builder.setAggregateFunction(dataType.createAggregate().createIntervalFunction(true, false));
    builder.addInterval(new Double(0d), interval);

    // Define grouping fanouts
    builder.getLevel(0).addGroup(2);
    builder.getLevel(1).addGroup(3);


    System.out.println("------------------------");
    System.out.println("INTERVAL-BASED HIERARCHY");
    System.out.println("------------------------");
    System.out.println("");
    System.out.println("SPECIFICATION");

    // Print specification
    for (Interval<Double> interval1 : builder.getIntervals()){
        System.out.println(interval1);
    }

    // Print specification
    for (Level<Double> level : builder.getLevels()) {
        System.out.println(level);
    }

    // Print info about resulting levels
    System.out.println("Resulting levels: "+Arrays.toString(builder.prepare(getExampleData())));

    System.out.println("");
    System.out.println("RESULT");

    // Print resulting hierarchy
    printArray(builder.build().getHierarchy());
    System.out.println("");
}

private static String[] getExampleData() {

    String[] data = new String[]{
            "40.764725",
            "40.646866",
            "40.786007",
            "40.812",       // This data value throws illegal state exception "No interval found for: 0.006000000000000227 raw: 0.006000000000000227"
            "40.644527",
            "40.749702",
            "40.764137",

    };

    return data;
}

parsing dates from cvs via interface in OSX

Hi,

Thank you for this project!

I would like to report a problem using the GUI. I can't parse dates in yyyy-MM-dd format from a cvs file. The only format that worked so far was the default format (dd.MM.yyyy). When trying to select the format I want the interface refuses because it finds that 'Format doesn not match all data values', which is not true.

cvs file contents:

DateLastContact,dateSynced
2014-09-03,2014-10-03
2014-09-09,2014-09-10

software:

OSX 10.9.5
ARX 2.2.0

screen shot 2014-11-13 at 11 50 58

Test fails because of missing data files

  1. After cloning this repo, I tried to run the test ant target. It failed because, obviously, the test classes couldn't find the files ../arx-data/data-junit/*.csv

So I created a folder arx-data on the same level as this repo's folder (arx) and then created the symbolic link data-junit -> ../arx/data to point to the data dir in this repo. Then most tests could run.

Please add this step to the readme file, if it is indeed needed to run the test target.

  1. The following csv files are still missing in the data directory:
  • adult_age_microaggregated.csv
  • atus.csv
  • cup.csv
  • cup_hierarchy_RAMNTALL.csv
  • dis.csv
  • fars.csv
  • fars_hierarchy_istatenum.csv
  • ihis.csv

Could you add these to the data directory?

Thanks!

sorting data in a column

Hi,

Thank you for developing such a powerful and flexible tool for data anonymization. In order to enable sorting data per column, I've modified the implementation of the DataTableGridLayer class by (1) defining a SortHeaderLayer on top of the columnHeaderLayer:
SortHeaderLayer<String[]> sortHeaderLayer = new SortHeaderLayer<>(columnHeaderLayer, this.sortModel);
(2) passing this layer to the cornerLayer:
ILayer cornerLayer = new CornerLayer(cornerDataLayer, rowHeaderLayer, sortHeaderLayer);
and (3) setting that layer as the column header layer of the gridLayer:
setColumnHeaderLayer(sortHeaderLayer);
The sort functionality works fine now, but in cases that the sum of columns width is less than the width of the wrapper composite, table body is empty, i.e., only the row/column headers are displayed. A snapshot is attached. Any help would be appreciated.
capture

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.