juliadata / categoricalarrays.jl Goto Github PK

View Code? Open in Web Editor NEW

124.0 5.0 34.0 1.74 MB

Arrays for working with categorical data (both nominal and ordinal)

License: Other

Julia 100.00%

julia statistics data categorical-data

categoricalarrays.jl's Introduction

CategoricalArrays.jl

Documentation:

This package provides tools for working with categorical variables, both with unordered (nominal variables) and ordered categories (ordinal variables), optionally with missing values.

categoricalarrays.jl's People

Contributors

Stargazers

Watchers

categoricalarrays.jl's Issues

contrasts

On the julia-stats mailing list I suggested exporting a constrasts generic from StatsBase and adding methods in this repository. I would appreciate your comments, @nalimilan, if you have any.

Adding levels to ordered categorical array

I find the following behavior inconsistent. Is it on purpose?

julia> using CategoricalArrays

julia> x = categorical([1,2,3], ordered=true)
3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> x[3] = 10
ERROR: cannot add new level 10 since ordered pools cannot be extended implicitly. Use the levels! function to set new levels, or the ordered! function to mark the pool as unordered.

julia> recode(x, 3=>10)
3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 2
 10

JSON serialization of CategoricalValue{T, R} is broken

Now CategoricalValue is not a <: AbstractString, so JSON.jl serializes it using CompositeTypeWrapper and goes into recursion due to circular pool -> value -> pool references.

julia> using CategoricalArrays, JSON

julia> x = categorical([1, 2, 3]);

julia> JSON.json(x[1])
ERROR: StackOverflowError:
Stacktrace:
 [1] Type at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:18 [inlined]
 [2] lower(::CategoricalArrays.CategoricalPool{Int64,UInt32,CategoricalArrays.CategoricalValue{Int64,UInt32}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:40
 [3] show_pair(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::Symbol, ::CategoricalArrays.CategoricalPool{Int64,UInt32,CategoricalArrays.CategoricalValue{Int64,UInt32}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:236
 [4] show_json(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::JSON.Writer.CompositeTypeWrapper{CategoricalArrays.CategoricalValue{Int64,UInt32}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:294
 [5] show_json(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::Array{CategoricalArrays.CategoricalValue{Int64,UInt32},1}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:302
 [6] show_json(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::JSON.Writer.CompositeTypeWrapper{CategoricalArrays.CategoricalPool{Int64,UInt32,CategoricalArrays.CategoricalValue{Int64,UInt32}}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:294
 [7] show_pair(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::Symbol, ::CategoricalArrays.CategoricalPool{Int64,UInt32,CategoricalArrays.CategoricalValue{Int64,UInt32}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:236
 [8] show_json(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::JSON.Writer.CompositeTypeWrapper{CategoricalArrays.CategoricalValue{Int64,UInt32}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:294
 [9] show_json(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::Array{CategoricalArrays.CategoricalValue{Int64,UInt32},1}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:302
 [10] show_json(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::JSON.Writer.CompositeTypeWrapper{CategoricalArrays.CategoricalPool{Int64,UInt32,CategoricalArrays.CategoricalValue{Int64,UInt32}}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:294
 [11] show_pair(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::Symbol, ::CategoricalArrays.CategoricalPool{Int64,UInt32,CategoricalArrays.CategoricalValue{Int64,UInt32}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:236
 [12] show_json(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::JSON.Writer.CompositeTypeWrapper{CategoricalArrays.CategoricalValue{Int64,UInt32}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:294
 [13] show_json(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::Array{CategoricalArrays.CategoricalValue{Int64,UInt32},1}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:302
 [14] show_json(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::JSON.Writer.CompositeTypeWrapper{CategoricalArrays.CategoricalPool{Int64,UInt32,CategoricalArrays.CategoricalValue{Int64,UInt32}}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:294
 [15] show_pair(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::Symbol, ::CategoricalArrays.CategoricalPool{Int64,UInt32,CategoricalArrays.CategoricalValue{Int64,UInt32}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:236
 [16] show_json(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::JSON.Writer.CompositeTypeWrapper{CategoricalArrays.CategoricalValue{Int64,UInt32}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:294

It could be easily fixed by implementing JSON.lower(x::CategoricalValue) method, but that would require introducing CategoricalArrays -> JSON dependency.
The more generic fix would be to implement support for circular references in JSON, but that would still print CategoricalValue incorrectly (probably we just need to show the stored value, i.e. get(x::CategoricalValue)).

add future to require

Can you possibly add Future to the require (or manifest/project.toml)?

I am getting this error on travis

https://travis-ci.org/kafisatz/DecisionTrees.jl/jobs/394917060#L580

Should the indices field be named values?

If I understand correctly the values field is the numeric indices into the pool. I would think that a name such as refs or indices or indexes would be more appropriate. In most types a field name of values indicates the actual values in the type without the metadata.

== for CategoricalPool

@nalimilan
Currently:

x = categorical([1,2,3])
y = deepcopy(x)
x.pool == y.pool

produces false because for CategoricalPool the operation == is not defined and falls back to ===.

Is this intentional or should this be fixed?

Package breaks if ran without inline

When I try to use the package in a session started with julia --inline=no it breaks. All errors during the tests seem to fall back to this:

categorical(CategoricalArrays.CategoricalArray{String,1,Int64,String,CategoricalArrays.CategoricalString{Int64},Union{}}, compress=false) R1=Int64 R2=Int64: Error During Test
  Got an exception of type MethodError outside of a @test
  MethodError: in(::CategoricalArrays.CategoricalString{Int64}, ::Base.KeyIterator{ObjectIdDict}) is ambiguous. Candidates:
    in(k, v::Base.KeyIterator) in Base at associative.jl:60
    in(x::Union{CategoricalArrays.CategoricalString{R}, CategoricalArrays.CategoricalValue{T,R} where T} where R, y) in CategoricalArrays at /Users/ericperim/.julia/v0.6/CategoricalArrays/src/value.jl:118
  Possible fix, define
    in(::Union{CategoricalArrays.CategoricalString{R}, CategoricalArrays.CategoricalValue{T,R} where T} where R, ::Base.KeyIterator)

Note: in order to observe this behaviour it is necessary to use julia --inline=no test/runtests.jl, since if you run Pkg.test from inside the REPL it will use the default options (inline=yes).

cut mutates breaks input vector

I was surprised to find that cut mutates the input breaks for the interval vector when extend=true. (And strangely only when breaks is sorted.)

I would say either make a copy before mutating or document the current behaviour?
I could make a PR for either.

MWE:

julia> breaks = [2,4,6];

julia> cut(collect(1:10), breaks, extend=true);

julia> breaks
5-element Array{Int64,1}:
  1
  2
  4
  6
 10

Cheers.

Implement optimized sort functions

We should review DataArrays's methods and decide which of them should be implemented:
https://github.com/JuliaStats/DataArrays.jl/blob/f17e4e30aa0713c794409802741f43cf8ff7f05e/src/pooleddataarray.jl#L576

These should also be used in DataFrames:
https://github.com/JuliaStats/DataFrames.jl/blob/8334bf1cbeafba98f038e65a67a732e4125dae89/src/abstractdataframe/sort.jl#L312

Implement optimized == and isequal() for arrays with different pools

We currently have a fast path in == and isequal for when two CategoricalArrays share the same pool, but that is not the most common situation. For other cases, we do essentially what the AbstractArray fallback does, by extracting CategoricalValue objects and comparing them, which implies comparing their contents.

It would be much faster to compute a correspondence table between the levels of the arrays first, and then work only with integer codes. But I'm not sure how to do that without allocating an N×M table, with N and M the number of levels of each array. Doing so would only make sense for quite large arrays.

Can't use DataFramesMeta.by with NullableCategoricalArrays

Please see this stackoverflow post

Basically I have loaded in a DataFrame using Feather.jl and then I tried to do

by[df, :some_categorical_array_col, df1->sum(df1[:some_value])

it fails with

MethodError: Cannot convert an object of type String to an object of type CategoricalArrays.CategoricalValue{String,Int32}
This may have arisen from a call to the constructor CategoricalArrays.CategoricalValue{String,Int32}(...),
since type constructors fall back to convert methods.
in by at DataFrames\src\groupeddataframe\grouping.jl:320
in groupby at DataFrames\src\groupeddataframe\grouping.jl:92
in DataArrays.PooledDataArray at base\sysimg.jl:24
in DataArrays.PooledDataArray at DataArrays\src\pooleddataarray.jl:140
in convert at NullableArrays\src\primitives.jl:258
in copy! at base\abstractarray.jl:655

and I am new to Julia hence not sure how to fix it myself.

RFC: Merge NominalArray and OrdinalArray into CategoricalArray

The only reason for having separate NominalArray and OrdinalArray types (as well as their Nullable counterparts) is to return an OrdinalValue from the latter, which supports < and >. This does not sound worth the increased complexity. We could use a single CategoricalArray type, and store whether it's ordered via a Bool field. With branch prediction, < shouldn't be noticeably slower than with different types.

Incidentally, this is what Pandas does as well as MATLAB. Another advantage is that data storage formats generally don't distinguish nominal and ordinal arrays, that distinction must be added after importing data: if we use two different types, moving to ordinal requires changing the type of the array after the fact.

Before doing that change, I'd like to hear what others think of it.

Efficiently copy levels ordering from CategoricalValue in setindex!()

Extending the logic implemented for copy! in #97, it could make sense to copy the ordering of levels from the RHS CategoricalValue in setindex(::CategoricalArray, ::CategoricalValue, ...). That would essentially mean that copying elements manually one by one from a CategoricalArray to another would be equivalent to using specialized copy! method: the order of levels of the source would be preserved.

Custom iteration?

How should custom iteration be implemented over CategoricalArrays?

Or to put it differently, why do next(CategoricalArray(["a","b","b"]), 2) and done(CategoricalArray(["a","b","b"]), 2) give a BoundsError?

I'm on Julia v.0.6.2

NullableCategoricalArrays constructor and levels ordering

The constructor for NullableCategoricalArray does not adhere to the ordering of the data when ordering levels.

julia> using CategoricalArrays
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:244 overwritten in module NullableArrays at /Users/Cameron/.julia/v0.6/NullableArrays/src/operators.jl:128.

julia> x = levels!(CategoricalArray(["B", "B", "A", "A"]), ["C", "B", "A"])
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "B"
 "B"
 "A"
 "A"

julia> levels(x)
3-element Array{String,1}:
 "C"
 "B"
 "A"

julia> nullx = NullableCategoricalArray(x)
4-element CategoricalArrays.NullableCategoricalArray{String,1,UInt32}:
 "B"
 "B"
 "A"
 "A"

julia> levels(nullx) # why did the levels change?
3-element Array{String,1}:
 "A"
 "B"
 "C"

julia> nullx = NullableCategoricalArray(x, ordered=true)
4-element CategoricalArrays.NullableCategoricalArray{String,1,UInt32}:
 "B"
 "B"
 "A"
 "A"

julia> levels(nullx) # still reordered even with ordered=true
3-element Array{String,1}:
 "A"
 "B"
 "C"

julia> droplevels!(nullx) # does not reset the order
4-element CategoricalArrays.NullableCategoricalArray{String,1,UInt32}:
 "B"
 "B"
 "A"
 "A"

julia> levels(nullx)
2-element Array{String,1}:
 "A"
 "B"

same for Array{Strings}

julia> y = ["B", "B", "A", "A"]
4-element Array{String,1}:
 "B"
 "B"
 "A"
 "A"

julia> nully = NullableCategoricalArray(y)
4-element CategoricalArrays.NullableCategoricalArray{String,1,UInt32}:
 "B"
 "B"
 "A"
 "A"

julia> levels(nully) # ordering
2-element Array{String,1}:
 "A"
 "B"

julia> nully = NullableCategoricalArray(y, ordered=true)
4-element CategoricalArrays.NullableCategoricalArray{String,1,UInt32}:
 "B"
 "B"
 "A"
 "A"

julia> levels(nully) # ordering
2-element Array{String,1}:
 "A"
 "B"

julia> droplevels!(nully)
4-element CategoricalArrays.NullableCategoricalArray{String,1,UInt32}:
 "B"
 "B"
 "A"
 "A"

julia> levels(nully)
2-element Array{String,1}:
 "A"
 "B"

ordered() -> isordered()

Would it be more idiomatic to rename ordered() into isordered()?

Custom organization logo?

It might be nice to have a custom logo for this organization. So... call for proposals!

Gadfly seems to want isfinite defined

With

julia> Pkg.status("Gadfly")
 - Gadfly                        0.5.2+             master

julia> Pkg.status("CategoricalArrays")
 - CategoricalArrays             0.1.0

an attempt to use a CategoricalArray as x or y in a plot fails because Gadfly checks isfinite on elements of the array.

julia> plot(dyestuff2, x="Yield", y="Batch", Geom.point)
Error showing value of type Gadfly.Plot:
ERROR: MethodError: no method matching isfinite(::CategoricalArrays.CategoricalValue{String,UInt32})
Closest candidates are:
  isfinite(::Float16) at float16.jl:119
  isfinite(::BigFloat) at mpfr.jl:799
  isfinite(::DataArrays.NAtype) at /home/bates/.julia/v0.5/DataArrays/src/predicates.jl:9
  ...
 in apply_statistic_typed(::CategoricalArrays.CategoricalValue{String,UInt32}, ::CategoricalArrays.CategoricalValue{String,UInt32}, ::Array{CategoricalArrays.CategoricalValue{String,UInt32},1}, ::Array{Void,1}, ::Array{Void,1}) at /home/bates/.julia/v0.5/Gadfly/src/statistics.jl:957
 in apply_statistic(::Gadfly.Stat.TickStatistic, ::Dict{Symbol,Gadfly.ScaleElement}, ::Gadfly.Coord.Cartesian, ::Gadfly.Aesthetics) at /home/bates/.julia/v0.5/Gadfly/src/statistics.jl:811
 in apply_statistics(::Array{Gadfly.StatisticElement,1}, ::Dict{Symbol,Gadfly.ScaleElement}, ::Gadfly.Coord.Cartesian, ::Gadfly.Aesthetics) at /home/bates/.julia/v0.5/Gadfly/src/statistics.jl:38
 in render_prepare(::Gadfly.Plot) at /home/bates/.julia/v0.5/Gadfly/src/Gadfly.jl:766
 in render(::Gadfly.Plot) at /home/bates/.julia/v0.5/Gadfly/src/Gadfly.jl:819
 in display(::Base.REPL.REPLDisplay{Base.REPL.LineEditREPL}, ::MIME{Symbol("text/html")}, ::Gadfly.Plot) at /home/bates/.julia/v0.5/Gadfly/src/Gadfly.jl:1092
 in macro expansion at ./multimedia.jl:143 [inlined]
 in display(::Gadfly.Plot) at /home/bates/.julia/v0.5/Gadfly/src/Gadfly.jl:1044
 in hookless(::Media.##7#8{Gadfly.Plot}) at /home/bates/.julia/v0.5/Media/src/compat.jl:14
 in render(::Media.NoDisplay, ::Gadfly.Plot) at /home/bates/.julia/v0.5/Media/src/compat.jl:27
 in display(::Media.DisplayHook, ::Gadfly.Plot) at /home/bates/.julia/v0.5/Media/src/compat.jl:9
 in macro expansion at ./multimedia.jl:143 [inlined]
 in display(::Gadfly.Plot) at /home/bates/.julia/v0.5/Gadfly/src/Gadfly.jl:1048
 in print_response(::Base.Terminals.TTYTerminal, ::Any, ::Void, ::Bool, ::Bool, ::Void) at ./REPL.jl:154
 in print_response(::Base.REPL.LineEditREPL, ::Any, ::Void, ::Bool, ::Bool) at ./REPL.jl:139
 in (::Base.REPL.##22#23{Bool,Base.REPL.##33#42{Base.REPL.LineEditREPL,Base.REPL.REPLHistoryProvider},Base.REPL.LineEditREPL,Base.LineEdit.Prompt})(::Base.LineEdit.MIState, ::Base.AbstractIOBuffer{Array{UInt8,1}}, ::Bool) at ./REPL.jl:652
 in run_interface(::Base.Terminals.TTYTerminal, ::Base.LineEdit.ModalInterface) at ./LineEdit.jl:1579
 in run_frontend(::Base.REPL.LineEditREPL, ::Base.REPL.REPLBackendRef) at ./REPL.jl:903
 in run_repl(::Base.REPL.LineEditREPL, ::Base.##930#931) at ./REPL.jl:188
 in _start() at ./client.jl:360

The crux of the problem is

julia> ca = categorical(repeat(1:10, inner = 3));

julia> isfinite(ca[3])
ERROR: MethodError: no method matching isfinite(::CategoricalArrays.CategoricalValue{Int64,UInt32})
Closest candidates are:
  isfinite(::Float16) at float16.jl:119
  isfinite(::BigFloat) at mpfr.jl:799
  isfinite(::DataArrays.NAtype) at /home/bates/.julia/v0.5/DataArrays/src/predicates.jl:9
  ...

whereas

julia> pda = pool(repeat(1:10, inner = 3));

julia> isfinite(pda[3])
true

I'm not sure if this should be reported here or in the Gadfly package.

Add convert(Array, ::CategoricalArray) methods

We should add convert methods so that in the following example, the element type is String:

julia> Array(categorical(["a"]))
1-element Array{CategoricalArrays.CategoricalValue{String,UInt32},1}:
 "a"

It would be great if we could find a way to make it work for any AbstractArray.

Efficient broadcast comparison

On 0.6,

julia> heys1 = fill("Hey", 10000);

julia> heys3 = CategoricalVector(heys1);

julia> @btime heys1 .== "Hey";
  62.509 μs (21 allocations: 6.19 KiB)

julia> @btime heys3 .== "Hey";
  101.900 μs (21 allocations: 6.19 KiB)

Theoretically the second comparison should be faster, since it should boil down to comparing integers (or, in this case, realizing that no value in the pool is ==). Is there a way of implementing this? I'm not familiar with broadcasting innards.

CategoricalArray type not closed under `unique` method

When one applies the unique function to a categorical array, I would expect a categorical array of the same type to be returned but this is not the case. I'm using Julia 0.6:

julia> CategoricalArray(["a","b","c", "a"])
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "c"
 "a"

julia> unique(ans)
3-element Array{String,1}:
 "a"
 "b"
 "c"

julia> VERSION
v"0.6.2"

Slow creation of CategoricalArrays from array of strings

The timings for coverting an array of strings to CategoricalArray is really slow.

using CategoricalArrays
import CategoricalArrays.CategoricalArray
const N = 250_000_000
const K = 100
@time pool1 = [@sprintf "id%010d" k for k in 1:(N/K)]
@time id3 = rand(pool1, N)

# very slow
@time a = CategoricalArray(id3, ordered=false)

Implement deleteat! method

I happened to run into this when constructing an example of a rank-deficient linear model caused by a missing cell in a table.

Main>     dfrm = DataFrame([categorical(repeat(string.('A':'D'), inner = 6)),
                      categorical(repeat(string.('a':'c'), inner = 2, outer = 4))],
                      [:G, :H])
24×2 DataFrames.DataFrame
│ Row │ G │ H │
├─────┼───┼───┤
│ 1   │ A │ a │
│ 2   │ A │ a │
│ 3   │ A │ b │
│ 4   │ A │ b │
│ 5   │ A │ c │
│ 6   │ A │ c │
│ 7   │ B │ a │
│ 8   │ B │ a │
│ 9   │ B │ b │
│ 10  │ B │ b │
│ 11  │ B │ c │
│ 12  │ B │ c │
│ 13  │ C │ a │
│ 14  │ C │ a │
│ 15  │ C │ b │
│ 16  │ C │ b │
│ 17  │ C │ c │
│ 18  │ C │ c │
│ 19  │ D │ a │
│ 20  │ D │ a │
│ 21  │ D │ b │
│ 22  │ D │ b │
│ 23  │ D │ c │
│ 24  │ D │ c │

Main> deleterows!(dfrm, 7:8)
ERROR: MethodError: no method matching deleteat!(::CategoricalArrays.CategoricalArray{String,1,UInt32,String,CategoricalArrays.CategoricalString{UInt32},Union{}}, ::UnitRange{Int64})
Closest candidates are:
  deleteat!(::BitArray{1}, ::UnitRange{Int64}) at bitarray.jl:968
  deleteat!(::Array{T,1} where T, ::UnitRange{#s45} where #s45<:Integer) at array.jl:878
  deleteat!(::Array{T,1} where T, ::AbstractArray{T,1} where T) at array.jl:914
  ...
Stacktrace:
 [1] deleterows!(::DataFrames.DataFrame, ::UnitRange{Int64}) at /home/bates/.julia/v0.6/DataFrames/src/dataframe/dataframe.jl:663
 [2] eval(::Module, ::Any) at ./boot.jl:235

Superficially it seems that the call to deleteat! could be performed on the refs array and the original CategoricalArray returned but I defer to those who know more about the internals than I do.

Move CategoricalValue and CategoricalPool into separate package

I'm starting to think about queryverse/IterableTables.jl#2, and the whole design would be a lot easier if packages could take a dependency on CategoricalValue without taking a dependency on the whole CategoricalArrays package. Maybe a package called CategoricalValues.jl would work?

No efficient way to generate a categorical array or hard to discover

No efficient way to generate a categorical array, or it's really hard to discover.

CategoricalArray([1, 2, 1, 3], CategoricalPool(["a","b","c"],false))

to me the above should generate the Array ["a", "b", "a", "c"] or even simpler

CategoricalArray([1,2,1,3], ["a","b","c"])

Otherwise it doesn't feel intuitive

CategoricalArrays without CategoricalValue

The type CategoricalArray has two different uses:

represent categorical data with a particular ordering
indicate in a @formula call that the variable should be treated as a set of dummies.

I think these goals are in conflict, since one may want 2. without 1. For instance, I may want to treat year as a dummy in a regression while subsetting the dataframe to years below 1995:

using DataFrames, GLM
df = DataFrame(y = rand(100), x = rand(100), year = categorical(rand(1990:2000, 100)))
lm(@formula(y ~ x + year), df[df[:year] .<= 1995]])
# ERROR: MethodError: no method matching isless(::Int64, ::CategoricalArrays.CategoricalValue{Int64,UInt32})

I see two solutions

Have a special syntax to hint @formula to consider a variable as categorical, such as
```
using GLM
df = DataFrame(y = rand(100), x = rand(100), year = rand(1990:2000, 100))
lm(@formula(y ~ x + c.year, df[df[:year] .<= 1995]])
```
This is what Stata does. This kind of syntax would be my favorite solution, but it would require to have stricter variable names (see JuliaData/DataFrames.jl#1348). A related solution is to support functions in formulas, i.e. lm(@formula(y ~ x + categorical(year), df[df[:year] .<= 1995]]) although it can get verbose with large statistical models.
Define a simpler version of CategoricalArray that does not return a CategoricalValue when indexed

Create online manual using Documenter.jl

It would be nice to convert the documentation currently in README.md to a real manual using Documenter.jl. Apart from looking better, it would allow listing the provided API online, and running doctests to ensure they still work.

See https://juliadocs.github.io/Documenter.jl/latest/ for instructions.

Handling of SubArray of CategoricalArray

Currently SubArray of CategoricalArray is not recognized as CategoricalArray.
The consequence is that the code that wants to handle both has to use:

Union{CategoricalArray, SubArray{T, N, <:CategoricalArray}} where {T,N}

which is ugly and actually never used in the code.

The consequence is that methods may will work differently (i.e. producing different results) on CategoricalArray and on their views. Problem spotted in FreqTables when working with view.

Is there any standard practice how this could be handled in the current state of type system?

CC @nalimilan

Getting CategoricalValue from a pool

Assume we have a categorical array c that has a level that has a value v. In order to get a CategoricalValue corresponding to this value v I write:

c.pool[get(c.pool, v)]

but it seems a bit cumbersome and uses access to pool. Is there a better way to do it? If not I think it would be good to have one.

The use case is for example an ordered categorical array in which we want to filter values greater than some level using the order defined in this array. To do this you have to compare the values in this array to a CategoricalVale that is this specific level.

Handling CategoricalValue types?

I'm in the process of updating Gadfly's code to use the new DataFrames v0.11+ infrastructure and I'm getting hung up on the eltype of CategoricalArrays. It looks accessing an element is always wrapped? I believe this is different from PooledDataArrays?

CategoricalArrays

julia> a = CategoricalArray([RGBA{Float32}(1.0, 1.0, 1.0, 1.0)])
1-element CategoricalArrays.CategoricalArray{ColorTypes.RGBA{Float32},1,UInt32}:
 RGBA{Float32}(1.0f0,1.0f0,1.0f0,1.0f0)

julia> eltype(a)
CategoricalArrays.CategoricalValue{ColorTypes.RGBA{Float32},UInt32}

DataArrays v0.6.2

julia> a = PooledDataArray([RGBA{Float32}(1.0, 1.0, 1.0, 1.0)])
1-element DataArrays.PooledDataArray{ColorTypes.RGBA{Float32},UInt32,1}:
 RGBA{Float32}(1.0,1.0,1.0,1.0)

julia> eltype(a)
ColorTypes.RGBA{Float32}

Unfortunately this means that general functions like something(color::Color) won't work with CategoricalArrays any more.

Serialization

Was the performance/storage efficiency of [Nullable]CategoricalArray serialization checked?
Would it make sense to override serialize()/deserialize() for CategoryPool (invindex and especially valindex fields could be reconstructed from index)?

precompile error when loading from other module

When I load modules RCall or MixedModels (that both import CategoricalArrays) I get the error

WARNING: Module CategoricalArrays with uuid 102572555498720 is missing from the cache.
This may mean module CategoricalArrays does not support precompilation but is imported by a module that does.
ERROR: LoadError: Declaring __precompile__(false) is not allowed in files that are being precompiled.
 in require(::Symbol) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 in require(::Symbol) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 in include_from_node1(::String) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 in include_from_node1(::String) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 in macro expansion; at ./none:2 [inlined]
 in anonymous at ./<missing>:?
 in eval(::Module, ::Any) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 in eval(::Module, ::Any) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 in process_options(::Base.JLOptions) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 in _start() at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 in _start() at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
while loading /Users/michael/.julia/v0.5/RCall/src/RCall.jl, in expression starting on line 3
ERROR: Failed to precompile RCall to /Users/michael/.julia/lib/v0.5/RCall.ji.
 in compilecache(::String) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 in require(::Symbol) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 in require(::Symbol) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?

Lower-level constructors

I'm having a hard time figuring out how to construct a CategoricalArray at a "lower level" than just providing a vector of strings + missingness vector.

My use-case is Feather.jl, where the raw file bits provide an actual list of unique levels as strings, a vector of Ints of the actual elements, and an indication of whether the levels are ordered or not. Given these three items, is there some kind of intermediate constructor I could use? Do I need to manually build a Pool first and then use that to build my Array? Any pointers here would be most appreciated.

Uppdate of CI scripts

@nalimilan appveyor.yml and .travis.yml need updating for Julia 1.0 (in particular Appveyor CI fails on Julia 1.0 in current configuration), but I was not sure which versions of Julia you want to test against so I am opening an issue.

Define functions like `levels` in StatsBase?

I think the conversion from DataArrays to CategoricalArrays would be made easier if functions like levels were defined in a "Base" package of some sort and the individual packages defined methods for those functions. The base package doesn't have to be StatsBase - it could be something like FactorBase that is specific to this data type.

At least my experience is that I end up defining AbstractFactor types and functions like

@compat const AbstractFactor{V,R} = Union{NullableCategoricalVector{V,R},CategoricalVector{V,R},PooledDataVector{V,R}}

"""
    asfactor(f)

Return `f` as a AbstractFactor.

This function and the `AbstractFactor` union can be removed once `CategoricalArrays` replace
`PooledDataArray`
"""
asfactor(f::AbstractFactor) = f
asfactor(f) = pool(f)


"""
    levs(A::ReMat)

Return the levels of the grouping factor.

This is to disambiguate a call to `levels` as both `DataArrays`
and `CategoricalArrays` export it.
"""
function levs(A::ReMat)
    f = A.f
    isa(f, PooledDataArray) ? DataArrays.levels(f) : CategoricalArrays.levels(f)
end

If I have missed a better way of working around the same name being defined in different packages please let me know.

CategoricalArray constructor sometimes does not copy all data

See
JuliaData/DataFrames.jl#1309 (comment)
for an example

The problem is that lines:
https://github.com/JuliaData/CategoricalArrays.jl/blob/master/src/array.jl#L269
and
https://github.com/JuliaData/CategoricalArrays.jl/blob/master/src/array.jl#L270
do not guarantee that a copy is performed (sometimes convert returns the reference to the original object).

CC @nalimilan

recode fails with missing and value ranges

This works:

julia> recode([1,2,3], [1,2]=>100)
3-element Array{Int64,1}:
 100
 100
   3

and this works:

julia> recode(categorical([1,2,3,missing]), [1,2]=>100)
4-element CategoricalArrays.CategoricalArray{Union{Int64, Missings.Missing},1,UInt32}:
 100
 100
 3
 missing

but this fails:

julia> recode([1,2,3,missing], [1,2]=>100)
ERROR: TypeError: non-boolean (Missings.Missing) used in boolean context
Stacktrace:
 [1] any(::Base.##136#137{Missings.Missing}, ::Array{Int64,1}) at .\reduce.jl:574
 [2] in(::Missings.Missing, ::Array{Int64,1}) at .\reduce.jl:631
 [3] recode!(::Array{Union{Int64, Missings.Missing},1}, ::Array{Union{Int64, Missings.Missing},1}, ::Void, ::Pair{Array{Int64,1},Int64}, ::Vararg{Pair{Array{Int64,1},Int64},N} where N) at D:\Software\JULIA_PKG\v0.6\CategoricalArrays\src\recode.jl:39
 [4] recode(::Array{Union{Int64, Missings.Missing},1}, ::Void, ::Pair{Array{Int64,1},Int64}, ::Vararg{Pair{Array{Int64,1},Int64},N} where N) at D:\Software\JULIA_PKG\v0.6\CategoricalArrays\src\recode.jl:332
 [5] recode(::Array{Union{Int64, Missings.Missing},1}, ::Pair{Array{Int64,1},Int64}) at D:\Software\JULIA_PKG\v0.6\CategoricalArrays\src\recode.jl:317

Probably line:
https://github.com/JuliaData/CategoricalArrays.jl/blob/master/src/recode.jl#L39
should be fixed.

How to make a DataFrame out of CategoricalArrays?

I was trying to run

using DataFrames, CategoricalArrays
DataFrame(NullableCategoricalArray(["nihao","haha"]))

but it gives an error

MethodError: Cannot convert an object of type CategoricalArrays.NullableCategoricalArray{String,1,UInt32} to an object of type DataFrames.DataFrame
This may have arisen from a call to the constructor DataFrames.DataFrame(...),
since type constructors fall back to convert methods.
in DataFrames.DataFrame at base\sysimg.jl:24

so I am just wondering how to make an DataFrame out of CategoricalArrays. I am quite new to Julia so not sure how to fix it myself just yet.

Rename `get` for categorical values?

`levels=` for `categorical()`

IIUC currently categorical() doesn't support specifying levels.
Would it be possible to support levels= keyword parameter?
The use case is e.g. when "normalizing" the imported data with known hard-coded category values.
That would save one levels!() call and make the user code a little bit less error-prone.

Somewhat related use-case is to make the categorical/plain values column A in one data frame matching the categorical column B in another (i.e. same levels encoding).
Would it be possible to do this with something like similar(A, B)?

Check for overflow when adding new levels in setindex!()

PooledDataArrays have this safety feature. We should probably add it too, since it has a very low performance cost.

map returns CategoricalArray{Any}

map with CategoricalArray currently returns a CategoricalArray{Any} when the function does not return a categorical value.

For example:

map(get, categorical(1:2))
Base.collect_to!(categorical([0.0, 0.0]), (1.0 for v in 1:3), 2, 2)

This appears to be due to the fact that similar(::CategoricalArray, T) returns a CategoricalArray{T}, whose element type is not T but CategoricalValue{T} or CategoricalString. So when collect_to! calls promote_typejoin(T, typeof(el)), it tries to promote CategoricalValue{T} with typeof(el) instead of doing the more appropriate CategoricalValue{promote_typejoin(valtype(T), typeof(el))}, and eventually chooses Any.

As an exception, these do not return a CategoricalArray{Any}:

map(string, categorical([1]))
Base.collect_to!(categorical(["", ""]), ("a" for v in 1:2), 2, 2)

AFAICT this is due to the fact that the CategoricalArray constructor uses CategoricalString if the array has an AbstractString element type. Since promote_type(CategoricalString, String) gives AbstractString, the resulting array will be CategoricalArray{String}.

I can see three ways of fixing this:

Change similar(::CategoricalArray, T) to return an Array{T} when T is not a categorical value. This is more correct regarding the AbstractArray interface. But it could be annoying in some cases where a CategoricalArray is more natural/efficient. Or maybe not (concrete examples are needed).
Change promote_jointype(S::Type{<:CategoricalValue}, T) so that it returns CategoricalValue{promote_typejoin(valtype(S)), T)}.
Implement a special method for map. This might be needed anyway to take advantage of the possibility of calling the function only on the levels for efficiency. But the current behavior of similar could be problematic elsewhere.

Whichever solution we choose, it seems that map should continue to return a CategoricalArray. Indeed map generally preserves the container type (e.g. for tuples), CategoricalArray is more efficient for repeated data, and array comprehensions are available to create arrays when explicitly needed. broadcast should probably also preserve the type (which it currently doesn't).

Check for null values in copy!()

The specialized copy! method added by #37 does not check the presence of null values when copying from NullableCategoricalArray to CategoricalArray. This currently creates #undef entries, but it would make more sense to raise an error during "conversion".

Fix showcompact()

Looks like we need to override the default method (and maybe to improve that method in Base too):

julia> showcompact(CategoricalArray(["a", "b"]))
CategoricalArrays.CategoricalValue{String,UInt32}["a","b"]

How about a CategoricalArraysTools.jl package where things like faster sort and countmap can go?

< and > are inconsistent

> using CategoricalArrays
> x = CategoricalArray(["Old", "Young", "Middle", "Young"], ordered=true)
> @show x[1] < x[2]
> @show x[1] > x[2];

x[1] < x[2] = true
x[1] > x[2] = CategoricalArrays.CategoricalString{UInt32}["Old", "Young", "Middle", "Young"]

Segfault when testing under v0.4.6 on Ubuntu 16.04

I am consistently getting segfaults under v0.4.6 when testing the release version or the master branch. On the master branch of CategoricalArrays I get a series of warnings about overwriting convert methods

julia> Pkg.test("CategoricalArrays")
INFO: Computing test dependencies for CategoricalArrays...
INFO: No packages to install, update or remove
INFO: Testing CategoricalArrays
INFO: Recompiling stale cache file /home/bates/.julia/lib/v0.4/CategoricalArrays.ji for module CategoricalArrays.
WARNING: Method definition convert(Type{Array{#T<:Any, N<:Any}}, AbstractArray{#S<:Any, #n<:Any}) in module Base at array.jl:240 overwritten in module CategoricalArrays at /home/bates/.julia/v0.4/CategoricalArrays/src/CategoricalArrays.jl:18.
WARNING: Method definition convert(Type{Array{#T<:Any, #n<:Any}}, AbstractArray{#S<:Any, #n<:Any}) in module Base at array.jl:241 overwritten in module CategoricalArrays at /home/bates/.julia/v0.4/CategoricalArrays/src/CategoricalArrays.jl:19.
WARNING: Method definition convert(Type{Base.Nullable}, #T<:Any) in module Base at nullable.jl:19 overwritten in module CategoricalArrays at /home/bates/.julia/v0.4/CategoricalArrays/src/CategoricalArrays.jl:21.
WARNING: New definition 
    convert(Type{CategoricalArrays.NominalPool{#S<:Any, #R<:Any, V<:Any}}, CategoricalArrays.NominalPool) at /home/bates/.julia/v0.4/CategoricalArrays/src/pool.jl:54
is ambiguous with: 
    convert(Type{CategoricalArrays.NominalPool{#T<:Any, #R<:Any, V<:Any}}, CategoricalArrays.NominalPool{#T<:Any, #R<:Any, V<:Any}) at /home/bates/.julia/v0.4/CategoricalArrays/src/pool.jl:51.
To fix, define 
    convert(Type{CategoricalArrays.NominalPool{#S<:Any, #R<:Any, V<:Any}}, CategoricalArrays.NominalPool{#S<:Any, _<:Integer, V<:Any})
before the new definition.

culminating in

ARNING: New definition 
    convert(Type{CategoricalArrays.NullableOrdinalArray{#T<:Any, #N<:Any, #R<:Any}}, CategoricalArrays.NullableOrdinalArray) at /home/bates/.julia/v0.4/CategoricalArrays/src/array.jl:92
is ambiguous with: 
    convert(Type{CategoricalArrays.NullableOrdinalArray{#T<:Any, #N<:Any, #R<:Any}}, CategoricalArrays.NullableOrdinalArray{#T<:Any, #N<:Any, #R<:Any}) at /home/bates/.julia/v0.4/CategoricalArrays/src/array.jl:59.
To fix, define 
    convert(Type{CategoricalArrays.NullableOrdinalArray{#T<:Any, #N<:Any, #R<:Any}}, CategoricalArrays.NullableOrdinalArray{#T<:Any, #N<:Any, _<:Integer})
before the new definition.

signal (11): Segmentation fault
unknown function (ip: 0x7ff5d2f3cf76)
unknown function (ip: 0x7ff5d2f3d0d5)
unknown function (ip: 0x7ff5d2f3ddeb)
unknown function (ip: 0x7ff5d2f3d0d5)
unknown function (ip: 0x7ff5d2f3ddeb)
unknown function (ip: 0x7ff5d2f3d0d5)
unknown function (ip: 0x7ff5d2f3ddeb)
unknown function (ip: 0x7ff5d2f3d0d5)
unknown function (ip: 0x7ff5d2f3d0d5)
unknown function (ip: 0x7ff5d2f3d0d5)

Although CategoricalArrays is the package that triggers this segfault, it looks like it is a problem in Julia itself.

Complete Nullable API

It would be nice to have padnull(), anynull() etc methods for NullableCategoricalArray, and also to maintain the API parity with NullableArrays in the long term. There are 3 possibilities:

CategoricalArrays require NullableArrays and extend NullableArrays methods for NullableCategoricalArray
NullableArrays require CategoricalArrays and implement nullable API for NullableCategoricalArray (the type definition still stays in the CategoricalArrays),
the new package NullableCategoricalArrays.jl that requires both NullableArrays and CategoricalArrays. That way the requirements could be made really fine grained.
...?

To me the second option makes more sense:

if one wants to work with nullables, it's very likely that he/she would like to have nullable support for categorical arrays
CategoricalArrays is a smaller package that has minimal number of dependencies
it's also better for maintenance and version updates to have all methods defined in one package

Define `push!` and `append!`

Currently in DataStreams, I'm ironing out an "api" of sorts for columns, i.e. the things that actually store data in a Source or Sink. They're mainly interacted with through the Data.getfield and Data.getcolumn methods, but they also need to support a few basic methods. Currently this includes:

push!(A, item)
append!(A, B)
setindex!(A, item, i)
allocate{T}(::Type{T}, rows, ref)

Most of these are pretty basic, though allocate is the funny one. allocate takes a scalar type (i.e. T, Nullable{T}, NominalValue{S, R}), a # of rows, and a potential ref or parent Vector{UInt8} and allocates a new column vector of some kind that the data will be streamed to.

I'm happy to have the allocate methods for CategoricalArrays live in DataStreams for now (while things settle down), with the idea that they could eventually move to the various packages (NullableArrays, DataFrames, CategoricalArrays, etc.).

The two we seem to be missing, however, are push! and append! for the various CategoricalArray types. Happy to take a crack at it, but wanted to post it here first.

Conflict with missing causing errors on 0.7

Perhaps this issue belongs in Missings, but:

WARNING: both Missings and Base export "Missing"; uses of it in module CategoricalArrays must be qualified
ERROR: LoadError: LoadError: UndefVarError: Missing not defined
Stacktrace:
 [1] top-level scope
 [2] include at ./boot.jl:279 [inlined]
 [3] include_relative(::Module, ::String) at ./loading.jl:509
 [4] include at ./sysimg.jl:15 [inlined]
 [5] include(::String) at /home/travis/.julia/v0.7/CategoricalArrays/src/CategoricalArrays.jl:2
 [6] top-level scope
 [7] include at ./boot.jl:279 [inlined]
 [8] include_relative(::Module, ::String) at ./loading.jl:509
 [9] include(::Module, ::String) at ./sysimg.jl:15
 [10] top-level scope
 [11] eval at ./boot.jl:282 [inlined]
 [12] top-level scope at ./<missing>:2
in expression starting at /home/travis/.julia/v0.7/CategoricalArrays/src/value.jl:25
in expression starting at /home/travis/.julia/v0.7/CategoricalArrays/src/CategoricalArrays.jl:23
ERROR: LoadError: Failed to precompile CategoricalArrays to /home/travis/.julia/lib/v0.7/CategoricalArrays.ji.

juliadata / categoricalarrays.jl Goto Github PK

categoricalarrays.jl's Introduction

CategoricalArrays.jl

categoricalarrays.jl's People

Contributors

Stargazers

Watchers

Forkers

categoricalarrays.jl's Issues

Recommend Projects

Recommend Topics

Recommend Org