Documentation:
This package provides tools for working with categorical variables, both with unordered (nominal variables) and ordered categories (ordinal variables), optionally with missing values.
Arrays for working with categorical data (both nominal and ordinal)
License: Other
On the julia-stats mailing list I suggested exporting a constrasts
generic from StatsBase
and adding methods in this repository. I would appreciate your comments, @nalimilan, if you have any.
I find the following behavior inconsistent. Is it on purpose?
julia> using CategoricalArrays
julia> x = categorical([1,2,3], ordered=true)
3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
1
2
3
julia> x[3] = 10
ERROR: cannot add new level 10 since ordered pools cannot be extended implicitly. Use the levels! function to set new levels, or the ordered! function to mark the pool as unordered.
julia> recode(x, 3=>10)
3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
1
2
10
Now CategoricalValue
is not a <: AbstractString
, so JSON.jl serializes it using CompositeTypeWrapper
and goes into recursion due to circular pool
-> value -> pool
references.
julia> using CategoricalArrays, JSON
julia> x = categorical([1, 2, 3]);
julia> JSON.json(x[1])
ERROR: StackOverflowError:
Stacktrace:
[1] Type at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:18 [inlined]
[2] lower(::CategoricalArrays.CategoricalPool{Int64,UInt32,CategoricalArrays.CategoricalValue{Int64,UInt32}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:40
[3] show_pair(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::Symbol, ::CategoricalArrays.CategoricalPool{Int64,UInt32,CategoricalArrays.CategoricalValue{Int64,UInt32}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:236
[4] show_json(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::JSON.Writer.CompositeTypeWrapper{CategoricalArrays.CategoricalValue{Int64,UInt32}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:294
[5] show_json(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::Array{CategoricalArrays.CategoricalValue{Int64,UInt32},1}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:302
[6] show_json(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::JSON.Writer.CompositeTypeWrapper{CategoricalArrays.CategoricalPool{Int64,UInt32,CategoricalArrays.CategoricalValue{Int64,UInt32}}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:294
[7] show_pair(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::Symbol, ::CategoricalArrays.CategoricalPool{Int64,UInt32,CategoricalArrays.CategoricalValue{Int64,UInt32}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:236
[8] show_json(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::JSON.Writer.CompositeTypeWrapper{CategoricalArrays.CategoricalValue{Int64,UInt32}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:294
[9] show_json(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::Array{CategoricalArrays.CategoricalValue{Int64,UInt32},1}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:302
[10] show_json(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::JSON.Writer.CompositeTypeWrapper{CategoricalArrays.CategoricalPool{Int64,UInt32,CategoricalArrays.CategoricalValue{Int64,UInt32}}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:294
[11] show_pair(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::Symbol, ::CategoricalArrays.CategoricalPool{Int64,UInt32,CategoricalArrays.CategoricalValue{Int64,UInt32}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:236
[12] show_json(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::JSON.Writer.CompositeTypeWrapper{CategoricalArrays.CategoricalValue{Int64,UInt32}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:294
[13] show_json(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::Array{CategoricalArrays.CategoricalValue{Int64,UInt32},1}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:302
[14] show_json(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::JSON.Writer.CompositeTypeWrapper{CategoricalArrays.CategoricalPool{Int64,UInt32,CategoricalArrays.CategoricalValue{Int64,UInt32}}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:294
[15] show_pair(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::Symbol, ::CategoricalArrays.CategoricalPool{Int64,UInt32,CategoricalArrays.CategoricalValue{Int64,UInt32}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:236
[16] show_json(::JSON.Writer.CompactContext{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::JSON.Serializations.StandardSerialization, ::JSON.Writer.CompositeTypeWrapper{CategoricalArrays.CategoricalValue{Int64,UInt32}}) at /home/astukalov/.julia/v0.6/JSON/src/Writer.jl:294
It could be easily fixed by implementing JSON.lower(x::CategoricalValue)
method, but that would require introducing CategoricalArrays -> JSON
dependency.
The more generic fix would be to implement support for circular references in JSON
, but that would still print CategoricalValue
incorrectly (probably we just need to show the stored value, i.e. get(x::CategoricalValue)
).
Can you possibly add Future to the require (or manifest/project.toml)?
I am getting this error on travis
https://travis-ci.org/kafisatz/DecisionTrees.jl/jobs/394917060#L580
If I understand correctly the values
field is the numeric indices into the pool
. I would think that a name such as refs
or indices
or indexes
would be more appropriate. In most types a field name of values
indicates the actual values in the type without the metadata.
@nalimilan
Currently:
x = categorical([1,2,3])
y = deepcopy(x)
x.pool == y.pool
produces false
because for CategoricalPool
the operation ==
is not defined and falls back to ===
.
Is this intentional or should this be fixed?
When I try to use the package in a session started with julia --inline=no
it breaks. All errors during the tests seem to fall back to this:
categorical(CategoricalArrays.CategoricalArray{String,1,Int64,String,CategoricalArrays.CategoricalString{Int64},Union{}}, compress=false) R1=Int64 R2=Int64: Error During Test
Got an exception of type MethodError outside of a @test
MethodError: in(::CategoricalArrays.CategoricalString{Int64}, ::Base.KeyIterator{ObjectIdDict}) is ambiguous. Candidates:
in(k, v::Base.KeyIterator) in Base at associative.jl:60
in(x::Union{CategoricalArrays.CategoricalString{R}, CategoricalArrays.CategoricalValue{T,R} where T} where R, y) in CategoricalArrays at /Users/ericperim/.julia/v0.6/CategoricalArrays/src/value.jl:118
Possible fix, define
in(::Union{CategoricalArrays.CategoricalString{R}, CategoricalArrays.CategoricalValue{T,R} where T} where R, ::Base.KeyIterator)
Note: in order to observe this behaviour it is necessary to use julia --inline=no test/runtests.jl
, since if you run Pkg.test
from inside the REPL it will use the default options (inline=yes
).
I was surprised to find that cut
mutates the input breaks
for the interval vector when extend=true
. (And strangely only when breaks
is sorted.)
I would say either make a copy before mutating or document the current behaviour?
I could make a PR for either.
MWE:
julia> breaks = [2,4,6];
julia> cut(collect(1:10), breaks, extend=true);
julia> breaks
5-element Array{Int64,1}:
1
2
4
6
10
Cheers.
We should review DataArrays's methods and decide which of them should be implemented:
https://github.com/JuliaStats/DataArrays.jl/blob/f17e4e30aa0713c794409802741f43cf8ff7f05e/src/pooleddataarray.jl#L576
These should also be used in DataFrames:
https://github.com/JuliaStats/DataFrames.jl/blob/8334bf1cbeafba98f038e65a67a732e4125dae89/src/abstractdataframe/sort.jl#L312
We currently have a fast path in ==
and isequal
for when two CategoricalArray
s share the same pool, but that is not the most common situation. For other cases, we do essentially what the AbstractArray
fallback does, by extracting CategoricalValue
objects and comparing them, which implies comparing their contents.
It would be much faster to compute a correspondence table between the levels of the arrays first, and then work only with integer codes. But I'm not sure how to do that without allocating an N×M
table, with N
and M
the number of levels of each array. Doing so would only make sense for quite large arrays.
Please see this stackoverflow post
Basically I have loaded in a DataFrame using Feather.jl and then I tried to do
by[df, :some_categorical_array_col, df1->sum(df1[:some_value])
it fails with
MethodError: Cannot
convert
an object of type String to an object of type CategoricalArrays.CategoricalValue{String,Int32}
This may have arisen from a call to the constructor CategoricalArrays.CategoricalValue{String,Int32}(...),
since type constructors fall back to convert methods.
in by at DataFrames\src\groupeddataframe\grouping.jl:320
in groupby at DataFrames\src\groupeddataframe\grouping.jl:92
in DataArrays.PooledDataArray at base\sysimg.jl:24
in DataArrays.PooledDataArray at DataArrays\src\pooleddataarray.jl:140
in convert at NullableArrays\src\primitives.jl:258
in copy! at base\abstractarray.jl:655
and I am new to Julia hence not sure how to fix it myself.
The only reason for having separate NominalArray
and OrdinalArray
types (as well as their Nullable
counterparts) is to return an OrdinalValue
from the latter, which supports <
and >
. This does not sound worth the increased complexity. We could use a single CategoricalArray
type, and store whether it's ordered via a Bool
field. With branch prediction, <
shouldn't be noticeably slower than with different types.
Incidentally, this is what Pandas does as well as MATLAB. Another advantage is that data storage formats generally don't distinguish nominal and ordinal arrays, that distinction must be added after importing data: if we use two different types, moving to ordinal requires changing the type of the array after the fact.
Before doing that change, I'd like to hear what others think of it.
Extending the logic implemented for copy!
in #97, it could make sense to copy the ordering of levels from the RHS CategoricalValue
in setindex(::CategoricalArray, ::CategoricalValue, ...)
. That would essentially mean that copying elements manually one by one from a CategoricalArray
to another would be equivalent to using specialized copy!
method: the order of levels of the source would be preserved.
How should custom iteration be implemented over CategoricalArrays?
Or to put it differently, why do next(CategoricalArray(["a","b","b"]), 2)
and done(CategoricalArray(["a","b","b"]), 2)
give a BoundsError
?
I'm on Julia v.0.6.2
The constructor for NullableCategoricalArray does not adhere to the ordering of the data when ordering levels.
julia> using CategoricalArrays
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:244 overwritten in module NullableArrays at /Users/Cameron/.julia/v0.6/NullableArrays/src/operators.jl:128.
julia> x = levels!(CategoricalArray(["B", "B", "A", "A"]), ["C", "B", "A"])
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"B"
"B"
"A"
"A"
julia> levels(x)
3-element Array{String,1}:
"C"
"B"
"A"
julia> nullx = NullableCategoricalArray(x)
4-element CategoricalArrays.NullableCategoricalArray{String,1,UInt32}:
"B"
"B"
"A"
"A"
julia> levels(nullx) # why did the levels change?
3-element Array{String,1}:
"A"
"B"
"C"
julia> nullx = NullableCategoricalArray(x, ordered=true)
4-element CategoricalArrays.NullableCategoricalArray{String,1,UInt32}:
"B"
"B"
"A"
"A"
julia> levels(nullx) # still reordered even with ordered=true
3-element Array{String,1}:
"A"
"B"
"C"
julia> droplevels!(nullx) # does not reset the order
4-element CategoricalArrays.NullableCategoricalArray{String,1,UInt32}:
"B"
"B"
"A"
"A"
julia> levels(nullx)
2-element Array{String,1}:
"A"
"B"
same for Array{Strings}
julia> y = ["B", "B", "A", "A"]
4-element Array{String,1}:
"B"
"B"
"A"
"A"
julia> nully = NullableCategoricalArray(y)
4-element CategoricalArrays.NullableCategoricalArray{String,1,UInt32}:
"B"
"B"
"A"
"A"
julia> levels(nully) # ordering
2-element Array{String,1}:
"A"
"B"
julia> nully = NullableCategoricalArray(y, ordered=true)
4-element CategoricalArrays.NullableCategoricalArray{String,1,UInt32}:
"B"
"B"
"A"
"A"
julia> levels(nully) # ordering
2-element Array{String,1}:
"A"
"B"
julia> droplevels!(nully)
4-element CategoricalArrays.NullableCategoricalArray{String,1,UInt32}:
"B"
"B"
"A"
"A"
julia> levels(nully)
2-element Array{String,1}:
"A"
"B"
Would it be more idiomatic to rename ordered()
into isordered()
?
It might be nice to have a custom logo for this organization. So... call for proposals!
With
julia> Pkg.status("Gadfly")
- Gadfly 0.5.2+ master
julia> Pkg.status("CategoricalArrays")
- CategoricalArrays 0.1.0
an attempt to use a CategoricalArray
as x
or y
in a plot fails because Gadfly checks isfinite
on elements of the array.
julia> plot(dyestuff2, x="Yield", y="Batch", Geom.point)
Error showing value of type Gadfly.Plot:
ERROR: MethodError: no method matching isfinite(::CategoricalArrays.CategoricalValue{String,UInt32})
Closest candidates are:
isfinite(::Float16) at float16.jl:119
isfinite(::BigFloat) at mpfr.jl:799
isfinite(::DataArrays.NAtype) at /home/bates/.julia/v0.5/DataArrays/src/predicates.jl:9
...
in apply_statistic_typed(::CategoricalArrays.CategoricalValue{String,UInt32}, ::CategoricalArrays.CategoricalValue{String,UInt32}, ::Array{CategoricalArrays.CategoricalValue{String,UInt32},1}, ::Array{Void,1}, ::Array{Void,1}) at /home/bates/.julia/v0.5/Gadfly/src/statistics.jl:957
in apply_statistic(::Gadfly.Stat.TickStatistic, ::Dict{Symbol,Gadfly.ScaleElement}, ::Gadfly.Coord.Cartesian, ::Gadfly.Aesthetics) at /home/bates/.julia/v0.5/Gadfly/src/statistics.jl:811
in apply_statistics(::Array{Gadfly.StatisticElement,1}, ::Dict{Symbol,Gadfly.ScaleElement}, ::Gadfly.Coord.Cartesian, ::Gadfly.Aesthetics) at /home/bates/.julia/v0.5/Gadfly/src/statistics.jl:38
in render_prepare(::Gadfly.Plot) at /home/bates/.julia/v0.5/Gadfly/src/Gadfly.jl:766
in render(::Gadfly.Plot) at /home/bates/.julia/v0.5/Gadfly/src/Gadfly.jl:819
in display(::Base.REPL.REPLDisplay{Base.REPL.LineEditREPL}, ::MIME{Symbol("text/html")}, ::Gadfly.Plot) at /home/bates/.julia/v0.5/Gadfly/src/Gadfly.jl:1092
in macro expansion at ./multimedia.jl:143 [inlined]
in display(::Gadfly.Plot) at /home/bates/.julia/v0.5/Gadfly/src/Gadfly.jl:1044
in hookless(::Media.##7#8{Gadfly.Plot}) at /home/bates/.julia/v0.5/Media/src/compat.jl:14
in render(::Media.NoDisplay, ::Gadfly.Plot) at /home/bates/.julia/v0.5/Media/src/compat.jl:27
in display(::Media.DisplayHook, ::Gadfly.Plot) at /home/bates/.julia/v0.5/Media/src/compat.jl:9
in macro expansion at ./multimedia.jl:143 [inlined]
in display(::Gadfly.Plot) at /home/bates/.julia/v0.5/Gadfly/src/Gadfly.jl:1048
in print_response(::Base.Terminals.TTYTerminal, ::Any, ::Void, ::Bool, ::Bool, ::Void) at ./REPL.jl:154
in print_response(::Base.REPL.LineEditREPL, ::Any, ::Void, ::Bool, ::Bool) at ./REPL.jl:139
in (::Base.REPL.##22#23{Bool,Base.REPL.##33#42{Base.REPL.LineEditREPL,Base.REPL.REPLHistoryProvider},Base.REPL.LineEditREPL,Base.LineEdit.Prompt})(::Base.LineEdit.MIState, ::Base.AbstractIOBuffer{Array{UInt8,1}}, ::Bool) at ./REPL.jl:652
in run_interface(::Base.Terminals.TTYTerminal, ::Base.LineEdit.ModalInterface) at ./LineEdit.jl:1579
in run_frontend(::Base.REPL.LineEditREPL, ::Base.REPL.REPLBackendRef) at ./REPL.jl:903
in run_repl(::Base.REPL.LineEditREPL, ::Base.##930#931) at ./REPL.jl:188
in _start() at ./client.jl:360
The crux of the problem is
julia> ca = categorical(repeat(1:10, inner = 3));
julia> isfinite(ca[3])
ERROR: MethodError: no method matching isfinite(::CategoricalArrays.CategoricalValue{Int64,UInt32})
Closest candidates are:
isfinite(::Float16) at float16.jl:119
isfinite(::BigFloat) at mpfr.jl:799
isfinite(::DataArrays.NAtype) at /home/bates/.julia/v0.5/DataArrays/src/predicates.jl:9
...
whereas
julia> pda = pool(repeat(1:10, inner = 3));
julia> isfinite(pda[3])
true
I'm not sure if this should be reported here or in the Gadfly
package.
We should add convert
methods so that in the following example, the element type is String
:
julia> Array(categorical(["a"]))
1-element Array{CategoricalArrays.CategoricalValue{String,UInt32},1}:
"a"
It would be great if we could find a way to make it work for any AbstractArray
.
On 0.6,
julia> heys1 = fill("Hey", 10000);
julia> heys3 = CategoricalVector(heys1);
julia> @btime heys1 .== "Hey";
62.509 μs (21 allocations: 6.19 KiB)
julia> @btime heys3 .== "Hey";
101.900 μs (21 allocations: 6.19 KiB)
Theoretically the second comparison should be faster, since it should boil down to comparing integers (or, in this case, realizing that no value in the pool is ==
). Is there a way of implementing this? I'm not familiar with broadcasting innards.
When one applies the unique
function to a categorical array, I would expect a categorical array of the same type to be returned but this is not the case. I'm using Julia 0.6:
julia> CategoricalArray(["a","b","c", "a"])
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"a"
"b"
"c"
"a"
julia> unique(ans)
3-element Array{String,1}:
"a"
"b"
"c"
julia> VERSION
v"0.6.2"
The timings for coverting an array of strings to CategoricalArray is really slow.
using CategoricalArrays
import CategoricalArrays.CategoricalArray
const N = 250_000_000
const K = 100
@time pool1 = [@sprintf "id%010d" k for k in 1:(N/K)]
@time id3 = rand(pool1, N)
# very slow
@time a = CategoricalArray(id3, ordered=false)
I happened to run into this when constructing an example of a rank-deficient linear model caused by a missing cell in a table.
Main> dfrm = DataFrame([categorical(repeat(string.('A':'D'), inner = 6)),
categorical(repeat(string.('a':'c'), inner = 2, outer = 4))],
[:G, :H])
24×2 DataFrames.DataFrame
│ Row │ G │ H │
├─────┼───┼───┤
│ 1 │ A │ a │
│ 2 │ A │ a │
│ 3 │ A │ b │
│ 4 │ A │ b │
│ 5 │ A │ c │
│ 6 │ A │ c │
│ 7 │ B │ a │
│ 8 │ B │ a │
│ 9 │ B │ b │
│ 10 │ B │ b │
│ 11 │ B │ c │
│ 12 │ B │ c │
│ 13 │ C │ a │
│ 14 │ C │ a │
│ 15 │ C │ b │
│ 16 │ C │ b │
│ 17 │ C │ c │
│ 18 │ C │ c │
│ 19 │ D │ a │
│ 20 │ D │ a │
│ 21 │ D │ b │
│ 22 │ D │ b │
│ 23 │ D │ c │
│ 24 │ D │ c │
Main> deleterows!(dfrm, 7:8)
ERROR: MethodError: no method matching deleteat!(::CategoricalArrays.CategoricalArray{String,1,UInt32,String,CategoricalArrays.CategoricalString{UInt32},Union{}}, ::UnitRange{Int64})
Closest candidates are:
deleteat!(::BitArray{1}, ::UnitRange{Int64}) at bitarray.jl:968
deleteat!(::Array{T,1} where T, ::UnitRange{#s45} where #s45<:Integer) at array.jl:878
deleteat!(::Array{T,1} where T, ::AbstractArray{T,1} where T) at array.jl:914
...
Stacktrace:
[1] deleterows!(::DataFrames.DataFrame, ::UnitRange{Int64}) at /home/bates/.julia/v0.6/DataFrames/src/dataframe/dataframe.jl:663
[2] eval(::Module, ::Any) at ./boot.jl:235
Superficially it seems that the call to deleteat!
could be performed on the refs
array and the original CategoricalArray
returned but I defer to those who know more about the internals than I do.
I'm starting to think about queryverse/IterableTables.jl#2, and the whole design would be a lot easier if packages could take a dependency on CategoricalValue
without taking a dependency on the whole CategoricalArrays
package. Maybe a package called CategoricalValues.jl
would work?
No efficient way to generate a categorical array, or it's really hard to discover.
CategoricalArray([1, 2, 1, 3], CategoricalPool(["a","b","c"],false))
to me the above should generate the Array ["a", "b", "a", "c"] or even simpler
CategoricalArray([1,2,1,3], ["a","b","c"])
Otherwise it doesn't feel intuitive
The type CategoricalArray has two different uses:
@formula
call that the variable should be treated as a set of dummies.I think these goals are in conflict, since one may want 2. without 1. For instance, I may want to treat year as a dummy in a regression while subsetting the dataframe to years below 1995:
using DataFrames, GLM
df = DataFrame(y = rand(100), x = rand(100), year = categorical(rand(1990:2000, 100)))
lm(@formula(y ~ x + year), df[df[:year] .<= 1995]])
# ERROR: MethodError: no method matching isless(::Int64, ::CategoricalArrays.CategoricalValue{Int64,UInt32})
I see two solutions
@formula
to consider a variable as categorical, such as
using GLM
df = DataFrame(y = rand(100), x = rand(100), year = rand(1990:2000, 100))
lm(@formula(y ~ x + c.year, df[df[:year] .<= 1995]])
lm(@formula(y ~ x + categorical(year), df[df[:year] .<= 1995]])
although it can get verbose with large statistical models.It would be nice to convert the documentation currently in README.md to a real manual using Documenter.jl. Apart from looking better, it would allow listing the provided API online, and running doctests to ensure they still work.
See https://juliadocs.github.io/Documenter.jl/latest/ for instructions.
Currently SubArray
of CategoricalArray
is not recognized as CategoricalArray
.
The consequence is that the code that wants to handle both has to use:
Union{CategoricalArray, SubArray{T, N, <:CategoricalArray}} where {T,N}
which is ugly and actually never used in the code.
The consequence is that methods may will work differently (i.e. producing different results) on CategoricalArray
and on their views. Problem spotted in FreqTables
when working with view
.
Is there any standard practice how this could be handled in the current state of type system?
CC @nalimilan
Assume we have a categorical array c
that has a level that has a value v
. In order to get a CategoricalValue
corresponding to this value v
I write:
c.pool[get(c.pool, v)]
but it seems a bit cumbersome and uses access to pool
. Is there a better way to do it? If not I think it would be good to have one.
The use case is for example an ordered categorical array in which we want to filter values greater than some level using the order defined in this array. To do this you have to compare the values in this array to a CategoricalVale
that is this specific level.
I'm in the process of updating Gadfly's code to use the new DataFrames v0.11+ infrastructure and I'm getting hung up on the eltype
of CategoricalArrays. It looks accessing an element is always wrapped? I believe this is different from PooledDataArrays?
CategoricalArrays
julia> a = CategoricalArray([RGBA{Float32}(1.0, 1.0, 1.0, 1.0)])
1-element CategoricalArrays.CategoricalArray{ColorTypes.RGBA{Float32},1,UInt32}:
RGBA{Float32}(1.0f0,1.0f0,1.0f0,1.0f0)
julia> eltype(a)
CategoricalArrays.CategoricalValue{ColorTypes.RGBA{Float32},UInt32}
DataArrays v0.6.2
julia> a = PooledDataArray([RGBA{Float32}(1.0, 1.0, 1.0, 1.0)])
1-element DataArrays.PooledDataArray{ColorTypes.RGBA{Float32},UInt32,1}:
RGBA{Float32}(1.0,1.0,1.0,1.0)
julia> eltype(a)
ColorTypes.RGBA{Float32}
Unfortunately this means that general functions like something(color::Color)
won't work with CategoricalArrays any more.
Was the performance/storage efficiency of [Nullable]CategoricalArray
serialization checked?
Would it make sense to override serialize()/deserialize()
for CategoryPool
(invindex
and especially valindex
fields could be reconstructed from index
)?
When I load modules RCall
or MixedModels
(that both import CategoricalArrays) I get the error
WARNING: Module CategoricalArrays with uuid 102572555498720 is missing from the cache.
This may mean module CategoricalArrays does not support precompilation but is imported by a module that does.
ERROR: LoadError: Declaring __precompile__(false) is not allowed in files that are being precompiled.
in require(::Symbol) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
in require(::Symbol) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
in include_from_node1(::String) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
in include_from_node1(::String) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
in macro expansion; at ./none:2 [inlined]
in anonymous at ./<missing>:?
in eval(::Module, ::Any) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
in eval(::Module, ::Any) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
in process_options(::Base.JLOptions) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
in _start() at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
in _start() at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
while loading /Users/michael/.julia/v0.5/RCall/src/RCall.jl, in expression starting on line 3
ERROR: Failed to precompile RCall to /Users/michael/.julia/lib/v0.5/RCall.ji.
in compilecache(::String) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
in require(::Symbol) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
in require(::Symbol) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
I'm having a hard time figuring out how to construct a CategoricalArray at a "lower level" than just providing a vector of strings + missingness vector.
My use-case is Feather.jl, where the raw file bits provide an actual list of unique levels as strings, a vector of Ints of the actual elements, and an indication of whether the levels are ordered or not. Given these three items, is there some kind of intermediate constructor I could use? Do I need to manually build a Pool first and then use that to build my Array? Any pointers here would be most appreciated.
@nalimilan appveyor.yml
and .travis.yml
need updating for Julia 1.0 (in particular Appveyor CI fails on Julia 1.0 in current configuration), but I was not sure which versions of Julia you want to test against so I am opening an issue.
I think the conversion from DataArrays
to CategoricalArrays
would be made easier if functions like levels
were defined in a "Base" package of some sort and the individual packages defined methods for those functions. The base package doesn't have to be StatsBase
- it could be something like FactorBase
that is specific to this data type.
At least my experience is that I end up defining AbstractFactor
types and functions like
@compat const AbstractFactor{V,R} = Union{NullableCategoricalVector{V,R},CategoricalVector{V,R},PooledDataVector{V,R}}
"""
asfactor(f)
Return `f` as a AbstractFactor.
This function and the `AbstractFactor` union can be removed once `CategoricalArrays` replace
`PooledDataArray`
"""
asfactor(f::AbstractFactor) = f
asfactor(f) = pool(f)
"""
levs(A::ReMat)
Return the levels of the grouping factor.
This is to disambiguate a call to `levels` as both `DataArrays`
and `CategoricalArrays` export it.
"""
function levs(A::ReMat)
f = A.f
isa(f, PooledDataArray) ? DataArrays.levels(f) : CategoricalArrays.levels(f)
end
If I have missed a better way of working around the same name being defined in different packages please let me know.
See
JuliaData/DataFrames.jl#1309 (comment)
for an example
The problem is that lines:
https://github.com/JuliaData/CategoricalArrays.jl/blob/master/src/array.jl#L269
and
https://github.com/JuliaData/CategoricalArrays.jl/blob/master/src/array.jl#L270
do not guarantee that a copy is performed (sometimes convert
returns the reference to the original object).
CC @nalimilan
This works:
julia> recode([1,2,3], [1,2]=>100)
3-element Array{Int64,1}:
100
100
3
and this works:
julia> recode(categorical([1,2,3,missing]), [1,2]=>100)
4-element CategoricalArrays.CategoricalArray{Union{Int64, Missings.Missing},1,UInt32}:
100
100
3
missing
but this fails:
julia> recode([1,2,3,missing], [1,2]=>100)
ERROR: TypeError: non-boolean (Missings.Missing) used in boolean context
Stacktrace:
[1] any(::Base.##136#137{Missings.Missing}, ::Array{Int64,1}) at .\reduce.jl:574
[2] in(::Missings.Missing, ::Array{Int64,1}) at .\reduce.jl:631
[3] recode!(::Array{Union{Int64, Missings.Missing},1}, ::Array{Union{Int64, Missings.Missing},1}, ::Void, ::Pair{Array{Int64,1},Int64}, ::Vararg{Pair{Array{Int64,1},Int64},N} where N) at D:\Software\JULIA_PKG\v0.6\CategoricalArrays\src\recode.jl:39
[4] recode(::Array{Union{Int64, Missings.Missing},1}, ::Void, ::Pair{Array{Int64,1},Int64}, ::Vararg{Pair{Array{Int64,1},Int64},N} where N) at D:\Software\JULIA_PKG\v0.6\CategoricalArrays\src\recode.jl:332
[5] recode(::Array{Union{Int64, Missings.Missing},1}, ::Pair{Array{Int64,1},Int64}) at D:\Software\JULIA_PKG\v0.6\CategoricalArrays\src\recode.jl:317
Probably line:
https://github.com/JuliaData/CategoricalArrays.jl/blob/master/src/recode.jl#L39
should be fixed.
I was trying to run
using DataFrames, CategoricalArrays
DataFrame(NullableCategoricalArray(["nihao","haha"]))
but it gives an error
MethodError: Cannot
convert
an object of type CategoricalArrays.NullableCategoricalArray{String,1,UInt32} to an object of type DataFrames.DataFrame
This may have arisen from a call to the constructor DataFrames.DataFrame(...),
since type constructors fall back to convert methods.
in DataFrames.DataFrame at base\sysimg.jl:24
so I am just wondering how to make an DataFrame out of CategoricalArrays. I am quite new to Julia so not sure how to fix it myself just yet.
IIUC currently categorical()
doesn't support specifying levels.
Would it be possible to support levels=
keyword parameter?
The use case is e.g. when "normalizing" the imported data with known hard-coded category values.
That would save one levels!()
call and make the user code a little bit less error-prone.
Somewhat related use-case is to make the categorical/plain values column A
in one data frame matching the categorical column B
in another (i.e. same levels encoding).
Would it be possible to do this with something like similar(A, B)
?
PooledDataArrays have this safety feature. We should probably add it too, since it has a very low performance cost.
map
with CategoricalArray
currently returns a CategoricalArray{Any}
when the function does not return a categorical value.
For example:
map(get, categorical(1:2))
Base.collect_to!(categorical([0.0, 0.0]), (1.0 for v in 1:3), 2, 2)
This appears to be due to the fact that similar(::CategoricalArray, T)
returns a CategoricalArray{T}
, whose element type is not T
but CategoricalValue{T}
or CategoricalString
. So when collect_to!
calls promote_typejoin(T, typeof(el))
, it tries to promote CategoricalValue{T}
with typeof(el)
instead of doing the more appropriate CategoricalValue{promote_typejoin(valtype(T), typeof(el))}
, and eventually chooses Any
.
As an exception, these do not return a CategoricalArray{Any}
:
map(string, categorical([1]))
Base.collect_to!(categorical(["", ""]), ("a" for v in 1:2), 2, 2)
AFAICT this is due to the fact that the CategoricalArray
constructor uses CategoricalString
if the array has an AbstractString
element type. Since promote_type(CategoricalString, String)
gives AbstractString
, the resulting array will be CategoricalArray{String}
.
I can see three ways of fixing this:
similar(::CategoricalArray, T)
to return an Array{T}
when T
is not a categorical value. This is more correct regarding the AbstractArray
interface. But it could be annoying in some cases where a CategoricalArray
is more natural/efficient. Or maybe not (concrete examples are needed).promote_jointype(S::Type{<:CategoricalValue}, T)
so that it returns CategoricalValue{promote_typejoin(valtype(S)), T)}
.map
. This might be needed anyway to take advantage of the possibility of calling the function only on the levels for efficiency. But the current behavior of similar
could be problematic elsewhere.Whichever solution we choose, it seems that map
should continue to return a CategoricalArray
. Indeed map
generally preserves the container type (e.g. for tuples), CategoricalArray
is more efficient for repeated data, and array comprehensions are available to create arrays when explicitly needed. broadcast
should probably also preserve the type (which it currently doesn't).
The specialized copy!
method added by #37 does not check the presence of null values when copying from NullableCategoricalArray
to CategoricalArray
. This currently creates #undef
entries, but it would make more sense to raise an error during "conversion".
Looks like we need to override the default method (and maybe to improve that method in Base too):
julia> showcompact(CategoricalArray(["a", "b"]))
CategoricalArrays.CategoricalValue{String,UInt32}["a","b"]
> using CategoricalArrays
> x = CategoricalArray(["Old", "Young", "Middle", "Young"], ordered=true)
> @show x[1] < x[2]
> @show x[1] > x[2];
x[1] < x[2] = true
x[1] > x[2] = CategoricalArrays.CategoricalString{UInt32}["Old", "Young", "Middle", "Young"]
I am consistently getting segfaults under v0.4.6 when testing the release version or the master branch. On the master branch of CategoricalArrays I get a series of warnings about overwriting convert
methods
julia> Pkg.test("CategoricalArrays")
INFO: Computing test dependencies for CategoricalArrays...
INFO: No packages to install, update or remove
INFO: Testing CategoricalArrays
INFO: Recompiling stale cache file /home/bates/.julia/lib/v0.4/CategoricalArrays.ji for module CategoricalArrays.
WARNING: Method definition convert(Type{Array{#T<:Any, N<:Any}}, AbstractArray{#S<:Any, #n<:Any}) in module Base at array.jl:240 overwritten in module CategoricalArrays at /home/bates/.julia/v0.4/CategoricalArrays/src/CategoricalArrays.jl:18.
WARNING: Method definition convert(Type{Array{#T<:Any, #n<:Any}}, AbstractArray{#S<:Any, #n<:Any}) in module Base at array.jl:241 overwritten in module CategoricalArrays at /home/bates/.julia/v0.4/CategoricalArrays/src/CategoricalArrays.jl:19.
WARNING: Method definition convert(Type{Base.Nullable}, #T<:Any) in module Base at nullable.jl:19 overwritten in module CategoricalArrays at /home/bates/.julia/v0.4/CategoricalArrays/src/CategoricalArrays.jl:21.
WARNING: New definition
convert(Type{CategoricalArrays.NominalPool{#S<:Any, #R<:Any, V<:Any}}, CategoricalArrays.NominalPool) at /home/bates/.julia/v0.4/CategoricalArrays/src/pool.jl:54
is ambiguous with:
convert(Type{CategoricalArrays.NominalPool{#T<:Any, #R<:Any, V<:Any}}, CategoricalArrays.NominalPool{#T<:Any, #R<:Any, V<:Any}) at /home/bates/.julia/v0.4/CategoricalArrays/src/pool.jl:51.
To fix, define
convert(Type{CategoricalArrays.NominalPool{#S<:Any, #R<:Any, V<:Any}}, CategoricalArrays.NominalPool{#S<:Any, _<:Integer, V<:Any})
before the new definition.
culminating in
ARNING: New definition
convert(Type{CategoricalArrays.NullableOrdinalArray{#T<:Any, #N<:Any, #R<:Any}}, CategoricalArrays.NullableOrdinalArray) at /home/bates/.julia/v0.4/CategoricalArrays/src/array.jl:92
is ambiguous with:
convert(Type{CategoricalArrays.NullableOrdinalArray{#T<:Any, #N<:Any, #R<:Any}}, CategoricalArrays.NullableOrdinalArray{#T<:Any, #N<:Any, #R<:Any}) at /home/bates/.julia/v0.4/CategoricalArrays/src/array.jl:59.
To fix, define
convert(Type{CategoricalArrays.NullableOrdinalArray{#T<:Any, #N<:Any, #R<:Any}}, CategoricalArrays.NullableOrdinalArray{#T<:Any, #N<:Any, _<:Integer})
before the new definition.
signal (11): Segmentation fault
unknown function (ip: 0x7ff5d2f3cf76)
unknown function (ip: 0x7ff5d2f3d0d5)
unknown function (ip: 0x7ff5d2f3ddeb)
unknown function (ip: 0x7ff5d2f3d0d5)
unknown function (ip: 0x7ff5d2f3ddeb)
unknown function (ip: 0x7ff5d2f3d0d5)
unknown function (ip: 0x7ff5d2f3ddeb)
unknown function (ip: 0x7ff5d2f3d0d5)
unknown function (ip: 0x7ff5d2f3d0d5)
unknown function (ip: 0x7ff5d2f3d0d5)
Although CategoricalArrays
is the package that triggers this segfault, it looks like it is a problem in Julia itself.
It would be nice to have padnull()
, anynull()
etc methods for NullableCategoricalArray
, and also to maintain the API parity with NullableArrays
in the long term. There are 3 possibilities:
NullableCategoricalArray
(the type definition still stays in the CategoricalArrays
),NullableCategoricalArrays.jl
that requires both NullableArrays
and CategoricalArrays
. That way the requirements could be made really fine grained.To me the second option makes more sense:
CategoricalArrays
is a smaller package that has minimal number of dependenciesCurrently in DataStreams, I'm ironing out an "api" of sorts for columns, i.e. the things that actually store data in a Source
or Sink
. They're mainly interacted with through the Data.getfield
and Data.getcolumn
methods, but they also need to support a few basic methods. Currently this includes:
push!(A, item)
append!(A, B)
setindex!(A, item, i)
allocate{T}(::Type{T}, rows, ref)
Most of these are pretty basic, though allocate
is the funny one. allocate
takes a scalar type (i.e. T
, Nullable{T}
, NominalValue{S, R}
), a # of rows, and a potential ref
or parent
Vector{UInt8}
and allocates a new column vector of some kind that the data will be streamed to.
I'm happy to have the allocate
methods for CategoricalArrays live in DataStreams for now (while things settle down), with the idea that they could eventually move to the various packages (NullableArrays, DataFrames, CategoricalArrays, etc.).
The two we seem to be missing, however, are push!
and append!
for the various CategoricalArray types. Happy to take a crack at it, but wanted to post it here first.
Perhaps this issue belongs in Missings, but:
WARNING: both Missings and Base export "Missing"; uses of it in module CategoricalArrays must be qualified
ERROR: LoadError: LoadError: UndefVarError: Missing not defined
Stacktrace:
[1] top-level scope
[2] include at ./boot.jl:279 [inlined]
[3] include_relative(::Module, ::String) at ./loading.jl:509
[4] include at ./sysimg.jl:15 [inlined]
[5] include(::String) at /home/travis/.julia/v0.7/CategoricalArrays/src/CategoricalArrays.jl:2
[6] top-level scope
[7] include at ./boot.jl:279 [inlined]
[8] include_relative(::Module, ::String) at ./loading.jl:509
[9] include(::Module, ::String) at ./sysimg.jl:15
[10] top-level scope
[11] eval at ./boot.jl:282 [inlined]
[12] top-level scope at ./<missing>:2
in expression starting at /home/travis/.julia/v0.7/CategoricalArrays/src/value.jl:25
in expression starting at /home/travis/.julia/v0.7/CategoricalArrays/src/CategoricalArrays.jl:23
ERROR: LoadError: Failed to precompile CategoricalArrays to /home/travis/.julia/lib/v0.7/CategoricalArrays.ji.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.