class for Factor or CatData
Create a class to make working with labels for groups and dummy variables easier. should produce a exog matrix to be used in regression, for descriptive statistics and anova
Blueprint information
- Status:
- Not started
- Approver:
- None
- Priority:
- Undefined
- Drafter:
- None
- Direction:
- Needs approval
- Assignee:
- None
- Definition:
- New
- Series goal:
- None
- Implementation:
- Unknown
- Milestone target:
- None
- Started by
- Completed by
Related branches
Related bugs
Sprints
Whiteboard
What I wanted to do with the class for catdata or factor:
2 representations: label array (either (n by 1 or 1d) and dummy matrix (n by k)
labels are converted to integers starting from 0, and keep "original" labels around e.g. names of states
allow conversion between both representations
define 2 simple operations add (+) and multiply (*) and maybe power (**) (not sure about subtract, but not at first)
x1 = Factor(
x2 = Factor(
x = x1 + x2 + x1*x2 similar to formula, but I never looked carefully enough
or x=(1 + x1)*(1+x2)
there are more operations in SAS, e.g. for nested, but this would be the basic start.
But no fancy dict and name spaces and so on, just a simple class with datadummy and datalabel and some metainformation.
That's roughly what I wanted to do to make working with dummy variables or categorical data a bit more easy.
In pivot table ptable_1 and the groupstats and anova, I wanted to do the basic summary statistics and statistical analysis.
On the mailing list, I also wrote recently some recipes how the label array can be created from histogram and sorted data indices.
With label arrays and np.bincount it is very fast and easy for example to subtract the group means from each observation to remove the fixed effects.