# Thread: Help with Modeling and Data variables

1. ## Help with Modeling and Data variables

I have a few questions;

Does anyone use SAS?

Either way can anyone think of a logical way to identify whether a variable is nominal, ordinal or continous? I will have to code this up, at the moment I'm simply thinking count the number of distinct values within that variable, ie out of a sample of 10,000 if I have 3,000 distinct values it's obviously continous, If I only have 5 then it's probably ordinal or nominal. It's a bit of an assumption, can anyone suggest a better method.

Also how many categories is too many for ordinal data? Do you think 20 distinct values is too many? Is 2 too little. What is the best for doing some predictive modelling?

If I have a sample of 20,000 observations with an even 50/50 split of my target variable, what calculation should I be using to decide what the optimal split is for nominal data. ie 10% of the sample contains 90% of my target variable, 20% contains 75%, 30% contains 50%. Is there a formula I should be using that gives a decent sample size with a decent ratio of my target?

I will be back with more questions I'm sure. I know sas can do all this without me knowing. But I want to learn why it's doing what it does, which also gives me more control over the model.
Thank you.

2. Originally Posted by RhysGM
I have a few questions;

Does anyone use SAS?

Either way can anyone think of a logical way to identify whether a variable is nominal, ordinal or continous? I will have to code this up, at the moment I'm simply thinking count the number of distinct values within that variable, ie out of a sample of 10,000 if I have 3,000 distinct values it's obviously continous, If I only have 5 then it's probably ordinal or nominal. It's a bit of an assumption, can anyone suggest a better method.

Also how many categories is too many for ordinal data? Do you think 20 distinct values is too many? Is 2 too little. What is the best for doing some predictive modelling?

If I have a sample of 20,000 observations with an even 50/50 split of my target variable, what calculation should I be using to decide what the optimal split is for nominal data. ie 10% of the sample contains 90% of my target variable, 20% contains 75%, 30% contains 50%. Is there a formula I should be using that gives a decent sample size with a decent ratio of my target?

I will be back with more questions I'm sure. I know sas can do all this without me knowing. But I want to learn why it's doing what it does, which also gives me more control over the model.
Thank you.
Assuming this is a survey of some kind the clue to the nature of the variable is usually in the question.

CB

3. Originally Posted by CaptainBlack
Assuming this is a survey of some kind the clue to the nature of the variable is usually in the question.

CB
No this is behavioural, transactional data. And I have 300 data variables, obviously I can go through each by eye but I would rather not as these variables I'll be using will be different each time. I would like to set up some kind of macro to automatically do this. I can use the method I describe above as it'll probably be correct most of the time.

However when I set up my data, I can group continous items into bands, should I be going for just high and low or a spread of bands say ten or so.

Does anyone do any predictive modeling on here? Decision trees, regression?

EDIT:
What statistical test should I be using for the optimal cut off for nominal data?

4. Sorry to double post but I have produced an example;

I have an overall sample of 40,000 variables, 20,000 of my target. With the data sorted by one of nominal data variables, in the first 5% (2,000) there is over 90% of my target in there. The simplified table below (sorry for my poor attempt at trying to draw a table) shows the rest of the data. Obviously I have all these numbers for every percentage point.

What is the best cut off and how should it be calculated. So I have two segments One with a high probability to contain my targets and the other segment to be a low probability to contain my target.

% of |
Overall | # of Targets | % Of Targets
Sample | in Sample | in Sample

5% | 1,802 | 90.10%
10% | 3,616 | 90.40%
15% | 5,381 | 89.68%
20% | 7,058 | 88.23%
25% | 8,617 | 86.17%
30% | 10,091 | 84.09%
35% | 11,400 | 81.43%
40% | 12,571 | 78.57%
45% | 13,684 | 76.02%
50% | 15,257 | 76.29%
55% | 15,442 | 70.19%
60% | 16,138 | 67.24%
65% | 17,040 | 65.54%
70% | 17,325 | 61.88%
75% | 17,834 | 59.45%
80% | 18,379 | 57.43%
85% | 18,777 | 55.23%
90% | 19,129 | 53.14%
95% | 19,514 | 51.35%
100% | 20,000 | 50.00%

5. Don't mean to triple post, this will be my last one, I promise. Can anyone advise as to where I should be looking to work this out or specifically what I should be googling.