Results 1 to 5 of 5

Math Help - Help with Modeling and Data variables

  1. #1
    Junior Member
    Joined
    Jan 2009
    Posts
    44

    Help with Modeling and Data variables

    I have a few questions;

    Does anyone use SAS?

    Either way can anyone think of a logical way to identify whether a variable is nominal, ordinal or continous? I will have to code this up, at the moment I'm simply thinking count the number of distinct values within that variable, ie out of a sample of 10,000 if I have 3,000 distinct values it's obviously continous, If I only have 5 then it's probably ordinal or nominal. It's a bit of an assumption, can anyone suggest a better method.

    Also how many categories is too many for ordinal data? Do you think 20 distinct values is too many? Is 2 too little. What is the best for doing some predictive modelling?

    If I have a sample of 20,000 observations with an even 50/50 split of my target variable, what calculation should I be using to decide what the optimal split is for nominal data. ie 10% of the sample contains 90% of my target variable, 20% contains 75%, 30% contains 50%. Is there a formula I should be using that gives a decent sample size with a decent ratio of my target?

    I will be back with more questions I'm sure. I know sas can do all this without me knowing. But I want to learn why it's doing what it does, which also gives me more control over the model.
    Thank you.
    Follow Math Help Forum on Facebook and Google+

  2. #2
    Grand Panjandrum
    Joined
    Nov 2005
    From
    someplace
    Posts
    14,972
    Thanks
    4
    Quote Originally Posted by RhysGM View Post
    I have a few questions;

    Does anyone use SAS?

    Either way can anyone think of a logical way to identify whether a variable is nominal, ordinal or continous? I will have to code this up, at the moment I'm simply thinking count the number of distinct values within that variable, ie out of a sample of 10,000 if I have 3,000 distinct values it's obviously continous, If I only have 5 then it's probably ordinal or nominal. It's a bit of an assumption, can anyone suggest a better method.

    Also how many categories is too many for ordinal data? Do you think 20 distinct values is too many? Is 2 too little. What is the best for doing some predictive modelling?

    If I have a sample of 20,000 observations with an even 50/50 split of my target variable, what calculation should I be using to decide what the optimal split is for nominal data. ie 10% of the sample contains 90% of my target variable, 20% contains 75%, 30% contains 50%. Is there a formula I should be using that gives a decent sample size with a decent ratio of my target?

    I will be back with more questions I'm sure. I know sas can do all this without me knowing. But I want to learn why it's doing what it does, which also gives me more control over the model.
    Thank you.
    Assuming this is a survey of some kind the clue to the nature of the variable is usually in the question.

    CB
    Follow Math Help Forum on Facebook and Google+

  3. #3
    Junior Member
    Joined
    Jan 2009
    Posts
    44
    Quote Originally Posted by CaptainBlack View Post
    Assuming this is a survey of some kind the clue to the nature of the variable is usually in the question.

    CB
    No this is behavioural, transactional data. And I have 300 data variables, obviously I can go through each by eye but I would rather not as these variables I'll be using will be different each time. I would like to set up some kind of macro to automatically do this. I can use the method I describe above as it'll probably be correct most of the time.

    However when I set up my data, I can group continous items into bands, should I be going for just high and low or a spread of bands say ten or so.

    Does anyone do any predictive modeling on here? Decision trees, regression?

    EDIT:
    What statistical test should I be using for the optimal cut off for nominal data?
    Last edited by RhysGM; June 28th 2010 at 06:59 AM.
    Follow Math Help Forum on Facebook and Google+

  4. #4
    Junior Member
    Joined
    Jan 2009
    Posts
    44
    Sorry to double post but I have produced an example;

    I have an overall sample of 40,000 variables, 20,000 of my target. With the data sorted by one of nominal data variables, in the first 5% (2,000) there is over 90% of my target in there. The simplified table below (sorry for my poor attempt at trying to draw a table) shows the rest of the data. Obviously I have all these numbers for every percentage point.

    What is the best cut off and how should it be calculated. So I have two segments One with a high probability to contain my targets and the other segment to be a low probability to contain my target.

    % of |
    Overall | # of Targets | % Of Targets
    Sample | in Sample | in Sample

    5% | 1,802 | 90.10%
    10% | 3,616 | 90.40%
    15% | 5,381 | 89.68%
    20% | 7,058 | 88.23%
    25% | 8,617 | 86.17%
    30% | 10,091 | 84.09%
    35% | 11,400 | 81.43%
    40% | 12,571 | 78.57%
    45% | 13,684 | 76.02%
    50% | 15,257 | 76.29%
    55% | 15,442 | 70.19%
    60% | 16,138 | 67.24%
    65% | 17,040 | 65.54%
    70% | 17,325 | 61.88%
    75% | 17,834 | 59.45%
    80% | 18,379 | 57.43%
    85% | 18,777 | 55.23%
    90% | 19,129 | 53.14%
    95% | 19,514 | 51.35%
    100% | 20,000 | 50.00%
    Follow Math Help Forum on Facebook and Google+

  5. #5
    Junior Member
    Joined
    Jan 2009
    Posts
    44
    Don't mean to triple post, this will be my last one, I promise. Can anyone advise as to where I should be looking to work this out or specifically what I should be googling.
    Follow Math Help Forum on Facebook and Google+

Similar Math Help Forum Discussions

  1. Modeling a function for a set of data help
    Posted in the Pre-Calculus Forum
    Replies: 1
    Last Post: February 17th 2010, 03:46 AM
  2. modeling data to polynomial functions
    Posted in the Pre-Calculus Forum
    Replies: 0
    Last Post: August 31st 2009, 01:42 PM
  3. need help modeling data
    Posted in the Pre-Calculus Forum
    Replies: 6
    Last Post: May 27th 2009, 05:00 AM
  4. Data question to find missing variables
    Posted in the Statistics Forum
    Replies: 2
    Last Post: April 7th 2009, 08:56 AM
  5. Replies: 0
    Last Post: November 6th 2007, 11:48 PM

Search Tags


/mathhelpforum @mathhelpforum