I have 22301 rows of data involving 19 variables, a special variable X and 18 others: A to R. X is discrete, and was quantitatively expressed by assinging a score to the categories. Over the 22301 rows of data, in some X is near-correctly classified, and somewhat incorrectly in a few.
My aim is to indentify/analyse the existence or absence of links between X and the other variables. Due to the nature of the data, I try two approaches to do this:
1. Since correlation stats are supposed to be extremely sensitive to outliers, I divided the 22301 data set into 20 groups and calculated the average values for A to R for each of these 20 groups. I then used the 20 group averages for each variable to calculate the correlation and covariance statistics, in the hope that outliers would be smoothed out by the averaging process.
2. In the other approach, I calculated correlation and covariance stats using all 22301 data rows individually.
Here are the results I get using the two methods (18 columns correspond of correl/covar of X against A to R respectively):
Method 1 results:
CORREL 0.76 0.69 0.35 0.58 0.1 0.66 -0.1 -0.13 0.41 0.34 -0.28 0.28 -0.29 0.12 -0.14 -0.45 0.2 -0.1
COVAR 67.17 44.24 17.19 31.78 5.67 48.64 -1.6 -0.78 25.11 11.32 -8.52 15.22 -11.3 5.16 -4.35 -0.54 3.62 -52.49
Method 2 results:
CORREL 0.55 0.52 0.33 0.46 0.18 0.52 0.34 -0.04 0.22 0.17 -0.08 0.17 -0.15 - 0.01 0.27 -0.09 -0.23 -0.07
COVAR 57.66 48.94 29.08 31.56 19.36 49.69 6.31 -0.67 17.81 11.19 -5.64 13.46 -17.97 - 0.35 10.01 -0.22 -5.81 -208.48
Now my question is from this data can I infer some connection (need not be causal) between any of the variables A to R and the special variable X? Is a correct approach to consider correlations above 0.5 absolute to be significant? If that is the case I can hold variables A, B, D and F to be indicators of X. Further, my understanding so far leads me to believe larger changes in A indicate smaller changes in X due to the high covariance value, while B, D and F are closer to "moving with" X. Am I on the right track? Or am I totally off and missing something important?
Also is my approach of averaging data rows to smooth out outliers valid? Or does it introduce problems?
(Apologies for the lengthy post)