Hi everyone,

I am a linguist student tasked to do some classification on text. So, please bear with me on this (Math is not my). I have a dataset that comprises of numeric and nominal/string attributes. A simple example of the dataset is as given below:

Attributes:
attribute1: Header {header1, header2, header3, header4} <fixed string attributes of which only one or none will be present in the instance>
attribute2: ProbabilityA {0.0 - 1.0} <numeric value where closer to 1.0 means more likely it is in classA>
attribute3: ProbabilityB {0.0 - 1.0} <numeric value where closer to 1.0 means more likely it is in classB>
attribute4: ProbabilityC {0.0 - 1.0} <numeric value where closer to 1.0 means more likely it is in classC>
attribute5: memberDistA {0.0 - 1.0} <numeric value where closer to 1.0 means more likely it is in classA>
attribute6: memberDistB {0.0 - 1.0} <numeric value where closer to 1.0 means more likely it is in classB>
attribute7: memberDistC {0.0 - 1.0} <numeric value where closer to 1.0 means more likely it is in classC>

Class: A, B, C

Sample training data:

classA: header4, 0.4416, 0.1235, 0.3271, 0.6726, 0.4572, 0.2018
classB: header2, 0.1333, 0.3666, 0.2876, 0.0128, 0.5904, 0.3327

1. Is there a classification or clustering algorithm that can handle both attributes? As far as I can search and read about, there aren't many algorithms that can handle both numeric and nominal/string data.

2. Would it be easier to convert the attributes to all numeric or vice versa without losing the performance?

3. Does it make sense to find a way to merge the numeric data that falls within the same class (e.g. ProbabilityA and memberDistA) to have one final numeric value to represent likelihood of belonging to a particular class?

Sorry, if this question is too complicated to answer. Thanks!