This is a question by WhitleyMahnken.


Quote Originally Posted by WhitleyMahnken View Post
Hello everyone, I am doing my Ph.D. in Machine Learning Algorithms from Florida University, Gainesville. Before this, I have completed my Master from University of Central Florida, Orlando. I have recently started my Ph.D. dissertation and needed help with some problems.


My thesis topic is simple if you have experienced with non-linear implementation with SVM with biased Neural Networks Kernels and customized RBFs Kernels.
Everything is okay if you are allowed to use more examples. But, the professor has clearly mentioned that I only have to use 3000 inputs. Forum members (who have implemented ML in the past) will know that 3000 is nothing to get 75% correct results. As facing Regularization, Generalization, Overfitting and Validation and these need to allot of non-poisoned data, in addition to available data as inputs. Further, I am using nonlinear inputs, these means more data will be completely lost while getting fed to the algorithm. Also remember, I am using RBF and NN kernels.


As the situation is tough, here are my questions:

  • What would be the desired solution, you could propose?
  • As you have to implement multiple techniques for handling performance issues? How to take care of R,G, V.
  • As the there are a low number of inputs, what is the way to take care of data snooping?
  • How to know that the data is biased or not?
  • How to solve the Oscar Razor without touching the inputs?



These are the basic questions in Machine Learning. Achieving low efficiency is not an option here. By now, I am dealing with 37% efficiency? But supervisor has clearly mentioned achieving 75% efficiency. This is a tough case, as described with above inputs.


In order to achieve those tasks, I am dealing with the military welfare non-profit organization. This organization is providing help to retired and injured military personnel. I am fetching data from organization website. At the end of the day, I will be provided the data in raw HTML and I have to fetch the data of individuals pages using some programming language (those visitors are the ones acting as input to the algorithm). This data include users location, time of visitation, bounce back time on the website, their comments, These are the inputs and I have to feed this information to the algorithm as inputs. Keep in mind, it is necessary, I can't exceed 3000 data points.


Let me explain the situation more:
Suppose a new visitor Alice is visiting MB page at 9 Hour, 8 Minutes, 7 Seconds. Alice left the site at 9+1 hour, 8+2 minutes, 7+3 seconds. Alice visited 4 pages and left 5 comments. Alice left those comments with variable length of 20,33, 45 words. And remember the first page, Alice visited Redstone Arsenal Army Base in Madison, AL | Complete info, reviews, map Militarybases.co is the most important, as the probability of interest is more in this page. After feeding the information, we will receive the output in the form of rational probabilities (as Pn). A number of probabilities are required, which can be displayed in the form P1,P2,....,PN, which shows the type of interest of the person.


Previously, I can't exceed 37% probability. And failing is not an option here.