For function optimization (error minimalization) there are relative easy methods like nelder-mead and simutaled annealing (no gradient calculation needed). I wonder if a related method that I thought of would also work. Also I wonder if it exists already (somebody must have thought of it already because it's very simple) and if there are publications about it.
The method (not a complete algorithm):
Have a set of current positions and treat it like a cloud (and not like a simplex which neldermead does): each step replace a bad position (some randomness in picking that) by a position near (some randomness in picking that) a good position (some randomness in picking that too). The new point can be chosen using a gaussian distribution (around the chosen good position). For convergence the stddev of this distribution has to decrease each step (linearly or exponentially). The stddev can (also) be a function of the error (of the chosen good position): better evaluation, then smaller stddev. For convergence randomness in picking bad and good points can also be decreased each step. Optimization can be stopped when the error has passed a threshold or when the number of iterations has passed a threshold.
This method is simpler than neldermead and maybe also better than that in avoiding local minima as final results (neldermead is deterministic after initialization and so some pattern in to be optimized function might 'sabotage' optimization). Further, in neldermead the dimension of the function determines the number of current positions; in the proposed method this is free. You can choose for more certainty to avoid local minima (by having more positions) or you can choose for more convergence speed (by having less positions).
Compared to simulated annealing the proposed method has the advantage of having a helicopter overview in the beginning. It's not a local search method (there is simulated annealing using multiple current positions, but they do not interact (for a position no information about other positions is used) and so the helicopter view is not exploited). Simulated annealling is less local than e.g. gradient based search because of the big jumps in the beginning, but there is no big picture at any moment in time, which neldermead and the proposed method do have (at least in the beginning of the search) and this big picture gives information about where you probably have to look for minima.
This initial big picture is also what misses in gradient based search. If you have a fairly smooth function and add some noise, the resulting function is hard to optimize with gradient based search, because gradients give bad information in this case. But a big picture gives very good information (for a smooth-noisy function the evaluation values for positions in a short range of each other stay within a short range). (of course if a function is too noisy this advantage does not exist. And such functions are almost impossible to optimize using any method.)
Gradient based search is (very) probably faster in case of optimizing a simple smooth function (few valleys and hills, no noise), since gradients give good information then. So in that case it's better to pick that method (given that gradients can be calculated).
Feedback is welcome.