I'm looking for your opinions on the best way to solve a need I have for finding outliers in a set of data.
A while ago I wrote a software application to help my wife and I manage our finances. It imports statement (CSV's) from our banks, credit card companies, etc, tags them with various metadata (like expense category), and loads them into a database. The application provides budgeting, data (i.e. transaction) analysis, reporting, and a slew of other things. Overall, I'm pleased with how it's been working. It's helped me to understand where our money is going and how to better manage our finances.
Over time, outliers in the data have started skewing my reports, budgeting analysis, and the conclusions I draw from these KPI's - to the point where I'm not sure I can trust the results.
In an effort to start weeding out these offending data points, I brushed up on entry level stats and added functionality into the application to account for outliers using standard deviations. By default, the application looks through the various transactions for a given category and calculates whether the transaction amount is outside 2 std dev of the mean. I believe this is the 95% CI.
This works great in a lot of situations. Balloon, one-time payments for things are omitted from the trending analysis and makes the data more informative. For example, we recently bought a house and spent a lot of money on "Home Renovations". A lot of these transactions occur at Lowes, Home Depot and the like. But a few of them were large payments to contractors. The system properly weeds these out and marks them as outliers, which is great.
But, there are 2 main problems:
First, not all my categories have more than 30 data points. Meaning, for the given category, e.g. "Vehicle (Fuel)" or "Stop n Shop", I might only have 5 transactions. This makes my current method for finding outliers unreliable.
Second, in situations where I have more than 30 transactions, the outliers identified are sometimes not expected. For example, my wife has a Student Loan payment that gets paid automatically every week. The payment amount is the same every week. Over the years, this adds up to a few hundred transactions. About halfway through the date range for these payments, she upped her monthly payment to help pay the loan off sooner. So for a while it was $55 a week and then it became $76 a week. When running the outlier routine using 2 std devs, the system identifies all the $55 payments as outliers. I think the lower range cutoff was like $55.84 - so it barely misses but misses nonetheless.
I'm not sure if I'm thinking of this stuff the right way. I guess that's why I'm here. I'd like to get your feedback on how you think I should process the transactions such that the data reported is reliable. I'd like the method to be dynamic, meaning the application makes these determinations dynamically without me having to go through each transaction and choosing if it's an outlier or not. Having some power to tweak the thresh-holds would be great too.
So, am I going about this stuff the right way? Maybe the wrong way? Are the problems I'm experiencing just the nature of statistics? I don't know... please let me know your thoughts.
First time poster Thanks!