What are you ultimately trying to do? Are you trying to make comparisons between relative sales quantities with regards to existing stock in each store for each size of clothing? What is the goal here?
I need a little advice from your expertise...
I have a pool of data which contains a list of stores and their sales quantity for 4 sizes, namely S, M, L, XL.
Store Size Quantity
2 S 5483
2 M 6444
2 L 5883
2 XL 3165
7 S 4931
7 M 5963
7 L 4915
7 XL 2469
12 S 115
12 M 149
12 L 94
12 XL 65
For each store, I need to label them to 1 of the 4 size group. Which means, store 2 belong to size group S because they are selling more of the S size, store 7 belongs to size group XL because they are selling more of size XL etc for all stores.
Please note that each store did not receive the same quantity of each size, and hence I didn't not analyze based on their sales quantity but rather their sales % contribution of each size instead... and the % variance of each size compare to the sales % contribution of all stores combined...
(Please find attached excel file for better understanding...)
And I got stuck here...
Can some please give me advice if I am going to the right direction? If yes, how should I proceed further.
Thank you for your help in advance.
Thank you for your reply Chiro.
What I am trying to find out is;
Based on the store sales qty, identity which store should be identified as a "SMALL" store, which store should be identify as "LARGE" store.
A "SMALL" store means the store is selling more of the small size compare to the average. Same logic goes to "LARGE" store.
I have a solution in mind but I am not sure if it is the correct way to solve this.
I will take the % contribution of Size XS of all stores and calculate the average and the average deviation. Then store with % contribution for size XS higher than the average + 1*sd will be a "SMALL" store, and store with % contribution for size XS higher than the average + 2*sd will be a "XS" store...
Then do the same for size XL...
One thing to point out is that your data may not be normally distributed.
However you can still do the kinds of things by specifying quantile information for what is a "XS" store and other categories.
What you do is you get your sample and get what is called an empirical distribution. You then say select the regions for "XS" and so on. For example "XS" might be top 10% (so this is the 0.9-1.0 range of cumulative probability), the next might be 20% (so 0.7-0.9 cumulative probability), next might be top 30% (so 0.4-0.7 cumulative probability) and so on.
In a normal distribution you do the same thing as above except because the Normal is symmetric, it means that you get simplifications. So when you use a normal distribution +- 1 standard deviation is roughly 66%, +- 2 is roughly 95%, +- 3 is roughly 99.7% and so on.
But this assume normality and you can't always assume this, but what you can do is use exactly the same idea but for the distribution represented by your data and not by an assumption.
Also Remember that +- includes not only data that are 2 s.d.'s in the positive direction, but also 2 s.d's in the negative: this is why I proposed breaking it in regions from the top, but if you had a reason for doing the above then just ignore that comment.
Excel should have a way to calculate the cumulative probability of an observation given a sample distribution (what I call empirical above).
Thank you for your reply again Chiro,
I did some research on the empirical distribution. The solution you proposed will group a fix number of store into each size group, because it is defining based on the cumulative probability and the cumulative probability depends on "n", in this case stores. Is this correct? Please advice.
If that is the case, this might not be the best solution to this problem. We would like to, for example, assign a store to "XS" only if they are selling a high Qty of XS. If no store is selling a lot of XS, then there will be no XS store.
Using the empirical distribution, a fixed number of store will be assigned as "XS" even if they do not sell an obvious high portion of XS compare to other sizes.
You can filter the data by requirements that you only get stuff in the size distribution if it meets a certain threshold if you wish.
The cumulative distribution will depend on the different numbers of values that the distribution can take: this is up to you and is also based on the nature of your data.