Hello all

I'm new to this forum and new to statistics as well I'm studying forensic linguistics and am writing a module assessment where I have to use statistical tests. Up to now, I could figure out what to do but now I am stuck.

I analyze two twitter accounts and try to attribute some anonymous posts to one of the two authors (the anonymous posts are either written by one author or by the other).

I have two data samples which I compared using a two-tailed z test for the comparison of proportions to examine if there are statistically significant differences. The samples consist of a list with different words appearing in the tweets and their number of occurrence in relation to the total number of words of all tweets. Sample 1 is for author 1 and sample 2 for author 2. I compared the lists the find out if one author used a specific word significantly more than the other author. For this purpose, I used a two-tailed z test for the comparison of proportions with the significance level of 0.05.

Example:

“xy” is used by author 1 12x out of 13997 words in total.

The same word is used 51x out of 11121 words in total by author 2.

The z test result is a p value of 0. 00000000445922188063719, the difference in the use of this word thus is statistically significant.

This test was used on every word that appears in both samples.

Now I have a smaller data sample, “anonymous”, with 181 words in total. I don’t know if the sample is from author 1 or author 2. I want to try to attribute the anonymous sample to either author 1 or 2 based on the list of words in the sample and their number of occurrence.

How can I test if the sample “anonymous” is more likely to be attributed to author 1 or author 2? Is there a statistical test with which I can compare this small data sample with the two bigger samples?

Put another way:

“anonymous” uses “xy” 5x out of 181 words.

author 1 uses the word 251x out of 13997 words.

author 2 uses the word 81x out of 11121 words.

--> How can I compare these data and find a statistical difference? I know that author 1 uses the word significantly more than author 2. Does anonymous use this word similarly often as author 1 or similarly often as author 2?

I have thought of doing two z tests, one comparing anonymous with author 1 and the other comparing anonymous to author 2. But what would the results tell me? I have problems stating the null hypotheses for these two tests. Can I say that the null hypothesis (comparing anonymous to author 1) is that there is no difference between these two in using specific words? Would the alternate hypothesis (that there IS a difference) mean that anonymous is likely not to be attributed to author 1 but author 2?

I hope you can help me.

Thank you for your consideration