Could someone please help me with the following question:
Would you expect word-length in general to differ in fiction and non-fiction books?
Take two books, of different authors, one fiction, one non-fiction. Choose a reasonable sample size of words from each, and find the mean, median, modal word-length in each and standard deviation. Make it clear how you have done the various calculations without presenting detailed arithmetic.
In the light of the figures found, comment on the initial question (no formal inferential work needed) just an informed view from the figures found.
When you choose the books avoid non-fiction rich in mathematical, chemical
or similar notation, it will make the sampling more difficult and ambiguous.
I would suggest that the fiction be "Moby Dick", as its almost traditional
by now to use this as a literary reference test (as in testing the "Bible Codes"
claims).
You will need to consider how big a sample you will need, this will depend
on the spread of word lengths in the texts (the SD of word lengths), and
the resolution you wish to achieve in your test (that is do you wish to
detect a difference in mean word lengths of 1, 0.1, 0.01 ... letters with
high probability).
A rough order of magnitude estimate that I made indicates you may be looking
at sample sizes ~>1500 if you want to detect differences in the mean word
length of ~0.1 words.
You will need to devise a sampling frame that selects words fairly (method
of deciding which words to include in your sample). How you do this will
depend on the facilities you have available (computer text with software
that can randomly sample the texts, or doing it by hand with paper copies
of the books).
That's about all I can think of for this at present.
