I wonder if you can help on a maths problem I'm having. I've posted it here, as my limited maths knowledge says that this is a combinatorics problem, but please shout if not and I'll see if I can get the thread moved to the appropriate place in the forum.
I'm working on a linguistic tool (maths is not by strong point, you might be surprise to know), and the tool is to do with synonym-replacement for a given piece of text (synonym = word or phrase which means the same as something else)
Here is the relevant information:
1. The tool works by taking a phrase (which might be one word, or a set of words running consecutively), and allows that phrase to be replaced by an defined set of equivalent phrases.
eg, 'the quick brown fox' could have the "quick" replaced by "fast", "rapid", or "speedy", and the "brown fox" replaced by "tan fox", "tan arctic fox", "furry vulpine", or "hairy four-legged mammallian animal". The number of words in either the original or replacement text is not relevant to my problem, as far as can tell, although see later...
2. The number of alternative replacements is not consistent, and can be 0 (eg, only the original text)
eg in 'the quick brown fox', the word 'the' is not replaced ever. Whether this is indicated by 0 (no additional synonyms) or 1 (only 1 total synonym) in the formula I'm seeking, is up to you guys, I guess.
3. When the software is run, it spits out a random combinations of the various synonyms replacements (let's call this the "proposed text") , and then runs the following check:
4A. It breaks the entire text into 4-word shingles. A shingle is a group of consecutive words, best explained like this:
'the quick brown fox jumps over the lazy dog' has shingles of 'the quick brown fox', 'quick brown fox jumps', 'brown fox jumps over', 'fox jumps over the', 'jumps over the lazy' and 'over the lazy dog'. For the purposes of the exercise, I'm ignoring the starting and ending shingles of 'the', 'the quick', and 'the quick brown', 'the lazy dog', 'lazy dog', and 'dog', although feel free to include them if it makes the maths easier.
4B. It then compares the current list of shingles to the list of shingles in all previous "proposed texts" (if any), and if more than 30% of the shingles are common between the two lists of shingles, the new "proposed text" is deleted, and another one created. If there are 30% or less duplicates, the new "proposed text" is retained, and then becomes one of the "proposed texts" against which future versions are checked.
NB: I said earlier that the number of words is not relevant as far as I can tell. I guess is would be here, since if a given 1-word phrase can be replaced by another 10-word phrase, then the number of shingles between two articles that use the 1-word or 10-word phrase will vary.
NB: The order of the shingles is not relevant. So, in "the quick brown fox jumps over the lazy dog" vs "the lazy dog jumps over the quick brown fox", then "the quick brown fox" is a duplicate, even though it appears in different positions in each sentence.
What I'm trying to work out is:
By looking at the source text (including it's relevant replacement synonyms), what formula can I apply to work out how many of the potentiall "proposed texts" will meet that 30% unique shingles score?
I know this is my first post to the forum, but I'm desperate for an answer to this. Many thanks for any solutions or part-solutions you can offer. If anything is not clear, just let me know. And if someone can offer a solution, I can send a belated Xmas gift to their Paypal account, or an Amazon voucher/wishlist purchase, or something similar, as a way of expressing my thanks.
All the best,