How to count # of patterns in a sequence

Jan 2020
3
0
USA
Definition: Sequence 'ab' has patterns 'a', 'b', 'ab' but not 'ba', or by codes
Python:
def has_pattern(pat, lst):
  pat_i = 0
  for el in lst:
    if el == pat[pat_i]:
      pat_i += 1
      if pat_i == len(pat):
        return True
      continue
  return False
What I want is given a pattern and a sequence, the probability of a permutation of the sequence has the pattern. This function does the job
Python:
def prob(pat, lst):
  all_perm = list(itertools.permutations(lst, len(lst)))
  return Fraction(sum((has_pattern(pat, p) for p in all_perm)), len(all_perm))
but it's slow and I want a general formula so I can get, for example, prob(‘abc’, ‘aabbcc’) = 47/90, quickly.
What I can see is
1. Letters in the sequence but not in the pattern doesn't matter. prob('ab', 'abc') = prob('ab', 'ab')
2. The order of pattern doesn't matter. prob('ab', 'ab') = prob('ba', 'ab')
3, then?
Thanks.
 
Jan 2009
448
127
I decided to simplify the code using regex, then reread itertools documentation, itertools — Functions creating iterators for efficient looping — Python 3.8.1 documentation , to find the following
"Elements are treated as unique based on their position, not on their value. So if the input elements are unique, there will be no repeat values in each permutation."
That means all_perm in your code was a list with some duplicate elements.

ie.
Python:
print(list(permutations('aa')))
> [('a', 'a'), ('a', 'a')]
This slowed down your code and caused a defect, namely prob('ab','aabbcc') isn't 47/90
My code yields 23/90. It defines all_perm as a set (removes duplicates) from a list of joined strings. It then calls pattern_count_re which uses regex search.
Python:
from itertools import permutations
from fractions import Fraction
from re import search

def pattern_count_re(pat,all_perm):
    count = 0
    for perm in all_perm:
        # print(perm)
        if search(pat,perm):
            count +=1
    return count

def prob_re(pat, seq):
    all_perm = set(list(map("".join, permutations(seq))))
    return Fraction(pattern_count_re(pat,all_perm), len(all_perm))

print(prob_re('abc','aabbcc'))
The math way:
How many permutations?
\(\displaystyle \frac{n!}{x_1!x_2!...}\)
Where n is the length of the sequence and \(\displaystyle x_i\) is number of repeats per letter

So for 'aabbcc' we have
\(\displaystyle \frac{6!}{2!2!2!} = 90\)

For counting 'abc' in all permutations of 'aabbcc' we have 4 ways to put 'abc' in the string of length 6 (excluding wrap-around).
We then have 6 ways to permute remaining letters a,b,c from original sequence.
6*4 = 24, but 'abcabc' is repeated twice, so it's 24-1 = 23

I am stumped on how to count patterns in all permutes generally, but I hope the above example is at least a start for you (if you want the math closed form solution)
 
Jan 2020
3
0
USA
Thank you. Some comments.
1. Sorry my original example was incomplete. Let me give another one. 'abc' has pattern a, b, c, ab, ac, bc and abc, but not ca.
2. Using regex is a good ides. It should be faster.
3. I don't want to deduplicate. Every permutation counts, but that's an easy tweak from your code.
 
Jan 2009
448
127
'aabbcc' has 90 permutations.
'aabbcc' does not have 6! =720 permutations, the result of not removing duplicates from the all_perms list.

Why don't you want to remove duplicates?
 
Jan 2020
3
0
USA
Yes we have a x-y problem here. The original problem is a bit involved. Basically it's a permutation test. We want to use permutations to get a baseline distribution and from there to get a p value, so every permutation counts, no matter if it's a duplication or not.
 
Jan 2009
448
127
I understand now. In that case one could simply define in the problem that 'permutation' is to be understood as the output of itertools.permutations exactly and that it is the set from which to count how many of the elements have a pattern in it.

One more idea that I had would be that if one could (beforehand) know how many duplicates of each element exist in the set of permutations, then I could simply check over the unique set of permutations for patterns, then use duplicate quantities as weights in an ad-hoc calculation.
(23/90 and 47/90, and 23*2+1 = 47 appears rather convenient in the case of 'aabbcc')

I don't think it'd be faster to create a two column list of unique permutation and its count and iterate over that, than simply count over the set of original permutations as described above, but there might be patterns to exploit.
 
Similar Math Discussions Math Forum Date
Discrete Math
Statistics / Probability
Algebra
Algebra