list - Python - count words not in a distance of three (3) words from specific words -


i using following python code count words in text (.txt) files, checking whether of words in text file belong in of two lists of words considering (the word lists .csv files, imagine these "dictionaries")

import re import collections collections import counter import csv import sys  find_words = re.compile(r'(?<!\s)[a-za-z]+(?!\s)').findall wanted1 = set(find_words(open('word_list1.csv').read().lower())) wanted2 = set(find_words(open('word_list2.csv').read().lower()))    f in sys.argv[1:]:     cnt1 = cnt2 = cntwords = 0     wanted = 20     open(f) inputfile:         line in inputfile:             word in find_words(line.lower()):                 myfile.write(word+ "\n")                 cntwords += 1                 if word in wanted1:                     file1.write(word+ "\n")                     cnt1 += 1                 if word in wanted2:                     file2.write(word+ "\n")                     cnt2 += 1    

at moment, counting every word in .txt file happens belong in word lists wanted1 , wanted2.

what want count these words only when there no negator in distance of 3 words these words.

a negator any 1 of following 3 words: no, not, never.

in case, if negator in distance [-3,+3] words word examining, word should not counted if belongs in 1 of word lists examining.

any idea how implement in code? thanks.

example1:

word-2 word-1 word0 word1 word2 not word3 word4 word5 word6 word7 -> none of word0 word5 should counter, word-2, word-1, word6, word7 should counted (if belong in csv word lists). instead of "not" "never" or "no".

example2: never word-2 word-1 word0 word1 word2 -> word-2 word-1 word0 should not counted, word1 word2 should counted (if belong in csv word lists). instead of "never" "not" or "no".

i put little script similar you're asking. implemented contents of .txt file multi-line string , hardcoded word lists simplify things example. can replace bits file open / reading code. inefficient solution, clearest way organise in head. feel free optimise liking.

# -*- coding: utf-8 -*- import pprint collections import defaultdict string import punctuation  # word count each word in pair of wordlists appear in block of text. # exclude appearance of word count if of 3 words before or after word # in question member of negator set (no, not, never).  def main():     wordlist1 = ['mickey', 'pluto', 'goofy', 'minnie', 'donald']     wordlist2 = ['bugs', 'daffy', 'elmer', 'foghorn', 'porky']      # whether ensure words in wordlists lowercase depends on use-case     wordlist1 = [element.lower() element in wordlist1]     wordlist2 = [element.lower() element in wordlist2]     mergedwordset = set(wordlist1 + wordlist2)     negatorset = set(['no', 'not', 'never'])      # using collections.defaultdict here can add key value of 1      # if doesn't exist , increment value of key if exist.     countincludingneg = defaultdict(int)     countexcludingneg = defaultdict(int)      # using multi-line string here simplify example.     # parsed word count.     # adapt own uses.     # text excerpts wikipedia:     # http://en.wikipedia.org/wiki/pluto_(disney)     textblock = ''' pluto, called pluto pup, cartoon character created in 1930 walt disney productions. red-colored, medium-sized, short-haired dog black ears. unlike disney characters, pluto not anthropomorphic beyond characteristics such facial expression, though did speak short portion of history. mickey mouse's pet. officially mixed-breed dog, made debut bloodhound in mickey mouse cartoon chain gang. mickey mouse, minnie mouse, donald duck, daisy duck, , goofy, pluto 1 of "sensational six"—the biggest stars in disney universe. though 6 non-human animals, pluto alone not dressed human. pluto debuted in animated cartoons , appeared in 24 mickey mouse films before receiving own series in 1937. pluto appeared in 89 short films between 1930 , 1953. several of these nominated academy award, including pointer (1939), squatter's rights (1946), pluto's blue note (1947), , mickey , seal (1948). 1 of films, lend paw (1941), won award in 1942. because pluto not speak, films rely on physical humor. made pluto pioneering figure in character animation, expressing personality through animation rather dialogue. of pluto's co-stars, dog has appeared extensively in comics on years, first making appearance in 1931. returned theatrical animation in 1990 prince , pauper , has appeared in several direct-to-video films. pluto appears in television series mickey mouse works (1999–2000), house of mouse (2001–2003), , mickey mouse clubhouse (2006–2013). in 1998, disney's copyright on pluto, set expire in several years, extended passage of sonny bono copyright term extension act. disney, along other studios, lobbied passage of act preserve copyrights on characters such pluto 20 additional years. pluto first , appears in mickey mouse series of cartoons. on rare occasions paired donald duck ("donald , pluto", "beach picnic", "window cleaners", "the eyes have it", "donald's dog laundry", & "put put troubles"). first cartoons feature pluto solo star 2 silly symphonies, dogs (1932) , mother pluto (1936). in 1937, pluto appeared in pluto's quin-puplets first instalment of own film series, headlined pluto pup. however, not produced on regular basis until 1940, time name of series shortened pluto. first comics appearance in mickey mouse daily strips in 1931 2 months after release of moose hunt. pluto saves ship, comic book published in 1942, 1 of first disney comics prepared publication outside newspaper strips. however, not counting few cereal give-away mini-comics in 1947 , 1951, did not have own comics title until 1952. in 1936 pluto got title feature in picture book under title "mickey mouse , pluto pup" whitman publishing. pluto runs own neighborhood in disney's toontown online. it's called brrrgh , it's snowing there except during halloween. during april toons week, weekly event silly, pluto switches playgrounds minnie (all other characters well). pluto talks in minnie's melodyland. pluto has appeared in television series mickey mouse works (1999–2000), disney's house of mouse (2001–2003) , mickey mouse clubhouse (2006–present). curiously enough, however, pluto standard disney character not included when whole gang reunited 1983 featurette mickey's christmas carol, although did return in prince , pauper (1990) , runaway brain (1995). had cameo in framed roger rabbit (1988). in 1996, made cameo in quack pack episode "the mighty ducks". '''     # removing leading , trailing whitespace.     # removing new-lines can extend look-aheads / look-behinds across lines.     # removing punctuation.     # setting text lowercase     # adjust use-cases     textblock = textblock.strip().replace('\n', ' ').translate(none, punctuation).lower()     textblockwords = textblock.split()      # construct list of 7-gram (or less) word windows.     # window center on each individual word of textblock     # , include 3 words before , after appearance.     windows = n_gram_word_windows(textblockwords, 3)      # un-comment following line if you'd see representation of n-gram word windows     #pprint.pprint(windows)       windowdict in windows:         key, ngramlist in windowdict.iteritems():             # word member of wordlists?             if key in mergedwordset:                 countincludingneg[key] += 1                 # words preceeding or following appear in set of negators?                 if len(negatorset.intersection(set(ngramlist))) == 0:                     countexcludingneg[key] += 1     print "count including negators"     pprint.pprint(countincludingneg)     print "count excluding negators"     pprint.pprint(countexcludingneg)    # idea here examine each word in textblock , # create list containing 3 words before word, word itself, , 3 words following word. # method return list of dictionaries. # dictionary comprised of examined word key, , n-gram word window value. def n_gram_word_windows(textlist, lookaheadbehind=3):     wordwindows = []     index, item in enumerate(textlist):         intermediatelist = []         if index < lookaheadbehind:             preceedingword in textlist[:index]:                 intermediatelist.append(preceedingword)         else:             preceedingword in textlist[index-lookaheadbehind:index]:                 intermediatelist.append(preceedingword)         if index < len(textlist):             lookaheadword in textlist[index:index+lookaheadbehind+1]:                 intermediatelist.append(lookaheadword)         wordwindows.append({item: intermediatelist})     return wordwindows   if __name__ == '__main__':     main() 

the results following:

macbook:stackoverflow joeyoung$ python negatorparser.py  count including negators defaultdict(<type 'int'>, {'mickey': 12, 'donald': 3, 'goofy': 1, 'minnie': 2, 'pluto': 27}) count excluding negators defaultdict(<type 'int'>, {'mickey': 12, 'donald': 3, 'goofy': 1, 'minnie': 2, 'pluto': 24}) 

Comments

Popular posts from this blog

google chrome - Developer tools - How to inspect the elements which are added momentarily (by JQuery)? -

angularjs - Showing an empty as first option in select tag -

php - Cloud9 cloud IDE and CakePHP -