停用词删除清除禁止词列表中的单词

时间:2016-06-19 18:35:26

标签: python split

我有兴趣挖掘科学文献,特别是PubMed。我想在我选择的关键字的左侧和右侧确定单词修饰符。我的计划是(1)查询我的听力和助听器数据库中的“AID”​​一词。 (2)然后,我从包含标题+摘要的字段中删除了标点符号,双重空格等,这主要是出于历史原因。 (3)接下来,我在空格处分割文本,(4)从MYSQL中获取的列表中删除了停用词。回想起来,列表可能应该在某个类中。 (5)我找了关键字“AID”​​并在前后收集了密钥。代码来自StackOverflow和其他网站上的许多来源,因为我是python和sqlite的新手。代码中的问题区域如下。

my_stopwords = '''['A','ABLE','ABOUT','ABOVE','ACCORDING','ACCORDINGLY','ACROSS','ACTUALLY','AFTER','AFTERWARDS','AGAIN','AGAINST','ALL','ALLOW','ALLOWS','ALMOST','ALONE','ALONG','ALREADY','ALSO','ALTHOUGH','ALWAYS','AM','AMONG','AMONGST','AN','ANOTHER',
                        'ANY','ANYBODY','ANYHOW','ANYONE','ANYTHING','ANYWAY','ANYWAYS','ANYWHERE','APART','APPEAR','APPRECIATE','APPROPRIATE','ARE',
                        'AROUND','AS','ASIDE','ASK','ASKING','ASSOCIATED','AT','AVAILABLE','AWAY','AWFULLY','BE','BECAME','BECAUSE','BECOME','BECOMES',
                        'BECOMING','BEEN','BEFORE','BEFOREHAND','BEHIND','BEING','BELIEVE','BELOW','BESIDE','BESIDES','BEST','BETTER','BETWEEN','BEYOND',
                        'BOTH','BRIEF','BUT','BY','CAME','CAN','CANNOT','CANT','CAUSE','CAUSES','CERTAIN','CERTAINLY','CHANGES','CLEARLY','CO','COM','COME',
                        'COMES','CONCERNING','CONSEQUENTLY','CONSIDER','CONSIDERING','CONTAIN','CONTAINING','CONTAINS','CORRESPONDING','COULD','COURSE',
                        'CURRENTLY','DEFINITELY','DESCRIBED','DESPITE','DETERMINE','DETERMINED','DID','DIFFERENT','DO','DOES','DOING','DONE','DOWN','DOWNWARDS','DURING','EACH','EDU',
                        'EFFECT','EFFECTS','EG','EIGHT','EITHER','ELSE','ELSEWHERE','ENOUGH','ENTIRELY','ESPECIALLY','ET','ETC','EVEN','EVER','EVERY','EVERYBODY','EVERYONE',
                        'EVERYTHING','EVERYWHERE','EX','EXACTLY','EXAMPLE','EXCEPT','FAR','FEW','FIFTH','FIRST','FIVE','FOLLOWED','FOLLOWING','FOLLOWS',
                        'FOR','FORMER','FORMERLY','FORTH','FOUR','FROM','FURTHER','FURTHERMORE','GET','GETS','GETTING','GIVEN','GIVES','GO','GOES','GOING',
                        'GONE','GOT','GOTTEN','GREETINGS','HAD','HAPPENS','HARDLY','HAS','HAVE','HAVING','HE','HELLO','HELP','HENCE','HER','HERE','HEREAFTER',
                        'HEREBY','HEREIN','HEREUPON','HERS','HERSELF','HI','HIM','HIMSELF','HIS','HITHER','HOPEFULLY','HOW','HOWBEIT','HOWEVER','IE','IF',
                        'IGNORED','IMMEDIATE','IN','INASMUCH','INC','INDEED','INDICATE','INDICATED','INDICATES','INNER','INSOFAR','INSTEAD','INTO','INWARD',
                        'IS','IT','ITS','ITSELF','JUST','KEEP','KEEPS','KEPT','KNOW','KNOWN','KNOWS','LAST','LATELY','LATER','LATTER','LATTERLY','LEAST','LESS',
                        'LEST','LET','LIKED','LIKELY','LITTLE','LOOK','LOOKING','LOOKS','LTD','MAINLY','MANY','MAY','MAYBE','ME','MEAN','MEANWHILE','MERELY',
                        'MIGHT','MORE','MOREOVER','MOST','MOSTLY','MUCH','MUST','MY','MYSELF','NAME','NAMELY','ND','NEAR','NEARLY','NECESSARY','NEED','NEEDS',
                        'NEITHER','NEVER','NEVERTHELESS','NEW','NEXT','NINE','NO','NOBODY','NON','NONE','NOONE','NOR','NORMALLY','NOT','NOTHING','NOVEL','NOW',
                        'NOWHERE','OBVIOUSLY','OF','OFF','OFTEN','OH','OK','OKAY','OLD','ON','ONCE','ONE','ONES','ONLY','ONTO','OTHER','OTHERS','OTHERWISE',
                        'OUGHT','OUR','OURS','OURSELVES','OUT','OUTSIDE','OVER','OVERALL','OWN','PARTICULAR','PARTICULARLY','PER','PERHAPS','PLACED','PLEASE',
                        'PLUS','POSSIBLE','PRESUMABLY','PROBABLY','PROVIDES','QUE','QUITE','QV','RATHER','RD','RE','REALLY','REASONABLY','REGARDING',
                        'REGARDLESS','REGARDS','RELATIVELY','RESPECTIVELY','RIGHT','SAID','SAME','SAW','SAY','SAYING','SAYS','SECOND','SECONDLY','SEE','SEEING',
                        'SEEM','SEEMED','SEEMING','SEEMS','SEEN','SELF','SELVES','SENSIBLE','SENT','SERIOUS','SERIOUSLY','SEVEN','SEVERAL','SHALL','SHE','SHOULD',
                        'SHOWED','SHOWS','SINCE','SIGNIFICANTLY','SIX','SO','SOME','SOMEBODY','SOMEHOW','SOMEONE','SOMETHING','SOMETIME','SOMETIMES','SOMEWHAT','SOMEWHERE','SOON','SORRY',
                        'SPECIFIED','SPECIFY','SPECIFYING','STILL','STUDY','SUB','SUCH','SUP','SURE','TAKE','TAKEN','TELL','TENDS','TH','THAN','THANK','THANKS',
                        'THANX','THAT','THATS','THE','THEIR','THEIRS','THEM','THEMSELVES','THEN','THENCE','THERE','THEREAFTER','THEREBY','THEREFORE',
                        'THEREIN','THERES','THEREUPON','THESE','THEY','THINK','THIRD','THIS','THOROUGH','THOROUGHLY','THOSE','THOUGH','THREE','THROUGH',
                        'THROUGHOUT','THRU','THUS','TO','TOGETHER','TOO','TOOK','TOWARD','TOWARDS','TRIED','TRIES','TRULY','TRY','TRYING','TWICE','TWO',
                        'UN','UNDER','UNFORTUNATELY','UNLESS','UNLIKELY','UNTIL','UNTO','UP','UPON','US','USE','USED','USEFUL','USES','USING','USUALLY',
                        'VALUE','VARIOUS','VERY','VIA','VIZ','VS','WANT','WANTS','WAS','WAY','WE','WELCOME','WELL','WENT','WERE','WHAT','WHATEVER','WHEN',
                        'WHENCE','WHENEVER','WHERE','WHEREAFTER','WHEREAS','WHEREBY','WHEREIN','WHEREUPON','WHEREVER','WHETHER','WHICH','WHILE','WHITHER',
                        'WHO','WHOEVER','WHOLE','WHOM','WHOSE','WHY','WILL','WILLING','WISH','WITH','WITHIN','WITHOUT','WONDER','WOULD','YES','YET','YOU',
                        'YOUR','YOURS','YOURSELF','YOURSELVES, 'zzz', 'ZZZ', zzSTOPzz']'''

            str_split = string.split(' ')
            keys = [word for word in str_split if word.upper() not in my_stopwords]
            print ("Split Input: ", keys)
            num_wds = len(keys)
            print("Number of words = ", num_wds, "\n")

大多数情况下,这都有效,但关键字“AID”​​给我带来了两难境地。以下是示例输出。

在初始查询(代码未显示)之后,我得到以下内容。

Input Abstract:  PMID21839526zzz BONE-ANCHORED HEARING **AID** (BAHA) IN PATIENTS WITH TREACHER COLLINS SYNDROME:  ....

清除标点符号后,我得到以下内容。

Cleaned Input:  PMID21839526zzz BONE-ANCHORED HEARING **AID** BAHA IN PATIENTS WITH TREACHER COLLINS SYNDROME....

在我运行上面的代码以拆分空格并删除不包含单词AID的停用词列表后,我得到以下内容。请注意,“AID”​​一词已从列表中删除,无法实现我的目的。

Split Input:  ['PMID21839526zzz', 'BONE-ANCHORED', 'HEARING', 'BAHA', 'PATIENTS', 'TREACHER', 'COLLINS', 'SYNDROME',....

此代码适用于其他关键字,包括“AIDS”,“MAGNETIC”等。问题出现在三个字母的关键字“AID”​​上。我非常感谢有关为什么在这种特定情况下可能发生这种情况的解释或想法。我希望这很清楚。谢谢你的帮助。

1 个答案:

答案 0 :(得分:0)

我并不完全掌握您的算法,但您的停用词列表必须是list(更好的是set),而不是字符串:

my_stopwords = set(['A','ABLE','ABOUT','ABOVE','ACCORDING',])

否则你只是在列表中进行子串匹配而不是精确的字符串匹配。

例如,对于s = "['THEY', 'THEM']"'HE' in s是真的。如果s = ['THEY', 'THEM']'HE' in s不正确。前者是一个字符串,其内容类似于python list的语法。后者 是一个python list