通过mysql / python或mysql / .net中的子字符串对字符串进行分组

时间:2010-08-19 02:54:20

标签: c# python mysql

数据将存储在mysql数据库中,如下所示:

5911    CD  $4.99   Eben, Landscapes of Patmos {w.Martin Lenniger, percussion}; 2 Choral Phantasies; Laudes. (All w.Sieglinde Ahrens, organ)
5913    CD  $5.99   Turina, Sevilliana; Rafaga; Hommage a Tarrega; Sonata. Rodrigo, 3 Piezas Espanolas; En Los Trigales; Sarabande Lointaine. (Eric Hill, guitar^)
145460  CD  $13.98  Wagner, The Flying Dutchman. (Hans Hotter, Astrid Varnay, Set Svanholm et al. Cond. Reiner. Rec.1950. PLEASE NOTE: Limited-pressing CDRs)
145461  CD  $13.98  Montemezzi, L'Amore dei Tre Re. (Virgilio Lazzari, Dorothy Kirsten, Charles Kullman, Robert Weede, Leslie Chabay et al. Cond. Giuseppe Antonicelli. Rec. 1949. PLEASE NOTE: Limited-pressing CDRs)
145462  CD  $13.98  Ponchielli, La Gioconda. (Zinka Milanov, Giacomo Vaghi, Leonard Warren, Rise Stevens, Richard Tucker, Margaret Harshaw et al. Cond. Emil Cooper. Rec. 1946. PLEASE NOTE: Limited-pressing CDRs)
145465  CD  $5.99   ' Yankele: Yiddish Songs'. (16 titles incl. Az der Rebe, Rozhinkes mit Mandlekh, Shabes, Yankele, Belz, Di Grine Kuzine. Moshe Leiser, voice and guitar. Ami Flammer, violin. Gerard Barreaux, accordion. Rec. 'live', Lyon Opera. Total time: 78')
145467  CD  $4.99   Brahms, Piano Trios 2 & 3. (Trio Bamberg: Evgeny Schuk, violin; Stephan Gerlinghaus, cello. Robert Benz, piano. Rec. Nuremberg, 4/7/2000. Total time: 51'45')
145468  CD  $4.99   Gaubert, Piece Romantique; Trois Aquarelles. Debussy, Premier Trio in G. Francaix, Trio. (Trio Cantabile: Hans-Jorg Wegner, flute. Guido Larisch, cello. Christiane Kroeker, piano. Rec. Hannover, 3/2001. Total time: 62'35')
145469  CD  $4.99   Gattermeyer, Heinrich [b.1923]: Ophelias Schattentheater [text by Michael Ende]. Matthias Drude [b.1960], Jorinde und Joringel. Christoph J. Keller [b.1959], Die Kristallkugel [both texts by Brother Grimm]. (Helmut Thiele, narrator w.Bernd-Christian Schulze, piano. Total time: 68'08')
145470  CD  $2.99   Morrill, Dexter [b.1938]: Dance Bagatelles for Viola & Piano; Three Lyric Pieces for Violin and Piano [Laura Klugherz, viola & violin. Jill Timmons, piano]; Fantasy for Solo Cello [James Kirkwood, cello]; String Quartet #2 [Tremont String Quartet]. (Total time: 51'03')
145471  CD  $2.99   Werntz, Julia: String Trio with Homage to Chopin [Curtis Macomber, violin. Lois Martin, viola. Ted Mook, cello]; 'To You Strangers'- Five Poems of Dylan Thomas for Mezzo-Soprano Solo [Christina Ascher]; Piano Piece [John McDonald]. John Mallia, Lock [Stephanie Kay, clarinet]; Poor Denizens of Hell [chamber ensemble/ Daniel Hosken]; Plexus 2. (Aura Group for New Music)
145472  CD  $2.99   Morrill, Dexter [b.1938]- 'Music for Trumpets': 'Ponzo' for Two Trumpets; 'Nine Pieces' for Solo Trumpet; 'TARR' for Four Trumpets & Computer; 'Studies' for Trumpet & Computer; 'Trumpet Concerto' for Trumpet & Piano. (Mark Ponzo, trumpet with Barbara Butler [trumpet] & William Koehler, piano. Total time: 52'02')
145473  CD  $2.99   Kallstrom, Michael [b.1956]: 'Stories'. (A chamber opera for solo performer with puppets and electronic tape based on Old Testament stories)
145474  CD  $2.99   Carosio, Vailati, Lechi, Ponchielli, D'Alessandro, Sterzati, Riva, Pucci, Casazza, Denti, Gnaga, Anelli, Feroldi: 'The Mandolins of Stradivari'. (16 pieces for mandolin ensemble et al. Ugo Orlandi, mandolin. Alessandro Bono, guitar. Maura Mazzonetto, piano. Giampaolo Baldin, baritone. Quartetto romantico a plettro 'Umbert Sterzati'. Orchestra di Mandolini e Chitarre 'Citta di Brescia'/ Mandonico. Total time: 77'19')
145475  CD  $3.99   Rachmaninov, Symphony #3; Symphonic Dances. (St. Petersburg Philharmonic/ Jansons. Total time: 72'16')

我需要将每个标题与其他4个具有共同词汇的标题分组。例如,如果我愿意将4个cds分组,并在字符串中包含单词BEETHOVEN和MOZART。

但是,我不想指定它应该分组的单词。我希望以某种人工智能的方式完成这项工作

这就是我认为算法应该是这样的:

  1. 对所有单词进行频率分配
  2. 扔出英语中经常使用的任何单词(比如if,或者,我在哪里可以得到这些单词的列表)??
  3. 开始按照最不常发生的词汇进行分组
  4. 有没有人知道任何明智的分组方法?

1 个答案:

答案 0 :(得分:2)

Re(2),你想要的东西被称为“停用词” - 例如,在NLTK(这是Python,但我想会有C#等价物),在其出色的在线书中每chapter 2,< / p>

>>> from nltk.corpus import stopwords
>>> stopwords.words('english')
['a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across',
'actually', 'after', 'afterwards', 'again', 'against', "ain't", 'all', 'allow',
'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always',
 ...]

我引用的这本书也可以帮助您解决问题1,但第3点实际上是一个不同的领域 - 聚类。您需要一种非常特殊的聚类(指定且相同的聚类大小),因此现有算法可能不适合您,但根据您提及的内容设计一些并不太难。

基本上你希望每个单词值得一个“得分”,对于英语中较少的单词(和NLTK,或者C#中任何等效强大的自然语言处理工具包,当然可以帮助你)更高 - - 减去例如,单词频率的对数可以是一个开始。

根据您提到的规范,您只需要对至少五个文档中出现的不间断单词进行评分,因此有意义单词的数量应该非常少,并且详尽的搜索甚至可能是可行的。

事实上,最大的问题可能是另一个问题 - 如果有一组不到5个文档的话,与任何其他人共同拥有任何不间断的词汇?这种情况发生的可能性表明你必须放松某些方面的规范(因为我对你的应用程序一无所知我当然不能给出具体的建议,但它可能是任何允许具有多个不同于5的文档的组,放宽分组标准等等。

或者,您是否愿意仅仅诊断某些情况是否存在实际满足您的严格约束的情况,并提供错误消息而不是任何结果?