NLTK-Python从CSV提取名称

时间:2018-08-03 14:23:17

标签: python csv file-handling

我有一个CSV文件,其中包含不同原始格式的文章文本。

就像我们有第1列一样:

你好,我是约翰

汤姆有一只狗

...更多文本。

我正在尝试从这些文本中提取名字和姓氏,如果我将单个文本复制并粘贴到代码中,则能够做到这一点。 但是我不知道如何在代码中读取csv,然后它必须处理Raws提取名称和姓氏中的不同文本。

这是我的代码在其中处理文本:

import operator,collections,heapq
import csv
import pandas 
import json
import nltk
from nameparser.parser import HumanName

def get_human_names(text):
    tokens = nltk.tokenize.word_tokenize(text)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)
    person_list = []
    person = []
    name = ""
    for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):
        for leaf in subtree.leaves():
            person.append(leaf[0])
        if len(person) > 1: #avoid grabbing lone surnames
            for part in person:
                name += part + ' '
            if name[:-1] not in person_list:
                person_list.append(name[:-1])
            name = ''
        person = []

    return (person_list)

text = """
M.F. Husain, Untitled, 1973, oil on canvas, 182 x 122 cm. Courtesy the Pundole Family Collection

In her essay ‘Worlding Asia: A Conceptual Framework for the First Delhi Biennale’, Arshiya Lokhandwala explores Gayatri Spivak’s provocation of ‘worlding’, which has been defined as imperialism’s epistemic violence of inscribing meaning upon a colonized space to bring it into the world through a Eurocentric framework. Lokhandwala extends this concept of worlding to two anti-cartographical terms: ‘de-worlding’, rejecting or debunking categories that are no longer useful such as the binaries of East-West, North-South, Orient-Occidental, and ‘re-worlding’, re-inscribing new meanings into the spaces that have been de-worlded to create one’s own worlds. She offers de-worlding and re-worlding as strategies for active resistance against epistemic violence of all forms, including those that stem from ‘colonialist strategies of imperialism’ or from ‘globalization disguised within neo-imperialist practices’.

Lokhandwala writes: Fourth World. The presence of Arshiya is really the main thing here. 

Re-worlding allows us to reach a space of unease performing the uncanny, thereby locating both the object of art and the postcolonial subject in the liminal space, which prevents these categorizations as such… It allows an introspected view of ourselves and makes us seek our own connections, and look at ourselves through our own eyes.

In a recent exhibition on the occasion of the seventieth anniversary of India’s Independence, Lokhandwala employed the term to seemingly interrogate this proposition: what does it mean to re-world a country through the agonistic intervention of art and activism? What does it mean for a country and its historiography to re-world? What does this re-worlded India, in active resistance and a state of introspection, look like to itself?

The exhibition ‘India Re-Worlded: Seventy Years of Investigating a Nation’ at Gallery Odyssey in Mumbai (11 September 2017–21 February 2018) invited artists to select a year from the seventy years since the country’s independence that had personal import or resonated with them because of the significance of the events that occurred at the time. The show featured works that responded to or engaged with these chosen years. It captured a unique history of post-independent India told through the perspective of seventy artists. The works came together to collectively reflect on the history and persistence of violence from pre-independence to the present day and made reference to the continued struggle for political agency through acts of resistance, artistic and otherwise. Through the inclusion of subaltern voices, imagined geographies, particular experiences, solidarities and critical dissent, the exhibition offered counter-narratives and multiple histories.

Anita Dube, Missing Since 1992, 2017, wood, electrical wire, holders, bulbs, voltage stabilizers, 223 x 223 cm. Courtesy the artist and Gallery Odyssey

Lokhandwala says she had been thinking hard about an appropriate response to the seventy years of independence. ‘I wanted to present a new curatorial paradigm, a postcolonial critique of the colonisation and an affirmation of India coming into her own’, she says. ‘I think the fact that I tried to include seventy artists to [each take up] one year in the lifetime of the nation was also a challenging task to take on curatorially.’

Her previous undertaking ‘After Midnight: Indian Modernism to Contemporary India: 1947/1997’ at the Queens Museum in New York in 2015 juxtaposed two historical periods in Indian art: Indian modern art that emerged in the post-independence period from 1947 through the 1970s, and contemporary art from 1997 onwards when the country experienced the effects of economic liberalization and globalization. The 'India Re-Worlded' exhibition similarly presented art practices that emerged from the framework of postcolonial Indian modernity. It attempted to explore the self-reflexivity of the Indian artist as a postcolonial subject and, as Lokhandwala described in the curatorial note, the artists’ resulting ‘sense of agency and renewed connection with the world at large’. The exhibition included works by Progressive Artists' Group core members F.N. Souza, S.H. Raza, M.F. Husain and their peers Krishen Khanna, Tyeb Mehta and V.S. Gaitonde, presented under the year in which they were produced. Other important and pioneering pieces included work from Somnath Hore’s paper pulp print series Wounds (1970); a blowtorch on plywood work by abstractionist Jeram Patel, who was one of the founding members of Group 1890 ; and a video documenting one of Rummana Husain’s last performances.

The methodology of their display removed the didactic, art historical preoccupation with chronology and classification, instead opting to intersperse them amongst contemporary works. This fits in with Lokhandwala’s curatorial impulses and vision: to disrupt and resist single narratives, to stage dialogues and interactions between the works, to offer overlaps, intersections and nuances in the stories, but also in the artistic impetuses.

Jeram Patel, Untitled, 1970, blowtorch Fourht World on plywood, 61 x 61 cm. Courtesy the artist and Gallery Odyssey

The show opened with Jitish Kallat’s Death of Distance (2006), then we have Arshiya, which through lenticular prints presented two overlaid found texts from 2005 and 2006. One was a harrowing news story of a twelve-year-old Indian girl committing suicide after her mother tells her she cannot afford one rupee – two US cents – for a school meal. The other one was a news clipping in which the head of the state-run telecommunications company announces a new one-rupee-per-minute tariff plan for interstate phone calls and declares the scheme as ‘the death of distance’. The images offer two realities that are distant from and at odds with each other. They highlight an economic disparity heightened by globalization. A rupee coin, enlarged to a human scale and covered in black lead, stood poised on the gallery floor in front of the prints.

Bose Krishnamachari chose 1962, the year of his birth, to discuss the relationship between memory and age. As a visual representation of the country’s past through a timeline, within which he situated his own identity-questioning experiences as an artist, his work epitomized the themes and intentions of the exhibition. In Shilpa Gupta’s single channel video projection 100 Hand drawn Maps of India (2007–8) ordinary Indian people sketch outlines of the country from memory. The subjective maps based on the author’s impression and perception of space show how each person sees the country and articulates its borders. The work seems to ask, what do these incongruent representations reveal about our collective identities and our ideas about nationhood?

The repetition of some of the years selected, or even the absence of certain years, suggested that the parameters set by the curatorial concept sought to guide rather than clamp down on. This allowed greater freedom for the artists and curator, and therefore more considered and wide responses.

Surekha’s photographic series To Embrace (2017) celebrated the Chipko tree-hugging movement that originated on 25 March 1974, when 27 women from Reni village in Uttar Pradesh in northern India staged a self-organised, non-violent resistance to the felling of trees by clinging to them and linking arms around them. The photographs showed women embracing the branches of the giant, 400-year-old Dodda Alada Mara (Big Banyan Tree) in rural Bengaluru – paying a homage to both the pioneering eco-feminist environmental movement and the grand old tree.

Anita Dube’s Missing Since 1992 (2017) hung from the ceiling like a ghost of a terrible, dark past. Its electrical wires and bulbs outlined a sombre dome to represent the demolition of the Babri Masjid on 6 December 1992, which Dube calls ‘the darkest day I have experienced as a citizen’. This piece was one of several works in the exhibition that dealt with this event and the many episodes of communal riots that followed. These works document a decade when the country witnessed economic reform and growth but also the rise of a religious right-wing.

Riyas Komu, Fourth World, 2017, rubber and metal, 244 x 45 cm each. Courtesy the artist and Gallery Odyssey 

Near the end of the exhibition, Riyas Komu’s sculptural installation Fourth World (2017) alerted us to the divisive forces that are threatening to dismantle the ethical foundations of the Republic symbolized by its official emblem, the Lion Capital – a symbol seen also on the blackened rupee coin featured in Kallat’s work – and in a way rounded off the viewing experience.

The seventy works that attempted to represent seventy years of the country’s history built a dense and complicated network of voices and stories, and also formed a cross section of the art emerging during this period. Although the show’s juxtaposition of modern and contemporary art made it seem like an extension of the themes presented in the curator’s previous exhibition at the Queens Museum, here the curatorial concept made the process of staging the exhibition more democratic blurring the sequence of modern and contemporary Indian art. Furthermore, the multi-pronged curatorial intentions brought renewed criticality to the events of past and present, always underscoring the spirit of resistance and renegotiation as the viewer could actively de-world and re-world.
"""

names = get_human_names(text)
print ("LAST, FIRST")

namex=[]
for name in names: 
    last_first = HumanName(name).last + ' ' + HumanName(name).first
    print (last_first)
    namex.append(last_first)
print (namex)


print('Saving the data to the json file named Names')
try:
    with open('Names.json', 'w') as outfile:
        json.dump(namex, outfile)
except Exception as e:
    print(e)

所以我想从代码中删除所有文本,并希望代码处理我的csv中的文本。

非常感谢:)

2 个答案:

答案 0 :(得分:0)

CSV代表逗号分隔值,并且是一种文本格式,用于以纯文本形式表示表格数据。逗号用作列分隔符,而换行符用作行分隔符。您的字符串看起来不像真实的csv文件。没关系,扩展名仍然可以像这样读取文本文件:

with open('your_file.csv', 'r') as f:
    my_text = f.read()

您的文本文件现在在其余的代码中以my_text的形式提供。

答案 1 :(得分:0)

熊猫有read_csv命令:

yourText= pandas.read_csv("csvFile.csv")