试图从文本文件解析twitter json

时间:2013-05-08 22:07:05

标签: python json

我是python的新手,我正在尝试从文本文件中解析“tweets”以进行分析。

我的测试文件有很多推文,下面是一个例子:

{"created_at":"Mon May 06 17:39:59 +0000 2013","id":331463367074148352,"id_str":"331463367074148352","text":"Extra\u00f1o el trabajo en las aulas !! * se jala los cabellos","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":276765971,"id_str":"276765971","name":"Shiro","screen_name":"_Shira3mmanueL_","location":"","url":null,"description":null,"protected":false,"followers_count":826,"friends_count":1080,"listed_count":5,"created_at":"Mon Apr 04 01:36:52 +0000 2011","favourites_count":1043,"utc_offset":-21600,"time_zone":"Mexico City","geo_enabled":true,"verified":false,"statuses_count":28727,"lang":"es","contributors_enabled":false,"is_translator":false,"profile_background_color":"1A1B1F","profile_background_image_url":"http:\/\/a0.twimg.com\/images\/themes\/theme9\/bg.gif","profile_background_image_url_https":"https:\/\/si0.twimg.com\/images\/themes\/theme9\/bg.gif","profile_background_tile":false,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/3608152674\/45133759fb72090ebbe880145d8966a6_normal.jpeg","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/3608152674\/45133759fb72090ebbe880145d8966a6_normal.jpeg","profile_banner_url":"https:\/\/si0.twimg.com\/profile_banners\/276765971\/1367525440","profile_link_color":"2FC2EF","profile_sidebar_border_color":"181A1E","profile_sidebar_fill_color":"252429","profile_text_color":"666666","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":{"type":"Point","coordinates":[19.30303082,-99.54709768]},"coordinates":{"type":"Point","coordinates":[-99.54709768,19.30303082]},"place":{"id":"1d23a12800a574a8","url":"http:\/\/api.twitter.com\/1\/geo\/id\/1d23a12800a574a8.json","place_type":"city","name":"Lerma","full_name":"Lerma, M\u00e9xico","country_code":"MX","country":"M\u00e9xico","bounding_box":{"type":"Polygon","coordinates":[[[-99.552193,19.223171],[-99.552193,19.4343],[-99.379483,19.4343],[-99.379483,19.223171]]]},"attributes":{}},"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"es"}

我的代码是:

import re

import json



pattern_split = re.compile(r"\W+")


def sentment_tbl(sent_file):
    # Read in AFINN-111.txt
    tbl = dict(map(lambda (w, s): (w, int(s)), [
    ws.strip().split('\t') for ws in open(sent_file)]))
    return tbl

def sentiment(text,afinn):
    # Word splitter pattern 
    words = pattern_split.split(text.lower())
    sentiments = map(lambda word: afinn.get(word, 0), words)
    if sentiments:
        sentiment = float(sum(sentiments))
    else:
        sentiment = 0
    return sentiment

def main():

    sent_file = sys.argv[1]
    afinn = sentment_tbl(sent_file)

    tweet_file = (sys.argv[2])
    with open(tweet_file) as f:
       for line_str in f:
        print type(line_str)
        print line_str
        tweet = json.loads(line_str.read())
        print("%6.2f %s" % (sentiment(line_str,afinn)))


    #Test: text = "Finn is stupid and idiotic"
    #print("%6.2f %s" % (sentiment(text,afinn), text))


if __name__ == '__main__':
    main()

我收到有关

的错误消息

我感觉我正在混合苹果和橘子,并希望得到一些经验丰富的帮助

谢谢,克里斯

3 个答案:

答案 0 :(得分:1)

为什么不使用内置的JSON library代替循环,将每行读取和解析为JSON,如下所示:

import json
jsonObj = json.loads(open(tweet_file, 'r'))
# Now jsonObject is an array of dictionaries corresponding to the JSON

答案 1 :(得分:1)

如果您已将多条推文写入文件。 EG:

 o.write(tweet1)
 o.write(tweet2)

您还必须逐行阅读,because json can't decode a file of multiple objects written line by line

tweets = []
for line in open('test.txt', 'r'):
    tweets.append(json.loads(line))

答案 2 :(得分:0)

您需要将字符串传递给json.loads

tweet = json.loads(line_str)

因为line_str是一个字符串。

之后,您需要确保正确地将tweettweet中的一些详细信息传递给sentiment()以进行进一步处理。请注意,您现在正在使用sentiment()来调用line_str,并且尚未使用tweet