解析来自JSON数据的电子邮件

时间:2013-12-17 20:12:58

标签: python json python-2.7

编程和Python的新手。我已经在这个问题上工作了几天,我还没有能够悲伤地解决它。我已经如此接近但仍然没有成功......

以下是我在代码之前使用的原始数据。 (我的代码拨打电话后,我从Twitter API获取此数据)

{"metadata":{"result_type":"recent","iso_language_code":"et"},"created_at":"Tue Dec 03 01:41:53 +0000 2013","id":407686093790662656,"id_str":"407686093790662656","text":"@emblems123 justinbieberfan12599@gamil.com","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":407677310821613569,"in_reply_to_status_id_str":"407677310821613569","in_reply_to_user_id":2201997043,"in_reply_to_user_id_str":"2201997043","in_reply_to_screen_name":"emblems123","user":{"id":1220098345,"id_str":"1220098345","name":"PYD","screen_name":"bieberfan12599","location": 

以下是我的代码:

import csv
import json
import oauth2 as oauth
import urllib
import sys
import requests
import time
import re

CONSUMER_KEY = ""
CONSUMER_SECRET = ""
ACCESS_KEY = ""
ACCESS_SECRET = ""

class TwitterSearch:
    def __init__(self,
        ckey    = CONSUMER_KEY,
        csecret = CONSUMER_SECRET,
        akey    = ACCESS_KEY,
        asecret = ACCESS_SECRET,
        query   = 'https://api.twitter.com/1.1/search/tweets.{mode}?{query}'
    ):
        consumer     = oauth.Consumer(key=ckey, secret=csecret)
        access_token = oauth.Token(key=akey, secret=asecret)
        self.client  = oauth.Client(consumer, access_token)
        self.query   = query

    def search(self, q, mode='json', **queryargs):
        queryargs['q'] = q
        query = urllib.urlencode(queryargs)
        return self.client.request(self.query.format(query=query, mode=mode))

def write_csv(fname, rows, header=None, append=False, **kwargs):
    filemode = 'ab' if append else 'wb'
    with open(fname, filemode) as outf:
        out_csv = csv.writer(outf, **kwargs)
        if header:
            out_csv.writerow(header)
        out_csv.writerows(rows)

def main():
    ts = TwitterSearch()
    response, data = ts.search('@gmail.com', result_type='recent')
    js = json.loads(data)

    messages = ([msg['created_at'], msg['text'], msg['user']['id']] for msg in js.get('statuses', []))
    write_csv('twitter_gmail.csv', messages, append=True)

if __name__ == '__main__':
    main()

它产生以下数据:

Tue Dec 17 19:57:22 +0000 2013,"@soccerdotcom work for DQB-Planning campaign 4 RealMadrid,who should I approach to further discuss this? iturraldedebracamonte@gmail.com",399224668

我希望它能够生成下面的代码,从文本中提取电子邮件地址并打印而不是整个邮件。

Tue Dec 17 19:57:22 +0000 2013, "iturraldedebracamonte@gmail.com",399224668

我与Regex和分裂非常接近,但我仍然无法做到正确。

我应该采取的任何想法或方向都会非常有帮助。在解析json时,我可以将正则表达式放入生成器吗?

1 个答案:

答案 0 :(得分:1)

您可以执行任何可以放入生成器表达式中的表达式的内容。问题是,你真的想要吗?

假设您使用了正则表达式和.findall()

email_re = re.compile(r'<some expression>')

messages = ([msg['created_at'], ' '.join(email_re.findall(msg['text'])), msg['user']['id']] for msg in js.get('statuses', []))

这使得您的一行代码相当长且不可读。

我会在这里将提取分解为函数:

def extract_info(msg):
    created_at = msg['created_at']
    user_id = msg['user']
    text = msg['txt']
    emails = email_re.findall(text)
    return (created_at, ' '.join(emails), user_id)

messages = (extract_info(msg) for msg in js.get('statuses', []))
相关问题