为文件中单词的第一个匹配项提取n个字符

时间:2019-03-01 13:06:36

标签: python-3.x

我是Python的初学者。我有一个包含一行数据的文件。我的要求是仅在某些单词首次出现后提取“ n”个字符。而且,这些单词不是连续的。

数据文件:{"id":"1234566jnejnwfw","displayId":"1234566jne","author":{"name":"abcd@xyz.com","datetime":15636378484,"displayId":"23423426jne","datetime":4353453453}

我想在“ displayId”的第一个匹配项之后和“ author”之前获取值,即1234566jne。对于“ datetime”也是如此。

我尝试根据索引作为单词将行换行,并将其放入另一个文件中以进行进一步清理以获得准确的值。

tmpFile = "tmpFile.txt"
tmpFileOpen = open(tmpFile, "w+")

with open("data file") as openfile:
       for line in openfile:
           tmpFileOpen.write(line[line.index(displayId) + len(displayId):])

但是,我确信这不是进一步工作的好方法。

有人可以帮我吗?

2 个答案:

答案 0 :(得分:1)

此答案适用于任何格式与您的问题类似的displayId。我决定不为该答案加载JSON文件,因为不需要它来完成任务。

import re

tmpFile = "tmpFile.txt"
tmpFileOpen = open(tmpFile, "w+")

with open('data_file.txt', 'r') as input:
  lines = input.read()

  # Use regex to find the displayId element
  # example: "displayId":"1234566jne
  # \W matches none words, such as " and :
  # \d matches digits
  # {6,8} matches digits lengths between 6 and 8
  # [a-z] matches lowercased ASCII characters
  # {3} matches 3 lowercased ASCII characters
  id_patterns = re.compile(r'\WdisplayId\W{3}\d{6,8}[a-z]{3}')
  id_results = re.findall(id_patterns, lines)

  # Use list comprehension to clean the results
  clean_results = ([s.strip('"displayId":"') for s in id_results])

  # loop through clean_results list
  for id in clean_results:
    # Write id to temp file on separate lines
    tmpFileOpen.write('{} \n'.format(id))

    # output in tmpFileOpen
    # 1234566jne 
    # 23423426jne 

此答案确实会加载JSON文件,但是如果JSON文件格式更改,此答案将失败。

import json

tmpFile = 'tmpFile.txt'
tmpFileOpen = open(tmpFile, "w+")

# Load the JSON file
jdata = json.loads(open('data_file.txt').read())

# Find the first ID
first_id = (jdata['displayId'])
# Write the first ID to the temp file
tmpFileOpen.write('{} \n'.format(first_id))

# Find the second ID
second_id = (jdata['author']['displayId'])
# Write the second ID to the temp file
tmpFileOpen.write('{} \n'.format(second_id))

# output in tmpFileOpen
# 1234566jne 
# 23423426jne 

答案 1 :(得分:0)

如果我正确理解了您的问题,则可以执行以下操作:

import json

tmpFile = "tmpFile.txt"
tmpFileOpen = open(tmpFile, "w+")

with open("data.txt") as openfile:
    for line in openfile:
        // Loads the json to a dict in order to manipulate it easily
        data = json.loads(str(line))
        // Here I specify that I want to write to my tmp File only the first 3
        // characters of the field `displayId`
        tmpFileOpen.write(data['displayId'][:3])

之所以可以这样做,是因为文件中的数据是JSON,但是如果格式更改,它将无法正常工作