Question

原始问题：我有一个StringIO对象，如何将其转换为BytesIO？

更新：更普遍的问题是，如何在python3中将二进制（编码）file-like对象转换为解码file-like对象？

我得到的天真的方法是：

import io
sio = io.StringIO('wello horld')
bio = io.BytesIO(sio.read().encode('utf8'))
print(bio.read())  # prints b'wello horld'

是否有更优雅的方法？

例如，对于反向问题（BytesIO-> StringIO），存在一个类-io.TextIOWrapper正是这样做的（请参阅此answer）

Answer 1

将字符流转换为字节流可能是一个普遍有用的工具，所以去了：

import io

class EncodeIO(io.BufferedIOBase):
  def __init__(self,s,e='utf-8'):
    self.stream=s               # not raw, since it isn't
    self.encoding=e
    self.buf=b""                # encoded but not yet returned
  def _read(self,s): return self.stream.read(s).encode(self.encoding)
  def read(self,size=-1):
    b=self.buf
    self.buf=b""
    if size is None or size<0: return b+self._read(None)
    ret=[]
    while True:
      n=len(b)
      if size<n:
        b,self.buf=b[:size],b[size:]
        n=size
      ret.append(b)
      size-=n
      if not size: break
      b=self._read(min((size+1024)//2,size))
      if not b: break
    return b"".join(ret)
  read1=read

显然write可以对称地定义以解码输入并将其发送到基础流，尽管这时您必须处理仅一部分字符具有足够的字节的情况。

Answer 2

@foobarna answer可以通过继承一些io基类来改进

import io
sio = io.StringIO('wello horld')


class BytesIOWrapper(io.BufferedReader):
    """Wrap a buffered bytes stream over TextIOBase string stream."""

    def __init__(self, text_io_buffer, encoding=None, errors=None, **kwargs):
        super(BytesIOWrapper, self).__init__(text_io_buffer, **kwargs)
        self.encoding = encoding or text_io_buffer.encoding or 'utf-8'
        self.errors = errors or text_io_buffer.errors or 'strict'

    def _encoding_call(self, method_name, *args, **kwargs):
        raw_method = getattr(self.raw, method_name)
        val = raw_method(*args, **kwargs)
        return val.encode(self.encoding, errors=self.errors)

    def read(self, size=-1):
        return self._encoding_call('read', size)

    def read1(self, size=-1):
        return self._encoding_call('read1', size)

    def peek(self, size=-1):
        return self._encoding_call('peek', size)


bio = BytesIOWrapper(sio)
print(bio.read())  # b'wello horld'

Answer 3

正如一些人指出的那样，您需要自己进行编码/解码。

但是，您可以用一种优雅的方式实现这一点-为TextIOWrapper实现自己的string => bytes。

这是一个示例：

class BytesIOWrapper:
    def __init__(self, string_buffer, encoding='utf-8'):
        self.string_buffer = string_buffer
        self.encoding = encoding

    def __getattr__(self, attr):
        return getattr(self.string_buffer, attr)

    def read(self, size=-1):
        content = self.string_buffer.read(size)
        return content.encode(self.encoding)

    def write(self, b):
        content = b.decode(self.encoding)
        return self.string_buffer.write(content)

哪个会产生这样的输出：

In [36]: bw = BytesIOWrapper(StringIO("some lengt˙˚hyÔstring in here"))

In [37]: bw.read(15)
Out[37]: b'some lengt\xcb\x99\xcb\x9ahy\xc3\x94'

In [38]: bw.tell()
Out[38]: 15

In [39]: bw.write(b'ME')
Out[39]: 2

In [40]: bw.seek(15)
Out[40]: 15

In [41]: bw.read()
Out[41]: b'MEring in here'

希望它能清除您的想法！

Answer 4

有趣的是，尽管这个问题看似合理，但要弄清楚为什么我需要将StringIO转换为BytesIO的实际原因并不容易。两者基本上都是缓冲区，通常只需要其中一个就可以对字节或文本进行一些其他操作。

我可能是错的，但是我认为您的问题实际上是当要传递给它的某些代码需要文本文件时，如何使用BytesIO实例。

在这种情况下，这是一个常见问题，解决方案是codecs模块。

使用它的两种常见情况如下：

组成要读取的文件对象

In [16]: import codecs, io

In [17]: bio = io.BytesIO(b'qwe\nasd\n')

In [18]: StreamReader = codecs.getreader('utf-8')  # here you pass the encoding

In [19]: wrapper_file = StreamReader(bio)

In [20]: print(repr(wrapper_file.readline()))
'qwe\n'

In [21]: print(repr(wrapper_file.read()))
'asd\n'

In [26]: bio.seek(0)
Out[26]: 0

In [27]: for line in wrapper_file:
    ...:     print(repr(line))
    ...:
'qwe\n'
'asd\n'

组成要写入的文件对象

In [28]: bio = io.BytesIO()

In [29]: StreamWriter = codecs.getwriter('utf-8')  # here you pass the encoding

In [30]: wrapper_file = StreamWriter(bio)

In [31]: print('жаба', 'цап', file=wrapper_file)

In [32]: bio.getvalue()
Out[32]: b'\xd0\xb6\xd0\xb0\xd0\xb1\xd0\xb0 \xd1\x86\xd0\xb0\xd0\xbf\n'

In [33]: repr(bio.getvalue().decode('utf-8'))
Out[33]: "'жаба цап\\n'"

Answer 5

我的需求完全相同，因此我在EncodedStreamReader包中创建了一个nr.utils.io类。通过实际读取请求的字节数，而不是从包装的流中读取字符数，也解决了这个问题。

$ pip install 'nr.utils.io>=0.1.0,<1.0.0'

用法示例：

import io
from nr.utils.io.readers import EncodedStreamReader
fp = EncodedStreamReader(io.StringIO('ä'), 'utf-8')
assert fp.read(1) == b'\xc3'
assert fp.read(1) == b'\xa4'

Answer 6

您的示例中的

bio是_io.BytesIO类对象。您使用过2次read()函数。

我想出了bytes转换和一种read()方法：

sio = io.StringIO('wello horld')
b = bytes(sio.read(), encoding='utf-8')
print(b)

但是第二个变体应该更快：

sio = io.StringIO('wello horld')
b = sio.read().encode()
print(b)

将io.StringIO转换为io.BytesIO

6 个答案:

组成要读取的文件对象

组成要写入的文件对象