python 3.x - binary input with an ASCII text header, read from stdin

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I want to read a binary PNM image file from stdin. The file contains a header which is encoded as ASCII text, and a payload which is binary. As a simplified example of reading the header, I have created the following snippet:

#! /usr/bin/env python3
import sys
header = sys.stdin.readline()
print("header=["+header.strip()+"]")
I run it as "test.py" (from a Bash shell), and it works fine in this case:
$ printf "P5 1 1 255\n\x41" |./test.py 
header=[P5 1 1 255]
However, a small change in the binary payload breaks it:
$ printf "P5 1 1 255\n\x81" |./test.py 
Traceback (most recent call last):
  File "./test.py", line 3, in <module>
    header = sys.stdin.readline()
  File "/usr/lib/python3.4/codecs.py", line 313, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 11: invalid start byte
Is there an easy way to make this work in Python 3?
                @hiroprotagonist: Thanks for the hint.  The approach indicated there did lead me to one possible solution -- although it is a bit of a hack to apply Unicode decoding to arbitrary binary data.
– Brent Bradburn
                Jul 19, 2015 at 1:42
To read binary data, you should use a binary stream e.g., using TextIOBase.detach() method:
#!/usr/bin/env python3
import sys
sys.stdin = sys.stdin.detach() # convert to binary stream
header = sys.stdin.readline().decode('ascii') # b'\n'-terminated
print(header, end='')
print(repr(sys.stdin.read()))
From the docs, it is possible to read binary data (as type bytes) from stdin with sys.stdin.buffer.read():
  To write or read binary data from/to the standard streams, use the
  underlying binary buffer object. For example, to write bytes to
  stdout, use sys.stdout.buffer.write(b'abc').
So this is one direction that you can take -- read the data in binary mode.  readline() and various other functions still work.  Once you have captured the ASCII string, it can be converted to text, using decode('ASCII'), for additional text-specific processing.
Alternatively, you can use io.TextIOWrapper() to indicate the use of the latin-1 character set on the input stream.  With this, the implicit decode operation will essentially be a pass-through operation -- so the data will be of type str (which represent text), but the data is represented with a 1-to-1 mapping from the binary (although it could be using more than one storage byte per input byte).
Here's code that works in either mode:
#! /usr/bin/python3
import sys, io
BINARY=True ## either way works
if BINARY: istream = sys.stdin.buffer
else:      istream = io.TextIOWrapper(sys.stdin.buffer,encoding='latin-1')
header = istream.readline()
if BINARY: header = header.decode('ASCII')
print("header=["+header.strip()+"]")
payload = istream.read()
print("len="+str(len(payload)))
for i in payload: print( i if BINARY else ord(i) )
Test every possible 1-pixel payload with the following Bash command:
for i in $(seq 0 255) ; do printf "P5 1 1 255\n\x$(printf %02x $i)" |./test.py ; done
                The hack of using latin-1 as a conduit for binary data works because it is 8-bit clean, whereas UTF-8 is not.
– Brent Bradburn
                Jul 19, 2015 at 1:47
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.