Actual Behavior
Document.summary() is not working with python3 when the document is based on bytes and not on string content.
Steps to Reproduce the Problem
Follow the readme steps
>>> import requests
>>> from readability import Document
>>> response = requests.get('http://example.com')
>>> doc = Document(response.content)
>>> doc.title()
Traceback (most recent call last):
...
RE_CHARSET.findall(page) + RE_PRAGMA.findall(page) + RE_XML.findall(page)
^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: cannot use a string pattern on a bytes-like object
How to correct
String Regexp should be updated ro bytes regexp since encoding.get_encoding is only used for bytes content.
In encoding.py :
RE_CHARSET = re.compile(br'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)
RE_PRAGMA = re.compile(br'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)
RE_XML = re.compile(br'^<\?xml.*?encoding=["\']*(.+?)["\'>]')
Actual Behavior
Document.summary() is not working with python3 when the document is based on bytes and not on string content.
Steps to Reproduce the Problem
Follow the readme steps
How to correct
String Regexp should be updated ro bytes regexp since
encoding.get_encodingis only used for bytes content.In
encoding.py: