[IMP] tools.html2plaintext: consistent use of lxml.etree.HTMLParser to convert HTML to plaintext

We used to switch to using BeautifulSoup when available, but that
lead to inconsistent behavior depending on the installed Python
packages, and sometimes lead to bad surprises. There is no advantage
in using BeautifulSoup rather than HTMLParser, and the latter is
always available.

bzr revid: odo@openerp.com-20121015120934-njaylf99dc5zekfw
This commit is contained in:
Olivier Dony 2012-10-15 14:09:34 +02:00
parent 99c4f31111
commit e06e3aad4a
1 changed files with 2 additions and 10 deletions

View File

@ -312,16 +312,8 @@ def html2plaintext(html, body_id=None, encoding='utf-8'):
html = ustr(html) html = ustr(html)
from lxml.etree import tostring from lxml.etree import tostring, fromstring, HTMLParser
try: tree = fromstring(html, parser=HTMLParser())
from lxml.html.soupparser import fromstring
kwargs = {}
except ImportError:
_logger.debug('tools.misc.html2plaintext: cannot use BeautifulSoup, fallback to lxml.etree.HTMLParser')
from lxml.etree import fromstring, HTMLParser
kwargs = dict(parser=HTMLParser())
tree = fromstring(html, **kwargs)
if body_id is not None: if body_id is not None:
source = tree.xpath('//*[@id=%s]'%(body_id,)) source = tree.xpath('//*[@id=%s]'%(body_id,))