BeautifulSoup doesn't work with a threaded input queue?

Discussion:

(too old to reply)

Christopher Reimer

2017-08-27 17:23:07 UTC

Greetings,

I have Python 3.6 script on Windows to scrape comment history from a
website. It's currently set up this way:

Requestor (threads) -> list -> Parser (threads) -> queue -> CVSWriter
(single thread)

It takes 15 minutes to process ~11,000 comments.

When I replaced the list with a queue between the Requestor and Parser
to speed up things, BeautifulSoup stopped working.

When I changed BeautifulSoup(contents, "lxml") to
BeautifulSoup(contents), I get the UserWarning that no parser wasn't
explicitly set and a reference to line 80 in threading.py (which puts it
in the RLock factory function).

When I switched back to using list between the Requestor and Parser, the
Parser worked again.

BeautifulSoup doesn't work with a threaded input queue?

Thank you,

Chris Reimer

Peter Otten

2017-08-27 18:54:58 UTC

Permalink

Post by Christopher Reimer
Greetings,
I have Python 3.6 script on Windows to scrape comment history from a
Requestor (threads) -> list -> Parser (threads) -> queue -> CVSWriter
(single thread)
It takes 15 minutes to process ~11,000 comments.
When I replaced the list with a queue between the Requestor and Parser
to speed up things, BeautifulSoup stopped working.
When I changed BeautifulSoup(contents, "lxml") to
BeautifulSoup(contents), I get the UserWarning that no parser wasn't
explicitly set and a reference to line 80 in threading.py (which puts it
in the RLock factory function).
When I switched back to using list between the Requestor and Parser, the
Parser worked again.
BeautifulSoup doesn't work with a threaded input queue?

The documentation

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup

says you can make the BeautifulSoup object from a string or file.
Can you give a few more details where the queue comes into play? A small
code sample would be ideal...

Christopher Reimer

2017-08-27 19:35:03 UTC

Permalink

Post by Peter Otten
The documentation
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup
says you can make the BeautifulSoup object from a string or file.
Can you give a few more details where the queue comes into play? A small
code sample would be ideal.

A worker thread uses a request object to get the page and puts it into
queue as page.content (HTML). Another worker thread gets the
page.content from the queue to apply BeautifulSoup and nothing happens.

soup = BeautifulSoup(page_content, 'lxml')
print(soup)

No output whatsoever. If I remove 'lxml', I get the UserWarning that no
parser wasn't explicitly set and get the reference to threading.py at
line 80.

I verified that page.content that goes into and out of the queue is the
same page.content that goes into and out of a list.

I read somewhere that BeautifulSoup may not be thread-safe. I've never
had a problem with threads storing the output into a queue. Using a
queue (random order) instead of a list (sequential order) to feed pages
for the input is making it wonky.

Chris R.

MRAB

2017-08-27 20:12:10 UTC

Permalink

Post by Christopher Reimer

A worker thread uses a request object to get the page and puts it into
queue as page.content (HTML). Another worker thread gets the
page.content from the queue to apply BeautifulSoup and nothing happens.
soup = BeautifulSoup(page_content, 'lxml')
print(soup)
No output whatsoever. If I remove 'lxml', I get the UserWarning that no
parser wasn't explicitly set and get the reference to threading.py at
line 80.
I verified that page.content that goes into and out of the queue is the
same page.content that goes into and out of a list.
I read somewhere that BeautifulSoup may not be thread-safe. I've never
had a problem with threads storing the output into a queue. Using a
queue (random order) instead of a list (sequential order) to feed pages
for the input is making it wonky.

What do you mean by "queue (random order)"? A queue is sequential order,
first-in-first-out.

Peter Otten

2017-08-27 20:31:35 UTC

Permalink

Post by Christopher Reimer

A worker thread uses a request object to get the page and puts it into
queue as page.content (HTML). Another worker thread gets the
page.content from the queue to apply BeautifulSoup and nothing happens.
soup = BeautifulSoup(page_content, 'lxml')
print(soup)
No output whatsoever. If I remove 'lxml', I get the UserWarning that no
parser wasn't explicitly set and get the reference to threading.py at
line 80.
I verified that page.content that goes into and out of the queue is the
same page.content that goes into and out of a list.
I read somewhere that BeautifulSoup may not be thread-safe. I've never
had a problem with threads storing the output into a queue. Using a
queue (random order) instead of a list (sequential order) to feed pages
for the input is making it wonky.

Here's a simple example that extracts titles from generated html. It seems
to work. Does it resemble what you do?

import csv
import threading
import time
from queue import Queue

import bs4

def process_html(source, dest, index):
while True:
html = source.get()
if html is DONE:
dest.put(DONE)
break
soup = bs4.BeautifulSoup(html, "lxml")
dest.put(soup.find("title").text)

def write_csv(source, filename, to_go):
with open(filename, "w") as f:
writer = csv.writer(f)
while True:
title = source.get()
if title is DONE:
to_go -= 1
if not to_go:
return
else:
writer.writerow([title])

NUM_SOUP_THREADS = 10
DONE = object()

web_to_soup = Queue()
soup_to_file = Queue()

soup_threads = [
threading.Thread(target=process_html, args=(web_to_soup, soup_to_file,
i))
for i in range(NUM_SOUP_THREADS)
]

write_thread = threading.Thread(
target=write_csv, args=(soup_to_file, "tmp.csv", NUM_SOUP_THREADS),
)

write_thread.start()

for thread in soup_threads:
thread.start()

for i in range(100):
web_to_soup.put("<html><head><title>#{}</title></head></html>".format(i))
for i in range(NUM_SOUP_THREADS):
web_to_soup.put(DONE)

for t in soup_threads:
t.join()
write_thread.join()

Christopher Reimer

2017-08-27 20:35:06 UTC

Permalink

Post by MRAB
What do you mean by "queue (random order)"? A queue is sequential
order, first-in-first-out.

With 20 threads requesting 20 different pages, they're not going into
the queue in sequential order (i.e., 0, 1, 2, ..., 17, 18, 19) and
coming in at different times for the parser worker threads to get for
processing.

Similar situation with a list but I sort the list before giving it to
the parser, so all the items are in sequential order and fed to the
parser one at time.

Chris R.

MRAB

2017-08-27 20:50:20 UTC

Permalink

Post by Christopher Reimer

Post by MRAB
What do you mean by "queue (random order)"? A queue is sequential
order, first-in-first-out.

With 20 threads requesting 20 different pages, they're not going into
the queue in sequential order (i.e., 0, 1, 2, ..., 17, 18, 19) and
coming in at different times for the parser worker threads to get for
processing.
Similar situation with a list but I sort the list before giving it to
the parser, so all the items are in sequential order and fed to the
parser one at time.

What if you don't sort the list? I ask because it sounds like you're
changing 2 variables (i.e. list->queue, sorted->unsorted) at the same
time, so you can't be sure that it's the queue that's the problem.

Christopher Reimer

2017-08-27 21:14:27 UTC

Permalink

Post by Peter Otten
Here's a simple example that extracts titles from generated html. It seems
to work. Does it resemble what you do?

Your example is similar to my code when I'm using a list for the input
to the parser. You have soup_threads and write_threads, but no read_threads.

The particular website I'm scraping requires checking each page for the
sentinel value (i.e., "Sorry, no more comments") in order to determine
when to stop requesting pages. For my comment history that's ~750 pages
to parse ~11,000 comments.

I have 20 read_threads requesting and putting pages into the output
queue that is the input_queue for the parser. My soup_threads can get
items from the queue, but BeautifulSoup doesn't do anything after that.

Chris R.

Paul Rubin

2017-08-27 21:23:58 UTC

Permalink

Post by Christopher Reimer
I have 20 read_threads requesting and putting pages into the output
queue that is the input_queue for the parser.

Given how slow parsing is, you probably want to scrap the pages into
disk files, and then run the parser in parallel processes that read from
the disk. You could also use something like Redis (redis.io) as a queue.

Christopher Reimer

2017-08-27 21:19:24 UTC

Permalink

Post by MRAB
What if you don't sort the list? I ask because it sounds like you're
changing 2 variables (i.e. list->queue, sorted->unsorted) at the same
time, so you can't be sure that it's the queue that's the problem.

If I'm using a list, I'm using a for loop to input items into the parser.

If I'm using a queue, I'm using worker threads to put or get items.

The item is still the same whether in a list or a queue.

Chris R.

Peter Otten

2017-08-27 21:45:27 UTC

Permalink

Post by Christopher Reimer

Post by Peter Otten
Here's a simple example that extracts titles from generated html. It
seems to work. Does it resemble what you do?

Where's that check happening? If it's in the soup thread you need some kind
of back channel to the read threads to inform them that you're need no more
pages.

Post by Christopher Reimer
For my comment history that's ~750 pages
to parse ~11,000 comments.
I have 20 read_threads requesting and putting pages into the output
queue that is the input_queue for the parser. My soup_threads can get
items from the queue, but BeautifulSoup doesn't do anything after that.
Chris R.

Christopher Reimer

2017-08-27 22:48:27 UTC

Permalink

Ah, shoot me. I had a .join() statement on the output queue but not on
in the input queue. So the threads for the input queue got terminated
before BeautifulSoup could get started. I went down that same rabbit
hole with CSVWriter the other day. *sigh*

Thanks for everyone's help.

Chris R.