Dino Fizzotti

Engineer, maker, hacker, thinker, funner.

Oct 14, 2018 - 11 minute read - Comments - Software Python

CarbAlert - Part 2: Django and Scrapy

This is part 2 of a 4 part series of articles where I explain how I discovered and purchased my laptop by building a web application which scrapes a local PC parts forum and sends automated email alerts when posts featuring specific keywords appear:

CarbAlert on GitHub: https://github.com/dinofizz/carbalert

Django

In order to manage the search phrases and email addresses I am using Django. Django is a Python web framework, and is known for including many extras right out of the box. I am taking advantage of two specific extras: Django’s built-in ORM and the Django admin console.

Django ORM

With the ORM and admin console it becomes easy to create search phrases which are related to specific users. I simply create Django Model classes representing a SearchPhrase and a Thread. There exists a many-to-many relationship between search phrases and threads: a thread may contain multiple search phrases, and a search phrase may be found in many threads.

from django.contrib.auth.models import User
from django.db import models


class SearchPhrase(models.Model):
    phrase = models.CharField(max_length=100, blank=True)
    email_users = models.ManyToManyField(User, related_name="email_users")

    def __str__(self):
        return self.phrase


class Thread(models.Model):
    thread_id = models.CharField(max_length=20, blank=True)
    title = models.CharField(max_length=200, blank=True)
    url = models.CharField(max_length=200, blank=True)
    text = models.TextField(max_length=1000)
    datetime = models.DateTimeField(blank=True)
    search_phrases = models.ManyToManyField(SearchPhrase)

    def __str__(self):
        return self.title

Link to code: https://github.com/dinofizz/carbalert/blob/master/carbalert/carbalert_django/models.py


Django & PostgreSQL

Django makes it easy to integrate with many different database types which can be configured in the Django settings.py configuration file. I decided to use PostgreSQL just to see and learn what would be involved. It was as easy as installing the psycopg2 package which provides a PostgreSQL database adapter for Python, and updating the default database configuration in the Django settings.py file to connect to a PostgreSQL database:

...
DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql',
        'NAME': 'postgres',
        'USER': 'postgres',
        'HOST': 'db',
        'PORT': 5432,
    }
}
...


The “db” host will resolve to a named Docker container:

...
  db:
    image: postgres
    volumes:
      - postgres_data:/var/lib/postgresql/data/
...

Django Admin Console and Registration with Email Address

CarbAlert users are notified of search phrase “hits” via email. By default user registration through the Django admin console does not require an email address - only a username and password. To then associate an email address with a new user you need to perform an extra step after they have created their username and password. In order to eliminate this extra step I made a change to the Django admin management console to ensure that an “email address” is prompted for capture and required for user registration. My implementation is based on a StackOverflow answer, and can be found here:

from django.contrib import admin
from django.contrib.auth.admin import UserAdmin
from django.contrib.auth.forms import UserCreationForm, UserChangeForm
from django.contrib.auth.models import User

from .models import Thread, SearchPhrase


class EmailRequiredMixin(object):
    def __init__(self, *args, **kwargs):
        super(EmailRequiredMixin, self).__init__(*args, **kwargs)
        self.fields["email"].required = True


class MyUserCreationForm(EmailRequiredMixin, UserCreationForm):
    pass


class MyUserChangeForm(EmailRequiredMixin, UserChangeForm):
    pass


class EmailRequiredUserAdmin(UserAdmin):
    form = MyUserChangeForm
    add_form = MyUserCreationForm
    add_fieldsets = (
        (
            None,
            {
                "fields": ("username", "email", "password1", "password2"),
                "classes": ("wide",),
            },
        ),
    )


admin.site.unregister(User)
admin.site.register(User, EmailRequiredUserAdmin)
admin.site.register(Thread)
admin.site.register(SearchPhrase)

Link to code: https://github.com/dinofizz/carbalert/blob/master/carbalert/carbalert_django/admin.py


The Django web app is run from a Docker container as described in the docker-compose.yml file:

...
  web:
    build: .
    working_dir: /code/carbalert
    command: gunicorn carbalert.wsgi -b 0.0.0.0:8000
    volumes:
      - "/static:/static"
    ports:
      - "8000:8000"
    depends_on:
      - db
...


Scrapy

In order to scan the latest Carbonite posts I am using Scrapy. Scrapy is a Python framework for scraping web sites. I had previously used BeautifulSoup to scrape web sites for HTML content-of-interest, but after listening to Episode #50: Web scraping at scale with Scrapy and ScrapingHub of the Talk Python To Me podcast I decided to give Scrapy a go.

Scrapy is able to scrape web sites just like BeautifulSoup, but then has additional features such as item pipelines, an interactive shell and logging, and more. Scraping logic pertinent to a specific site or web resource is contained within a class which extends one of the “Spider” base classes. These spiders provide built-in mechanisms for navigating and collecting data.

I did consider using the RSS feed provided by Carbonite for each forum, but the post data was truncated and did not contain the full text for me to search for keywords of interest.

I did look over the Carbonite Terms and Conditions and there is (currently) no mention of any restriction on scraping the site. After getting an alert I would still have to click through to the thread in question and use the internal messaging tool to negotiate the purchase.

Scrapy shell

Scrapy provides an interactive shell which can be used to inspect the results of a crawl. This is useful for exploring various HTML and CSS selectors when attempting to figure out the best way of identifying how to get to the data you need. If you start a Scrapy shell with a URL as a command line argument Scrapy will run a default Spider and scrape the URL. You will end up in the interactive shell and a listing of which objects are now available for inspection. Here is what it looks like if I run the shell with the Carbonite laptop forum URL:

$ scrapy shell https://carbonite.co.za/index.php\?forums/laptops.32/
2018-07-27 14:38:47 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: carbalert)
...
<I have ommitted some of the initial Scrapy start-up lines for brevity>
...
2018-07-27 14:38:47 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-07-27 14:38:47 [scrapy.core.engine] INFO: Spider opened
2018-07-27 14:38:47 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://carbonite.co.za/robots.txt> (referer: None)
2018-07-27 14:38:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://carbonite.co.za/index.php?forums/laptops.32/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f194dd79a20>
[s]   item       {}
[s]   request    <GET https://carbonite.co.za/index.php?forums/laptops.32/>
[s]   response   <200 https://carbonite.co.za/index.php?forums/laptops.32/>
[s]   settings   <scrapy.settings.Settings object at 0x7f194dd79470>
[s]   spider     <DefaultSpider 'default' at 0x7f194c4e3e10>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>> response
<200 https://carbonite.co.za/index.php?forums/laptops.32/>
>>>

I knew I needed to be able to find all the posts available on the first page of the Carbonite Laptops forum. With Chrome dev tools I was able to determine that all of the threads are found under the a div identified by a .js-threadList CSS class selector:

Div containing all threads Div containing all threads

I am also able to see that individual threads can be identified by another CSS identifier .structItem--thread. By chaining the Scrapy selection results I can create a list of all the individual threads, as seen by running the following commands in the Scrapy shell:

>>> response.css('.js-threadList')
[<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' js-threadList ')]" data='<div class="structItemContainer-group js'>]
>>> len(response.css('.js-threadList').css('.structItem--thread'))
30
>>>

In addition to using using CSS selectors, Scrapy can locate and extract HTML elements using XPath expressions. Here is an example where I continue with the list of threads and extract the text for the title from the first one in the list using a combination of CSS selectors and an XPath expression:

>>> response.css('.js-threadList') \

...     .css('.structItem--thread')[0] \

...     .css('.structItem-title') \

...     .xpath('.//a/text()') \

...     .extract_first()
'Lenovo Y580 - i7 / 8GB RAM / 1TB HDD / nVidia 660M / Win10 JBL Speakers Bcklit Keyboard - R4100 PRICE DROP'


CarbSpider

Using the results of my investigation from poking around with the Scrapy shell I decided to put all of the scraping logic into a Scrapy Spider. This Python class contains the code which knows how to scan the Carbonite “Laptop” forum index page which contains the 30 latest posts.

When a Spider is invoked it will use a URL specified in the start_urls variable to generate a Request object targeted to that URL. Once Scrapy has scraped the URL it will call-back into the parse function of the Spider with all the gathered information stored in a Response object which is passed in as a parameter:

import scrapy


class CarbSpider(scrapy.Spider):
    name = "carb"
    start_urls = [
        'https://carbonite.co.za/index.php?forums/laptops.32/',
    ]

    def parse(self, response):
        # Insert response handling here


My “CarbSpider” is used obtain metadata such as thread title, thread URL and a unique thread ID. I also want to be able to search the body text of the initial thread post for keywords of interest. With Scrapy I can direct my Spider to scrape an additional URL (the actual thread post URL) with it’s own call-back function which will parse that response. Here is the full code of my CarbSpider which shows how I loop through all threads found on the Laptop forum index page and in turn scrapes each of those thread posts for additional data (initial post body text and timestamp):

import logging
import re

import html2text
import scrapy


class CarbSpider(scrapy.Spider):
    name = "carb"
    start_urls = ["https://carbonite.co.za/index.php?forums/laptops.32/"]

    def parse(self, response):
        logging.info("CarbSpider: Parsing response")
        threads = response.css(".js-threadList").css(".structItem--thread")

        logging.info(f"Found {len(threads)} threads.")

        for thread in threads:
            item = {}

            thread_url_partial = (
                thread.css(".structItem-cell--main")
                .xpath(".//a")
                .css("::attr(href)")[1]
                .extract()
            )
            thread_url = response.urljoin(thread_url_partial)
            logging.info(f"Thread URL: {thread_url}")
            item["thread_url"] = thread_url

            thread_title = (
                thread.css(".structItem-cell--main")
                .css(".structItem-title")
                .xpath(".//a/text()")
                .extract_first()
            )
            logging.info(f"Thread title: {thread_title}")
            item["title"] = thread_title

            thread_id = re.findall(r"\.(\d+)/", thread_url_partial)[0]
            item["thread_id"] = thread_id
            logging.info(f"Thread ID: {thread_id}")

            request = scrapy.Request(item["thread_url"], callback=self.parse_thread)
            request.meta["item"] = item
            yield request

    def parse_thread(self, response):
        item = response.meta["item"]

        thread_timestamp = (
            response.css(".message-main")[0]
            .xpath(".//time")
            .css("::attr(datetime)")
            .extract_first()
        )
        logging.info(f"Thread timestamp: {thread_timestamp}")
        item["datetime"] = thread_timestamp

        html = response.css(".message-main")[0].css(".bbWrapper").extract_first()
        converter = html2text.HTML2Text()
        converter.ignore_links = True
        thread_text = converter.handle(html)
        logging.info(f"Thread text: {thread_text}")
        item["text"] = thread_text

        yield item

Link to code: https://github.com/dinofizz/carbalert/blob/master/carbalert/carbalert_scrapy/carbalert_scrapy/spiders/carb_spider.py


As the thread post body content is formatted in HTML (it is of course being presented on a web page) I make use of the HTML2Text to convert the thread post body HTML to non-HTML formatted text.

So it may be a bit confusing to figure out what happens with the data I am collecting, but here is an attempt to explain:

Information from this Spider is populated into an item dictionary which is passed along for each thread found on the index page from parse() into parse_thread(). Once parse_thread() has completed collecting information on the specific forum post, it adds to the item object and is yielded back into parse(), whose final yield statement will pass on the item object to the Pipeline (more on this below).

(You may be better off following the documentation for Scrapy)

CarbPipeline

Thread metadata scraped from the URL and parsed by the CarbSpider is then passed into a Scrapy Item Pipeline for further processing. I created a CarbPipeline which performs the following, given the previously scraped thread item metadata:

  • Given the scraped thread ID, checks to see if an existing Thread object exists in the database. If I have already saved this thread before I discard the item. Writing this blog post I realise that newly added search phrases will now not trigger on previously saved threads which may be scraped again should a new post bring them to the front page. # TODO…
  • Checks if any of the search phrases are found in the thread title or thread body.
  • If a search phrase is found, it is added to the “mailing list” dictionary where the user is the key and a list of all associated search phrase hits is the value. Each search phrase hit is appended to a new or existing list for a specific user - this allows a user to be notified of a single thread of interest which may contain multiple search phrases hits for that user.
  • A Thread object containing all relevant metadata is saved to the database.
  • Once the thread item has been checked against all possible search phrases an asynchronous email task is kicked off which will send email notifications to all users which had search phrase hits for this thread - one email per user, with all search phrases hits listed in the email.

Code below:

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import logging

import maya

# The path for this is weird. I spent some time trying to get it to work with the a more sane import statement
# but I was not (yet) successful.
from carbalert_django.models import Thread, SearchPhrase
from carbalert.carbalert_scrapy.carbalert_scrapy.tasks import send_email_notification


class CarbalertPipeline(object):
    def process_item(self, item, spider):
        logging.info("CarbalertPipeline: Processing item")

        thread_id = item["thread_id"]
        logging.info(f"Checking if thread ID ({thread_id}) exists in DB...")

        if Thread.objects.filter(thread_id=thread_id).exists():
            logging.debug("Thread already exists.")
            return item

        logging.info("No existing thread for ID.")

        search_phrases = SearchPhrase.objects.values_list("phrase", flat=True)

        title = item["title"]
        text = item["text"]
        thread_url = item["thread_url"]
        thread_datetime = maya.parse(item["datetime"])

        email_list = {}

        for search_phrase in search_phrases:
            logging.info(f"Scanning title and text for search phrase: {search_phrase}")
            if (
                search_phrase.lower() in title.lower()
                or search_phrase.lower() in text.lower()
            ):
                logging.info(f"Found search phrase: {search_phrase}")

                search_phrase_object = SearchPhrase.objects.get(phrase=search_phrase)

                for user in search_phrase_object.email_users.all():
                    logging.info(
                        f"Found user {user} associated to search phrase {search_phrase}"
                    )
                    if user in email_list:
                        email_list[user].append(search_phrase)
                    else:
                        email_list[user] = [search_phrase]

                logging.info(f"Saving thread ID ({thread_id}) to DB.")
                try:
                    thread = Thread.objects.get(thread_id=thread_id)
                except Thread.DoesNotExist:
                    thread = Thread()
                    thread.thread_id = thread_id
                    thread.title = title
                    thread.text = text
                    thread.url = thread_url
                    thread.datetime = thread_datetime.datetime()
                    thread.save()

                thread.search_phrases.add(search_phrase_object)
                thread.save()

        local_datetime = thread_datetime.datetime(to_timezone="Africa/Harare").strftime(
            "%d-%m-%Y %H:%M"
        )

        for user in email_list:
            logging.info(
                f"Sending email notification to user {user} for thread ID {thread_id}, thread title: {title}"
            )
            send_email_notification.delay(
                user.email, email_list[user], title, text, thread_url, local_datetime
            )

        return item

Link to code: https://github.com/dinofizz/carbalert/blob/master/carbalert/carbalert_scrapy/carbalert_scrapy/pipelines.py


Next post in series: Part 3: Celery, Mailgun and Flower