Web Crawling with Hop, Mighty, and Instructor

Whether you need machine learning know-how, design to take your digital product to the next level, or strategy to set your roadmap, we can help. Book a free consult to learn more.

Introduction

Web crawling or web scraping is the automated process of browsing and scraping information from the web. The internet contains an unimaginable amount of information but you can find an answer to almost any question you have if you know where to look and how to get it. Web scraping is an essential tool in any machine learning practitioner’s toolkit.

In this post, I’ll show you how to use a new library called Hop to perform an analysis that extracts the most common keywords associated with a given website. Finally, we’ll discuss how to combine traditional web crawlers with large language models to build intelligent data extraction applications.

Ethical Considerations

Before we get into building web scrapers, we need to discuss some ethical considerations when scraping the web. Generally speaking, there are a few important things to keep in mind before writing a web scraper:

Respect each site’s robots.txt. This file outlines the website’s crawling policies. By default, Hop will adhere to each site’s robots.txt for you.
Rate limit your requests! The last thing you want is to accidentally DoS a site you’re trying to scrape by overloading the server with requests.
Respect copyrights and terms of service. Some websites might explicitly ban scraping in their terms of service.

Keyword Analysis with Hop and Mighty

One common application of web crawlers is search engine optimization or SEO. SEO is the process of improving the visibility of a website in search engines by improving various aspects of the website. Common SEO tools like Semrush will crawl your website to do backlink analysis, keyword analysis, and more. For this project, we’ll implement a tool that performs keyword analysis of a given website.

With hop, we can perform keyword analysis on your websites in a few lines of code in a Livebook. The original idea from this project was borrowed (with permission) from my good friend Andres Alejos.

To get started, open up a Livebook and install hop, readability, mighty, instructor, and exla:

Mix.install([
  {:hop, "~> 0.1"},
  {:readability, "~> 0.12"},
  {:mighty, github: "acalejos/mighty"},
  {:exla, ">= 0.0.0"},
  {:instructor, "~> 0.0.5"},
  {:req, "~> 0.5", override: true}
])

Nx.default_backend(EXLA.Backend)

hop is a simple and extensible library for crawling websites in Elixir. readability is a library similar to Mozilla’s readability library that extracts the readable content from a webpage. mighty is a natural language processing (NLP) library in Elixir that implements some algorithms we’ll need to perform our keyword analysis, and exla will help accelerate the algorithms we use in mighty.

Once you have everything installed, you can go about configuring your Hop. A Hop in this case is a configuration for your crawl given to the hop web crawling library. Every crawl you perform with hop has 3 stages you can configure:

Prefetch - a pre-request validation step
Fetch - the actual request to retrieve a web page
Next - determining which pages to crawl next

When you start crawling a website with Hop, you give it a root URL or a list of root URLs to start on. It will then perform a prefetch step, which does things like validating that the page’s content is not too large to download or that the page you’re trying to access is not disallowed by the site’s robots.txt. If the request is properly validated during prefetch, hop will move on to the fetch stage. This stage makes the actual request to download the contents of the webpage. After fetch, hop will pass the response on to the “next” stage, which is responsible for extracting the next pages to crawl. Each of these steps is just a function, which means you can customize them to do anything you want.

To create a new hop, first you need to call Hop.new/2:

Hop.new("https://seanmoriarity.com")

%Hop{
  url: "https://seanmoriarity.com",
  prefetch: #Function<2.112089291/3 in Hop.default_prefetch>,
  fetch: #Function<3.112089291/3 in Hop.default_fetch>,
  next: #Function<4.112089291/4 in Hop.default_next>,
  config: [
    max_depth: 5,
    max_content_length: 1000000000,
    accepted_mime_types: ["text/html", "text/plain", "application/xhtml+xml", "application/xml",
     "application/rss+xml", "application/atom+xml"],
    accepted_schemes: ["http", "https"],
    crawl_query?: true,
    crawl_fragment?: false,
    req_options: [connect_options: [timeout: 15000], retry: false]
  ]
}

Hop.new/2 takes the root URL or URLs you want to crawl, and any additional configuration options you want to pass. You can configure your hop’s prefetch, fetch, and next implementations by passing it through any of the “builder” functions in the API:

Hop.new("https://seanmoriarity.com")
|> Hop.fetch(&my_custom_fetch/3)

For our SEO application, we don’t actually need any custom behavior. We just want to build a crawler that hits every page on a domain, analyzes the text, and extracts the top keywords associated with the website. Hop’s default behavior is to crawl every page on a site up to a limited depth, so we don’t need to change anything here. To actually execute the crawl with hop, all you need to do is call Hop.stream/1:

Hop.new("https://seanmoriarity.com")
|> Hop.stream()

Hop.stream/1 will return a lazy stream of {url, response, state} tuples that you can enumerate using any of the functions in the Enum library. This simplicity makes Hop perfect for building crawlers in Livebook or in smaller applications.

To see this in action, try crawling a website and just echoing out each URL crawled:

Hop.new("https://seanmoriarity.com")
|> Hop.stream()
|> Enum.each(fn {url, _response, _state} -> IO.puts("Crawled #{url}") end)

After running, you’ll notice that Hop takes care of all of the hard work for us! We can simply call Hop.stream/1, and then run an Enum.map over each crawled page to perform whatever data extraction or processing task we want to perform. For our SEO application, the first thing we want to do is extract all of the readable content from each page on a website. We can do this with the readability library:

website_content =
  Hop.new("https://seanmoriarity.com")
  |> Hop.stream()
  |> Enum.map(fn {_, response, _state} ->
    case response do
      %{status: status, body: html} when status in 200..299 ->
        html
        |> Readability.article()
        |> Readability.readable_text()

      _ ->
        nil
    end
  end)
  |> Enum.reject(&(is_nil(&1) or &1 == "" or &1 == "Log into your Facebook account to share."))
  |> Enum.uniq()

["Hi, I’m Sean.\n I’m the author of the book Genetic Algorithms in Elixir: Solve Problems with Evolution.My interests include artificial intelligence, evolutionary algorithms, mathematics, and functional programming. You can check out my GitHub to see what projects I’m currently working on or follow me on Twitter to see what I’m up to.",
 "Books\nGenetic Algorithms in Elixir: Solve Problems using Evolution – (ebook) (Amazon)\nMachine Learning in Elixir: Learning to Learn with Nx and Axon – (ebook)\nShare this:\nTwitterFacebookLikeLoading...",
 "I’ve been fortunate enough to speak at a number of events and participate in a few podcast recordings. You can find them all here.\nTalks\nElixirConf US 2022 – Axon: Functional Programming for Deep LearningDenver Elixir Meetup – Building a Conversational Support Bot with Elixir, Nx, and BumblebeeElixirConf EU 2023 Sponsor Talk (Teller) – Categorizing Millions of Transactions at Scale with Elixir, Nx, and Bumblebee\nEMPEX New York 2023 – Elixir and Large Language ModelsElixirConf US 2023 – MLOps with ElixirGOTO AI Chicago, May 2024 (projected)Interviews / Podcast Recordings\nThinking Elixir #102 – Machine Learning in Elixir with Sean MoriarityElixir Wizards Podcast S10E10 – Sean Moriarity on the Future of Machine Learning with ElixirThinking Elixir #154 – Serving Up AI with Sean MoriarityBeam Radio #54 – Sean Moriarity and Machine LearningGOTO Book Club – Genetic Algorithms in Elixir with Sean MoriarityElixir Outlaws #132 – Making Diagrams for no ReasonAuthor Spotlight – Sean Moriarity (DevTalk)Software Engineering Radio #594Share this:\nTwitterFacebookLikeLoading...",
 Programmers→"
 ...
]

This pipeline will extract the readable text from each crawled page that Hop returns and filter out some erroneous or empty content. Next, we’re going to perform a TF-IDF analysis on the contents we extracted from the website.

TF-IDF stands for Term Frequency-Inverse Document Frequency. It’s a way to figure out how important a word is in a document within a collection of documents.

TF (Term Frequency): How often a word appears in a document. If a word appears a lot, it might be important.
IDF (Inverse Document Frequency): How unique or rare a word is across all documents. If a word is in every document, it’s probably not that special.

TF-IDF is a way to extract potentially relevant keywords from a corpus of documents like the one we’ve extracted from the crawled website here. We can use mighty to run TF-IDF for us. First, we’ll extract a vocabulary of unique words from the corpus:

english_stop_words =
  ~w| / < > ` ( ) ^ : ; , . " [ ] a about above across after afterwards again against all almost alone along already also although always am among amongst amoungst amount an and another any anyhow anyone anything anyway anywhere are around as at back be became because become becomes becoming been before beforehand behind being below beside besides between beyond bill both bottom but by call can cannot cant co con could couldnt cry de describe detail do done down due during each eg eight either eleven else elsewhere empty enough etc even ever every everyone everything everywhere except few fifteen fifty fill find fire first five for former formerly forty found four from front full further get give go had has hasnt have he hence her here hereafter hereby herein hereupon hers herself him himself his how however hundred i ie if in inc indeed interest into is it its itself keep last latter latterly least less ltd made many may me meanwhile might mill mine more moreover most mostly move much must my myself name namely neither never nevertheless next nine no nobody none noone nor not nothing now nowhere of off often on once one only onto or other others otherwise our ours ourselves out over own part per perhaps please put rather re same see seem seemed seeming seems serious several she should show side since sincere six sixty so some somehow someone something sometime sometimes somewhere still such system take ten than that the their them themselves then thence there thereafter thereby therefore therein thereupon these they thick thin third this those though three through throughout thru thus to together too top toward towards twelve twenty two un under until up upon us very via was we well were what whatever when whence whenever where whereafter whereas whereby wherein whereupon wherever whether which while whither who whoever whole whom whose why will with within without would yet you your yours yourself yourselves|

{corpus_vectorizer, corpus_tf} =
  Mighty.Preprocessing.CountVectorizer.new(
    max_features: 256,
    stop_words: english_stop_words,
    ngram_range: {1, 3}
  )
  |> Mighty.Preprocessing.CountVectorizer.fit_transform(website_content)

{%Mighty.Preprocessing.CountVectorizer{
   vocabulary: %{
     "to make" => 217,
     "work" => 238,
     "run" => 179,
     "bit" => 49,
     "like" => 128,
     "final" => 86,
     "elevenlabs" => 78,
     ...
   },
   ngram_range: {1, 3},
   max_features: 256,
   min_df: 1,
   max_df: 1.0,
   stop_words: MapSet.new(["at", ...]),
   binary: false,
   preprocessor: {Mighty.Preprocessing.Shared, :default_preprocessor, []},
   tokenizer: {String, :split, []},
   pruned: MapSet.new(["evolution –", ...])
 },
 #Nx.Tensor<
   s64[31][256]
   EXLA.Backend<host:0, 0.2724506681.405143584.227864>
   [
     [0, 0, 0, 0, ...],
     ...
   ]
 >}

This code creates a vocabulary of unique words from the corpus of documents. It uses Mighty’s CountVectorizer, which transforms a corpus into counts by term frequency in a corpus. It also strips out stop words. Stop words are common English words that we don’t want to consider in our TF-IDF analysis. These are words like “the” and “a”, which are extremely common in any form of writing and are generally insignificant.

This code also turns the vocabulary into n-grams. Rather than simply extracting single words, it will extract series of words. In this instance, we’ve specified the ngram_range to be {1, 3}, which means the vocabulary of words we extract will be between 1 and 3 words long.

Once we’ve created a vocabulary, we can create a new TF-IDF Vectorizer with Mighty:

vectorizer =
  Mighty.Preprocessing.TfidfVectorizer.new(
    vocabulary: corpus_vectorizer.vocabulary,
    norm: nil,
    max_features: 256,
    ngram_range: {1, 3}
  )

%Mighty.Preprocessing.TfidfVectorizer{
  count_vectorizer: %Mighty.Preprocessing.CountVectorizer{
    vocabulary: %{
      "to make" => 217,
      "work" => 238,
      "run" => 179,
      "bit" => 49,
      ...
    },
    ngram_range: {1, 3},
    max_features: 256,
    min_df: 1,
    max_df: 1.0,
    stop_words: MapSet.new([]),
    binary: false,
    preprocessor: {Mighty.Preprocessing.Shared, :default_preprocessor, []},
    tokenizer: {String, :split, []},
    pruned: nil
  },
  norm: nil,
  idf: nil,
  use_idf: true,
  smooth_idf: true,
  sublinear_tf: false
}

Then, we can fit the vectorizer to our corpus:

vectorizer = Mighty.Preprocessing.TfidfVectorizer.fit(vectorizer, website_content)

%Mighty.Preprocessing.TfidfVectorizer{
  count_vectorizer: %Mighty.Preprocessing.CountVectorizer{
    vocabulary: %{
      "to make" => 217,
      "work" => 238,
      "run" => 179,
      "bit" => 49,
      "like" => 128,
      "final" => 86,
      "elevenlabs" => 78,
      ...
    },
    ngram_range: {1, 3},
    max_features: 256,
    min_df: 1,
    max_df: 1.0,
    stop_words: MapSet.new([]),
    binary: false,
    preprocessor: {Mighty.Preprocessing.Shared, :default_preprocessor, []},
    tokenizer: {String, :split, []},
    pruned: MapSet.new([])
  },
  norm: nil,
  idf: #Nx.Tensor<
    f32[256]
    EXLA.Backend<host:0, 0.2724506681.405143584.227923>
    [3.367123603820801, 2.5198259353637695, ...]
  >,
  use_idf: true,
  smooth_idf: true,
  sublinear_tf: false
}

Now, you can use the trained vectorizer to transform your corpus into TF-IDF vectors:

tf_idf = Mighty.Preprocessing.TfidfVectorizer.transform(vectorizer, website_content)

#Nx.Tensor<
  f32[31][256]
  EXLA.Backend<host:0, 0.2724506681.405143584.227926>
  [
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...],
    ...
  ]
>

This tensor has a shape of {n_documents, n_words} where each value represents the TF-IDF score of a word in your vocabulary in a particular document. To get the aggregate score of a term, we can sum across all of the documents and then take the top 10 keywords on the website:

top_keywords =
  tf_idf
  |> Nx.sum(axes: [0])
  |> Nx.argsort(direction: :desc)
  |> Nx.slice([0], [10])
  |> Nx.to_list()

[189, 14, 135, 158, 42, 153, 105, 161, 241, 112]

These values represent an individual keyword in the vocabulary. We can use the trained TF-IDF vectorizer to transform them back into an interpretable word:

int_to_word = Map.new(vectorizer.count_vectorizer.vocabulary, fn {k, v} -> {v, k} end)

Enum.map(top_keywords, &int_to_word[&1])

["task", "=", "model", "of the", "axon", "nx", "iex>", "on task", "you can", "in the"]

And that’s it! These results here are interesting, to say the least. It seems we have a few phrases where stop-words crept in; however, that’s easy enough to fix. Generally speaking, I would say the keywords returned here fit my website - Axon, Nx, Model, Task, and iex in particular. Let’s see if we can improve upon this method using more advanced large language models.

More Intelligent Crawlers

The SEO keyword analyzer was a quick exercise in showing you what’s possible with Hop. In reality, Hop will let you implement crawlers that are as complex as you need them to be. Large language models in particular are changing how people implement crawlers and extract information from websites. With Hop and Instructor, you can build more intelligent crawlers and data extraction pipelines.

Traditional scraping implementations generally rely on extracting structured information from a website. With Instructor, however, we can extract structured information from open-ended text. For example, we can change our keyword extraction pipeline to extract keywords directly from each article, rather than relying on TF-IDF to do that for us.

We’ll start by defining a simple schema we want to extract from each crawled page:

defmodule PageInfo do
  use Ecto.Schema

  @doc """
  ## Field Descriptions

  - `:relevant?` - whether or not the given page is an SEO relevant page.
  SEO relevant pages include blog posts, static pages such as about, contact,
  index, and more
  - `:keywords` - a list of SEO keywords applicable to the given page
  """
  @primary_key false
  embedded_schema do
    field(:relevant?, :boolean)
    field(:keywords, {:array, :string})
  end
end

{:module, PageInfo, <<70, 79, 82, 49, 0, 0, 15, ...>>,
 [__schema__: 1, __schema__: 1, __schema__: 1, __schema__: 1, __schema__: 2, __schema__: 2, ...]}

With this schema, we’ll extract keywords from each crawled page. We also include a flag for Instructor to tell us whether or not a page is relevant to our analysis. Next, we can reimplement our crawling pipeline; however, this time we’ll extract page info from each crawled page:

Application.put_env(:instructor, :adapter, Instructor.Adapters.OpenAI)
Application.put_env(:instructor, :openai, api_key: System.get_env("LB_OPENAI_API_KEY"))

:ok

seo_info =
  Hop.new("https://seanmoriarity.com")
  |> Hop.stream()
  |> Stream.map(fn {url, response, _state} ->
    case response do
      %{status: status, body: html} when status in 200..299 ->
        text =
          html
          |> Readability.article()
          |> Readability.readable_text()

        {url, text}

      _ ->
        nil
    end
  end)
  |> Stream.reject(&is_nil/1)
  |> Enum.map(fn {url, text} ->
    {:ok, page_info} =
      Instructor.chat_completion(
        model: "gpt-4-turbo",
        response_model: PageInfo,
        messages: [
          %{
            role: "user",
            content:
              "Determine if this page is SEO relevant, and if so extract all of the keywords from it.\nPage: #{text}"
          }
        ]
      )

    {url, page_info}
  end)
  |> Enum.filter(&elem(&1, 1).relevant?)
  |> Map.new()

%{
  "https://seanmoriarity.com/category/paper/" => %PageInfo{
    relevant?: true,
    keywords: ["catastrophic forgetting", "neural networks", "sequential learning",
     "task-specific skills", "continual learning", "elastic weight consolidation", "TensorFlow 2"]
  },
  "https://seanmoriarity.com/2023/02/16/streaming-gpt-3-responses-with-elixir-and-liveview/?share=facebook" => %PageInfo{
    relevant?: true,
    keywords: ["Facebook", "login", "share"]
  },
  ...
}

Now we can extract the unique keywords from each of the pages, get the frequencies of each, and return the top-10 again:

seo_info
|> Enum.flat_map(fn {_, page} -> page.keywords end)
|> Enum.map(&String.downcase/1)
|> Enum.frequencies()
|> Enum.sort_by(&elem(&1, 1), :desc)
|> Enum.map(&elem(&1, 0))
|> Enum.take(10)

["elixir", "nx", "neural networks", "catastrophic forgetting", "continual learning", "share",
 "facebook", "multi-task learning", "elastic weight consolidation", "mnist"]

Not bad! It appears we have some Facebook share links that leaked in, but overall this analysis is pretty good.

Conclusion

This post just scratches the surface of what’s possible with Hop. I hope it inspires you to go out and experiment with different applications of web crawling and machine learning. The first ever machine learning project I worked on involved an extensive amount of crawling, and I still do quite a bit to this day. There’s nothing more fun than finding and using interesting data online. If you have interesting use cases with Hop, I’d love to hear about them. Until next time!

Introduction

Ethical Considerations

Keyword Analysis with Hop and Mighty

More Intelligent Crawlers

Conclusion

Newsletter

Stay in the Know