In the previous article we have seen an Akka introduction. Now it is time to move to something more interesting. There are many applications, where Akka shines. Today we will take a look at a web crawler example. It contains several interesting parts where Akka shows that multithreading application can be really easy to design and develop.
We want to start from the design, it will provide us a quick look of how the system should look like. Where can we get some good examples for this task?
A web crawler is usually a part of a web search engine. And the most popular search engine currently is Google. You can take a look at the The Anatomy of a Large-Scale Hypertextual Web Search Engine article, it provides a brief overview of how Google was designed at first. So, if you want to build a similar project, you can surely start from this article.
The application we want to build is much smaller. To make it more interesting we will base our web crawler on several conditions:
- Web crawler must not spam websites, so each request to a single website should be delayed.
- All operations must be non-blocking.
- We assume that the web connection is not good, so the crawler might fail sometimes. In such a case the system must recover in the face of failure.
- Since we will be running the crawler on a regular computer, then we need to stop the system when we scrap enough links. The check itself will be regulated by a simple threshold value.
Our system will be divided into Actors. Each Actor represents a logical block that operates with a different data. The first thing that we need is a starting point, this can be just a simple object class, which creates a main Actor and starts the messaging. The main Actor itself will regulate some high order operations like how many links we still need to scrap or what to do in case of a scraping failure. In addition, it needs to manage other Actors. I think, the name that suits this Actor well is
It will be much easier, if we create a separate Actor for each website we need to crawl. In that case we do not need to worry of how to synchronize delays with other websites scrapers. Since this Actor operates with a single website, we will call it
Supervisor Actor will send here the information to be scraped.
SiteCrawler needs to get all needed information from a web page. The problem is that the scrapping might fail, as we assumed previously. To separate the scraping functions we create a
Scraper Actor. It will process the provided url and send the status back to the
SiteCrawler. If the scraping fails,
SiteCrawler will send an error message to the
Supervisor after the timeout.
The scraped information will be sent from the
Scraper directly to the
Indexer Actor, which will store the content information. Once it receives the information it will not only store the necessary data, but also send urls that needs to be scraped to the
Supervisor. To make the
Indexer a bit more interesting, we will print all collected data before stopping it.
You can assume that each Actor can represent a single machine, so to make the communication process better, we try to minimize the information that is sent from one Actor to another.
The diagram of the project classes is as follows:
Classes in details
App class of the project should do several things, such as:
- creating the main Actor and initializing it with a first website we want to process,
- stopping the whole system if the process was not ended yet.
Here is the object implementation:
The thing I want to highlight is that the stopping after the delay demostrated here is a last resort. It is much better to stop the system when the program has accomplished its goal.
The Supervisor has four basic variables:
- How many pages we visited (sent for scraping).
- Which pages we still need to scrap.
- How many times we tried to visit a particular page. This is needed, since we think that scraping might fail and in that case we need to visit a page several times.
- Store for
SiteCrawlerActors for each host we need to deal with.
Here is one way you can represent those variables:
When we scrap a url, we need to send it to an Actor which processes urls for one particular host.
Here, we check if an Actor for the provided host is presented, if not, we create it and add to our
host2Actor container. Next, we add the url to the collection of pages we want to scrap. After all, we notify the
SiteCrawler Actor that we want to scrap the url.
receive function body of the
Supervisor class contains a handler for each received message. Each message is a case class. To see all messages, check the
Messages class in the project repository.
Let’s take a closer look at
Indexer finishes its processing, it sends the url information to the
IndexFinished message. We check if we want to scrap received urls or not based on the number of pages we already visited. If yes, we proceed with each url that we did not try to scrap before. The
checkAndShutdown function removes url from
toScrap set and, if there is no urls to visit anymore — shutdowns the whole system.
ScrapFailure message is received when the scraping fails. In that case we need to decide if we want to go on scraping a url or not, we do that by counting the number of visits for the url.
Following this link you can check the whole source code for the
One of the most interesting things in the system is how we handle the delays for each website in a non-blocking way. To achieve that we placed this process in a separate class. In addition to this, we need to recover after the scraping failure somehow.
Want to see how it works? See the
SiteCrawler class implementation below.
First of all, we create an instance of
SiteCrawler for each website (host) in the
Supervisor Actor as we saw previously. In that case we need to deal with only one website and do not worry about synchronizations with the others.
We create a scheduler that sends a
process message to the Actor each second. We cannot call the internal process directly, since there is no synchronization between it and the
SiteCrawler Actor. Without the scheduler it might be possible that we modify
toProcess variable in two places at the same time: when we add a new url to that variable (
Scrap message) and when we remove the last added element from the list (
When we need to get a response from an Actor, we can use the ask pattern. It uses a
? symbol. The
timeout variable implicitly defines how much time we wait before the
ask fails automatically. And if it does, the process recovers with the
ScrapFailure message. Finally, we send the status back to the
After all, we get a non-blocking Actor, which processes a website without spamming it.
Scraper Actor is simple. It does not contain any state (so we could use a simple
Future instead of an
Actor) and its purpose is to scrap the provided url and send the success flag to the sender and the scraped information to the
Indexer. Here is the
receive method of the Actor:
The most interesting part of this Actor is a
We used a popular library for web scraping — Jsoup to get links and other information from a web page. Since it is a Java library, we needed to convert the received information to the Scala format. It is actually possible to do that with a
scala.collection.JavaConverters object. We also do a simple check if the received url is a valid content, since we want to parse only html pages.
The two main goals of the
Indexer Actor is to store the sent information and to send all scraped urls to the
Supervisor. To make things more fun, we print all received data before the Actor stops working.
If you run a program, you will see a bunch of messages that it produces. They just show the stage of a processing for a url.
We have built a simple web crawler that can successfully crawl several websites. Moreover, we achieved that using Akka, we did not use standard Java multithreading techniques, just messages and Actor instances.
The source code is available on github under MIT License.