Web crawling platform

Building a generic web crawling platform, including an advanced workflow system, able to handle any kind of custom logic and deal with website bot protection, suitable for non experienced technical personnel and capable of scaling horizontally.
Web crawling hero image

One of our clients is in a business of internet crawling and data mining. They need to collect millions of web pages from hundreds of different web sites and extract relevant parts of data from them every month.

Challenge

The number of pages that needs to be retrieved and analyzed varies from small (several hundred) to huge (several million) per web site. Some of the sites are crawled recurringly and some just once, but sometimes under a very strict deadline. The original approach for meeting such demands included development of a dedicated app for each of the target web sites. Initially, it did not rely on any of the modern cloud platforms, so there was a need to deal with a large payload by splitting it and running multiple instances of the same app on many different machines, each processing its own part. Over time, the maintenance of all those apps became difficult from both development and maintenance standpoints. One had to deal with constant changes in the structure of crawled web sites, updating apps to keep track with technology trends, writing new apps in the most efficient and optimized way and finally, constantly reconfiguring how and where they run.

That all merged into an idea of developing a highly scalable generic solution to run any kind of web crawling process - actually any kind of workflow - so that processing, transforming and storing of the collected data is also covered. Additionally, we had to improve the crawling logic for some of the problematic web sites which required constant workflow changes and manual interaction due to their advanced bot protection. Finally, we had to find a way for the new solution to be flexible to the constantly changing load and deadline demands.

Web crawling mockup image

Requirements

Generic architecture

The first challenge was to abstract and standardize the way users define what the system needs to do for all the different crawling processes. The core logic may be described as the following:

  • Fetch the content of a web page (via some sort of http get logic)
  • Extract the desired parts of the content (most commonly via some sort of xpath logic)
  • Do something with the extracted data - e.g. store it to the db or to a file
  • Do all the above in a loop for a number of pages

The idea is to pack the logic of common functionality like these into tasks, which users can then invoke and define their web site specific behavior via different properties on those tasks. The most challenging part was how to present this to the users. We had to come up with some user friendly solution so that people are able to define, view and understand complex workflows as easy and as fast as possible.

Advanced crawling tasks

A real life web crawling scenario may be much more complex than a simplified view given above. More and more web sites introduce various crawling protection methods. We were using rotating proxies, mimicking human behavior via javascript as well as some other techniques (which we also needed to packed into tasks), but at the end of the day, there are sites out there which are so good at detecting bots, that nothing except running them in the browser helps. So, we needed to assess the best headless browser solution and incorporate it to our logic.

Horizontal scaling

The most important requirement was that the solution has to be able to scale the work to multiple machines automatically. That meant that we had to come up with some kind of a cluster solution in a cloud. For a number of workflows where the processing time is not important and they are scheduled to run at different times, the cluster would dynamically expand and contract based on load demands. For the important ones under a strict deadline, one may need to run a dedicated cluster with a specified scaling parameters, calculated so that they could complete in time.

Web crawling mockup image

Solution

We were tasked with building a generic web crawling workflow system, able to handle any kind of custom logic and deal with website bot protection, suitable for non experienced technical personnel and capable of scaling horizontally. There is no commercial solution meeting such demands on the market. We managed to implement a generic solution by describing workflows via XML, having library of packed tasks able to run that XML and implementing custom tasks scripting via Roslyn compiler platform. Advanced web crawling tasks and bot protection were done via CEF - Chromium Embedded Framework. Horizontal scaling was implemented by modeling the entire application architecture with the Akka.NET actor model framework and using its built-in cluster capabilities. Docker and Kubernetes were used for running the application parts in the cloud environment.

Technical details

Defining workflows via XML

One of the initial demands was to try to produce a user interface, one that could make the system usable by someone technically savvy, but not necessarily an experienced software developer. We looked at commercial tools like Mozenda for ideas and developed a flowchart-like user interface in ReactJS and .NET web api, with the ability to drag and drop tasks and define their properties visually. We also created a class library workflow model on the backend and were able to transform it to/from our front-end and also serialize/deserialize it to/from a database.

After testing such a solution for a while, it turned out that really complex workflows with hundreds of tasks, and most of our client's workflows are like that, were simply too complex to comprehend, search through and change in such a visual manner. Another problem was that frequent changes were hard to track and revert, so it was almost like we needed a source control system for the stored workflows.

Since we used XML for serialization, it occurred to us that using a decent XML-friendly text editor and working directly on the xml made everyone more productive. Simple file as a workflow definition was also suitable for source control. Hence we opted for such simpler, yet more effective solution and completely separated the logic of designing workflows from running them.

Expression Engine

As workflows grew in complexity, our XMLs needed a full support of almost anything one might do in a programming language. We packed complex functionalities to tasks, but there were many custom math operations or data transformations that simply couldn't be defined in such a generic manner. We realized that we needed the ability to write small code snippets and use them to accomplish any kind of custom functionality, so we looked out for an expression engine being able to parse and run custom code.

As our workflow was actually a script of a sort, with variables defined dynamically on the fly, we thought the most suitable scripting language would be JavaScript. Since our code was written in .NET, our first solution was Microsoft's .NET implementation of a JavaScript V8 engine - ClearScript. This implementation was used initially, until we realized that it had major performance problems.

Looking further, we decided to try the (back then relatively new) Microsoft's .NET Compiler Platform Roslyn and run our scripts in C#. We never looked back since the performance was much faster than anything else we tried. There is an initial compilation penalty, but we reduced that by caching each of our snippets as dynamically generated assembly on disk, so we loaded it from there on every repeated run. This gave us the final ingredient for writing workflows with any kind of custom logic. The way our worklows looked at that point, we had an internal joke that we produced XML#.

One may wonder why all that XML & C# scripting effort, why not just pack the major functionalities as a class library and then have programmers write custom apps or modules which use that. Well, a number of reasons - first, and most important, it was simply hard to enforce the discipline to the developer-users of the app not to introduce their own modified versions of standard tasks and then we'd end up again with a maintenance and possibly performance mess. Second, the xml scripts for simpler workflows were written by other roles, not strictly developers. Third reason was simplicity - although it would be possible to write a custom library module per each workflow, recompile it on every change and plug it in to the main platform for running, it was way more complex than just changing a simple script file and re-running it.

Web crawling mockup image

CEF as headless browser

A large part of our development time was dedicated to the headless browser. We had previous experiences with Phantom and Selenium, but wasn't able to cover all required scenarios and didn't really have an organized code base ready to be used across the board. We had a good previous experience with CEF - Chromium Embedded Framework on another project and used it there to render complex graphics in a desktop application. Having Chromium at its base and being actively developed for years, with a large community, it looked reliable and with great potential. We followed the guidelines on how to run in in headless (or more precisely windowless) mode and packed everything we could think of into scriptable tasks - creating/destroying an instance, loading a url, getting html from current DOM, setting a proxy, setting/clearing cookies, executing JavaScript, etc.

We used Xilium.CefGlue .NET CEF implementation as that gave us a de-facto 1:1 mapping to the native C++ CEF code comparing to the more user friendly and a less of a learning curve CefSharp. It was also better as we were able to do our own custom CEF build and upgrade to newer versions more frequently.

Finally, with the emergence of .NET core, we wanted to run isolated crawling processes on a number of available Linux machines, so we had to do a custom Linux CEF build and integrate it to our app.

Actor framework for distributing work

While looking into how to do the horizontal scaling in the best way, we choose the Actor Model. Studying resources like - Distributed Systems Done Right: Embracing the Actor Model - convinced us that this was the right architecture for us. It later proved that it was a right decision, not only due to built-in scaling and cluster support, but also in terms of how it handles concurrency problems (e.g. different parts of workflow doing conflicting actions at the same time) and fault tolerance (when one of the workflow worker parts becomes inaccessible on the network or fails due to an error).

Using .NET as core technology, our choice was between two actor framework solutions - Microsoft's Orleans and Akka.NET, a port of the Akka framework from Java/Scala. We spent some time evaluating both of them and chose Akka.Net, despite its slightly higher learning curve, but due to it being more open, customizable and closer to the original actor model philosophy.

Cluster environment was our primary goal, but as we went along with implementation, we realized that there were simply problems we couldn't have solved without the actor model. These problems are inherent to the nature of distributed systems. For example, as our workflow was parallelized, there was a need to pause, restart or cancel the whole workflow, either by manual action from the user interface or by the workflow logic itself in one of the parallel iterations. On a single machine one could use various locking techniques to synchronize behavior of parallel executions, but in a distributed system that all changes. Akka, as well as other actor models, is designed to deal with this kind of problems. Having every logical unit-of-work part defined as a single actor responsible for handling it, it was easy to deal with concurrency since messages (commands) are always processed one at a time. The workflow control problem was solved by having a single runner actor which responded to the pause, restart or cancel messages one at a time, notifying other (distributed) parts/actors of the same workflow and not proceeding further with message processing until making sure they completed their part.

Cluster environment

We dockerized all of our application parts and run it in Kubernetes, which provided the adaptive load balancing ability, i.e. cluster elasticity. This was not without challenges as we had to make all our actors persistent and resilient to both cluster expansion and contraction. At any time they had to be able to terminate on one machine and re-emerge with the same restored state on another one. This was done via Akka cluster sharding, but we couldn't solve the headless browser part in this way as it was simply impossible to save and restore its state. Thus we had to implement the browser part as a separate micro service, with the ability to scale out to new machines via regular Akka cluster routing, but without the contraction, which had to be controlled manually.

he client also had a number of physical machines available, which were still running legacy crawling apps, but were underused. We were able to run a direct, static cluster environment on them too. There was no cluster elasticity in this case and the work distribution was done via adaptive loaded balancing provided by the Akka cluster metrics module.

Our tech stack
Microsoft .NET
Chromium
Akka
Roslyn

Book a free consultation

Let us know what would you like to do. We will probably have some ideas on how to do it.