Couple of thoughts about web crawling

These days crawling the web¹ is becoming more and more difficult with websites trying to ensure that “only humans” access them. While there are some good arguments in favour of these efforts, I also have some good arguments:

Trying to crawl an auction website to feed the images / descriptions into an LLM to enhance it, because sellers don’t always provide the details I’m interested in
Trying to crawl the premium website of a recently deceased comic author because I’m not sure how much longer it will be around…

Anyway, here are some tips to help out in creating a crawler:

First and most important: run this code under a separate user! These days with all the supply chain attacks, using some kind of separation is highly recommended. Whether it’s a separate user (without admin rights!) or even a dedicated virtual machine, run it isolated.
The idea is to run a full browser and programmatically navigate to the pages we want to crawl. This (a) helps avoid bot detection systems (b) if a CAPTCHA or similar obstacle is displayed, you can intervene and unblock the process and (c) you can help out with steps like logging in to the website.
What I used (repo with example scripts):
- The latest Chrome controlled through the CDP (Chrome DevTools Protocol). This seems the best way currently to “blend in”. Browser automation libraries like Playwright or Selenium come with their own pre-packaged browsers, but those are usually older versions and they get quickly flagged as “bots” by the automated systems.
  
  To start Chrome with CDP, just run it on the command line with --remote-debugging-port=9222 --user-data-dir=Chrome_Profile The “user-data-dir” part is technically not needed, since we’re already running under a separate user (you did create a separate user, right!?), but it provides an extra level of isolation. It does mean that if you want to use the cookies for downloading videos for example, you would have to use an extension like “Get cookies.txt LOCALLY” (Chrome extensions are a notorious source for malware, but you are running under a separate user and only providing minimal information to Chrome, right!?)
- I used Playwright in current project. It supports a couple of different programming languages, has support for connecting to Chrome through CDP and LLMs do a decent job of generating code that mostly does what you need it to do (I found knowledge of web technologies like Javascript and CSS – selectors for example – useful in guiding it in the right direction). I stuck to the synchronous API for this project, since my aim was a one-off archival, rather than a long running performant system. Also, “high performance” systems can easily overwhelm the site we are trying to archive!
- I saved the website in three ways (see the code):
  - As a dump of all the HTML content (obviously this doesn’t cover additional assets like style sheets, javascript or images). Also, it wouldn’t cover embedded iframes (which this website didn’t use)
  - As a “PDF print” which preserves the visual feel better (and also keeps text as searchable text)
  - All the requests proxied through mitmproxy
- Speaking of mitmproxy, here are a couple of things I’ve learned:
  - MacOS ships with Python3, but to get the latest version you need to update / install it from Python.org (makes sense in hindsight, but I was used to the GNU/Linux way of “just update all the packages which will get you the latest version of the software”)
  - Run mitmdump: mitmdump --listen-host 127.0.0.1 -w dump.mitm -s strip_hsts.py (the source for strip_hsts.py is in the example repo). This will save all traffic in the “dump.mitm” file.
  - To have Chrome use mitmproxy, use Proxy SwitchyOmega (V3) extension (again, this extension seems alright for limited use in a sandbox environment). Other ways of setting up the proxy suggests setting it system wide – which could capture other, unintended traffic and would cut internet access when I stop the proxy.
  - To set up TLS (aka. SSL/HTTPS) capture – which we almost certainly need in todays world – start with the mitmproxy documentation, but again, I would strongly recommend to only install the certificates for this instance instead of system-wide, as the documentation suggests (again, I want to only capture traffic from this instance, not system wide!)
  - You can convert the dump (which now contains all the data used to display the website – all the images, style sheets, javascript, etc²) from the mitmproxy specific format to the more generic HAR format using the command mitmdump -r dump.mitm --set hardump=dump-9.har Note that this is somewhat inefficient in that it first needs to load the entire dump file to memory, so try splitting up the dumps into chunks not larger than 3/4 of your RAM in advance³. Theoretically you can also use the -q command line parameter to avoid dumping all the URLs on the screen again, but I found that with that option the conversion process sometimes hangs ¯\(ツ)/¯. And finally, these are text based formats, so they benefit from compression with bzip2 / xz / 7zip.

Image from pixabay ↩︎
This still might not be enough to later completely recreate the website, because the Javascript might do some client-side processing that involves the current time, random values (used for cache-busting for example) or other sources of randomness. That’s why I also archive the pages as PDFs. And, in my particular case, I am confident that I only need a subset of data from the captured pages which is not affected by these potential problems ↩︎
I tried looking for solutions that would convert each request one by one (“streaming”) instead of loading the entire file in memory, but couldn’t find any ↩︎

Leave a Reply Cancel reply