OVER THE FULLERENESHIFT
19/3/2025 - Creating a Web Scraper - Thoughts on JavaScript and Why Archival is Necassary
Recently I tried making a small web scraper in POSIX shell - with curl
, but other than that, just POSIX. Although this was originally a project mostly for fun and to make a small search engine, I am planning to continue working on this to archive as much of the internet as I personally can. That might seem pointless, especially with the existence of the Internet Archive, and to an extent, it is, but I'm going to rant about it anyways because it's my website and I can do whatever I want.
What's the Point of Web Scraping?
Note: Why a web scraper as opposed to a web crawler? Fundamentally, web scrapers and crawlers are the same thing - what makes the difference is more on whether it is malicious or not. For example, a web crawler might be crawling pages to power a search engine, while a web scraper might be trying to steal or rehost information from an entire website automatically. I'm choosing to say web scraper here since, as of writing, I have not implemented any support for the robots.txt standard - more on that later.
Of course, web crawling can be used by companies like Google to power search engines, or used in similarly questionable ways like to train AI - but most of these crawlers are operated by huge companies, with huge servers that are able to efficiently run the scrapers and store all the data. I don't own a server, let alone hundreds of thousands like Google, so why am I doing this? Mostly because I'm autistic.
I really care about permanance - in a way, this relates to my obsession with compatibility; the more cross-compatible something is, the longer it'll continue to work. This also applies to just storing data - of course, realistically, storing the same information with multiple online hosts is probably the safest option, I like knowing that I have control over my data.
Theoretically, say, Google Drive or Dropbox can have full access to any files I upload - theoretically, if they want, they could delete their entire servers tomorrow. As unlikely as this is, especially for companies as big as this, knowing that I physically own the data that I want to preserve is very different to trusting the promise of these companies.
At some point, all of these companies, no matter how large, will collapse and all of their servers and Google Drives and Dropboxes - all of that information will be permanently lost unless backed up in advance. If I put an image on an HDD, that'll last tens of years - and unless the HDD is physically damaged, nothing will ever realistically happen to that data. Combine that with using an online host, and now no matter whether your physical storage goes down first or your cloud based storage, you'll have time to transfer and back up all of your data to somewhere new.
The Internet Archive
In my opinion, one of the most important things to archive is the web - it contains so much more information than anything else that exists so far - so much more easily accessible and searchable information. The web contains many mediums - it's not just websites. Lots of games, shows, movies, even books are stored digitally on the internet, and avaliable for free, although not always legally. Web scrapers can download all of this, with sufficient storage, and that's a huge amount of information - lots of which could be lost if things like the Wayback Machine didn't exist.
The Wayback Machine is a great resource, but the Internet Archive has faced multiple lawsuits - one very recently, and is questionably legal. In order to continue existing online though, websites have had to be taken down from the archive. There are still people trying to shut down the archive.
All of this information could vanish entirely if the Internet Archive and Wayback Machine are to be shut down - of course, there are similar existing projects, like archive.today, but due to similarly being not exactly legal and often used for piracy, all these websites are at dangers of being taken down.
So, naturally, the solution is to archive everything myself.
Writing a Web Scraper
Of course, as I said at the start, I don't have nearly as much resources as a lot of huge companies; I can't archive everything myself. But I believe that the more people dedicate their time to archiving, downloading, and crawling as much of the internet as they can, the more we can safely, 100% preserve. Even if you aren't interested in writing one yourself like I am, there are many existing web crawlers and scrapers that, if you have an extra computer and don't need too much bandwith on your Wi-Fi, if you have lots of HDDs and SSDs lying around that you don't need, I ask you: If you're interested in preservation of anything - it can probably be found somewhere on the web, so set up a web crawler!
One last thing before I go into the details that quite literally no one will ever care about:
robots.txt
Should you or should you not follow the robots.txt standard? I believe that if you're trying to archive something, you shouldn't follow the robots.txt. Although these are helpful for stoping standards-conforming crawlers being used to train AI, robots.txt can also exclude bots that aren't doing anything malicious. Web crawlers can overload smaller servers if they're going too fast, so I would recommend still following the Crawl-delay
.
Back to Writing a Web Scraper
As I said at the start, I was writing it in POSIX shell. If you check the GitHub repository, a lot of the code is in JavaScript.
As much as I am against writing JavaScript for code outside of the browser, I am starting to see more and more why people continue to use it for serverside regardless. I could write this project in Python - but I have already put in hours learning so many small details of JavaScript, that it's faster to just write it in JavaScript.
As much as this could be a time loss in the future as opposed to if I had used Python, it's a much easier place to start. The JavaScript rewrite has been operating significantly faster and better than the shell version, and writing Python code for something somewhat complex will take much, much longer considering how little of Python I actually know, or any language other than JavaScript for that matter. I really hate that I'm using JavaScript outside the browser, but I don't know how to easily avoid it. Hopefully something will lead me to learning a better language soon.
Conclusion
If you care about preserving anything, please help archive the internet! Use resources like this if you know how to code, or just take snapshots of pages with archive.today or the Wayback Machine if you don't. Thank you so much for reading!
Note: I really don't like how tech focused this blog is becoming, especially considering just how little I actually know. I'll try to put out more content on other, unrelated topics.