My best friend is an engineer at National Renewable Laboratory and is always trying to get me to go work there since I have an affinity for High Performance Computers (supercomputers) so for the fun of it I wanted to see where the HPCs his organization works on stand on a global scale of power which naturally led me to this. website
The problem with this website is you cannot search the listing for specific rankings, You can download an XML file to search through... but that requires you to create an account just to download... nah, too hard.

So lets extract the data ourselves.
First thing I noticed is that to navigate through the URLs for the various pages (100 computers per page) you have this html query paremeter https://www.top500.org/lists/top500/list/2022/11/?page=2 this makes it pretty easy to loop through the pages and download them 1 by 1 to store the data.
This downloads the entire html pages with the data we want.

You'll get saved html files as such
Instead of having all these split up in multiple pages its better to add them together.
Now that we have the data we can parse through it, you can't really do this easily with BASH as it gets messy really fast but there is a nice python library called Beautiful Soup that we can use to parse out the html tags and nonsense we don't want.
The source code is super simple, it takes in the file we combined above and parses out all the html tags and pushes the parsed text out to another file called top500_parsed.txt.
There we have it. top500_parsed.txt contains the information I wanted and I didn't have to sign up for an account to get it.
The problem with this website is you cannot search the listing for specific rankings, You can download an XML file to search through... but that requires you to create an account just to download... nah, too hard.

So lets extract the data ourselves.
First thing I noticed is that to navigate through the URLs for the various pages (100 computers per page) you have this html query paremeter https://www.top500.org/lists/top500/list/2022/11/?page=2 this makes it pretty easy to loop through the pages and download them 1 by 1 to store the data.
for i in {1..5} do wget https://www.top500.org/lists/top500/list/2022/11/?page=$i done.
This downloads the entire html pages with the data we want.

You'll get saved html files as such
index.html?page=1 index.html?page=2 index.html?page=3 index.html?page=4 index.html?page=5
Instead of having all these split up in multiple pages its better to add them together.
cat index* >> top500.txt.
Now that we have the data we can parse through it, you can't really do this easily with BASH as it gets messy really fast but there is a nice python library called Beautiful Soup that we can use to parse out the html tags and nonsense we don't want.
sudo apt-get install python3-bs4
The source code is super simple, it takes in the file we combined above and parses out all the html tags and pushes the parsed text out to another file called top500_parsed.txt.
#!/usr/bin/env python3 from bs4 import BeautifulSoup with open("top500.txt") as markup: soup = BeautifulSoup(markup.read()) with open("top500_parsed.txt", "w") as f: f.write(soup.get_text())
There we have it. top500_parsed.txt contains the information I wanted and I didn't have to sign up for an account to get it.