jpnix: (Default)
My best friend is an engineer at National Renewable Laboratory and is always trying to get me to go work there since I have an affinity for High Performance Computers (supercomputers) so for the fun of it I wanted to see where the HPCs his organization works on stand on a global scale of power which naturally led me to this. website

The problem with this website is you cannot search the listing for specific rankings, You can download an XML file to search through... but that requires you to create an account just to download... nah, too hard.

website showing the download links that require login

So lets extract the data ourselves.

First thing I noticed is that to navigate through the URLs for the various pages (100 computers per page) you have this html query paremeter https://www.top500.org/lists/top500/list/2022/11/?page=2 this makes it pretty easy to loop through the pages and download them 1 by 1 to store the data.

for i in {1..5}
do
     wget https://www.top500.org/lists/top500/list/2022/11/?page=$i
done.

This downloads the entire html pages with the data we want.
Terminal showing wget output

You'll get saved html files as such
index.html?page=1
index.html?page=2
index.html?page=3
index.html?page=4
index.html?page=5

Instead of having all these split up in multiple pages its better to add them together.
 cat index* >> top500.txt.


Now that we have the data we can parse through it, you can't really do this easily with BASH as it gets messy really fast but there is a nice python library called Beautiful Soup that we can use to parse out the html tags and nonsense we don't want.

sudo apt-get install python3-bs4

The source code is super simple, it takes in the file we combined above and parses out all the html tags and pushes the parsed text out to another file called top500_parsed.txt.
#!/usr/bin/env python3

from bs4 import BeautifulSoup

with open("top500.txt") as markup:
    soup = BeautifulSoup(markup.read())

with open("top500_parsed.txt", "w") as f: 
    f.write(soup.get_text())


There we have it. top500_parsed.txt contains the information I wanted and I didn't have to sign up for an account to get it.

Profile

jpnix: (Default)
JPNIX

January 2023

S M T W T F S
1234567
8910111213 14
15161718 192021
222324252627 28
293031    

Syndicate

RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Oct. 7th, 2025 11:27 pm
Powered by Dreamwidth Studios