jpnix | Entries tagged with parsing

My best friend is an engineer at National Renewable Laboratory and is always trying to get me to go work there since I have an affinity for High Performance Computers (supercomputers) so for the fun of it I wanted to see where the HPCs his organization works on stand on a global scale of power which naturally led me to this. website

The problem with this website is you cannot search the listing for specific rankings, You can download an XML file to search through... but that requires you to create an account just to download... nah, too hard.

website showing the download links that require login

website showing the download links that require login

So lets extract the data ourselves.

First thing I noticed is that to navigate through the URLs for the various pages (100 computers per page) you have this html query paremeter https://www.top500.org/lists/top500/list/2022/11/?page=2 this makes it pretty easy to loop through the pages and download them 1 by 1 to store the data.

for i in {1..5}
do
     wget https://www.top500.org/lists/top500/list/2022/11/?page=$i
done.

This downloads the entire html pages with the data we want.
Terminal showing wget output

You'll get saved html files as such

index.html?page=1
index.html?page=2
index.html?page=3
index.html?page=4
index.html?page=5

Instead of having all these split up in multiple pages its better to add them together.

 cat index* >> top500.txt.

Now that we have the data we can parse through it, you can't really do this easily with BASH as it gets messy really fast but there is a nice python library called Beautiful Soup that we can use to parse out the html tags and nonsense we don't want.

sudo apt-get install python3-bs4

The source code is super simple, it takes in the file we combined above and parses out all the html tags and pushes the parsed text out to another file called top500_parsed.txt.

#!/usr/bin/env python3

from bs4 import BeautifulSoup

with open("top500.txt") as markup:
    soup = BeautifulSoup(markup.read())

with open("top500_parsed.txt", "w") as f: 
    f.write(soup.get_text())

There we have it. top500_parsed.txt contains the information I wanted and I didn't have to sign up for an account to get it.

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Navigation

James Permenter

Entries tagged with parsing

Extracting top500 supercomputer info without account sign up

Profile

January 2023

Syndicate

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags