Web scraping with GitHub Actions
GitHub is a platform to store your code, organize your work, collaborate with other fellow coders, but… it’s also a place where you can run code! 🤩
GitHub Actions are mostly used to automatically run tests, builds, and deployments on projects. But for data wranglers like us, they can also be used to scrape data.
In this lesson, we reuse the code that fetches and extracts the latest temperature in Montreal from the previous lessons. But this time, we’ll use GitHub Actions to trigger it every hour and accumulate the data — for free!
If you want to see the final product, head over here.
I strongly encourage you to complete the two previous lessons (What is Git? and What is GitHub?) before diving into this one. I’ll assume you have a GitHub account and know the basic Git commands.
Setup
Create a new folder and open it with VS Code.
In it, create a main.ts
file with the code below. Here’s what it does:
- It fetches the latest weather data in Montreal from the Meteorological Service of Canada API.
- The data is in XML format. We use the library
fast-xml-parser
to convert it to JSON. - Then we extract the temperature and the time from the nested JSON. We also create a variable with the current time of the scrape.
- Finally, we append the scraped data to
data.csv
.
Each time this script runs, a new row is added to the CSV. So the data is accumulated over time.
import { XMLParser } from "fast-xml-parser";
const response = await fetch(
"https://dd.weather.gc.ca/observations/swob-ml/latest/CWTQ-AUTO-minute-swob.xml",
);
const xml = await response.text();
const json = new XMLParser({ ignoreAttributes: false }).parse(xml);
const observation =
json["om:ObservationCollection"]["om:member"]["om:Observation"];
const resultTime =
observation["om:resultTime"]["gml:TimeInstant"]["gml:timePosition"];
console.log(resultTime);
const elements = observation["om:result"]["elements"]["element"];
type element = {
"@_name": string;
"@_value": string;
};
const temp = elements
.find((d: element) => d["@_name"] === "air_temp")["@_value"];
console.log(temp);
const scrapeTime = new Date().toISOString();
console.log(scrapeTime);
await Deno.writeTextFile(
"./data.csv",
`\n${scrapeTime},${resultTime},${temp}`,
{
append: true,
},
);
main.ts
needs a data.csv
file, so let’s create one in the project folder.
scrapeTime,resultTime,temp
We will push this project to GitHub, so let’s add a README.md
to the project to make it more presentable.
# Montreal temperature scraper
Every hour, this project retrieves the latest temperature in Montreal
from the [Meteorological Service of Canada API](https://eccc-msc.github.io/open-data/msc-data/obs_station/readme_obs_insitu_swobdatamart_en/)
and appends a new row to the `data.csv` file.
Let’s take this for a run!
First, install fast-xml-parser
:
deno add npm:fast-xml-parser
Then run main.ts
:
deno -A main.ts
And you should see the temperature being logged and a row added to the data.csv
file!
Creating a new GitHub repository
Let’s initialize a new Git repository and push it to GitHub.
First, we initialize and make a first commit:
git init
git add -A
git commit -m "First commit"
Then, create a new repository on GitHub.
Give it a name and a description (optional). You can make it private or public, as you wish.
You don’t need a README
because we created one. You don’t need a .gitignore
and you don’t need a license for now.
Then hit Create repository
.
Copy the commands to push an existing repository from the command line.
And run these commands in your project terminal.
Now, if you refresh your GitHub repository page, you’ll see your code and the README.md
below it.
Adding a workflow
While it’s called GitHub Actions, you are not creating actions but workflows. And you can see your workflows in the Actions
section.
Right now, there are no workflows in our repository. GitHub suggests a few for us. Feel free to explore, but we are going to add one manually.
For GitHub to automatically detect workflows in the repository, we need to create YAML files in .github/workflows/
.
YAML stands for YAML Ain’t Markup Language — but to me, it still feels a bit like markup… just stricter. Indentation matters, for example. It’s very common for configuration files. It’s not complicated and we will write some together, so don’t worry about it.
Let’s create new folders and a new file .github/workflows/scrape.yml
. Let’s keep it empty for now.
Commit it and push it to GitHub:
git add -A
git commit -m "Adding scrape.yml"
git push
After a few seconds, if you refresh the Actions page of your repo, you’ll see your workflow automatically detected by GitHub! But since the file is empty, it’s failing.
Writing a workflow
Let’s do that step by step.
Name and events
When you create a workflow, you first need to give it a name and decide when it should be triggered.
Below, we tell GitHub to trigger this workflow in three situations:
push
means when we push new code to themain
branch. Note the indentation — it indicates hierarchy and grouping.workflow_dispatch
means we will be able to click on a button to manually trigger the workflow. Always handy.schedule
sets a cron job, which is a way to run scripts on a schedule. Here,0 * * * *
means the workflow will run at the start of every hour (minute0
).
name: Scrape Montreal temperature
on:
push:
branches:
- main
workflow_dispatch:
schedule:
- cron: "0 * * * *"
Cron jobs are widely used, but their syntax is not intuitive. I usually use crontab.guru to make sense of them. Also, it’s important to note that cron jobs in GitHub workflows are not perfectly accurate. When your workflow needs to run, GitHub pushes it to a queue where it waits for a virtual machine to become available. So it might actually run a few minutes late. For most web scraping tasks, that’s fine. But if you need to trigger critical scripts at very specific times, GitHub Actions might not be for you.
Jobs
Now we need to describe the jobs we want the workflow to accomplish. You can have multiple jobs in a workflow. Here, we have just one. We name it scrape
.
We tell GitHub we want it to run on a virtual machine with the latest Ubuntu distribution. This means it will run on a cheap virtual machine, which is perfectly fine for us. But you have choices, if you ever need something different.
Our main.ts
script writes the freshly scraped data to data.csv
, so we need to give the workflow permission to write and modify the files of the repository.
name: Scrape Montreal temperature
on:
push:
branches:
- main
workflow_dispatch:
schedule:
- cron: "0 * * * *"
jobs:
scrape:
runs-on: ubuntu-latest
permissions:
contents: write
Steps
The only thing left to do is to tell GitHub the steps of this job. Each step has a name:
Checking out the repo
tells the workflow to actually access the repository. We use the pre-codedactions/checkout@v4
for that. There are a lot of actions available for you to use, especially for common tasks. It’s nice not to have to code everything yourself! 🥳Set up Deno
installs Deno. We use an available action for that too. We specify that we want version 2 of Deno.Run main.ts
runs our script, just like we would locally.
name: Scrape Montreal temperature
on:
push:
branches:
- main
workflow_dispatch:
schedule:
- cron: "0 * * * *"
jobs:
scrape:
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- name: Checking out the repo
uses: actions/checkout@v4
- name: Set up Deno
uses: denoland/setup-deno@v2
with:
deno-version: v2.x
- name: Run main.ts
run: deno -A main.ts
Saving the data
Now, if you commit your workflow and push it to GitHub, it will fetch the data and write it to data.csv
, but… it won’t actually save data.csv
!
Remember: this is running on a virtual machine. So it runs, writes a new row to data.csv
, but then the machine is destroyed and removed. The data saved in memory disappears…
This is why we have a final step to commit the changes. We create a Git user named Automated
that commits and pushes the changes to data.csv
. Isn’t that brilliant? 🙌
name: Scrape Montreal temperature
on:
push:
branches:
- main
workflow_dispatch:
schedule:
- cron: "0 * * * *"
jobs:
scrape:
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- name: Checking out the repo
uses: actions/checkout@v4
- name: Set up Deno
uses: denoland/setup-deno@v2
with:
deno-version: v2.x
- name: Run main.ts
run: deno -A main.ts
- name: Commit changes
run: git config user.name "Automated" && git config user.email "actions@users.noreply.github.com" && git add data.csv && git commit -m "Automated update" && git push
Save the file and push it to GitHub:
git add -A
git commit -m "Updating workflow"
git push
And now, if you refresh your repository’s Actions page, you’ll see your workflow running properly!
If you click on it, you’ll see the jobs. Here, there’s just one: scrape
.
And if you click on the job, you’ll see its steps and can click on them. This is useful when something is failing — you’ll be able to see the error messages here.
If you go back to the home page of the repository, you’ll see that the latest commit has been made by the Automated
Git user we set up in the workflow.
And if you check the data.csv
file, you’ll see a second row has been added!
You can even go one step further and check the commit to see the actual change.
By the way, this is a really great way to check what changed in web pages, data files, and more. If you scrape entire files, GitHub will be able to tell you what changed and when. You’ll have a full, detailed history thanks to commits.
This has run when we pushed to main
. But now, it will update automatically every hour thanks to our cron job!
Before GitHub Actions, you would have needed to pay for and set up a server to scrape this information on a schedule. But now, you can do it for free with just one simple YAML file! 💃🕺
Retrieving the data
The committed data is on GitHub. To retrieve it in your local repo, just pull
it:
git pull
Note that I went for lunch before taking this screenshot — so I have three lines! The scraper ran while I was away! 😄
If your repository is public, you can also fetch the CSV data easily. This is useful when you want to work with up-to-date data in another project. Click on the Raw
button to get a URL to the actual file.
Now you can directly fetch this URL.
You can use this URL to fetch the latest data and process it — for example, with the Simple Data Analysis library.
import { SimpleDB } from "@nshiab/simple-data-analysis";
const sdb = new SimpleDB();
const temperatures = sdb.newTable();
await temperatures.loadData(
"https://raw.githubusercontent.com/nshiab/temperature-scraper/refs/heads/main/data.csv",
);
await temperatures.logTable();
await sdb.done();
Manually triggering a workflow
Right now, our scrape is triggered when we push new code to main
and every hour, at the top of the hour.
But since we added workflow_dispatch
in our configuration, we can also trigger it manually by clicking on it from the Actions page.
Disabling a workflow
To remove a workflow, you need to actually delete the YAML file from the repository and commit/push the change.
But sometimes, you just want to temporarily disable it — for example, while fixing something in the code, because you don’t need the data at the moment, or because you’ve hit the limit of your free GitHub account! 😬
To disable a workflow while keeping the configuration file in the repository, click on the workflow in the left sidebar, then on the ...
button in the top right corner, and finally on Disable workflow
.
Don’t worry — you’ll be able to enable it again later without any problem.
Conclusion
Congratulations! You now know how to create simple workflows. I made my repository public, so feel free to fork it for your future projects.
You can use this to scrape data — but also for so many other things! This is code, so the possibilities are endless! With what you learned in this lesson, you’ll be able to understand and replicate more sophisticated workflows you find around.
I hope you found this lesson interesting. But we haven’t explored everything GitHub has to offer… In the next lesson, we’ll discover GitHub Pages, which allows you to create websites for… wait for it… free! And since you now know about GitHub Actions, you’ll be able to republish your web pages automatically each time you push code to the main
branch. 🤓
See you in the next lesson!