5. Git & GitHub 🐙Web scraping with GitHub actions

Web scraping with GitHub Actions

GitHub is a platform to store your code, organize your work, collaborate with other fellow coders, but… it’s also a place where you can run code! 🤩

GitHub Actions are mostly used to automatically run tests, builds, and deployments on projects. But for data wranglers like us, they can also be used to scrape data.

In this lesson, we reuse the code that fetches and extracts the latest temperature in Montreal from the previous lessons. But this time, we’ll use GitHub Actions to trigger it every hour and accumulate the data — for free!
If you want to see the final product, head over here.

I strongly encourage you to complete the two previous lessons (What is Git? and What is GitHub?) before diving into this one. I’ll assume you have a GitHub account and know the basic Git commands.

Want to know when new lessons are available? Subscribe to the newsletter ✉️ and give a ⭐ to the GitHub repository to keep me motivated! Click here to get in touch.

Setup

Create a new folder and open it with VS Code.

In it, create a main.ts file with the code below. Here’s what it does:

  • It fetches the latest weather data in Montreal from the Meteorological Service of Canada API.
  • The data is in XML format. We use the library fast-xml-parser to convert it to JSON.
  • Then we extract the temperature and the time from the nested JSON. We also create a variable with the current time of the scrape.
  • Finally, we append the scraped data to data.csv.

Each time this script runs, a new row is added to the CSV. So the data is accumulated over time.

main.ts
import { XMLParser } from "fast-xml-parser";
 
const response = await fetch(
  "https://dd.weather.gc.ca/observations/swob-ml/latest/CWTQ-AUTO-minute-swob.xml",
);
const xml = await response.text();
const json = new XMLParser({ ignoreAttributes: false }).parse(xml);
 
const observation =
  json["om:ObservationCollection"]["om:member"]["om:Observation"];
 
const resultTime =
  observation["om:resultTime"]["gml:TimeInstant"]["gml:timePosition"];
console.log(resultTime);
 
const elements = observation["om:result"]["elements"]["element"];
type element = {
  "@_name": string;
  "@_value": string;
};
const temp = elements
  .find((d: element) => d["@_name"] === "air_temp")["@_value"];
console.log(temp);
 
const scrapeTime = new Date().toISOString();
console.log(scrapeTime);
 
await Deno.writeTextFile(
  "./data.csv",
  `\n${scrapeTime},${resultTime},${temp}`,
  {
    append: true,
  },
);

main.ts needs a data.csv file, so let’s create one in the project folder.

data.csv
scrapeTime,resultTime,temp

We will push this project to GitHub, so let’s add a README.md to the project to make it more presentable.

README.md
# Montreal temperature scraper
 
Every hour, this project retrieves the latest temperature in Montreal
from the [Meteorological Service of Canada API](https://eccc-msc.github.io/open-data/msc-data/obs_station/readme_obs_insitu_swobdatamart_en/)
and appends a new row to the `data.csv` file.

Let’s take this for a run!

First, install fast-xml-parser:

  • deno add npm:fast-xml-parser

Then run main.ts:

  • deno -A main.ts

And you should see the temperature being logged and a row added to the data.csv file!

The README.

Creating a new GitHub repository

Let’s initialize a new Git repository and push it to GitHub.

First, we initialize and make a first commit:

  • git init
  • git add -A
  • git commit -m "First commit"

Then, create a new repository on GitHub.

New repository on GitHub.

Give it a name and a description (optional). You can make it private or public, as you wish.

You don’t need a README because we created one. You don’t need a .gitignore and you don’t need a license for now.

Then hit Create repository.

Settings for a new repository on GitHub.

Copy the commands to push an existing repository from the command line.

Commands to sync the new GitHub repository.

And run these commands in your project terminal.

Repository synced with the new GitHub repository.

Now, if you refresh your GitHub repository page, you’ll see your code and the README.md below it.

A fresh repository on GitHub

Adding a workflow

While it’s called GitHub Actions, you are not creating actions but workflows. And you can see your workflows in the Actions section.

The Actions tab in new GitHub repository.

Right now, there are no workflows in our repository. GitHub suggests a few for us. Feel free to explore, but we are going to add one manually.

GitHub suggesting actions.

For GitHub to automatically detect workflows in the repository, we need to create YAML files in .github/workflows/.

YAML stands for YAML Ain’t Markup Language — but to me, it still feels a bit like markup… just stricter. Indentation matters, for example. It’s very common for configuration files. It’s not complicated and we will write some together, so don’t worry about it.

Let’s create new folders and a new file .github/workflows/scrape.yml. Let’s keep it empty for now.

Commit it and push it to GitHub:

  • git add -A
  • git commit -m "Adding scrape.yml"
  • git push

An empty .yml file.

After a few seconds, if you refresh the Actions page of your repo, you’ll see your workflow automatically detected by GitHub! But since the file is empty, it’s failing.

Failed workflow.

Writing a workflow

Let’s do that step by step.

Name and events

When you create a workflow, you first need to give it a name and decide when it should be triggered.

Below, we tell GitHub to trigger this workflow in three situations:

  • push means when we push new code to the main branch. Note the indentation — it indicates hierarchy and grouping.
  • workflow_dispatch means we will be able to click on a button to manually trigger the workflow. Always handy.
  • schedule sets a cron job, which is a way to run scripts on a schedule. Here, 0 * * * * means the workflow will run at the start of every hour (minute 0).
.github/workflows/scrape.yml
name: Scrape Montreal temperature
 
on:
  push:
    branches:
      - main
  workflow_dispatch:
  schedule:
    - cron: "0 * * * *"
💡

Cron jobs are widely used, but their syntax is not intuitive. I usually use crontab.guru to make sense of them. Also, it’s important to note that cron jobs in GitHub workflows are not perfectly accurate. When your workflow needs to run, GitHub pushes it to a queue where it waits for a virtual machine to become available. So it might actually run a few minutes late. For most web scraping tasks, that’s fine. But if you need to trigger critical scripts at very specific times, GitHub Actions might not be for you.

Jobs

Now we need to describe the jobs we want the workflow to accomplish. You can have multiple jobs in a workflow. Here, we have just one. We name it scrape.

We tell GitHub we want it to run on a virtual machine with the latest Ubuntu distribution. This means it will run on a cheap virtual machine, which is perfectly fine for us. But you have choices, if you ever need something different.

Our main.ts script writes the freshly scraped data to data.csv, so we need to give the workflow permission to write and modify the files of the repository.

.github/workflows/scrape.yml
name: Scrape Montreal temperature
 
on:
  push:
    branches:
      - main
  workflow_dispatch:
  schedule:
    - cron: "0 * * * *"
 
jobs:
  scrape:
    runs-on: ubuntu-latest
    permissions:
      contents: write

Steps

The only thing left to do is to tell GitHub the steps of this job. Each step has a name:

  • Checking out the repo tells the workflow to actually access the repository. We use the pre-coded actions/checkout@v4 for that. There are a lot of actions available for you to use, especially for common tasks. It’s nice not to have to code everything yourself! 🥳
  • Set up Deno installs Deno. We use an available action for that too. We specify that we want version 2 of Deno.
  • Run main.ts runs our script, just like we would locally.
.github/workflows/scrape.yml
name: Scrape Montreal temperature
 
on:
  push:
    branches:
      - main
  workflow_dispatch:
  schedule:
    - cron: "0 * * * *"
 
jobs:
  scrape:
    runs-on: ubuntu-latest
    permissions:
      contents: write
    steps:
      - name: Checking out the repo
        uses: actions/checkout@v4
      - name: Set up Deno
        uses: denoland/setup-deno@v2
        with:
          deno-version: v2.x
      - name: Run main.ts
        run: deno -A main.ts

Saving the data

Now, if you commit your workflow and push it to GitHub, it will fetch the data and write it to data.csv, but… it won’t actually save data.csv!

Remember: this is running on a virtual machine. So it runs, writes a new row to data.csv, but then the machine is destroyed and removed. The data saved in memory disappears…

This is why we have a final step to commit the changes. We create a Git user named Automated that commits and pushes the changes to data.csv. Isn’t that brilliant? 🙌

.github/workflows/scrape.yml
name: Scrape Montreal temperature
 
on:
  push:
    branches:
      - main
  workflow_dispatch:
  schedule:
    - cron: "0 * * * *"
 
jobs:
  scrape:
    runs-on: ubuntu-latest
    permissions:
      contents: write
    steps:
      - name: Checking out the repo
        uses: actions/checkout@v4
      - name: Set up Deno
        uses: denoland/setup-deno@v2
        with:
          deno-version: v2.x
      - name: Run main.ts
        run: deno -A main.ts
      - name: Commit changes
        run: git config user.name "Automated" && git config user.email "actions@users.noreply.github.com" && git add data.csv && git commit -m "Automated update" && git push

Save the file and push it to GitHub:

  • git add -A
  • git commit -m "Updating workflow"
  • git push

And now, if you refresh your repository’s Actions page, you’ll see your workflow running properly!

Success workflow.

If you click on it, you’ll see the jobs. Here, there’s just one: scrape.

Success job.

And if you click on the job, you’ll see its steps and can click on them. This is useful when something is failing — you’ll be able to see the error messages here.

Success steps.

If you go back to the home page of the repository, you’ll see that the latest commit has been made by the Automated Git user we set up in the workflow.

Automated commit.

And if you check the data.csv file, you’ll see a second row has been added!

The updated data.csv file.

You can even go one step further and check the commit to see the actual change.

By the way, this is a really great way to check what changed in web pages, data files, and more. If you scrape entire files, GitHub will be able to tell you what changed and when. You’ll have a full, detailed history thanks to commits.

The detailed commit.

This has run when we pushed to main. But now, it will update automatically every hour thanks to our cron job!

Before GitHub Actions, you would have needed to pay for and set up a server to scrape this information on a schedule. But now, you can do it for free with just one simple YAML file! 💃🕺

Retrieving the data

The committed data is on GitHub. To retrieve it in your local repo, just pull it:

  • git pull

Note that I went for lunch before taking this screenshot — so I have three lines! The scraper ran while I was away! 😄

The pulled data.

If your repository is public, you can also fetch the CSV data easily. This is useful when you want to work with up-to-date data in another project. Click on the Raw button to get a URL to the actual file.

The link to the raw CSV file.

Now you can directly fetch this URL.

The link to the raw data.

You can use this URL to fetch the latest data and process it — for example, with the Simple Data Analysis library.

import { SimpleDB } from "@nshiab/simple-data-analysis";
 
const sdb = new SimpleDB();
 
const temperatures = sdb.newTable();
await temperatures.loadData(
  "https://raw.githubusercontent.com/nshiab/temperature-scraper/refs/heads/main/data.csv",
);
await temperatures.logTable();
 
await sdb.done();

Fetching the data with the Simple Data Analysis.

Manually triggering a workflow

Right now, our scrape is triggered when we push new code to main and every hour, at the top of the hour.

But since we added workflow_dispatch in our configuration, we can also trigger it manually by clicking on it from the Actions page.

Manually triggering the workflow.

Disabling a workflow

To remove a workflow, you need to actually delete the YAML file from the repository and commit/push the change.

But sometimes, you just want to temporarily disable it — for example, while fixing something in the code, because you don’t need the data at the moment, or because you’ve hit the limit of your free GitHub account! 😬

To disable a workflow while keeping the configuration file in the repository, click on the workflow in the left sidebar, then on the ... button in the top right corner, and finally on Disable workflow.

Don’t worry — you’ll be able to enable it again later without any problem.

Disabling a workflow.

Conclusion

Congratulations! You now know how to create simple workflows. I made my repository public, so feel free to fork it for your future projects.

You can use this to scrape data — but also for so many other things! This is code, so the possibilities are endless! With what you learned in this lesson, you’ll be able to understand and replicate more sophisticated workflows you find around.

I hope you found this lesson interesting. But we haven’t explored everything GitHub has to offer… In the next lesson, we’ll discover GitHub Pages, which allows you to create websites for… wait for it… free! And since you now know about GitHub Actions, you’ll be able to republish your web pages automatically each time you push code to the main branch. 🤓

See you in the next lesson!

Enjoyed this? Want to know when new lessons are available? Subscribe to the newsletter ✉️ and give a ⭐ to the GitHub repository to keep me motivated! Get in touch if you have any questions.