From a761ce7e66ab10e3db414501392bd1175d424aef Mon Sep 17 00:00:00 2001 From: "Jurgen J. Vinju" Date: Thu, 25 Jun 2026 10:47:29 +0200 Subject: [PATCH] rescued this recipe from old rascal branch --- .../Languages/HTML/Scraping/Scraping.md | 167 ++++++++++++++++++ 1 file changed, 167 insertions(+) create mode 100644 courses/Recipes/Languages/HTML/Scraping/Scraping.md diff --git a/courses/Recipes/Languages/HTML/Scraping/Scraping.md b/courses/Recipes/Languages/HTML/Scraping/Scraping.md new file mode 100644 index 000000000..fa62a92ef --- /dev/null +++ b/courses/Recipes/Languages/HTML/Scraping/Scraping.md @@ -0,0 +1,167 @@ +--- +title: HTML Scraping +keywords: + - scraping + - "pattern matching" + - recursion + - html +--- + +#### Synopsis + +Scraping HTML is to recover raw data from HTML documents + +#### Description + +In this example we see HTML as just another language that may contain relevant information. In the case +of HTML it is smart to reuse existing parsers, so we use an ((AbstractSyntax)) format for HTML. It is +described in ((lang::html::AST)) and its IO interface is described in ((lang::html::IO)). + +In this demo we extract information from the website of the [Centraal Bureau voor Statistiek (CBS)](https://www.cbs.nl), the Dutch national centre for statistics. + +We found an interesting [page](https://longreads.cbs.nl/nederland-in-cijfers-2022/hoeveel-fietsen-we-gemiddeld-per-week/) that lists how much biking the Dutchies do on average weekly: + +```rascal-shell +import lang::html::IO; +import IO; + +page = readHTMLFile(|https://longreads.cbs.nl/nederland-in-cijfers-2022/hoeveel-fietsen-we-gemiddeld-per-week/|); +``` + +As you can see the output is truncated with `...`, to see more we can use ((IO-iprintln)): + +```rascal-shell,continue +iprintln(page) +``` + +We used Chrome's "Inspect" feature to figure out that the div class `datatable-container` is of interest. +So let's select that using a deep match operator and bind that div to the `tab` variable: + +```rascal-shell,continue +if (/tab:div(_,class=/datatable-container/) := page) + iprintln(tab); +``` + +We used a deep match pattern and then a regular expression pattern to select all `class` attributes that have `datatable-container` somewhere in the string. + +Every row in the table contains data, except the header row. Let's convert this entire table +to a relation of type `rel[str persoonskenmerken, real fietskilometers]`. + +We create the match pattern by step-wise refinement. First let's just +list all the rows: + +```rascal-shell,continue +if (/tab:div(rows,class=/datatable-container/) := page) { // <1> + for (/r:tr(_) := rows) { // <2> + println(r); + } +} +``` + +* <1> binds the children of the div to `rows`; +* <2> uses deep match `/` to quickly jump to all the nested `tr` nodes; + +Now we refined the pattern to filter out the non-header rows: + +```rascal-shell,continue +if (/tab:div(rows,class=/datatable-container/) := page) { + for (/r:tr([th(_,scope="row"), td(_)]) := rows) { // <3> + println(r); + } +} +``` + +* <3> we matching only those `tr` that have two children, one `th` and one `td`. To be sure we also limit the first `th` to have the `scope` attribute equal to `"row"`. + +Now it's time to get the final data out. The category is in the first column and the numbers are in the second. +We could make the query deeper and more complex, but we choose to add another nesting level for the sake of clarity: + +```rascal-shell,continue +if (/tab:div(rows,class=/datatable-container/) := page) { + for (/r:tr([category:th(_,scope="row"), number:td(_)]) := rows) { + if (/text(str c) := category, /text(str n) := number) { + println(" --- "); + } + } +} +``` + +Now we have scraped the data out of the HTML syntax tree, we have to convert it to +raw data. But the Dutch use comma's as decimal separators: + +```rascal-shell,continue,error +import String; +toReal("18,79"); +toReal("18.79"); +replaceAll("18,79", ",", ".") +``` + +```rascal-shell,continue +rel[str persoonskenmerken, real fietskilometers] myData = {}; +if (/tab:div(rows,class=/datatable-container/) := page) { + for (/r:tr([category:th(_,scope="row"), number:td(_)]) := rows) { + if (/text(str c) := category, /text(str n) := number) { + println(" --- "); + myData += ; + } + } +} +myData; +``` + +Now we have the data in a format that we can compute with: +```rascal-shell +myData +import Set; +theSum = sum(myData); +relativeData = { | <- myData}; + +To keep this analysis for the future, for example when new data is published on the site, we +can store the query in a function. It is also ready to be rewritten from structured programming +style into a functional comprehension. Let's do that first: + +```rascal-shell,continue +{ +| /tab:div(rows,class=/datatable-container/) := page +, /r:tr([category:th(_,scope="row"), number:td(_)]) := rows +, /text(str c) := category, /text(str n) := number +} +``` + +The patterns have _not_ changed, only they have been copied to the generator/filter side +of a ((Set-Comprehension)): +* <1> here we have the resulting tuple that uses `c` and `n` which have been selected by pattern matching +* <2> this is the first selector that finds the table in the page +* <3> here we iterate over the rows that are not the header +* <4> finally we project out the text from the two cells. + +Now we wrap it all up in a reusable function: +```rascal-shell,continue +rel[str persoonskenmerken, real fietskilometers] scrapeFietskilometers(loc address=|https://longreads.cbs.nl/nederland-in-cijfers-2022/hoeveel-fietsen-we-gemiddeld-per-week/|) + = { + | /tab:div(rows,class=/datatable-container/) := readHTMLFile(address) + , /r:tr([category:th(_,scope="row"), number:td(_)]) := rows + , /text(str c) := category, /text(str n) := number + }; +scrapeFietsKilometers() +``` + +Every time the function is called, the HTML is retrieved again from the site. We coded the +URL in a default parameter, just in case a similar page exists that we might try our analysis +on. + +#### Benefits + +* Rascal has a lot of powerful ((PatternMatching)) operators to dissect a HTML page with; +* Skills used in the analysis of programming languages, like traversal and pattern matching, are equally useful for HTML scraping; +* Deep matching and ((Statements-Visit)) skip over all uninteresting content without depending on it. The more you use these "structure shy" primitives, the more robust the scraper will be against sudden changes in the HTML. + +#### Pitfalls + +* HTML scraping is a *brittle* business. If the page changes, then it's likely the query will not work +anymore. The function will start returning empty sets of tuples in that case, most likely. If we look at the +structural dependencies then the word `datatable-container` is very important. Also this query matches only +tables with two columns, and the first cell is always a `th` and the second cell is `td`. Finally the actual +data is stored in a single text cell under the `th` and `td`. If any of these properties change, this scraper +breaks. However, if _anything else_ changes, the scraper keeps working; +* The HTML parser skips SVG elements; \ No newline at end of file