This tutorial goes over how to create a simple "web scraper" app which pulls the HTML from a page and saves it as a page in a local hyperdrive.
You can follow along and create this app using the DWeb Scratchpad.
The first thing we'll do is set up an HTML form to take URLs to scrape and an editable section to preview the page within.
<form id="urlForm">
<input autofocus type="url" name="url" placeholder="url to scrape">
<input type="submit" value="Scrape">
</form>
<section id="previewBox"></section>
Next we'll use the local AI to generate some code for us.
I used the following prompt to get some JS to do the trick:
javascript get form "urlForm", get value of input with name "url" fetch the url and get the response as text pass it to DOMParser Take the body innerHTML and set the innerHTML of the element with id "previewBox" Use await instead of .then
Notice how I gave it some tips on specific APIs I wanted it to use and the IDs of the relevant elements.
document.getElementById('urlForm').addEventListener('submit', async (event) => {
event.preventDefault();
const url = document.getElementById('urlForm').url.value;
try {
const response = await fetch(url);
const text = await response.text();
const parser = new DOMParser();
const doc = parser.parseFromString(text, 'text/html');
document.getElementById('previewBox').innerHTML = doc.body.innerHTML;
} catch (error) {
console.error('Error fetching the URL:', error);
}
});
Lets also add some CSS to the previewBox to make it more obvious where it starts and ends:
#previewBox {
border: 1px dashed var(--browser-theme-primary-highlight);
margin: 1em 0em;
padding: 0.5em;
}
Notice how we make use of the built in Browser Theme CSS vars so that users can customize the look and feel of the app along with the rest of the browser.
Once this is in, test it out with an example page like the Agregore Home Page. You should be seeing the contents of the page with a bunch of busted iframes and images due to any relative URLs not working within the preview.
Generally speaking it's unsafe to shove arbitrary HTMLinto your page and we should add a "sanitization" step to get rid of any stray scripts, as well as make any URLs absolute relative to the original page.
For that I prompted the local AI to make us a function using the following:
javascript function called sanitizeElementTree which takes a document, an element and a rootURL. It should traverse an HTML element's tree and remove any iframes or script tags. It should go through all img,audio,video tags and change their src to be
new URL(src, rootURL).href
to make it absolute. It should go through alla
tags and make thehref
absolute relative to the rootURL. It will also find any attributes that start withon
(like onclick) and remove them from elements.
This gave me the following code:
function sanitizeElementTree(doc, element, rootURL) {
function isAbsolute(url) {
return /^(http|https):\/\//i.test(url);
}
function makeAbsolute(url) {
if (!url || isAbsolute(url)) return url;
return new URL(url, rootURL).href;
}
function traverse(node) {
if (node instanceof HTMLElement) {
if (node.tagName === 'IFRAME' || node.tagName === 'SCRIPT') {
node.remove();
} else if (['IMG', 'AUDIO', 'VIDEO'].includes(node.tagName)) {
const srcAttr = node.getAttribute('src');
if (srcAttr) node.setAttribute('src', makeAbsolute(srcAttr));
} else if (node.tagName === 'A' && node.hasAttribute('href')) {
const hrefAttr = node.getAttribute('href');
if (hrefAttr) node.setAttribute('href', makeAbsolute(hrefAttr));
}
for (let i = 0; i < node.attributes.length; i++) {
const attr = node.attributes[i];
if (attr.name.startsWith('on')) {
node.removeAttribute(attr.name);
i--;
}
}
}
node.childNodes.forEach(traverse);
}
traverse(element);
}
With this we can take an extra step to sanitize the body of the page before setting the previewBox
const doc = parser.parseFromString(text, 'text/html');
sanitizeElementTree(doc, doc.body, url)
document.getElementById('previewBox').innerHTML = doc.body.innerHTML;
With this, images and links should work and we can be a bit less nervous about malicious javascript.
Since web pages can often have unnecessary cruft, we should give users a chance to manually clean stuff up before they commit to saving.
To do this we can add the contenteditable
attribute on the previewBox
which enables users to edit anything in the box's content and in particular delete it the way they would in traditional document editors.
<section id="previewBox" contenteditable="true"></section>
Try deleting images or any sections that seem out of place.
Now that we have the page content ready we should havce a way to save it.
Ideally we should have all the scraped sites in a single archive with an easy way to browse them.
In my case I decided to store them as {root url}/{site hostname}/{path}.html
.
I asked the AI to help make a function that would convert the user's URL to this format.
Make a JS function called urlToFile which takes a `url` and a `rootURL` and converts it to the following format:
`{root url}/{hostname}/{path}.html`
Only add `.html` if the path doesn't already have it
It gave the following:
function urlToFile(url, rootURL) {
const parsedUrl = new URL(url);
let path = parsedUrl.pathname;
if (!path.endsWith('.html')) path += '.html';
return `${rootURL}/${parsedUrl.hostname}${path}`;
}
In order to save this file we need to generate that root URL. Agregore supports different protocols to publish to, and the main ones that support changes over time are hyper://
and ipns://
.
We should give the user the option to choose which one they want to use. The scratchpad already does this so we can cannibalize some code from it:
<form id="saveForm">
<label>
Save Type
<select name="saveType">
<option value="hyper">hyper://</option>
<option value="ipns">ipns://</option>
</select>
</label>
<button title="Save">💾</button>
</form>
We can rip out the JS for creating the root drives, too, but change the folder.
async function getDriveURL() {
const name = "scraped_sites"
const response = await fetch(`hyper://localhost/?key=${name}`, { method: 'POST' });
if (!response.ok) {
throw new Error(`Failed to generate Hyperdrive key: ${response.statusText}`);
}
return await response.text()
}
async function getIPNSURL() {
const name = "scraped_sites"
const response = await fetch(`ipns://localhost/?key=${name}`, { method: 'POST' });
if (!response.ok) {
throw new Error(`Failed to generate IPNS key: ${response.statusText}`);
}
await response.text()
return response.headers.get('Location')
}
We can now write some code to tie this all together:
saveForm.onsubmit = async (e) => {
e.preventDefault()
const publishURL = await saveScrape()
window.open(publishURL)
}
async function saveScrape() {
const body = previewBox.innerHTML
const url = urlForm.url.value
const saveType = saveForm.saveType.value
const rootURL = saveType === 'hyper' ? await getDriveURL() : await getIPNSURL()
const publishURL = urlToFile(url, rootURL)
const response = await fetch(publishURL, {
method: 'put',
body
})
if(!response.ok) throw new Error(`Unable to publish: ${await response.text()}`)
return publishURL
}
This will add a listener to the saveForm
which publishes the scrape and opens it in a new window.
The function gets the url and scrape content and does a put to either the hyper://
or ipns://
archive.
Having the raw HTML works but it's missing important bits like a title and description from the head, as well as meta tags to make it look nice on mobile phones.
We can take the template that the Scratchpad uses as a starting point:
function genPage({
title,
description,
html
}={}) {
return `<!DOCTYPE html>
<meta lang="en">
<meta charset="UTF-8">
<title>${title}</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="${description.trim()}">
<link rel="stylesheet" href="browser://theme/style.css">
${html}
`
}
Next we'll want to take the title and description out of the page when we scrape it.
I used this prompt to get the AI to make a function that can do that:
Make a js function called
getDocumentMetadata
which takes a document reference and returns the title and description from the document head
Which gave me this:
function getDocumentMetadata(doc) {
const title = doc.querySelector('title').innerText;
const description = doc.querySelector('meta[name="description"]').getAttribute('content');
return { title, description };
}
We can then edit our initial scrape code to account for this:
let scrapedTitle = ''
let scrapedDescription = ''
document.getElementById('urlForm').addEventListener('submit', async (event) => {
event.preventDefault();
const url = document.querySelector('input[name="url"]').value;
try {
const response = await fetch(url);
const text = await response.text();
const parser = new DOMParser();
const doc = parser.parseFromString(text, 'text/html');
const {title, description} = getDocumentMetadata(doc)
scrapedTitle = title
scrapedDescription = description
sanitizeElementTree(doc, doc.body, url)
document.getElementById('previewBox').innerHTML = doc.body.innerHTML;
} catch (error) {
console.error('Error fetching the URL:', error);
}
});
And we can also adjust our publishing code to bring this all together:
async function saveScrape() {
const html = previewBox.innerHTML
const body = genPage({
title: scrapedTitle,
description: scrapedDescription,
html
})
const url = urlForm.url.value
const saveType = saveForm.saveType.value
const rootURL = saveType === 'hyper' ? await getDriveURL() : await getIPNSURL()
const publishURL = urlToFile(url, rootURL)
const response = await fetch(publishURL, {
method: 'put',
body
})
if(!response.ok) throw new Error(`Unable to publish: ${await response.text()}`)
return publishURL
}
With this in place we have our app! You can see a deployed version of it here.
Here's some things you can try to do yourself as improvements:
https+raw
transport to bypass CORS restrictions in HTTP based sites