Post

Chapter 5 Tangled Web Not At All

Chapter 5 of Linux Shell Scripting Cookbook — web interaction from the command line using curl, wget, and shell scripting

Chapter 5 Tangled Web Not At All

Chapter Overview

This chapter covers interacting with the web from the command line — downloading files, scraping pages, making HTTP requests, parsing responses, and automating web tasks. curl and wget are the two main tools, and they’re used in virtually every DevOps and security workflow.


Downloading from a Web Page

wget

wget is built for downloading — it handles retries, resuming, and recursive downloads automatically.

1
2
3
4
5
6
wget https://example.com/file.tar.gz           # download a file
wget -O output.tar.gz https://example.com/f    # save with a specific name
wget -c https://example.com/largefile.iso      # resume interrupted download
wget -q https://example.com/file              # quiet mode (no output)
wget -b https://example.com/file              # background download
wget --limit-rate=500k https://example.com/f  # limit speed to 500KB/s

Download multiple files from a list:

1
wget -i urls.txt                               # read URLs from a file

Mirror an entire website:

1
2
3
wget --mirror --convert-links --adjust-extension \
  --page-requisites --no-parent \
  https://example.com/
  • --mirror = recursive + timestamps
  • --convert-links = fix links for offline use
  • --page-requisites = download CSS, images, JS
  • --no-parent = don’t go above the starting URL

Recursive download with depth limit:

1
wget -r -l 2 https://example.com/docs/        # 2 levels deep

curl for downloading

1
2
3
4
curl -O https://example.com/file.tar.gz        # save with original name
curl -o output.tar.gz https://example.com/f    # save with custom name
curl -C - -O https://example.com/large.iso     # resume download
curl -L https://example.com/file              # follow redirects (-L is important)

Downloading a Web Page as Plain Text

wget

1
wget -q -O - https://example.com | cat         # dump raw HTML to stdout

curl

1
2
curl -s https://example.com                    # -s = silent (no progress bar)
curl -s https://example.com | grep "<title>"   # extract title

Convert HTML to plain text

Using lynx:

1
2
lynx -dump https://example.com                 # render page as text (no HTML tags)
lynx -dump -nolist https://example.com         # suppress link list at bottom

Using w3m:

1
w3m -dump https://example.com                  # text rendering

Using html2text:

1
curl -s https://example.com | html2text        # convert HTML to markdown-like text

Strip HTML tags with sed (quick and dirty):

1
curl -s https://example.com | sed 's/<[^>]*>//g' | sed '/^[[:space:]]*$/d'

A Primer on cURL

curl (Client URL) is the Swiss Army knife of HTTP requests. Essential for testing APIs, web scraping, and automation.

Basic requests

1
2
3
4
5
curl https://example.com                       # GET request
curl -s https://example.com                    # silent (no progress)
curl -v https://example.com                    # verbose (show headers)
curl -I https://example.com                    # HEAD request (headers only)
curl -L https://example.com                    # follow redirects

Request methods

1
2
3
4
5
curl -X GET https://api.example.com/users
curl -X POST https://api.example.com/users
curl -X PUT https://api.example.com/users/1
curl -X DELETE https://api.example.com/users/1
curl -X PATCH https://api.example.com/users/1

Sending data (POST)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Form data
curl -X POST -d "user=omar&pass=secret" https://example.com/login

# JSON body
curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"username": "omar", "password": "secret"}' \
  https://api.example.com/login

# JSON from a file
curl -X POST \
  -H "Content-Type: application/json" \
  -d @payload.json \
  https://api.example.com/data

Headers

1
2
3
curl -H "Authorization: Bearer TOKEN" https://api.example.com/data
curl -H "Accept: application/json" https://api.example.com/
curl -H "X-Custom-Header: value" https://api.example.com/

Authentication

1
2
3
curl -u username:password https://api.example.com/     # Basic auth
curl -H "Authorization: Bearer TOKEN" https://api.com  # Bearer token
curl --digest -u user:pass https://example.com          # Digest auth

Cookies

1
2
3
curl -c cookies.txt https://example.com          # save cookies to file
curl -b cookies.txt https://example.com          # send cookies from file
curl -b "session=abc123" https://example.com     # send cookie directly

Response handling

1
2
3
4
curl -o output.html https://example.com          # save body to file
curl -D headers.txt https://example.com          # save headers to file
curl -w "%{http_code}\n" -s -o /dev/null https://example.com  # print status code only
curl -w "%{time_total}\n" -s -o /dev/null https://example.com # print response time

TLS/SSL

1
2
3
curl -k https://self-signed.example.com          # skip certificate verification
curl --cacert ca.pem https://example.com         # use custom CA
curl --cert client.pem --key key.pem https://   # client certificate

Useful write-out variables

1
2
curl -w "Status: %{http_code}\nTime: %{time_total}s\nSize: %{size_download} bytes\n" \
  -s -o /dev/null https://example.com

Accessing Gmail from the Command Line

Note: Gmail now requires OAuth2 — plain password access is disabled. These approaches use mutt or msmtp with app passwords or OAuth2.

mutt with Gmail (app password)

Configure ~/.muttrc:

1
2
3
4
5
set imap_user = "you@gmail.com"
set imap_pass = "your-app-password"
set folder = "imaps://imap.gmail.com/"
set spoolfile = "+INBOX"
set ssl_force_tls = yes
1
mutt -f imaps://imap.gmail.com/INBOX    # open Gmail inbox

Send email from command line with msmtp

Configure ~/.msmtprc:

1
2
3
4
5
6
7
account gmail
host smtp.gmail.com
port 587
auth on
tls on
user you@gmail.com
password your-app-password
1
2
echo "Message body" | msmtp -a gmail recipient@example.com
echo -e "Subject: Test\n\nHello" | msmtp recipient@example.com

Send with curl (SMTP)

1
2
3
4
5
6
curl --ssl-reqd \
  --url 'smtps://smtp.gmail.com:465' \
  --user 'you@gmail.com:app-password' \
  --mail-from 'you@gmail.com' \
  --mail-rcpt 'to@example.com' \
  --upload-file email.txt

email.txt format:

1
2
3
4
5
From: you@gmail.com
To: to@example.com
Subject: Test

Message body here.

Parsing Data from a Website

grep for quick extraction

1
2
3
curl -s https://example.com | grep -oE 'href="[^"]*"'      # all links
curl -s https://example.com | grep -oE '<title>[^<]*</title>'  # page title
curl -s https://example.com | grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+'  # IPs

sed for structured extraction

1
curl -s https://example.com | sed -n 's/.*<title>\(.*\)<\/title>.*/\1/p'

awk for table data

1
2
3
4
5
curl -s https://example.com | \
  awk '/<table/,/<\/table/' | \
  awk '/<td/,/<\/td/' | \
  sed 's/<[^>]*>//g' | \
  sed '/^[[:space:]]*$/d'

pup — HTML parser (cleaner approach)

1
2
3
4
curl -s https://example.com | pup 'a[href] attr{href}'    # extract all hrefs
curl -s https://example.com | pup 'title text{}'           # page title
curl -s https://example.com | pup 'h2 text{}'              # all h2 text
curl -s https://example.com | pup '.classname text{}'      # by CSS class

Install: go install github.com/ericchiang/pup@latest

python (when shell isn’t enough)

1
2
3
4
5
6
7
8
9
10
11
12
13
curl -s https://example.com | python3 -c "
import sys
from html.parser import HTMLParser

class LinkParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for attr, val in attrs:
                if attr == 'href':
                    print(val)

LinkParser().feed(sys.stdin.read())
"

Image Crawler and Downloader

Extract all image URLs from a page

1
2
3
curl -s https://example.com | \
  grep -oE 'src="[^"]*\.(jpg|jpeg|png|gif|webp)"' | \
  sed 's/src="//;s/"//'

Download all images from a page

1
2
3
4
# Extract and download
curl -s https://example.com | \
  grep -oE '(https?://[^"]*\.(jpg|jpeg|png|gif))' | \
  xargs -P 4 wget -q

wget recursive image download

1
wget -r -A "*.jpg,*.png,*.gif" -nd -P ./images/ https://example.com/gallery/
  • -A = accept only these extensions
  • -nd = no directories (flat download)
  • -P = save to ./images/

Full image crawler script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#!/bin/bash
url="$1"
output_dir="./downloaded_images"
mkdir -p "$output_dir"

echo "Crawling: $url"

curl -s "$url" | \
  grep -oE '(https?://[^"'\''<>[:space:]]*\.(jpg|jpeg|png|gif|webp))' | \
  sort -u | \
  while read img_url; do
    filename=$(basename "$img_url" | cut -d'?' -f1)
    echo "Downloading: $img_url"
    curl -s -L -o "$output_dir/$filename" "$img_url"
  done

echo "Done. Images saved to $output_dir"

Usage: bash crawler.sh https://example.com/gallery


Web Photo Album Generator

Generate a simple HTML gallery from a directory of images:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/bin/bash
dir="${1:-.}"
output="album.html"

cat > "$output" << 'HEADER'
<!DOCTYPE html>
<html>
<head>
  <title>Photo Album</title>
  <style>
    body { font-family: sans-serif; background: #111; color: #fff; }
    .gallery { display: flex; flex-wrap: wrap; gap: 10px; padding: 20px; }
    .gallery img { width: 200px; height: 150px; object-fit: cover; border-radius: 4px; }
    .gallery a:hover img { opacity: 0.8; }
  </style>
</head>
<body>
<h1>Photo Album</h1>
<div class="gallery">
HEADER

find "$dir" -maxdepth 1 -type f \( -iname "*.jpg" -o -iname "*.png" -o -iname "*.jpeg" \) | sort | \
while read img; do
  filename=$(basename "$img")
  echo "  <a href=\"$filename\"><img src=\"$filename\" alt=\"$filename\"></a>"
done >> "$output"

cat >> "$output" << 'FOOTER'
</div>
</body>
</html>
FOOTER

echo "Album generated: $output"
echo "Images included: $(grep -c '<img' $output)"

Usage: bash album.sh /path/to/photos


Twitter / X Command-Line Client

The official Twitter v1.1 API is now restricted. For modern usage, the approach is the Twitter v2 API with Bearer token auth.

Basic tweet fetch with curl

1
2
3
4
5
6
7
BEARER_TOKEN="your_bearer_token_here"

# Get recent tweets from a user
curl -s \
  -H "Authorization: Bearer $BEARER_TOKEN" \
  "https://api.twitter.com/2/tweets/search/recent?query=from:username&max_results=10" | \
  python3 -m json.tool

twurl (official Twitter curl wrapper)

1
2
3
gem install twurl
twurl authorize --consumer-key KEY --consumer-secret SECRET
twurl /1.1/statuses/home_timeline.json | python3 -m json.tool

t (Ruby Twitter CLI)

1
2
3
4
5
6
gem install t
t authorize
t timeline                    # home timeline
t mentions                    # mentions
t search "keyword"            # search
t update "Hello from terminal!"  # post a tweet

Creating a Define Utility

Build a command-line dictionary lookup using free web APIs.

Using the Free Dictionary API

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
define() {
  word="$1"
  curl -s "https://api.dictionaryapi.dev/api/v2/entries/en/$word" | \
    python3 -c "
import sys, json
data = json.load(sys.stdin)
if isinstance(data, list):
    for entry in data:
        print(f'Word: {entry[\"word\"]}')
        for meaning in entry.get('meanings', []):
            print(f'  [{meaning[\"partOfSpeech\"]}]')
            for d in meaning.get('definitions', [])[:2]:
                print(f'    - {d[\"definition\"]}')
else:
    print(data.get('message', 'Not found'))
"
}

define "ephemeral"

Simple version with grep/sed

1
2
3
4
5
6
define() {
  curl -s "https://api.dictionaryapi.dev/api/v2/entries/en/$1" | \
    grep -oP '"definition":"\K[^"]+' | \
    head -3 | \
    nl
}

Add either version to ~/.bashrc for permanent use.


wget spider mode

1
2
3
4
wget --spider -r -nd -nv --delete-after \
  -o wget_log.txt https://example.com

grep -i "broken\|404\|error" wget_log.txt

--spider = don’t download, just check links.

curl in a loop

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#!/bin/bash
url="$1"

# Extract all links
links=$(curl -s "$url" | grep -oE 'href="(https?://[^"]*)"' | sed 's/href="//;s/"//')

while read -r link; do
  status=$(curl -s -o /dev/null -w "%{http_code}" -L --max-time 10 "$link")
  if [[ "$status" =~ ^[45] ]]; then
    echo "BROKEN [$status]: $link"
  else
    echo "OK [$status]: $link"
  fi
done <<< "$links"

linkchecker (dedicated tool)

1
2
3
4
pip install linkchecker
linkchecker https://example.com
linkchecker --no-warnings https://example.com    # errors only
linkchecker -r 2 https://example.com             # limit recursion depth

Check a list of URLs

1
2
3
4
while read url; do
  code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 "$url")
  echo "$code $url"
done < urls.txt | grep -v "^200"

Tracking Changes to a Website

Basic change detection with diff

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#!/bin/bash
url="$1"
snapshot_file="snapshot_$(echo $url | md5sum | cut -c1-8).txt"

new_content=$(curl -s "$url" | \
  sed 's/<[^>]*>//g' | \  # strip HTML
  sed '/^[[:space:]]*$/d')  # remove blank lines

if [[ -f "$snapshot_file" ]]; then
  if diff -q <(echo "$new_content") "$snapshot_file" > /dev/null; then
    echo "No changes detected."
  else
    echo "CHANGES DETECTED:"
    diff "$snapshot_file" <(echo "$new_content")
    echo "$new_content" > "$snapshot_file"
  fi
else
  echo "First run — saving snapshot."
  echo "$new_content" > "$snapshot_file"
fi

Run on a schedule with cron

1
2
# Check every hour and email if changed
0 * * * * /path/to/check_changes.sh https://example.com | mail -s "Site Changed" you@gmail.com

Hash-based detection (lightweight)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/bin/bash
url="$1"
hash_file=".site_hash"

current_hash=$(curl -s "$url" | md5sum | cut -d' ' -f1)

if [[ -f "$hash_file" ]]; then
  saved_hash=$(cat "$hash_file")
  if [[ "$current_hash" != "$saved_hash" ]]; then
    echo "$(date): CHANGED — $url"
    echo "$current_hash" > "$hash_file"
  else
    echo "$(date): No change"
  fi
else
  echo "$current_hash" > "$hash_file"
  echo "Baseline saved."
fi

Posting to a Web Page and Reading the Response

POST with curl

1
2
3
4
5
6
7
8
9
10
11
# Form submission
curl -X POST -d "username=omar&password=secret" https://example.com/login

# URL-encoded (same as above, explicit)
curl -X POST --data-urlencode "query=hello world" https://example.com/search

# JSON API
curl -s -X POST \
  -H "Content-Type: application/json" \
  -d '{"title": "New Post", "body": "Content here", "userId": 1}' \
  https://jsonplaceholder.typicode.com/posts | python3 -m json.tool

Read the full response (headers + body)

1
2
3
curl -i https://example.com                    # headers and body together
curl -D - https://example.com                  # headers to stdout, body to stdout
curl -v https://example.com 2>&1               # verbose — everything

Check status code

1
2
3
4
5
6
7
8
code=$(curl -s -o /dev/null -w "%{http_code}" -X POST -d "data=val" https://example.com)
echo "Response: $code"

if [[ "$code" == "200" || "$code" == "201" ]]; then
  echo "Success"
else
  echo "Failed with $code"
fi

API interaction script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/bin/bash
API="https://jsonplaceholder.typicode.com"
TOKEN="your_token_here"

# GET
get_posts() {
  curl -s -H "Authorization: Bearer $TOKEN" "$API/posts" | python3 -m json.tool
}

# POST
create_post() {
  curl -s -X POST \
    -H "Authorization: Bearer $TOKEN" \
    -H "Content-Type: application/json" \
    -d "{\"title\": \"$1\", \"body\": \"$2\", \"userId\": 1}" \
    "$API/posts"
}

# DELETE
delete_post() {
  curl -s -X DELETE -o /dev/null -w "%{http_code}" "$API/posts/$1"
}

case "$1" in
  get)    get_posts ;;
  create) create_post "$2" "$3" ;;
  delete) delete_post "$2" ;;
  *)      echo "Usage: $0 {get|create|delete}" ;;
esac

📚 References


You can find me online at:

My signature image

This post is licensed under CC BY 4.0 by the author.