Dino Fizzotti

Engineer, maker, hacker, thinker, funner.

Jan 4, 2022 - 5 minute read - Comments - Software Go

SiteMapper Part 1: Exploring concurrency in Go

$ ./sm -s https://google.com | jq 
2021/12/31 14:57:26 Using mode: concurrent
2021/12/31 14:57:26 Crawling https://google.com with depth 1
2021/12/31 14:57:26 visiting URL https://google.com at depth 0 with parent https://google.com
2021/12/31 14:57:26 Elapsed milliseconds:  263
[
  {
    "URL": "https://google.com",
    "Links": [
      "https://accounts.google.com/ServiceLogin",
      "https://drive.google.com/",
      "https://mail.google.com/mail/",
      "https://news.google.com/",
      "https://play.google.com/",
      "https://www.google.com/advanced_search",
      "https://www.google.com/intl/en/about.html",
      "https://www.google.com/intl/en/ads/",
      "https://www.google.com/intl/en/policies/privacy/",
      "https://www.google.com/intl/en/policies/terms/",
      "https://www.google.com/preferences",
      "https://www.google.com/search",
      "https://www.google.com/services/",
      "https://www.google.com/setprefdomain"
    ]
  }
]

I’ve really been enjoying my journey in becoming more familiar with Go. One of Go’s strengths is its built-in concurrency primitives, namely goroutines and channels. I decided to practice my Go skills by building a tool which explores some of these features. SiteMapper accepts a root URL and a maximum depth, and then uses one of three modes of operation to crawl the site (synchronous, concurrent and concurrent-limited), building out a sitemap listing links to internal pages.

Part 1 of this project details the stand-alone CLI tool implementation written in Go. Part 2 details an implementation using Kubernetes, where site crawling is performed by ephemeral Kubernetes job pods.

Link to GitHub repo: https://github.com/dinofizz/sitemapper

Notes on usage example above:

  • The output is piped to jq to improve readability.
  • Increasing the depth to 2 with a -d 2 flag will mean each of the link results above will also be crawled for google.com links.

Modes of operation

Three modes of operation are available: synchronous, concurrent and concurrent-limited. Each mode is implemented as a recursive method and checks against the maximum depth to identify when to break the recursion.

Synchronous

Not much to explain here. In this mode a standard recursive traversal is performed for each link at each depth below the maximum.

func (c *SynchronousCrawlEngine) crawl(u, root, parent string, depth int) {
	if c.maxDepth == depth {
		return
	}
	urls := getLinks(u, root, parent, depth, c.sm)
	depth++

	for _, urlLink := range urls {
		c.crawl(urlLink, root, u, depth)
	}
}

Concurrent

  • For the concurrent mode I run the recursive call as a goroutine wrapped in an closure.
  • A WaitGroup is used to monitor for all goroutines to complete.
  • For each goroutine created the WaitGroup count is incremented using Add().
  • A corresponding decrement is performed once the goroutine is complete using Done().
  • The crawl method caller waits for all goroutines to complete using the Wait() method.
func (c *ConcurrentCrawlEngine) crawl(u, root, parent string, depth int) {
	if c.maxDepth == depth {
		return
	}
	urls := getLinks(u, root, parent, depth, c.sm)
	depth++

	for _, urlLink := range urls {
		c.WG.Add(1)
		go func(urlLink, root, parent string, d int) {
			defer c.WG.Done()
			c.crawl(urlLink, root, parent, d)
		}(urlLink, root, u, depth)
	}
}

Concurrent Limited

The concurrent limited mode is similar to the concurrent mode, but uses a buffered channel to gate the number of goroutines running (a pattern I learnt from reading “Learning Go” by Jon Bodner).

Limiter code

  • A buffered channel is created and filled with empty structs.
  • Each time a new crawl goroutine is created the limiter code reads from the buffered channel, runs the crawl function, and then tops up the buffered channel.
  • If there are no structs available to be read from the channel the limiter returns an error
type Limiter struct {
	ch    chan struct{}
	limit int
}

func NewLimiter(limit int) *Limiter {
	ch := make(chan struct{}, limit)

	for i := 0; i < limit; i++ {
		ch <- struct{}{}
	}

	return &Limiter{
		ch:    ch,
		limit: limit,
	}
}

func (l *Limiter) RunFunc(f func()) error {
	select {
	case <-l.ch:
		f()
		l.ch <- struct{}{}
		return nil
	default:
		return errors.New("limit reached, try again later")
	}
}

Concurrent limited crawl implementation

The operation is the same as for the concurrent crawl engine, however the crawl method is passed as a parameter to the limiter implementation. If an error is returned by the limiter I wait a random amount of time before trying again.

func (c *ConcurrentLimitedCrawlEngine) crawl(u, root, parent string, depth int) {
	if c.maxDepth == depth {
		return
	}
	urls := getLinks(u, root, parent, depth, c.sm)
	depth++

	for _, urlLink := range urls {
		c.WG.Add(1)
		go func(urlLink, root, parent string, d int) {
			defer c.WG.Done()
			for {
				err := c.limiter.RunFunc(func() {
					c.crawl(urlLink, root, parent, d)
				})
				if err != nil {
					n := rand.Intn(500)
					log.Printf("task limited for URL %s, sleeping for %depth millisecconds\n", urlLink, n)
					time.Sleep(time.Duration(n) * time.Millisecond)
				} else {
					break
				}
			}
		}(urlLink, root, u, depth)
	}
}

Other features

  • To keep track of the links found at each URL I’m using a map data structure, protected by a sync.Mutex to lock and unlock the map before and after reads and writes.
  • I don’t visit URLs I’ve already visited.
  • The sitemap struct implements the WriterTo interface, which allows me to write the output to stdout.
  • Custom JSON marshalling methods are provided to provide a more user friendly JSON structure for the sitemap.
  • Tests exist for the crawler and sitemap related functionality.
  • I’m using cobra for command line argument handling.

Things I would do differently

  • The sitemap data structure is actually a map of maps (done this way to provide a map of a set - but no set data structure exists). I can probably come up with a better structure.
  • I made the decision to key my map with URLs that trim their trailing slash as I was seeing crawl results showing links for both http://www.example.com/about and http://www.example.com/about/. I might want to revert that decision.
  • Rules for parsing internal URLs are tricker than expected (handling schemes, anchors, relative, absolute links). The current implementation is very trial and error.
  • Add more tests