Buffered read split by two consecutive newlines using bufio.Scanner in Go

Revision history
Tags: go golang adventofcode2020

Preface

During the Advent of Code 2020, some of the challenges have so far contained input data separated by two consecutive newlines. I chose to try and figure out how to make use of bufio.Scanner to give me the multiline items one by one.

Golang’s bufio.Scanner

The builtin bufio.NewScanner uses bufio.ScanLines that buffers the output by a single newline. Meaning that given a Reader interface reading from a stream containing

hello, world
foo bar
baz 42

will yield three lines over three consecutive calls to scanner.Scan() using the following code:

text := `hello, world
foo bar
baz 42`
r := strings.NewReader(text)
scanner := bufio.NewScanner(r)
i := 1
for scanner.Scan() {
  fmt.Println(i, scanner.Text())
  i++
}

Code from https://golang.org/pkg/bufio/#example_Scanner_lines

However, when you’d want to split a string by two consective newlines instead, a new bufio.Scanner must be written.

Writing a custom bufio.Scanner

I first took a look at the original implementation It’s much easier to write a new implementation when having a full working example, especially when it already behaves almost exactly as I’d want it to.

// dropCR drops a terminal \r from the data.
func dropCR(data []byte) []byte {
  if len(data) > 0 && data[len(data)-1] == '\r' {
    return data[0 : len(data)-1]
  }
  return data
}

// ScanLines is a split function for a Scanner that returns each line of
// text, stripped of any trailing end-of-line marker. The returned line may
// be empty. The end-of-line marker is one optional carriage return followed
// by one mandatory newline. In regular expression notation, it is `\r?\n`.
// The last non-empty line of input will be returned even if it has no
// newline.
func ScanLines(data []byte, atEOF bool) (advance int, token []byte, err error) {
  if atEOF && len(data) == 0 {
    return 0, nil, nil
  }
  if i := bytes.IndexByte(data, '\n'); i >= 0 {
    // We have a full newline-terminated line.
    return i + 1, dropCR(data[0:i]), nil
  }
  // If we're at EOF, we have a final, non-terminated line. Return it.
  if atEOF {
    return len(data), dropCR(data), nil
  }
  // Request more data.
  return 0, nil, nil
}

Updating the function to split by two consecutive newlines instead shouldn’t be too hard. Additionally, I want it to join the single newlines together, so that it yields a concatenated string without newlines when calling scanner.Text(). Meaning that foo\nbar\n\nbaz yields two strings; foo bar and baz. Hence, standalone newlines replaced with spaces.

To make things easier for myself, I am using regex lookup instead of slice indices. This lets me replace both \n and \r in a single operation, while at the same time giving me the indices that I need to return on each invokation. It will probably be more expensive perfomance-wise, but computers are fast and my input is short, so I shouldn’t worry about that.

This led me to the following implementation:

var (
	patEols  = regexp.MustCompile(`[\r\n]+`)
	pat2Eols = regexp.MustCompile(`[\r\n]{2}`)
)

// Modified version of Go's builtin bufio.ScanLines to return strings separated by
// two newlines (instead of one). Returns a string without newlines in it, and trims
// spaces from start and end.
// https://github.com/golang/go/blob/master/src/bufio/scan.go#L344-L364
func ScanTwoConsecutiveNewlines(data []byte, atEOF bool) (advance int, token []byte, err error) {
	if atEOF && len(data) == 0 {
		return 0, nil, nil
	}

	if loc := pat2Eols.FindIndex(data); loc != nil && loc[0] >= 0 {
		// Replace newlines within string with a space
		s := patEols.ReplaceAll(data[0:loc[0]+1], []byte(" "))
		// Trim spaces and newlines from string
		s = bytes.Trim(s, "\n ")
		return loc[1], s, nil
	}

	if atEOF {
		// Replace newlines within string with a space
		s := patEols.ReplaceAll(data, []byte(" "))
		// Trim spaces and newlines from string
		s = bytes.Trim(s, "\r\n ")
		return len(data), s, nil
	}

	// Request more data.
	return 0, nil, nil
}

Take a look at my utility repository for updated code, along with a test for this function.

References

If you have any comments or feedback, please send me an e-mail. (stig at stigok dotcom).

Did you find any typos, incorrect information, or have something to add? Then please propose a change to this post.