For such a big file, one possibility is Flex. Let
%% <unk> printf("<raw_unk>"); %%
Then compile and execute:
$ flex -o unk.c unk.l $ cc -o unk -O2 unk.c -lfl $ unk < corpus.txt > corpus.txt.new
The usual text processing tools are not designed to handle lines that don’t fit in RAM. They tend to work by reading one record (one line), manipulating it, and outputting the result, then proceeding to the next record (line).
If there’s an ASCII character that appears frequently in the file and doesn’t appear in
<raw_unk>, then you can use that as the record separator. Since most tools don’t allow custom record separators, swap between that character and newlines.
tr processes bytes, not lines, so it doesn’t care about any record size. Supposing that
<corpus.txt tr 'n;' ';n' | sed 's/<unk>/<raw_unk>/g' | tr 'n;' ';n' >corpus.txt.new
You could also anchor on the first character of the text you’re searching for, assuming that it isn’t repeated in the search text and it appears frequently enough. If the file may start with
unk>, change the sed command to
sed '2,$ s/… to avoid a spurious match.
<corpus.txt tr 'n<' '<n' | sed 's/^unk>/raw_unk>/g' | tr 'n<' '<n' >corpus.txt.new
Alternatively, use the last character.
<corpus.txt tr 'n>' '>n' | sed 's/<unk$/<raw_unk/g' | tr 'n>' '>n' >corpus.txt.new
Note that this technique assumes that sed operates seamlessly on a file that doesn’t end with a newline, i.e. that it processes the last partial line without truncating it and without appending a final newline. It works with GNU sed. If you can pick the last character of the file as the record separator, you’ll avoid any portability trouble.
So you don’t have enough physical memory (RAM) to hold the whole file at once, but on a 64-bit system you have enough virtual address space to map the entire file. Virtual mappings can be useful as a simple hack in cases like this.
The necessary operations are all included in Python. There are several annoying subtleties, but it does avoid having to write C code. In particular, care is needed to avoid copying the file in memory, which would defeat the point entirely. On the plus side, you get error-reporting for free (python “exceptions”) :).
#!/usr/bin/python3 # This script takes input from stdin # (but it must be a regular file, to support mapping it), # and writes the result to stdout. search = b'<unk>' replace = b'<raw_unk>' import sys import os import mmap # sys.stdout requires str, but we want to write bytes out_bytes = sys.stdout.buffer mem = mmap.mmap(sys.stdin.fileno(), 0, access=mmap.ACCESS_READ) i = mem.find(search) if i < 0: sys.exit("Search string not found") # mmap object subscripts to bytes (making a copy) # memoryview object subscripts to a memoryview object # (it implements the buffer protocol). view = memoryview(mem) out_bytes.write(view[:i]) out_bytes.write(replace) out_bytes.write(view[i+len(search):])