Home » Linux tools to treat files as sets and perform set operations on them

Linux tools to treat files as sets and perform set operations on them

Solutons:


Assuming elements are strings of characters other than NUL and newline (beware that newline is valid in file names though), you can represent a set as a text file with one element per line and use some of the standard Unix utilities.

Set Membership

$ grep -Fxc 'element' set   # outputs 1 if element is in set
                            # outputs >1 if set is a multi-set
                            # outputs 0 if element is not in set

$ grep -Fxq 'element' set   # returns 0 (true)  if element is in set
                            # returns 1 (false) if element is not in set

$ awk '$0 == "element" { s=1; exit }; END { exit !s }' set
# returns 0 if element is in set, 1 otherwise.

$ awk -v e="element" '$0 == e { s=1; exit } END { exit !s }'

Set Intersection

$ comm -12 <(sort set1) <(sort set2)  # outputs intersect of set1 and set2

$ grep -xF -f set1 set2

$ sort set1 set2 | uniq -d

$ join -t <(sort A) <(sort B)

$ awk '!done { a[$0]; next }; $0 in a' set1 done=1 set2

Set Equality

$ cmp -s <(sort set1) <(sort set2) # returns 0 if set1 is equal to set2
                                   # returns 1 if set1 != set2

$ cmp -s <(sort -u set1) <(sort -u set2)
# collapses multi-sets into sets and does the same as previous

$ awk '{ if (!($0 in a)) c++; a[$0] }; END{ exit !(c==NR/2) }' set1 set2
# returns 0 if set1 == set2
# returns 1 if set1 != set2

$ awk '{ a[$0] }; END{ exit !(length(a)==NR/2) }' set1 set2
# same as previous, requires >= gnu awk 3.1.5

Set Cardinality

$ wc -l < set     # outputs number of elements in set

$ awk 'END { print NR }' set

$ sed '$=' set

Subset Test

$ comm -23 <(sort -u subset) <(sort -u set) | grep -q '^'
# returns true iff subset is not a subset of set (has elements not in set)

$ awk '!done { a[$0]; next }; { if !($0 in a) exit 1 }' set done=1 subset
# returns 0 if subset is a subset of set
# returns 1 if subset is not a subset of set

Set Union

$ cat set1 set2     # outputs union of set1 and set2
                    # assumes they are disjoint

$ awk 1 set1 set2   # ditto

$ cat set1 set2 ... setn   # union over n sets

$ sort -u set1 set2  # same, but doesn't assume they are disjoint

$ sort set1 set2 | uniq

$ awk '!a[$0]++' set1 set2       # ditto without sorting

Set Complement

$ comm -23 <(sort set1) <(sort set2)
# outputs elements in set1 that are not in set2

$ grep -vxF -f set2 set1           # ditto

$ sort set2 set2 set1 | uniq -u    # ditto

$ awk '!done { a[$0]; next }; !($0 in a)' set2 done=1 set1

Set Symmetric Difference

$ comm -3 <(sort set1) <(sort set2) | tr -d 't'  # assumes not tab in sets
# outputs elements that are in set1 or in set2 but not both

$ sort set1 set2 | uniq -u

$ cat <(grep -vxF -f set1 set2) <(grep -vxF -f set2 set1)

$ grep -vxF -f set1 set2; grep -vxF -f set2 set1

$ awk '!done { a[$0]; next }; $0 in a { delete a[$0]; next }; 1;
       END { for (b in a) print b }' set1 done=1 set2

Power Set

All possible subsets of a set displayed space separated, one per line:

$ p() { [ "$#" -eq 0 ] && echo || (shift; p "$@") |
        while read r; do printf '%s %sn%sn' "$1" "$r" "$r"; done; }
$ p $(cat set)

(assumes elements don’t contain SPC, TAB (assuming the default value of $IFS), backslash, wildcard characters).

Set Cartesian Product

$ while IFS= read -r a; do while IFS= read -r b; do echo "$a, $b"; done < set1; done < set2

$ awk '!done { a[$0]; next }; { for (i in a) print i, $0 }' set1 done=1 set2

Disjoint Set Test

$ comm -12 <(sort set1) <(sort set2)  # does not output anything if disjoint

$ awk '++seen[$0] == 2 { exit 1 }' set1 set2 # returns 0 if disjoint
                                             # returns 1 if not

Empty Set Test

$ wc -l < set            # outputs 0  if the set is empty
                         # outputs >0 if the set is not empty

$ grep -q '^' set        # returns true (0 exit status) unless set is empty

$ awk '{ exit 1 }' set   # returns true (0 exit status) if set is empty

Minimum

$ sort set | head -n 1   # outputs the minimum (lexically) element in the set

$ awk 'NR == 1 { min = $0 }; $0 < min { min = $0 }; END { print min }'
# ditto, but does numeric comparison when elements are numerical

Maximum

$ sort test | tail -n 1    # outputs the maximum element in the set

$ sort -r test | head -n 1

$ awk '$0 > max { max = $0 }; END { print max }'
# ditto, but does numeric comparison when elements are numerical

All available at http://www.catonmat.net/blog/set-operations-in-unix-shell-simplified/

Sort of. You need to deal with sorting yourself, but comm can be used to do that, treating each line as a set member: -12 for intersection, -13 for difference. (And -23 gives you flipped difference, that is, set2 - set1 instead of set1 - set2.) Union is sort -u in this setup.

The tiny console tool “setop” is now availible in Debian Stretch and in Ubuntu since 16.10. You can get it via

sudo apt install setop

Here are some examples. The sets to be operated on are given as different input files:

setop input # is equal to "sort input --unique"
setop file1 file2 --union # option --union is default and can be omitted
setop file1 file2 file3 --intersection # more than two inputs are allowed
setop file1 - --symmetric-difference # ndash stands for standard input
setop file1 -d file2 # all elements contained in 1 but not 2

Boolean queries return only EXIT_SUCCESS in case of true, and EXIT_FAILURE as well as a message otherwise. This way, setop can be used in the shell.

setop inputfile --contains "value" # is element value contained in input?
setop A.txt B.txt --equal C.txt # union of A and B equal to C?
setop bigfile --subset smallfile # analogous --superset
setop -i file1 file2 --is-empty # intersection of 1 and 2 empty (disjoint)?

It is also possible to precisly describe how the input streams shall be parsed, actually by regular expressions:

  • setop input.txt --input-separator "[[:space:]-]" means that a whitespace (i. e. v t n r f or space) or a minus sign is interpreted as a separator between elements (default is new line, i. e. every line of the input file is one element)
  • setop input.txt --input-element "[A-Za-z]+" means that elements are only words consisting of latin characters, all other characters are considered to be separators between elements

Furthermore, you can

  • --count all elements of the output set,
  • --trim all input elements (i. e. erase all unwanted preceding and succeeding characters like space, comma etc.),
  • consider empty elements as valid via --include-empty,
  • --ignore-case,
  • set the --output-separator between elements of the output stream (default is n),
  • and so on.

See man setop or github.com/phisigma/setop for more information.

Related Solutions

Joining bash arguments into single string with spaces

[*] I believe that this does what you want. It will put all the arguments in one string, separated by spaces, with single quotes around all: str="'$*'" $* produces all the scripts arguments separated by the first character of $IFS which, by default, is a space....

AddTransient, AddScoped and AddSingleton Services Differences

TL;DR Transient objects are always different; a new instance is provided to every controller and every service. Scoped objects are the same within a request, but different across different requests. Singleton objects are the same for every object and every...

How to download package not install it with apt-get command?

Use --download-only: sudo apt-get install --download-only pppoe This will download pppoe and any dependencies you need, and place them in /var/cache/apt/archives. That way a subsequent apt-get install pppoe will be able to complete without any extra downloads....

What defines the maximum size for a command single argument?

Answers Definitely not a bug. The parameter which defines the maximum size for one argument is MAX_ARG_STRLEN. There is no documentation for this parameter other than the comments in binfmts.h: /* * These are the maximum length and maximum number of strings...

Bulk rename, change prefix

I'd say the simplest it to just use the rename command which is common on many Linux distributions. There are two common versions of this command so check its man page to find which one you have: ## rename from Perl (common in Debian systems -- Ubuntu, Mint,...

Output from ls has newlines but displays on a single line. Why?

When you pipe the output, ls acts differently. This fact is hidden away in the info documentation: If standard output is a terminal, the output is in columns (sorted vertically) and control characters are output as question marks; otherwise, the output is...

mv: Move file only if destination does not exist

mv -vn file1 file2. This command will do what you want. You can skip -v if you want. -v makes it verbose - mv will tell you that it moved file if it moves it(useful, since there is possibility that file will not be moved) -n moves only if file2 does not exist....

Is it possible to store and query JSON in SQLite?

SQLite 3.9 introduced a new extension (JSON1) that allows you to easily work with JSON data . Also, it introduced support for indexes on expressions, which (in my understanding) should allow you to define indexes on your JSON data as well. PostgreSQL has some...

Combining tail && journalctl

You could use: journalctl -u service-name -f -f, --follow Show only the most recent journal entries, and continuously print new entries as they are appended to the journal. Here I've added "service-name" to distinguish this answer from others; you substitute...

how can shellshock be exploited over SSH?

One example where this can be exploited is on servers with an authorized_keys forced command. When adding an entry to ~/.ssh/authorized_keys, you can prefix the line with command="foo" to force foo to be run any time that ssh public key is used. With this...

Why doesn’t the tilde (~) expand inside double quotes?

The reason, because inside double quotes, tilde ~ has no special meaning, it's treated as literal. POSIX defines Double-Quotes as: Enclosing characters in double-quotes ( "" ) shall preserve the literal value of all characters within the double-quotes, with the...

What is GNU Info for?

GNU Info was designed to offer documentation that was comprehensive, hyperlinked, and possible to output to multiple formats. Man pages were available, and they were great at providing printed output. However, they were designed such that each man page had a...

Set systemd service to execute after fstab mount

a CIFS network location is mounted via /etc/fstab to /mnt/ on boot-up. No, it is not. Get this right, and the rest falls into place naturally. The mount is handled by a (generated) systemd mount unit that will be named something like mnt-wibble.mount. You can...

Merge two video clips into one, placing them next to each other

To be honest, using the accepted answer resulted in a lot of dropped frames for me. However, using the hstack filter_complex produced perfectly fluid output: ffmpeg -i left.mp4 -i right.mp4 -filter_complex hstack output.mp4 ffmpeg -i input1.mp4 -i input2.mp4...

How portable are /dev/stdin, /dev/stdout and /dev/stderr?

It's been available on Linux back into its prehistory. It is not POSIX, although many actual shells (including AT&T ksh and bash) will simulate it if it's not present in the OS; note that this simulation only works at the shell level (i.e. redirection or...

How can I increase the number of inodes in an ext4 filesystem?

It seems that you have a lot more files than normal expectation. I don't know whether there is a solution to change the inode table size dynamically. I'm afraid that you need to back-up your data, and create new filesystem, and restore your data. To create new...

Why doesn’t cp have a progress bar like wget?

The tradition in unix tools is to display messages only if something goes wrong. I think this is both for design and practical reasons. The design is intended to make it obvious when something goes wrong: you get an error message, and it's not drowned in...