Home » Find common “characters” in 2 given strings (rev5)

Find common “characters” in 2 given strings (rev5)

Solutons:


A number of efficiency issues have been addressed in your code, but there are a number of functional bugs which have not been pointed out, and some tricks you can try that make code more readable.

First up, your desire for the “most stream like” implementation is only useful if streams are the best tools for the job. Not everything in your code is suited to streams, and certainly not suited to a single stream.

Big Bug 1

Let’s look at your first bug:

if (string1.length() < string2.length()) {

There are two issues here, both are related to the fact that length only counts the number of characters in the input – where character is a 16-bit value.

The first symptom here is for wide characters which your code anticipates, are 2 characters wide, but are only one code point. Since your code deals with code points, it should only treat the number of code points in the text, not the number of char values required to store them.

The second symptom is that your text may not be similarly represented. A simple character like ö could be represented in multiple ways in Unicode, with some of them requiring “combining character sequences” to append the accents to the base character. In other words, the input string that looks like ö may be one, or two characters long….. Also, not all combining marks can be composed in a single character. Unicode is complicated.

The bottom line is that your very fist step in your code makes assumptions that may be untrue, and as a result, your code will possibly choose the wrong input as being the shortest, and, subsequently, report the common characters in the wrong order.

Big Bug 2

By the way, this same issue results in another bug – you may not identify common characters. A Character using combination marks in one input string will not match a character in composite form in the other.

Here are two additional test cases you need to solve:

{ "abcdefg", "bou0308ou0308ou0308a", "ab" }, // the second input is shorter!
{ "aöbcdedgh", "bou0308ou0308ou0308a", "aöb" }, // the second input has expanded ö

The only real way to solve this problem is to normalize both inputs in the same way, and while you are normalizing the text, you may as well convert to lower-case at the same time.

This in turn leads on to a discussion about combining characters. If, after normalization, there are un-composited combining characters, it means that there is no composite character that represents the combining accents… so, how can you treat that as a single character? By maintaining an array of characters that form one base character only.

This is beyond the scope of my answer, how to do that (bug I did it anyway), but, in short, each character should be represented as an array of base character, followed by whatever combining characters are modifying it.

So, your goal: Handle “characters” (i.e. code points) outside the Basic Multilingual Plane (BMP), including characters from Supplementary Planes. is not handled correctly.

Now, I would expect your code to have a method that breaks down the input string in to a normalized set of base characters (possibly surrogate characters), followed by combining marks. Both sides of the input strings should normalize in the same ways, leading to common normalized outputs. The Java library contains the Normalizer class for this exact reason.

Re-implemented – functional, not necessarily efficient

I have put this all together here. If it looks complicated, well, it is because it is complicated…. 😉 :

import java.text.Normalizer;
import java.util.Arrays;
import java.util.Collection;
import java.util.Set;
import java.util.stream.Collectors;
import java.util.stream.Stream;

import org.junit.Assert;
import org.junit.runner.RunWith;
import org.junit.runners.Parameterized;
import org.junit.runners.Parameterized.Parameters;

@RunWith(Parameterized.class)
public class CommonCharacters5 {

    private static final class CompositeCharacter {
        private final char[] sequence;
        private final int hashcode;

        public CompositeCharacter(char[] sequence) {
            this.sequence = sequence;
            this.hashcode = Arrays.hashCode(sequence);
        }

        @Override
        public int hashCode() {
            return hashcode;
        }

        @Override
        public boolean equals(Object obj) {
            if (this == obj) {
                return true;
            }
            if (!(obj instanceof CompositeCharacter)) {
                return false;
            }
            CompositeCharacter other = (CompositeCharacter) obj;
            return hashcode == other.hashcode && Arrays.equals(sequence, other.sequence);
        }
        
        
        @Override
        public String toString() {
            return Normalizer.normalize(new String(sequence), Normalizer.Form.NFC);
        }
        
        
    }
    
    private static final CompositeCharacter[] normalize(String input) {
        // decompose all input to separate combining marks.
        final char[] normal = Normalizer.normalize(input.toLowerCase(), Normalizer.Form.NFD).toCharArray();
        final int limit = normal.length;
        final int last = limit - 1;
        // assume one char per component
        final CompositeCharacter[] span = new CompositeCharacter[limit];
        int left = 0;
        int cnt = 0;
        while (left < limit) {
            int right = left;
            if (Character.isHighSurrogate(normal[right])) {
                right++;
            }
            while (right < last && Character.NON_SPACING_MARK == Character.getType(normal[right + 1])) {
                right++;
            }
            right++;
            span[cnt++] = new CompositeCharacter(Arrays.copyOfRange(normal, left, right));
            left = right;
        }
        return Arrays.copyOf(span, cnt);
    }    

    private static String commonCharactersOf(String string1, String string2) {
        CompositeCharacter[] common = commonComposites(normalize(string1), normalize(string2));
        return Stream.of(common).map(CompositeCharacter::toString).collect(Collectors.joining());
    }

    private static CompositeCharacter[] commonComposites(final CompositeCharacter[] longer, final CompositeCharacter[] shorter) {
        // Requirement
        //
        // Always return lowercase versions of common characters. e.g.:
        //
        // OK: (a, a) -> a; OK: (a, A) -> a; OK: (A, A) -> a
        // No: (a, A) -> a; No: (A, A) -> A; No: (aA, aA) -> aA;
        //
        // Requirement
        //
        // Return common characters joined in a String, preserving the order in
        // which they appeared in the longest argument, or in the first argument
        // if
        // the arguments are of the same length.
        //
        // Requirement
        //
        // Handle "characters" (i.e. code points) outside the Basic Multilingual
        // Plane (BMP), including characters from Supplementary Planes.
        // There should be no `char' or `Character' based "false positives".
        // e.g.:
        //
        // String string1 = "uD835uDC00", string2 = "uD835uDC01";
        // string1 and string2 share no characters in the intended acceptation
        // of
        // "character".
        
        if (shorter.length > longer.length) {
            // recurse with swapped arguments.
            return commonComposites(shorter, longer);
        }
        // @formatter:off
        Set<CompositeCharacter> shorterArgumentCodePoints = Stream.of(shorter).collect(Collectors.toSet());
        return Stream.of(longer).filter(shorterArgumentCodePoints::contains)
                .limit(shorterArgumentCodePoints.size()).toArray(s -> new CompositeCharacter[s]);
    }

    @Parameters(name = "({0}, {1}) -> {2}")
    public static Collection<String[]> data() {
        return Arrays.asList(new String[][] {
                // @formatter:off
                { "", "", "" }, { "a", "", "" }, { "", "a", "" }, { "aa", "", "" }, { "", "aa", "" },
                { "a", "a", "a" }, { "aa", "b", "" }, { "b", "aa", "" }, { "ab", "ba", "ab" }, { "aba", "ab", "ab" },
                { "aba", "ba", "ab" }, { "aba", "aab", "ab" }, { "a", "A", "a" }, { "A", "a", "a" }, { "A", "A", "a" },
                { "ab", "AB", "ab" }, { "AB", "ab", "ab" }, { "aB", "Ab", "ab" }, { "aB", "Ba", "ab" },
                { "aB", "Ba", "ab" }, { "abc", "ac", "ac" }, { "abc", "ca", "ac" }, { "abc", "cba", "abc" },
                { "a", "uD835uDC1A", "" }, { "uD835uDC1A", "uD835uDC1A", "uD835uDC1A" },
                { "uD835uDC00", "uD835uDC00", "uD835uDC00" }, { "uD835uDC1A", "uD835uDC00", "" },
                { "uD835uDC00", "uD835uDC01", "" }, { "uD801uDC2B", "uD801uDC2B", "uD801uDC2B" },
                { "uD801uDC03", "uD801uDC03", "uD801uDC2B" }, { "uD801uDC2B", "uD801uDC03", "uD801uDC2B" },
                { "uD83DuDE80", "uD83DuDE80", "uD83DuDE80" }, { "a", "aaaaaaaaaaaaaaaaa", "a" },
        // The last test should still work, and work fast, with a second
        // argument string starting with "a" and ending _many_ characters later
        // The last test values doe not test it, but illustrate the scenario
                { "abcdefg", "bou0308ou0308ou0308a", "ab" }, 
                { "aöbcdedgh", "bou0308ou0308ou0308a", "aöb" }, 
        // @formatter:on
                });
    }
    
    private String string1;
    private String string2;
    private String expected;

    public CommonCharacters5(String string1, String string2, String expected) {
        this.string1 = string1;
        this.string2 = string2;
        this.expected = expected;
    }

    @org.junit.Test
    public void test() {
        Assert.assertEquals(expected, commonCharactersOf(string1, string2));
    }
}

Self-call with swapped args trick

One trick I have in the above code, is the simple 1-level recursion if the argument-length guess is wrong…. it’s a common trick that sometimes confuses people:

    if (shorter.length > longer.length) {
        // recurse with swapped arguments.
        return commonComposites(shorter, longer);
    }

In the code, you call it with a short and long argument. If you guessed the wrong argument is shorter, then just call the method again with the arguments reversed. This removes the need for a conditional “swap” routine, and additional variables.

To Do

Even with the more comprehensive handling of composited accents, etc. the code still should probably check for COMBINING_SPACING_MARK characters too. I don’t know unicode well enough to know if that is significant.

Comments

I think your huge comment block about the “Requirement” of your class can be better placed as a class-level Javadoc comment:

/**
 * Return common characters joined in a String, preserving the order in...
 */
public class CommonCharacters5 {
    // ...
}

Utility classes and methods

For utility classes and methods, the convention is to make the class final and the constructor private so that it’s clear that they should not be instantiated:

public final class CommonCharacters5 {

    private CommonCharacters5() {
        // empty
    }

    public static String commonCharactersOf(String string1, String string2) {
        // ...
    }
}

CharSequence vs String

Since codePoints() is actually a method from the CharSequence interface, you may want to consider making your method accept a pair of CharSequences so that they are even less restrictive to work with.

Unit testing

I lean more towards TestNG myself, and I think its @DataProvider annotation usage is more expressive to construct a set of tests to assert iteratively. I happen to have another answer here that touches on parameterized testing with TestNG. 🙂

Regardless of the testing framework though, you may want to consider whether to opt for a better modeling approach for your tests, to make them more expressive. For example, if you can encapsulate them as an Enum:

enum MyTestCase {
    EMPTY_STRINGS("", "", ""),
    A_AND_EMPTY("a", "", ""),
    A_AND_B("a", "b", ""),
    // ...
    ROCKET_EMOJI("uD83DuDE80", "uD83DuDE80", "uD83DuDE80"),
    // ...

    private final String one;
    private final String other;
    private final String expected;

    private MyTestCase(String one, String other, String expected) {
        this.one = one;
        this.other = other;
        this.expected = expected;
    }
}

In your case, the slight benefit I can see is that you can freely swap the inputs to test that your method works regardless of which argument is the longer one:

@org.junit.Test
public void test() {
    // assuming current is an instance of MyTestCase
    Assert.assertEquals(current.expected, commonCharactersOf(current.one, current.other));
    Assert.assertEquals(current.expected, commonCharactersOf(current.other, current.one));
}

Related Solutions

How to delete a non-empty directory in Terminal?

Use the below command : rm -rf lampp It deletes all files and folders contained in the lampp directory. In case user doesn't have the permission to delete the folder: Add sudo at the beginning of the command : sudo rm -rf folderName Otherwise, without sudo you...

mysql: Show GRANTs for all users

Nothing built-in. You have two options though: Use common_schema's sql_show_grants view. For example, you can query: SELECT sql_grants FROM common_schema.sql_show_grants; Or you can query for particular users, for example: SELECT sql_grants FROM...

Configurable values to MDB annotations

You can externalise the annotations into the ejb-jar.xml that you deploy in the META-INF of your jar file as follows: <?xml version="1.0" encoding="UTF-8"?> <ejb-jar version="3.0"> <enterprise-beans> <message-driven>...

How do I select which Apache MPM to use?

There are a number of MPM modules (Multi-Processing Modules), but by far the most widely used (at least on *nix platforms) are the three main ones: prefork, worker, and event. Essentially, they represent the evolution of the Apache web server, and the different...

Using var self = this or .bind(this)? [closed]

Things that would favor var self = this; bind isn't supported in IE8 and Safari5. If you aim to build a library or code that supports legacy browsers, then var self = this would be more cross-browser friendly. Sometimes, callbacks are bound to a certain context...

What is the difference between SSL vs SSH? Which is more secure?

SSL and SSH both provide the cryptographic elements to build a tunnel for confidential data transport with checked integrity. For that part, they use similar techniques, and may suffer from the same kind of attacks, so they should provide similar security (i.e....

How can I stop applications and services from running?

First Things First You may have some misconceptions about how Android works and what's really happening when a service is running or an app is in the background. See also: Do I really need to install a task manager? Most apps (e.g., ones you launch manually)...

How do I reset a lost administrative password?

By default the first user's account is an administrative account, so if the UI is prompting you for a password it's probably that person's user password. If the user doesn't remember their password you need to reset it. To do this you need to boot into recovery...

How can I use environment variables in Nginx.conf

From the official Nginx docker file: Using environment variables in nginx configuration: Out-of-the-box, Nginx doesn't support using environment variables inside most configuration blocks. But envsubst may be used as a workaround if you need to generate your...

Difference between .bashrc and .bash_profile

Traditionally, when you log into a Unix system, the system would start one program for you. That program is a shell, i.e., a program designed to start other programs. It's a command line shell: you start another program by typing its name. The default shell, a...

Custom query with Castle ActiveRecord

In this case what you want is HqlBasedQuery. Your query will be a projection, so what you'll get back will be an ArrayList of tuples containing the results (the content of each element of the ArrayList will depend on the query, but for more than one value will...

What is the “You have new mail” message in Linux/UNIX?

Where is this mail? It's likely to be in the spool file: /var/mail/$USER or /var/spool/mail/$USER are the most common locations on Linux and BSD. (Other locations are possible – check if $MAIL is set – but by default, the system only informs you about...

How can I find the implementations of Linux kernel system calls?

System calls aren't handled like regular function calls. It takes special code to make the transition from user space to kernel space, basically a bit of inline assembly code injected into your program at the call site. The kernel side code that "catches" the...

Is a composite index also good for queries on the first field?

It certainly is. We discussed that in great detail under this related question: Working of indexes in PostgreSQL Space is allocated in multiples of MAXALIGN, which is typically 8 bytes on a 64-bit OS or (much less common) 4 bytes on a 32-bit OS. If you are not...