You expression is almost correct, the use of
(2|3) will cause issues, as these will match their content, but not their syntax. So certain examples should fail:
11 can’t be substituted by
3, as the capture is empty at that point in time. So you’ll have to duplicate the syntax:
Then there is some additional optimization you can do. Instead of repeating each internal group 0..* times and the out group 1..* times, you can remove the inner repetition and change the outer one to 0..*:
Further simplification doesn’t seem possible without resorting to tricks such as zero-width assertions, basically ensuring that you end up with an even number of 1’s and and even number of 0’s and ensuring that both are true:
Cheating a bit further, we already know by the capturing expression that we have an even number of 0’s, so we can replace the look-ahead to just ensure we’re on an even number of characters, there are no other characters to worry about:
It’s slightly shorter and notice that the first part of the expression is zero-width, the second part is not to ensure that we actually end up with a capture.
The final trick, completely moving away from a DFA, is to use balancing groups, since we’re balancing 1’s and 0’s we know we’re even when every odd 1 and 0 has been balanced by its even counterpart. It’s not really simpler, but it does work and only needs one pass, like your example:
There is an algorithm to convert a DFA into a regular expression.
A good explanation is given by in the StackExchange answer: How to convert finite automata to regular expressions?
Here is a demonstration of how it applies to this problem:
We start with this DFA:
and we first remove the OO node yielding:
Next, remove, say, the OE node:
Finally, remove the EO node. The paths from EE to itself are:
r1: 00 r2: 0(11)*0 r3: (0(11)*10 | 1) (00 | 01(11)*10 )* (1 | 01(11)*0 ) _ to EO _/ _ around EO _/ _ back to EE _/
The final regex is:
(r1 | r2 | r3)*.
This problem is easier to solve with a lookahead combination than with a “match all possibilities” combination. Consider two zero-width lookaheads before the first character. One ensures that each 1 has a following matching 1, and the other ensures that each 0 has a matching 0. if you can guarantee that there are only 1 and 0 characters in the code then you’re set.
So, your issue here is that you are trying to count mixed up bit values, but separating them makes the logic much easier:
Note, there are two lookaheads, and combining them like this essentially makes them two “and” conditions.
The first is:
- find a 0-bit pair that has some number (perhaps 0) 1 bits before them
- allow that pattern to repeat as many times as needed (perhaps 0)
- then allow there to be other 1 bits after them to the end of string.
This ensures that there are an even number of 0 values, and all other values are 1’s.
The second regular expression is the opposite, it ensures there’s an even number of 1 bits, and the other bits are 0.
Combine them together and you can count them easily….
See the pattern working in Java here.
Note that my preference solution would simply be to ensure there are an even number of characters in total (a
length % 2 == 0) and then to just ensure an even number of the 0-bits:
use that as:
return (text.length % 2 == 0) && text.matches("^(1*01*0)*1*$")
That eliminates all the lookaheads and other magic.