Regular Expressions in Scala


A regular expression is a sequence of characters which forms a pattern, used to match character combinations in a string. Regular expressions are very useful when you are dealing with textual data. These can be used to extract useful information from a string, and can also be used for find and replace use cases. I have used regular expressions extensively during my thesis project in Indian Institute of Technology, Delhi.

Some of the use cases where I have used Regular expressions are- extract the required information like name of the person and his date of birth from the data scraped from the internet, check if the date is in the required format, filter out the unuseful data from the text, extract the relationships among the entities based on the pattern etc. For all these use cases, I have used python programming language. A few weeks back, I was asked to write a REPL interpreter in Scala to evaluate a given mathematical expression. This blog post covers how to create REPL interpreter using Regex in Scala?

The requirements of the assignment as follows-

  1. Evaluate a mathematical expression: 1 + 2 * n + 3 should return 12 if we assume n = 4. I used regex to extract the operators, operands and the variables and Shunting yard Algorithm to evaluate it.
  2. Variable Assignment: An expression like x = 2 * 4 + m should evaluate the expression at RHS and assign the answer to the variable at LHS (x in this case).
  3. Simplify a mathematical expression: If an expression starts with @ simplify the mathematical expression using following properties-
    1. Distributive property: (a + b ) x (a + c) -> a x ( b + c)
    2. 0 + e -> e
    3. e + 0 -> e
    4. 1 x e -> e
    5. e x 1 -> e
    6. e x 0 -> 0
    7. 0 x e -> 0
    For matching which property is applicable and simplifying the expression, I have used regex.
So, in the rest of the post I will be talking about how I have used regular expressions to meet the above requirements.

Regular Expression in Scala

There is a built in util class for regular expressions in scala- Regex. The class is defined in the package scala.util.matching. An instance of the class Regex represents a compiled regular expression pattern. The compilation of regular expressions is an expensive operation, therefore, it is recommended to define frequently used regex in the program once, outside any loop.

Defining Regular expression

Creating a Regex object is an easy task. A Regex object can be created easily from the string using the implicit method r. For solving the above requirements, I have created multiple Lists of Regexes. Let us consider the requirements of simplifying the expressions using multiple mathematical identities specified in requirement (3).

Given an expression, the problem is to identify if the expression matches the left hand side of any of the mathematical identities. I solved the problem using Regex. I defined a regular expression for the left hand side of each of the mathematical identities. Regular expression for LHS of distributive property-

  • raw"\( (\d+) \* (\d+) \) \+ \( (\d+) \* (\d+) \)".r("f11", "f12", "s11", "s12")
    The string is converted to a Regex object using the r method.
Regular expressions for operations returning the element itself i.e., where a number is multiplied by 1 or 0 is added to the number-
  • raw"^(0 \+ (\d+))".r
  • raw"^((\d+) \+ 0 )".r
  • raw"^((\d+) \* 1 )".r
  • raw"^(1 \* (\d+))".r
  • raw"( 0 \+ (\d+))".r
  • raw"( (\d+) \+ 0 )".r
  • raw"( (\d+) \* 1 )".r
  • raw"( 1 \* (\d+))".r
Regular expression for mathematical operations where an element is multiplied by 0, hence the result is 0.
  • raw"(0 \* \d+)|(\d+ \* 0)".r

Decoding symbols used in above regex-

  • \d in regular expressions represent the digits from 0 to 9.
  • * is a special character representing 0 or more occurrences of pattern preceding it.
  • + is a special character representing 1 or more occurrences of pattern preceding it.
  • Parentheses () are used to define groups in a regular expression. The groups can be named and extracted. In the distribution property regex above, I have defined 4 different groups and named them as f11, f12, s11 and s12 respectively. f11 represents the first operand of the first part of the expression, f12 second operand of the first part of the expression and so on.
  • | (pipe character) represents OR.
  • ^ symbol represents start of the expression.
  • \ is used to escape the special characters in the regular expression, for example, () are used for grouping in regex and I expect them to be part of mathematical expression. So, I have escaped parentheses using \ wherever I expect them to be part of mathematical expression.
  • Kindly note the spaces in the regex. Space acts as a separator between operator and operands in the expression.

Simplifying the Regular expression using methods of Regex class

After defining regular expressions for all the mathematical identities, the next step is to simplify the expression by replacing the expression with the RHS of the identities wherever applicable. The identities are divided into 3 categories- distributive identity, identities returning number itself and identities returning zero. I used the following methods for simplifying the expressions-

  1. replaceAllIn
    replaceAllIn(target: CharSequence, replacer: (Match) ⇒ String): String
  2. group method of Match class
  3. subgroups method of Match class
The replaceAllIn function searches the regex in input and replaces the matched part with the second argument. The second argument is the function which maps the matcher object to the replacement string.
For identities returning the number itself, I have put all the 8 regular expressions as stated above in a list. I have compared each of the 8 regexes with the input string one by one. If any regex finds a match in the input expression, it is replaced by the RHS of the identity using the subgroups method of matcher object. If you have noticed, I have added parentheses around \d+ in all 8 regexes which makes it a group. subgroups method returns all the groups captured in the string. In this case, I have specified 2 groups, one the full expression and another around \d+, hence, if a regex finds a match in the expression, I replace the matched string with the group at index 1. This could also be done by simply using the group method.
Following is the code snippet for simplifying the expression using identities returning elements itself-
                
    var expression = <input>
    listOfRegex.foreach(regex => 
        expression = regex.replaceAllIn(expression, m => m.subgroups(1))
        )
                
            
I have used similar logic to further simplify the expression using distributive and zero identities. The above logic of search and match continues until you cannot further simplify the expression.

Dealing with variables in the expression

I have used a map to store the value of variables. For the expressions which are assigning values to any variable, the idea is to solve the RHS and then update the value of the variable in the map. For the expressions, in which a variable is part of the RHS of the expression, I iterated through the map and replaced the variable names with the value stored in the map. I have used a similar logic to find and replace the variables with the corresponding values as for simplifying the expression using regular expressions.

You can find the complete code here on github.

That's it for today. Thanks for reading, I hope this article is helpful!