This was originally posted in this TFI topic about bulletin board code parsing.

Alright, for this I have to introduce you a bit into automata theory and languages. First a couple of definitions:

  • An alphabet in automata theory is a set of symbols (ASCII characters in applications usually) that can be used in the input of your automaton.
  • A language consists of all the strings of symbols that an automaton accepts.

That all sounds very vague probably. Let’s give an example of a finite state automaton (every automaton has states, we have a finite number of them), we have the following automaton:
Automata diagram
The alphabet for this one is: a, b, c (so input streams will only contain those characters, this is only for didactic purposes, in practice the alphabet might be the whole ASCII table).

An automaton consists of states (named q0, q1 and q2 in this case). What’s a state? Well in your program it will probably just be a variable, having the current state as value (you can number them or name them). An automaton has an initial state (q0 in this one) and an accepting state (meaning “this string of symbols is part of our language”), q2 in this case (you can see this because of the additional ellipse around it). The symbols next to the arrows are symbols.

How does it work? Ok, let’s see if the string “acccbc” is in the language defined by this FSA (Finite State Automaton).

  • We start in q0, the first symbol we have to “eat up” is the a, is there an arrow (transition) from q0 which says a? Yes there is, and it leads to state q1.
  • Ok, we’re now in q1 and as input we have cccbc left. Is there a transition from q1 with a c as symbol? Yes there is and it leads back to q1.
  • We’re now in q1 again and as input we have ccbc left. Is there a transition from q1 with a c as symbol? Yes there is and it leads back to q1.
  • We’re now in q1 still and as input we have cbc left. Is there a transition from q1 with a c as symbol? Yes there is and it leads back to q1.
  • We’re now in q1 again and as input we have bc left. Is there a transition from q1 with b as symbol? Yes there is and it leads to q2, an accepting state. Are we done then? No, an input string is only accepted when the last symbol leads to an accepting state.
  • We’re now in q2 and as input we have c left. Is there a transition from q2 with c as symbol? Yes there is, and it leads to q1

Ok, we’re done. We ended up in q1 and as that’s not an accepting state so acccbc is not in the language described by the FSA, it does not match. If we left out the last c: acccb, it would be accepted and would have matched.

You probably wonder what on earth is the use of this. Well, everyone who used regular expressions and probably everyone has, as indirectly used FSA’s. Let’s look at the FSA I just demonstrated, what’s the regex that belongs to that? It’s ac*(bc+)*b. Regular expression are something that came from automata theory as a simple way to describe languages. Every regular expression can be translated to a non-deterministic finite state automaton with lambda transition which on it’s turn can be translated to a non-detereministic finite state automaton (NFA) and a NFA can be tranlated into a finite state automaton. Internally this is what PHP/Perl/Whatever does. Every regular expression you use is first translated into a FSA and then executed. With minor enhancements this is what preg_match (and ereg) do, they match an input string with a regular expression (in practice FSA) and if it matches, return true. preg_replace is a minor modification to that, it can also replace certain sub-strings matching a regex with other expressions.

What happens when you use multiple preg_replaces is that on every run each of them is translated, the whole input is then scanned and actions are taken. On the second one this process is repeated, so if you use 20 preg_replaces your input is scanned 20 times. Many people think regexes do magic by guessing if a certain pattern matches, but of course it’s not, it just scans your input.


Rss Commenti

19 Commenti

  1. Ugh, I hated doing this in university, and I hate it even more when I read it by myself :P. FSA diagrams take all the fun out of regular expressions, ESPECIALLY when you have to make these weird regular expressions which purposely have bad looking FSA diagrams. Good little lesson though Zef, I think you’re more qualified than half of the professors I’ve been taught by ;)

    #1 Parham
  2. Sounds interesting :)

    #2 M
  3. I believe the regex would be ac*(bc+)*b

    #3 Luke Hutteman
  4. Yes indeed, you’re right.

    #4 Zef
  5. um, i like DFA’s and NFA’s but when it comes to this topic “REGULAR EXPRESSION” my mind is not working! i hate dis topic. but im trying to understand it and be familiar with it because its in the curriculum. if i ignore learning this i might get a failing grade… i just hope i can manage learning it zun!

    #5 jiLL of philippines
  6. can u help me find what is the regular expression of “set of string over(0,1) that starts with 1 when converted to binary it is divisible by 5″

    #6 Jan
  7. hi
    it,s better try for the more best ok.

    #7 ali
  8. please help me out in digesting this topic “REGULAR EXPRESSION” please please my mind is not working to understand this specific topic.
    Looking forward
    Regards
    Abdul Waheed

    #8 Abdul Waheed
  9. hey i m finding automata theory a bit difficult. i will be really grateful if u send me some stuff concerning how to get a required regular expression from a given statement. for eg:- give the regular expression of the language containing even no.of 1’s and odd no. of 0’s. hoping fur a positive reply

    #9 raghu
  10. dear sir,
    i am a student of master degree in computer application.i went through ur tutorial.really this is very lucid and self explanatory.so it is very useful.but it is a matter of pity tht,among the whole arena of automata theory and languages u just presented only a handful of tutorial and tht is also not completely. i was expecting a whole bunch of complete stuff frm u with practise questions and answers.waiting for then. …..nihar ranjan

    #10 Nihar
  11. i wont more examples of regular and irregular in diagrams

    #11 mohd
  12. dear sir,
    i need regular expressions for the following statements:
    1…A language consists of alphabets {a,b} and all strings in which the total number of “a’s” is divisible by 3 no matter how they are distributed, i.e. aabaabbbab
    2…A language consists of a set of alphabets {a,b} and in which all the strings that have an even number of a’s and an odd number of b’s.
    looking forward for your reply..
    regards..

    #12 sharry
  13. sir i have visited site and found in very helpful to me.. its a gr8 effort. can u plz help me regarding the ”Finite automata” of identifeirs,comments,float constant [,expression,… in compiler construction.. can u plz mail me the link of FAs or send me the notes i will be very thank full 2 u..bye

    #13 shiraz
  14. Hello sir,
    It is a super explanation of FSA. I am a univerisity student and FSA is one of the topic for a module and i couldn’t understand it well so i was searching about FSA on the net and finally i found out this website!

    many mnay thanks sir

    #14 suyanthan
  15. Good one indeed.

    Regards
    vijay

    #15 Vijay Kumar
  16. I have to say this was my worst subject at uni. The first was discrete maths :(

    #16 rusty
  17. can u help me out in this subject?im a student of BCA.i can’t understand turing machine.

    #17 Anuradha
  18. It’s a very interesting thing. I like Theory of Automata and all such stuff (Regular Expressions, FSA etc). You are doing a great job.

    #18 Mohammad Aazam
  19. 1…A language consists of alphabets {a,b} and all strings in which the total number
    of “a’s” is divisible by 3 no matter how they are distributed, i.e. aabaabbbab
    2…A language consists of a set of alphabets {a,b} and in which all the strings that
    have an even number of a’s and an odd number of b’s.

    1ans: ((a+b)(a+b)(a+b))*
    Is it satisfies??????

    #19 jyothsna

Sorry, the comment form is closed at this time.