logoalt Hacker News

masfuertetoday at 1:18 PM2 repliesview on HN

As originally written, doesn't it go from the start of the first match to the end of the last match? I feel like I'm missing something.


Replies

ievievtoday at 1:28 PM

It goes from start of the first match to the longest "alive" end, in practice it will go to a dead state and return after finding the match end.

there's an implicit `.*` in front of the first pass but i felt it would've been a long tangent so i didn't want to get into it.

so given input 'aabbcc' and pattern `b+`,

first reverse pass (using `.*b+`) marks 'aa|b|bcc'<-

the forward pass starts from the first match:

'aa->b|b|cc' marking 2 ends

then enters a dead state after the first 'c' and returns the longest end: aa|bb|cc

i hope this explains it better

show 1 reply
mananaysiempretoday at 2:37 PM

Right, the explanation seems to be a bit oversimplified, but I don’t think it’s difficult to fix it up: you need to collect non-overlapping starts (with an RTL scan) and ends (with an LTR scan) and zip them together. The non-overlapping matches are the last ones you see before you need to reset the matcher (traverse a failing edge). This feels like it should work.

(I tried to write some pseudocode here but got annoyed dealing with edge cases like zero-length matches at EOF, sorry.)