How a long sequence of dots allowed a regex to reach its internal stack limit. Premise Wikipedia’s production error logs were reporting an increase in app crashes from the search results page. The internal Logstash error report looked as follows: [RuntimeException] Cannot consume query at offset 0 (need to go to 7296) at mediawiki/…/CirrusSearch: QueryStringRegexParser->nextToken…| Timo Tijhof
Why does software accept invalid data? And, at what software layer should we reject it? Also, what are “namespaces” and “special pages” on Wikipedia? Premise One day, our server monitoring was reporting a high frequency of fatal errors from web servers. Over 10,000 an hour. The majority shared a single root cause – The program…| Timo Tijhof
These are short stories from bug hunts and incident investigations at Wikipedia. Impact After developers submit code to Gerrit, they eagerly await the result from Jenkins, an automated test runner. Every day during the 15 minute window before 5 PM in San Francisco, code changes submitted for code review would have mysteriously failing tests. Jenkins…| Timo Tijhof
These are short stories from bug hunts and incident investigations at Wikipedia. New database partition A user reported a timeout error for certain queries from the Public log viewer on commons.wikimedia.org. Database administrator Manuel Aróstegui investigated the underlying query and found that it was slow (and timing out) due to one of the database replicas…| Timo Tijhof