One of most powerful things DataFox can do is allow our users to search, filter, and sort on a combination of their proprietary data and our own constantly growing company database. Users can construct complex queries such as: find me all companies in San Francisco whose headcount is 5 - 5000, have recently had a product launch and to which have a note I attached to “reach out”. We want to deliver this result (or at least the first 50 matches) in under a half a second. Hiding behind that query is the complexity to keep our database up to date with their data and have extensive indexes to avoid table scans regardless of how ugly the query is.

Back in 2014 when DataFox was starting out, and having some foresight that search was going to be an integral part of our product, we picked Solr for our search infrastructure. We started with Solr 4, with a simple master & slave deployment. We took advantage of features like faceting, wildcard fields, spell check, result relevancy, geo-spatial search, joins and near real time (NRT) search.

Solr is at the center of our backend processing and also our customer workflows. When a customer adds a company to a list, Solr needs to reflect this change as soon as possible. (We don’t want to explain “eventual consistency” to the user when they refresh the page and suddenly the company disappears.) But aha! We found that you can send a soft commit with the update to Solr and the update is processed immediately thanks to Solr NRT. This solved our immediate need, we moved on and promptly forgot the details of this fix.

As our data grew, we started to notice the application searches timing out whenever the backend was running large number of queries (we started making use of EC2 autoscaling and it became pretty easy to DDOS ourselves). We realized that we needed to isolate our customers from our backend load and therefore created another replica intended just for the backend. We added logic to our search code to send app requests to one server and backend requests to another. One backend replica became 4 and on we went.

There was another case in which we got to stretch our creative muscles. We knew enough to know that joins in Solr are problematic – slow & take lots of memory – so we avoided them as much as possible. However, there was a need to let the user filter a list of companies by when that company got added to a list. Our clever solution? Create a wildcard field on the company core of the form: date_added_*: <Date>. Now, when a user wanted to filter by the list, our query on the company core was: date_added_<list-id>. And on we went! The implementation was straightforward and fit in well with the rest of our filtering.

So that’s where we were at one point. Our company core was getting hundreds of writes a second and a large portion of those we would force a “soft commit”. At the same time our documents were growing in size, the most frequently indexed & queried docs would have 2K keys, and some documents had 10,000 keys since one company can easily be in thousands of lists. It’s honestly impressive how long Solr tolerated us.

The application was slowing down, eventually to the point where the same searches would take 10-20 seconds to run if you got unlucky. We were in a state of emergency. We had not been monitoring performance and by the time we acknowledged this problem, our users were knocking on our doors with pitchforks. We needed a fix and we needed it fast.

The most tempting option was to throw more hardware at the problem, so of course we tried that first. We jumped from 30G to 60G RAM boxes. Sadly, that didn’t help that much.

I bet you can guess the next thing we tried: it’s caching! We were reluctant to add caching for the usual reasons of cache invalidation. Given the many ways in which companies core gets updated, we knew we would be playing whack-a-mole for the rest of eternity to ensure we correctly reflected user changes and still utilize the cache as much as possible. We picked a design and put a caching layer in place. The impact was tremendous and blaming caching became a running joke to the point where our customer success team started putting that in the bug summaries whenever anything didn’t add up.

We also implemented a project that I liked to call “Predict the Future”. We preloaded tabs for the user by issuing the requests ahead of time. Those requests would get cached and by the time the user actually wanted to navigate to those tabs, the response time was nearly instant because we would take the cached result instead of trying our luck with Solr. <3 redis.

Caching bought us some breathing room to work on the more low level problems. Looking closer at our Solr logs, Solr was actually quite happy to tell us what was wrong. Our logs were littered with “Performance Warning: Overlapping onDeckSearchers”. We did some reading about soft commits, hard commits, the transaction log, and realized how much we were abusing Solr. We stopped issuing commits from the application and slowly did all we could to remove the requirement for data to show up “instantly” in the app. More things moved to the background and we could control the rate of indexing. This wasn’t the end of our performance problems but the 75% percentile dropped substantially. And we no longer forced the user to wait, staring at the dreaded loading icon while we processed their list import for example.

Around this time we also started performance testing our Solr setup. Mainly what we were trying to figure out for ourselves was rate of companies we could expect to index per second without adverse performance impacts to search query time. The hope was to put a rate limiter in place to not overwhelm Solr with too much indexing. We removed customer data for security reasons during testing, which had the side effect of documents with only 100 keys, when in production they had 2000 keys. Our testing clone had the same number of documents, which we thought was a lot more crucial. In this more isolated environment, Solr performed great! We could easily index 500 docs/s and still get search query times of 20ms. This was inspiring but also confusing because our production was still drowning and we weren’t even close to pushing 500 docs/second. Through some luck and some process of elimination we stumbled on the keys observation and realized the only way out of performance problem was to remove our clever use of Solr wildcard fields for tracking when a company got added to a list. After some negotiations with our PMs and designer and a several month project later, our doc sizes were down to much more moderate 200 keys and we could tolerate indexing docs at 300 doc/s. We switched to store the join table between companies and lists in Postgres.

Mostly our performance problems were under control. That is to say that the 50% percentile was looking healthy, but the 90% and 99% percentiles were still catastrophic. 1/10 - 1/100 searches would take 5+ seconds to respond.

The next low hanging fruit to fix was to isolate the company core further. If our other cores are getting reindexed we absolutely do not want that to impact the company core performance so we decided to isolate the company core. The only tricky part was that we did have a few Solr joins, and Solr joins only work if the cores are on the same machine. So we came up with the following architecture:

This meant that our application code now had to know a lot about Solr internals. We had complicated branches about the type of query we were doing and which server was ready to handle that. Managing the Solr configuration became a headache. We effectively had 5 different Solr boxes, and each core needs to know its master. Every time one of the masters went down we had a serious surgical operation to promote one of the slaves to be the master and clean up the cores. This type of recovery took minimum 30 minutes, during which our caching layer helped us hide the fact that we were down but plenty of requests failed. We also had a number of bugs where the Solr configuration didn’t propagate to all the servers or didn’t restart after a config change. (According to all the docs, schema change only required a core reload, but for whatever reason our Solr is special and never incorporated the schema change unless you restarted Solr). There was about a 5% chance of something going catastrophically wrong anytime we had a schema change and I would hold my breath anytime we had to push one.

This configuration was not ready for scale. We had one core, let’s call it our Text core, that was 60GB already and was going to be 80GB by the end of the year. That core was on a server with 5 more cores, and the total index size of all cores added up to 80 GB. We could isolate the Text core to its own server, but that would increase the branching factor of our application. The other piece that didn’t sit well with us is that anytime we pushed a schema change, we would be partially down for 30 minutes. We chose those times carefully but it was hurting our development velocity.

We decided to move to Solr cloud. The big advantage of Solr cloud is that it’s designed for scale. It supports sharding and figures out how to route queries to the right node. We also wanted to remove the engineering overhead of managing the Solr configuration and the manual intervention required when a server had an issue. And hopefully schema changes would finally work with just a core reload instead of a full Solr restart.

The one serious drawback is that we had to rethink joins in our application. For years that was the one sticking point about Solr Cloud (Solr cloud doesn’t support joins in a distributed environment) that prevented us from migrating. However, since those initial discussions, we have become a much more performance-driven culture and absolutely no one wanted to go back to the days where our application would show the spinning wheel of death for 10-20 seconds. We removed functionality that required joins, which was a tremendous effort by our PM team. For joins we absolutely had to keep, we implemented those in memory, in a way that we could control and scale.

We moved to the cloud and I wouldn’t say it’s life changing yet, but I do sleep better at night knowing that Solr can repair itself. Early on in our migration we somewhat unintentionally tested cloud resilience. We started with a lean Solr cloud setup of 4 nodes. We started sending more and more traffic to our 4 node cloud and at some point we started seeing timeout errors. Replica shards would become unavailable but a few minutes later the node would catch up and come back online. Sometimes the leader would go down and shard leader elections happened. It wasn’t stable, certainly some requests from the application failed, but it was also not catastrophic to the point of a human had to intervene if we were to ever have a hope of recovery. For example, it was not the dreaded “Indexed locked for write” error that we saw all too often in the Solr master/slave architecture.

As promised by Solr cloud, we were able to add nodes to our cloud in a streamlined way and without having to tell our application about how shards were now distributed. We were able to move shards around using the cloud collections API and growing from 4 to 7 nodes fixed our timeout issues. We are a lot more convinced Solr will scale horizontally as our data grows.

The story of performance at DataFox is an eventful one. There was no one hero, but so many team members pitched in, got creative, and put out the fire they were best suited for. Our sales team not only knows what Solr is but can also articulate to prospects why certain requests are difficult. Our frontend team, with coordination with design, not only reduced the load to Solr during list import but also objectively improved that workflow. Our CTO ruthlessly cut features that provided marginal value but were expensive to our backend. Our CS bought engineering time by onboarding customers to use the app in way that was more conducive to our architecture limitations.

Lessons learned:

  1. Don’t bother trying to improve performance unless you have a baseline
  2. Be weary of clever solutions
  3. Users are a lot more patient when they kick off a job & get a notification when it’s done – rather than being forced to sit there and wait
  4. Doubling your hardware won’t always fix your problem