|
| 1 | + |
| 2 | +=== Indexing Performance Tips |
| 3 | + |
| 4 | +If you are in an indexing-heavy environment, such as indexing infrastructure |
| 5 | +logs, you may be willing to sacrifice some search performance for faster indexing |
| 6 | +rates. In these scenarios, searches tend to be relatively rare and performed |
| 7 | +by people internal to your organization. They are willing to wait several |
| 8 | +seconds for a search, as opposed to a consumer facing search which must |
| 9 | +return in milliseconds. |
| 10 | + |
| 11 | +Because of this unique position, there are certain tradeoffs that can be made |
| 12 | +which will increase your indexing performance. |
| 13 | + |
| 14 | +.These tips only apply to Elasticsearch 1.3+ |
| 15 | +**** |
| 16 | +This book is written for the most recent versions of Elasticsearch, although much |
| 17 | +of the content works on older versions. |
| 18 | +
|
| 19 | +The tips presented in this section, however, are _explicitly_ version 1.3+. There |
| 20 | +have been multiple performance improvements and bugs fixed which directly impact |
| 21 | +indexing. In fact, some of these recommendations will _reduce_ performance on |
| 22 | +older versions due to the presence of bugs or performance defects. |
| 23 | +**** |
| 24 | + |
| 25 | +==== Test performance scientifically |
| 26 | + |
| 27 | +Performance testing is always difficult, so try to be as scientific as possible |
| 28 | +in your approach. Randomly fiddling with knobs and turning on ingestion is not |
| 29 | +a good way to tune performance. If there are too many "causes" then it is impossible |
| 30 | +to determine which one had the best "effect". |
| 31 | + |
| 32 | +1. Test performance on a single node, with a single shard and no replicas |
| 33 | +2. Record performance under 100% default settings so that you have a baseline to |
| 34 | +measure against |
| 35 | +3. Make sure performance tests run for a long time (30+ minutes) so that you can |
| 36 | +evaluate long-term performance, not short-term spikes or latencies. Some events |
| 37 | +(such as segment merging, GCs, etc) won't happen right away, so the performance |
| 38 | +profile can change over time. |
| 39 | +4. Begin making single changes to the baseline defaults. Test these rigorously, |
| 40 | +and if performance improvement is acceptable, keep the setting and move on to the |
| 41 | +next one. |
| 42 | + |
| 43 | +==== Using and Sizing Bulk requests |
| 44 | + |
| 45 | +Should be fairly obvious, but use bulk indexing requests for optimal performance. |
| 46 | +Bulk sizing is dependent on your data, analysis and cluster configuration, but |
| 47 | +a good starting point is 5-15mb per bulk. Note that this is physical size. |
| 48 | +Document count is not a good metric for bulk size. For example, if you are |
| 49 | +indexing 1000 documents per bulk: |
| 50 | + |
| 51 | +- 1000 documents at 1kb each is 1mb |
| 52 | +- 1000 documents at 100kb each is 100mb |
| 53 | + |
| 54 | +Those are drastically different bulk sizes. Bulks need to be loaded into memory |
| 55 | +at the coordinating node, so it is the physical size of the bulk that is more |
| 56 | +important than the document count. |
| 57 | + |
| 58 | +Start with a bulk size around 5-15mb and slowly increase it until you do not |
| 59 | +see performance gains anymore. Then start increasing the concurrency of your |
| 60 | +bulk ingestion (multiple threads, etc). |
| 61 | + |
| 62 | +Monitor your nodes with Marvel and/or tools like `iostat`, `top, and `ps` to see |
| 63 | +when resources start to bottleneck. If you start to receive `EsRejectedExecutionException` |
| 64 | +then your cluster is at-capacity with _some_ resource and you need to reduce |
| 65 | +concurrency. |
| 66 | + |
| 67 | +When ingesting data, make sure bulk requests are round-robin'ed across all your |
| 68 | +data nodes. Do not send all requests to a single node, since that single node |
| 69 | +will need to store all the bulks in memory while processing. |
| 70 | + |
| 71 | +==== Storage |
| 72 | + |
| 73 | +- Use SSDs. As mentioned elsewhere, they are superior to spinning media |
| 74 | +- Use RAID 0. Striped RAID will increase disk IO, at the obvious expense of |
| 75 | +potential failure if a drive dies. Don't use "mirrored" or "parity" RAIDS since |
| 76 | +replicas provide that functionality |
| 77 | +- Alternatively, use multiple drives and allow Elasticsearch to stripe data across |
| 78 | +them via multiple `path.data` directories |
| 79 | +- Do not use remote-mounted storage, such as NFS or SMB/CIFS. The latency introduced |
| 80 | +here is antithetical with performance. |
| 81 | +- If you are on EC2, beware EBS. Even the SSD-backed EBS options are often slower |
| 82 | +than local drives. |
| 83 | + |
| 84 | +==== Segments and Merging |
| 85 | + |
| 86 | +Segment merging is computationally expensive, and can eat up a lot of Disk IO. |
| 87 | +Merges are scheduled to operate in the background because they can take a long |
| 88 | +time to finish, especially large segments. This is normally fine, because the |
| 89 | +rate of large segment merges is relatively rare. |
| 90 | + |
| 91 | +But sometimes merging falls behind the ingestion rate. If this happens, Elasticsearch |
| 92 | +will automatically throttle indexing requests to a single thread. This prevents |
| 93 | +a "segment explosion" problem where hundreds of segments are generated before |
| 94 | +they can be merged. Elasticsearch will log INFO level messages stating `now |
| 95 | +throttling indexing` when it detects merging falling behind indexing. |
| 96 | + |
| 97 | +Elasticsearch defaults here are conservative: you don't want search performance |
| 98 | +to be impacted by background merging. But sometimes (especially on SSD, or logging |
| 99 | +scenarios) the throttle limit is too low. |
| 100 | + |
| 101 | +The default is 20mb/s, which is a good setting for spinning disks. If you have |
| 102 | +SSDs, you might consider increasing this to 100-200mb/s. Test to see what works |
| 103 | +for your system: |
| 104 | + |
| 105 | +[source,js] |
| 106 | +---- |
| 107 | +PUT /_cluster/settings |
| 108 | +{ |
| 109 | + "persistent" : { |
| 110 | + "indices.store.throttle.max_bytes_per_sec" : "100mb" |
| 111 | + } |
| 112 | +} |
| 113 | +---- |
| 114 | + |
| 115 | +If you are doing a bulk import and don't care about search at all, you can disable |
| 116 | +merge throttling entirely. This will allow indexing to run as fast as your |
| 117 | +disks will allow: |
| 118 | + |
| 119 | +[source,js] |
| 120 | +---- |
| 121 | +PUT /_cluster/settings |
| 122 | +{ |
| 123 | + "transient" : { |
| 124 | + "indices.store.throttle.type" : "none" <1> |
| 125 | + } |
| 126 | +} |
| 127 | +---- |
| 128 | +<1> Setting the throttle type to `none` disables merge throttling entirely. When |
| 129 | +you are done importing, set it back to `merge` to re-enable throttling. |
| 130 | + |
| 131 | +If you are using spinning media instead of SSD, you need to add this to your |
| 132 | +`elasticsearch.yml`: |
| 133 | + |
| 134 | +[source,yaml] |
| 135 | +---- |
| 136 | +index.merge.scheduler.max_thread_count: 1 |
| 137 | +---- |
| 138 | + |
| 139 | +Spinning media has a harder time with concurrent IO, so we need to decrease |
| 140 | +the number of threads that can concurrently access the disk per index. This setting |
| 141 | +will allow `max_thread_count + 2` threads to operate on the disk at one time, |
| 142 | +so a setting of `1` will allow 3 threads. |
| 143 | + |
| 144 | +For SSDs, you can ignore this setting. The default is |
| 145 | +`Math.min(3, Runtime.getRuntime().availableProcessors() / 2)` which works well |
| 146 | +for SSD. |
| 147 | + |
| 148 | +Finally, you can increase `index.translog.flush_threshold_size` from the default |
| 149 | +200mb to something larger, such as 1gb. This allows larger segments to accumulate |
| 150 | +in the translog before a flush occurs. By letting larger segments build, you |
| 151 | +flush less often, and the larger segments merge less often. All of this adds up |
| 152 | +to less disk IO overhead and better indexing rates. |
| 153 | + |
| 154 | +==== Other |
| 155 | + |
| 156 | +Finally, there are some other considerations to keep in mind: |
| 157 | + |
| 158 | +- If you don't need near-realtime accuracy on your search results, consider |
| 159 | +dropping the `index.refresh_interval` of each index to `30s`. If you are doing |
| 160 | +a large import, you can disable refreshes by setting this value to `-1` for the |
| 161 | +duration of the import. Don't forget to re-enable it when you are done! |
| 162 | + |
| 163 | +- If you are doing a large bulk import, consider disabling replicas by setting |
| 164 | +`index.number_of_replicas: 0`. When documents are replicated, the entire document |
| 165 | +is sent to the replica node and the indexing process is repeated verbatim. This |
| 166 | +means each replica will perform the analysis, indexing and potentially merging |
| 167 | +process. |
| 168 | ++ |
| 169 | +In contrast, if you index with zero replicas and then enable replicas when ingestion |
| 170 | +is finished, the recovery process is essentially a byte-for-byte network transfer. |
| 171 | +This is much more efficient than duplicating the indexing process. |
| 172 | + |
| 173 | +- If you don't have a natural ID for each document, use Elasticsearch's auto-ID |
| 174 | +functionality. It is optimized to avoid version lookups, since the autogenerated |
| 175 | +ID is unique. |
| 176 | + |
| 177 | +- If you are using your own ID, try to pick an ID that is http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html[friendly to Lucene]. Examples include zero-padded |
| 178 | +sequential IDs, UUID-1 and nanotime; these IDs have consistent, "sequential" |
| 179 | +patterns which compress well. In contrast, IDs such as UUID-4 are essentially |
| 180 | +random, which offer poor compression and slow down Lucene. |
| 181 | + |
| 182 | + |
| 183 | + |
| 184 | + |
| 185 | + |
| 186 | + |
| 187 | + |
| 188 | + |
0 commit comments