Mapping defines how JSON documents are stored in an Elasticsearch index. Issues can arise if mappings are incorrect or change unexpectedly. Explicit mappings will cause exceptions if fields don't match, while dynamic mappings can cause explosions if many new fields are introduced. To deal with mismatches, the "ignore_malformed" setting allows indexing documents that don't match the mapping, by ignoring problematic fields. However, it has limitations and may not work for complex data like nested JSON objects. The best approach is to define mappings explicitly for all expected fields.
Mapping defines how JSON documents are stored in an Elasticsearch index. Issues can arise if mappings are incorrect or change unexpectedly. Explicit mappings will cause exceptions if fields don't match, while dynamic mappings can cause explosions if many new fields are introduced. To deal with mismatches, the "ignore_malformed" setting allows indexing documents that don't match the mapping, by ignoring problematic fields. However, it has limitations and may not work for complex data like nested JSON objects. The best approach is to define mappings explicitly for all expected fields.
Mapping defines how JSON documents are stored in an Elasticsearch index. Issues can arise if mappings are incorrect or change unexpectedly. Explicit mappings will cause exceptions if fields don't match, while dynamic mappings can cause explosions if many new fields are introduced. To deal with mismatches, the "ignore_malformed" setting allows indexing documents that don't match the mapping, by ignoring problematic fields. However, it has limitations and may not work for complex data like nested JSON objects. The best approach is to define mappings explicitly for all expected fields.
Mapping defines how JSON documents are stored in an Elasticsearch index. Issues can arise if mappings are incorrect or change unexpectedly. Explicit mappings will cause exceptions if fields don't match, while dynamic mappings can cause explosions if many new fields are introduced. To deal with mismatches, the "ignore_malformed" setting allows indexing documents that don't match the mapping, by ignoring problematic fields. However, it has limitations and may not work for complex data like nested JSON objects. The best approach is to define mappings explicitly for all expected fields.
Download as TXT, PDF, TXT or read online from Scribd
Download as txt, pdf, or txt
You are on page 1of 12
WEBVTT
00:06.960 --> 00:12.180
Mapping is an essential foundation of an index that can generally be considered the heart of Elasticsearch.
00:12.900 --> 00:16.020
So you can be sure of the importance of a well managed mapping.
00:16.740 --> 00:20.820
But just as it is with many important things, sometimes mappings can go wrong.
00:21.300 --> 00:25.350
We'll take a look at various issues that can arise with mappings and how to deal with them.
00:28.720 --> 00:34.060
Before delving into the possible challenges with mappings, let's quickly recap some key points about
00:34.060 --> 00:34.630
mappings.
00:35.350 --> 00:42.220
A mapping essentially entails two parts the process a process of defining how your JSON documents will
00:42.220 --> 00:48.160
be stored in an index and the result the actual metadata structure resulting from the definition process.
00:51.960 --> 00:56.850
If we first consider the process aspect of the mapping definition, there are generally two ways this
00:56.850 --> 00:57.390
can happen.
00:58.170 --> 01:03.120
An explicit mapping process where you define what fields in their types you want to store along with
01:03.120 --> 01:04.350
any additional parameters.
01:05.340 --> 01:10.950
A dynamic mapping Elasticsearch automatically attempts to determine the appropriate data type and updates
01:10.950 --> 01:12.030
the mapping accordingly.
01:17.340 --> 01:22.530
The result of the mapping process defines what we can index via individual fields and their data types,
01:22.530 --> 01:25.950
and also how the indexing happens via related parameters. 01:26.670 --> 01:28.380 Consider this mapping example here.
01:29.190 --> 01:32.850
It's a very simple mapping example for a basic log collection microservice.
01:33.570 --> 01:37.620
The individual logs consist of the following fields and their associated data types.
01:38.130 --> 01:44.280
The timestamp of the log is mapped as a date service name, which created the log is mapped as a keyword
01:44.880 --> 01:50.070
IP of the host on which the log was produced is mapped as an IP data type, port number is mapped as
01:50.070 --> 01:50.640
an integer.
01:51.150 --> 01:57.060
The actual log message map is text to enable full text searching and more as we have not disabled the
01:57.060 --> 01:58.890
default dynamic mapping process.
01:58.980 --> 02:03.540
So we'll be able to see how we can introduce new fields arbitrarily and they will be added to the mapping
02:03.540 --> 02:04.260
automatically.
02:08.380 --> 02:09.790
So what could go wrong?
02:10.390 --> 02:14.470
There are generally two potential issues that many will end up with facing with mappings.
02:15.460 --> 02:20.230
If we create an explicit mapping when fields don't match, we'll get an exception if the mismatch falls
02:20.230 --> 02:21.850
beyond a certain safety zone.
02:22.390 --> 02:25.030
We'll explain this in more detail later in the exercise.
02:25.930 --> 02:31.180
If we keep the defaults dynamic mapping and then introduce many more fields, we're in for a mapping
02:31.180 --> 02:33.340
explosion which can take our cluster down. 02:37.980 --> 02:42.180 Let's continue with some interesting hands on examples where we'll simulate the issues and attempt to
02:42.180 --> 02:42.810
resolve them.
02:45.830 --> 02:49.610
Let's get back to the safety zone we mentioned before when there's a mapping mismatch.
02:50.270 --> 02:52.160
We'll create our index and see it in action.
02:52.370 --> 02:55.160
We are using the same exact mapping that we saw earlier.
02:55.160 --> 03:00.110
And to save you some typing, I've uploaded some of the larger commands in this exercise to the web
03:00.110 --> 03:00.530
for you.
03:00.560 --> 03:07.250
So just head over to media that some dog tasks Ofcom slash s slash exceptions dot text and you'll see
03:07.250 --> 03:09.260
this cheat sheet that you can just copy and paste from.
03:10.100 --> 03:11.600
So we'll start by creating our index.
03:11.600 --> 03:15.050
We're on a call it microservice dash logs containing the following properties.
03:15.350 --> 03:19.010
And although we're defining the port as an integer type, that will be important later on.
03:19.760 --> 03:20.960
Going to go ahead and copy that.
03:22.360 --> 03:24.460
And back to our terminal and right click to paste.
03:25.760 --> 03:26.090
All right.
03:27.770 --> 03:33.350
Now a well-defined JSON log for this mapping would look something like this and block to note that the
03:33.350 --> 03:36.890
port is defined as a integer one, two, three, four or five, just like it should be.
03:37.880 --> 03:41.930
But what if another service tries to log its port as a string and not a numeric value?
03:42.440 --> 03:44.780
Notice that the port is in quotation marks here.
03:44.790 --> 03:46.940
I mean, that's actually a string containing the string.
03:46.940 --> 03:47.960
15,000.
03:48.560 --> 03:49.580
Well, let's try it out.
03:50.810 --> 03:51.410
Copy that.
03:53.630 --> 03:54.650
And pasted it in.
03:57.320 --> 03:57.640
Great.
03:57.650 --> 03:59.630
It actually worked without throwing an exception.
04:00.050 --> 04:02.300
This is that safety zone that I mentioned earlier.
04:03.400 --> 04:07.600
But what if that service law does string that has no relation to numeric values at all into the poor
04:07.610 --> 04:09.580
field, which we earlier defined as an integer?
04:09.610 --> 04:10.960
Well, let's see what happens then.
04:11.200 --> 04:15.980
So on this one, our message is I am not well because the port is actually the string none.
04:16.000 --> 04:17.140
That's not a number at all.
04:17.360 --> 04:18.400
Well, let's see what happens.
04:19.520 --> 04:20.120
Copy that.
04:26.810 --> 04:29.930
Number format section under a map or parsing exception.
04:30.000 --> 04:30.410
Hmm.
04:31.010 --> 04:33.920
So we're now entering the world of Elasticsearch mapping exceptions.
04:34.070 --> 04:39.050
We've received a code 400 and the map or parsing exception that is informing us about our data type
04:39.050 --> 04:44.180
issue, specifically that it failed to pass the provided value of none to the type integer.
04:45.250 --> 04:46.840
So how do we solve this kind of an issue?
04:47.410 --> 04:50.590
Well, unfortunately, there isn't a one size fits all solution.
04:51.280 --> 04:56.350
In this specific case, we can partially resolve the issue by defining a ignore or malformed mapping
04:56.350 --> 04:56.920
parameter.
04:57.580 --> 05:02.050
Now keep in mind this parameter is non dynamic, so you either need to set it when creating your index
05:02.320 --> 05:03.770
or you need to close the index.
05:03.790 --> 05:07.960
Change the setting value and then reopen the index, which is what we're going to do right now, something
05:07.960 --> 05:08.440
like this.
05:09.220 --> 05:11.920
So let's run the commands in block five here, one at a time.
05:12.610 --> 05:13.960
First, we'll close our index.
05:18.480 --> 05:21.810
And then we'll set index mapping ignore malformed to true.
05:27.610 --> 05:29.130
And will reopen that index.
05:38.120 --> 05:42.450
All right, so now let's try to index that same document again.
05:42.470 --> 05:44.090
That's what's in BLOCK Six here.
05:52.550 --> 05:53.030
All right.
05:53.040 --> 05:53.900
That one actually worked.
05:55.160 --> 05:59.330
Now, if we check the document by its ID, it will show us that the poor field was actually omitted
05:59.330 --> 06:01.500
for indexing, and we'll see it in the ignored section.
06:01.520 --> 06:02.840
Let's see how that works.
06:03.290 --> 06:11.930
First, we need to copy that ID that we got back after inserting it and we'll type in kernel ATP.
06:13.130 --> 06:18.320
Local Host 9200 slash microservice dash logs.
06:19.740 --> 06:25.980
Slash underscore doc slash right click to paste in that ID question mark pretty.
06:27.780 --> 06:28.410
Single quote.
06:30.710 --> 06:34.160
So note here, it's telling you that the port field was ignored due to that rule.
06:35.000 --> 06:38.900
Now, the reason this is only a partial solution is because the setting has its limits and they are
06:38.900 --> 06:39.950
quite considerable.
06:40.460 --> 06:42.110
Let's reveal one of the next example.
06:42.800 --> 06:47.510
A developer might decide that when a microservice receives some API request, it should log the received
06:47.510 --> 06:49.610
Jason payload in the message field.
06:50.210 --> 06:55.240
Now we already map the message field as text and we still have the ignore malformed parameter set.
06:55.250 --> 06:56.270
So what would happen?
06:56.450 --> 06:57.200
Well, let's see.
06:57.770 --> 06:59.390
We'll copy block seven here.
07:01.080 --> 07:03.660
That is putting some JSON data within the message.
07:07.490 --> 07:08.650
Let's get a clean slate here.
07:12.120 --> 07:13.080
And we got an error.
07:13.950 --> 07:16.470
So we see our old friend the mapper parsing exception.
07:16.980 --> 07:22.710
This is because ignore malformed can't handle JSON objects on the input, which is a significant limitation
07:22.710 --> 07:23.550
to be aware of.
07:24.630 --> 07:29.670
Now when speaking of JSON objects, be aware that all the mapping ideas remain valid for the nested
07:29.670 --> 07:30.570
parts as well.
07:31.320 --> 07:36.240
Continuing our scenario after losing some laws to mapping exceptions, we decided time to introduce
07:36.240 --> 07:40.380
a new payload field of the type object where we can store the JSON at will.
07:41.340 --> 07:45.960
Now remember, we have dynamic mapping in place so we can index it without first creating its mapping.
07:46.920 --> 07:48.240
Let's go ahead and try that.
07:49.020 --> 07:51.630
See, we have a payload field now that contains that JSON data.
08:04.430 --> 08:04.940
All good.
08:05.390 --> 08:05.680
All right.
08:05.690 --> 08:08.510
Now we can check the mapping and focus on that payload field.
08:10.060 --> 08:11.080
To say, Colonel.
08:11.080 --> 08:12.100
That's his request.
08:12.100 --> 08:14.080
Gets a copy. 08:16.790 --> 08:25.010 Local host 1800 microservice dash logs slash underscore mapping pretty.
08:29.530 --> 08:30.960
And let's find that payload field.
08:30.970 --> 08:31.480
There it is.
08:33.410 --> 08:37.460
So it was mapped as an object with sub properties defining the nested fields.
08:37.640 --> 08:41.270
So apparently the dynamic mapping works, but there is a trap.
08:41.720 --> 08:47.120
The payloads are and generally any JSON object in the world of many producers and consumers can consist
08:47.120 --> 08:48.080
of almost anything.
08:48.950 --> 08:53.750
So you know what will happen with different JSON payloads, which also consist of a payload dot data
08:53.750 --> 08:56.570
that received field but with a different type of data.
08:58.130 --> 09:00.040
Let's try that with BLOCK nine here.
09:05.860 --> 09:08.170
Well, you see, we're just sending a slightly different payload here.
09:15.460 --> 09:17.830
And again, we got the map or a passing exception.
09:19.060 --> 09:20.170
So what else can we do?
09:20.620 --> 09:24.250
Well, engineers on the team need to be aware of these mapping mechanics.
09:24.670 --> 09:27.430
You can also establish shared guidelines for the log fields.
09:28.030 --> 09:33.070
Secondly, you may consider what's called a dead letter queue pattern that would store the fail documents
09:33.070 --> 09:34.030
in a separate queue.
09:34.570 --> 09:38.890
The C there needs to be handled on an application level or by employing log stache DL. 09:38.890 --> 09:42.100 Q Which allows us to still process the failed documents.
09:43.650 --> 09:46.020
Let's clear and start with a first slate here.
09:47.160 --> 09:50.460
So now the second area of caution in relation to mappings are limits.
09:51.570 --> 09:53.220
Even from super simple examples.
09:53.220 --> 09:57.480
With payloads, you can see that the number of nested fields can start accumulating pretty quickly.
09:57.870 --> 09:58.980
Where does this road end?
09:59.040 --> 10:03.930
Well, at the number 1000, which is the default limit of the number of fields in a mapping.
10:04.800 --> 10:09.090
Let's simulate this exception in our safe playground environment before you'll unwillingly meet it in
10:09.090 --> 10:10.230
your production environment.
10:11.040 --> 10:17.340
Let's start by creating a large dummy JSON document with 1001 fields, post it and see what happens.
10:18.300 --> 10:24.140
So to create the document, we're going to use the example command below with the JQ tool.
10:24.150 --> 10:30.180
And if you don't already have JQ installed, you'll have to do that with sudo apt dash get install JQ.
10:34.620 --> 10:38.010
And once you have that, you can create the JSON manually or if you prefer.
10:38.070 --> 10:39.240
This is a little bit easier, actually.
10:39.240 --> 10:39.860
A lot easier.
10:39.880 --> 10:44.940
Just go to BLOCK ten here and we'll set up a variable called thousand and one Fields Jason that contains
10:44.940 --> 10:48.000
the following stuff using JQ.
10:52.490 --> 10:52.910
Copy.
10:54.790 --> 10:55.270
Based.
10:57.530 --> 11:03.650
And you can see all this is doing is echoing that 1001 times to that environment variable and we can
11:03.650 --> 11:05.060
echo that to take a look at what's in it.
11:09.140 --> 11:09.650
Oh, yeah.
11:09.800 --> 11:10.880
1001 things.
11:14.050 --> 11:20.980
So we can now create a new plane index with a curl that says a location request to put.
11:22.350 --> 11:27.680
A sleepy local host 1900 slash will call this one big dash objects.
11:31.470 --> 11:33.450
And we'll post in our generated JSON.
11:42.580 --> 11:44.260
Big dash objects.
11:45.460 --> 11:46.000
Underscore.
11:46.000 --> 11:46.450
Duck.
11:47.380 --> 11:47.790
Poshmark.
11:47.830 --> 11:50.590
Pretty backslash.
11:51.040 --> 11:51.340
Dash.
11:51.340 --> 11:51.790
Dash.
11:51.970 --> 11:52.480
Data.
11:52.480 --> 11:53.020
Dash.
11:53.020 --> 11:53.680
Raw.
11:55.440 --> 11:56.010
Quote.
11:57.220 --> 11:59.170
Dollars 9001.
12:00.310 --> 12:00.910
Fields.
12:01.240 --> 12:08.890
Jason and I will import that the contents of that into our big objects index.
12:10.840 --> 12:12.760
And you can guess what happened.
12:13.150 --> 12:15.190
We went straight to the illegal argument.
12:15.190 --> 12:15.640
Exception.
12:15.640 --> 12:19.450
Exception that informs us about the limit being exceeded very explicitly.
12:20.380 --> 12:21.370
So how do you handle that?
12:22.470 --> 12:26.310
Well, first, you should definitely think about what you're storing in your indices and for what purpose.
12:26.520 --> 12:29.690
Secondly, if you still need to, you can increase this 1000 limit.
12:30.240 --> 12:31.110
But be careful.
12:31.110 --> 12:35.280
As with bigger complexity, you might come a much bigger price of potential performance degradations
12:35.550 --> 12:36.900
and high memory pressure.
12:37.950 --> 12:40.860
Changing this limit can be performed with a simple, dynamic setting change.
12:40.860 --> 12:44.850
We can just say curl location.
12:47.300 --> 12:55.610
Request put HP local host 1900 slash big dash objects slash underscore settings.
12:57.920 --> 12:58.670
Data raw.
13:02.500 --> 13:08.380
It would be index mapping, not total underscore field start limits.
13:08.890 --> 13:10.450
And we could set that to 1001.
13:12.500 --> 13:14.330
And that would get around that particular issue.
13:15.350 --> 13:15.830
All right.
13:15.830 --> 13:20.180
So now you know that you're more aware of the dangers lurking within mappings and you're much better