Difference between revisions of "Write Your Own Adapter"
(20 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
− | You may have custom applications logging messages in their own format, databases storing messages in their own table structures, typical files or some third party applications that generate data and make it available through APIs which you may wish to import continuously into Khika. To be able to do so, it is important | + | You may have custom applications logging messages in their own format, databases storing messages in their own table structures, typical files or some third party applications that generate data and make it available through APIs which you may wish to import continuously into Khika. To be able to do so, it is important to write your own Adapter scripts so that you can pump your data into Khika and start analyzing it. |
− | + | Khika does not pose any restriction on the source of the data as long it conforms to the standard [[Khika Data Format|<span style="color:#0000ff"> Khika Data Format </span>]]. | |
− | + | There are three steps to be considered while developing a custom Adapter script: | |
− | + | 1) '''Read the data from the source''': It may involve reading a simple text file or may involve reading the data from a third party application using the third party APIs. In all the cases, you'll need to ensure that you have appropriate read access to the source of the data. (NOTE: The user account executing the Adapter script must have read access on the source data.) | |
− | 3) Write the Khika formatted data line-by-line on the stdout and exit after the available data is written. | + | 2) '''Convert the data into [[Khika Data Format|<span style="color:#0000ff"> Khika Data Format </span>]]''': It is important to know the format of the source data that you are reading as you have to extract timestamp (date and time) from each message and metadata/value pairs, if possible. We recommend you to add meaningful metadata tags in this step so that it becomes easy to deal with the data in Khika at a later stage. |
+ | |||
+ | 3) '''Write the Khika formatted data line-by-line on the stdout''' and exit after the available data is written. | ||
Khika executes Adapter script after every 'Polling Interval' (or at a scheduled time). ''It is important to make sure that the Adapter script/program reads only the incremental data at each execution and does not read all the data every time.'' Make sure that the Adapter does not re-read all the data each time it executes. Consider an Adapter which is a shell script that reads lines from a text file. It is essential that the Adapter script knows how many lines it read during the last execution and reads only next available lines, if any, during the subsequent executions. | Khika executes Adapter script after every 'Polling Interval' (or at a scheduled time). ''It is important to make sure that the Adapter script/program reads only the incremental data at each execution and does not read all the data every time.'' Make sure that the Adapter does not re-read all the data each time it executes. Consider an Adapter which is a shell script that reads lines from a text file. It is essential that the Adapter script knows how many lines it read during the last execution and reads only next available lines, if any, during the subsequent executions. | ||
− | We explain it with an example below. | + | We explain it with an example below. Login to KHIKA Data Aggregator and open demo.sh script installed in /opt/KHIKA/Apps/Adapters directory. This simple script is designed to read the data from demo.txt file available in the same directory which already has data in KHIKA Data Format. This being our first Adapter script, we have avoided adding parsing and date conversion logic here. We will cover it in the subsequent examples. |
Line 31: | Line 33: | ||
− | The script checks for out.txt file in certain directory (on line 2). This file is used to keep the record of total lines read so far. If you are executing it for the first time, the file out.txt wont exist and hence we'll execute the else part on line 9. | + | This Adapter script is designed to ingest demo.txt file into KHIKA. We assume that some application is continuously writing messages to this file in KHIKA Data format. Before reading demo.txt file, it first checks if it has read the same file before. If yes, it will try to get the offset and read ahead of the offset. The place where it store the offset if out.txt file. |
+ | |||
+ | The script checks for out.txt file in certain directory (on line 2). This file is used to keep the record of total lines read (offset) so far. If you are executing it for the first time, the file out.txt wont exist and hence we'll execute the else part on line 9. | ||
Here we use wc -l command (line 10) to find the number of lines in file demo.txt, which is our source file to read. We store in a variable lines_to_read | Here we use wc -l command (line 10) to find the number of lines in file demo.txt, which is our source file to read. We store in a variable lines_to_read | ||
Line 54: | Line 58: | ||
− | You may have observed that this script reads a simple text file and writes the message on stdout (using head and tail commands of Unix). It stores the number of lines read (in a file) so that the next execution refers to it and reads only the appended portion of the file. It does not perform any epoch time conversion because the source file (demo.txt) has the time stamp in the epoch format. This is an unlikely case as most application would log the timestamp in a human readable format which will have to be converted to the epoch time so as to conform to Khika format. In the next example, we read the /var/log/messages file (the syslog format) using a python script and explain how to convert a human readable timestamp into an epoch time. | + | You may have observed that this script reads a simple text file and writes the message on stdout (using head and tail commands of Unix). It stores the number of lines read (in a file) so that the next execution refers to it and reads only the appended portion of the file. It does not perform any epoch time conversion because the source file (demo.txt) has the time stamp in the epoch format. This is an unlikely case as most application would log the timestamp in a human readable format which will have to be converted to the epoch time so as to conform to Khika format. |
− | + | ||
+ | |||
+ | In the next example, we read the /var/log/messages file (the syslog format) using a python script and explain how to convert a human readable timestamp into an epoch time. Below are some sample input line | ||
+ | |||
+ | 05 31 10:17:23 khika156 systemd[1]: Detected virtualization kvm. | ||
+ | 05 31 10:17:23 khika156 systemd[1]: Detected architecture x86-64. | ||
+ | 05 31 10:17:23 khika156 systemd[1]: Set hostname to <khika156>. | ||
+ | 05 31 10:17:25 khika156 systemd-udevd[522]: starting version 219 | ||
+ | 05 31 10:17:25 khika156 systemd[1]: Starting Flush Journal to Persistent Storage... | ||
+ | |||
+ | Please check the python Adapter script given below. We will focus on converting the timestamp to EPOCH format. | ||
Line 82: | Line 96: | ||
32 month = split_line[0] | 32 month = split_line[0] | ||
33 date = split_line[1] | 33 date = split_line[1] | ||
− | 34 year = str(time.gmtime().tm_year) | + | 34 year = str(time.gmtime().tm_year) #Since year filed is missing in the message, we add current year |
35 timestamp = split_line[2] | 35 timestamp = split_line[2] | ||
36 hours = timestamp.split(':')[0] | 36 hours = timestamp.split(':')[0] | ||
Line 100: | Line 114: | ||
− | The core logic of read_syslog.py is more or less the same as that of demo.sh explained earlier. It reads the syslog format file (/var/log/messages) all at once. Stores the number of lines read in '/tmp/metadata.txt' which it refers at each execution. It skips already read lines and reads only incremental data. It parses the file to extract the date field from each line, converts it in Khika data format and writes the output on stdout. The important step is parsing of the data i.e. lines 31 to 43. | + | The core logic of read_syslog.py is more or less the same as that of demo.sh explained earlier. |
− | Before we actually wrote this simple parser, | + | It reads the syslog format file (/var/log/messages) all at once. Stores the number of lines read in '/tmp/metadata.txt' which it refers at each execution. It skips already read lines and reads only incremental data. |
− | + | ||
− | + | It parses the file to extract the date field from each line, converts it in Khika data format and writes the output on stdout. The important step is parsing of the data i.e. lines 31 to 43. | |
− | + | ||
− | + | Before we actually wrote this simple parser, it is important to understand the format of the messages. | |
− | This much information is good enough for us to write the parser. This is how we wrote our parser | + | |
− | + | 1) Each line has timestamp at the beginning. (Eg: "05 31 10:17:23") | |
− | + | ||
− | + | 2) The format of the time stamp is consistent across the file | |
− | + | ||
− | + | 3) The date field, which is at the beginning of the file has a perticular format: "MM DD Hours:Minutes:Seconds". | |
+ | |||
+ | 4) The important thing to note here is the year is missing from the timestamp. | ||
+ | |||
+ | |||
+ | This much information is good enough for us to write the basic parser. This is how we wrote our parser | ||
+ | |||
+ | 1) We split each line into a list of words (using whitespace as the separator. This works in most of the log files) | ||
+ | |||
+ | 2) Now, we know that the first word in the line is a Month, second is the date and the third is the timestamp, which we further split using ':' as the separator. This gives us Month, Date, Hours, Minutes and Seconds, pretty much all the fields but the year, which we populate using the value of the current year. (Note: python indexes start with 0) | ||
+ | |||
+ | 3) On line 39, 40 and 41 we use simple python time library functions to convert this date into epoch time. Please refer to the documentation of strptime() and mktime() library functions that we have used [https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior here]. | ||
+ | |||
+ | If you use any other scripting/programming language such as perl, JAVA, Ruby, C, C++ etc, you should get access to many standard time library functions as all the popular languages provide a rich interface to time library. | ||
+ | |||
+ | 4) Other part of the message remains the same | ||
+ | |||
+ | 5) Finally on line 43, we print the message in Khika Data format on stdout. | ||
Note: Needless to mention, but the account executing the' Adapter Script' must have read permission on /var/log/messages file | Note: Needless to mention, but the account executing the' Adapter Script' must have read permission on /var/log/messages file | ||
− | In the meantime, if you need any help for writing the Adapter scripts, please write to us on info@khika.com | + | |
+ | All KHIKA Adapters are open source and are shipped with KHIKA Data Aggregator. We encourage you to open /opt/KHIKA/Apps/Adapters and read the code. Most of the Adapters are written in python programming language and you need a bit of knowledge of regular expressions. | ||
+ | |||
+ | In the meantime, if you need any help for writing the Adapter scripts, please write to us on info@khika.com | ||
+ | |||
+ | [[Writing advanced adapters]] |
Latest revision as of 11:31, 21 August 2019
You may have custom applications logging messages in their own format, databases storing messages in their own table structures, typical files or some third party applications that generate data and make it available through APIs which you may wish to import continuously into Khika. To be able to do so, it is important to write your own Adapter scripts so that you can pump your data into Khika and start analyzing it.
Khika does not pose any restriction on the source of the data as long it conforms to the standard Khika Data Format .
There are three steps to be considered while developing a custom Adapter script:
1) Read the data from the source: It may involve reading a simple text file or may involve reading the data from a third party application using the third party APIs. In all the cases, you'll need to ensure that you have appropriate read access to the source of the data. (NOTE: The user account executing the Adapter script must have read access on the source data.)
2) Convert the data into Khika Data Format : It is important to know the format of the source data that you are reading as you have to extract timestamp (date and time) from each message and metadata/value pairs, if possible. We recommend you to add meaningful metadata tags in this step so that it becomes easy to deal with the data in Khika at a later stage.
3) Write the Khika formatted data line-by-line on the stdout and exit after the available data is written.
Khika executes Adapter script after every 'Polling Interval' (or at a scheduled time). It is important to make sure that the Adapter script/program reads only the incremental data at each execution and does not read all the data every time. Make sure that the Adapter does not re-read all the data each time it executes. Consider an Adapter which is a shell script that reads lines from a text file. It is essential that the Adapter script knows how many lines it read during the last execution and reads only next available lines, if any, during the subsequent executions.
We explain it with an example below. Login to KHIKA Data Aggregator and open demo.sh script installed in /opt/KHIKA/Apps/Adapters directory. This simple script is designed to read the data from demo.txt file available in the same directory which already has data in KHIKA Data Format. This being our first Adapter script, we have avoided adding parsing and date conversion logic here. We will cover it in the subsequent examples.
1 #!/bin/bash 2 if [ -e /home/KHIKA/Adapters/out.txt ] 3 then 4 line_already_read=`cat /home/KHIKA/Adapters/out.txt` 5 no_of_lines=`wc -l /home/KHIKA/Adapters/demo.txt|awk '{print $1}'` 6 lines_to_read=$(($no_of_lines - $line_already_read)) 7 echo `date` " : " $lines_to_read>>/home/KHIKA/Adapters/log.txt 8 tail -n $lines_to_read /home/KHIKA/Adapters/demo.txt|awk '{printf("%d ", $1);for(i=2;i<=NF;++i){printf("%s ", $i);} printf("\n");}' 9 else 10 lines_to_read=`wc -l /home/KHIKA/Adapters/demo.txt|awk '{print $1}'` 11 echo `date` " : " $lines_to_read>> /home/KHIKA/Adapters/log.txt 12 head -n $lines_to_read /home/KHIKA/Adapters/demo.txt|awk '{printf("%d ", $1);for(i=2;i<=NF;++i){printf("%s ", $i);} printf("\n");}' 13 fi 14 wc -l /home/KHIKA/Adapters/demo.txt| awk '{print $1}' > /home/KHIKA/Adapters/out.txt
This Adapter script is designed to ingest demo.txt file into KHIKA. We assume that some application is continuously writing messages to this file in KHIKA Data format. Before reading demo.txt file, it first checks if it has read the same file before. If yes, it will try to get the offset and read ahead of the offset. The place where it store the offset if out.txt file.
The script checks for out.txt file in certain directory (on line 2). This file is used to keep the record of total lines read (offset) so far. If you are executing it for the first time, the file out.txt wont exist and hence we'll execute the else part on line 9.
Here we use wc -l command (line 10) to find the number of lines in file demo.txt, which is our source file to read. We store in a variable lines_to_read
We log an info message on line 11
On line 12 we read line_to_read number of lines from the top using head command of the source file
line 13 is Unix shell syntax for indicating the if loop is ended
On line 14 we calculate the number of lines in the source file using the wc -l command and store it in out.txt which we refer everytime. The script ends here.
During the next execution, we find the demo.txt to file (line 2) and read how many lines we read the last time (line 4). We store it in variable line_already_read.
On line 5 we find total lines in the source file demo.txt using the wc-l command and store it in variable no_of_lines.
On line 6 we take the difference between the number of lines in file right now (no_of_lines) and lines read till previous execution (lines_already_read). If the file is appended during the polling interval, the difference will be a positive number (lines_to_read)
On line 7 we log an informative log message
On line 8 we read exactly lines_to_read number of lines at the end of the file using tail command.
You may have observed that this script reads a simple text file and writes the message on stdout (using head and tail commands of Unix). It stores the number of lines read (in a file) so that the next execution refers to it and reads only the appended portion of the file. It does not perform any epoch time conversion because the source file (demo.txt) has the time stamp in the epoch format. This is an unlikely case as most application would log the timestamp in a human readable format which will have to be converted to the epoch time so as to conform to Khika format.
In the next example, we read the /var/log/messages file (the syslog format) using a python script and explain how to convert a human readable timestamp into an epoch time. Below are some sample input line
05 31 10:17:23 khika156 systemd[1]: Detected virtualization kvm. 05 31 10:17:23 khika156 systemd[1]: Detected architecture x86-64. 05 31 10:17:23 khika156 systemd[1]: Set hostname to <khika156>. 05 31 10:17:25 khika156 systemd-udevd[522]: starting version 219 05 31 10:17:25 khika156 systemd[1]: Starting Flush Journal to Persistent Storage...
Please check the python Adapter script given below. We will focus on converting the timestamp to EPOCH format.
10 import time 11 import socket 12 13 input = "/var/log/messages" 14 #Change this path to Khika'sAdapter directory or some safe location 15 meta_data = "/tmp/metadata.txt" 16 17 try: 18 lines_read = int(open("/tmp/metadata.txt", 'r').read()) 19 except: 20 lines_read = -1 21 22 if (lines_read == -1): 23 skip_lines = 0 24 else: 25 skip_lines = lines_read 26 raw_input 27 count = 0 28 with open(input, "r") as f: 29 for line in f: 30 if ( count>= skip_lines): 31 split_line = line.split() 32 month = split_line[0] 33 date = split_line[1] 34 year = str(time.gmtime().tm_year) #Since year filed is missing in the message, we add current year 35 timestamp = split_line[2] 36 hours = timestamp.split(':')[0] 37 minutes = timestamp.split(':')[1] 38 seconds = timestamp.split(':')[2] 39 MyStr = year + " " + month + " " + date + " " + hours + " " + minutes + " " + seconds 40 s = time.strptime(MyStr, "%Y %b %d %H %M %S") 41 epoch_time = str(int(time.mktime(s))) 42 #Write the output to stdout in Khika format 43 print epoch_time+": host "+socket.gethostname()+" file:/var/log/messages event_str " ," ".join(split_line[3:]) 44 lines_read += 1 45 else: 46 count += 1 47 continue 48 49 open("/tmp/metadata.txt", 'w').write(str(lines_read))
The core logic of read_syslog.py is more or less the same as that of demo.sh explained earlier.
It reads the syslog format file (/var/log/messages) all at once. Stores the number of lines read in '/tmp/metadata.txt' which it refers at each execution. It skips already read lines and reads only incremental data.
It parses the file to extract the date field from each line, converts it in Khika data format and writes the output on stdout. The important step is parsing of the data i.e. lines 31 to 43.
Before we actually wrote this simple parser, it is important to understand the format of the messages.
1) Each line has timestamp at the beginning. (Eg: "05 31 10:17:23")
2) The format of the time stamp is consistent across the file
3) The date field, which is at the beginning of the file has a perticular format: "MM DD Hours:Minutes:Seconds".
4) The important thing to note here is the year is missing from the timestamp.
This much information is good enough for us to write the basic parser. This is how we wrote our parser
1) We split each line into a list of words (using whitespace as the separator. This works in most of the log files)
2) Now, we know that the first word in the line is a Month, second is the date and the third is the timestamp, which we further split using ':' as the separator. This gives us Month, Date, Hours, Minutes and Seconds, pretty much all the fields but the year, which we populate using the value of the current year. (Note: python indexes start with 0)
3) On line 39, 40 and 41 we use simple python time library functions to convert this date into epoch time. Please refer to the documentation of strptime() and mktime() library functions that we have used here.
If you use any other scripting/programming language such as perl, JAVA, Ruby, C, C++ etc, you should get access to many standard time library functions as all the popular languages provide a rich interface to time library.
4) Other part of the message remains the same
5) Finally on line 43, we print the message in Khika Data format on stdout.
Note: Needless to mention, but the account executing the' Adapter Script' must have read permission on /var/log/messages file
All KHIKA Adapters are open source and are shipped with KHIKA Data Aggregator. We encourage you to open /opt/KHIKA/Apps/Adapters and read the code. Most of the Adapters are written in python programming language and you need a bit of knowledge of regular expressions.
In the meantime, if you need any help for writing the Adapter scripts, please write to us on info@khika.com