Hi
I am using node-tail on a server with relatively high load. My app basically monitors a syslog text file that gets written with relatively high rate (about 25-35kB/s). File looks more or less like this:
2014-07-09T09:56:16+00:00 nszen3.dc1.company.net [info] postfix/smtpd: lost connection after CONNECT from unknown[10.12.109.70]
2014-07-09T09:56:16+00:00 nszen3.dc1.company.net [info] postfix/smtpd: disconnect from unknown[10.12.109.70]
2014-07-09T09:56:15+00:00 xe17-1d.dc2.company.net [notice] 897262: Jul 9 09:56:14.844 UTC: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/16, changed state to down
2014-07-09T09:56:13+00:00 aw1-1p.dc3.company.net [notice] 1724: Jul 9 09:56:15.584 UTC: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/1, changed state to down
2014-07-09T09:56:13+00:00 aw1-1.dc3.company.net [notice] 2673: Jul 9 09:56:15.622 UTC: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/1, changed state to down
2014-07-09T09:56:40+00:00 aw1-1.dc3.company.net [error] Postfix blah blah
multi
line message
2014-07-09T09:56:13+00:00 aw1-1.dc3.company.net [notice] 2673: Jul 9 09:56:15.622 UTC: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/1, changed state to down
...
The end goal is to 'split' the stream into an array of syslog messages. The logical step would be to use \n
as a separator but it does not work in this particular case because some of the messages are multi line.
I ended up writing code that looks like this:
var tail = new Tail("/opt/syslog/archive.log", "\n20")
tail.on('line', function(data) {
process_further('20' + );
}
tail.on('error', function(error) { console.log(error);});
The results are as expected vast majority of the time - I get the exact message that's supposed to be extracted. Unfortunately once every couple minutes the message gets corrupted in a funny way. process_further()
for some reason gets called with junk data in the begining of the string, for example:
2014-2014-07-09T06:03:36+00:00 478242-dhcp2.kickstart.company.com [notice] info: dhcpd: DHCPREQUEST for 10.132.29.112 (10.192.0.40) from 00:26:b9:4e:e3:07 via 10.192.28.3
or
202014-07-08T12:47:15+00:00 f4-42-1.company.net [notice] : 2014 Jul 8 12:46:19 UTC: %ETHPORT-5-IF_DUPLEX: Interface Ethernet1/34, operational duplex mode changed to Full
Please bear in mind that it happens only once every couple thousand of lines.
I have been troubleshooting this for last couple days and I am not able to find the culprit. So far I have checked following things:
- There are no 'weird characters' or anything else in a raw file that is being monitored when it happens. The afffected area (few lines before and after the corrupted message) has been verified with hexeditor and the only character after last line is
\n
aka 0x0A
.
- It seems to be related to CPU load. Currently the system is running with load average of 8.5
- I am not able to replicate the issue when replaying same file on other computer
- Replaying same file on the same computer does not produce corruption in a same place
- One of my theories was that file gets somehow overwritten or worse, segments are removed between reads. I've Added debugging console.log in https://github.com/lucagrulla/node-tail/blob/master/tail.coffee#L48 to check if the condition is ever met:
if (stats.size < _this.pos) {
console.log("rewinded! stats.size=" + stats.size + " @pos=" + _this.pos + " difference(@pos-size)=" + (_this.pos - stats.size));
_this.pos = stats.size;
}
and to my surprise it was fired couple times:
rewinded! stats.size=1219983229 @pos=1219983490 difference(@pos-size)=261
rewinded! stats.size=1219983750 @pos=1219984281 difference(@pos-size)=531
I thought it is not possible, but wanted to verify anyway so I wrote short ruby script to monitor file size returned by system and produce an alert when it's rewinded:
#!/usr/bin/ruby
filename = '/opt/syslog/archive.log'
prev = 0
loop do
fsize = File.stat(filename).size
if fsize - prev < 0
alert = 'REWINDED!'
puts "size: #{fsize} prev:#{prev} diff: #{fsize - prev} #{alert}"
else
alert = ''
end
prev = fsize
abort("File rewinded, aborting") if alert != ''
sleep 0.1
end
That script has been running for last couple of hours and it never fired, so the file is not being rewinded yet for some reason node thinks it is.
Still, it does not look like it correlates with the time when messages are corrupted:
rewinded! stats.size=1219983229 @pos=1219983490 difference(@pos-size)=261
rewinded! stats.size=1219983750 @pos=1219984281 difference(@pos-size)=531
rewinded! stats.size=1226892183 @pos=1226894946 difference(@pos-size)=2763
rewinded! stats.size=1232545312 @pos=1232545481 difference(@pos-size)=169
rewinded! stats.size=1232581791 @pos=1232582488 difference(@pos-size)=697
rewinded! stats.size=1235945205 @pos=1235945374 difference(@pos-size)=169
rewinded! stats.size=1245025044 @pos=1245026503 difference(@pos-size)=1459
rewinded! stats.size=1245908859 @pos=1245909751 difference(@pos-size)=892
Problem with chunk: 014-07-09T10:48:01+00:00 dev.dc3.company.net [info] postfix/cleanup: 821A8DDD33: message-id=<[email protected]>
14-07-09T10:48:01+00:00 phase3sniffer.dc3.company.net [info] postfix/qmgr: 84FA12EC0085: from=<[email protected]>, size=1159, nrcpt=1 (queue active)
rewinded! stats.size=1263001128 @pos=1263001772 difference(@pos-size)=644
rewinded! stats.size=1270921071 @pos=1270921374 difference(@pos-size)=303
rewinded! stats.size=1274044805 @pos=1274045193 difference(@pos-size)=388
rewinded! stats.size=1276448317 @pos=1276449057 difference(@pos-size)=740
rewinded! stats.size=1276991206 @pos=1276992015 difference(@pos-size)=809
rewinded! stats.size=1277356959 @pos=1277357127 difference(@pos-size)=168
rewinded! stats.size=1289780552 @pos=1289781614 difference(@pos-size)=1062
- My other theory was the race condition where
@buffer
gets overwritten. Would it be possible if readBlock()
did not finish executing lines 11-22 but @internalDispatcher
received another next
event from watch()
?
More (possibly relevant) details:
- Node 0.10.25-linux-64
- Linux 2.6.18-92
- ext3 filesystem
Simplest code to reproduce the problem:
var Tail = require('tail').Tail
var tail = new Tail("/opt/syslog/archive.log", "\n20")
tail.on('line', function(data) {
if(data.substr(0,2) != '14') {
console.log("Problem with chunk: " + data);
}
})
tail.on('error', function(error) { throw error })
Please note that first 'Problem' must be ignored since node-tail will start reading from a beginning of a file which is unlikely to start with \n20
.
Thanks for reading this. I would appreciate any ideas or suggestions as to what else may I try?
thanks,
Marek