Time
S
Nick
Message
15:40
axfelix joined #dataverse
16:57
donsizemore joined #dataverse
18:11
donsizemore
Hello! just curious if anyone is around and interested in troubleshooting pegged CPU ?
18:16
pdurbin
donsizemore: well, I can try a bit. Is this for DVN 3.x?
18:17
donsizemore
correct - an Dataverse 4 upgrade is in our near future. and thank you for your help!
18:19
donsizemore
i've been poring through the domain server, access and export logs but don't see a particular culprit. Akio and I went down a few rabbit holes earlier this afternoon
18:22
pdurbin
donsizemore: I'm going to guess you don't want to simply restart glassfish
18:23
donsizemore
already done, happy to do it again
18:23
pdurbin
huh. and still pegged
18:24
pdurbin
what's the load from `uptime`?
18:24
donsizemore
we were getting a number of harvesting errors, but they had me disable the schedule for that
18:24
donsizemore
it's been hovering around 5 all day. i found one IP in the access log that had sent an enormous number of requests earlier but that stopped around 1230 EDT
18:25
pdurbin
5 isn't crazy high at least
18:29
pdurbin
donsizemore: I assume you have symptoms as well. site is slower than usual or whatever
18:29
donsizemore
this particular VM has a ton of resources thrown at it, so the machine (and Glassfish) are perfectly responsive. but the load is usually closer to 0
18:30
pdurbin
ok. responsive is good :)
18:31
donsizemore
we spent part of the morning trying to correlate export errors / nullpointer exception to bad data of any sort, but can't pin anything down
18:36
pdurbin
I wonder if that's related or not.
18:41
donsizemore
on launch the glassfish log gets a string of javax.enterprise.system.container.ejb.com.sun.ejb.containers|_ThreadID=17;_ThreadName=Thread-2;|EJB5184:A system exception occurred during an invocation on EJB StudyServiceBean, method: public void edu.harvard.iq.dvn.core.study.StudyServiceBean.exportStudy(java.lang.Long) errors, but once exporting finishes the CPU doesn't die down
18:42
pdurbin
donsizemore: so if you restart glassfish and *don't* try to export everything is fine? it's only when you restart glassfish and try an export that the CPU gets pegged?
18:43
donsizemore
it seems to be launching export on its own
18:44
pdurbin
huh
18:45
pdurbin
I assume you don't want that.
18:46
pdurbin
donsizemore: this *might* help: http://wiki.greptilian.com/java/glassfish/howto/purge-jms-queue/
18:47
pdurbin
if export is done via a JMS queue, which I'm not sure of
18:55
pdurbin
donsizemore: I'd start by trying to list anything in any JMS queues
18:58
donsizemore
DSBIngest Queue RUNNING 0 - 1 - 0 0 0 0.0
18:58
donsizemore
IndexMessage Queue RUNNING 0 - 1 - 5 0 5 366.0
18:58
donsizemore
mq.sys.dmq Queue RUNNING 0 - 0 - 0 0 0 0.0
19:00
pdurbin
so a count of 5 in the IndexMessage queue?
19:00
donsizemore
correct
19:00
pdurbin
is that expected?
19:11
donsizemore
Akio is checking documentation. if I try to view the jms/DSBIngest Connection Factory in the Glassfish console, I get SEVERE|glassfish3.1.2|org.glassfish.admingui|_ThreadID=37;_ThreadName=Thread-2;|RestResponse.getResponse() gives FAILURE. endpoint = 'https://localhost:4848/management/domain/resources/connector-connection-pool/jms%2FIndexMessage'; attrs = '{}'
19:12
pdurbin
ok
19:18
pdurbin
donsizemore: my only point is that when unexpected stuff starts happening when you start glassfish it may be because stuff is in the JMS queue.
19:21
donsizemore
point absolutely well taken - just to be sure, the JMS queue is safe to purge?
19:22
pdurbin
well, I would think you'd be able to restart indexing anyway
19:23
donsizemore
10-4 -- please forgive me for being squeamish; i'm new =)
19:24
pdurbin
probably good to be a bit squeamish
19:24
pdurbin
:)
19:24
pdurbin
donsizemore: please feel free to open a ticket for all this. I'm just trying to figure out why your CPU is pegged. support dataverse.org
19:26
donsizemore
#224689 from this morning
19:26
donsizemore
the purge returned successful but i still have a consumer count of 1 in DSBIngest and 5 in IndexMessage - should I purge IndexMessage specifically?
19:27
pdurbin
huh. sounds like it didn't do anything
19:34
donsizemore
Lucene seems to be the CPU hog
19:37
pdurbin
donsizemore: yeah? how can you tell?
19:38
donsizemore
turned on threading in "top" (which I should've done in the first place but I'm learning)
19:38
donsizemore
then stracing those processes shows Lucene contending over write-locks
19:39
donsizemore
stat("/usr/DVN/lucene/index-dir/write.lock", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
19:39
pdurbin
how do you turn on threading in top?
19:39
donsizemore
on Linux, with a capital H
19:39
donsizemore
i meant to look that up earlier and got sidetracked messing with jstat etc
19:40
donsizemore
i had also seen Lucene complaining about write locks timing out earlier, but thought that was a symptom (timeout) rather than a cause
19:40
pdurbin
in the ticket you opened I see this: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/usr/DVN/lucene/index-dir/write.lock
19:41
donsizemore
(actually, it may still be a symptom)
19:41
donsizemore
will stopping Glassfish and manually removing the lock cause things to go any wonkier than present?
19:42
pdurbin
not sure
19:43
pdurbin
sorry, I've spent more time on Dataverse 4 by now
19:46
donsizemore
it was Lucene's write lock. right in front of my (blushing) face
19:47
donsizemore
and we, too, hope to be on Dataverse 4 soon
19:48
pdurbin
donsizemore: you removed the lock and you're all set now?
19:51
donsizemore
things look wonderfully normal. i apologize for my many dumb questions but i'm learning new-to-me technologies, the Dataverse among them
19:51
pdurbin
no no, it's all good. good reminder of how useful strace is :)
20:13
* pdurbin
replies on https://help.hmdc.harvard.edu/Ticket/Display.html?id=224689
22:00
garnett joined #dataverse