Time
S
Nick
Message
06:46
jri joined #dataverse
07:29
poikilotherm joined #dataverse
10:51
pdurbin joined #dataverse
10:52
pdurbin
poikilotherm: welcome back!
11:16
jri joined #dataverse
11:25
poikilotherm
Hey pdurbin :-)
11:25
poikilotherm
Thx
11:27
pdurbin_m joined #dataverse
11:29
pdurbin_m
poikilotherm: my trip to PIDapalooza and FOSDEM was approved. I'm hoping to book flights today. Want to try to meet up?
11:29
poikilotherm
Wow, cool!
11:29
poikilotherm
I still need to ask my boss if I may go to Pidapalooza...
11:29
poikilotherm
Fosdem is not far away... ;-)
11:46
pdurbin_m
Right. I think I'll book hotels for 3 nights in each city. Any suggestions for where to stay in Brussels?
12:34
pdurbin
pmauduit: hi! Did you see the comment I left at https://github.com/IQSS/dataverse-ansible/issues/99#issuecomment-524398212 ?
13:38
donsizemore joined #dataverse
14:00
pmauduit
not yet, sorry I've been on a meeting all the afternoon
14:00
pmauduit
I'll have a look asap
14:08
pmauduit
I guess it's kindof normal to have prometheus on /prometheus/ now, even on the 9090 port
14:08
pmauduit
that I'd have expected
14:12
pmauduit
pdurbin: what is the user on the ec2 instance already ?
14:13
pmauduit
centos, right
14:15
pmauduit
ok, I replied
14:16
pdurbin
pmauduit: yes, "centos"
14:18
pmauduit
next thing I'd like to do is to configure JMX, but no idea where you can tweak the Java variables for the J2EE container
14:20
donsizemore
@pmauduit this is where i limited the scope of the initial issue =)
14:22
pdurbin
pmauduit: JMX first or Grafana first? I was thinking maybe we could start with a very basic Grafana dashboard of operating system level stuff like disk usage, CPU , memory, etc.
14:29
pdurbin
I'm fine with whatever. :)
14:29
pdurbin
donsizemore: I'm glad to see you're still following along even though you've got a busy week coming up. Upgrades and such, right? I forget. :)
14:30
donsizemore
we're on a custom-patched 4.11 now; re-indexing just finished, i'm restoring original file sizes and tracking down dataset index failures
14:31
* pdurbin
looks at https://dataverse.unc.edu
14:31
pdurbin
awesome
14:32
pdurbin
How long does reindexing take? Roughly?
14:41
pmauduit
pdurbin: it's egal to me also
14:41
pmauduit
*equal
14:42
pdurbin
pmauduit: ok, if it's ok with you, I think I'd like to have a basic Grafana dashboard first. I'm not even ssh'ed in anymore. Do you want to try to install it? :)
14:42
pmauduit
but we already have some material to be ported to the ansible playbook
14:42
pmauduit
I can see if it's yum packaged
14:43
pdurbin
True and we definitely do want to put all this in Ansible but I think donsizemore is quite busy with other stuff at the moment.
14:46
pmauduit
:( seems that a package is available but not sure it's grafana in itself
14:46
pmauduit
pcp-webapp-grafana.noarch : Grafana web application for Performance Co-Pilot
14:49
pmauduit
ok done, using the yum repo advised by the grafana project
14:51
pdurbin
perfect
14:52
pdurbin
We already use custom repos for EPEL and Postgres.
14:52
pdurbin
custom yum repos
14:52
pmauduit
I commented on the ticket
14:52
pdurbin
Looks good. The next step is to create and export a dashboard?
14:53
pmauduit
provide it via the http conf before ?
14:53
pdurbin
sure!
14:53
pdurbin
Do you need a hand with that? Please feel free to make whatever edits you want. :)
14:55
pmauduit
http://ec2-3-81-53-52.compute-1.amazonaws.com/grafana/login
14:55
pmauduit
that's ok ;)
14:56
pdurbin
Nice! Can you configure it so a login isn't required?
14:58
pmauduit
probably, I cannot find the option for now though, but admin/admin is the default
14:59
pdurbin
Ok. We'll find it later. I know it's possible. :)
15:01
donsizemore
@pdurbin re-indexing on our current hardware and number of datasets takes ~90 minutes
15:02
pdurbin
donsizemore: cool, I expected much worse :)
15:02
donsizemore
404 datasets don't want to re-index under solr 7.3.1, so far the majority are deaccessioned and the rest display in the web interface
15:03
donsizemore
since we had to blow away solr for the 4.11 upgrade, i moved the retroactive original filesize and ReExportAll steps for the last
15:03
pdurbin
By 404 do you mean unpublished?
15:04
donsizemore
then i'll re-index one more time to see if that number changes. no, just coincidence that 404 datasets threw indexing failures
15:04
donsizemore
i didn't think reordering the filesize or JSON -LD export steps should've affected solr
15:04
donsizemore
and reordering those steps cut our downtime drastically
15:05
pdurbin
gotcha, you seem to be back up. that's good :)
15:06
donsizemore
Mandy hasn't started using it yet =) she was the anti-King Midas during our testing
15:06
donsizemore
features that worked just fine for me failed consistently for her
15:07
pdurbin
It's good to have people around like that. :)
15:07
pmauduit
pdurbin: http://ec2-3-81-53-52.compute-1.amazonaws.com/grafana/d/COPs6vKZk/overall-metrics?orgId=1&from=now-30m&to=now
15:09
pdurbin
pmauduit: CPU Load! Looks great! https://i.imgur.com/CcIEgO9.png
15:09
pmauduit
I'm adding the memory graphes, once I found out in which unit collect is providing the info ;)
15:10
pmauduit
should be in bytes(2GB of used ram seems legit ?)
15:11
pdurbin
well, we could compare it to what munin is reporting for memory
15:11
pmauduit
then reload the grafana dashboard
15:11
pdurbin
Please see http://ec2-3-81-53-52.compute-1.amazonaws.com/munin/localhost/localhost/memory.html
15:12
pmauduit
seems in sync
15:12
pdurbin
cool
15:13
pdurbin
Meanwhile I found https://grafana.com/blog/2019/05/16/worth-a-look-public-grafana-dashboards/ but not the config on how to make grafana dashboards public.
15:13
pmauduit
the dashboard might be public by default, but would require admin account to set up your dashboard as well as datasources
15:13
pdurbin
Ah, there's something called [auth.anonymous]
15:13
pdurbin
please see https://stackoverflow.com/questions/33111835/how-to-set-up-grafana-so-that-no-password-is-necessary-to-view-dashboards
15:14
pmauduit
http://ec2-3-81-53-52.compute-1.amazonaws.com/grafana/admin/settings
15:14
pmauduit
it's over here
15:14
pmauduit
so stored somewhere in the db I guess, but might be overriden in the grafana.ini
15:15
pdurbin
Ok, please feel free to go ahead and make the graphs public.
15:16
pmauduit
found !
15:17
pmauduit
http://ec2-3-81-53-52.compute-1.amazonaws.com/grafana/d/COPs6vKZk/overall-metrics?orgId=1 accessible in "privacy mode" now
15:18
pmauduit
(with no authà
15:18
pmauduit
)
15:37
pdurbin
pmauduit: perfect! I was just at standup and showed Danny afterwards. I'm not sure if you know Danny but he's our project manager.
15:38
pmauduit
no I don't think so, but great !
15:38
pmauduit
(if he's not around on irc, we probably never exchanged together)
15:39
pmauduit
pdurbin: I notice that centos does not start / enable the services by default after a yum install
15:39
pmauduit
I don't know if it's a problem (as we might provision once and let the VM run until its "death")
15:40
pdurbin
pmauduit: yes, that's a feature of centos. It drives me crazy that debian does the opposite. :)
15:40
pmauduit
but it's good to know
15:43
pdurbin
Don't worry, when donsizemore and I get all this stuff added to dataverse-ansible, we'll make sure all the services start on boot.
15:44
pdurbin
pmauduit: should we move on to monitoring Glassfish or Solr via JMX? What's our next move? :)
15:45
pdurbin
Or should we export the dashboard as is? CPU and memory is a good start!
15:48
pmauduit
pdurbin: I checked the ansible module for grafana, there is everything to each json as input and get ansible configured as output (datasource + dashboard)
15:49
pdurbin
pmauduit: oh! That might be nice. I'll let donsizemore decide though. I'm fine with whatever. :)
15:50
bjonnh
pdurbin: what may I know ?
15:52
pdurbin
bjonnh: heh. Nevermind. There's an "about" page coming for Harvard Dataverse that will explain it all some day. :)
15:53
pmauduit
pdurbin: https://github.com/IQSS/dataverse-ansible/issues/99#issuecomment-524916069
15:54
pdurbin
pmauduit: cool. JSON format works for me!
16:01
pdurbin
pmauduit: do you think it would be easier to monitor Glassfish or Solr?
16:07
pdurbin
pmauduit: I just found this if it helps: https://lucene.apache.org/solr/guide/7_3/using-jmx-with-solr.html (Dataverse supports Solr 7.3.x right now)
16:08
pdurbin
This is even more specific: https://lucene.apache.org/solr/guide/7_3/monitoring-solr-with-prometheus-and-grafana.html
19:32
donsizemore joined #dataverse
19:33
donsizemore
@pdurbin knock knock, o ye of the institutional knowledge?
19:35
pdurbin
donsizemore: hit me
19:39
donsizemore
@pdurbin so, on our production dataverse we have 404 datasets which cause solr to throw an index failure
19:40
donsizemore
@pdurbin on our test server, same warfile, same solr version, import of production database, we only have our accustomed 22 dataset indexing failures that we've never tracked down
19:41
donsizemore
schema.xml is identical. i do note a couple differences in solrconfig.xml (which... i thought i used the copy from dvinstall.zip for both, but looks like i dropped the ball on the test server there (and it only fails on 22)
19:41
pdurbin
oh! 404 datasets, I thought you were talking about the http code... I'm with you now :)
19:42
pdurbin
When you try to index one of those datasets individually, is there a stacktrace in server.log?
19:42
donsizemore
correct. i'm trying to suss out the difference. one of those being solrconfig.xml. may i send you a diff?
19:42
donsizemore
they all say Exception info: null
19:43
pdurbin
right but there are probably line numbers in there
19:43
pdurbin
that's what I want to see, the line numbers and which java classes
19:44
donsizemore
at edu.harvard.iq.dataverse.__EJB31_Generated__DataverseServiceBean__Intf____Bean__.findRootDataverse(Unknown Source)
19:44
pdurbin
no line numbers?
19:44
donsizemore
there are several, i can collate them. is it because we renamed the root dataverse to 'unc' years ago?
19:45
donsizemore
javax.ejb.TransactionRolledbackLocalException: Exception thrown from bean at com.sun.ejb.containers.EJBContainerTransactionManager.checkExceptionClientTx(EJBContainerTransactionManager.java:662)
19:45
pdurbin
Can you you please email the whole server.log file to support dataverse.org? More of the stacktrace would help a lot.
19:45
donsizemore
yes yes thank you.
19:46
donsizemore
the "problem" datasets show up in the web interface, which i take to mean that solr didn't completely blow up during indexing
19:46
donsizemore
unless the dataset page itself is populated by the database rather than solr
19:46
donsizemore
most of the problem datasets are deaccessioned but not all of them
19:46
pdurbin
Can you please link me to one of the problem datasets?
19:47
donsizemore
here's one that isn't deaccessioned https://dataverse.unc.edu/dataset.xhtml?persistentId=doi:10.15139/S3/12410
19:49
donsizemore
and if i try to index them manually through the API the server log doesn't barf
19:50
pdurbin
Thanks for the link. This dataset seems to be properly indexed. Is it?
19:50
donsizemore
the only difference i can find is the boost logic in solrconfig.xml, which is current on prod but out of date on test
19:50
donsizemore
it was our third failure during index-all
19:51
pdurbin
Are you suffering from this issue? index all fails to index some datasets but they can be indexed individually #5575 https://github.com/IQSS/dataverse/issues/5575
19:51
donsizemore
so far they all look properly indexed in the web interface. do you suppose solr is tripping on some field and we get a blue screen of indexing in the glassfish log?
19:51
donsizemore
this is exactly the stack trace, o ye of the institutional knowledge
19:53
donsizemore
so, the solution is... upgrade to 4.14+ (which we're not ready to do yet) or simply re-index them manually?
19:53
pdurbin
Yeah, I think so. We don't know where the bug is. Reindexing manually seems to be the workaround.
19:55
pdurbin
donsizemore: please feel free to leave a comment on that issue
19:57
donsizemore
the memory/resource angle makes perfect sense because our test server succeeds and it's otherwise doing jack
19:59
pdurbin
Oh, was there a comment about memory or resources in the issue?
19:59
donsizemore
yes, from jim myers
20:01
pdurbin
Ah, I see it. Man, this bug has bit everyone. :(
20:15
donsizemore
i scripted a re-index with a sleep 1 each time. no errors! (even the 22 we /should've/ seen)
20:15
pdurbin
nice
20:15
pdurbin
hooray for workarounds, I guess
20:15
pdurbin
sorry for all the bugs :)
20:16
donsizemore
kasha and i absconded to a bar to celebrate the upgrade; i'm re-indexing with bold rock
20:16
pdurbin
:)
20:17
pdurbin
cheers