IRC log for #dataverse, 2019-07-11

Connect via to discuss Dataverse (, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.

07:06 juancorr Thanks pdurbin. I think that GoogleMaps is nicer for presentations like the case. We use it in e-cienciaDatos. WorldMap is much better to reuse data in a dataset. I agree with @polikotherm that summaries will be a great improvement in both cases
08:41 poikilotherm Thx pdurbin! I just release new images for K8s.
08:41 poikilotherm +d
09:37 dataverse-user Hi, sorry if this is not this is not the correct channel for asking this, but I have issued a pull request 15 days ago for some typos I found on the documentation for the Native API (using the 'Quick Fix' procedure in and it has not progressed
09:38 dataverse-user Please note that I am not complaining. I just wanted to check if I have done anything wrong
09:40 dataverse-user Could anyone please advise on how to proceed and what info should I supply?
09:59 poikilotherm Hi @dataverse-user
10:00 poikilotherm Could you copy paste the URL to the PR here?
10:02 dataverse-user hi @poikilotherm
10:02 dataverse-user thanks for the reply
10:02 dataverse-user
10:02 dataverse-user is this what You were asking?
10:15 pdurbin_m joined #dataverse
10:16 pdurbin_m dataverse-user: hi! Thanks for that pull request! We didn't see it because it doesn't appear under
10:18 dataverse-user ok. Should I have done so that it appeared in
10:19 pdurbin_m Yes, please!
10:20 pdurbin_m This happened to another contributor recently. Her first pull request was against her fork rather than the upstream repo.
10:22 pdurbin_m So I wonder if we should improve our "quick fix" write up. We haven't touched it in years.
10:30 poikilotherm Sry, I had a talk with my colleague. What pdurbin says :-d
10:30 poikilotherm And yeah, pdurbin, it might be a good idea
10:31 poikilotherm Maybe rethink about the first pr bots?
10:31 poikilotherm Attach a message about how to create a PR to the issue=
10:31 poikilotherm ?
10:31 dataverse-user If this happened to other contributors, perhaps it would make sense to revisit the procedure
10:32 dataverse-user could You please check if my pull request is now visible?
10:34 dataverse-user I believe it is, but if You could check it would be great
10:34 pdurbin_m Yes! I see it now! Thanks! Do you want to help us fix up the page that talks about the quick fix too? :)
10:35 pdurbin_m And sorry it's almost time for breakfast here so I should get off my phone. :)
10:35 pdurbin_m poikilotherm: are you around to continue helping? :)
10:35 poikilotherm Yeah
10:35 dataverse-user30 sorry, the chat seems to have crashed on my machine
10:35 poikilotherm Back to keyboard now :-D
10:36 pdurbin_m poikilotherm: thanks!
10:36 poikilotherm Ah, there is a bug in the chat web client
10:36 poikilotherm Never mind that it crashed
10:37 pdurbin_m dataverse-user30: a memory leak: :(
10:38 poikilotherm I guess this is your PR then, right j-n-c?
10:38 dataverse-user ok. I have received an update on my pull request. Thanks for Your help
10:39 poikilotherm Yeah, I see it here
10:39 poikilotherm Is there an issue related to this PR already?
10:39 poikilotherm IQSS tries to follow "issue first - pr second" process
10:39 poikilotherm They use it to ensure QA and other devs always know what's going on
10:39 dataverse-user I have not created an issue
10:40 dataverse-user Ok. seems reasonable
10:40 poikilotherm Would you be so kind to create one and link it in the PR?
10:40 dataverse-user Sure
10:40 poikilotherm It would make their work a lot easier
10:40 poikilotherm Thank you!
10:40 poikilotherm :-)
10:41 poikilotherm Feel free to copy-paste your PR description into the issue, as it sounds like a good description ;-)
10:42 poikilotherm You can delete the PR description part of "New Contributors" till "Related issues" after reading ;-)
10:49 dataverse-user Done. Thanks again for Your help! ;)
10:52 poikilotherm No problem :-)
10:52 poikilotherm Thank you for your contribution
11:02 pdurbin_m dataverse-user: yes, thanks.
11:02 pdurbin_m poikilotherm: are you able to add the pull request to the main board?
11:03 poikilotherm Done
11:04 poikilotherm Pulled it into "Community Dev
11:04 poikilotherm Issue + PR
11:04 poikilotherm Dude, that Column is huge
11:06 pdurbin_m poikilotherm: heh. Actually, it should be in code review, right? The pull request I mean.
11:07 poikilotherm Done as well
11:07 pdurbin_m Automation should put the pull request into Code Review if you add the pull request to the right project.
11:07 pdurbin_m Does that make sense?
11:07 poikilotherm I placed it here
11:08 pdurbin_m Yes, perfect, but I'm saying there's a simpler way.
11:09 poikilotherm I placed it into Community Dev first and it hadn't been moved
11:09 poikilotherm You say it should have been moved?
11:09 pdurbin_m No.
11:10 pdurbin_m The act of adding a pull request to the main project from the pull request itself moves it to code review thanks to automation.
11:10 poikilotherm I can't modify the PR...
11:11 poikilotherm I need to go the other way round
11:11 pdurbin_m interesting
11:11 poikilotherm Yeah
11:11 poikilotherm I even cannot change it if its my own
11:11 pdurbin_m but the pull request is in code review now so that's perfect. thanks!
11:12 poikilotherm You're welcome.
11:12 poikilotherm Err... Shouldn't you have some coffee first at 7am? Or get breakfast ready?
11:13 poikilotherm Instead you are speaking with strangers across the ocean :-D
11:13 pdurbin_m dataverse-user: oh, our process is changing so for a small doc change like this you probably don't have to create an issue but it's always appreciated :)
11:13 poikilotherm My wife would kill me ;-)
11:13 poikilotherm pdurbin: that was my fault! I requested him to follow the rulez ;-)
11:13 pdurbin_m poikilotherm: we've met. We aren't strangers. :)
11:18 pdurbin_m The rulez are changing a bit.
11:20 * poikilotherm *thumbs up*
12:15 pdurbin dataverse-user: I don't know if you're across the ocean from my or not but I assume so based on how early it in the morning it is here. :)
12:16 pdurbin juancorr: thanks for the feedback on my crazy map idea :)
12:16 pdurbin poikilotherm: awesome that you've already released
12:17 poikilotherm ;-)
12:18 pdurbin Lots of chatter to catch up on. I think I missed a few things. Or have comments to make at least. :)
12:22 donsizemore joined #dataverse
12:24 donsizemore @pdurbin morning. may I ask one of my many "have-you-seen-this-before" questions?
12:37 pdurbin donsizemore: I have plumbers here installing a new washer/dryer but please go ahead.
12:38 pdurbin donsizemore: also, dataverse-user just made a pull request to add some more full curl examples.
12:39 poikilotherm donsizemore: go ahead, I'm all in :-)
12:41 donsizemore it's curl i be askin'! (pirate voice)
12:42 donsizemore since the AWS nodes were misbehaving, i was trying to help cheryl download a dataset with 108 files. via curl, a number of them return a 403 forbidden
12:44 donsizemore copy-paste the same URL into a browser... oh, wait. i'm passing the API token but it's getting chomped
12:45 donsizemore but only for certain requests?!? ugh
12:47 pdurbin What's a pirate's favorite programming language? R!!!!!!
13:00 donsizemore yeah, so draft dataset, native API. i get dataset metadata via api/datasets/:persistentId/versions/' + args.version + '/files?persistentId=
13:01 donsizemore then i start downloading files individually via api/access/datafile/' + str(fileid) + '?format=original'
13:02 donsizemore for about half of the files, this seems fine. others return a 403 forbidden. i can't help but think i'm passing dataverse bad milk and cookies
13:02 donsizemore in a draft dataset, all files should have an 'original' format, yes? the version is at the record level
13:03 donsizemore if i follow the URL i'm passing curl in a browser, i get the file. same url construct, passing API token, same dataset, nothing published
13:04 donsizemore i may bump dataverse-ansible to 4.15.1 which i need to do anyway and test this a little more locally. (we were trying to download large datasets from harvard but encountering problems)
13:12 pdurbin donsizemore: all files *should* have an original format but I'm aware that a few of these files are missing from Harvard Dataverse. We adjusted the GUI to not offer "original" under the download button in cases like this.
13:12 pdurbin So I guess the question is if you see "original" for that file in the GUI or not.
13:18 donsizemore this is great information. i have cheryl's API token but when she gets back from fauxbucks i'll pester her about what's in the GUI
13:19 pdurbin You can go to the file page based on the id.
13:19 donsizemore i was wondering if these files got uploaded during periods of instability and may not be in a fully-archived state
13:19 donsizemore i've got the DOI and her API token, i just didn't think i could call up the webpage without being logged in as her. i'll try. thanks for the info about the GUI adjustment
13:24 pdurbin Oh, that's a good point. The data is probably unpublished or restricted so you can't even see a download button unless you log in as her.
13:33 donsizemore just looked at the GUI with cheryl. for problem files, there's a download button rather than a drop-down. but she can download-all and there's an option for original format
13:34 donsizemore which means dataverse is working around this (i'll go traipse through the code). i would set forth that files which lack surrogate copies are assumed to be in original format, so that 403 forbidden might better simply return the file
13:36 pdurbin I can dig up the issue about the GUI change if you want.
13:36 pdurbin Make "download as original" disappear from download options, when there is no saved original. #4796
13:37 donsizemore i was just looking at
13:37 donsizemore the first problem file i hit, for instance, is a PDF
13:37 donsizemore no ingest, no rename, no original format
13:38 pdurbin huh
13:39 pdurbin Maybe ?format=original is only for tabular files that were ingested? I don't know.
13:39 pdurbin If the API guide is unclear we should fix it up.
13:41 donsizemore the API guide is perfectly clear, but for a scripted download pass, some file formats have an "original" format, some don't. i'm suggesting that dateverse serve an original for each
13:43 donsizemore from our perspective this is fairly related to #6006
13:47 pdurbin You want to always be able to pass ?format=original and get the original file regarless of if the file was a PDF or a tabular file that was successfully ingested. Is that right?
13:54 donsizemore i was going to poke through the dataverse code to see what download-all in original format was using to determine whether to pass ?format=original or equivalent
13:54 donsizemore let me look at that first. but every file should have an original format...
14:00 pdurbin Yeah, I agree. I think we should have a dedicated issue about this. Small chunks.
14:06 donsizemore i was thinking along the same lines, but wanted to ask first. best defense is no offense and all that =)
14:07 pdurbin New washer/dryer installed and it seems like it even works! \o/
14:45 pdurbin jri: Can you please take a look at this thread? Examples of Web Analytics Code (Matomo, formerly “Piwik”)
14:55 pdurbin dataverse-user: I just moved your pull request to QA. Thanks again.
15:06 dataverse-user @pdurbin, just checked the notification. Thanks!
15:13 pdurbin dataverse-user: sure. I can help you change your nick to j-n-c in here if you want.
15:14 pdurbin xarthisius: yes, yes, YES! I just noticed !!! Go, go, GO!!
15:21 pdurbin donsizemore: I'm tempted to add a Jenkins job for it.
15:24 donsizemore sure thing - just open an issue?
15:30 pdurbin donsizemore: sure, done:
15:32 pdurbin donsizemore: we can try it on a DOI from UNC Dataverse. :)
16:04 xarthisius FTR, it doesn't have to be DOI, it will work with raw url pointing to dataverse resource too
16:11 pdurbin xarthisius: I was wondering if Handles are also supported. Thanks!
16:11 pdurbin I just left this comment on the "Binderverse" issue:
16:12 pdurbin The thing I'm wondering about right now is where the installation instructions should go for adding an external tool to Dataverse for Binder.
16:13 pdurbin xarthisius: as you know, the Whole Tale button instructions are here, for example:
16:13 pdurbin But what's the right place to put similar instructions in the Binder world?
16:18 xarthisius pdurbin: I don't think external tools fit binder model
16:18 xarthisius you'll need to create a "resolution" service that would convert output from ext tools to a proper binder link
16:19 xarthisius that's how we do it in WT
16:20 xarthisius there's an endpoint that responds with a redirect when external tools hit it
16:20 xarthisius unless you make external tools more flexible
16:21 pdurbin Hmm. You don't think so? Do you have a Zenodo DOI I can play around with?
16:23 pdurbin It looks like they were testing with 10.5281/zenodo.3229823
16:24 pdurbin Which becomes this URL, it appears:
16:26 xarthisius yup, so external tools would need to return that
16:26 xarthisius<doi>
16:27 pdurbin But there are 46 installations of Dataverse.
16:28 xarthisius I'm not sure how's that relevant
16:29 pdurbin So should it be something like<doi> for the installations of Dataverse at ?
16:29 xarthisius it's gonna work with all instances, no matter how many there are
16:29 xarthisius doi will resolve to, won't it?
16:29 pdurbin yes
16:29 pdurbin I guess you're right.
16:31 xarthisius
16:31 xarthisius would also work
16:31 xarthisius that was my point about raw urls
16:31 xarthisius it's just a matter of external tools being able to create those urls
16:32 pdurbin Hmm.
16:33 xarthisius or running a service that would do it
16:33 pdurbin We could send dataset id.
16:35 pdurbin as a query parameter
16:38 pdurbin I just added a "MyBinder" button to (under Explore)
16:38 pdurbin If you click it, it goes to
16:40 xarthisius yeah, but that would require Binder to significantly change how they operate wouldn't it?
16:40 pdurbin I don't know. It sounds like you're saying it would. :)
16:41 pdurbin Binder doesn't like query parameters? They like paths instead? :)
16:41 xarthisius that's my understanding
16:42 xarthisius I might be wrong
16:42 pdurbin Query parameters are nice for DOIs because DOIs can have an arbitrary number of slashes in them.
16:44 pdurbin But you seem to be saying that raw URLs will work with the content provider you're adding. That means that we need to allow external tools to append to the "toolUrl" of an external tool.
16:46 xarthisius The changes I've made are only to r2d, and yeah it will work with anything. How Binder team decides to utilize that in their UI is not really my choice to make
16:47 pdurbin Sure, that makes total sense. Do your r2d changes require a Dataverse DOI or can it also work with the database id of a dataset?
16:49 xarthisius see
16:51 pdurbin Ok so you handle...
16:51 pdurbin - {siteURL}/dataset.xhtml?pe​rsistentId={persistentId}
16:51 pdurbin - {siteURL}/dataset.xhtml?id={datasetId} is not handled.
16:52 xarthisius I can add that
16:52 pdurbin that would be great, I think
16:52 pdurbin Here's a live example:
16:52 xarthisius If there are additional schemes that should be handled let me know
16:52 pdurbin can do!
16:54 pdurbin I think I might add this example to your pull request to try to get some feedback from the Binder folks:
16:55 pdurbin To see how much they object to the query parameters. We can craft that URL today without any modifications to our external tool code.
16:55 pdurbin Or should that conversation happen in a different repo? Perhaps the BinderHub repo?
16:58 pdurbin Maybe this repo that has the front end code in it:
17:18 xarthisius yeah, I think binderhub is better place for that conversation
17:19 pdurbin ok, thanks
17:31 xarthisius BTW, until Dataverse and Binder come up with a robust solution to the problem, I'm happy to host external tools -> binder resolution since we already do that for WT
17:31 pdurbin Oh! That's fantastic! Thank you!
17:31 xarthisius it's just a matter of adding a simple switch on our end and you can have two separate json specifications
17:32 xarthisius one that will point to WT and 2nd that would bounce to binder
17:32 pdurbin Is that what's happening with the Whole Tale external tool right now?
17:32 xarthisius yes
17:33 pdurbin And that resolution service has a url something like ?
17:33 xarthisius yup
17:33 pdurbin perfect
17:34 xarthisius and accepts query params which is all this is about ;)
17:34 pdurbin right, right
17:34 pdurbin When you have an update to the resolution service that I can test, please let me know!
17:35 xarthisius I can do it right away but there's no instance of binder that I can point it to
17:36 pdurbin Sure, but once they merge your pull request and deploy to mybinder there would be.
17:36 xarthisius heh, yes, once the above happens we can deploy the resolver instantly ;)
17:37 pdurbin Nice! And I assume they have some sort of staging environment. Maybe they can let us test a bit.
17:37 xarthisius can external tools add arbitrary query parameters, or is it a white list?
17:37 pdurbin It's a white list. And some are required.
17:37 pdurbin I can't leave out fileId, for example.
17:38 xarthisius I was more interested in adding binder=True, but I'll work it around
17:38 pdurbin You *can* customize the key of the query parameter, if that makes sense.
17:39 xarthisius I'm not sure I understand, can you give me an example?
17:39 pdurbin So you could have datasetIdWT=10 and datasetIdBinder=10
17:40 xarthisius how do I do that?
17:41 pdurbin like this: { "displayName": "Custom Keys", "description": "custom keys", "type": "explore", "toolUrl": "", "contentType": "application/x-ipynb+json", "toolParameters": { "queryParameters": [ { "foo": "{siteUrl}" }, { "bar": "{datasetId}" }, { "baz": "{fileId}" } ] } }
17:42 xarthisius oh! the keys in queryParameters are arbitrary, now I get it
17:42 pdurbin yeah, you'd get something like this as a URL:
17:42 pdurbin I don't know if that helps you or not.
17:43 xarthisius that's enough, I can have empty/non empty check on binder={whatever}
17:43 xarthisius it'll work as good as boolean
17:43 pdurbin cool
17:43 pdurbin I love the hacks. Getting things done. :)
17:44 xarthisius feature-driven development ;)
17:44 pdurbin :)
18:05 pboon Help needed with coming to a grinding halt with a CPU load of 400%
18:10 pdurbin xarthisius: for now I left the comment at
18:10 pdurbin pboon: hi! Thanks for joining. Let's see who's around who runs Dataverse in production.
18:10 pdurbin I see andrewSC bricas_ and donsizemore
18:11 pdurbin pboon: you're running Dataverse 4.10.1, slightly forked, right?
18:12 pdurbin I always wonder what changed recently. Is it simply that you're seeing more traffic right now? Everyone is trying to download data at once? Or did you upgrade something?
18:13 pboon more details on!topic/dataverse-community/DLy56gukZ3E
18:14 pdurbin Ah, thanks!
18:14 pdurbin I also asked for reinforcements in Slack just now.
18:14 pboon Nothing was supposed to be changed, nothing happening, but will eyeball the logs again to be sure
18:16 pdurbin pboon: ok. Nice write up. The image didn't come though. Maybe you can upload it to or similar.
18:18 pdurbin We should really start using the "Feature: Performance & Stability" GitHub Issue label more consistently because I'm having through finding any specific fixes we made after 4.10.1.
18:21 pdurbin pboon: ah, you resent the graph of memory usage and it came through this time, thanks:
18:22 pdurbin This is interesting: "The weird thing is that it used to run without problems from the time we deployed it on production 2019-05-09 up to 2019-07-10, when we got the near 400% CPU load."
18:23 pboon I waded through all milestones this afternoon
18:23 pdurbin you poor thing
18:24 pboon if it doesn't kill you it makes you stronger...
18:25 pdurbin heh
18:26 pdurbin Some of the performance related pull requests made by donsizemore you should already have since both and shipped with Dataverse 4.9.
18:27 pboon Yes, this was probably retale
18:28 pdurbin And I don't know if you're on S3 or not but you should already have from Jim Myers.
18:29 pboon Yes, the ones about 'outputStream' are probably related to the open file descriptor problems, which I don't see now
18:29 pdurbin And you should have from Pete Meyer.
18:30 pboon We are not running on S3, we where just testing the possibility
18:30 pdurbin Ok. Do you have any sort of script in place to restart Glassfish if it stops responding? We do for Harvard Dataverse.
18:32 djbrooke Hey pboon - can you check the logs to see if there are a lot of instances of GET /api/access/datafiles occurring as well?
18:32 pboon As a way of making it usable for users we planned to upgrade memory (so it takes a bit longer to become problematic) and then our sysadmin will have a restart script in place
18:39 pboon Me and my colleague already spent some time looking into the logs, but I will do it again after I downloaded it through my home WiFi
18:39 djbrooke OK, sounds good
18:40 pdurbin pboon: djbrooke is asking because that's the API endpoint that "download all files" or "download some files" buttons use to zip up files and this can be pretty inefficient. I crashed the demo server the other day while trying to download all files from a dataset. They were kind of big files, I guess.
18:40 djbrooke We were seeing a lot of instances of GET /api/access/datafiles ... yeah what pdurbin said
18:41 pdurbin We ended up limiting (for now anyway) "download files as zip" to 10 MB.
18:41 pdurbin :ZipDownloadLimit
18:42 djbrooke pboon also if you have some long running ingest jobs, that could be contributing, those persist even after a restart
18:42 pdurbin The climb in memory usage from your graphs seems strangely regular, as if a script is hitting your site every few minutes or something. And strange patterns from certain IP addresses?
18:51 pboon My connection dropped, sorry
18:52 pboon Unless there is some deep insight/breakthrough for my memory consumption problem with Dataverse I will get back tomorrow, Thanks!
18:52 pboon left #dataverse
18:52 djbrooke pboon good to see you mentioned upping the memory above, donsizemore suggested that on the google group
18:53 donsizemore back in the 90s i remember java's memory management algorithm described as: NOM NOM NOM
18:55 pdurbin djbrooke: Paul's connection probably didn't drop. There's a memory leak in the version of Shout I stood up. We should think about upgrading to TheLounge:
19:01 pdurbin donsizemore: hopefully it's a little better these days. :)
19:03 pdurbin donsizemore: also, DataverseNL does have a harvesting set advertised at
19:03 pdurbin since you asked at ... good points about memory
19:05 pdurbin donsizemore: how long does harvesting take? Also, I think Mike just spotted a bug.
19:06 pdurbin donsizemore: do you recognize these curl braces? :)
19:06 pdurbin curly*
19:07 donsizemore i do
19:07 donsizemore but i thought that had been moved into the solrservicebean
19:08 donsizemore oh, wait. you're right =)
19:10 donsizemore this may mean that Odum can turn harvesting back on
19:23 pdurbin That would be nice. I just replied.
19:23 pdurbin How long does harvesting take?
19:24 pdurbin donsizemore: the version you're running has the same bug:
19:27 donsizemore correct
19:27 donsizemore GESIS for instance seemed to harvest against us hourly
19:27 donsizemore it runs for a while, then stops whether or not it completed. i don't remember what decides the cutoff
19:28 pdurbin Well, it should stop when it's done fetching the latest changes, I guess.
19:28 donsizemore it stops before then.
19:29 pdurbin When your installation of Dataverse crashes? :)
19:29 donsizemore we turned off harvesting
19:30 pdurbin and you don't run a fork
19:30 pdurbin which is probably good
19:30 pdurbin but maybe you'll test a patch?
19:30 donsizemore we're running on a patched 4.9.4 now; I suppose we could rebuild but we're planning to upgrade Real Soon Now to... 4.11?
19:31 pdurbin oh ho, so you are running patches, probably for memory leaks
19:31 donsizemore they really want the file hierarchy but we don't want to jump to 4.15 just yet
19:31 donsizemore 4.11 looks pretty safe
19:31 pdurbin those other ones I mentioned earlier, and file descriptor leaks
19:31 donsizemore yes our 4.9.4 warfile includes akio's two fixes we submitted as PRs into 4.10
19:31 pdurbin see, that's what I like... give 'em an incentive to upgrade... new features! :)
19:33 pdurbin Well, if you test a patch and it helps, you and Paul can have a race to the new pull request button.
19:33 donsizemore you mentioned a PR for the harvesting curly. that's something i feel like i can handle :)
19:33 donsizemore except Gustavo didn't want to go that route, he wanted to move the solr client into a service bean
19:43 pdurbin What's more important to me is learning if fixing up that part of the code in either way help you with your problem of Dataverse crashing when people try to harvest from you.
19:43 pdurbin donsizemore: so if moving the curly is easier for you, please go for it. :)
20:23 pdurbin donsizemore: still there?
20:54 donsizemore @pdurbin back from the gym
20:54 donsizemore just saw your not about password aliases... and i've honestly had enough screwy problems for one day. i'll take a look tomorrow. have a great evening!
21:01 pdurbin I pushed a commit, if it helps. :)

Connect via to discuss Dataverse (, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.