IQSS logo

IRC log for #dataverse, 2018-06-12

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.

| Channels | #dataverse index | Today | | Search | Google Search | Plain-Text | plain, newest first | summary

All times shown according to UTC.

Time S Nick Message
00:06 donsizemore joined #dataverse
03:40 jri joined #dataverse
06:56 jri joined #dataverse
06:59 jri joined #dataverse
09:30 jri joined #dataverse
11:46 jri joined #dataverse
12:10 donsizemore joined #dataverse
12:29 jri joined #dataverse
13:22 amaz17 joined #dataverse
13:49 blavoie joined #dataverse
14:09 pameyer joined #dataverse
14:28 pameyer joined #dataverse
14:30 kevin-w joined #dataverse
14:30 pameyer hi kevin-w
14:30 kevin-w hi pameyer
14:30 kevin-w how's it going?
14:31 pameyer not too bad - how about you?
14:31 pameyer pdurbin's around too
14:31 kevin-w Pretty good over here too. the sun is out and it quite warm.
14:32 pdurbin pameyer and I decided to get in the same office so we can easily chat on the side
14:32 kevin-w sounds good.
14:32 pameyer mind if I ask a couple background questions?
14:33 kevin-w so, the big question is. What is involved in setting dataverse up so it can upload to different repos?
14:34 kevin-w I guess we can start with, has this been done in a production environment?
14:34 guill joined #dataverse
14:34 pameyer as far as I know, there aren't any production environments that are uploading to different types of storage from the same installation
14:34 kevin-w but it seems the DCM has allowed this
14:35 pameyer the ones that I'm aware of are using a single type of storage; either POSIX/NFS or one of the object stores
14:36 donsizemore @pameyer it's a jvm-option, isn't it?
14:36 guill sorry, missed the begining. is anyone using Ceph as object store?
14:37 kevin-w so when using the dcm to upload data it loads to the same file store as the rest of DV on s3?
14:37 pameyer @kevin-w not quite.  the DCM supports uploads that don't go through glassfish / dataverse app server. it was designed to support multiple transfer and storage protocols
14:39 pameyer @guill I know of people using swift and s3; not 100% sure about ceph.  possibly the openstack folks, possibly the eu folks - but I don't know for certain
14:39 pameyer @donsizemore - there are definately config options; but at the moment I don't think they allow for multiple options at the same time (that's something we've got in dev at the moment)
14:40 pameyer @kevin-w the initial uploads to the DCM land on a temporary filestore.  the checksums get verified, and assuming they pass get moved to the same filestore as the rest of DV
14:40 pdurbin guill: welcome! You can catch up on the conversation at http://irclog.iq.harvard.edu/dataverse/2018-06-12
14:41 guill @pameyer Thanks! We'll try to implement it as it's part of our architecture for various reasons and I'll keep everyone posted. Should not be much different than Swift anyway.
14:41 pameyer one issue currently in dev at IQSS is having the DCM support S3 storage.  the user community we were targeting for the initial dev was using NFS, so we deferred on S3
14:43 amaz61 joined #dataverse
14:43 pameyer are you targeting multiple storage protocols, multiple transfer protocols, or both?
14:44 pdurbin kevin-w: you asked about production. pameyer will be going live with his Dataverse-based solution soon but it's designed after an existing homegrown solution that's been in production for years. That is to say, the existing solution also uses rsync. Here's an example dataset with the homegrown solution that will be migrated soonish: https://data.sbgrid.org/dataset/1/
14:45 pameyer that one will end up looking like https://dv.sbgrid.org/dataset.xhtml?persistentId=doi:10.15785/SBGRID/1
14:46 pameyer that's staging, not production migration
14:48 amaz61 We've been looking at ceph as well, but we have nothing in production.
14:48 kevin-w are you using s3 protocol with the dcm pameyer.
14:48 amaz61 Our Swift cluster is in production.
14:49 kevin-w54 joined #dataverse
14:49 pameyer our installation is not using s3
14:49 amaz61 And we could potentially use swift's S3 middleware to emulate
14:49 amaz61 https://docs.openstack.org/mitaka/config-reference/object-storage/configure-s3.html
14:49 kevin-w54 are you using the s3 protocol now with the dcm?
14:50 pameyer @amaz61 I've heard people investigating S3 emulation for swift or ceph
14:50 amaz61 We've got it running on Dev, but not in production.
14:50 pameyer @kevin-w54 no - the installation I'm associated with isn't using s3 at all
14:51 amaz61 It seems like a really good option, but a lot of work needs to be done to make the emulation more complete.
14:51 pameyer the IQSS installation is using S3, but not DCM
14:51 guill @amaz61 we have a ceph cluster for dev right now, we'll do some tests in the coming weeks. Prod cluster will be up later (waiting for network equipment).
14:51 amaz61 Sorry to join late and sorry if I'm re-asking questions here.
14:51 amaz61 nice!
14:51 pameyer @amaz61 in my very limited experience with S3-compatable implimentations, there have usually been parts of the protocol not fully implemented
14:52 amaz61 @guill I'd be very interested in your results.
14:52 pameyer although I'm relatively sure that dataverse supports swift, so s3 compatability might not be needed there
14:52 pdurbin kevin-w54: in fact, for now we say "You cannot use a DCM with non-filesystem storage options such as Swift." at http://guides.dataverse.org/en/4.9/developers/big-data-support.html but the plan is to support S3 (but not Swift) as part of https://github.com/IQSS/dataverse/issues/4703 (in dev). Swift is in the "inbox": https://github.com/IQSS/dataverse/issues/4710
14:53 pdurbin for inbox, backlog, this sprint, dev, code review, and QA, please see https://waffle.io/IQSS/dataverse
14:53 kevin-w54 thanks pdurbin
14:54 guill @amaz61 sure! in fact we're gonna use Ceph as a central datalake, supporting data analysis. We've just finished some benchmark for Spark against HDFS vs Ceph with Redhat, it will be published soon. Now the target is to "hook up" dataverse against the lake. Or the other way around.
14:54 amaz61 It would be great for us if the DCM could support swift :)
14:54 pdurbin kevin-w54 amaz61 one thing I forgot to say yesterday is that my usual response to issues like https://github.com/IQSS/dataverse/issues/4439 about non-robust upload is to try Dataverse APIs rather than the GUI. Any thoughts on that?
14:56 guill I guess we'll definitely have to code something around that... We'll try to catch up with all the developments that are mentioned here.
14:56 pdurbin donsizemore: what JVM option? I'm confused.
14:56 amaz61 @pdurbin Will the APIs allow for more robust upload experience if we built an app around them?
14:56 pameyer @amaz61 one issue I've run into with swift has been trouble setting up a dev environment.  it should be possible, but we'll have a better idea after the DCM S3 is sorted
14:57 amaz61 @pameyer would it help if we gave you access to our swift cluster so you don't have to stand your own up?
14:58 pdurbin amaz61: that's sort of what I'm wondering. What if you were to build an upload client or something. Just a thought. I'd first try curl to see if upload is more robust than using the web gui. :)
14:58 pdurbin knikolla: hey, we're talking swift and stuff if you're around.
14:58 knikolla o/
14:59 pdurbin guill: if it helps, there's a group in France that uses Ceph, but I'm not sure if they're using it for Dataverse: https://groups.google.com/d/msg/dataverse-community/py0UMJV9lDg/kQcO6P51CQAJ
14:59 amaz61 @pdurbin Are there any limitations with upload and the API?
14:59 amaz61 I haven't looked at the APi in over a year
14:59 pameyer @amaz61 - the first obstacle I'd have to sort out is that the project I'm on doesn't have swift as an objective.  but there's a chance it might help the IQSS dev team if swift DCM gets prioritized.  I did some early exploration of swift/s3/object stores in general to see if they made sense for our repository; and end up with no / not yet
15:00 kevin-w54 i guess the dcm would need to use a different protocol to be able to upload to SWIFT then rsync
15:00 guill_ joined #dataverse
15:00 pdurbin amaz61: nope. Out of the box there are no limits. It's configurable. The nice thing about using the APIs is that the dataset gets all the normal features of Dataverse such as versioning and the possibility of restricted data.
15:00 pameyer @kevin-w54 - not necessarily.
15:01 amaz61 the other option I mentioned yesterday on the phone call is, we could mount swift as a virtual filesystem. To dataverse, it would be like any mounted file system. But it would be swift under the hood
15:01 guill_ @pdurbin thanks, will try to get in touch with them
15:01 pameyer the upload protocol and storage protocol can be different; although swift upload / swift storage should be a workable combination
15:02 kevin-w54 good to know
15:02 pdurbin kevin-w54 amaz61: something important you should notice at https://dv.sbgrid.org/dataset.xhtml?persistentId=doi:10.15785/SBGRID/1 is that the filetype is "Dataverse Package". This is something we made up for pameyer's use case where all 360 (or whatever) files are treated as a single "file" in Dataverse.
15:03 kevin-w54 I guess that makes creating metadata a bit easier :)
15:03 amaz61 Ok, that might be the best way forward then. Kevin and I can look at the API and think about how to use them to make an upload client.
15:03 pameyer @pdurbin - good point.  one of the problems we ran into was performance for largish-numbers of files on the dataset pages
15:03 amaz61 Will the APIs care what the underlying storage is?
15:03 amaz61 ie. swift for us?
15:04 pameyer one thing about existing APIs - I don't think they support client-side checksums, and they don't preserve file hierarchy.  that may or may not matter depending on the user base
15:04 pdurbin amaz61: nope, the APIs will just put files on whatever storage is configured (filesystem, S3, or Swift)
15:04 amaz61 excellent. And the APIs would allow us to versioning, entitlements as well I'm assuming?
15:04 pdurbin amaz61: again, please try with curl first to see if you think it's a good path forward
15:04 amaz61 @pdurbin Yes, for sure!
15:05 pdurbin Yes, the APIs use all the same code paths (more or less) as the GUI. So files go through "ingest", etc.
15:05 pdurbin You can (newly) restrict files through the API as well.
15:07 amaz61 @pdurbin What're the odds that if we spent some time trying to create a browser upload utility built into dataverse that it would be accepted back into the project.
15:07 amaz61 ?
15:07 pdurbin kevin-w54 amaz61 what about Globus? pameyer went to a conference about it. You said some of your researchers have Globus installed on laptops? Friendlier for researchers on Windows than rsync, I imagine.
15:08 pdurbin amaz61: what technology would you use to build that utility? I'm confused.
15:08 amaz61 @pdurbin Globus is definitely another option. Am I correct in understanding that we'd need to make dataverse a globus endpoint for that to work?
15:08 pdurbin I don't know how Globus works.
15:09 amaz61 @pdurbin I'm hoping something browser based.
15:09 pameyer it's slightly complex, because of differing assumptions between how dataverse assumes file storage is organized, and how globus assumes it's organized
15:09 guill_ @amaz61 we'd be really interested too, as Globus is deployed in all Canadian Universities (provides data transfer for Compute Canada)
15:09 amaz61 @pdurbin If that doesn't work, we could always try a java desktop app or something like that
15:10 amaz61 Who here knows how Globus works?
15:10 pdurbin amaz61: Flash? Silverlight? Java applet?
15:10 pameyer we've got another component that uses globus for data replication; and looked at it for upload
15:10 amaz61 java probably
15:10 pdurbin amaz61: so you'd create a Java applet as the browser-based solution? I thought applets were deprecated.
15:10 pameyer I know enough about globus to manage our endpoint, and do some dev against their SDKs - but I'm very far from expert
15:11 pameyer applets are, but I think JNLP is still ok
15:12 pdurbin I should mention that pameyer and I would like to go to standup in 5 minutes if possible. It's quick. We'll be back.
15:12 amaz61 Cool. talk soon.
15:12 kevin-w54 cool with me too
15:12 pdurbin I don't think I want a Java applet.
15:13 amaz61 @pdurbin What would you want to use?
15:13 pdurbin I'd be fine with a desktop client. It might be nice to bundle all the dependencies into it (JDK or Python or whatever).
15:13 pameyer background question - order of magnitude, how much storage / how many files in a dataset are folks thinking to support?
15:14 pdurbin I don't think people should have to install Java on desktops these days. Unless they want to use a Java IDE.
15:14 amaz61 I was starting to look at libraries like http://www.resumablejs.com/
15:14 pdurbin pameyer: at https://github.com/IQSS/dataverse/issues/4439 it seems like they want to upload ~4 GB files
15:15 amaz61 Several Multiple gigabytes files to start
15:15 amaz61 I hate java apps too.
15:15 pdurbin Resumable.js looks interesting
15:15 dataverse-user joined #dataverse
15:24 pameyer joined #dataverse
15:26 pameyer low numbers of multi-GB files, or hundreds/thousands of multi-GB?
15:29 pdurbin back
15:30 pdurbin kevin-w54 amaz61 you might find the "Dataverse Upload/Download Manager" google doc linked from here interesting: https://github.com/IQSS/dataverse/issues/2960#issuecomment-188820984
15:31 pdurbin It looks like he (Bill) was thinking about using Electron but you guys would be welcome to use whatever technology you want.
15:31 pdurbin Bill has moved on but he was involved in early Data Capture Module discussions.
15:32 kevin-w54 cool
15:32 pdurbin We are absolutely aware that the rync support we have built is not friendly for Windows clients. But again, rsync is a proven technology that pameyer has used in production in his homegrown solution for years.
15:33 pdurbin Anyway, that google doc captures some thinking from a couple years ago a least. Early in this "big data" collaboration.
15:34 kevin-w54 I would like to know the best workflow to take when a user loads data outside of dataverse UI and needs to connect it to an existing dataset?
15:34 pameyer another factor for us going with rsync vs native APIs was that we wanted things to be scalable; using the native APIs puts load on glassfish (transfer, IO, CPU, RAM)
15:34 pameyer that load could impact the UI, which was something we wanted to avoid
15:35 kevin-w54 In AWS is your glassfish server replicated?
15:35 pdurbin kevin-w54: well, does the non-Dataverse system have the ability to expose the datasets and files via OAI-PMH? If so, your Dataverse installation could harvest from the non-Dataverse installation.
15:36 kevin-w54 I thinking more like, I have loaded data through SWORD and wanted to create a link from dataverse
15:38 pdurbin kevin-w54: last I checked there are two Glassfish servers in production on AWS for https://dataverse.harvard.edu . I'm not sure if this is what you mean by replicated. Recently some students were working on scaling Glassfish but in a safe, non-production Kubernetes/OpenShift environment: https://github.com/IQSS/dataverse/issues/4617
15:39 pdurbin kevin-w54: I'm confused. If you upload data to Dataverse via SWORD... Dataverse has the files.
15:40 kevin-w54 yes, but the files were uploaded outside of dataverse using curl, so the workflow would be to create the dataset first then open your terminal and upload the data
15:41 kevin-w54 regarding the replicated glassfish servers, I thinking about scalability of this system.
15:42 kevin-w54 is two enough? what if the load increasing 10x
15:43 pdurbin Back when we were on physical hardware, we spun up a third Glassfish server.
15:44 pdurbin You have to put some hacks into place to use multiple Glassfish servers. To serve up logos for dataverses, for example.
15:45 pdurbin kevin-w54: I don't understand what you're saying about uploads. Sorry.
15:46 kevin-w54 thanks for the glassfish explanation, makes sense
15:46 pdurbin Sure. Should we wrap up this meeting soonish?
15:47 kevin-w54 regarding uploading files outside of dataverse, I just looking to map out a common workflow for how users will load large data files
15:47 kevin-w54 no worries if you have to go. I think we covered a lot during this chat. Thanks for all your help.
15:49 pdurbin I can hang out a while longer. And I'm usually lurking in here. :)
15:51 pdurbin kevin-w54: well, it sort of sounds like you and amaz61 will try uploading large files to Dataverse using curl and if it works well enough, you all may try writing a desktop client. Maybe I missed something.
15:51 kevin-w54 yeah, that's probably the direction we will take.
15:52 kevin-w54 end users will first have to create a dataset and then open the desktop client and upload the data
15:52 pdurbin Ok. And any more thoughts about Globus. It sounds like pameyer isn't seeing his researchers using in on their laptops or desktops much. A couple tickets over the span of years.
15:52 pdurbin kevin-w54: right, they'd copy and paste the DOI into the desktop client
15:53 kevin-w54 I think to get globus to work we'd need to allocate a space for this data outside of data and then create some kind of linking mechanism to it.
15:54 pdurbin Of course amaz61 was talking about some sort of browser-based client. I'm fine with that, especially since he mentioned Javascript but please no dead technologies like Java applets, Flash, Silverlight, etc. :)
15:54 kevin-w54 lol, I'm on your side with that
15:55 kevin-w54 ok, thanks again. have a great day.
15:55 pdurbin To me Globus is like Swift. Seems like neat technology but I've never gotten my hands dirty with it.
15:55 pdurbin I hope this was helpful!
15:55 kevin-w54 I think so.
15:55 pdurbin :)
15:56 pdurbin Voice is definitely faster.
15:56 pameyer @kevin-w54 that might be the lowest complexity way to get globus to work, because it gets around the differing assumptions about storage between globus and dataverse
15:56 pdurbin Maybe we can try recording a google hangouts on air in the future instead. or something
16:00 pameyer happy to talk things over - and I'm usually around in this channel
16:08 jri joined #dataverse
16:14 pameyer joined #dataverse
16:38 icarito[m] joined #dataverse
16:57 donsizemore joined #dataverse
17:10 pameyer joined #dataverse
17:29 pameyer joined #dataverse
18:19 pameyer joined #dataverse
18:32 jri joined #dataverse
18:52 pameyer joined #dataverse
18:54 pdurbin In case anyone is feeling meta, I just worked a bit on metrics for this channel for the past year: https://github.com/IQSS/chat.dataverse.org/issues/6 :)
19:09 pameyer joined #dataverse
19:50 soU joined #dataverse
19:51 soU Hello there
19:52 soU Is it possible to stop Ingest in progress?
19:54 pameyer joined #dataverse
20:33 pameyer hi soU: I don't know of one, but pdurbin might
20:39 pdurbin soU: good question. I don't know. I think it goes into a JMS queue. We recently added an API to uningest a file, if that's helpful. It's in Dataverse 4.9: https://github.com/IQSS/dataverse/issues/3766
20:41 pdurbin soU: you're making me wonder how long the ingest has been running. You can limit the size of files for which ingest is attempted: http://guides.dataverse.org/en/4.9/installation/config.html#tabularingestsizelimit
20:51 dataverse-user joined #dataverse
20:53 dataverse-user Hi all, I am interested in running dataverse locally and I am having trouble when entering SMTP server
20:53 pameyer hi dataverse-user: what kind of problem are you having?
20:54 dataverse-user I ran the installer, and after entering the default Harvard smtp server it gives me this message: Could not establish connection to mail.hmdc.harvard.edu, the address you provided for your Mail server. Please select a valid mail server, and try again.
20:54 dataverse-user I tried localhost, same thing
20:55 pameyer it sounds like you don't have a SMTP server running on localhost - is that correct?
20:55 dataverse-user I do not
20:55 pameyer do you have one available to use on your local network?
20:56 dataverse-user not particularly. in the past I have tried the default email and it worked fine
20:56 dataverse-user by email I mean smtp server
20:57 pameyer possibly a silly question, but are you on the same network as mail.hmdc.harvard.edu?
20:57 dataverse-user I am not
20:58 dataverse-user but in the past I wasn't either. Not sure what happened
20:59 pameyer I don't know the details, but I believe there were some relatively recent changes that dealt with email server handling (to fix some bugs that folks had run into with emails not going out).  it's possible that's related to what you're seeing.
20:59 pameyer which version are you seeing the problem with, and do you remember which version didn't give you the problem?
21:00 dataverse-user I was using 4.8.6, the same as the one I used before
21:00 pameyer ok - that means it's not those changes
21:01 dataverse-user maybe it's the network?
21:02 pameyer it sounds that way to me, especially if the same version is behaving differently
21:03 dataverse-user I was using a Mac. Currently using fedora. Does that make a difference?
21:03 pameyer possibly.  it might be worth checking selinux and/or the system filewall
21:04 pameyer I'm pretty sure that dataverse doesn't get regularly tested on fedora - so if you have a centos system to test on, that might be something to try
21:08 soU pdurbin: i received a message says that it runs since yesterday. i don't know the exact time. the sizes are 998MB and 532MB
21:08 soU pdurbin thanks for the links
21:12 icarito[m] joined #dataverse
21:32 pdurbin_m joined #dataverse
21:32 pdurbin_m as a temporary measure I would recommend hacking on the install script and disabling in the SMTP check.
21:37 pdurbin_m definitely a bug that "harvard" is in there
21:44 pameyer joined #dataverse
22:09 pdurbin Heh, nice tweet: https://twitter.com/ronaldfar/status/1006559300498722818
23:38 icarito[m] joined #dataverse

| Channels | #dataverse index | Today | | Search | Google Search | Plain-Text | plain, newest first | summary

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.