IRC log for #dataverse, 2018-06-12

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.

Time	Nick	Message
00:06		donsizemore joined #dataverse
03:40		jri joined #dataverse
06:56		jri joined #dataverse
06:59		jri joined #dataverse
09:30		jri joined #dataverse
11:46		jri joined #dataverse
12:10		donsizemore joined #dataverse
12:29		jri joined #dataverse
13:22		amaz17 joined #dataverse
13:49		blavoie joined #dataverse
14:09		pameyer joined #dataverse
14:28		pameyer joined #dataverse
14:30		kevin-w joined #dataverse
14:30	pameyer	hi kevin-w
14:30	kevin-w	hi pameyer
14:30	kevin-w	how's it going?
14:31	pameyer	not too bad - how about you?
14:31	pameyer	pdurbin's around too
14:31	kevin-w	Pretty good over here too. the sun is out and it quite warm.
14:32	pdurbin	pameyer and I decided to get in the same office so we can easily chat on the side
14:32	kevin-w	sounds good.
14:32	pameyer	mind if I ask a couple background questions?
14:33	kevin-w	so, the big question is. What is involved in setting dataverse up so it can upload to different repos?
14:34	kevin-w	I guess we can start with, has this been done in a production environment?
14:34		guill joined #dataverse
14:34	pameyer	as far as I know, there aren't any production environments that are uploading to different types of storage from the same installation
14:34	kevin-w	but it seems the DCM has allowed this
14:35	pameyer	the ones that I'm aware of are using a single type of storage; either POSIX/NFS or one of the object stores
14:36	donsizemore	@pameyer it's a jvm-option, isn't it?
14:36	guill	sorry, missed the begining. is anyone using Ceph as object store?
14:37	kevin-w	so when using the dcm to upload data it loads to the same file store as the rest of DV on s3?
14:37	pameyer	@kevin-w not quite. the DCM supports uploads that don't go through glassfish / dataverse app server. it was designed to support multiple transfer and storage protocols
14:39	pameyer	@guill I know of people using swift and s3; not 100% sure about ceph. possibly the openstack folks, possibly the eu folks - but I don't know for certain
14:39	pameyer	@donsizemore - there are definately config options; but at the moment I don't think they allow for multiple options at the same time (that's something we've got in dev at the moment)
14:40	pameyer	@kevin-w the initial uploads to the DCM land on a temporary filestore. the checksums get verified, and assuming they pass get moved to the same filestore as the rest of DV
14:40	pdurbin	guill: welcome! You can catch up on the conversation at http://irclog.iq.harvard.edu/dataverse/2018-06-12
14:41	guill	@pameyer Thanks! We'll try to implement it as it's part of our architecture for various reasons and I'll keep everyone posted. Should not be much different than Swift anyway.
14:41	pameyer	one issue currently in dev at IQSS is having the DCM support S3 storage. the user community we were targeting for the initial dev was using NFS, so we deferred on S3
14:43		amaz61 joined #dataverse
14:43	pameyer	are you targeting multiple storage protocols, multiple transfer protocols, or both?
14:44	pdurbin	kevin-w: you asked about production. pameyer will be going live with his Dataverse-based solution soon but it's designed after an existing homegrown solution that's been in production for years. That is to say, the existing solution also uses rsync. Here's an example dataset with the homegrown solution that will be migrated soonish: https://data.sbgrid.org/dataset/1/
14:45	pameyer	that one will end up looking like https://dv.sbgrid.org/dataset.xhtml?persistentId=doi:10.15785/SBGRID/1
14:46	pameyer	that's staging, not production migration
14:48	amaz61	We've been looking at ceph as well, but we have nothing in production.
14:48	kevin-w	are you using s3 protocol with the dcm pameyer.
14:48	amaz61	Our Swift cluster is in production.
14:49		kevin-w54 joined #dataverse
14:49	pameyer	our installation is not using s3
14:49	amaz61	And we could potentially use swift's S3 middleware to emulate
14:49	amaz61	https://docs.openstack.org/mitaka/config-reference/object-storage/configure-s3.html
14:49	kevin-w54	are you using the s3 protocol now with the dcm?
14:50	pameyer	@amaz61 I've heard people investigating S3 emulation for swift or ceph
14:50	amaz61	We've got it running on Dev, but not in production.
14:50	pameyer	@kevin-w54 no - the installation I'm associated with isn't using s3 at all
14:51	amaz61	It seems like a really good option, but a lot of work needs to be done to make the emulation more complete.
14:51	pameyer	the IQSS installation is using S3, but not DCM
14:51	guill	@amaz61 we have a ceph cluster for dev right now, we'll do some tests in the coming weeks. Prod cluster will be up later (waiting for network equipment).
14:51	amaz61	Sorry to join late and sorry if I'm re-asking questions here.
14:51	amaz61	nice!
14:51	pameyer	@amaz61 in my very limited experience with S3-compatable implimentations, there have usually been parts of the protocol not fully implemented
14:52	amaz61	@guill I'd be very interested in your results.
14:52	pameyer	although I'm relatively sure that dataverse supports swift, so s3 compatability might not be needed there
14:52	pdurbin	kevin-w54: in fact, for now we say "You cannot use a DCM with non-filesystem storage options such as Swift." at http://guides.dataverse.org/en/4.9/developers/big-data-support.html but the plan is to support S3 (but not Swift) as part of https://github.com/IQSS/dataverse/issues/4703 (in dev). Swift is in the "inbox": https://github.com/IQSS/dataverse/issues/4710
14:53	pdurbin	for inbox, backlog, this sprint, dev, code review, and QA, please see https://waffle.io/IQSS/dataverse
14:53	kevin-w54	thanks pdurbin
14:54	guill	@amaz61 sure! in fact we're gonna use Ceph as a central datalake, supporting data analysis. We've just finished some benchmark for Spark against HDFS vs Ceph with Redhat, it will be published soon. Now the target is to "hook up" dataverse against the lake. Or the other way around.
14:54	amaz61	It would be great for us if the DCM could support swift :)
14:54	pdurbin	kevin-w54 amaz61 one thing I forgot to say yesterday is that my usual response to issues like https://github.com/IQSS/dataverse/issues/4439 about non-robust upload is to try Dataverse APIs rather than the GUI. Any thoughts on that?
14:56	guill	I guess we'll definitely have to code something around that... We'll try to catch up with all the developments that are mentioned here.
14:56	pdurbin	donsizemore: what JVM option? I'm confused.
14:56	amaz61	@pdurbin Will the APIs allow for more robust upload experience if we built an app around them?
14:56	pameyer	@amaz61 one issue I've run into with swift has been trouble setting up a dev environment. it should be possible, but we'll have a better idea after the DCM S3 is sorted
14:57	amaz61	@pameyer would it help if we gave you access to our swift cluster so you don't have to stand your own up?
14:58	pdurbin	amaz61: that's sort of what I'm wondering. What if you were to build an upload client or something. Just a thought. I'd first try curl to see if upload is more robust than using the web gui. :)
14:58	pdurbin	knikolla: hey, we're talking swift and stuff if you're around.
14:58	knikolla	o/
14:59	pdurbin	guill: if it helps, there's a group in France that uses Ceph, but I'm not sure if they're using it for Dataverse: https://groups.google.com/d/msg/dataverse-community/py0UMJV9lDg/kQcO6P51CQAJ
14:59	amaz61	@pdurbin Are there any limitations with upload and the API?
14:59	amaz61	I haven't looked at the APi in over a year
14:59	pameyer	@amaz61 - the first obstacle I'd have to sort out is that the project I'm on doesn't have swift as an objective. but there's a chance it might help the IQSS dev team if swift DCM gets prioritized. I did some early exploration of swift/s3/object stores in general to see if they made sense for our repository; and end up with no / not yet
15:00	kevin-w54	i guess the dcm would need to use a different protocol to be able to upload to SWIFT then rsync
15:00		guill_ joined #dataverse
15:00	pdurbin	amaz61: nope. Out of the box there are no limits. It's configurable. The nice thing about using the APIs is that the dataset gets all the normal features of Dataverse such as versioning and the possibility of restricted data.
15:00	pameyer	@kevin-w54 - not necessarily.
15:01	amaz61	the other option I mentioned yesterday on the phone call is, we could mount swift as a virtual filesystem. To dataverse, it would be like any mounted file system. But it would be swift under the hood
15:01	guill_	@pdurbin thanks, will try to get in touch with them
15:01	pameyer	the upload protocol and storage protocol can be different; although swift upload / swift storage should be a workable combination
15:02	kevin-w54	good to know
15:02	pdurbin	kevin-w54 amaz61: something important you should notice at https://dv.sbgrid.org/dataset.xhtml?persistentId=doi:10.15785/SBGRID/1 is that the filetype is "Dataverse Package". This is something we made up for pameyer's use case where all 360 (or whatever) files are treated as a single "file" in Dataverse.
15:03	kevin-w54	I guess that makes creating metadata a bit easier :)
15:03	amaz61	Ok, that might be the best way forward then. Kevin and I can look at the API and think about how to use them to make an upload client.
15:03	pameyer	@pdurbin - good point. one of the problems we ran into was performance for largish-numbers of files on the dataset pages
15:03	amaz61	Will the APIs care what the underlying storage is?
15:03	amaz61	ie. swift for us?
15:04	pameyer	one thing about existing APIs - I don't think they support client-side checksums, and they don't preserve file hierarchy. that may or may not matter depending on the user base
15:04	pdurbin	amaz61: nope, the APIs will just put files on whatever storage is configured (filesystem, S3, or Swift)
15:04	amaz61	excellent. And the APIs would allow us to versioning, entitlements as well I'm assuming?
15:04	pdurbin	amaz61: again, please try with curl first to see if you think it's a good path forward
15:04	amaz61	@pdurbin Yes, for sure!
15:05	pdurbin	Yes, the APIs use all the same code paths (more or less) as the GUI. So files go through "ingest", etc.
15:05	pdurbin	You can (newly) restrict files through the API as well.
15:07	amaz61	@pdurbin What're the odds that if we spent some time trying to create a browser upload utility built into dataverse that it would be accepted back into the project.
15:07	amaz61	?
15:07	pdurbin	kevin-w54 amaz61 what about Globus? pameyer went to a conference about it. You said some of your researchers have Globus installed on laptops? Friendlier for researchers on Windows than rsync, I imagine.
15:08	pdurbin	amaz61: what technology would you use to build that utility? I'm confused.
15:08	amaz61	@pdurbin Globus is definitely another option. Am I correct in understanding that we'd need to make dataverse a globus endpoint for that to work?
15:08	pdurbin	I don't know how Globus works.
15:09	amaz61	@pdurbin I'm hoping something browser based.
15:09	pameyer	it's slightly complex, because of differing assumptions between how dataverse assumes file storage is organized, and how globus assumes it's organized
15:09	guill_	@amaz61 we'd be really interested too, as Globus is deployed in all Canadian Universities (provides data transfer for Compute Canada)
15:09	amaz61	@pdurbin If that doesn't work, we could always try a java desktop app or something like that
15:10	amaz61	Who here knows how Globus works?
15:10	pdurbin	amaz61: Flash? Silverlight? Java applet?
15:10	pameyer	we've got another component that uses globus for data replication; and looked at it for upload
15:10	amaz61	java probably
15:10	pdurbin	amaz61: so you'd create a Java applet as the browser-based solution? I thought applets were deprecated.
15:10	pameyer	I know enough about globus to manage our endpoint, and do some dev against their SDKs - but I'm very far from expert
15:11	pameyer	applets are, but I think JNLP is still ok
15:12	pdurbin	I should mention that pameyer and I would like to go to standup in 5 minutes if possible. It's quick. We'll be back.
15:12	amaz61	Cool. talk soon.
15:12	kevin-w54	cool with me too
15:12	pdurbin	I don't think I want a Java applet.
15:13	amaz61	@pdurbin What would you want to use?
15:13	pdurbin	I'd be fine with a desktop client. It might be nice to bundle all the dependencies into it (JDK or Python or whatever).
15:13	pameyer	background question - order of magnitude, how much storage / how many files in a dataset are folks thinking to support?
15:14	pdurbin	I don't think people should have to install Java on desktops these days. Unless they want to use a Java IDE.
15:14	amaz61	I was starting to look at libraries like http://www.resumablejs.com/
15:14	pdurbin	pameyer: at https://github.com/IQSS/dataverse/issues/4439 it seems like they want to upload ~4 GB files
15:15	amaz61	Several Multiple gigabytes files to start
15:15	amaz61	I hate java apps too.
15:15	pdurbin	Resumable.js looks interesting
15:15		dataverse-user joined #dataverse
15:24		pameyer joined #dataverse
15:26	pameyer	low numbers of multi-GB files, or hundreds/thousands of multi-GB?
15:29	pdurbin	back
15:30	pdurbin	kevin-w54 amaz61 you might find the "Dataverse Upload/Download Manager" google doc linked from here interesting: https://github.com/IQSS/dataverse/issues/2960#issuecomment-188820984
15:31	pdurbin	It looks like he (Bill) was thinking about using Electron but you guys would be welcome to use whatever technology you want.
15:31	pdurbin	Bill has moved on but he was involved in early Data Capture Module discussions.
15:32	kevin-w54	cool
15:32	pdurbin	We are absolutely aware that the rync support we have built is not friendly for Windows clients. But again, rsync is a proven technology that pameyer has used in production in his homegrown solution for years.
15:33	pdurbin	Anyway, that google doc captures some thinking from a couple years ago a least. Early in this "big data" collaboration.
15:34	kevin-w54	I would like to know the best workflow to take when a user loads data outside of dataverse UI and needs to connect it to an existing dataset?
15:34	pameyer	another factor for us going with rsync vs native APIs was that we wanted things to be scalable; using the native APIs puts load on glassfish (transfer, IO, CPU, RAM)
15:34	pameyer	that load could impact the UI, which was something we wanted to avoid
15:35	kevin-w54	In AWS is your glassfish server replicated?
15:35	pdurbin	kevin-w54: well, does the non-Dataverse system have the ability to expose the datasets and files via OAI-PMH? If so, your Dataverse installation could harvest from the non-Dataverse installation.
15:36	kevin-w54	I thinking more like, I have loaded data through SWORD and wanted to create a link from dataverse
15:38	pdurbin	kevin-w54: last I checked there are two Glassfish servers in production on AWS for https://dataverse.harvard.edu . I'm not sure if this is what you mean by replicated. Recently some students were working on scaling Glassfish but in a safe, non-production Kubernetes/OpenShift environment: https://github.com/IQSS/dataverse/issues/4617
15:39	pdurbin	kevin-w54: I'm confused. If you upload data to Dataverse via SWORD... Dataverse has the files.
15:40	kevin-w54	yes, but the files were uploaded outside of dataverse using curl, so the workflow would be to create the dataset first then open your terminal and upload the data
15:41	kevin-w54	regarding the replicated glassfish servers, I thinking about scalability of this system.
15:42	kevin-w54	is two enough? what if the load increasing 10x
15:43	pdurbin	Back when we were on physical hardware, we spun up a third Glassfish server.
15:44	pdurbin	You have to put some hacks into place to use multiple Glassfish servers. To serve up logos for dataverses, for example.
15:45	pdurbin	kevin-w54: I don't understand what you're saying about uploads. Sorry.
15:46	kevin-w54	thanks for the glassfish explanation, makes sense
15:46	pdurbin	Sure. Should we wrap up this meeting soonish?
15:47	kevin-w54	regarding uploading files outside of dataverse, I just looking to map out a common workflow for how users will load large data files
15:47	kevin-w54	no worries if you have to go. I think we covered a lot during this chat. Thanks for all your help.
15:49	pdurbin	I can hang out a while longer. And I'm usually lurking in here. :)
15:51	pdurbin	kevin-w54: well, it sort of sounds like you and amaz61 will try uploading large files to Dataverse using curl and if it works well enough, you all may try writing a desktop client. Maybe I missed something.
15:51	kevin-w54	yeah, that's probably the direction we will take.
15:52	kevin-w54	end users will first have to create a dataset and then open the desktop client and upload the data
15:52	pdurbin	Ok. And any more thoughts about Globus. It sounds like pameyer isn't seeing his researchers using in on their laptops or desktops much. A couple tickets over the span of years.
15:52	pdurbin	kevin-w54: right, they'd copy and paste the DOI into the desktop client
15:53	kevin-w54	I think to get globus to work we'd need to allocate a space for this data outside of data and then create some kind of linking mechanism to it.
15:54	pdurbin	Of course amaz61 was talking about some sort of browser-based client. I'm fine with that, especially since he mentioned Javascript but please no dead technologies like Java applets, Flash, Silverlight, etc. :)
15:54	kevin-w54	lol, I'm on your side with that
15:55	kevin-w54	ok, thanks again. have a great day.
15:55	pdurbin	To me Globus is like Swift. Seems like neat technology but I've never gotten my hands dirty with it.
15:55	pdurbin	I hope this was helpful!
15:55	kevin-w54	I think so.
15:55	pdurbin	:)
15:56	pdurbin	Voice is definitely faster.
15:56	pameyer	@kevin-w54 that might be the lowest complexity way to get globus to work, because it gets around the differing assumptions about storage between globus and dataverse
15:56	pdurbin	Maybe we can try recording a google hangouts on air in the future instead. or something
16:00	pameyer	happy to talk things over - and I'm usually around in this channel
16:08		jri joined #dataverse
16:14		pameyer joined #dataverse
16:38		icarito[m] joined #dataverse
16:57		donsizemore joined #dataverse
17:10		pameyer joined #dataverse
17:29		pameyer joined #dataverse
18:19		pameyer joined #dataverse
18:32		jri joined #dataverse
18:52		pameyer joined #dataverse
18:54	pdurbin	In case anyone is feeling meta, I just worked a bit on metrics for this channel for the past year: https://github.com/IQSS/chat.dataverse.org/issues/6 :)
19:09		pameyer joined #dataverse
19:50		soU joined #dataverse
19:51	soU	Hello there
19:52	soU	Is it possible to stop Ingest in progress?
19:54		pameyer joined #dataverse
20:33	pameyer	hi soU: I don't know of one, but pdurbin might
20:39	pdurbin	soU: good question. I don't know. I think it goes into a JMS queue. We recently added an API to uningest a file, if that's helpful. It's in Dataverse 4.9: https://github.com/IQSS/dataverse/issues/3766
20:41	pdurbin	soU: you're making me wonder how long the ingest has been running. You can limit the size of files for which ingest is attempted: http://guides.dataverse.org/en/4.9/installation/config.html#tabularingestsizelimit
20:51		dataverse-user joined #dataverse
20:53	dataverse-user	Hi all, I am interested in running dataverse locally and I am having trouble when entering SMTP server
20:53	pameyer	hi dataverse-user: what kind of problem are you having?
20:54	dataverse-user	I ran the installer, and after entering the default Harvard smtp server it gives me this message: Could not establish connection to mail.hmdc.harvard.edu, the address you provided for your Mail server. Please select a valid mail server, and try again.
20:54	dataverse-user	I tried localhost, same thing
20:55	pameyer	it sounds like you don't have a SMTP server running on localhost - is that correct?
20:55	dataverse-user	I do not
20:55	pameyer	do you have one available to use on your local network?
20:56	dataverse-user	not particularly. in the past I have tried the default email and it worked fine
20:56	dataverse-user	by email I mean smtp server
20:57	pameyer	possibly a silly question, but are you on the same network as mail.hmdc.harvard.edu?
20:57	dataverse-user	I am not
20:58	dataverse-user	but in the past I wasn't either. Not sure what happened
20:59	pameyer	I don't know the details, but I believe there were some relatively recent changes that dealt with email server handling (to fix some bugs that folks had run into with emails not going out). it's possible that's related to what you're seeing.
20:59	pameyer	which version are you seeing the problem with, and do you remember which version didn't give you the problem?
21:00	dataverse-user	I was using 4.8.6, the same as the one I used before
21:00	pameyer	ok - that means it's not those changes
21:01	dataverse-user	maybe it's the network?
21:02	pameyer	it sounds that way to me, especially if the same version is behaving differently
21:03	dataverse-user	I was using a Mac. Currently using fedora. Does that make a difference?
21:03	pameyer	possibly. it might be worth checking selinux and/or the system filewall
21:04	pameyer	I'm pretty sure that dataverse doesn't get regularly tested on fedora - so if you have a centos system to test on, that might be something to try
21:08	soU	pdurbin: i received a message says that it runs since yesterday. i don't know the exact time. the sizes are 998MB and 532MB
21:08	soU	pdurbin thanks for the links
21:12		icarito[m] joined #dataverse
21:32		pdurbin_m joined #dataverse
21:32	pdurbin_m	as a temporary measure I would recommend hacking on the install script and disabling in the SMTP check.
21:37	pdurbin_m	definitely a bug that "harvard" is in there
21:44		pameyer joined #dataverse
22:09	pdurbin	Heh, nice tweet: https://twitter.com/ronaldfar/status/1006559300498722818
23:38		icarito[m] joined #dataverse

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.