Time |
S |
Nick |
Message |
01:52 |
|
|
axfelix joined #dvn |
02:00 |
|
|
axfelix joined #dvn |
03:18 |
|
|
axfelix joined #dvn |
03:24 |
|
|
axfelix joined #dvn |
03:35 |
|
|
axfelix joined #dvn |
03:56 |
|
|
garnett joined #dvn |
12:00 |
|
|
ruebot joined #dvn |
12:12 |
|
pdurbin |
sbmarks: when you have a moment |
12:13 |
|
pdurbin |
yoh_: welcome! I don't think I've seen you here before |
13:22 |
|
sbmarks |
pdurbin: yo |
13:24 |
|
pdurbin |
sbmarks: you said "you could almost provide a CVS-like model for datasets" at https://groups.google.com/d/msg/dataverse-community/Q0HOKTR80rM/US3jDWlO1C0J |
13:25 |
|
sbmarks |
let's see....that sounds like something I would say ;) |
13:25 |
|
sbmarks |
ah yes |
13:26 |
|
sbmarks |
pdurbin: this of course relates to the current conversation on the list? |
13:29 |
|
pdurbin |
sbmarks: sure does |
13:30 |
|
pdurbin |
sbmarks: dunno if you've had a chance to check out my thought experiment |
13:30 |
|
|
LyndsySimon joined #dvn |
13:30 |
|
pdurbin |
thought experiment: datasets as git repos: https://docs.google.com/document/d/18WDIS8hrFJvMJBcnRuQ8NfD-VxGq32vJ9WwlEgyyWZs/edit?usp=sharing |
13:31 |
|
sbmarks |
i haven't yet, but now is as good a time as any! |
13:31 |
|
pdurbin |
:) |
13:31 |
|
|
jwhitney joined #dvn |
13:32 |
|
pdurbin |
something new (to me) is http://dataprotocols.org/data-packages/ ... Dublin Core is even mentioned |
13:32 |
|
pdurbin |
datapackage.json, etc. I'm not sure what do make of it |
13:33 |
|
pdurbin |
s/do/to/ |
13:33 |
|
pdurbin |
jwhitney: mornin'. meeting today, I hear |
13:33 |
|
jwhitney |
pdurbin: yessir. |
13:34 |
|
pdurbin |
top o' the mornin', I mean |
13:44 |
|
* pdurbin |
checks out "Archiving Reproducible Research with R and Dataverse by Thomas J. Leeper" at http://journal.r-project.org/archive/accepted/leeper.pdf via https://twitter.com/ChrisGandrud/status/445475741158088705 |
13:47 |
|
sbmarks |
ohhhhhh my |
13:47 |
|
sbmarks |
that dvn R package is really interesting! |
13:47 |
|
sbmarks |
i have had folks ask about this |
13:47 |
|
pdurbin |
sbmarks: see also rOpenSci - dvn - Sharing Reproducible Research from R - http://ropensci.org/blog/2014/02/20/dvn-dataverse-network/ |
13:52 |
|
yoh_ |
pdurbin: Hello ;) yes -- I (Yaroslav) am new here -- took your invitation and joined the room ;) |
13:53 |
|
pdurbin |
yoh_: oh! hi! thanks for all the links! I just added comment about datapackage.json to the Google Doc |
13:53 |
|
pdurbin |
yoh_: I also added the logo I created over the weekend to the bottom, even though you don't like my daggers ;) |
13:55 |
|
yoh_ |
I just like squares more ;) |
13:56 |
|
pdurbin |
yoh_: the squares in the qr code at http://datagit.org ? |
13:57 |
|
pdurbin |
(or https://github.com/data-git ) |
14:00 |
|
yoh_ |
yeah ;) |
14:00 |
|
pdurbin |
bleh! ;) |
14:01 |
|
pdurbin |
maybe there's something in that rorschach test I'm not seeing :) |
14:02 |
|
pdurbin |
yoh_: the specification for http://dataprotocols.org/data-packages/ really goes back to "12 November 2007"? |
14:08 |
|
yoh_ |
might well be, remember Matthew Brett talking about some of ideas a while back |
14:08 |
|
pdurbin |
nice |
14:08 |
|
pdurbin |
well, I'm poking through the issue tracker |
14:08 |
|
pdurbin |
this is right up our alley: Encapsulated extensibility · Issue #103 · dataprotocols/dataprotocols - https://github.com/dataprotocols/dataprotocols/issues/103 |
14:08 |
|
pdurbin |
"A Data Package author MAY add any number of additional fields beyond those listed in the specification here." |
14:10 |
|
pdurbin |
we have all sorts of fields in Dataverse-land. we started with social science and astrophysics (big announcement today, by the way) and now we're trying to expand into biomed, generally: http://thedata.org/blog/major-dataverse-release-coming-spring-2014 |
14:16 |
|
pdurbin |
and other fields/domains in the future. we're trying to stay flexible |
14:28 |
|
skay |
yoh_: you were in the cc of the email conversation about data versioning? |
14:28 |
|
* skay |
is trying to link nics to conversations I've had |
14:29 |
|
* skay |
is Sheila |
14:30 |
|
skay |
pdurbin: join the mailing list for data protocols and start a conversation with them about this. I want to see if they will start discussing this more |
14:30 |
|
skay |
I also think some of them might idle in one of hte okfn irc channels, but I am not certain about that |
14:30 |
|
pdurbin |
skay: you're not the boss of me ;) |
14:30 |
|
* skay |
is not the boss of you! |
14:31 |
|
pdurbin |
heh |
14:31 |
|
pdurbin |
skay: I'll see what I can do. Can't get too mired in all this. Gotta ship. |
14:31 |
|
skay |
yeah I know that feeling :( |
14:31 |
|
skay |
I guess it can wait. It has waited for centuries! |
14:31 |
|
pdurbin |
skay: it's been worked on since 2007 apparently :) |
14:32 |
|
skay |
I was disappointed not to see more work with bibjson, or maybe it is just not surfaced |
14:32 |
|
pdurbin |
javaeebot: lucky bibjson |
14:32 |
|
javaeebot |
pdurbin: http://www.bibjson.org/ |
14:32 |
|
pdurbin |
"BibJSON is a convention for representing bibliographic metadata in JSON; it makes it easy to share and use bibliographic metadata online." |
14:32 |
|
pdurbin |
sounds good |
14:32 |
|
skay |
when I asked about it on a mailing list, they did give a timely reply |
14:34 |
|
pdurbin |
timely is good |
14:34 |
|
skay |
I was starting to think of creating citation information for a compendia -- and there are no exact standards for representing src or data... so on the one side I was looking through conventions with bibtex fields, the xml schema for datacite.org, conversations with Victoria... |
14:34 |
|
skay |
and on the other side wondering how I should represent these in json |
14:34 |
|
skay |
blargh all in my head with no other programmers to bounce ideas off of |
14:35 |
|
skay |
that means I will make ill-considered things because it is better to have thought experiments with more than just one person |
14:35 |
|
pdurbin |
skay: we're definitely basing our new work on datacite |
14:36 |
|
skay |
oh, and then I had to completely drop the conversation with myself to go focus on deliverables with higher priority |
14:37 |
|
skay |
btw squirrel: I wish there were more #openhatch friendly FLOSS java projects which is why I pinged you there |
14:37 |
|
pdurbin |
today? |
14:37 |
|
skay |
there are way more friendly python projects I know about than java projects, but we do have people looking for java projects from time to time |
14:37 |
|
skay |
no, a day or two ago? |
14:37 |
|
pdurbin |
right right |
14:38 |
|
skay |
I didn't want to bug you too much about it. I was wondering if you would be interested at some point in having some openhatch type of labels on things. |
14:39 |
|
skay |
it is tricky since shepharding new contributors can take time away from shipping |
14:39 |
|
pdurbin |
amen |
14:40 |
|
skay |
so I don't want to just drop people on you... but if you have some tasks that are independent and don't soak up time it could be cool |
14:40 |
|
skay |
but that is just a thought experiment for the future |
14:41 |
|
skay |
datacite -- arg everyone wants to talk xml at each other |
14:41 |
|
skay |
I know it is so well structured but I hate it |
14:42 |
|
skay |
oh hey, I was also looking at the schema that crossref uses for describing citations |
14:42 |
|
skay |
which is just a few things on top of unixref something or the other |
14:43 |
|
skay |
because I wanted to be able to talk to crossref (perhaps deposit, perhaps only query) |
14:43 |
|
skay |
so it is another schema you could look at, but not as rich as the datacite one |
14:44 |
|
skay |
people should take a look at all these schema to gather together some nice commonalities |
14:44 |
|
skay |
also, it would be nice to see how things get used in the wild so that one knows how much of the work that goes in to schemas is YAGNI versus actually used |
14:45 |
|
skay |
because you don't want to consume cycles on working with things that don't get used |
14:45 |
|
skay |
anyway, I should also go away-from-window to focus on work work |
14:47 |
|
|
axfelix joined #dvn |
14:56 |
|
|
LyndsySimon joined #dvn |
15:02 |
|
pdurbin |
Linus had a nice anti-XML rant the other day: https://plus.google.com/+LinusTorvalds/posts/X2XVf9Q7MfV |
15:37 |
|
pdurbin |
I can't make any sense of http://www.crossref.org/schema/documentation/unixref1.0/unixref.html |
15:38 |
|
pdurbin |
handy link to get to their home page: http://crossref.org |
15:42 |
|
pdurbin |
hmm. "DiffKit is an application, and a framework, for comparing two tables of data, field-by-field" -- http://www.diffkit.org via https://twitter.com/thosjleeper/status/445582381488287744 |
15:42 |
|
pdurbin |
"DiffKit is like the Unix diff utility, but for tables instead of lines of text." |
15:43 |
|
yoh_ |
skay: yes -- that was me (sorry for being lousy with my replies) |
15:49 |
|
pdurbin |
skay: thanks for all the comments at https://docs.google.com/document/d/18WDIS8hrFJvMJBcnRuQ8NfD-VxGq32vJ9WwlEgyyWZs/edit?usp=sharing ! |
16:19 |
|
skay |
pdurbin: bd |
16:21 |
|
skay |
pdurbin: remember when I mentioned sumatra? I don't know if you checked in to it, but I will try a quicker run through of how someone would use it. as a user I'd set up a git repo with my data analsysis code, and arrange it so that there is a directory where hte results get dumped, call it Data/ and perhaps data is not tracked by git |
16:21 |
|
skay |
I would trigger a run using a sumatra command which calls out to my code, and sumatra then capture things about the run and hte data |
16:21 |
|
skay |
it creates a hash, it stores dependencies (for python and R and a few other languages) |
16:21 |
|
skay |
and it stores hte parameters that were passed for the run |
16:22 |
|
skay |
and the system that the job runs on |
16:22 |
|
skay |
so this is all metadata that can be important as provenance information |
16:22 |
|
skay |
I want to use it for researchcompendia because I want to track that data |
16:22 |
|
pdurbin |
ah |
16:23 |
|
pdurbin |
this is helping me understand what sumatra is |
16:24 |
|
* skay |
that time on the desktop when you forget that focus does not follow eyes and hit ctl-p to navigate to the channel and instead pull up multiple browserconfig.properties |
16:25 |
|
skay |
also my focus follows mouse stopped being set! arg! |
16:25 |
|
|
LyndsySimon joined #dvn |
16:27 |
|
pdurbin |
skay: focus! |
16:47 |
|
|
axfelix joined #dvn |
17:12 |
|
yoh_ |
skay, pdurbin: re sumatra -- might be of interest then for you guys a recent paper: http://journal.frontiersin.org/Journal/10.3389/fninf.2013.00044/full |
17:13 |
|
pdurbin |
yoh_: IPython! yes! please check this out: Create IPython Notebook for Dataverse APIs · Issue #6 · IQSS/dvn-client-python - https://github.com/IQSS/dvn-client-python/issues/6 |
17:14 |
|
yoh_ |
I was pointing more to lancet ;-) |
17:15 |
|
skay |
pdurbin: oh hey! Victoria is teaching at Berkely this semester and has given a talk, I think, to Fernando's group -- and I pitched the idea that some of us should meet to do some hacking with ipython and stuff |
17:15 |
|
skay |
pdurbin: well I don't know if that will happen for me, but I am pitching the idea to you in case you can do it. I can enjoy it vicariously |
17:16 |
|
* pdurbin |
reads the title again: An automated and reproducible workflow for running and analyzing neural simulations using Lancet and IPython Notebook |
17:16 |
|
pdurbin |
"Launch jobs, organize the output, and dissect the results." -- http://ioam.github.io/lancet/ |
17:16 |
|
skay |
yoh_: do you use neural ensemble tools? this blog post is really cool http://neuralensemble.blogspot.com/2014/03/docker-image-to-run-software-neural_3.html |
17:17 |
|
skay |
because I want to dockerize papers for the tool I am working on. this means I am definitely enjoying that people are doing similar things out int he wild |
17:17 |
|
skay |
http://bcbio.wordpress.com/2014/03/06/improving-reproducibility-and-installation-of-genomic-analysis-pipelines-with-docker/ |
17:17 |
|
skay |
sorry, squirrel |
17:18 |
|
yoh_ |
I am in neuroimaging, so do not use neural ensemble -- just maintain few (e.g. brian) for debian |
17:18 |
|
skay |
oh neat! you are a debian mainter. bd. |
17:18 |
|
skay |
bd is two thumbs up |
17:19 |
|
* pdurbin |
had forgotten ;) |
17:19 |
|
skay |
arg, I see I bookmarked the lancet paper 13 days ago and have not read it yet |
17:20 |
|
skay |
and... I wish everyone working on these toolchains could join forces and collaborate on solving hte problems together |
17:21 |
|
pdurbin |
skay: #dvn is becoming #hackingscience, which I'm still willing to log for you :) |
17:22 |
|
skay |
pdurbin: I think you can, okay. though to be honest it is mostly a travis bot |
17:22 |
|
skay |
I like your logs |
17:23 |
|
pdurbin |
me too |
17:23 |
|
skay |
handy for crossing the streams. https://lists.okfn.org/pipermail/data-protocols/2014-March/000089.html |
17:23 |
|
pdurbin |
travis bot? for builds of what? |
17:23 |
|
skay |
researchcompendia, which does not have nearly enough tests I am embarassed to say |
17:24 |
|
pdurbin |
meh. send that bot to #researchcompendia then |
17:24 |
|
skay |
good point |
17:25 |
|
pdurbin |
bd |
17:26 |
|
skay |
with hackingscience the idea was that Victoria was going to set up a blog/mailinglist/something for people to have technical discussions for writing toolchains for reproducible science |
17:26 |
|
skay |
but then we are too busy to maintain a blog, etc. so maybe I should not have bothered to create a related irc channel |
17:26 |
|
skay |
wishful thinking |
17:27 |
|
pdurbin |
it's a good idea. I like big tents |
17:36 |
|
skay |
pdurbin: yoh_: Rufus from okfn replied and mentioned #hyperdata as a place where people talk about data protocols and frictionless data https://lists.okfn.org/pipermail/data-protocols/2014-March/000090.html |
17:37 |
|
skay |
and hopefully some people will check your googledoc and send you comments now that I've mentioned it on the list |
17:37 |
|
|
axfelix joined #dvn |
17:37 |
|
skay |
just giving you a heads up |
17:45 |
|
pdurbin |
skay: thanks for linking to it |
17:47 |
|
|
jwhitney joined #dvn |
18:17 |
|
|
LyndsySimon joined #dvn |
18:52 |
|
|
LyndsySimon joined #dvn |
18:52 |
|
pdurbin |
I'm realizing that a link to http://blog.okfn.org/2013/07/02/git-and-github-for-data/ was sent around in an internal list shortly after it was posted but I didn't remember it. Found it my searching my mail. |
18:55 |
|
pdurbin |
"Line-oriented text and its tools are, of course, far from perfect solutions to data storage and versioning. They will not work for datasets of every shape and size" |
18:55 |
|
pdurbin |
yeah |
18:56 |
|
pdurbin |
plenty of our data is not in CSV format |
18:56 |
|
pdurbin |
Stata files, FITS files, etc. |
19:07 |
|
sbmarks |
yeah, but |
19:07 |
|
sbmarks |
many of those are parseable into subsettable files |
19:07 |
|
sbmarks |
which are then exportable as CSV right? |
19:08 |
|
sbmarks |
(or as something a little more semantically meaningful than CSV) |
19:23 |
|
pdurbin |
sbmarks: I'm not the expert on this, but you're right. We "ingest" certain formats and make them available as CSV (or TSV or whatever). |
19:44 |
|
pdurbin |
sbmarks: are you suggesting we put the CSV version in regular git and the binary version in something like git-annex? |
19:55 |
|
pdurbin |
(dunno if you're familiar with git-annex) |
20:10 |
|
sbmarks |
pdurbin: I guess I'm saying it could be one way to allow some non-CSV formats to be stored in the same way as line-oriented stuff |
20:12 |
|
pdurbin |
yeah |
20:13 |
|
pdurbin |
I wonder if git is an all or nothing proposition |
20:13 |
|
sbmarks |
I'm hoping not =) I am not yet sold on the utility of it, tbh |
20:13 |
|
pdurbin |
or if we'd let the dataset author enable git for their dataset |
20:14 |
|
pdurbin |
sbmarks: not sold on git for versioning? not interested in `git clone`-ing datasets? |
20:14 |
|
sbmarks |
pdurbin: I think it sounds great for people like us who use git a lot....but a lot of researchers don't |
20:15 |
|
sbmarks |
I'm just worried about implementing things that might clash with established research workflows is all |
20:16 |
|
sbmarks |
i for sure like the idea of an option, but I also think versioning as done currently is pretty OK |
20:17 |
|
pdurbin |
sbmarks: so you're not so into papers like this: Source Code for Biology and Medicine | Full text | Git can facilitate greater reproducibility and increased transparency in science - http://www.scfbm.org/content/8/1/7 |
20:17 |
|
pdurbin |
not that I'm asking you to be :) |
20:18 |
|
sbmarks |
hahaha |
20:21 |
|
sbmarks |
I guess I would say: I can understand the logic that git promote reproducability. but I don't think it's a cure all. I think there are a lot of other pieces that are far more gaping holes than being able to clone the data. We can already somewhat easily get the data. I worry about reproducability of workflows, for example, which this doesn't necessarily address. |
20:22 |
|
sbmarks |
(Granted, yes, it could if those are represented in a git-able manner) |
20:23 |
|
pdurbin |
sbmarks: right. OSF is good for workflows: Open Science Framework | Home - https://osf.io |
20:26 |
|
sbmarks |
I guess.....I'm just a bit confused as to the value-added of doing it the git way. And it's probably me failing to fully get it |
20:26 |
|
pdurbin |
"git it" ;) |
20:26 |
|
sbmarks |
b ^_^ |
20:27 |
|
sbmarks |
i mean, I am Joe Researcher. I can already "clone" data from a DV by downloading it, now I can do it with the API, or even through that R module |
20:28 |
|
sbmarks |
and versioning is already supported, although maybe the point is it could possibly be more efficient |
20:29 |
|
sbmarks |
unless the idea is that this would standardize the way these things are done, which would be a good thing |
20:30 |
|
pdurbin |
sbmarks: sure. via the SWORD API in Dataverse. wouldn't it be cool if RStudio supported SWORD? See my post here: [sword-app-tech] Using SWORD via R - http://www.mail-archive.com/sword-app-tech@lists.sourceforge.net/msg00344.html |
20:30 |
|
pdurbin |
Rstudio *does* support git, which is kind of what this paper is about (mentions Dataverse): GitHub: A Tool for Social Data Set Development and Verification in the Cloud by Christopher Gandrud :: SSRN - http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2199367 |
20:32 |
|
pdurbin |
sbmarks: so yeah, for me anyway, this is about standards. and not re-inventing the wheel, reinventing versioning |
20:32 |
|
sbmarks |
fair enough. I'm not trying to be a Debbie Downer or anything. |
20:32 |
|
pdurbin |
sbmarks: heh. no worries :) |
20:32 |
|
sbmarks |
totally |
20:33 |
|
sbmarks |
at the end of the day, standardization and new features are good. =) I do like anything that's going to make Dataverse more of a part of the day to day research workflow |
20:34 |
|
sbmarks |
so now that I'm reading this paper I can see some value in having a central place to check out the data set from, and to commit back to, with a way to merge versions and etc. |
20:34 |
|
sbmarks |
so I guess: http://i3.kym-cdn.com/photos/images/original/000/138/244/funny-barack-michelle-obama-face.jpg |
20:37 |
|
pdurbin |
heh |
20:37 |
|
sbmarks |
thanks for bringing me along |
20:38 |
|
pdurbin |
well don't get excited. dunno if we can actually ship any git support |
20:38 |
|
pdurbin |
it's just me making noise right now |
20:38 |
|
sbmarks |
oh totally |
20:38 |
|
sbmarks |
i just hate feeling left out of the cool kids conversation |
20:38 |
|
pdurbin |
dream a little dream |
20:38 |
|
sbmarks |
I'm still working on my magnum opus metadata email reply |
20:39 |
|
pdurbin |
! |
20:39 |
|
pdurbin |
looking forward to it |
20:39 |
|
sbmarks |
i no rite |
20:40 |
|
pdurbin |
sbmarks: oh, you might like this... now a Google Doc is the source of truth about metadata. we export 4 tsv files. then we `curl` those tsv files into an API endpoing to populate our metadata blocks |
20:40 |
|
pdurbin |
"blocks"... that's what we call them |
20:40 |
|
sbmarks |
ha! |
20:40 |
|
sbmarks |
I like the "blocks" |
20:40 |
|
sbmarks |
though I would love if there were user-servicaable blocks too =) |
20:41 |
|
pdurbin |
sbmarks: well, you could hack on the tsv files but... |
20:41 |
|
pdurbin |
you break it you buy it :) |
20:41 |
|
sbmarks |
riiiiiiight |
20:41 |
|
sbmarks |
hence my desire for user servicable =) |
20:43 |
|
pdurbin |
:) |
20:45 |
|
sbmarks |
in all seriousness, I think you guys are doing really great work right now, fwiw |
20:45 |
|
sbmarks |
it's nice because it syncs up pretty well with an uptick we've seen in DV use, so yay |
21:07 |
|
* pdurbin |
blushes |
21:08 |
|
pdurbin |
definitely a team effort: http://www.iq.harvard.edu/people/filter_by/staff/data-science |
21:16 |
|
|
axfelix joined #dvn |
23:42 |
|
|
axfelix joined #dvn |