>> From the Library of Congress in Washington DC. ^M00:00:05 [ Silence ] ^M00:00:16 >> Today, we're very pleased to have Sayeed Choudhury who is the Associate Dean for Library Digital Programs and the Hodson Director of Digital Research in Curation Center at the Sheridan Libraries, Johns Hopkins University. >> You got it right. [ Laughter ] >> That's almost as bad. >> It's almost as bad thing. [ Laughter ] >> He serves as the Principal Investigator for a number of projects funding by National Science Foundation, Institute of Museum and Library Services, and the Mellon Foundation. And closest by heart is the fact that Sayeed worked the best on the Archive Ingest and Handling Test a few years back which belong to the earliest-- >> That's right. >> --in DIP projects and one of [inaudible] still, I think, would make about one of the most interesting in terms of findings. But what he's gonna talk to us today is about his project with National Science Foundation with Data Conservancy which is an award under the NSF Data Net program which is a very significant and substantial effort on the part of the National Science Foundation regarding the long term preservation and access to scientific data. And what's interesting and I'm sure Sayeed will extend on this, is that the Data Conservancy is looking to avoid the developing what they a call a "rigid road map" and look instead to principles of navigation which I think is very wise. Sometimes projects will try to attempt to put a square pavement and round holes. In this case, it looks like they are trying to figure out what the shape of the hole is first and then trying to figure out what each--it should mean--could mean to it. We don't want to belabor. >> Thank you Bill. That's actually perfect. Not just the shape of the size, we think it's quite large, actually, at this point. It's a pleasure to be here. I wanna thank you for taking time out of your day to hear what I have to say about today in Data Conservancy. In fact, we just had our all hands meeting last week so a lot of the discussions that took place during that meeting are still resonating in my mind. But a couple of those thoughts I've already incorporated into this particular presentation and I'll hope to go through that. So the first thing that I always talk about when thinking about the data conservancy is what do we mean by Data Curation? Bill mentioned the DataNet program. I'm assuming that most of you know about this program that very quickly it's a large scale effort to build, national, maybe even international scale, data curation infrastructure. So NSF has spent a good deal of funding creating lots of science and engineering data sets not nearly as much on preserving and curating those data sets and that's what data net is intended to address. The data curation or digital curation has many definitions and one that's fairly well known is the one that comes out of the UK from the Digital Curation Center that emphasizes the life cycle aspect of data and we certainly believe that. But in the specific context of DataNet, when our team was actually developing its proposal, not surprisingly, we spend a fair amount of time debating, what do we actually mean by data curation. And this is the definition that we came up with. And the main points that I'd like to take away from this or I'd like you take away from this, is that word preservation. I think that's very important. Sometimes in our conversations with scientists or conservations with NSF or the community, everyone jumps very quickly to what happens when you have all these data sets preserved and what can you do with them and new forms of science, and all that's important. But I always bring them back to--actually, we have a lot of in terms of actually preserving these data sets. So let's not forget that preservation is the fundamental underpinning of everything we're trying to do. Now, having said that, the definition also reinforces the idea that, while preservation is absolutely necessary, it is not sufficient to address the full goal or objectives of data curation and certainly the DataNet program. We could build dark archives. We could take scientists in data sets, just lock them away. Some days I actually wish that's what we could do. It might make our life a little bit easier in terms of preserving the data sets. But clearly, that is not the intent. It's the most useful path for us to follow. And quite significantly, the DataNet program is focused on what happens when you preserve data sets or manage data sets in a way that supports scientific inquiry, and not only current forms of scientific inquiry for the next foreseeable future but new forms of scientific inquiry. And there's a great deal of emphasis now on these large complex grand research types of problems, climate change, food security, cleaning up oil, things of that nature that seem to require lots of different data types, lots of different perspectives and disciplines coming together in a holistic manner to address the challenge. Maybe even, in fact, only possible when you bring together this kind of view point. With that definition in mind, I'll show you the goal that we've identified for the data conservancy. I should also point out, I don't read my slides. Actually, I don't really actually like PowerPoint. I've used it because it's important to have something for you to have context and I can leave you a copy. But I don't typically look at the slides and read them so I'm sorry if you think I'm--I am focusing on them, but I don't intend to just sit there and read through them. The main point that I wanna make about the goal is the word strategy. So as Bill said, we've been thinking a great deal about how you actually build infrastructure and I'll talk a lot about this in detail in subsequent slides. But again, even in the early stages of developing and forming this team, I kept talking about strategy and a lot of people said, "Well, what do you mean? Don't we wanna talk about the architectural diagrams and requirements and use cases?" Of course, we do and we will. I'm sure we'll spend lots of time thinking about those things. But it has to be a part of an overall strategy because we don't really know quite frankly what types data exists. I know many scientists do, but as the scientist starts to interact with our team and interact with other libraries, we're not even sure of the inventory of the different data types. We're not sure of the characteristics of those data types, the scale, the complexity, all those wonderful kinds of things and that's even--that's within a particular scientific domain or within a particular project. What happens when you start looking across all those different scientific areas? So it was really important to think in terms of strategy and not necessarily in terms of implementation. The two of them are connected clearly. But I think the strategy has to be something that is much more free of time constraints. It's something that continues to keep guiding us even as the implementation changes. The other word that's really important and this is the word sustain. So as you may know, the DataNet program, the DataNet solicitation and NSF is somewhat unusual, I think. In that, it emphasized the two categories that NSF typically talked about, an intellectual merit and broader impact, where the intellectual merit as the term implies. This is really the substance, the science in some sense. And the broader impacts typically refer to some of the other kinds of benefits that NSF cares a great deal about education, outreach, capacity building, things of that nature. DataNet program had sustainability, not explicitly, but certainly implicitly as equally important to these two categories. And in the sense of holding our feet to the fire, NSF has made it very clear that they do not expect to continue funding all of the data and their partners. We anticipate they'll be five by the end of two rounds. I do not anticipate funding all five of these teams after the five-year initial awards are over. It isn't clear at this point what life will be like and what NSF opportunities may exist. But it is very clear that if our plan for sustainability is to go back to NSF for more funding. The probability of that happening is less than a hundred, a hundred percent, or less than one. So we have to factor that in, in terms of this overall strategy for the data conservancy. These are the partnering institutions. Now, there are many institutions, many individuals involved. And when I talk about partner institutions, this slide basically refers to the institutions that are named and have sub-awards or the primary award, in the case of Johns Hopkins, for the Data Conservancy. There are already other partners who are not funded directly. There are already other individuals. But these are the ten that basically have some very specific accountability if you wanna think about it that way. We have a very detailed project management plan at this point. It's about 80 pages long and it outlines in great detail, timelines, milestones, deliverables, in more detail than I am ever accustomed to working with. We have an individual who used to work for one of our partners, Tessella, who know moved to Hopkins as our executive director. ^M00:10:02 >> A very seasoned professional project manager who--if he hadn't arrived, I have joked that I would found the tallest building in Baltimore and jump off of it and that wasn't enough I would have gone to New York City. So it's been very important that we have very clearly defined roles and responsibilities, and accountability for these funded partners. As you might imagine, the next wave of partners, the next wave of individuals, it isn't as explicit. But we are hoping overtime that as we identify specific ways to work with other, they too will become part of this project management plan that we put together. Some of the names you see here are probably familiar, their academic institutions, for example. Some of them may not as familiar though they are closely connected to what I would think of his higher education, DuraSpace, Portico, are examples of those kinds of organizations. They are nonprofits. They're very much, in some sense, out of the academic community. DuraSpace is an organization that brought together DSpace and Fedora Commons. Portico, for those of you who don't know is an e-journal archive. But then there are organizations that are--again, they're academic in some sense, but they're very much science-based organizations, the National Snow and Ice Data Center, the National Center for Atmospheric Research. We felt it was really critical, in addition to places like Hopkins to engage the scientists directly and identify some organizations that would give us this kind of direct access to the scientists, direct that access to the types of data that they've been using. Then you see Tessella which is, actually, a for-profit company and has done some work with the National Archives and Records Administration, good deal of work in Europe and in the UK with Digital Preservation. And we brought them on board. Basically, in some sense, I'd like to half jokingly say to bring some industrial strength perspectives to how we do things. It's been a real education for me to interact with people from Tessella. When they say we need a project plan, I go, oh, I can do that. And then they developed a project plan. It looks very different than what I have in mind. It's much more detailed, in many ways, much more rigorous. And I think that's really important for us to bring to the mix, not a traditional kind of perspective for NSF grants, but I think it's been very helpful in this particular context. Getting to this point that Bill mentioned at the beginning and I'm really glad he did because I think it's probably one of the overarching themes that we've spent a good deal of time thinking about and in many ways are only now beginning to really embrace and understand what it means. In putting together the proposal that we submitted, I was very influenced by this particular report that you see here and you remember Dan Atkins who's the original director of the Office of Cyberinfrastructure at NSF. He used to refer to this particular report when he had presentations and in not a terribly insightful realization. I thought well, if the head OCI is citing this report. I should probably go read it. So I did and it's a very interesting discussion and a set of outcomes or observations, if you will, about infrastructure development and particularly here in the US. But I think it applies in other ways. And it looks at the history of how infrastructure development and the major infrastructure development in the US took place. And there are a couple of really important ideas that came out of that, that influenced how we develop our proposal and how we're approaching this particular effort. And you can see this very clear statement, not a rigid road map but principles of navigation. Not assuming that there's only one way to develop, in our case, Cyberinfrastructure but infrastructure of any sort. And to understand, you know, the solutions based that round or square, or triangular hole, whatever you wanna think about it, is really quite large, typically larger than we ever imagined. I feel that everyday in thinking about this particular activity and that technical approaches are very much one piece of an overall approach. These are really important things for us to think about in terms of Cyberinfrastructure and when they look the participants of this workshop, look at historical efforts, railroads, the bank system, things like that, they notice another trend that very much, in some sense, depicts this principles of navigation idea. The idea is that systems form in local, relatively speaking, local communities, and then those systems somehow come together to form infrastructure. And in the case of Baltimore, we do like to think of railroads for obvious historical reasons, and as you know that when the railroads are being formed in the United States, they were originally regional networks. It was basically--at that time, about as far as you might imagine being able to stretch, you know, the Baltimore and Ohio line was one example and out West there was the Pacific Rail system. So these were regional networks that basically form. And as individuals and as the government and as companies all started to think about a national view, maybe we want railroad systems that can actually go from coast to coast or expand these regional networks. That's when it made the most sense to have a formal explicit effort to say, "Well, then what does a railroad gauge look like?" It can't be six different gauges, right? It has to be one or there has to be some way to explicitly account for the fact that there are six and then deal with this. And if there had been attempts to introduce that kind of conversation and that kind of standard too soon, to enforce the interoperability if you wanna think of it that way, it probably would have failed because people, in some sense, should've said, "Well, I don't really care about that. I have a more specific local problem that I'm dealing with and I need to address that first. I don't know why you're asking me about the sort of brand overarching national or bigger view." And I think we're starting to see a lot of that and we have seen a lot of that in the scientific community. So a lot of scientific communities are organizing themselves around the people they typically know and then may be stretching and pushing beyond those boundaries a little bit. And they're calling those infrastructure development efforts. I actually think what they're doing is system development and what we, through Data Conservancy, and I think the other DataNet partners and the library community in general needs to think about this, how do we take those--they're very large and they're very complex--system development efforts and think more broadly about an infrastructure effort that encompasses those without compromising or sacrificing the original systems themselves. Now, before I jump to the next set of topics, I do wanna mention in Australia, I spoke in Australia about two years ago, when I talked about this, somebody there said, "Do you know how the Australian railroads system works?" I said, "No." He said, "We still don't have a common gauge. We're actually between states, lift up trains, and put them on to new tracks and move them on to the new system." So there are different ways you can approach this. [Laughter] This is a really important slide. It--This is probably the one slide that I deliberately put in here after the All Hands Meeting last week. And you look at those partner institutions, you look at the number of individuals there, you have people who come from a fore profit sector who very clearly say, "This is the way things have to be done. These are the kinds of project plans you need." And then you have researchers who are listening to this thing. You know, I have no idea what my research outcomes are going to be. I have no idea if they'll actually be helpful to you in terms of building infrastructure. But they're important research outcomes in and off themselves. And we were all talking about this in a very free-flowing kind of conversation and it's one of those moments. If you've ever been in one of these meetings, you--particularly, if you've ever run one of these meetings, you're standing up in the front thinking, "this is really great, but I can just see this all spinning apart and going into a million pieces." But there was one comment, Chris Borgman from UCLA in particular said--she said, "You know, Sayeed, as we have this conversation, it occurs to me that maybe there are people, maybe even within NSF who seem to think infrastructure is a system building and it isn't." It's all the human and technology components we're talking about here. It's the virtual organization and community that's being developed. That's infrastructure building and we need to hold on to that and we need to absolutely embrace that. And the way that Carl Lagoze from Cornell said it during his presentation was "we have to embrace the diversity of cultures." And it's only now dawning on me, almost a year into this that my role as the head of the Data Conservancy, quite frankly, is to do exactly this. So yes, there are teams with deliverables and milestones and there're all sorts of things I have to track and financial accounting and all those wonderful things. But I'm now beginning to learn. We have an infrastructure team that has a very different view of what they're doing than the information science--computer science research team or the broad impacts team or the sustainability team. And somehow, they have to find a way to put all of these into the mix and come up with something cohesive and end something useful. The principles that we are looking at that we're thinking about in terms of how we move forward are these two, basically. So not surprisingly what I've said, it really is, in some sense, fundamentally focused on the idea of where are these exemplar scientific projects or communities or efforts. How do we tap into them? How do we learn from what they've done? Both of the things they've done well and the things that they might have done differently or done better and then formally leverage that learning and then expand on that into other communities and other domains. And we need to do this through a combination of research, a combination of prototyping, a combination of just experimentation, and then rolling out in successive waves, different tools, different services, different components and building on that foundation of preservation. ^M00:20:04 >> So there are some fundamental aspects that we believe apply at the preservation or even deeper, maybe at the stories layer. But then very quickly, as you start to move up the architectural stack, things get very different, they get very diverse. Different scientific communities have very different ways of doing research even within a particular domain. There maybe very different ways of doing research or different data types. We have to find that common layer of infrastructure, I think, and then really hone in on that particular piece. I don't believe that the Data Conservancy or any particular effort like this can serve all scientist needs. I just don't believe that's possible. As far as we can tell right now, it's extremely diverse what those ranges of needs might be. But we can provide the foundation so that we can connect to lots of different kinds of scientific services and different frameworks. That's very much the guiding set of principles that we have for what we're doing. In order to make this a little bit more tangible, we've divided up the work that we have at hand into four particular objectives and then there are four teams within the data conservancy that focus on each of these objectives, that you see listed here. I think another important aspect to the meaning we had last week at the All Hands was I came out and explicitly said, "I never meant to imply that the infrastructure team was more important than any of the others." I don't know that anyone actually felt that I had said that or that I was moving in that direction, but I think there was a little bit of an undercurrent about concern. So the DataNet program very clearly says at the end of five years you need to have working infrastructure. We can't spend five years and then say, "Look at all these really great papers we wrote, all these very interesting things we learned." We've already done that and hopeful to continue to do that. But there need to be something we can point to and say, "Look, there's infrastructure. There's data being preserved. There it is being used by scientists in all these interesting ways." But that doesn't mean that the other teams, the other objectives or any less important in getting to that goal. And I basically look at these four teams and as you can see in the diagram, I think of these four different classes of requirements. And I'm most familiar with technical requirements. I'm probably, you know, in that order next familiar with scientific touch of requirements. But there are educational requirements that are very important and all these and then there are those business requirements that are extremely important. I actually lead the sustainability team and I'm dealing with people who primarily have MBAs and I do not, for those of you who don't know and I don't think like people with MBAs. That's quite clear. Because when I sit in the room and I talk about these kinds of things, they basically say, "Well, what's the business potential for that or what are the business implications of that?" And my first reaction is, "Well, who cares? I mean, why does that matter?" And then my second reaction, the more responsible one [laughter] is, "Well, of course, it matters." But what is really important is that it matters now. It doesn't matter in 4-1/2 years as the project is winding up. It matters right now and it matters from day one. And while we think about what types of science data we might try to work with or what types of tools and services we might build, yes, it's the National Science Foundation, and yes, we're dealing with scientists. In some sense level, always be paramount. But the reality is that the business aspects, the educational aspects, the technical aspects, have to inform our strategy. And there may be cases where something is more scientifically compelling. But it doesn't make sense from all these other perspectives and we may decide that that is not the highest priority. Because even if we focus a lot of time and energy on that particular science problem or domain, but then we can't sustain it or it doesn't meet our technical architecture easily, we may be making more problems than we can handle. ^M00:24:02 [ Pause ] ^M00:24:08 >> This particular slide is a very interesting depiction of what astronomers are calling data flow. We have been spending years at Hopkins dealing with astronomers through two particular efforts. One is known as the Sloan Digital Sky Survey, the other is the Virtual Observatory, and now, the successor to VO is the Virtual Astronomical Observatory. The astronomers basically tell this that we look at our data and we think of different levels of data and this slide here from the top left corner to the bottom right depicts this kind of data flow. And at the top left corner, what you have is the instrument itself, the telescope, the telescope sitting in New Mexico, which literally is grabbing the bits, the ones and zeros that come off from the instrument into some sort of system. And there are very few people apparently. There are very few people in the world who can actually understand what all that means. And those people have to take those bits and interpret them, process them, calibrate them, refine them, through these different layers and different levels of data flow. So starting at the top left is level zero, literally, what the instruments generates. All the way down to the bottom right is level three. And what happens is that each of these stages, the data become more processed, more refined, more accessible. And in the case of Sloan, the level three is put into an MS equal database and then really released to the community. So they actually called the level three data sets "data releases," much more smaller than the original raw data themselves to give you some sense of the relative scale. The entire Sloan data set we believe is about 140 terabytes. Data releases tend to be on the order of 100 gigabytes. So that gives you some sense of difference in scale. But the data releases are put actually out on the web for all to use and they--through something called Sky Server, you can run queries against these different data releases and do your own science, derive smaller data sets, cite them, use them, do whatever you wish. And they're used by many people. So there are apparently about 10,000 professional astronomers in the world. Sky Server has over 900,000 registered users. So this is clearly not only in professional astronomy that's been using these kinds of data sets. And one of my favorite stories is that at Hopkins, we track the use of the Sky Server queries. And for a time, the second highest user, he was a high school in Orlando, and the faculty at Hopkins thought they were about to get hacked and Sky Server was going to become who knows what. So they shut off access to this high school and the principal of the high school called and said, "Are you crazy? I have a bunch of kids getting very excited in astronomy. They were using Sky Server and you cut them off." Astronomers being astronomers, they completely flipped that around and said, "Sorry, we didn't know." And now they built these huge educational resources for K through 12. So this is truly citizen science in a very significant way. The fourth level, I mentioned these data sets that you run or release queries, you run against data releases or level three data. There is another level called level four, if you will, that basically or the derived results from using these data releases. And that's actually where we started a lot of our data curation work at Hopkins for a couple of reasons. One, they're even smaller or they're even more manageable, more tractable. And two, they're not actually being collected. So for the life of the Sloan project, the folks at the Fermi lab were pretty good about saying, "Here are the data. Here are the data releases. We know where they are. They're being backed up," things like that. But the so-called level four data sets or data pieces are the ones that individual astronomers have on their hard drives or on their websites, or in their media, or they don't have at all, they are the ones that are being cited directly in papers. And so we thought this is a good place for us to start. It's reasonable tractable. It isn't being addressed. It's connected to work flows we're familiar with, in terms of publications and things of that nature. And it wasn't trivial to do that kind of work, but it was--we thought it was a round hole. We thought--We knew how big it was and we dealt with it. What was interesting is the Sloan project ended. They basically came to us and said, "We'd like you to go up this flow. We'd like you to deal with everything all the way up to the raw data themselves." And that was a little bit more significant conversation. Where we said, "Well, now wait a minute. We weren't really thinking that we'd have to do that. We thought you would take care of those levels of data." And an interesting conversation with them, I don't think there's anything malicious going on here. They basically said, "Well, of course, we're not gonna just delete the stuff." But the reality is as new data come in, as resource constraints hit, you know, this isn't going to become the latest and greatest data set for us. So we just naturally, overtime, we know what will happen. And I think there's a lot of very interesting insights into this kind of exchange that even scientists who were actively managing their data during the life of the project, when it is very interesting and compelling and essential for the research, they don't necessarily feel that way when the project ends, particularly when a new project comes along. Astronomy community is moving into using a telescope called Pan-STARRS. And it is an order of magnitude, more complex and larger than SDSS ever was. But most astronomers, not surprisingly, are now thinking, "Well, I'd like to get Pan-STARRS data instead of Sloan data." But having said that, astronomy is a discipline that cares about time and there are many sciences that do. They do want the Sloan data around. They don't want it to go away. They just don't wanna actively be the ones dealing with it. ^M00:29:59 >> And we had all sorts of good conversations with them about, you know, what's the intrinsic value, what can we select to keep and not keep, and how do we keep things in the rawest formats and then derive them later, and so on. I hate to tell you at the end of the day, we're taking the whole thing. We don't know enough about the data. We don't know enough about the potential uses to do that kind of assessment in advance, but we have to get there. [Laughter] This is one astronomy project. It's 140 terabytes. We're going to keep two copies. This is not scaleable. We cannot do this for every scientific project out there. This does not even address the so-called small science, the long tail, the bench scientist, the high throughput biologists that are generating mass amounts of data everyday. We cannot keep all of it. But we're hoping that this is an effort to go in and mind very deeply what it means to have a large scale of scientific data set on hand so that we can learn from that and move forward. When we move forward with future projects, ask much better precise questions. This slide is basically showing the strategy, if you will, that we are thinking about in terms of how we take these very compelling kinds of exchanges where we've had and continued to have astronomers and move them into these other disciplines that the Data Conservancy is looking at them. I should mention that other award right now, a current award in the DataNet program, called DataONE and it's being led by someone named Bill Michener at the University of New Mexico, really amazing project. Bill was a really great guy. They have focused very much on environmental sciences so we are trying to also account for what the other DataNet teams might be doing. Our life sciences group is an obvious connection point to their environmental science emphasis in DataONE, but we are basically looking at astronomy is this exemplar community. They have, number one, they agreed to share data. They've gone over that particular hurdle. They have standards that actually support data sharing and they actually have frameworks that allow them to query data in community based ways. An interesting thing that astronomers have said to me is that in the past, they used to say they were radio astronomers or x-ray astronomers or UV astronomers. They just say they're astronomers now. So within their community, at least, I'd like to think that they don't have to be in all railroad, the Pacific Railroad anymore. They have started to think more about astronomy as a community. So how do we learn from a lot of these things, sociological developments, technological developments, metadata frameworks, preservation activities, you name it, from astronomy and move them into these other domains. We're looking out of earth sciences, life sciences, and social sciences. So three of our partners at NCAR, UCLA, and--I'm sorry, Illinois. I've been told that UIUC is now Illinois. I should have changed the slide actually. But there are three partners at those institutions, are the ones leading the effort to do this. NCAR is doing what we are calling needs analysis, the user center design to extract requirements. So at the heart of this is asking scientists to come up with scenarios that described the kind of data they typically use or the things that they do, or more importantly, perhaps the things they cannot do and would like to be able to do, and then working with those scenarios and extracting persona, extracting tasks, use cases, and ultimately, requirements. That's a very critical process we're looking at to formalize what we're learning from interacting with scientist. We spent years to talking to astronomers. I'm not an expert in user center design. I couldn't tell you how we went through that process in any explicit way. But ultimately, we did come up with a set of requirements. We'd wanna become much more formal about that. We wanna do that across the different domains for two reasons. One is scope, earth sciences, life sciences, social sciences. Well, that's a lot. So we can scope that somewhat with these scenarios and these stories that are being developed, and also to start giving us a bridge, a formal bridge between these particular disciplines. UCLA is leading the effort to go very deep with astronomy. They've actually got an access to the entire e-mail archive of the Sloan project. From the day Sloan started, they kept the entire e-mail exchange and they did make us promise that we wouldn't share all the dirty laundry and all those wonderful things, but there's a lot of very useful information in there about how they made decisions, what their thought process was for coming up with particular approaches or standards, and so on. And then UCLA will also interview the astronomers associated with Sloan and give us, in some sense, their view on--this is how this very interesting success story came to be. And then it'll be UIUC that will lead the effort, in some sense, of formalizing the development of a theoretical framework. So right now, pretty much everything we're doing is bottom up. It's very much--I happen to know these scientists and I happen to know these data are not being taken care of. Well, how do you come up with a framework that says, "If you were to do data curation, this is what it means. These are some of the principles. These are some of the theoretical concepts. These are the common data practices. These are the kinds of lenses you can use to figure out this is effective in this kind of community." Very rigorous work in terms of creating these data curation profiles as they interact with scientists and lots of different disciplines. And one of the interesting things of--I was at the symposium last--I guess two months ago in April and Derek Law, who is in the UK in Glasgow, made a very interesting top observation that I never heard. He said, "I believe that the biggest failure of my generation of librarianship is to not develop a theoretical framework of the digital libraries." We go around talking to our younger colleagues and saying, "Go out and build digital libraries." And they had every right to say that was, "What do you mean? What is a digital library?" I think we need to think about that very clearly when it comes to data curation, otherwise, it'll all again feel like it's very bottom up. It's very much grass root. It's trying to connect things that are not necessarily quite even sure what the big picture, the overarching framework, might look like. So I think that's a really critical part of what we're trying to do through this three-tier approach, if you will. This diagram basically shows how the work that those three partner institutions are doing are going to influence three major areas of activity focused mainly on the technical architecture. But we also anticipate that this work will have very important implications in terms of a data framework which I'll describe in just a second and the education and outreach. So in particular, Illinois, but to some degree, UCLA as well, is also thinking very much about curriculum changes and design. What does it mean to train a modern information professional? What do you have to convey to this person in terms of being a data curator? We're finding that having domain knowledge is important, but information management skills are equally important. So what's the right balance? What are the right kinds of questions that these folks need to be able to address? And really, building the pipeline of the capacity for the human infrastructure is what we're hoping and comes out of that kind of work. This data framework concept is something that's being led by Carl Lagoze at Cornell. And in terms of this sort of glue, if you will, or theoretical view, we think the data framework is a really important piece of that. And broadly speaking, at the highest level, really the question is, so you get all these different data types from different scientific domains, how are you even suppose to know what data types are out there? One thing we're finding is that there are these very interesting serendipitous moments of discovery. One of the people involved in our project to Steve Kelling at the lab of Ornithology, Cornell and when he was at the site visit many months ago, he came up to me afterwards. I'm glad it was afterwards and he said, "You know, I'm really--well, I think that was good and I understand the whole idea and so on. But then I have to tell you, I don't care about astronomy data." And I said, "Well, that's okay Steve. We don't expect every scientist to care about every other kind of science data that we're dealing with." He said, "Okay. I just wanna be clear if that's the case." A few weeks later, he sent me an e-mail. And in the e-mail he said, Remember I was discing astronomy down. Well, I talked to one of my colleagues who said, oh Steve, you shouldn't have done that, because apparently, there's an entire community of ornithologists who look at the impact of astronomic phenomenon for migration patterns. And what Steve said to me was he would never have even thought about asking that question or poking around and sort of--in his views of corners of his discipline and asking, "Any of you guys care about astronomy data? I've got this crazy guy at Hopkins who keeps telling me I should think about this." Then there was another occasion where Bruce Marsh, is a geologist at Hopkins, he's been going to Antarctica every other year to a place called the Dry Valleys. He's very concerned about the portions of Antarctica that are not covered by snow and ice, so that he can go and drill and as he says, "See the earth at it has been for millions of years." But in order to find those places, he has to do a lot of surveying of where the snow and ice data patterns are. And as he's describing this, the people at the National Snow and Ice Data Center said, "Please tell us you kept all the data, the snow [inaudible] in Antarctica." He said, "I did, but why wouldn't anyone care about that?" So I think we're finding lots of these cases where scientists, when they're sitting in a room together, start to go, "Oh yeah. I guess you do care about that data that I think is noise." And Dean Krafft from Cornell said really eloquently "that one scientist's noise is another scientist's signal." And I think that's become a mantra for us. I actually rather, in eloquently said, "One scientist's garbage is another scientist's treasure." And he said, "No, no, no. Don't say garbage. That's a bad word for describing scientific data." ^M00:40:01 >> So we liked that, one scientist's signal is--or noise is another scientist's signal. So how do you formalize that into something that's beyond pair-wise conversations or pair-wise metadata cross box, or pair-wise, anything. The number of pair-wise arrangements will have to make is very, very complex. So the data framework is an attempt to do this in a more formalized way in an overarching way so that you can discover things you don't even necessarily know you care about or that someone else has collected. And we think a really important fundamental unit or element of that is this concept of an observation. You know, I have to be honest. When we were putting the proposal together, we talked about this data framework and it sounded like a really good thing to do. And Carl and I would keep communicating with each other and saying, "It sounds like a really good thing to do, but what are we going to use as a basis to think about this?" And then, he's actually one of the advisers we have for proposal of the basis. What about observations? Don't--Most, if not all, scientists have observations, this concept. Even different types of science, we think about simulations. They actually do have observations. They have observations within the simulation itself and then there's the real world observation, the simulation. So if you think about characteristics about simple viewer observations, time, place, things like that, they actually apply in the real world, in the simulator world, and experimental world, and so on. And what's very encouraging for us is that many scientific domains and disciplines of communities have already thought about data models for observations. So astronomers have thought about this. Some geological communities have thought about this. Apparently, some biological communities have thought about this and we think that that concept of the observation, the data models associated with those are really a good place for us to start. And Carl always asks me to make sure I remind everyone. This is going to be really hard. This is going to be a fundamental information science research challenge that we have. But I think it's a really great example of why I keep sayings this is not just about system building. So as we build prototypes and test them out, that's important. But in parallel and a very important parallel way, we have to also think about these overarching kinds of information science views which Carl is leading. And he's become involved in some community-based efforts that are already started in this area and he's becoming involved in some proposals that those folks have been writing. So I think it's a really convergence of existing efforts and current efforts and then thinking about it from this perspective of infrastructure development. This is one of those slides I put together where people have told me that, you know, sometimes you're just on the edge of saying something really strange, right? There's a really interesting book called the " Emergence: The Connected Lives of Ants, Brains, Cities, and Software" by Steven Johnson. I think it was written about a decade ago, so some of it may seem a little bit out-of-date. But it still very--it's very interesting read. And what he talks about is if you look at these different things as a variable of ants all the way of to our brains to cities and softwares, there is this concept of emergence. And this is line straight out of his book, "The movement from low level rules to higher level sophistication is what we call emergence." And there're really three concepts here, the simple rules, its feedback loops, and then adaptivity based on those simple rules and on the feedback loops. And he starts with ants which I am just fascinated by ants now [laughter] after I've read this book. But he basically says the queen ant is the misnomer because the queen actually isn't directing anybody. Now the queen is just the one who is the hope for the future, right. But she's not sitting they're saying worker ants go if I tell you to do that. And you know trash moving ants go off and do that. There seems to be some inherent understanding within the system itself. Now this is the things you do, these are your rules and worker ants can become trash moving ants or warrior ants they flip around, they move. Ant colonies apparently when another ant colony moves in they immediately assess. They're stronger than we are its time for us to move let's get out to here. And then two years later they come back, 2-year--2 cycles they've gone through, they have more warrior ants and other ant colony leaves. And you think now, stupid ants. You know, there's a lot of wisdom in that kind of approach to conflict. And then if you get the things like cities, apparently cities, there's lots of interesting other than cit--yes, we like to think of city planners. But certainly in historical cases, cities spontaneously decided this is a good place for this kind of store, this is a good place for the rich to live, this is as good place for this happen and so on. And certainly when you get to of human brain and things like that it's far more complex. I think about this a lot these days when I think about data curation. Because I think that the temptation is to say this is the topdown way we're going to build it. And I think that if we do it that way, we run the risk of catastrophic failure. Because it might seem really great for astronomers or it might seem really great geologist and maybe we can even make it work for the two of them. But science is not linear, it isn't constraint, and its very definition is that it's exploratory. So I think that if we try to think of a framework, fix it and try and impose, it will break. I'm a hundred percent sure of that. I don't know what the alternative answer ought to be. But I think it looks more like this than the former idea. And I think that ultimately scientists really need this to be simple. And I think the view that I look at this, as a slide, you may have seen, it's actually the slide on the poster. We've been spending a lot of time thinking about how to describe connections between papers and data sets using the object we use in exchange protocol, OAI, ORE. And this is a diagram we put together based on real information and feedback from astronomy articles. And what it basically depicts and you might not be able to read all of it. But what it basically depicts is an article is made up of text HTML, PDF or so on, but the articles themselves have figures, tables, also to things that actually have data underlying them. So even with an electronic journal, we can actually get to the data right now. I can actually verify the results that are being described on the particular article. We are thinking explicitly about ways of connecting the data to the articles, that's one gateway. There will be many, but that's one gateway to the data themselves. And then what we're also showing is that the data of course have been derived from other more fundamental types of data. If you go back to that data flow diagram, this is an attempt to acknowledge that data flow. So here's the table, here are the derived data that I used that were derived from a Sloan data release and Sloan data release were derived from the telescope itself all the way back. This type of view has very important implications for attribution or [inaudible], side ability, all those wonderful things that really are at that heart of science. And when I think of [inaudible] what I think is who touched this data, people or machines, both. And what do they do to them? And ultimately if you look back and you see this was the Fermi lab that did these following transformations on the data that gives you one level of confidence. And then if you look at another data and it's Sayeed Choudhury's personal observations and I run a few code, you know, some code that I run on my own computer, well, I think it gives you a very different level of confidence. So that kind of ability to automatically understand these things is going to be critical. And what I think scientist really want--now, I don't have any definitive proof to show you this right now. But I suspect what scientist would like to be able to do is take their data and just deposit it, they don't care where. I really don't think they care. It would be great if they viewed libraries and the Data Conservancy and so on as a trusted place where that can happen. But that's really all they want, is a trusted place where it's reliable and it's sustained and it's persistent and it's easy. It already--it fits very nicely with their existing workflows. Not something on the side. Not something new at the time of publication, at the grant kind of grand submission, at that time of reporting, you name it. And they deposit those data. The data are automatically connected to other relevant data. Yes, I know that raises also to questions. But they're connected automatically to other relevant data and then you can automatically identify the service is that you can run against these relevant data. That's the vision that I'm looking at. And I think it is an emergent vision. And I think it's one that supports things like Data Conservancy which is a fairly large scale, deep infrastructure, development effort, or an institutional repository or maybe even a personal repository. There can be data that are very large and sitting in the Data Conservancy in the diagram like this and that can be an Excel spreadsheet that's sitting in someone's personal repository. The key is just to find out when those two are actually connected in a relevant way and to do that without imposing a lot of rules on people. So with that, I will put up my acknowledgment slide and I think if we have time, I'd be happy to take questions. So thank you for your attention. ^M00:49:38 [ Applause ] ^M00:49:45 >> Yeah. >> I think it's a really great idea [inaudible] open strategy because like you said you have work root for all kinds of different types of data. I'm also wondering, are you suggesting to--those who submit the data that they include whether or they're using for their [inaudible] or their embedded data, their metadata? ^M00:50:09 >> Yeah. >> Whatever they're using, to please include that, so those of us who try to access we have an idea of what they're using including [inaudible]. >> Right. So I think I'm supposed to repeat the questions. So what I took away is the open strategies are good approach, but when the scientists are depositing their data we also want them to pass in their metadata and understanding of the context, broadly speaking, put the metadata in particular. So I think that's absolutely critical and that's why I think its part but--or part of our strategies to embedded it into existing workflow. So that the one case that we're most familiar with right now is the publication data connection. So when a scientist is submitting a paper they have some incentive to get through that workflow. And you might think they may also be most familiar with the data that are actually sited by that paper. It's in that moment I think where the marginal causes is the least and the incentive is the highest. So we can basically say to them, you know, please give us a data or tell us where there are. Tells how you describe them. Are you aware of the standards would make our lives much easier if you're willing to use those? Would you like to map with those standards? Those kinds of questions, those kinds of services really, I think are where we have our best chance to do things like that. But worst case scenario, if you've got metadata that you think actually describes those things and it doesn't fits seamlessly into the network, at least get it and see if there's some possibility for fitting into the network later on. ^M00:51:39 [ Pause ] ^M00:51:45 >> Yeah. >> You mentioned where you wanna be or you have to be at the end of 5 years. >> Yeah. >> What kinds of evaluation processes are you gonna use so that you know that you're on target? You have your detailed project plan but are there other mechanisms you're gonna use? >> Yeah, a very good question. The question is about the evaluation mechanisms we'll use to see if we're actually where we need to be. And you made a good point about the project management plan. The project management plan is an evaluation of whether we did what we purported to do. That doesn't necessarily mean that those things actually helped us meet the objectives of the overall program. So there're these two different classes of evaluation. I think we got a very good handle right now on the project evaluation. So we say these in the milestones, the deliverables, these are timelines, how well are we doing. We don't have as good a handle quite frankly on that bigger and more important class evaluation. There are series of matrix we've defined. We are talking to DataONE about those matrix, we're talking to NSF about this matrix whether they make sense or not. I think some of the most interesting things I've heard last week with All Hands Meeting was let's not get into metrics, let's simply count things. So it would be easy for us to basically say how much data do you have, how many papers are connected to the data sets, how many scientists use your network, how many students have you trained. And those are important. Don't get me wrong. But really what's much more important is about things like impact. And the question of that serendipitous discovery, how many cases did you get where scientist actually discover data they never knew about. Or never knew they cared about, for example. How many big scientific breakthroughs took place because of the infrastructure we put in place it? What's more important that we have highlight the data sitting in the Data Conservancy or that a scientist makes an incredible insight about climate change. You know, that's the kind of question we have to think about for the more important class of evaluation. Yeah. >> So Sayeed, as you go through this whole process, it seems to me that you and other here who are in your group see the serendipitous possible left turn you want to take, will you be able to do that within your structure because you're grant funded 'cause I hope that you will because that-- >> Yeah, yeah. So the question is really in some sense about how flexible can we be. We talked about these serendipitous discoveries and changes and can you make those terms, particularly because we're grant funded. So embracing the diversity of cultures is a great phrase but it's really hard. And I think one of the challenges we have, is we have for good reasons this very specific project management plan and a very early architectural build out and technical build out that's file-based. Because we--I mean, really we don't really know much more right at this point. There isn't a whole lot of experience of preserving data sets. At the meeting that we had last week, Alex Szalay, who is a really in many ways has gone, we're hoping to launch all of the E-Science efforts taking place at our institution, made a very interesting observation along these lines. He basically said, "Look, I understand you've gotta get something up and running quickly and I know you had to start somewhere." I'm not trying to be critical of all that. But quite frankly speaking science today doesn't work that way. And you basically talk about that this concept of virtual data. You said that there are data that don't even exist right now. They only exist after we run a query or after we've assembled data from different places. That's what science feels like now. Now he is a cutting-edge scientist. But, okay, wouldn't we wanna support a cutting-edge science. Wouldn't that be a good thing to do? So the architecture of the--the infrastructure team took away from that some very good feedback. Can we go through a phase change like that? That's really the scenario that we were going to have to think about as we move forward with the infrastructure team as one fine day. Literally, one fine day, we come to the conclusion. Okay, this isn't going to do what we needed to do. And just a view very critical of ourselves, maybe it support science that ended five, ten years ago. How do we make that kind of phase change very quickly, very nimbly given that we are grant funded? Now, what I can say to you is that we fought very hard. I don't mean that in a bad way. We have sort of debated hard with NSF to basic you say the project plan is a snapshot in time. And it documents what we think it's our best, great effort to meet the goals of the program. If that changes, we need to change it immediately. And we need to embrace that that kind of change is actually a healthy part of this. So let's not get into too many questions about, why did you make the change for project manager plan version 3 to 3.1 in this way? Let just--If we're able to argue for scientific reasons, those changes need to take place and that's got to be a fundamental aspect of this program. So it's not a specific answer in anyway shape or form. But I can tell you, we're already thinking about it and administrated with to the extent possible resent those expectations. Because I suspect there will be phase changes. [ Inaudible Remark ] >> I suspect the two. Will you be doing interim reports talking about that or pushing that? >> Yeah, absolutely. I think without getting into too many details, the first year of this program has had a very compressed feeling to it. But things have happened much more quickly in some ways than we anticipated in much more slowly in others. We are basically looking--so the project actually begun August 1st of 2009. So our first year will end in July 31st. But in the second year we're looking very actively at communication and dissemination and engagement with other communities, other partners. Because this solution space is enormous and not even the five DataNet partners themselves are going to be able to fix on. So I think in essence, I don't know what the planning or timing would be for these all DataNet meetings take place. But in my view, the ideal scenario would be we sort of look at the universe, divide and conquer and say here are some major, major gaps. Here are some major other places where we'd loved someone to try exactly the same thing with a different approach and then identify people within the community engagement. Sure. Yeah. >> Could you talk more about the validating of the information and find out when it [inaudible], one of your previous thoughts is, are you referring to that level of confidence that you talked about or is it more of authenticating the integrity of the data? And then also I can't help but wonder if maybe you get the attention of those MBA people, if you talk about possible relevance and how to [inaudible] and historical trends in that data? I think it might [inaudible] a little more. [ Laughter ] >> Yeah, I might. The question is primarily about validation of the data. I hate to keep saying this. There are different layers to validation as you alluded to in your question. There is the validation of the bits are in fact the bits what you give us sometime ago, can we maintain that integrity upon. There's the validation that it is what you say it is. So this is a series of fits files related to the Sloan Digital Sky Survey. Well, somebody needs to be sure that that in fact is the case. And you can keep going down this--to me, all the through scientific [inaudible]. It is in fact this, a reasonable scientific claim that is being made and are these the data that support that particular claim. This very much comes back to what I believe as what's the appropriate role. Because I'm in a library, the library is leading this particular effort, what's the appropriate role for the librarian? ^M01:00:05 >> We don't have the skill sets to do a lot of those scientific validation types of activity. We just don't. I joke that I knew more about astronomy than I ever imagine would ever know. But I'm not an astronomer one needs to stretch the imagination. The other thing I think is that we have to think more about how machines can play role in all of those even for scientific validation. Because it isn't going to scale if we keep throwing people a problem. So we have to think and it's a part of the services that I talk about in terms of when you deposit data to such a network. Services are not necessarily to support your science. But they can be preservation services, they could be validation services it could be authentication services, things of that nature. Yeah. >> [Inaudible] Do you have any observation about some promising rule out of rules that can result in the kinds of emergence that you're talking about either the technical and [inaudible]? >> Yeah. Yes, so the question is, do we have any insights about rule out the rules [inaudible] and as you point out in its early days? I think the simplicity aspect of the deposit has become a very, very clear. We're looking at different modes by which we apply our data and one of them is large scale batch processing. So Sloan interaction very much is a one-to-one kind of exchange. You have the astrophysical research consortium was illegal entity that feeds on behalf of these data, that communicate with Johns Hopkins. It's a 140 terabytes coming over the network in a big bulk transaction. I don't think that's the way to be a simple. I think that kind of transaction is always a bit complex and it's institution-based then it has legal issues and so on. But at the other end of it in terms of maybe things like bench lines and so on, I think that it's embedding into existing workflows, embedding into existing tools. One of the most interesting conversations I saw recently was about using Facebook to deposit data. You know, at first glance sight, this is my files. I don't get Facebook. I'm on it I follow lots of people once in a while I post things but then I keep thinking, "Why am I doing this, [laughter] I have other things I need to be doing." But nonetheless, many people are using Facebook, is there some way that you can use Facebook to actually deposit your data into the Data Conservancy or your repository? I don't know. But I think that we are hearing loud and clear that if this becomes an event if the deposited data is as massive event with fun fair, lights, and confetti, and so on, it isn't going to happen. So really, the simple rules in that case are existing tools built on top of them, don't build new ones unless it's this large scale kind of transaction that takes place. >> Thanks Sayeed very much. >> Thank you. ^M01:03:01 [ Applause ] ^M01:03:06 >> This has been a presentation of the Library of Congress. Visit us at loc.gov.