[00:02] *** Ravenloft has joined #archiveteam [00:05] Sketchcow, I would suggest scraping by year eg. http://link.springer.com/search?facet-language=%22En%22&facet-content-type=%22Book%22&showAll=false&date-facet-mode=in&facet-start-year=1995&previous-start-year=1856&facet-end-year=1995&previous-end-year=2016 [00:05] This allows to get a smaller number of results [00:06] Note that the CSV export of search results seems limited to 1000 results though. [00:10] Yeah, exactly. [00:10] I'll do a variety of searches, etc. [00:10] It's not hard. And I will have my thing skip doubled-grabbed things. [00:11] I'm doing a gradual upload so it doesn't overflow IA's OCR servers. It'll be doing this for a few days. It's a bonanza. [00:21] For example, http://link.springer.com/book/10.1007/978-3-642-99297-1 is a German textbook on Roman literature from 1858. [00:28] wunderbar [00:34] *** megaminxw has joined #archiveteam [00:57] https://archive.org/details/springer_10.1007-978-3-319-07118-3 just metadata as far as the eye can see [00:57] *** xhades has quit IRC (Ping timeout: 506 seconds) [00:58] Hm, what's the difference in intended meaning between "Isbnonline" and "Isbn"? [01:04] Guess. [01:04] [01:05] The isbn of the "online" version of the work, as opposed to the one of the "printed" version? [01:05] In this case, they appear to be the same. [01:05] oops, not the same [01:05] [01:06] ends with 7-6 for the plain ISBN, rather than 8-3 for the online one, for those following along at home. [01:06] *** dashcloud has joined #archiveteam [01:07] final isbn digit is a checksum [01:07] yep [01:07] and the springer link page confirms this: http://link.springer.com/book/10.1007%2F978-3-319-07118-3 [01:14] hm — there are a few more identifying numbers in the PDF that could be extracted (maybe in another pass), specifically two ISSNs & a LCCN. And a few more pieces from the Springer Link page (Series Volume, Topics, Industry Sectors). If you make the scrape scripts available somewhere, I may see about adding extraction of those, too. [01:14] are you guys sure that libgen does not have all of that already? [01:14] AFAIK, IA doesn't have a mirror of libgen. :-) [01:14] :O [01:14] And if they do, I doubt it's undarked. [01:18] *** BlueMaxim has joined #archiveteam [01:18] https://archive.org/details/@sketch_the_cow coming in nicely. [01:21] *** zerkalo has quit IRC (Remote host closed the connection) [01:23] *** MRX3 has quit IRC (Quit: Leaving) [01:26] *** nertzy has joined #archiveteam [01:34] *** JesseW has joined #archiveteam [01:57] *** JesseW has quit IRC (Leaving.) [01:59] *** JesseW has joined #archiveteam [02:00] *** RichardG has quit IRC (Ping timeout: 255 seconds) [02:09] JW_work: It wasn't darked until recently (past few months, I think) but it used to be here https://archive.org/details/librarygenesis [02:09] JW_work, (I think anyway.) Also I don't think it was at all up to date. [02:10] *** JesseW has quit IRC (Leaving.) [02:21] *** username1 has joined #archiveteam [02:24] *** schbirid2 has quit IRC (Read error: Operation timed out) [02:44] *** philpem has quit IRC (Ping timeout: 260 seconds) [03:01] *** rctbeast has joined #archiveteam [03:45] *** rctbeast has quit IRC (Ping timeout: 240 seconds) [04:08] *** dashcloud has quit IRC (Read error: Operation timed out) [04:08] *** Stiletto has quit IRC (Read error: Connection reset by peer) [04:09] *** Stiletto has joined #archiveteam [04:12] *** dashcloud has joined #archiveteam [04:18] *** wp494 has quit IRC (Read error: Operation timed out) [04:19] *** wp494 has joined #archiveteam [04:27] *** Elegance has quit IRC (Ping timeout: 369 seconds) [04:35] *** Coderjoe has quit IRC (Read error: Connection reset by peer) [04:36] *** VADemon has quit IRC (Read error: Connection reset by peer) [04:37] *** nertzy has quit IRC (This computer has gone to sleep) [04:37] *** Elegance has joined #archiveteam [04:47] *** Coderjoe has joined #archiveteam [05:01] *** JesseW has joined #archiveteam [05:15] *** arkhive has joined #archiveteam [05:16] http://www.pcmag.com/slideshow/story/340385/in-memoriam-tech-that-died-in-2015/ [05:19] kyan: interestly, one of the items in that collection, https://archive.org/history/firstphilosopher00wate was darked apparently due to a copyright claim by the author (in 2013), then undarked by ximm 3 months later. [05:20] *** megaminxw has quit IRC (Quit: Leaving.) [05:25] *** Coderjoe has quit IRC (Read error: Connection reset by peer) [05:30] *** Coderjoe has joined #archiveteam [05:33] Can anyone think of a good reason not to use terroroftinytown to archive a mapping between Hacker News item numbers and links? e.g. https://news.ycombinator.com/item?id=94521 -> http://fourreasonswhy.com/2008/01/03/privacy-is-doomed/ [05:43] *** Elegance has quit IRC (Ping timeout: 369 seconds) [05:44] *** Elegance has joined #archiveteam [06:06] *** Elegance has quit IRC (Ping timeout: 250 seconds) [06:13] *** Elegance has joined #archiveteam [06:57] *** Froggypwn has quit IRC (Ping timeout: 483 seconds) [06:58] *** Froggypwn has joined #archiveteam [07:05] *** Elegance has quit IRC (Ping timeout: 369 seconds) [07:05] *** Elegance has joined #archiveteam [07:33] Kind of weird to do that [07:33] We after all have all the comments [07:35] *** Elegance has quit IRC (Ping timeout: 369 seconds) [07:36] We do? [07:36] * JesseW goes looking for the dump [07:37] Hm, looks like the dump is only as of 2014-05-29 -- we should probably run a re-grab sometime. [07:38] http://archiveteam.org/index.php?title=Hacker_News [07:38] *** Elegance has joined #archiveteam [07:45] apparently there's a more recent one here: https://github.com/fhoffa/notebooks/blob/master/analyzing%20hacker%20news.ipynb -- it'd be good to copy it into IA [07:50] *** JesseW has quit IRC (Leaving.) [07:52] *** Elegance has quit IRC (Ping timeout: 250 seconds) [08:00] *** Elegance has joined #archiveteam [09:11] *** ohhdemgir has quit IRC (Ping timeout: 260 seconds) [09:39] *** philpem has joined #archiveteam [09:56] *** username1 is now known as schbirid [10:31] *** ats has quit IRC (Quit: Lost terminal) [10:32] *** ats has joined #archiveteam [11:11] *** jspiros has quit IRC (Read error: Operation timed out) [11:15] *** REiN^ has joined #archiveteam [11:16] *** jspiros has joined #archiveteam [11:29] *** zerkalo has joined #archiveteam [11:52] *** bzc6p has joined #archiveteam [11:52] *** swebb sets mode: +o bzc6p [11:52] *** bzc6p has left [11:56] *** bzc6p has joined #archiveteam [11:56] *** swebb sets mode: +o bzc6p [12:03] *** PepsiMax has quit IRC (Read error: Operation timed out) [12:03] *** Famicoman has quit IRC (Read error: Operation timed out) [12:05] *** Gfy has quit IRC (Ping timeout: 250 seconds) [12:06] *** rctbeast has joined #archiveteam [12:07] *** bzc6p has left [12:10] *** PepsiMax has joined #archiveteam [12:10] *** PepsiMax has quit IRC (Connection closed) [12:14] *** Gfy has joined #archiveteam [12:30] *** BlueMaxim has quit IRC (Quit: Leaving) [12:54] *** Ghost_of_ has quit IRC (Quit: Leaving) [13:20] *** Ravenloft has quit IRC (Ping timeout: 364 seconds) [13:34] *** Famicoman has joined #archiveteam [13:52] *** redlob has quit IRC (Ping timeout: 260 seconds) [13:57] *** redlob has joined #archiveteam [14:13] *** redlob has quit IRC (Max SendQ exceeded) [14:22] *** redlob has joined #archiveteam [14:50] *** Elegance_ has joined #archiveteam [14:51] *** Elegance has quit IRC (Ping timeout: 369 seconds) [15:00] *** Morbus has quit IRC (Quit: http://www.disobey.com/) [15:09] *** SimpBrain has quit IRC (Leaving) [15:21] *** ohhdemgir has joined #archiveteam [15:48] *** Elegance_ has quit IRC (Ping timeout: 369 seconds) [15:48] *** Elegance has joined #archiveteam [15:54] *** Microguru has joined #archiveteam [16:16] *** Microguru has quit IRC (Read error: Operation timed out) [16:27] *** nertzy has joined #archiveteam [16:33] SketchCow: chfoo: can you please create an rsync target on FOS for oldfriends? [16:49] *** JetBalsa has quit IRC (Ping timeout: 258 seconds) [16:49] *** JetBalsa has joined #archiveteam [16:58] *** JesseW has joined #archiveteam [16:59] *** nertzy has quit IRC (This computer has gone to sleep) [17:06] *** SimpBrain has joined #archiveteam [17:08] arkiver: ok, done [17:08] thank you!! [17:24] The OldFriends project is now running! [17:25] First items are the images [17:25] like http://www.oldfriends.co.nz/InstitutionPhotoView.aspx?id=132396 [17:27] *** JesseW has quit IRC (Leaving.) [17:29] * SimpBrain throws a 3 concurrent at oldfriends [17:36] * SmileyG ponders [17:36] i need to get my code on and learn this stuff [17:54] Came across this yesterday: https://www.clockss.org/clockss/Home - a joint venture to ensure academic works not sold by anyone anymore end up freely available to the public under a CC license [18:00] probably worth copying the "triggered content" into IA, if it isn't already [18:06] http://clockss.org/clockss/Triggered_Content — looks like a total of 227 "volumes" with an unknown number of issues in each volume, and an unknown number of separate PDFs (generally one per article) in each issue. [18:06] There is also one just listed as "Coming soon" http://clockss.org/clockss/MD_Conference_Express [18:18] *** rctbeast has quit IRC (Ping timeout: 243 seconds) [18:21] *** rctbeast has joined #archiveteam [18:42] probably worth sending a note to https://www.martineve.com/2012/03/30/the-problems-for-small-open-access-journals-in-terms-of-digital-preservation/ informing him of IA (I assume he knows, but didn't mention it for some reason, which might be interesting to know) [19:06] wikiteam should probably also add http://documents.clockss.org/index.php/Main_Page to the grab [19:16] STILL uploading the first batch of books. [19:17] And the first batch isn't even half of the total set [19:17] And the total set isn't all of the books due to the screwup 1000 problem [19:21] Oh man, it's only at 1,500. [19:30] *** Ravenloft has joined #archiveteam [19:46] Sketchcow: Can you create a collection for Yahoo! Groups, please? Items: https://archive.org/details/@purplesymphony?and[]=subject%3A%22yahoo%20groups%22 [19:59] *** Ghost_of_ has joined #archiveteam [20:00] *** JetBalsa has quit IRC (- nbs-irc 2.39 - www.nbs-irc.net -) [20:06] That is some weak-ass metadata there. [20:06] What did you DO [20:10] Ah yeh, that’s just the groups in the specific item. [20:11] For the search engine. [20:13] The CDX server unfortunately can’t deal with the WARC/CDX files I’m uploading. [20:14] yahoo groups must be a pita to archive [20:14] since a lot of groups are private [20:14] Even without the private groups its gonna take 17 year at the current rate. [20:14] *years [20:15] *** Stilett0 has joined #archiveteam [20:18] i'm done grabbing files from romhacking.net, currently uploading them to archive.org [20:18] i'll probably grab new files every 2 weeks or every month [20:18] That metadata should be in the description [20:18] *** Stiletto has quit IRC (Read error: Operation timed out) [20:21] Are custom metadata fields discouraged? [20:21] I was aiming for search queries like “subject:"yahoo groups" /metadata/group:billiejoearmstrongsthrone” [20:24] *** xhades has joined #archiveteam [20:39] PurpleSym: at a minimum, please add *spaces* between your group names, just for improvement of display [20:42] Sure, I can change that. Semicolons are used right now, because “subject” does the same. [20:43] (Unfortunately subject’s semantics are not applied to my custom metadata field) [20:44] well space and semicolon [20:52] yeah, don't leave *out* the semicolon, just use both [20:58] I’ll bulk-update existing items tomorrow. [21:03] cool, thanks [21:05] purpleSym: you should probably also put the group names in the "subject" metadata field, too, to improve seraching. [21:06] I tried that first, but the topic bar on the right can’t keep up with that. [21:06] Even with spaces? Interesting. [21:07] Nah, semicolon-delimited. The parser does its job well. [21:07] And as Sketch said, make sure to include them in Description for human-readable purposes, too. [21:07] But the number of groups is just too large I guess. [21:10] Additional metadata that could go in the description is a count of messages in each group (which you could extract from the cdx) [21:11] PurpleSym: it seems like this could be usefully turned into a Warrior project… [21:12] That’s true. [21:13] But I’ve never written Lua code. [21:14] AFAIK, Warrior projects require pretty minimal Lua code — it's mostly just filling in blanks. (I haven't written a project, myself, though — so take this with a grain of salt) [21:17] Also as far as I see neither my machine nor network connection are limiting the grab. Yahoo’s servers seem to be kind of “fragile”. If I want to I can easily overload them with mere search queries. [21:18] And more IP addresses wouldn't help? [21:19] Rate-limiting is IP-based, yes. But I have a /64 IPv6 ;) [21:20] ah, so you have enough IPs, then. So your network connection *alone* is enough to overload them? How are they serving everyone else in the world, I wonder? [21:20] Are there really that few people using Yahoo Groups at this point? [21:21] Are you sure they aren't rate-limiting the entire /64? [21:22] Everyone else just hits the cache. I’m trashing it. [21:22] As far as I know there’s no limit per netblock. [21:22] Or I have not hit it yet. [21:28] http://www.npr.org/sections/thetwo-way/2015/12/29/461401135/egypt-raids-2-major-independent-cultural-institutions-in-2-days [21:34] PurpleSym: hm, I wonder if you tried to aim at things already in the cache if you could boost the rate that way [21:35] in any case, we should probably take this to -bs [22:03] https://twitter.com/terryteachout/status/681957100730363904 [22:12] wait that blog entry is barely a page long [22:12] I was expecting something more in-depth [22:21] I know [22:21] I KNOW [22:26] *** VADemon has joined #archiveteam [22:40] Sketchcow: When you said you were interested in all CDs: did you mean the physical disc, or just an ISO? [22:43] mistym: Both, would be my guess. [22:43] IA has a warehouse — I'm pretty sure they'd be glad to store the original object; but I'm sure they'd be very grateful if you did the ripping yourself, saving them the time and effort. [22:44] JW_work: ive been trying to send a collection of CDs + magazines to the IA warehouse but got no reply from the email guy :( [22:45] jleclanch: he's … really busy. How long has it been? If it's been more than a week, I'd just re-send your email. [22:45] JW_work: been 3 weeks since my last poke, have poked him weekly before [22:46] eh, seems fine to poke again [22:46] last he replied was asking for a pricing estimate on the package [22:46] or you could wait till next week, as this is still likely a somewhat dead week between xmas and new years [22:47] and I presume you sent him one, and haven't heard back since? [22:47] yea [22:47] * JW_work is reminded to check if I have any outstanding things to bug IA about [22:47] OT but god damn scrapy is fun [22:48] heh. what are you ripping? [22:48] npmjs.org lol [22:48] doing a small packaging project [22:48] https://github.com/jleclanche/npm-js-metadata/ [22:48] wait, that doesn't have an API? ha [22:48] they have a terrible api [22:48] no way to list packages so i have to rip from the most-starred [22:48] lol [22:50] I also found out another popular language-specific package hosting website is keeping *all* packages and files, regardless of whether they were deleted [22:50] so that's lovely [22:50] must be a bunch of passwords in there /sigh [22:51] *** GLaDOS has quit IRC (Ping timeout: 260 seconds) [22:51] Eh, if you publish a password in public, you need to *change* it, not just try to get the horse back in the barn [22:51] ofc [22:51] doesn't mean people do it [22:52] so I think keeping *all* the versions (and just dark'ing them on request) is a better idea [22:55] *** ndiddy has joined #archiveteam [22:55] *** GLaDOS has joined #archiveteam [22:56] jleclanch: you've seen https://docs.npmjs.com/misc/registry , right? [22:56] JW_work: yes, that git repo mirrors the registry [22:57] so which data are you scraping that isn't available in the registry? [22:57] package lists [22:57] but can't you get that from https://web.archive.org/web/20150905225943/http://skimdb.npmjs.com/registry [22:57] no? [22:58] the "public mirror" of the underlying CouchDB? [22:58] oh hm [22:58] good point, let me look into it [23:05] jleclanch: it looks like you can get a list of packages with: https://skimdb.npmjs.com/registry/_all_docs?limit=10&skip=3000 [23:05] http://docs.couchdb.org/en/1.6.1/api/database/bulk-api.html#db-all-docs [23:05] JW_work: yeah i was just getting to this. mb not recognizing this as a couchdb instance [23:05] sweet stuff [23:06] I thought it implausible that npmjs would have such a bad api [23:06] ive been dealing with bad APIs all week [23:06] nothing is implausible, not if you can imagine it! [23:06] heh. sure — but npm seems generally more clueful than that [23:07] well [23:07] the package names are case-sensitive [23:07] so there's that [23:07] that seems sensible enough [23:07] https://www.npmjs.com/package/jQuery [23:07] https://www.npmjs.com/package/jquery [23:07] does it [23:16] er, those are different values [23:16] different packages [23:16] that's what i mean [23:16] ok, and the problem is? [23:17] that it's possible? :p [23:17] that seems like a feature. "jquery" is not the same as "jQuery" [23:18] is not the same as "jQuErY" [23:18] you see feature, i see social engineering vector [23:18] eh, unicode allows plenty of such whether you fold case or not [23:19] how does that invalidate what I said? it just reinforces it :P [23:19] Having a *warning* when full-unicode-case-folding would show a colision, that seems good — but prohibiting it — not so much [23:21] ok, there does seem to be a bug there, though — the stats are identical for both [23:21] which I doubt is accurate [23:21] the stats server is a separate API [23:21] im not surprised [23:21] *** brayden has quit IRC (Read error: Operation timed out) [23:21] the latter must be case insensitive [23:21] heh, OK — *that's* a bug [23:22] if they don't support the case sensitivity across all their services — then they need to turn it *off* across all their services [23:23] I mean, I'd lean towards requiring lowercase ascii alphanum & underscore for package names, but that might be going too far [23:24] apparently the / character can be in the package name, too [23:25] and that breaks some more APIs [23:25] lol [23:25] what were you saying about cluefulness again? :P [23:25] heh [23:25] the couchdb bit is pretty damn cool tho [23:26] ok, let me clarify — not "cluefullness", but rather, "enthusiastic openness" [23:26] heh fair enough [23:27] ah, there is a ticket: https://github.com/npm/newww/issues/380 [23:27] *** xhades has quit IRC (Read error: Operation timed out) [23:27] *** xhades has joined #archiveteam [23:29] *** Elegance has quit IRC (Quit: :(){ :|:& };:) [23:47] *** megaminxw has joined #archiveteam