[00:02] *** Ravenloft has joined #archiveteam
[00:05] <kyan> Sketchcow, I would suggest scraping by year eg. http://link.springer.com/search?facet-language=%22En%22&facet-content-type=%22Book%22&showAll=false&date-facet-mode=in&facet-start-year=1995&previous-start-year=1856&facet-end-year=1995&previous-end-year=2016
[00:05] <kyan> This allows to get a smaller number of results
[00:06] <kyan> Note that the CSV export of search results seems limited to 1000 results though.
[00:10] <Sketchcow> Yeah, exactly.
[00:10] <Sketchcow> I'll do a variety of searches, etc.
[00:10] <Sketchcow> It's not hard. And I will have my thing skip doubled-grabbed things.
[00:11] <Sketchcow> I'm doing a gradual upload so it doesn't overflow IA's OCR servers. It'll be doing this for a few days. It's a bonanza.
[00:21] <JW_work> For example, http://link.springer.com/book/10.1007/978-3-642-99297-1 is a German textbook on Roman literature from 1858.
[00:28] <DFJustin> wunderbar
[00:34] *** megaminxw has joined #archiveteam
[00:57] <Sketchcow> https://archive.org/details/springer_10.1007-978-3-319-07118-3 just metadata as far as the eye can see
[00:57] *** xhades has quit IRC (Ping timeout: 506 seconds)
[00:58] <JW_work> Hm, what's the difference in intended meaning between "Isbnonline" and "Isbn"?
[01:04] <Sketchcow> Guess.
[01:04] <Sketchcow> </wonka>
[01:05] <JW_work> The isbn of the "online" version of the work, as opposed to the one of the "printed" version?
[01:05] <JW_work> In this case, they appear to be the same.
[01:05] <JW_work> oops, not the same
[01:05] <Sketchcow> </wonka>
[01:06] <JW_work> ends with 7-6 for the plain ISBN, rather than 8-3 for the online one, for those following along at home.
[01:06] *** dashcloud has joined #archiveteam
[01:07] <xmc> final isbn digit is a checksum
[01:07] <JW_work> yep
[01:07] <JW_work> and the springer link page confirms this: http://link.springer.com/book/10.1007%2F978-3-319-07118-3
[01:14] <JW_work> hm — there are a few more identifying numbers in the PDF that could be extracted (maybe in another pass), specifically two ISSNs & a LCCN. And a few more pieces from the Springer Link page (Series Volume, Topics, Industry Sectors). If you make the scrape scripts available somewhere, I may see about adding extraction of those, too.
[01:14] <schbirid2> are you guys sure that libgen does not have all of that already?
[01:14] <JW_work> AFAIK, IA doesn't have a mirror of libgen. :-)
[01:14] <schbirid2> :O
[01:14] <JW_work> And if they do, I doubt it's undarked.
[01:18] *** BlueMaxim has joined #archiveteam
[01:18] <Sketchcow> https://archive.org/details/@sketch_the_cow coming in nicely.
[01:21] *** zerkalo has quit IRC (Remote host closed the connection)
[01:23] *** MRX3 has quit IRC (Quit: Leaving)
[01:26] *** nertzy has joined #archiveteam
[01:34] *** JesseW has joined #archiveteam
[01:57] *** JesseW has quit IRC (Leaving.)
[01:59] *** JesseW has joined #archiveteam
[02:00] *** RichardG has quit IRC (Ping timeout: 255 seconds)
[02:09] <kyan> JW_work: It wasn't darked until recently (past few months, I think) but it used to be here https://archive.org/details/librarygenesis
[02:09] <kyan> JW_work, (I think anyway.) Also I don't think it was at all up to date.
[02:10] *** JesseW has quit IRC (Leaving.)
[02:21] *** username1 has joined #archiveteam
[02:24] *** schbirid2 has quit IRC (Read error: Operation timed out)
[02:44] *** philpem has quit IRC (Ping timeout: 260 seconds)
[03:01] *** rctbeast has joined #archiveteam
[03:45] *** rctbeast has quit IRC (Ping timeout: 240 seconds)
[04:08] *** dashcloud has quit IRC (Read error: Operation timed out)
[04:08] *** Stiletto has quit IRC (Read error: Connection reset by peer)
[04:09] *** Stiletto has joined #archiveteam
[04:12] *** dashcloud has joined #archiveteam
[04:18] *** wp494 has quit IRC (Read error: Operation timed out)
[04:19] *** wp494 has joined #archiveteam
[04:27] *** Elegance has quit IRC (Ping timeout: 369 seconds)
[04:35] *** Coderjoe has quit IRC (Read error: Connection reset by peer)
[04:36] *** VADemon has quit IRC (Read error: Connection reset by peer)
[04:37] *** nertzy has quit IRC (This computer has gone to sleep)
[04:37] *** Elegance has joined #archiveteam
[04:47] *** Coderjoe has joined #archiveteam
[05:01] *** JesseW has joined #archiveteam
[05:15] *** arkhive has joined #archiveteam
[05:16] <arkhive> http://www.pcmag.com/slideshow/story/340385/in-memoriam-tech-that-died-in-2015/
[05:19] <JesseW> kyan: interestly, one of the items in that collection, https://archive.org/history/firstphilosopher00wate was darked apparently due to a copyright claim by the author (in 2013), then undarked by ximm 3 months later.
[05:20] *** megaminxw has quit IRC (Quit: Leaving.)
[05:25] *** Coderjoe has quit IRC (Read error: Connection reset by peer)
[05:30] *** Coderjoe has joined #archiveteam
[05:33] <JesseW> Can anyone think of a good reason not to use terroroftinytown to archive a mapping between Hacker News item numbers and links? e.g. https://news.ycombinator.com/item?id=94521 -> http://fourreasonswhy.com/2008/01/03/privacy-is-doomed/
[05:43] *** Elegance has quit IRC (Ping timeout: 369 seconds)
[05:44] *** Elegance has joined #archiveteam
[06:06] *** Elegance has quit IRC (Ping timeout: 250 seconds)
[06:13] *** Elegance has joined #archiveteam
[06:57] *** Froggypwn has quit IRC (Ping timeout: 483 seconds)
[06:58] *** Froggypwn has joined #archiveteam
[07:05] *** Elegance has quit IRC (Ping timeout: 369 seconds)
[07:05] *** Elegance has joined #archiveteam
[07:33] <Sketchcow> Kind of weird to do that
[07:33] <Sketchcow> We after all have all the comments
[07:35] *** Elegance has quit IRC (Ping timeout: 369 seconds)
[07:36] <JesseW> We do?
[07:36] * JesseW goes looking for the dump
[07:37] <JesseW> Hm, looks like the dump is only as of 2014-05-29 -- we should probably run a re-grab sometime.
[07:38] <JesseW> http://archiveteam.org/index.php?title=Hacker_News
[07:38] *** Elegance has joined #archiveteam
[07:45] <JesseW> apparently there's a more recent one here: https://github.com/fhoffa/notebooks/blob/master/analyzing%20hacker%20news.ipynb -- it'd be good to copy it into IA
[07:50] *** JesseW has quit IRC (Leaving.)
[07:52] *** Elegance has quit IRC (Ping timeout: 250 seconds)
[08:00] *** Elegance has joined #archiveteam
[09:11] *** ohhdemgir has quit IRC (Ping timeout: 260 seconds)
[09:39] *** philpem has joined #archiveteam
[09:56] *** username1 is now known as schbirid
[10:31] *** ats has quit IRC (Quit: Lost terminal)
[10:32] *** ats has joined #archiveteam
[11:11] *** jspiros has quit IRC (Read error: Operation timed out)
[11:15] *** REiN^ has joined #archiveteam
[11:16] *** jspiros has joined #archiveteam
[11:29] *** zerkalo has joined #archiveteam
[11:52] *** bzc6p has joined #archiveteam
[11:52] *** swebb sets mode: +o bzc6p
[11:52] *** bzc6p has left 
[11:56] *** bzc6p has joined #archiveteam
[11:56] *** swebb sets mode: +o bzc6p
[12:03] *** PepsiMax has quit IRC (Read error: Operation timed out)
[12:03] *** Famicoman has quit IRC (Read error: Operation timed out)
[12:05] *** Gfy has quit IRC (Ping timeout: 250 seconds)
[12:06] *** rctbeast has joined #archiveteam
[12:07] *** bzc6p has left 
[12:10] *** PepsiMax has joined #archiveteam
[12:10] *** PepsiMax has quit IRC (Connection closed)
[12:14] *** Gfy has joined #archiveteam
[12:30] *** BlueMaxim has quit IRC (Quit: Leaving)
[12:54] *** Ghost_of_ has quit IRC (Quit: Leaving)
[13:20] *** Ravenloft has quit IRC (Ping timeout: 364 seconds)
[13:34] *** Famicoman has joined #archiveteam
[13:52] *** redlob has quit IRC (Ping timeout: 260 seconds)
[13:57] *** redlob has joined #archiveteam
[14:13] *** redlob has quit IRC (Max SendQ exceeded)
[14:22] *** redlob has joined #archiveteam
[14:50] *** Elegance_ has joined #archiveteam
[14:51] *** Elegance has quit IRC (Ping timeout: 369 seconds)
[15:00] *** Morbus has quit IRC (Quit: http://www.disobey.com/)
[15:09] *** SimpBrain has quit IRC (Leaving)
[15:21] *** ohhdemgir has joined #archiveteam
[15:48] *** Elegance_ has quit IRC (Ping timeout: 369 seconds)
[15:48] *** Elegance has joined #archiveteam
[15:54] *** Microguru has joined #archiveteam
[16:16] *** Microguru has quit IRC (Read error: Operation timed out)
[16:27] *** nertzy has joined #archiveteam
[16:33] <arkiver> SketchCow: chfoo: can you please create an rsync target on FOS for oldfriends?
[16:49] *** JetBalsa has quit IRC (Ping timeout: 258 seconds)
[16:49] *** JetBalsa has joined #archiveteam
[16:58] *** JesseW has joined #archiveteam
[16:59] *** nertzy has quit IRC (This computer has gone to sleep)
[17:06] *** SimpBrain has joined #archiveteam
[17:08] <chfoo> arkiver: ok, done
[17:08] <arkiver> thank you!!
[17:24] <arkiver> The OldFriends project is now running!
[17:25] <arkiver> First items are the images
[17:25] <arkiver> like http://www.oldfriends.co.nz/InstitutionPhotoView.aspx?id=132396
[17:27] *** JesseW has quit IRC (Leaving.)
[17:29] * SimpBrain throws a 3 concurrent at oldfriends
[17:36] * SmileyG ponders
[17:36] <SmileyG> i need to get my code on and learn this stuff
[17:54] <dashcloud> Came across this yesterday: https://www.clockss.org/clockss/Home - a joint venture to ensure academic works not sold by anyone anymore end up freely available to the public under a CC license
[18:00] <JW_work> probably worth copying the "triggered content" into IA, if it isn't already
[18:06] <JW_work> http://clockss.org/clockss/Triggered_Content — looks like a total of 227 "volumes" with an unknown number of issues in each volume, and an unknown number of separate PDFs (generally one per article) in each issue. 
[18:06] <JW_work> There is also one just listed as "Coming soon" http://clockss.org/clockss/MD_Conference_Express
[18:18] *** rctbeast has quit IRC (Ping timeout: 243 seconds)
[18:21] *** rctbeast has joined #archiveteam
[18:42] <JW_work> probably worth sending a note to https://www.martineve.com/2012/03/30/the-problems-for-small-open-access-journals-in-terms-of-digital-preservation/ informing him of IA (I assume he knows, but didn't mention it for some reason, which might be interesting to know)
[19:06] <JW_work> wikiteam should probably also add http://documents.clockss.org/index.php/Main_Page to the grab
[19:16] <Sketchcow> STILL uploading the first batch of books.
[19:17] <Sketchcow> And the first batch isn't even half of the total set
[19:17] <Sketchcow> And the total set isn't all of the books due to the screwup 1000 problem
[19:21] <Sketchcow> Oh man, it's only at 1,500.
[19:30] *** Ravenloft has joined #archiveteam
[19:46] <PurpleSym> Sketchcow: Can you create a collection for Yahoo! Groups, please? Items: https://archive.org/details/@purplesymphony?and[]=subject%3A%22yahoo%20groups%22
[19:59] *** Ghost_of_ has joined #archiveteam
[20:00] *** JetBalsa has quit IRC (- nbs-irc 2.39 - www.nbs-irc.net -)
[20:06] <Sketchcow> That is some weak-ass metadata there.
[20:06] <Sketchcow> What did you DO
[20:10] <PurpleSym> Ah yeh, that’s just the groups in the specific item.
[20:11] <PurpleSym> For the search engine.
[20:13] <PurpleSym> The CDX server unfortunately can’t deal with the WARC/CDX files I’m uploading.
[20:14] <SimpBrain> yahoo groups must be a pita to archive
[20:14] <SimpBrain> since a lot of groups are private
[20:14] <PurpleSym> Even without the private groups its gonna take 17 year at the current rate.
[20:14] <PurpleSym> *years
[20:15] *** Stilett0 has joined #archiveteam
[20:18] <Start> i'm done grabbing files from romhacking.net, currently uploading them to archive.org
[20:18] <Start> i'll probably grab new files every 2 weeks or every month
[20:18] <Sketchcow> That metadata should be in the description
[20:18] *** Stiletto has quit IRC (Read error: Operation timed out)
[20:21] <PurpleSym> Are custom metadata fields discouraged?
[20:21] <PurpleSym> I was aiming for search queries like “subject:"yahoo groups" /metadata/group:billiejoearmstrongsthrone”
[20:24] *** xhades has joined #archiveteam
[20:39] <JW_work> PurpleSym: at a minimum, please add *spaces* between your group names, just for improvement of display
[20:42] <PurpleSym> Sure, I can change that. Semicolons are used right now, because “subject” does the same.
[20:43] <PurpleSym> (Unfortunately subject’s semantics are not applied to my custom metadata field)
[20:44] <DFJustin> well space and semicolon
[20:52] <JW_work> yeah, don't leave *out* the semicolon, just use both
[20:58] <PurpleSym> I’ll bulk-update existing items tomorrow.
[21:03] <JW_work> cool, thanks
[21:05] <JW_work> purpleSym: you should probably also put the group names in the "subject" metadata field, too, to improve seraching.
[21:06] <PurpleSym> I tried that first, but the topic bar on the right can’t keep up with that.
[21:06] <JW_work> Even with spaces? Interesting.
[21:07] <PurpleSym> Nah, semicolon-delimited. The parser does its job well.
[21:07] <JW_work> And as Sketch said, make sure to include them in Description for human-readable purposes, too.
[21:07] <PurpleSym> But the number of groups is just too large I guess.
[21:10] <JW_work> Additional metadata that could go in the description is a count of messages in each group (which you could extract from the cdx)
[21:11] <JW_work> PurpleSym: it seems like this could be usefully turned into a Warrior project…
[21:12] <PurpleSym> That’s true.
[21:13] <PurpleSym> But I’ve never written Lua code.
[21:14] <JW_work> AFAIK, Warrior projects require pretty minimal Lua code — it's mostly just filling in blanks. (I haven't written a project, myself, though — so take this with a grain of salt)
[21:17] <PurpleSym> Also as far as I see neither my machine nor network connection are limiting the grab. Yahoo’s servers seem to be kind of “fragile”. If I want to I can easily overload them with mere search queries.
[21:18] <JW_work> And more IP addresses wouldn't help?
[21:19] <PurpleSym> Rate-limiting is IP-based, yes. But I have a /64 IPv6 ;)
[21:20] <JW_work> ah, so you have enough IPs, then. So your network connection *alone* is enough to overload them? How are they serving everyone else in the world, I wonder?
[21:20] <JW_work> Are there really that few people using Yahoo Groups at this point?
[21:21] <JW_work> Are you sure they aren't rate-limiting the entire /64?
[21:22] <PurpleSym> Everyone else just hits the cache. I’m trashing it.
[21:22] <PurpleSym> As far as I know there’s no limit per netblock.
[21:22] <PurpleSym> Or I have not hit it yet.
[21:28] <DFJustin> http://www.npr.org/sections/thetwo-way/2015/12/29/461401135/egypt-raids-2-major-independent-cultural-institutions-in-2-days
[21:34] <JW_work> PurpleSym: hm, I wonder if you tried to aim at things already in the cache if you could boost the rate that way
[21:35] <JW_work> in any case, we should probably take this to -bs
[22:03] <Sketchcow> https://twitter.com/terryteachout/status/681957100730363904
[22:12] <yipdw> wait that blog entry is barely a page long
[22:12] <yipdw> I was expecting something more in-depth
[22:21] <Sketchcow> I know
[22:21] <Sketchcow> I KNOW
[22:26] *** VADemon has joined #archiveteam
[22:40] <mistym> Sketchcow: When you said you were interested in all CDs: did you mean the physical disc, or just an ISO?
[22:43] <JW_work> mistym: Both, would be my guess.
[22:43] <JW_work> IA has a warehouse — I'm pretty sure they'd be glad to store the original object; but I'm sure they'd be very grateful if you did the ripping yourself, saving them the time and effort.
[22:44] <jleclanch> JW_work: ive been trying to send a collection of CDs + magazines to the IA warehouse but got no reply from the email guy :(
[22:45] <JW_work> jleclanch: he's … really busy. How long has it been? If it's been more than a week, I'd just re-send your email.
[22:45] <jleclanch> JW_work: been 3 weeks since my last poke, have poked him weekly before
[22:46] <JW_work> eh, seems fine to poke again
[22:46] <jleclanch> last he replied was asking for a pricing estimate on the package
[22:46] <JW_work> or you could wait till next week, as this is still likely a somewhat dead week between xmas and new years
[22:47] <JW_work> and I presume you sent him one, and haven't heard back since?
[22:47] <jleclanch> yea
[22:47] * JW_work is reminded to check if I have any outstanding things to bug IA about
[22:47] <jleclanch> OT but god damn scrapy is fun
[22:48] <JW_work> heh. what are you ripping?
[22:48] <jleclanch> npmjs.org lol
[22:48] <jleclanch> doing a small packaging project
[22:48] <jleclanch> https://github.com/jleclanche/npm-js-metadata/
[22:48] <JW_work> wait, that doesn't have an API? ha
[22:48] <jleclanch> they have a terrible api
[22:48] <jleclanch> no way to list packages so i have to rip from the most-starred 
[22:48] <JW_work> lol
[22:50] <jleclanch> I also found out another popular language-specific package hosting website is keeping *all* packages and files, regardless of whether they were deleted
[22:50] <jleclanch> so that's lovely
[22:50] <jleclanch> must be a bunch of passwords in there /sigh
[22:51] *** GLaDOS has quit IRC (Ping timeout: 260 seconds)
[22:51] <JW_work> Eh, if you publish a password in public, you need to *change* it, not just try to get the horse back in the barn
[22:51] <jleclanch> ofc
[22:51] <jleclanch> doesn't mean people do it
[22:52] <JW_work> so I think keeping *all* the versions (and just dark'ing them on request) is a better idea
[22:55] *** ndiddy has joined #archiveteam
[22:55] *** GLaDOS has joined #archiveteam
[22:56] <JW_work> jleclanch: you've seen https://docs.npmjs.com/misc/registry , right?
[22:56] <jleclanch> JW_work: yes, that git repo mirrors the registry
[22:57] <JW_work> so which data are you scraping that isn't available in the registry?
[22:57] <jleclanch> package lists
[22:57] <JW_work> but can't you get that from https://web.archive.org/web/20150905225943/http://skimdb.npmjs.com/registry
[22:57] <jleclanch> no?
[22:58] <JW_work> the "public mirror" of the underlying CouchDB?
[22:58] <jleclanch> oh hm
[22:58] <jleclanch> good point, let me look into it
[23:05] <JW_work> jleclanch: it looks like you can get a list of packages with: https://skimdb.npmjs.com/registry/_all_docs?limit=10&skip=3000
[23:05] <JW_work> http://docs.couchdb.org/en/1.6.1/api/database/bulk-api.html#db-all-docs
[23:05] <jleclanch> JW_work: yeah i was just getting to this. mb not recognizing this as a couchdb instance
[23:05] <jleclanch> sweet stuff
[23:06] <JW_work> I thought it implausible that npmjs would have such a bad api
[23:06] <jleclanch> ive been dealing with bad APIs all week
[23:06] <jleclanch> nothing is implausible, not if you can imagine it!
[23:06] <JW_work> heh. sure — but npm seems generally more clueful than that
[23:07] <jleclanch> well
[23:07] <jleclanch> the package names are case-sensitive
[23:07] <jleclanch> so there's that
[23:07] <JW_work> that seems sensible enough
[23:07] <jleclanch> https://www.npmjs.com/package/jQuery
[23:07] <jleclanch> https://www.npmjs.com/package/jquery
[23:07] <jleclanch> does it
[23:16] <JW_work> er, those are different values
[23:16] <JW_work> different packages
[23:16] <jleclanch> that's what i mean
[23:16] <JW_work> ok, and the problem is?
[23:17] <jleclanch> that it's possible? :p
[23:17] <JW_work> that seems like a feature. "jquery" is not the same as "jQuery"
[23:18] <JW_work> is not the same as "jQuErY"
[23:18] <jleclanch> you see feature, i see social engineering vector
[23:18] <JW_work> eh, unicode allows plenty of such whether you fold case or not
[23:19] <jleclanch> how does that invalidate what I said? it just reinforces it :P
[23:19] <JW_work> Having a *warning* when full-unicode-case-folding would show a colision, that seems good — but prohibiting it — not so much
[23:21] <JW_work> ok, there does seem to be a bug there, though — the stats are identical for both
[23:21] <JW_work> which I doubt is accurate
[23:21] <jleclanch> the stats server is a separate API
[23:21] <jleclanch> im not surprised
[23:21] *** brayden has quit IRC (Read error: Operation timed out)
[23:21] <jleclanch> the latter must be case insensitive
[23:21] <JW_work> heh, OK — *that's* a bug
[23:22] <JW_work> if they don't support the case sensitivity across all their services — then they need to turn it *off* across all their services
[23:23] <JW_work> I mean, I'd lean towards requiring lowercase ascii alphanum & underscore for package names, but that might be going too far
[23:24] <jleclanch> apparently the / character can be in the package name, too
[23:25] <jleclanch> and that breaks some more APIs
[23:25] <JW_work> lol
[23:25] <jleclanch> what were you saying about cluefulness again? :P
[23:25] <JW_work> heh
[23:25] <jleclanch> the couchdb bit is pretty damn cool tho
[23:26] <JW_work> ok, let me clarify — not "cluefullness", but rather, "enthusiastic openness"
[23:26] <jleclanch> heh fair enough
[23:27] <JW_work> ah, there is a ticket: https://github.com/npm/newww/issues/380
[23:27] *** xhades has quit IRC (Read error: Operation timed out)
[23:27] *** xhades has joined #archiveteam
[23:29] *** Elegance has quit IRC (Quit: :(){ :|:& };:)
[23:47] *** megaminxw has joined #archiveteam