[00:21] *** kristian_ has quit IRC (Leaving) [00:39] Has anyone done any work on NPR's comments? [00:54] Asked about this a few days back, didn't get any response, so I'd assume no. [01:47] *** HCross has quit IRC (Ping timeout: 246 seconds) [01:47] *** HCross has joined #archiveteam [01:54] *** khaoohs has joined #archiveteam [02:03] *** khaoohs has quit IRC (Quit: Leaving) [02:10] *** tomwsmf has quit IRC (Read error: Operation timed out) [02:30] *** mr-b has left [02:45] *** db48x has joined #archiveteam [03:10] *** db48x` has joined #archiveteam [03:11] *** db48x has quit IRC (Read error: Operation timed out) [03:14] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [03:15] *** BartoCH has joined #archiveteam [03:22] *** nicolas17 has quit IRC (Quit: U+1F634) [04:09] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [04:12] *** BartoCH has joined #archiveteam [04:17] *** JesseW has joined #archiveteam [04:17] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:24] *** Sk1d has joined #archiveteam [04:26] we should probably get all the sites we can from http://www.users.totalise.co.uk as it appears to be a small ISP, in the process of being merged with another one (although they don't explicitly talk about shutting down the web sites) [04:29] !ig 28j6lpt5lmtyrdi4dhfugpmto squarespace\.com [04:35] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [04:35] *** HCross has quit IRC (Ping timeout: 246 seconds) [04:35] *** HCross has joined #archiveteam [04:43] *** DFJustin has quit IRC (Ping timeout: 260 seconds) [04:43] *** Meroje has quit IRC (Quit: bye!) [04:44] *** Meroje has joined #archiveteam [04:53] *** DFJustin has joined #archiveteam [04:53] *** swebb sets mode: +o DFJustin [05:05] *** DFJustin has quit IRC (Remote host closed the connection) [05:10] *** DFJustin has joined #archiveteam [05:15] *** HCross has quit IRC (Read error: Operation timed out) [05:15] *** HCross has joined #archiveteam [05:45] I'm in the process of grabbing the ones I can find with archivebot [05:51] *** quails has quit IRC (Ping timeout: 250 seconds) [05:56] *** quails has joined #archiveteam [05:57] *** phuzion has quit IRC (Read error: Operation timed out) [05:58] *** phuzion has joined #archiveteam [06:04] *** patrickod has quit IRC (Read error: Operation timed out) [06:04] *** patrickod has joined #archiveteam [06:05] *** phuzion has quit IRC (Read error: Operation timed out) [06:05] *** sep332 has quit IRC (Read error: Operation timed out) [06:05] *** midas1 has quit IRC (Read error: Operation timed out) [06:07] *** midas1 has joined #archiveteam [06:07] *** swebb sets mode: +o midas1 [06:07] *** sep332 has joined #archiveteam [06:10] *** Fake-Name has quit IRC (Ping timeout: 501 seconds) [06:13] *** BlueMaxim has joined #archiveteam [06:13] *** phuzion has joined #archiveteam [06:13] *** Fake-Name has joined #archiveteam [06:49] *** zerbrnky has joined #archiveteam [06:49] hi all, anyone around? [06:49] Any problem? [06:50] Zebranky: no. But ask whatever you were going to ask anyway... [06:50] hm i should use a different nick D: [06:51] i'm not Zebranky (i use a variant of this nick on places where longer names are allowed) [06:51] *** zerbrnky is now known as rbraun [06:51] oops, sorry [06:51] i was looking through the gawker dumps on archive.org and yeah, there might be a problem [06:52] well, a lot of our most recent stuff may not have made it up there yet [06:52] and we know about the robots.txt issues [06:52] is there a different problem? [06:52] it looks like they were grabbed by grabbing the sitemap for each month and then grabbing from there [06:52] the problem is that the sitemap for especially busy months can't be grabbed a whole month at a time [06:53] hm, yeah that could be an issue. godane? [06:53] so e.g. everything in January 2010 before 1/19 is missing from both this: https://archive.org/details/gawker.com-sitemap-2010-20160322 [06:53] and from web.archive.org too [06:53] rather it's not all missing from web.archive.org but some pages are [06:54] and many of the pages that /are/ there weren't crawled this year, indicating the bulk grab in march didn't hit them [06:54] this seems to be a bigger problem for older pages (probably back when they still paid their writers by the article) [06:54] do you know of a way to get a list of the missing pages? [06:55] yeah, you just see what the start date for the sitemap was and edit the end date to be that, iterate until it grabs thru the first of the month [06:55] i'm working on it now but i was wondering if anyone had already done it [06:56] e.g. january 2010 takes 3 pulls [06:56] and then of course all the pages... [06:56] (january 2012 is complete, though) [06:56] godane is the person who has been working on it; hopefully he'll speak up [07:03] *** Honno has joined #archiveteam [07:11] *** JesseW has quit IRC (Ping timeout: 370 seconds) [07:11] is there a faster way to force wayback to crawl a list of URLs than just loading http://web.archive.org/save/[URL] for each? [07:13] Try #archivebot [07:18] oh, nice, there is a non-recursive option [07:20] archiveonly < FILE is probably what i need, thanks [07:21] when archivebot uploads a WARC to archive.org, does it end up in web.archive.org too? [07:21] in wayback, that is [07:23] Yes, that’s the point. [07:23] ok, thanks, this looks easier than i thought [07:24] (fwiw i first discovered this issue when i noticed something from jan 2010 *wasn't* in wayback at all; then, found it wasn't in the collection i linked either) [07:28] gut feeling is that 2007-2011 are affected in part [07:28] (looking at http://gawkerdata.kinja.com/closing-the-book-on-gawker-com-1785555716) [07:31] *** REiN^ has quit IRC (Read error: Connection reset by peer) [07:33] *** phuzion has quit IRC (Read error: Operation timed out) [07:36] *** phuzion has joined #archiveteam [07:52] *** schbirid has joined #archiveteam [08:14] based on site map its 2010-01-19 on: gawker.com/sitemap_bydate.xml?startTime=2010-01-01T00:00:00&endTime=2010-01-31T23:59:59 [08:14] ok i see the problem: http://gawker.com/sitemap_bydate.xml?startTime=2010-01-01T00:00:00&endTime=2010-01-01T23:59:59 [08:15] sometimes those sitemaps do some weird shit [08:16] godane: do you have the missing ones or should i keep compiling them and feed them to archivebot? [08:16] i have 2010 almost ready [08:17] you can feed them into archivebot if you want to [08:17] i will also see about doing it [08:17] i checked several URLs from my file; some of them are in wayback and some not [08:18] ok; i'm working on 2010 but i think all of 2007-11 might be affected based on volume [08:18] (and the URLs not in wayback weren't saved in the big March dump, they were crawled earlier) [08:19] i maybe doing a daily grabs now [08:19] regrabs of what i got [08:19] also really uncertain how long any of the site will stay up so [08:20] i will work on gawker.com sitemap [08:20] note that in every case i saw, if the monthly grab by default returned through X date, the original grab had all of those articles [08:21] but not all the ones before that [08:26] i'm redump grawker.com as daily sitemap grab [08:30] kataku.com has the same problem [08:30] *kotaku.com [08:37] *** REiN^ has joined #archiveteam [08:43] *** WinterFox has joined #archiveteam [08:58] godane: do you want what i have for 2010? might save some time [08:59] its not going to save me time sadly [09:00] my script make a run at the sitemap by the day now [09:00] ok [09:00] also i will have to do that with all of gawker sites [09:00] some of them i think don't have enough articles for this to have been an issue [09:01] not sure which ones though [09:01] i have uploaded some of thoses [09:01] they were in the 10 to 100mb [09:01] range [09:02] might save the crawler time at least to not have to recrawl what's known already in the archive? [09:02] *** BartoCH has joined #archiveteam [09:02] (several different ways to do that; i was just using the date cutoff) [09:05] also i'm not sure how much time is left for gawker.com specifically [09:08] btw the sitemap cut off is weird [09:09] like for 2008-11 i can get 3034 urls with gawker but only 1971 urls with kotaku.com [09:09] google code is empty. Can someone requeue [09:12] godane: there are fewer articles total for that month on kotaku though [09:13] godane: for 2008-11 if i request the whole month it cuts of at the 14th for gawker but the 6th for kotaku [09:13] oh, i see [09:13] yeah, why didn't it grab the whole month for kotaku... [09:13] fwiw their own sitemaps link in 1-week increments [09:14] http://gawker.com/sitemap.xml [09:14] not sure i trust that given how uneven it is but i haven't found a case where it failed yet [09:17] *** HCross has quit IRC (Ping timeout: 246 seconds) [09:17] *** HCross has joined #archiveteam [09:22] sitemaps for 2006-01 are start to be uploaded: https://archive.org/details/gawker.com-sitemap-2006-01-09-20160823 [09:24] i'm doing 11 months of daily sitemaps at once :-D [09:24] that's going to produce a lot of collections... any reason not to combine those by month? [09:24] also FYI while investigating this, the sitemap_bydate.xml was giving me 500 errors sometimes [09:25] that was reliable if i didn't request whole-day increments [09:25] but it happened some other times too; just reloading fixed it [09:25] my script use curl to grab the sitemap by day then starts the download [09:26] why not cat those together like a month at a time? [09:27] cause i was not planing on doing that [09:27] well, the reason i ask is [09:27] the sitemaps provide an index of article titles [09:28] so if i know gawker published an article in 1/2010 but i don't know which day... [09:28] and i only know one word of the title or something [09:28] https://archive.org/details/archiveteam-fire?and[]=subject%3A%22www.dailymail.co.uk%22 [09:28] i do it by date of sitemap [09:29] it's also easier to verify everything is in there if it's in larger chunks [09:29] *** vOYtEC has quit IRC (Ping timeout: 244 seconds) [09:29] i make a month sitemap may make me confuse [09:30] thinking it was done the old method when gawker sitemap doesn't get everything [09:30] so the daily dumps are meant to be different since the month and yearly failed [09:31] i can turn the daily dumps into monthly or yearly for that reason [09:32] hmm ok [09:34] i'm mostly trying to keep the raw sitemap urls the same set as date of urls [09:34] *** HCross2 has quit IRC (Quit: Connection closed for inactivity) [09:37] *** schbird has joined #archiveteam [09:37] is there a way to record mouse/keyboard interaction with webrecorder.io or a similar tool? [09:37] godane: can your script handle the case where it returns a 500 error and retry? [09:38] to actually replay all "user" interaction [09:38] godane: i guess curl --retry 10 or something [09:41] i was getting those intermittently even on sitemap_bydate requests that would later complete [09:49] i'm not really getting those errors [09:49] i get them on days that don't exist i think [09:49] no, i get empty files (or with the front page only) on days that don't exist [09:50] i get 500 errors when it's cranky or if i try to pull a partial day (which doesn't work) [09:50] but in the former case i had to retry a few times [09:50] if you pass --retry <#> to curl with some number of retries allowed, you should have no problem though [09:52] only getting that on the sitemaps occasionally, not the actual pages [09:52] *** Selavi has quit IRC (Ping timeout: 260 seconds) [09:53] *** Kksmkrn has joined #archiveteam [09:53] *** Kksmkrn has quit IRC (Connection closed) [09:53] *** Kksmkrn has joined #archiveteam [09:53] i'm going to bed now [09:53] i will continue tomorrow [09:54] ok good night [10:00] *** Selavi has joined #archiveteam [10:09] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [10:16] *** BartoCH has joined #archiveteam [10:28] *** enr1c0 has joined #archiveteam [10:31] *** Kksmkrn has quit IRC (Quit: leaving) [11:00] *** enr1c0 has quit IRC (Quit: ZNC 1.6.3+deb1 - http://znc.in) [11:00] *** enr1c0 has joined #archiveteam [11:01] *** enr1c0 has left [11:26] *** enr1c0 has joined #archiveteam [11:30] *** enr1c0 has quit IRC (Client Quit) [11:31] *** enr1c0 has joined #archiveteam [11:31] *** enr1c0 has left [11:35] *** HCross has quit IRC (Ping timeout: 246 seconds) [11:35] *** HCross has joined #archiveteam [12:28] *** irl has joined #archiveteam [12:29] ok, so i was here a while ago and i'm trying to archive a whole bunch of paper manuals and documents from the 70s-90s from obscure networking hardware and computer programs relating to networking and such [12:30] following a complete mess trying to use the university's MFD devices (they scan to email only, and couldn't do large attachments, so i was limited to ~5 pages) [12:30] i've now decided i want to buy a scanner with an ADF to sit in the lab [12:30] can anyone recommend such a scanner that can handle various paper types, and paper with binding holes etc. that isn't going to break constantly? [12:31] ideally it would have linux support and not be networked, but direct into the pc [12:31] ideally it would also be fast-ish, but i'll take reliability over speed [12:32] I recently *built* a 25€ DIY book scanner, but it’s quite slow. [12:32] i'm talking ~10,000 ish pages of manuals [12:32] they're mostly A4 paper that's been punched and hand-bound [12:32] So, destructive scanning then? [12:33] with those plastic binding things [12:33] I see. [12:33] my hope is to be able to just put the plastic things back on them afterwards [12:34] i've looked through ebay for scanners with adf, but i have no idea how reliable these things are [12:35] the HP 9200C 9200 Digital Sender seems to come up a lot and looks quite heavy duty [12:35] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [12:37] *** BartoCH has joined #archiveteam [12:46] purchased a 9200c, seems to have good reviews [12:47] i'm guessing a lot of these things will have valid copyright [12:47] any advice on how to work out what i can publish and what i shouldn't publish? [12:48] is there a place i can stash things until the copyright expires? [12:50] irl: IA :) [12:50] irl: IA will dark things if they get complaints [12:51] where 'dark' === "it's still in the archives but not publicly accessible" [12:51] (also you might want to talk to SketchCow regarding manuals) [12:51] *** atomotic has joined #archiveteam [13:03] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [13:04] *** BlueMaxim has quit IRC (Quit: Leaving) [13:04] *** BartoCH has joined #archiveteam [13:21] joepie91: ah cool (: [13:21] so i can basically automate most of this then using scanner->ftp->git-annex-assistant->ia [13:21] just need to get the right metadata in the right places [13:21] SketchCow: i might want to talk to you [13:24] *** WinterFox has quit IRC (Read error: Operation timed out) [13:27] *** beardicus has quit IRC (bye) [13:27] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [13:28] *** dashcloud has quit IRC (Read error: Operation timed out) [13:31] *** beardicus has joined #archiveteam [13:31] *** swebb sets mode: +o beardicus [13:35] *** beardicus has quit IRC (Client Quit) [13:37] *** beardicus has joined #archiveteam [13:37] *** swebb sets mode: +o beardicus [13:45] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [13:45] *** BartoCH has joined #archiveteam [13:46] *** wp494 has quit IRC (Read error: Operation timed out) [13:47] *** dashcloud has joined #archiveteam [14:42] *** tomwsmf has joined #archiveteam [14:47] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [14:52] *** irl_ has joined #archiveteam [14:53] *** irl has quit IRC (Quit: WeeChat 1.5) [14:53] *** irl_ is now known as irl [14:56] *** irl has quit IRC (Client Quit) [14:57] *** irl has joined #archiveteam [15:00] SketchCow: if you're interested in old manuals, i can get you a list of the things we maybe have [15:01] *** nicolas17 has joined #archiveteam [15:01] SketchCow: i'm now idling here via znc, so i'll see when you respond as i guess you're not around right now [15:02] i'll be at debian uk bbq eating burgers this weekend, but will probably start a go at this the following weekend [15:02] (slow start - not diving in) [15:15] *** wp494 has joined #archiveteam [15:18] *** JesseW has joined #archiveteam [15:23] *** schbird has quit IRC (Read error: Operation timed out) [15:25] *** JesseW has quit IRC (Read error: Operation timed out) [15:34] *** BartoCH has joined #archiveteam [15:56] *** VADemon has joined #archiveteam [16:13] Hugs to irl [16:35] hello [16:35] SketchCow: [16:35] still here? [16:42] *** HCross2 has joined #archiveteam [16:42] Yep. [16:43] So much talking. Come to #archiveteam-bs [17:00] *** tomaspark has quit IRC (Ping timeout: 255 seconds) [17:02] *** db48x` is now known as db48x [17:06] bayimg is online again [17:06] I restarted the script [17:06] it's not yet in the warrior [17:06] http://tracker.archiveteam.org/bayimg/ [17:06] * restarted the projects [17:06] project* [17:18] OK SO FINALLY [17:19] http://fos.textfiles.com/pipeline.html is in version 1.0. It'll run once a day (with an indication of when it was run). It's NOT real-time, it's just a way for your nerds to notice what's going on on the site, and be able to communicate with me or each other on a status. [17:20] It's Inbox --> Outbox --> IA, and if there's interruptions at IA, the Outbox might fill and "work" but will leave some items untouched. [17:20] some projects seem to be missing [17:21] It's generating right now, but Orkut is such a nightmare, it will sit there for a while. I added another black-label "line" at the bottom of the table so you can see the difference between "running" and done. Looks like 10-15 minutes of disk thrashing to get through the mess. [17:21] I see [17:21] In the future, when it has the second black line at the bottom, if it's not there, it's not in the pipeline. [17:22] The script in the future will probably run in 5 minutes, as long as insanities like orkut aren't going on. [17:23] So for example, the WHOLE pipeline is backed up (google code is at 187g) because of Orkut [17:23] Yep [17:23] But at least now, in the future, one of you can go "Hey, looks like boombox project is at 300gb for some reason" and we can jump on that. [17:24] Or "it's time to add an upload script to this or that project" [17:24] orkut is going down in 8 days, so just a little more time [17:26] i thought orkut was long gone [17:26] still here as an archive https://orkut.google.com/en.html [17:27] are we on track to finish orkut? I have more available if FOS can handle it, if needed. [17:27] I think we're going to make it [17:27] Tomorrow or the day after we're going to retry the larger communities, so you might have to do a little less concurrent [17:28] nod [17:28] But I'll want you before we do that [17:28] the larger communities can be millions of posts [17:28] (and URLs) [17:29] So, the script is going to finish running, and I'm going to make two improvements. [17:29] First, it will not copy over the finished .html file until it's 100% done, so in the future, it's just "there" and not "in progress" [17:30] Second, I'm going to make a "cheat sheat" which will occasionally be forgotten by me to update but will change the "Project" name into something better. [17:30] I tried archiving orkut and it seemed like you didn't need more nodes [17:30] since most of the time I got rate-limiting by the tracker anyway [17:31] so the download rate was limited by that setting, not by how many people were running the warrior [17:42] *** AlexLehm has joined #archiveteam [17:54] http://fos.textfiles.com/pipeline.html just finished. [17:55] NOW you can rain down questions [18:00] *** schbird has joined #archiveteam [18:25] *** pfallenop has quit IRC (Ping timeout: 260 seconds) [18:25] *** pfallenop has joined #archiveteam [18:30] *** schbird has quit IRC (Read error: Operation timed out) [18:34] *** Zialus has quit IRC (Read error: Operation timed out) [18:34] arkiver: let me know when, and I'll reduce my quarter of a trillion concurrent [18:38] *** Zialus has joined #archiveteam [18:40] SketchCow: it would be nice if it also shows megaWARC size [19:19] *** VerifiedJ has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) [19:24] Not really easy to do that, since stuff will be either out or stuck. [19:24] Oh wait. [19:24] Mmm, let me see [19:28] I got it working. It's re-running and it'll update with it after it's done. [19:29] (almost all are 40gb but it's trivial to print it) [19:29] if someone wants to be a hero and wiki all this, go ahead [19:52] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [19:58] *** BartoCH has joined #archiveteam [20:39] one that needs crawling for the magazines collection: http://www.muzines.co.uk [20:39] sadly it has a stupid obnoxious Javascript-based interface... [20:47] seems to work mostly fine without js here [21:04] *** HCross2 has quit IRC (Quit: Connection closed for inactivity) [21:13] *** Morbus has joined #archiveteam [21:15] *** VerifiedJ has joined #archiveteam [21:19] *** schbird has joined #archiveteam [21:27] GTAGaming.com's database was compromised and they may be think about shutting the website down along with www.gta4-mods.com. http://www.gtagaming.com/news/comments.php?i=2369 [21:40] *** Honno has quit IRC (Read error: Operation timed out) [21:50] *** VerifiedJ has left [21:58] *** vOYtEC has joined #archiveteam [21:59] *** schbird has quit IRC (Leaving) [22:07] *** schbirid2 has joined #archiveteam [22:10] *** schbirid has quit IRC (Read error: Operation timed out) [22:33] *** RichardG has joined #archiveteam [22:42] *** schbirid2 has quit IRC (Read error: Operation timed out) [22:45] *** schbirid2 has joined #archiveteam [22:47] *** AlexLehm has quit IRC (Ping timeout: 260 seconds) [23:16] *** JW_work1 has joined #archiveteam [23:18] *** JW_work has quit IRC (Read error: Operation timed out) [23:23] *** RichardG has quit IRC (Read error: Operation timed out) [23:28] Who here can read an ext3 disk and is comfortable with possibly having to do a dd and then extracting of data [23:28] US preferred [23:29] you mean a physical disk, or? [23:36] Physical, here in front of me. [23:37] * nicolas17 is physically too far [23:37] what's involved in it? i.e. why can't you do it? [23:37] *** rchrch has joined #archiveteam [23:37] Don't want to [23:37] ah [23:37] If you're asking what's involved, you're not for the job [23:38] well, he's asking eg. is it a corrupted ext3 you have to recover things out of, or just a clean filesystem but you have no Linux? :P [23:38] yeah, basically^ [23:39] I can do magic with block devices but I'm not so good at fixing physically broken disks [23:39] Not broken [23:40] ah. can you ship? [23:43] I assume so because you said US preferred. I'm in Canada though. But if nobody closer wants to then I volunteer [23:44] I enjoy this sort of thing [23:45] *** kristian_ has joined #archiveteam [23:48] *** RichardG has joined #archiveteam [23:56] *** Stiletto has quit IRC (Ping timeout: 246 seconds) [23:57] You're in line [23:57] We'll see if anyone else in the US wants it. [23:58] I can sustain a canadian mailing