[00:00] well, I suppose you can run something like qemu without installing anything, but not dosbox. (though I am not sure how you would get network connectivity to the VM) [00:00] er [00:00] s/dosbox/virtualbox/ [00:00] box brainfart [00:00] Threw me for a loop there; wondered what running Wolf3d had to do with digipres ;) [00:01] So I totally disappeared after asking a question earlier :V I've got this site with a semi-endless loop - calendar feature that generates a page for any date you throw at it, regardless of whether it's 10000 years in the future. [00:01] Is there a way to grab that with reasonable limits using wget-warc without totally omitting the calendar? [00:01] http://www.vbox.me/ [00:02] mistym: Is it possible to disregard the calendar when crawling, and manually grab it later? [00:02] shaqfu: That's true, that could work. [00:03] Then you could pass it -r -l 50 or w/e and not accidentally grab the list of events for 50000 AD [00:03] Hm, this runs on VBox [00:06] Hm, or maybe this is a job for wget-lua. [00:06] Will that pull down the warc headers? [00:08] You can save as whatever the appropriate wget uses. The lua integration creates a set of callbacks that are called as wget recurses over urls. [00:09] * mistym didn't really know what was supported, but noticed some documentation on the wiki [00:09] Looks appropriate for this use case, luckily. [00:13] is there a visual means to determine a date with events vs one without? if so, you could possibly use a lua-enabled wget to look for the html that differentiates them and not give the urls for the days with no events [00:14] you could potentially even have the lua parse things like css and javascript on these painful-to-archive "web 2.0" ajaxy pages [00:14] Fortunately it's all in the urls, so it's really easy. date=2012-06, etc. I don't need to match any dates newer than this year and older than, I dunno, 1980, so that's all I really need to parse in this case. [00:15] Just need to figure out how to match regexps in lua. [00:17] you might be able to look at the picplz lua for info [00:20] Oh no guys [00:20] I repeat [00:20] The target has entered the zone [00:20] :D [00:20] (aka, I'm at IA!) [00:21] oh shit. there goes the galaxy. [00:21] underscor: Awesome :D [00:21] It's quite nice [00:22] so you're living out there for the summer? [00:22] Yeah [00:22] Actually in the building [00:22] haha [00:22] It is the most amazing building. [00:23] It's fabulous [00:23] Isn't it an old Christian Science building? [00:23] yep [00:23] It is now the church of the internet. [00:23] Say what you want about them, but God damn could they build nice buildings [00:23] http://www.speedtest.net/result/2014459014.png [00:24] i would say "great. now he's going to use up all of IAs pipe with torrents..." but you do that already :P [00:25] I'd joke about "downloading the Internet" but they already do that [00:27] You guys are silly [00:27] :D [00:28] I need someone to move http://archive.org/details/archiveteam-ftp_abit_com_tw over to the archiveteam-fire collection [00:34] Coderjoe: http://archive.org/details/archiveteam-ftp_abit_com_tw?reCache=1 [00:36] So not knowing any lua, did I make any particularly terrible mistakes here? https://gist.github.com/56d5e8916d73af0c9f11 [00:36] Seems straightforward enough, but I don't know if lua has any hidden gotchas. [00:37] thanks [00:41] np [00:47] underscor: Will you be able to finish the FP upload soon? [00:47] yeah, definitely [00:47] once I get settled in [00:47] I'm actually right near the box it's on now :D [00:47] Sorry for being naggy - just want to see this project done soonish, since we just found more FP [00:48] Now we're finding collided files, blech [00:52] :( [00:53] Nothing like a site passing through God knows how many major reorgs - it's a wonder it functioned at all [02:47] fanfiction.net is doing a purge again, we should put up a forum post saying we've probably got them saved [03:15] "but I don't want them saved. please destroy all copies" [03:15] hahaha [03:15] I thought the plan was to keep the fanfiction.net archiving on the down-low because of the inevitable dramabomb [03:17] Well I'm sure many an author would be happy to hear that their work has, in fact, not been BAH-leeted. [03:17] And many others will be livid that they're known as the author of Kirk/Spock fish slashfic, FOREVER [03:18] i liked my example subject better [03:18] but I don't even remember what it was [03:19] Buffy clown rape? [03:19] Hell, I've got shitty 12 year-old fanfiction on ffnet too. [03:19] I'd be most ashamed of ever having written "Harries" when it should have been "Harry's." Shame shame shame. [03:20] Space Harrier Potter [03:20] 12 year-old as in, written when I was 12, or some year around then. [03:21] Harry Potter and BSG crossover fiction? Now that's an idea. [03:24] probably already exists [03:29] Is that the only fanfic site being crawled? [03:30] Already crawled. [03:30] Being, hopefully - new content's being added [03:30] I thought yipdw was involved in that one. [04:08] Morning [04:20] SketchCow: hey, were you looking at end of January for archiveteam con? Or February? [04:36] shaqfu: #archiveteam.log:2012-06-09 02:02:40< Coderjoe> "SHIT SHIT SHIT! I didn't want anyone to know I wrote furry self-insertion star trek star wars incest bondage fanfiction! DELETE IT!" [04:38] Hahaha [04:40] hmm [04:40] aside from the star wars part, that could almost be satisfied by a viewpoint of someone claiming to be a tribble [04:41] and why the hell am I putting any more thought into that off-the-cuff example of why fanfic authors frequently don't like having their works archived [04:43] Blackmail ammunition? [04:58] I was looking at January. [05:01] Coderjoe: I dated a fanfic writer once, I think being illogical is part and parcel with the hobby [05:05] What was her fandom? [05:08] edward scissorhands [05:09] Wait, what [05:11] it's the depp and the leather and the goth [05:12] heh [05:13] Please please please please please don't tell me she wrote slash [05:13] I won't lie [05:13] she did [05:13] Literally, in this case, slash [05:14] hahaha [05:14] what did slash have to say about it? [05:15] it wasn't slash slash [05:16] I don't have any links to it, I think she baleeted it all [05:19] No worries; I'm sure there are dozens like it in ff.net [05:20] hahahahaha [05:20] she went by "jakie firecracker" if you feel a need to go find it [05:24] Looks like she deleted everything [05:24] BIG SHOCK [05:24] typical fanfic butthurt [05:24] Admittedly, that's a really grey area for archiving, ethics-wise [05:25] how do you say [05:26] That if someone makes the conscious decision to destroy something of theirs, is it ethical to release it later as part of a historical collection [05:27] But if you comply and delete, you cheat history :( [05:28] so obviously the right thing to do is wrap it up in an immutable package before they delete it [05:29] in all seriousness, I think that's the answer [05:30] I'm not sure there's a right answer - moreso a less wrong one [05:32] Probably best to just adhere to the "unring a bell" principle [05:32] Do right by history *and* you don't feel like a jerk! [05:46] http://en.wikipedia.org/wiki/Aeneid#Virgil.27s_death_and_editing_of_the_Aeneid [06:16] however the warcs are being archived, can they be indexed so i can give the archive search an id, and find that story? also hows that warc browser coming along? that would be extremely useful, to study the format [06:17] it absolutely can be indexed, that's how the wayback machine works [06:21] http://wegetsignal.org/tmp/yabbersdoesntcare.png [06:24] what [06:39] er [06:39] I meant to go to a different channel with that [06:39] and/or -bs [06:40] but the site hosting the phpbb seen in that screenshot has a notice up about a future outage at the end of may. it is now the middle of june. [10:06] Is there a simple way to host a 687 MB backup online to make it comfortably downloadable? Either for free, or for cheaps. [10:07] Depending on what precisely you're wanting to do, Dropbox might be a solution. [10:09] The trick of course is uploading. If you're in the states, upload speeds range from: this fucking sucks and will never complete without errors, to maybe a few megabits a second. [10:16] Or upload it to archive.org, if the hosting doesn't have to be private/short-term. [11:05] Korodzik: try http://tahoe.ccnmtl.columbia.edu/ [11:12] Thanks to you all. [13:16] has codebear been around? [13:17] lemme check my logs. [13:18] nope. [13:20] .of course, it might help if i check the right channel. [13:20] i last saw him on June 4th, leaving. [13:22] cheers [13:36] and now archive.org stopping responding. derp [13:39] nvm [13:52] http://lists.wikimedia.org/pipermail/wikitech-l/2012-June/061214.html [17:15] alard: can you make wget use the same warc file in later incovations? eg if i download 100 pages seperately but would them all to be warced [17:47] Schbirid: No, you'll have to download all files in one go. But you can concatenate warc files (even gzipped ones), so if you really want one big warc you can just use cat to combine them later. [17:47] ah nice [17:48] But that way you'll get warc files that are slightly larger, since each file has its own wget log etc. [17:53] and boom, IGN just killed all the gamespy forums, close to 40 million posts allegedly [17:54] those we grabbed? [17:56] i got pretty much i could [17:56] found two short time ago, it got killed as i was on the last one. but nothing too important [17:57] http://forumplanet-archive.quaddicted.com/111archives/7z/ | http://forumplanet-archive.quaddicted.com/111archives/warcs/ [17:58] i am trying to extract them all but i guess that will be way too much for my server [17:58] should have looked into a comperssed filesystem [17:58] gotta try https://gitorious.org/fuse-7z some day [18:03] the forums i downloaded had close to 20 million posts, if my script worked correctly i will have all of them [18:04] of course 50% are spam and 33% are deleted posts and 10% are IGN stupitidy or something like that [18:05] wait, that was wrong [18:06] no, the number is correct [18:11] be aware that i think the warcs are missing stuff because of infinite link loops resulting in very deep directory trees. iirc there were not warcs left for those boards [19:32] SketchCow linking to facebook on twitter, i hope he is alright [22:54] There's a site that has digitized old newspapers, most of which are in the public domain in the US. However, the site is run by one person; is there anything preventing me from uploading these newspapers to IA? [22:55] benuski: Legally? Dunno. I've seen people get pissy over digitized materials in the public domain [22:55] What's the page? [22:56] fultonhistory.com (warning: terribad website ahead) [22:56] Dear God [22:56] And this is why I have noscript. [22:57] oh my [22:57] that's... um [22:57] good luck systematically grabbing that. [22:57] I've emailed him to ask permission, but I'm worried he'll say no [22:57] I.) Use of Programs like Web Devil, SiteGrabber, OfflineExployer or any other mass downloading software programs will get you banned from this site. Server logs are read each morning and my software will detect this type of activity. Violators will be automatically prohibited from entering this site, I do Not have the bandwidth to support this kind of activity and serve the people who wish to use this site as it was intended. [22:57] RULES OF THE HOUSE: [22:58] II.) This site is FREE. All I request is a smile and a little courtesy. If you have a reasonable need for something on this site, and you will not use it for commercial purposes (Make Someone Pay For it), I will give it to you. Just e-mail me with a description of your needs at tryniski@fultonhistory.com [22:58] yeah, i saw that and that's why I emailed him [22:58] Meh. Let's just try not to pull another Parodius :P [22:58] ;D [22:58] if you're in the US it's legal under https://en.wikipedia.org/wiki/Bridgeman_Art_Library_v._Corel_Corp. [22:58] however IANAL [22:59] Reading the link on the site to copyright info [22:59] The one under the really surreal three-eyed head bouncing around [22:59] I think that's a picture of the creator of the site [23:00] Blech, nothing useful [23:01] A site that interesting shouldn't have design that abhorrent [23:01] I think its the worst web design I have ever seen [23:03] Hopefully he gets back soon [23:03] Otherwise, I guess a very slow crawl may be in order [23:04] (You'll probably be slow-crawling anyway if you don't get disks shipped) [23:04] Yeah, that was my plan... nicking stuff slowly so he doesn't ban me, unless he's into it [23:06] I don't understand being so hard up for bandwidth that you scrutinize the logs daily [23:06] I mean, I used to do that, but like 10 years ago when I ran a site off my cable modem [23:07] Did your site also have animated fish and bouncing heads? [23:07] now I pay $10 a month for retardedly more bandwidth than I will ever use [23:08] I'm kinda surprised that his bandwidth is so limited; if he's storing 20M newspapers, he obviously has decent infrastructure [23:08] I suppose he gets hammered if he goes over average or w/e [23:08] and he says on some part of the site that the high quality images, which aren't available online, are over 120 terabytes [23:09] For 50 euros a month you can get what looks like an outstanding dedi in europe: http://www.hetzner.de/en/hosting/produktmatrix/rootserver-produktmatrix-ex [23:09] Unlimited traffic. Well, 7 euros or so per terabyte over 10TB if you want to keep the 100Mbps connection. [23:21] SketchCow: are you in need of nintendo power magazines? [23:36] anyone know how long it usually takes before archive.org emails you back? [23:37] about what? [23:43] starting a collection [23:43] oh, no idea [23:43] day or two?