#archiveteam 2012-06-18,Mon

↑back Search

Time Nickname Message
00:00 🔗 Coderjoe well, I suppose you can run something like qemu without installing anything, but not dosbox. (though I am not sure how you would get network connectivity to the VM)
00:00 🔗 Coderjoe er
00:00 🔗 Coderjoe s/dosbox/virtualbox/
00:00 🔗 Coderjoe box brainfart
00:00 🔗 shaqfu Threw me for a loop there; wondered what running Wolf3d had to do with digipres ;)
00:01 🔗 mistym So I totally disappeared after asking a question earlier :V I've got this site with a semi-endless loop - calendar feature that generates a page for any date you throw at it, regardless of whether it's 10000 years in the future.
00:01 🔗 mistym Is there a way to grab that with reasonable limits using wget-warc without totally omitting the calendar?
00:01 🔗 DFJustin http://www.vbox.me/
00:02 🔗 shaqfu mistym: Is it possible to disregard the calendar when crawling, and manually grab it later?
00:02 🔗 mistym shaqfu: That's true, that could work.
00:03 🔗 shaqfu Then you could pass it -r -l 50 or w/e and not accidentally grab the list of events for 50000 AD
00:03 🔗 shaqfu Hm, this runs on VBox
00:06 🔗 mistym Hm, or maybe this is a job for wget-lua.
00:06 🔗 shaqfu Will that pull down the warc headers?
00:08 🔗 mistym You can save as whatever the appropriate wget uses. The lua integration creates a set of callbacks that are called as wget recurses over urls.
00:09 🔗 * mistym didn't really know what was supported, but noticed some documentation on the wiki
00:09 🔗 mistym Looks appropriate for this use case, luckily.
00:13 🔗 Coderjoe is there a visual means to determine a date with events vs one without? if so, you could possibly use a lua-enabled wget to look for the html that differentiates them and not give the urls for the days with no events
00:14 🔗 Coderjoe you could potentially even have the lua parse things like css and javascript on these painful-to-archive "web 2.0" ajaxy pages
00:14 🔗 mistym Fortunately it's all in the urls, so it's really easy. date=2012-06, etc. I don't need to match any dates newer than this year and older than, I dunno, 1980, so that's all I really need to parse in this case.
00:15 🔗 mistym Just need to figure out how to match regexps in lua.
00:17 🔗 Coderjoe you might be able to look at the picplz lua for info
00:20 🔗 underscor Oh no guys
00:20 🔗 underscor I repeat
00:20 🔗 underscor The target has entered the zone
00:20 🔗 underscor :D
00:20 🔗 underscor (aka, I'm at IA!)
00:21 🔗 Coderjoe oh shit. there goes the galaxy.
00:21 🔗 mistym underscor: Awesome :D
00:21 🔗 underscor It's quite nice
00:22 🔗 Coderjoe so you're living out there for the summer?
00:22 🔗 underscor Yeah
00:22 🔗 underscor Actually in the building
00:22 🔗 underscor haha
00:22 🔗 mistym It is the most amazing building.
00:23 🔗 underscor It's fabulous
00:23 🔗 shaqfu Isn't it an old Christian Science building?
00:23 🔗 underscor yep
00:23 🔗 mistym It is now the church of the internet.
00:23 🔗 shaqfu Say what you want about them, but God damn could they build nice buildings
00:23 🔗 underscor http://www.speedtest.net/result/2014459014.png
00:24 🔗 Coderjoe i would say "great. now he's going to use up all of IAs pipe with torrents..." but you do that already :P
00:25 🔗 shaqfu I'd joke about "downloading the Internet" but they already do that
00:27 🔗 underscor You guys are silly
00:27 🔗 underscor :D
00:28 🔗 Coderjoe I need someone to move http://archive.org/details/archiveteam-ftp_abit_com_tw over to the archiveteam-fire collection
00:34 🔗 underscor Coderjoe: http://archive.org/details/archiveteam-ftp_abit_com_tw?reCache=1
00:36 🔗 mistym So not knowing any lua, did I make any particularly terrible mistakes here? https://gist.github.com/56d5e8916d73af0c9f11
00:36 🔗 mistym Seems straightforward enough, but I don't know if lua has any hidden gotchas.
00:37 🔗 Coderjoe thanks
00:41 🔗 underscor np
00:47 🔗 shaqfu underscor: Will you be able to finish the FP upload soon?
00:47 🔗 underscor yeah, definitely
00:47 🔗 underscor once I get settled in
00:47 🔗 underscor I'm actually right near the box it's on now :D
00:47 🔗 shaqfu Sorry for being naggy - just want to see this project done soonish, since we just found more FP
00:48 🔗 shaqfu Now we're finding collided files, blech
00:52 🔗 underscor :(
00:53 🔗 shaqfu Nothing like a site passing through God knows how many major reorgs - it's a wonder it functioned at all
02:47 🔗 bsmith093 fanfiction.net is doing a purge again, we should put up a forum post saying we've probably got them saved
03:15 🔗 Coderjoe "but I don't want them saved. please destroy all copies"
03:15 🔗 underscor hahaha
03:15 🔗 DFJustin I thought the plan was to keep the fanfiction.net archiving on the down-low because of the inevitable dramabomb
03:17 🔗 aggro Well I'm sure many an author would be happy to hear that their work has, in fact, not been BAH-leeted.
03:17 🔗 shaqfu And many others will be livid that they're known as the author of Kirk/Spock fish slashfic, FOREVER
03:18 🔗 Coderjoe i liked my example subject better
03:18 🔗 Coderjoe but I don't even remember what it was
03:19 🔗 shaqfu Buffy clown rape?
03:19 🔗 aggro Hell, I've got shitty 12 year-old fanfiction on ffnet too.
03:19 🔗 aggro I'd be most ashamed of ever having written "Harries" when it should have been "Harry's." Shame shame shame.
03:20 🔗 shaqfu Space Harrier Potter
03:20 🔗 aggro 12 year-old as in, written when I was 12, or some year around then.
03:21 🔗 aggro Harry Potter and BSG crossover fiction? Now that's an idea.
03:24 🔗 DFJustin probably already exists
03:29 🔗 shaqfu Is that the only fanfic site being crawled?
03:30 🔗 aggro Already crawled.
03:30 🔗 shaqfu Being, hopefully - new content's being added
03:30 🔗 aggro I thought yipdw was involved in that one.
04:08 🔗 SketchCow Morning
04:20 🔗 closure SketchCow: hey, were you looking at end of January for archiveteam con? Or February?
04:36 🔗 Coderjoe shaqfu: #archiveteam.log:2012-06-09 02:02:40< Coderjoe> "SHIT SHIT SHIT! I didn't want anyone to know I wrote furry self-insertion star trek star wars incest bondage fanfiction! DELETE IT!"
04:38 🔗 shaqfu Hahaha
04:40 🔗 Coderjoe hmm
04:40 🔗 Coderjoe aside from the star wars part, that could almost be satisfied by a viewpoint of someone claiming to be a tribble
04:41 🔗 Coderjoe and why the hell am I putting any more thought into that off-the-cuff example of why fanfic authors frequently don't like having their works archived
04:43 🔗 shaqfu Blackmail ammunition?
04:58 🔗 SketchCow I was looking at January.
05:01 🔗 chronomex Coderjoe: I dated a fanfic writer once, I think being illogical is part and parcel with the hobby
05:05 🔗 shaqfu What was her fandom?
05:08 🔗 chronomex edward scissorhands
05:09 🔗 shaqfu Wait, what
05:11 🔗 Coderjoe it's the depp and the leather and the goth
05:12 🔗 chronomex heh
05:13 🔗 shaqfu Please please please please please don't tell me she wrote slash
05:13 🔗 chronomex I won't lie
05:13 🔗 chronomex she did
05:13 🔗 shaqfu Literally, in this case, slash
05:14 🔗 chronomex hahaha
05:14 🔗 Coderjoe what did slash have to say about it?
05:15 🔗 chronomex it wasn't slash slash
05:16 🔗 chronomex I don't have any links to it, I think she baleeted it all
05:19 🔗 shaqfu No worries; I'm sure there are dozens like it in ff.net
05:20 🔗 chronomex hahahahaha
05:20 🔗 chronomex she went by "jakie firecracker" if you feel a need to go find it
05:24 🔗 shaqfu Looks like she deleted everything
05:24 🔗 chronomex BIG SHOCK
05:24 🔗 chronomex typical fanfic butthurt
05:24 🔗 shaqfu Admittedly, that's a really grey area for archiving, ethics-wise
05:25 🔗 chronomex how do you say
05:26 🔗 shaqfu That if someone makes the conscious decision to destroy something of theirs, is it ethical to release it later as part of a historical collection
05:27 🔗 shaqfu But if you comply and delete, you cheat history :(
05:28 🔗 chronomex so obviously the right thing to do is wrap it up in an immutable package before they delete it
05:29 🔗 chronomex in all seriousness, I think that's the answer
05:30 🔗 shaqfu I'm not sure there's a right answer - moreso a less wrong one
05:32 🔗 shaqfu Probably best to just adhere to the "unring a bell" principle
05:32 🔗 shaqfu Do right by history *and* you don't feel like a jerk!
05:46 🔗 DFJustin http://en.wikipedia.org/wiki/Aeneid#Virgil.27s_death_and_editing_of_the_Aeneid
06:16 🔗 bsmith093 however the warcs are being archived, can they be indexed so i can give the archive search an id, and find that story? also hows that warc browser coming along? that would be extremely useful, to study the format
06:17 🔗 chronomex it absolutely can be indexed, that's how the wayback machine works
06:21 🔗 Coderjoe http://wegetsignal.org/tmp/yabbersdoesntcare.png
06:24 🔗 chronomex what
06:39 🔗 Coderjoe er
06:39 🔗 Coderjoe I meant to go to a different channel with that
06:39 🔗 Coderjoe and/or -bs
06:40 🔗 Coderjoe but the site hosting the phpbb seen in that screenshot has a notice up about a future outage at the end of may. it is now the middle of june.
10:06 🔗 Korodzik Is there a simple way to host a 687 MB backup online to make it comfortably downloadable? Either for free, or for cheaps.
10:07 🔗 aggro Depending on what precisely you're wanting to do, Dropbox might be a solution.
10:09 🔗 aggro The trick of course is uploading. If you're in the states, upload speeds range from: this fucking sucks and will never complete without errors, to maybe a few megabits a second.
10:16 🔗 alard Or upload it to archive.org, if the hosting doesn't have to be private/short-term.
11:05 🔗 joepie91 Korodzik: try http://tahoe.ccnmtl.columbia.edu/
11:12 🔗 Korodzik Thanks to you all.
13:16 🔗 Schbirid has codebear been around?
13:17 🔗 MorbusIff lemme check my logs.
13:18 🔗 MorbusIff nope.
13:20 🔗 MorbusIff .of course, it might help if i check the right channel.
13:20 🔗 MorbusIff i last saw him on June 4th, leaving.
13:22 🔗 Schbirid cheers
13:36 🔗 Schbirid and now archive.org stopping responding. derp
13:39 🔗 Schbirid nvm
13:52 🔗 Nemo_bis http://lists.wikimedia.org/pipermail/wikitech-l/2012-June/061214.html
17:15 🔗 Schbirid alard: can you make wget use the same warc file in later incovations? eg if i download 100 pages seperately but would them all to be warced
17:47 🔗 alard Schbirid: No, you'll have to download all files in one go. But you can concatenate warc files (even gzipped ones), so if you really want one big warc you can just use cat to combine them later.
17:47 🔗 Schbirid ah nice
17:48 🔗 alard But that way you'll get warc files that are slightly larger, since each file has its own wget log etc.
17:53 🔗 Schbirid and boom, IGN just killed all the gamespy forums, close to 40 million posts allegedly
17:54 🔗 Nemo_bis those we grabbed?
17:56 🔗 Schbirid i got pretty much i could
17:56 🔗 Schbirid found two short time ago, it got killed as i was on the last one. but nothing too important
17:57 🔗 Schbirid http://forumplanet-archive.quaddicted.com/111archives/7z/ | http://forumplanet-archive.quaddicted.com/111archives/warcs/
17:58 🔗 Schbirid i am trying to extract them all but i guess that will be way too much for my server
17:58 🔗 Schbirid should have looked into a comperssed filesystem
17:58 🔗 Schbirid gotta try https://gitorious.org/fuse-7z some day
18:03 🔗 Schbirid the forums i downloaded had close to 20 million posts, if my script worked correctly i will have all of them
18:04 🔗 Schbirid of course 50% are spam and 33% are deleted posts and 10% are IGN stupitidy or something like that
18:05 🔗 Schbirid wait, that was wrong
18:06 🔗 Schbirid no, the number is correct
18:11 🔗 Schbirid be aware that i think the warcs are missing stuff because of infinite link loops resulting in very deep directory trees. iirc there were not warcs left for those boards
19:32 🔗 Schbirid SketchCow linking to facebook on twitter, i hope he is alright
22:54 🔗 benuski There's a site that has digitized old newspapers, most of which are in the public domain in the US. However, the site is run by one person; is there anything preventing me from uploading these newspapers to IA?
22:55 🔗 shaqfu benuski: Legally? Dunno. I've seen people get pissy over digitized materials in the public domain
22:55 🔗 shaqfu What's the page?
22:56 🔗 benuski fultonhistory.com (warning: terribad website ahead)
22:56 🔗 shaqfu Dear God
22:56 🔗 aggro And this is why I have noscript.
22:57 🔗 underscor oh my
22:57 🔗 underscor that's... um
22:57 🔗 aggro good luck systematically grabbing that.
22:57 🔗 benuski I've emailed him to ask permission, but I'm worried he'll say no
22:57 🔗 shaqfu I.) Use of Programs like Web Devil, SiteGrabber, OfflineExployer or any other mass downloading software programs will get you banned from this site. Server logs are read each morning and my software will detect this type of activity. Violators will be automatically prohibited from entering this site, I do Not have the bandwidth to support this kind of activity and serve the people who wish to use this site as it was intended.
22:57 🔗 shaqfu RULES OF THE HOUSE:
22:58 🔗 shaqfu II.) This site is FREE. All I request is a smile and a little courtesy. If you have a reasonable need for something on this site, and you will not use it for commercial purposes (Make Someone Pay For it), I will give it to you. Just e-mail me with a description of your needs at tryniski@fultonhistory.com
22:58 🔗 benuski yeah, i saw that and that's why I emailed him
22:58 🔗 aggro Meh. Let's just try not to pull another Parodius :P
22:58 🔗 underscor ;D
22:58 🔗 DFJustin if you're in the US it's legal under https://en.wikipedia.org/wiki/Bridgeman_Art_Library_v._Corel_Corp.
22:58 🔗 DFJustin however IANAL
22:59 🔗 shaqfu Reading the link on the site to copyright info
22:59 🔗 shaqfu The one under the really surreal three-eyed head bouncing around
22:59 🔗 benuski I think that's a picture of the creator of the site
23:00 🔗 shaqfu Blech, nothing useful
23:01 🔗 shaqfu A site that interesting shouldn't have design that abhorrent
23:01 🔗 benuski I think its the worst web design I have ever seen
23:03 🔗 shaqfu Hopefully he gets back soon
23:03 🔗 shaqfu Otherwise, I guess a very slow crawl may be in order
23:04 🔗 shaqfu (You'll probably be slow-crawling anyway if you don't get disks shipped)
23:04 🔗 benuski Yeah, that was my plan... nicking stuff slowly so he doesn't ban me, unless he's into it
23:06 🔗 DFJustin I don't understand being so hard up for bandwidth that you scrutinize the logs daily
23:06 🔗 DFJustin I mean, I used to do that, but like 10 years ago when I ran a site off my cable modem
23:07 🔗 shaqfu Did your site also have animated fish and bouncing heads?
23:07 🔗 DFJustin now I pay $10 a month for retardedly more bandwidth than I will ever use
23:08 🔗 shaqfu I'm kinda surprised that his bandwidth is so limited; if he's storing 20M newspapers, he obviously has decent infrastructure
23:08 🔗 shaqfu I suppose he gets hammered if he goes over average or w/e
23:08 🔗 benuski and he says on some part of the site that the high quality images, which aren't available online, are over 120 terabytes
23:09 🔗 aggro For 50 euros a month you can get what looks like an outstanding dedi in europe: http://www.hetzner.de/en/hosting/produktmatrix/rootserver-produktmatrix-ex
23:09 🔗 aggro Unlimited traffic. Well, 7 euros or so per terabyte over 10TB if you want to keep the 100Mbps connection.
23:21 🔗 kennethre SketchCow: are you in need of nintendo power magazines?
23:36 🔗 Famicoman anyone know how long it usually takes before archive.org emails you back?
23:37 🔗 chronomex about what?
23:43 🔗 Famicoman starting a collection
23:43 🔗 chronomex oh, no idea
23:43 🔗 chronomex day or two?

irclogger-viewer