[00:02] so any more news on the possible archiveteam get together? [00:02] I need to discuss it over the next few weeks. [00:02] Assume mid January [00:02] Huge bandwidth visual and audio connections [00:03] save up terabytes of shit to shove into IA then, yes? [00:03] That works too [00:04] or I guess I could walk to the nearby university and use one of the unmetered 100mbit ports [02:06] Warning, bandwidth of archive.org about to disappear as of 10pm PST [02:06] I am thinking we might want to slow down the warriors [02:10] that's about 3h from now for all not on pacific time [02:12] is archive.org moving to even more ridiculous levels of bandwidth then? [02:17] No, this is a city thing [02:17] They already blew the bandwidth up [02:34] SketchCow: Is it part of ipv6 launch or some other thing? [02:34] SF outage [02:35] The neighborhood's getting fucked [02:35] Sounds like it's time for some devotion to duty: http://xkcd.com/705/ [02:42] anyone suggest a batch youtube channel that I can just put a profile into and hit go? [02:45] They're working hard to try and keep it up, but they don't want to overpromise. [02:45] It's likely to slow down, they ran a satellite link from the top of the building [02:46] And a microwave link to another site [03:00] the me.com crawlers are running out of usernames [03:01] and i would like to add some [03:03] http://memac.heroku.com/rescue-me [03:03] So does that mean we should all touch stop tonight? Or are we just going to forge on and hope it works? [03:04] Aranje: best to stop, I suppose [03:04] hmm [03:04] You can have the trackers (herokus) stop handing out usernames [03:04] Famicoman - to download all the episodes? [03:04] er... videos? [03:05] nm, found youtube-dl [03:05] cool little cli python tool [03:05] youtube-dl is godlike [03:06] handy [03:06] I was gonna say, I had a FF plugin for that, but I haven't used it for a while [03:06] not since I had to borrow all those Robot Wars episodes, anyway... [03:07] *borrow* [03:07] yes, borrowing. [03:07] like in a library [03:07] yes, absolutely [03:07] my favorite part is how the sarcasm patrol jumped right on that [03:07] lol [03:07] hey, he has every intention of returning them to... everybody [03:07] :D [03:08] I feel like you hand fed those to me like 3 years ago [03:08] pretty sure I had some less consistent rips before, which I might've given you [03:08] before I found that channel [03:09] I feel like I found them all on thebox [03:09] or most of them at least [03:09] yeah, their collection was incomplete though [03:09] missing at least a couple of series worth [03:10] and the early series had gaps [03:11] but either way, yay for the internet [03:11] I wish TV archives weren't so tightly locked away in vaults [03:12] I mean I get a lot of the reasons why on a legal level, but man, it seems a shame to let all that content rot [03:12] you should start recording then [03:12] I'd still love to see a crowdsourced historical TV guide, where everyone would contribute media to fill up TV schedules from a day in history [03:12] :P [03:13] but I know it'd get shut down quicker than you can say "copyright" [03:13] maybe IA could get away with doing it, I dunno [03:13] but it'd definitely be awesome [03:13] um, they already are [03:13] just isn't very public [03:14] .... http://memac.heroku.com/rescue-me is complicated because it just allows ~50 at a time [03:14] well then, I've learnt something ne [03:14] new, even [03:14] jonas__: if you have a pile, you should talk to alard [03:15] Moonlit: Archive.org will have a TV archive soon [03:15] shweet [03:15] Goes back 15 years [03:15] kickass [03:15] international, or just US? [03:16] Just US [03:16] aw [03:16] But via satellite [03:16] still, better than nothing [03:16] Right [03:16] You could whip up a script to submit 50 at a time. POST is a wonderful interface. [03:16] http://archive.org/details/tv [03:16] Try it out a little [03:16] oh, hello [03:16] cheers [03:17] http://archive.org/details/tv?q=vibrator [03:17] hah [03:18] that seems quite specific [03:19] but I'm not gonna argue with anything which archives telly [03:19] It's conforting to know I'm not the only one who sometimes felt that old TV listings would be useful [03:19] well, I'd love to have a site where you could, for example, pick your birthday and watch that entire day's TV [03:20] impractical, perhaps, but for recent years it might not be such a crazy idea [03:20] that may be exactly what /details/tv is for [03:20] who knows [03:20] alard is sleeping i guess, [03:20] jonas___: that would not be a surprise, given his timezone [03:20] no one else involved in the tracker? [03:20] timezones? pah! [03:21] * Moonlit looks down at the clock [03:21] 4:22am [03:21] >_> [03:21] I expect he'll be online in 4 to 6 hours [03:21] Moonlit: you're approaching that from the good side, right? [03:21] the night side, of course [03:22] yes [03:29] are other people able to access archive.org as well? [03:29] it sure doesn't seem down to me [03:29] or is it not that time yet [03:29] not time yet [03:29] d'oh, two more hours [03:29] 1h30 left I believe [03:29] right [05:56] * closure wonders if it'll come back with ipv6 for ipv6 day, that'd be cool [05:56] indeed [07:32] Hi. I read something about pausing trackers/warriors, is that still necessary or have I slept through the outage? [07:42] The outage came and went. [07:46] whoa, there's about ten zillion queued derives [07:49] wonder what's up. [11:55] chronomex: the outage affected ia6* datacenter, which is where catalogd's hosted; because of that, a.o "froze" the cluster for the duration, and efficiency still isn't back up to normal [11:57] ....#ovh giveaway - did you guys also after typing the code from twitter DM and trying to log in get this all the time? :"pease login: An error occurred, saving could not be completed, please check your information." [11:57] nope [11:57] I just entered my nichandle and it said "Thanks for participating" or something [11:57] then they dm'd me in frech [11:57] french* [12:37] https://fbcdn-sphotos-a.akamaihd.net/hphotos-ak-ash4/403466_10150939185809588_574221299_n.jpg [12:37] haha [12:40] <3 grimm brothers [12:41] they wrote some funny stuff :D [12:52] hmmmm [12:52] just started up warrior, selected picplz - sitting on 0% [waiting for headers] when trying to update liblua5.1-0 [12:53] mobileme seems fine tho. carrying on with tht for now [12:57] Experiment: http://warctozip.herokuapp.com/ [13:01] alard: and that reutrns the thing zipped for you? [13:01] returns* [13:04] Yes. [13:04] that's awesome [13:05] * underscor feeds it one of the youporn warcs [13:06] alard: https://gist.github.com/8a3c5fb1f64d8bd18a4e [13:07] with Python 2.6.1 [13:07] void_: You need to pass a zip filename [13:07] after the warc filename [13:07] example.zip or whatever you want it called [13:07] same [13:08] oh, weird [13:08] interestingly it seems Virtualbox may soon have some network speed contorl stuff. [13:08] https://gist.github.com/f5086e6188f9f6ea3e2a [13:08] SmileyG: it does, iirc [13:08] underscor: already? [13:09] I see stuff for a linux host, I'm a linux host but I've not yet attempted to test. [13:09] it said for windows users "soon" from what I've read. [13:09] oh, I'm linux [13:09] sorry [13:10] me too :P [13:10] Have you tried it? [13:11] http://www.slashgear.com/6-5m-linkedin-passwords-reportedly-leak-hackers-crowdsourcing-encryption-crack-06232454/ [13:11] no, because two clients doesn't saturate my pipe [13:11] ;P [13:12] mine nither but I'm at work and if I did then there would be stabbing. [13:12] hahaha [13:13] as in theres very little from stopping me other than me., [13:21] http://bugs.python.org/issue5511 [13:22] maybe is py2.6 the problem [13:27] Hello archiveteam. I'm interested in preserving certain aspects (in particular the forums) of some private torrent trackers. These community forums oftentimes have high-quality discussion of various topics and provide an insight into the world of filesharing, but they are also at risk of sudden, unannounced seizure. I'm a noob, currently fiddling with wget without much success. Is anybody interested and/or willing to help me out? [14:18] floppywar, supposably if you use wget with --mirror and the --load-cookies to load your cookie from when your logged in that should work [14:19] try that, didn't work. maybe the robots.txt is preventing me from accessing /forums.php [14:21] only /login.php was downloaded and that files says that neither javascript nor cookies are enabled (a necessity for logging into this website). [14:21] I've tried HTTrack too, without success. [14:21] [14:28] javascript... :S [14:28] S[h]O[r]T: can you get it to wget any other page than the login page? [14:33] why are u asking me? :P floppywar is the one trying to wget hehe [14:33] SmileyG: what do you mean? [14:33] ah I see [14:33] index.php too I think, but the content is the same as login.php [14:34] SmileyG : so, no, nothing besides the login page. [14:37] SmileyG: Correction, Javascript is not necessary. Cookies are though, obviously. [14:37] hmmm [14:37] https://chrome.google.com/webstore/detail/lopabhfecdfhgogdbojmaicoicjekelh [14:38] that seems like a promising easy way to get your login cookie over to wget [14:38] don't you just login on your local machine [14:38] save hte cookie.txt [14:38] job done. [14:39] I do [14:39] I'll try exporting the cookie using the add-on you just linked to. [14:40] floppywar: Talk with underscor - he's the guy to help. [14:40] http://www.youtube.com/watch?v=rxjHbe7TrjY [14:42] SmileyG: It seems to be working! [14:50] SmileyG: .. or so I thought. wget doesn't seem to descend into anything; it just grabs (properly) forums.php but not the actual individual forums and threads. [14:50] Individual forums can be reached in this way: /forums.php?action=viewforum&forumid=9 [14:53] i imagine as long as wget is following those links it shouldnt have a problem. i guess i would try to play with the options for host checking [14:54] "options for host checking", expand please if you will? [14:55] http://www.editcorp.com/Personal/Lars_Appel/wget/wget_4.html [14:56] i suppose you could try -L and -nh. im not sure if -mirror enables these already i bet it does one of them [14:59] "wget -load-cookies cookies.txt -m https://[site]/forums.php" only grabs forums.php [15:03] replacing -m with -L makes no difference, replacing -m with -nh returns "illegal option" (already default as hinted at on editcorps.com?) [15:03] floppywar: Is there a redirct from [site] to www.[site] or vice-versa? The -D option may be useful to you in that case. [15:03] as far as I'm aware there isn't [15:06] Might be worth a try anyway. Your "illegal option" error is because the h needs to be capital: -nH but that's a bad idea anyway since you're one outside link away from downloading the internet. [15:08] "wget --load-cookies cookies.txt -m https://[site]/forums.php?action=viewforum&forumid=19" does not differ from trying to grab /forums.php; only the page specified is grabbed, no descend into the actual threads. [15:10] Dunno then; I usually rely on someone smarter than me to set this stuff up. Like how I forgot just now that in wget you can have multi-letter options with a single dash so -nH didn't mean what I thought it did. [15:10] I do know from previous conversations in here that downloading forums is far more of a pain than it ought to be. [15:11] could it be because the links aren't static? [15:12] anyway, replacing -m with -nH does not fix it. "-m -nH" doesn't fix it either. [15:13] I've got to go in about 5 minutes. I'll probably be back within a couple of hours. [15:13] Thanks for the help guys. [15:13] wget --no-parent --html-extension --page-requisites --convert-links -e robots=off --exclude-directories=any,directory,you,do,not,want,comma,delimited --reject "*action=print,*any-parameter-you-do-not-want,comma,delimited" -w 5 --random-wait --warc-file=warc-file-name http://thesite.tld [15:13] without having a login to the site it's difficult to troubleshoot, but I've used the follow before for public grabs of forums. If you've got the cookie working correctly, then this combo might work: [15:15] I'll try that aggro, cookie seems to be working correctly. [15:16] the reject is useful for php-type sites that can have lots of different parameters for a page that are all mostly the same. Without the reject, wget will download lots of different versions of the same page, just with different parameters. You can fiddle with it until you find what url parameters are good to keep and which to ditch. [15:17] aggro: will wget ascend into higher directories if I specify https://[site]/forums.php? [15:18] if it's just somesite.com/forums.php , then that's already at the highest directory. [15:19] And wget will follow any links within that domain [15:19] typically though forums are installed in somesite.com/forums/index.php or forums.somesite.com [15:19] (forums.somesite.com/index.php) [15:23] aggro: It grabbed forums.php(.html is the final output) and is grabbing stuff from /static/ now. [15:24] aggro: aaaand, it's finished. so no descend into threads again :( [15:25] I left out --exclude-directories, --reject and --warc-file [15:25] that's fine. [15:25] :< [15:25] Does that "forums.php" page have the typical listing of different forums? [15:25] (I'm assuming phpbb or vbulletin or one of those major ones) [15:25] aggro: no, custom board I think. [15:26] It's what.cd [15:27] my connection dropped. did I miss anything after "(I'm assuming [...]"? [15:27] Hmmm... perhaps a cookie or login issue? Try running it with the "-d debug.log" option. [15:27] Oh: repeat: phpbb or vbulletin or one of those major ones) [15:27] I was guessing the type of system the forum is running on [15:28] what.cd runs question forum software, probably included with their in-house though open-source Gazelle private tracker software solution. [15:29] what.cd runs homebrew forum software* [15:29] huh. interesting :D [15:29] what.cd? [15:29] Sounds like one of the magizines [15:29] The first thing I'd be looking at is the debug output and probably some of the pages to see where specifically content is getting grabbed from. [15:29] biggest private music tracker in existance [15:30] printscreen incoming [15:32] http://image.bayimg.com/mapjeaada.jpg [15:33] Bingo. You'll want the URL to be "https://ssl.what.cd/" [15:33] :D [15:33] aggro: It is [15:33] only thing I notice is the https.... Oh [15:34] perhaps I could try the non-https version. [15:34] but I've got to go now. I'll be back in a couple of hours. [15:34] bye [15:34] wget doesn't trust the ssl cert they using or something bonkers? [15:34] Same here. [15:34] And possibly what Smiley said. So look at the debug output from wget :P [15:34] Back in a few hours all :D [15:34] kbai [15:34] o/ [16:17] WHERE'S THE HUG [16:17] I hope that floppywar does some greatness. [16:17] * Schbirid MUGS SketchCow [16:18] i learned that my city's surveying office sits on 2 cabinets full of tapes with historic aerial photos they might just throw away because they dont have the readers any more and dont care [16:18] i made my interest clear :) [16:18] i think [16:18] at least to a colleague who works in the department [16:18] Keep on it. [16:18] Be willing with a trunk. [16:18] What form are they [16:19] i dont have the slightest idea [16:19] but it probably wont be decided in the near future [16:21] http://www.engadget.com/2012/06/02/picplz-shutting-down-permanently-on-july-3rd-all-photos-to-be-d/ [16:22] and http://torrentfreak.com/worlds-oldest-bittorrent-site-shuts-down-120605/ Thought this may be of interest to some. [16:31] We've already grabbed 75% of picplz [16:34] Wow, that's fast. How many gigs so far? [16:35] http://picplz.heroku.com/ [16:51] alright. I'm using archiveteam warrior and am now working on picpiz [16:53] Excellent. [16:56] For some reason, my archiveteam warrior fails out, can't see anything with the network. [17:06] What's your host setup? I've had problems with VirtualBox on Win7 not believing the network exists. [17:10] Yeah [17:10] But it worked a while ago. [17:11] That... actually sounds like the problem I had with VirtualBox. For whatever reason bridged mode decided it didn't want to work any more. [17:18] the VirtualBox network drivers may no longer be loaded [17:19] I've had that happen on Linux and OS X; unfortunately the only way I found to fix that was to unload and reload the modules/kexts [17:19] supposedly restarting VirtualBox fixes it, too, though I've had little luck with that procedure [17:22] yipdw: Yeah, it was a pain. Restarting VBox didn't work, rebooting the host occasionally did. Then it gave up altogether so now I just deal with the pain that is NAT. [17:24] may mean windows is up for a reinstall [17:26] Meh. I've got something that works now, so that's how it's gonna stay. I don't have a spare weekend to deal with paving over the system. [18:02] I just got my warrior working [18:03] Which is a huge-ass euphemism [18:03] But also true. [18:03] oh crap I left mine running at work [18:03] doh [18:03] Archive Team: Saving History... By Mistake [18:03] ;) [18:19] oh poop, my gamespy forum script is not getting it all [18:37] it's wget's fault :( [18:38] Oh yeah, blame your failures on wget [18:38] :3 [18:41] might have been my tries and timeout limits [18:42] those would be logged as errors even with -nv, right? because no errors here [18:46] SketchCow: Following the twitter feedback, the warrior scripts now show the URL of the tracker stats page. [18:56] Excellent. [18:56] Codinghorror's got ideas aplenty [18:57] They're generally very good, so they're worth considering. [18:58] Love that warrior [18:59] https://twitter.com/codinghorror/status/210446032104980480 [19:00] Ha ha, codinghorror is now in on the uploading. [19:06] lol [19:08] maybe Jeff Atwood can make Stack Overflow do some warrior work [19:08] http://www.catwholaughed.com/previous/index.html is the artist who will do somehting for warrior [19:10] what the hell, --level=10 did not do what i thought it would [19:10] "wget -a test.log -m -nv --adjust-extension --convert-links --page-requisites -np --level=10 -X PrivateMessages -X Static http://forums.gamespy.com/fileplanet_command_center/b67434/p1" [19:10] with the level=10 it would not download all the pXXXX, only 533 [19:10] ooh [19:11] i thought that would control how many directories it would go down. but it makes it not crawl further than 10 "links" in terms of recursion, right? [19:11] is there something to limit the directory level traversal? [20:22] Hello, could someone change the status of ShoutWiki ( http://archiveteam.org/index.php?title=ShoutWiki ) to Online please as we are very much up and running, as you can see here: http://shoutwiki.com/wiki/Main_Page [20:22] I would do so myself but account creation appears to not be working [20:25] Further evidence is in our blog: http://blog.shoutwiki.com/ [20:30] Anyone? [20:30] It will get done [21:32] I've had two errors when using archiveteam warrior [21:34] Both my fault. Network connected but internet stops working. Maybe because I'm pumping too much in and out? I"m on a 12Mb/s down 896Kb/s up connection. [21:35] Anyway... Thought I'd let you know. I hope I didn't mess anything up or those 'items' get skipped? [21:46] it depends [21:46] what kind of errors? [21:48] items that have been claimed but not marked as complete will eventually be sent to someone else to re-run [21:48] if you've uploaded a broken pack, that's a bit different [22:16] Hey, any Google Reader users out there, I'm building a site to filter/crawl RSS articles, so for Feeds like Gizmodo and others that only provide a snippet of the content in their RSS feed, I crawl the page, use the Readability algorithm to pull the full content and then replace their content in the RSS feed. If you guys want, feel free to test out the service and let me know how you like it. [22:18] http://rsshose.com [22:19] neat [22:20] Also, if you read Hacker News, it'll provide a feed of all of the actual articles linked to, instead of "comments" as the body of each article in the feed. [22:20] I'm adding de-duping to it soon too, so if/when the next apple event happens, you won't have to re-read the same damn article 100 times. [22:21] I gave up HN recently [23:39] moral quandry: is a website dedicated to acts of gore threatened with shutdown a worthy item of archiving? http://www.cbc.ca/news/politics/story/2012/06/05/pol-magnotta-best-gore-police.html?cmp=rss [23:40] yes [23:40] well, have at that... that site is NSFL. [23:40] click the puppy, trust me. [23:40] it will help to provide cultural context to future historians [23:41] yeah i'd agree. i might think it's way fucked, but it's newsworthy, and noteworthy as a result [23:41] anything that would cause someone 100 years from now to say "huh, I didn't know ..." is definitely worh it [23:41] newsworthy and noteworthy is not the same as what I'm talking about [23:42] right, i get your take on it though, it is historically significant to future sociologists and the like [23:42] everyone archives for their own reasons [23:42] these are mine [23:44] DrainLbry: welcome back [23:44] don't see you on freenode... :p [23:51] attempting a magi.com mirror with wget-warc - if anyone eyeballs anything weird i'd have to do off the top of your head let me know, cause i just --mirror'd that bitch and hope it works - http://www.insidesocialgames.com/2012/06/06/magi-com-shutting-down/ [23:51] meh yeah, and that totally did not work. [23:52] or maybe it did... site's not as epic as i thought [23:55] yeah not so sure, anyone want to school me with some knowledge? 12 megs doesnt seem right