[05:46] Hey. [05:47] Back in the US [05:53] SketchCow, Is there a preference to how we keep the links in warc files? [05:53] I couldn't find the answer on the wiki [05:53] OK, so I am a little worried about this. [05:53] We're solving this this week. [05:54] We have people wander in, go "I WANNA SAVE CAMELTOE.ORG" and later they go "I SAVED IT" [05:54] I want to make sure we're standardized on a WGET-WARC grab. [05:56] this http://www.archiveteam.org/index.php?title=Wget_with_WARC_output [05:56] and this http://www.archiveteam.org/index.php?title=Software [05:57] You just want a single page that lays out the process from beginning to end [05:57] Yeah, they're there. [05:57] Right. [05:57] Create a wiki page and I can start editing it [05:57] I want to call it Kamikazee [05:57] For a single individual [05:58] I ask because we are working on backing up ugo, ign, 1up and gamespy [05:58] my code for my scripts may have some use [05:58] I know. [05:58] And I want those grabbed semi-intelligently. [05:59] like how to mirror forums and then grab images after that fact [05:59] SketchCow, that is a tall order based on the test grabs we have made so far [05:59] *after the fact [06:00] i think i'm close to having all the g4tv.com videos [06:00] the hd videos are the ones i don't know if i will have all of them [06:00] or could get all of them or even storage for them [06:01] In general, I would HOPE it wasn't simple to grab those sites. [06:02] My concern is the lack of a shutdown date [06:05] Yeah [06:05] It'll be 30 days or more [06:06] I have a good solution if we do a multipass archive which can be completed in a shorter amount of time. We haven't seen any bans yet [06:06] but previous projects all appear to be a single pass type approach [06:08] First pass wget everything we can. Scan and map all those warcs for links and the link patterns we know. Do a second pass download using all the links we found and generated. [06:10] I am also running a link mapper on these sites at present to find more buried content [06:13] If I had to guess, it's finding every subdomain for those domains. [06:13] Already done [06:13] Is that on the wiki? [06:14] we got this wiki page http://www.archiveteam.org/index.php?title=Ispygames but I do not have access to upload files or create new pages [06:14] yeah I got a gamespy file, ign file and a 1up file [06:14] and ugo [06:18] For example of the 3,702 subdomains we know of for gamespy.com only 267 of them "work" [06:18] Some of those redirect to other existing sites [06:39] omf you should be able to register for the wiki. i can change stuff tho if you need as well. [06:41] I already have an account that I use to edit the wiki I just cannot create pages [06:41] never gave it much thought [06:41] ah [07:20] I FINALLY wrote the stupid script to take a bunch of sets of filenames for one object. [07:20] for creating multi-file items? [07:21] Example: http://archive.org/details/POWERDRIVE0198 [07:21] Pumped the CUE, BIN, ISO and JPG in [07:21] ah bitchen [07:22] I had a whole class of waiting items for this. [07:22] So I can clear it out and get back into the groove [07:23] today I met a gentleman who's scanned literally an order of magnitude more BSPs than I have [07:23] he's entirely comfortable with putting them into IA [07:23] Good [07:23] I'll coordinate that; do you think I should put them in the same collection? [07:24] iirc the collection still has that restriction on it for all member items [07:24] Make a new collection [07:24] k [07:24] Oh, wait, talking off the top of my head [07:24] To be honest, no, you should get your ass in gear on the undoing from your gang [07:25] yes, I should [07:25] But now you have the back pocket "get it up" [07:25] ? [07:25] I can't parse that [07:32] I mean that if you can't get the letter, we just make a new collection and use new guy's stuff [07:33] ah yes [07:41] There, done, looking good. [07:44] PowerPlay0196.jpg PowerPlay0297.rar PowerPlay0399.jpg PowerPlay0596.rar PowerPlay0699.jpg PowerPlay0895.rar PowerPlay0998.jpg PowerPlay1195.rar PowerPlay1299.jpg [07:44] PowerPlay0196.rar PowerPlay0298.jpg PowerPlay0399.rar PowerPlay0597.jpg PowerPlay0699.rar PowerPlay0896.jpg PowerPlay0998.rar PowerPlay1196.jpg PowerPlay1299.rar [07:44] So these .rar files will be split up and then I can upload all [07:45] http://archive.org/details/powerdrivecd [08:58] http://pipeline.corante.com/archives/2013/02/22/what_if_the_journal_disappears.php [09:25] i assume the opensolaris stuff has been dealt with [09:26] i'm getting 403 forbidden [09:27] http://hub.opensolaris.org/bin/view/Main/ and http://src.opensolaris.org/source/ are both up [09:29] chronomex, you got the oss list? If not I can build one up and start testing the repo pulls against what I got [10:57] omf_: ughhhno [11:39] omf_: fetching now, turns out the simplest way to copy out of this system is with sftp [11:39] :) [11:39] I'll turn it into a bunch of hg bundles once this finishes [11:39] hg is pleasantly slow, I must say [11:40] pity the system doesn't let you rsync, or this would be much faster [11:41] true dat [11:43] actually I think this is about the same speed as rsync [11:43] the only way to win with many small files is tar/cpio, I think [11:54] are any of these repos large? [11:55] * chronomex shrugs [11:55] still sucking em down [11:55] hoping it'll fit on my 60G of free space in this laptop [11:55] else I'll have to fire up the disk array [12:05] there's a lot of fucking tiny ass files here [12:05] this will take a while ... [12:10] omf_: did you get http://defect.opensolaris.org/ ? [16:39] someone else mentioned they had a script or something for bugzilla so I didn't try and grab it [17:54] Has opensolaris been dealt with? [17:54] Oh, we better get on this. [17:54] ------------------------------ [17:54] OPENSOLARIS COORDINATION [17:54] #closedsolaris [17:54] ------------------------------ [18:31] you come up with good names very quickly [19:00] what do I need to know re: Posterous? [19:02] That there's a posterous channel in #preposterus and that there's a AT warrior project either running or coming soon [19:02] scraping is done afaik, now it's grabbing dataz [19:03] I see the warrior project, it says "they will ban you, check in at IRC before running this" [19:15] so what happened with opensolaris? [19:18] Did you not see the notice, ten lines up? #closedsolaris [19:23] I did, but since I was pretty sure it was dead already and/or someone had grabbed all of their stuff previously, I was confused [19:24] the site will be up till March 23, a few people seemed to misunderstand that [19:26] like, many [19:54] savetz: in order to run the posterous project correctly you need to be able to change ip addresses every hour [19:56] how do i change my ip on linux? [19:56] if your isp can give you a new one via dhcp, then that's fairly easy [19:56] you'll have to convince your router to make that request though, if you have one [20:01] so how aggressive is the blocking? will it always block more than one instance that's running continuously? [20:02] they'll ban your ip no matter how slow you go [20:02] it's better to run flat out and get as much as you can in the hour [20:02] I've had some luck today at least. Looks like an instance I'm running elsewhere isn't getting the banhammer. [20:03] we haven't worked out a good way to hand off from address to address though [20:04] if you use a tun device to proxy to another machine, then you can probably move your proxy connection from server to server without stopping your downloads [20:07] dashcloud: they ban freggin' everything man [20:08] hrm [20:08] wget won't recurse [20:08] I put -r -l inf and it downloads the index.html and then stops [20:08] Well, the tor network is a good source of IP's. Not sure how you'd torify any download scripts though, or even if it's possible. [20:08] tor is very very slow [20:09] ^ [20:09] but yea, you could do that with relatively little work [20:09] we'd never finish if that's all we did though [20:11] oooh, those idiots [20:11] all the links go to www3.whatever, so wget plays dumb [20:20] so all non hd videos are downloaded [20:20] computer tech videos is going to be come very big soon [20:37] thanks for doing that godane I love me some techie computer videos [20:56] 2000CD GET [21:01] Just a reminder we have a wiki page about IRC channels http://www.archiveteam.org/index.php?title=IRC [21:02] I just updated it with #closedsolaris #ispygames #preposterus [21:18] How hard would it be to take a single day snapshot of eBay? Is that impossible? [21:20] please add #aohell to the list as well [21:20] I just did [21:24] turnkit, maybe not all of ebay but a good chunk is possible [21:25] you would premap the url scheme and need more than a few dozen clients downloading pages [21:25] all day [21:25] maybe see if the API can save time [21:26] ebay is full of pictures of unique, interesting, historically relevant items that disappear after a month [21:26] it's infuriating [21:27] same with craigslist [21:27] yes, ebay more so I think [21:54] craigslist gets some stuff that would never be posted on ebay because it is not shippable. Slightly different markets, both very important [22:31] i got the total number of videos for non-HD part of g4tv.com [22:31] 36466 [22:32] that is what you have downloaded? [22:32] yes [22:32] that doesn't include hd videos [23:24] from this tweet: https://twitter.com/blefurgy/status/304955585172996096 learned about: http://matkelly.com/wail/ [23:28] his previous app WARCreate I thought was more impressive [23:29] I'll try it when there is a linux version