#archiveteam 2013-02-24,Sun

↑back Search

Time Nickname Message
05:46 🔗 SketchCow Hey.
05:47 🔗 SketchCow Back in the US
05:53 🔗 omf_ SketchCow, Is there a preference to how we keep the links in warc files?
05:53 🔗 omf_ I couldn't find the answer on the wiki
05:53 🔗 SketchCow OK, so I am a little worried about this.
05:53 🔗 SketchCow We're solving this this week.
05:54 🔗 SketchCow We have people wander in, go "I WANNA SAVE CAMELTOE.ORG" and later they go "I SAVED IT"
05:54 🔗 SketchCow I want to make sure we're standardized on a WGET-WARC grab.
05:56 🔗 omf_ this http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
05:56 🔗 omf_ and this http://www.archiveteam.org/index.php?title=Software
05:57 🔗 omf_ You just want a single page that lays out the process from beginning to end
05:57 🔗 SketchCow Yeah, they're there.
05:57 🔗 SketchCow Right.
05:57 🔗 omf_ Create a wiki page and I can start editing it
05:57 🔗 SketchCow I want to call it Kamikazee
05:57 🔗 SketchCow For a single individual
05:58 🔗 omf_ I ask because we are working on backing up ugo, ign, 1up and gamespy
05:58 🔗 godane my code for my scripts may have some use
05:58 🔗 SketchCow I know.
05:58 🔗 SketchCow And I want those grabbed semi-intelligently.
05:59 🔗 godane like how to mirror forums and then grab images after that fact
05:59 🔗 omf_ SketchCow, that is a tall order based on the test grabs we have made so far
05:59 🔗 godane *after the fact
06:00 🔗 godane i think i'm close to having all the g4tv.com videos
06:00 🔗 godane the hd videos are the ones i don't know if i will have all of them
06:00 🔗 godane or could get all of them or even storage for them
06:01 🔗 SketchCow In general, I would HOPE it wasn't simple to grab those sites.
06:02 🔗 omf_ My concern is the lack of a shutdown date
06:05 🔗 SketchCow Yeah
06:05 🔗 SketchCow It'll be 30 days or more
06:06 🔗 omf_ I have a good solution if we do a multipass archive which can be completed in a shorter amount of time. We haven't seen any bans yet
06:06 🔗 omf_ but previous projects all appear to be a single pass type approach
06:08 🔗 omf_ First pass wget everything we can. Scan and map all those warcs for links and the link patterns we know. Do a second pass download using all the links we found and generated.
06:10 🔗 omf_ I am also running a link mapper on these sites at present to find more buried content
06:13 🔗 SketchCow If I had to guess, it's finding every subdomain for those domains.
06:13 🔗 omf_ Already done
06:13 🔗 SketchCow Is that on the wiki?
06:14 🔗 omf_ we got this wiki page http://www.archiveteam.org/index.php?title=Ispygames but I do not have access to upload files or create new pages
06:14 🔗 omf_ yeah I got a gamespy file, ign file and a 1up file
06:14 🔗 omf_ and ugo
06:18 🔗 omf_ For example of the 3,702 subdomains we know of for gamespy.com only 267 of them "work"
06:18 🔗 omf_ Some of those redirect to other existing sites
06:39 🔗 S[h]O[r]T omf you should be able to register for the wiki. i can change stuff tho if you need as well.
06:41 🔗 omf_ I already have an account that I use to edit the wiki I just cannot create pages
06:41 🔗 omf_ never gave it much thought
06:41 🔗 S[h]O[r]T ah
07:20 🔗 SketchCow I FINALLY wrote the stupid script to take a bunch of sets of filenames for one object.
07:20 🔗 chronomex for creating multi-file items?
07:21 🔗 SketchCow Example: http://archive.org/details/POWERDRIVE0198
07:21 🔗 SketchCow Pumped the CUE, BIN, ISO and JPG in
07:21 🔗 chronomex ah bitchen
07:22 🔗 SketchCow I had a whole class of waiting items for this.
07:22 🔗 SketchCow So I can clear it out and get back into the groove
07:23 🔗 chronomex today I met a gentleman who's scanned literally an order of magnitude more BSPs than I have
07:23 🔗 chronomex he's entirely comfortable with putting them into IA
07:23 🔗 SketchCow Good
07:23 🔗 chronomex I'll coordinate that; do you think I should put them in the same collection?
07:24 🔗 chronomex iirc the collection still has that restriction on it for all member items
07:24 🔗 SketchCow Make a new collection
07:24 🔗 chronomex k
07:24 🔗 SketchCow Oh, wait, talking off the top of my head
07:24 🔗 SketchCow To be honest, no, you should get your ass in gear on the undoing from your gang
07:25 🔗 chronomex yes, I should
07:25 🔗 SketchCow But now you have the back pocket "get it up"
07:25 🔗 chronomex ?
07:25 🔗 chronomex I can't parse that
07:32 🔗 SketchCow I mean that if you can't get the letter, we just make a new collection and use new guy's stuff
07:33 🔗 chronomex ah yes
07:41 🔗 SketchCow There, done, looking good.
07:44 🔗 SketchCow PowerPlay0196.jpg PowerPlay0297.rar PowerPlay0399.jpg PowerPlay0596.rar PowerPlay0699.jpg PowerPlay0895.rar PowerPlay0998.jpg PowerPlay1195.rar PowerPlay1299.jpg
07:44 🔗 SketchCow PowerPlay0196.rar PowerPlay0298.jpg PowerPlay0399.rar PowerPlay0597.jpg PowerPlay0699.rar PowerPlay0896.jpg PowerPlay0998.rar PowerPlay1196.jpg PowerPlay1299.rar
07:44 🔗 SketchCow So these .rar files will be split up and then I can upload all
07:45 🔗 SketchCow http://archive.org/details/powerdrivecd
08:58 🔗 chronomex http://pipeline.corante.com/archives/2013/02/22/what_if_the_journal_disappears.php
09:25 🔗 Lord_Nigh i assume the opensolaris stuff has been dealt with
09:26 🔗 Lord_Nigh i'm getting 403 forbidden
09:27 🔗 omf_ http://hub.opensolaris.org/bin/view/Main/ and http://src.opensolaris.org/source/ are both up
09:29 🔗 omf_ chronomex, you got the oss list? If not I can build one up and start testing the repo pulls against what I got
10:57 🔗 chronomex omf_: ughhhno
11:39 🔗 chronomex omf_: fetching now, turns out the simplest way to copy out of this system is with sftp
11:39 🔗 chronomex :)
11:39 🔗 chronomex I'll turn it into a bunch of hg bundles once this finishes
11:39 🔗 chronomex hg is pleasantly slow, I must say
11:40 🔗 chronomex pity the system doesn't let you rsync, or this would be much faster
11:41 🔗 omf_ true dat
11:43 🔗 chronomex actually I think this is about the same speed as rsync
11:43 🔗 chronomex the only way to win with many small files is tar/cpio, I think
11:54 🔗 omf_ are any of these repos large?
11:55 🔗 * chronomex shrugs
11:55 🔗 chronomex still sucking em down
11:55 🔗 chronomex hoping it'll fit on my 60G of free space in this laptop
11:55 🔗 chronomex else I'll have to fire up the disk array
12:05 🔗 chronomex there's a lot of fucking tiny ass files here
12:05 🔗 chronomex this will take a while ...
12:10 🔗 chronomex omf_: did you get http://defect.opensolaris.org/ ?
16:39 🔗 omf_ someone else mentioned they had a script or something for bugzilla so I didn't try and grab it
17:54 🔗 SketchCow Has opensolaris been dealt with?
17:54 🔗 SketchCow Oh, we better get on this.
17:54 🔗 SketchCow ------------------------------
17:54 🔗 SketchCow OPENSOLARIS COORDINATION
17:54 🔗 SketchCow #closedsolaris
17:54 🔗 SketchCow ------------------------------
18:31 🔗 db48x22 you come up with good names very quickly
19:00 🔗 savetz what do I need to know re: Posterous?
19:02 🔗 ersi That there's a posterous channel in #preposterus and that there's a AT warrior project either running or coming soon
19:02 🔗 ersi scraping is done afaik, now it's grabbing dataz
19:03 🔗 savetz I see the warrior project, it says "they will ban you, check in at IRC before running this"
19:15 🔗 dashcloud so what happened with opensolaris?
19:18 🔗 ersi Did you not see the notice, ten lines up? #closedsolaris
19:23 🔗 dashcloud I did, but since I was pretty sure it was dead already and/or someone had grabbed all of their stuff previously, I was confused
19:24 🔗 omf_ the site will be up till March 23, a few people seemed to misunderstand that
19:26 🔗 ersi like, many
19:54 🔗 db48x22 savetz: in order to run the posterous project correctly you need to be able to change ip addresses every hour
19:56 🔗 godane how do i change my ip on linux?
19:56 🔗 db48x22 if your isp can give you a new one via dhcp, then that's fairly easy
19:56 🔗 db48x22 you'll have to convince your router to make that request though, if you have one
20:01 🔗 dashcloud so how aggressive is the blocking? will it always block more than one instance that's running continuously?
20:02 🔗 db48x22 they'll ban your ip no matter how slow you go
20:02 🔗 db48x22 it's better to run flat out and get as much as you can in the hour
20:02 🔗 aggrosk I've had some luck today at least. Looks like an instance I'm running elsewhere isn't getting the banhammer.
20:03 🔗 db48x22 we haven't worked out a good way to hand off from address to address though
20:04 🔗 db48x22 if you use a tun device to proxy to another machine, then you can probably move your proxy connection from server to server without stopping your downloads
20:07 🔗 ersi dashcloud: they ban freggin' everything man
20:08 🔗 db48x22 hrm
20:08 🔗 db48x22 wget won't recurse
20:08 🔗 db48x22 I put -r -l inf and it downloads the index.html and then stops
20:08 🔗 aggrosk Well, the tor network is a good source of IP's. Not sure how you'd torify any download scripts though, or even if it's possible.
20:08 🔗 db48x22 tor is very very slow
20:09 🔗 aggrosk ^
20:09 🔗 db48x22 but yea, you could do that with relatively little work
20:09 🔗 db48x22 we'd never finish if that's all we did though
20:11 🔗 db48x22 oooh, those idiots
20:11 🔗 db48x22 all the links go to www3.whatever, so wget plays dumb
20:20 🔗 godane so all non hd videos are downloaded
20:20 🔗 godane computer tech videos is going to be come very big soon
20:37 🔗 omf_ thanks for doing that godane I love me some techie computer videos
20:56 🔗 DFJustin 2000CD GET
21:01 🔗 omf_ Just a reminder we have a wiki page about IRC channels http://www.archiveteam.org/index.php?title=IRC
21:02 🔗 omf_ I just updated it with #closedsolaris #ispygames #preposterus
21:18 🔗 turnkit How hard would it be to take a single day snapshot of eBay? Is that impossible?
21:20 🔗 dashcloud please add #aohell to the list as well
21:20 🔗 DFJustin I just did
21:24 🔗 omf_ turnkit, maybe not all of ebay but a good chunk is possible
21:25 🔗 omf_ you would premap the url scheme and need more than a few dozen clients downloading pages
21:25 🔗 omf_ all day
21:25 🔗 omf_ maybe see if the API can save time
21:26 🔗 chronomex ebay is full of pictures of unique, interesting, historically relevant items that disappear after a month
21:26 🔗 chronomex it's infuriating
21:27 🔗 omf_ same with craigslist
21:27 🔗 chronomex yes, ebay more so I think
21:54 🔗 omf_ craigslist gets some stuff that would never be posted on ebay because it is not shippable. Slightly different markets, both very important
22:31 🔗 godane i got the total number of videos for non-HD part of g4tv.com
22:31 🔗 godane 36466
22:32 🔗 omf_ that is what you have downloaded?
22:32 🔗 godane yes
22:32 🔗 godane that doesn't include hd videos
23:24 🔗 dashcloud from this tweet: https://twitter.com/blefurgy/status/304955585172996096 learned about: http://matkelly.com/wail/
23:28 🔗 omf_ his previous app WARCreate I thought was more impressive
23:29 🔗 omf_ I'll try it when there is a linux version

irclogger-viewer