[00:10] i'm uploading more of marxists.org [00:49] i'm uploading the lastest DIY Tryin and tekzilla episodes [00:54] also Industrial Worker pdfs from marxists.org is getting upload too [02:33] SketchCow: was there a talk by Brewster alluding to Yahoo destroying culture by burning Geocities? [02:33] or was that you [02:36] I'm sure both [04:58] https://www.youtube.com/watch?v=Tsa4ogtBiv0 is pretty excellent [05:54] is this fhqwhgads on every track of the album? [05:59] ok. i'm curious of the downbeat tracks [06:01] meh [06:17] Fhqwhgads all the way down [09:32] Is it better to download a website as just a collection of raw pages, or to extract the data from it all and create a kind of simulation of their database? [09:38] depends what your goal is [09:43] I guess [09:44] this is not something like.. geocities or whatnot, but a more database orientated website. every page has standard output, tags, comments, etc [09:46] the more I think about it, the more I'm sure that pulling the data out and building a simulation of their database is best [09:46] but in the general case, kind of wondering whether there's more of a tendancy towards processed or raw data in archiveteam stuff [09:47] archiveteam tends to grab everything [09:47] never know what you have missed untill it's gone right [09:47] drspangle: I'm pretty new to all this, been sorta lirking for a while. Archiveteam seems to just grab everything raw, I tend to grab just the bits I want so my stuff isn't that useful [09:48] e.g. archiveteam would grab fanfiction stories, the author page, reviews... while I just grab the chapters because it's all I need to process [09:50] well, in the fanfic scenario, I'm thinking grab all the data from the page, including the author page, reviews, etc, but then format it into a database, rather than grabbing it all as story0237.htm, story0238.htm, ... author012.htm, author013.htm [09:51] That can work too. It just won't show up on the wayback machine I don't think, because that needs warcs [09:51] sorry, 'warcs'? Not sure the definition [09:52] Some kind of web archive format. It describes the request and response, including headers [09:52] ah, okay [09:53] The scripts I"ve seen seem to use wget-lua to do a lot of their link finding, maybe because wget has --page-requisites to grab a lot of the required css, javascript, images, etc that wouldn't be gotten otherwise [09:55] yeah [09:56] It seems more efficient to pull the data out of the website, if the style is just a tool to display the data [09:56] if the style is a feature in it's own right, then it does want saving [09:56] but I guess defining that difference is stupid, everyone will define differently, so download it all so they can decide later [09:57] For me, I'm not really sure how to do it properly, I can wget --mirror a static site and be fine, but otherwise I just scrape the stuff I want. I don't think there's a how to archive dynamic forums tutorial yet [09:58] thats why there is #archivebot [09:58] easiest way to feed sites into archive.org [10:01] I want to rip the spacebattles forums, particularly the fanfiction forum so I can read its posts with my editor. I can write some scripts to take out the posts I don't want [10:02] I'm kind of tempted to just attempt to simulate the website. Download the data, convert it into a simulated database, get a copy of the style, poke the data into where it goes [10:02] but pretty sure that's a bad idea and would be discouraged [10:02] but would mean easier processing of the data [10:29] drspangle: for archiving grab it as it is. for rehosting/displaying it differently you might want to parse it [10:56] schbirid: ah, okay [12:12] midas: harddisk, harddisk, harddisk? :> [12:12] i've ordered a new 4TB disk for travel :p [12:12] woot [12:25] you know what would be stupid fun? people running encrypted garbage faucet services to generate lots of encrypted traffic for fun [12:26] to annoy secret agencies and to test one's ISP's monthly limits by downloading it to /dev/null [12:26] monthly limits? lol, how barbarian! [12:26] :) [12:37] yay for netcat [15:04] http://imgur.com/gallery/nKkgfWK [15:09] oh yes [15:09] this is exactly what I needed [16:00] http://apnews.excite.com/article/20140507/alibaba_ipo-yahoos_windfall-08fb26a635.html [16:09] a video about the search for James Kim: http://download.cbsnews.com/media/2006/12/05/video2232655.flv [17:54] so.... [17:54] i wish we could pull all the old cartoons from that app on my phone [17:54] cartoon HD [18:27] book i just uploaded about driving the SR-71, read it before someone submits a DMCA or something https://archive.org/stream/SledDriver [19:42] So, anyone else up for pulling as many cartons from cartoon HD as possible before it gets shut down? (I can't imagine that it won't, but theres some old movies on there). [21:08] tried to find the app (ios) but seems to be missing here. but im willing to grab as much as possible [21:14] midas: it's... hidden [21:14] http://gappcenter.com/cartoon/index.html [21:14] i guess for iDevices you have to be rooted to install it now? [21:22] doesnt say jailbroken only [21:24] right [21:24] it sideloaded on my iphone... [21:24] thats kinda freaky [21:30] they're exploiting enterprice certs [21:30] can't see it lasting long [21:30] yeah I aint insuring the security of it. [21:31] so i was thinking run a android image, setup a mount, pull the movies down in the image and then pull onto the system via that [21:33] im checking out the apk to see what it does, cant be that hard [21:33] midas: nod, it's prob just a case of finding the url ;)