#archiveteam-bs 2014-05-08,Thu

↑back Search

Time Nickname Message
00:10 🔗 godane i'm uploading more of marxists.org
00:49 🔗 godane i'm uploading the lastest DIY Tryin and tekzilla episodes
00:54 🔗 godane also Industrial Worker pdfs from marxists.org is getting upload too
02:33 🔗 balrog SketchCow: was there a talk by Brewster alluding to Yahoo destroying culture by burning Geocities?
02:33 🔗 balrog or was that you
02:36 🔗 SketchCow I'm sure both
04:58 🔗 yipdw https://www.youtube.com/watch?v=Tsa4ogtBiv0 is pretty excellent
05:54 🔗 Coderjoe is this fhqwhgads on every track of the album?
05:59 🔗 Coderjoe ok. i'm curious of the downbeat tracks
06:01 🔗 Coderjoe meh
06:17 🔗 Coderjoe Fhqwhgads all the way down
09:32 🔗 drspangle Is it better to download a website as just a collection of raw pages, or to extract the data from it all and create a kind of simulation of their database?
09:38 🔗 schbirid depends what your goal is
09:43 🔗 drspangle I guess
09:44 🔗 drspangle this is not something like.. geocities or whatnot, but a more database orientated website. every page has standard output, tags, comments, etc
09:46 🔗 drspangle the more I think about it, the more I'm sure that pulling the data out and building a simulation of their database is best
09:46 🔗 drspangle but in the general case, kind of wondering whether there's more of a tendancy towards processed or raw data in archiveteam stuff
09:47 🔗 midas archiveteam tends to grab everything
09:47 🔗 midas never know what you have missed untill it's gone right
09:47 🔗 tsp__ drspangle: I'm pretty new to all this, been sorta lirking for a while. Archiveteam seems to just grab everything raw, I tend to grab just the bits I want so my stuff isn't that useful
09:48 🔗 tsp__ e.g. archiveteam would grab fanfiction stories, the author page, reviews... while I just grab the chapters because it's all I need to process
09:50 🔗 drspangle well, in the fanfic scenario, I'm thinking grab all the data from the page, including the author page, reviews, etc, but then format it into a database, rather than grabbing it all as story0237.htm, story0238.htm, ... author012.htm, author013.htm
09:51 🔗 tsp__ That can work too. It just won't show up on the wayback machine I don't think, because that needs warcs
09:51 🔗 drspangle sorry, 'warcs'? Not sure the definition
09:52 🔗 tsp__ Some kind of web archive format. It describes the request and response, including headers
09:52 🔗 drspangle ah, okay
09:53 🔗 tsp__ The scripts I"ve seen seem to use wget-lua to do a lot of their link finding, maybe because wget has --page-requisites to grab a lot of the required css, javascript, images, etc that wouldn't be gotten otherwise
09:55 🔗 drspangle yeah
09:56 🔗 drspangle It seems more efficient to pull the data out of the website, if the style is just a tool to display the data
09:56 🔗 drspangle if the style is a feature in it's own right, then it does want saving
09:56 🔗 drspangle but I guess defining that difference is stupid, everyone will define differently, so download it all so they can decide later
09:57 🔗 tsp__ For me, I'm not really sure how to do it properly, I can wget --mirror a static site and be fine, but otherwise I just scrape the stuff I want. I don't think there's a how to archive dynamic forums tutorial yet
09:58 🔗 midas thats why there is #archivebot
09:58 🔗 midas easiest way to feed sites into archive.org
10:01 🔗 tsp__ I want to rip the spacebattles forums, particularly the fanfiction forum so I can read its posts with my editor. I can write some scripts to take out the posts I don't want
10:02 🔗 drspangle I'm kind of tempted to just attempt to simulate the website. Download the data, convert it into a simulated database, get a copy of the style, poke the data into where it goes
10:02 🔗 drspangle but pretty sure that's a bad idea and would be discouraged
10:02 🔗 drspangle but would mean easier processing of the data
10:29 🔗 schbirid drspangle: for archiving grab it as it is. for rehosting/displaying it differently you might want to parse it
10:56 🔗 drspangle schbirid: ah, okay
12:12 🔗 schbirid midas: harddisk, harddisk, harddisk? :>
12:12 🔗 midas i've ordered a new 4TB disk for travel :p
12:12 🔗 schbirid woot
12:25 🔗 schbirid you know what would be stupid fun? people running encrypted garbage faucet services to generate lots of encrypted traffic for fun
12:26 🔗 schbirid to annoy secret agencies and to test one's ISP's monthly limits by downloading it to /dev/null
12:26 🔗 ersi monthly limits? lol, how barbarian!
12:26 🔗 schbirid :)
12:37 🔗 schbirid yay for netcat
15:04 🔗 SketchCow http://imgur.com/gallery/nKkgfWK
15:09 🔗 exmic oh yes
15:09 🔗 exmic this is exactly what I needed
16:00 🔗 DFJustin http://apnews.excite.com/article/20140507/alibaba_ipo-yahoos_windfall-08fb26a635.html
16:09 🔗 godane a video about the search for James Kim: http://download.cbsnews.com/media/2006/12/05/video2232655.flv
17:54 🔗 Smiley so....
17:54 🔗 Smiley i wish we could pull all the old cartoons from that app on my phone
17:54 🔗 Smiley cartoon HD
18:27 🔗 exmic book i just uploaded about driving the SR-71, read it before someone submits a DMCA or something https://archive.org/stream/SledDriver
19:42 🔗 Smiley So, anyone else up for pulling as many cartons from cartoon HD as possible before it gets shut down? (I can't imagine that it won't, but theres some old movies on there).
21:08 🔗 midas tried to find the app (ios) but seems to be missing here. but im willing to grab as much as possible
21:14 🔗 Smiley midas: it's... hidden
21:14 🔗 Smiley http://gappcenter.com/cartoon/index.html
21:14 🔗 Smiley i guess for iDevices you have to be rooted to install it now?
21:22 🔗 midas doesnt say jailbroken only
21:24 🔗 midas right
21:24 🔗 midas it sideloaded on my iphone...
21:24 🔗 midas thats kinda freaky
21:30 🔗 balrog they're exploiting enterprice certs
21:30 🔗 balrog can't see it lasting long
21:30 🔗 Smiley yeah I aint insuring the security of it.
21:31 🔗 Smiley so i was thinking run a android image, setup a mount, pull the movies down in the image and then pull onto the system via that
21:33 🔗 midas im checking out the apk to see what it does, cant be that hard
21:33 🔗 Smiley midas: nod, it's prob just a case of finding the url ;)

irclogger-viewer