[00:54] https://archive.org/details/JasonScottPresentationAtAmericanArchivistsMeetingWebArchivingGroupAugust2014 [15:48] Anyone able to find a caches of https://torrentfreak.com/bbc-fact-shut-down-doctor-who-fansite-140823/ [15:48] the site mentioned there#? [16:09] wow, that sounds seriously dodgy [16:10] (i.e. fact and bbc just demanding the domain is handed over virtually then and there0 [16:10] 0 [16:10] hm, i can't type close-brackets for some reason [16:15] they have done something like this with finalgear [16:36] Smiley: http://webcache.googleusercontent.com/search?q=cache:Gh-eo6mhGnEJ:doctorwhomedia.co.uk/+&cd=1&hl=nl&ct=clnk&gl=nl [17:53] Hello AT. A few days ago I was asking SketchCow for some help with a project I proposed in a HOPE X lightning session: http://ar.chiv.io [17:53] The very basic idea is a private archive.today (was archive.is) + Pinboard.in [17:54] With one fundamental feature: archiving every single page you visit [17:55] The long term plan is community driven plugins for better archiving and db federation to build a global public archive, but let's keep things simple for starting [17:55] I know I could pull that together with a few lines of scripting, but I don't want the dead HTML, I want to fetch the real page [17:56] Everything is dynamic, API based, AJAX ridden, and wget is not enough anymore [17:56] I *really* like what https://archive.today does, but it's stubbornly closed source, even if free [17:57] So I wanted to ask if anyone started a project with similar goals already -- load pages in real browsers and save the result -- from which I can get code [17:58] Anyone? <3 [18:09] filippo: i'm working on wpull which has experimental phantomjs support that supports saving snapshots into the warc under a non-standard uri. archivebot is using wpull. [18:13] ciao filippo [18:26] chfoo: are you speaking about bare image screenshot or static HTML? [18:27] (I think I checked out wpull while documenting) [18:27] Hey Nemo_bis :) [18:27] If you have a look at the archive.today zips, they are a embedded CSS, static HTML and images packet [18:28] I really like that approach, because as long as we will have a VM we will be able to render the page, but the text content is available too [18:28] filippo: right now wpull takes a DOM snapshot and saves it to a PDF and HTML resource. any phantomjs traffic is proxied and recorded into the warc [18:33] chfoo: nice! So no scripts left, what about images and external styles? Are they only recorded or also linked/embedded into the HTML? [18:37] filippo: it only records it. the idea was that maybe someone will index the WARC file for the special Wpull resources and then rewrite the HTML so it displays properly. [18:44] chfoo: I see, so one way to build my tool might be to do the post-processing on the WARCs or reuse the code from wpull [18:45] The only other tool I was able to find up to now is a browser extension, SinglePage [18:49] filippo: https://github.com/odie5533/WarcProxy ? [18:49] not exactly what you're talking about but it's ok [18:52] xmc: neat! I would not use it for privacy concerns but it can be useful [18:52] (I want to archive from a external server) [18:54] aye [18:54] I was running it at home for a while [18:55] it could use a few patches, such as a way to swap out the output file without restarting it, but that's fine [20:01] Huzzah [23:44] I just get a bunch of Tracker rate limiting is active [23:44] just downloads a file now and then [23:45] maybe because alot of others with low ping is also working that project? [23:47] its to not crash the site (again) [23:47] thats good [23:48] but if I look at the stats it like 5 or 6 peaople are running alot of files fast