[00:54] <SketchCow> https://archive.org/details/JasonScottPresentationAtAmericanArchivistsMeetingWebArchivingGroupAugust2014
[15:48] <Smiley> Anyone able to find a caches of https://torrentfreak.com/bbc-fact-shut-down-doctor-who-fansite-140823/
[15:48] <Smiley> the site mentioned there#?
[16:09] <antomatic> wow, that sounds seriously dodgy
[16:10] <antomatic> (i.e. fact and bbc just demanding the domain is handed over virtually then and there0
[16:10] <antomatic> 0
[16:10] <antomatic> hm, i can't type close-brackets for some reason
[16:15] <midas> they have done something like this with finalgear
[16:36] <midas> Smiley: http://webcache.googleusercontent.com/search?q=cache:Gh-eo6mhGnEJ:doctorwhomedia.co.uk/+&cd=1&hl=nl&ct=clnk&gl=nl
[17:53] <filippo> Hello AT. A few days ago I was asking SketchCow for some help with a project I proposed in a HOPE X lightning session: http://ar.chiv.io
[17:53] <filippo> The very basic idea is a private archive.today (was archive.is) + Pinboard.in
[17:54] <filippo> With one fundamental feature: archiving every single page you visit
[17:55] <filippo> The long term plan is community driven plugins for better archiving and db federation to build a global public archive, but let's keep things simple for starting
[17:55] <filippo> I know I could pull that together with a few lines of scripting, but I don't want the dead HTML, I want to fetch the real page
[17:56] <filippo> Everything is dynamic, API based, AJAX ridden, and wget is not enough anymore
[17:56] <filippo> I *really* like what https://archive.today does, but it's stubbornly closed source, even if free
[17:57] <filippo> So I wanted to ask if anyone started a project with similar goals already -- load pages in real browsers and save the result -- from which I can get code
[17:58] <filippo> Anyone? <3
[18:09] <chfoo> filippo: i'm working on wpull which has experimental phantomjs support that supports saving snapshots into the warc under a non-standard uri. archivebot is using wpull.
[18:13] <Nemo_bis> ciao filippo
[18:26] <filippo> chfoo: are you speaking about bare image screenshot or static HTML?
[18:27] <filippo> (I think I checked out wpull while documenting)
[18:27] <filippo> Hey Nemo_bis :)
[18:27] <filippo> If you have a look at the archive.today zips, they are a embedded CSS, static HTML and images packet
[18:28] <filippo> I really like that approach, because as long as we will have a VM we will be able to render the page, but the text content is available too
[18:28] <chfoo> filippo: right now wpull takes a DOM snapshot and saves it to a PDF and HTML resource. any phantomjs traffic is proxied and recorded into the warc
[18:33] <filippo> chfoo: nice! So no scripts left, what about images and external styles? Are they only recorded or also linked/embedded into the HTML?
[18:37] <chfoo> filippo: it only records it. the idea was that maybe someone will index the WARC file for the special Wpull resources and then rewrite the HTML so it displays properly.
[18:44] <filippo> chfoo: I see, so one way to build my tool might be to do the post-processing on the WARCs or reuse the code from wpull
[18:45] <filippo> The only other tool I was able to find up to now is a browser extension, SinglePage
[18:49] <xmc> filippo: https://github.com/odie5533/WarcProxy ?
[18:49] <xmc> not exactly what you're talking about but it's ok
[18:52] <filippo> xmc: neat! I would not use it for privacy concerns but it can be useful
[18:52] <filippo> (I want to archive from a external server)
[18:54] <xmc> aye
[18:54] <xmc> I was running it at home for a while
[18:55] <xmc> it could use a few patches, such as a way to swap out the output file without restarting it, but that's fine
[20:01] <SketchCow> Huzzah
[23:44] <JKLman> I just get a bunch of Tracker rate limiting is active
[23:44] <JKLman> just downloads a file now and then
[23:45] <JKLman> maybe because alot of others with low ping is also working that project?
[23:47] <Rotab> its to not crash the site (again)
[23:47] <JKLman> thats good
[23:48] <JKLman> but if I look at the stats it like 5 or 6 peaople are running alot of files fast