[00:05] *** BlueMax has joined #archiveteam-bs
[00:13] *** Flashfire has joined #archiveteam-bs
[00:17] *** Mayeau is now known as Mayonaise
[00:25] *** coldice has joined #archiveteam-bs
[00:32] <JAA> coldice: Have a look at our wiki. It contains a wealth of information on archival.
[00:32] <JAA> Generally speaking, you'll want to archive websites in the WARC format, which preserves request and response entirely (including HTTP headers) and also contains relevant metadata.
[00:34] <JAA> There are several tools and approaches to do this. The one we use most of the time (including through ArchiveBot and the warrior project) is a crawler like wpull or wget. This works pretty well for most sites. The major exception here are websites that make heavy use of JavaScript.
[00:35] <coldice> So old websites before 2010 is safe to wpull
[00:36] <coldice> Anything else through PanthomJS or something?
[00:36] <JAA> Even modern sites might work fine with wpull. It really just depends on how the site is built.
[00:37] <JAA> If the site's browsable with JS disabled in the browser, then it will usually work fine with those crawlers.
[00:37] <JAA> PhantomJS doesn't work very well.
[00:38] <JAA> We don't really have a proper solution for JS-heavy websites yet. It's a quite tricky problem, especially when links aren't even real links, clicks get hijacked, etc.
[00:39] <JAA> You can always archive that stuff through a browser using a proxy that writes everything to WARC, e.g. warcprox. But that doesn't necessarily mean that it can also be played back later.
[00:40] <JAA> And it's not well automatable in the general case. So you typically need to write custom code for each such site you want to grab.
[00:43] <coldice> Alright, to get started I need https://github.com/ludios/grab-site right?
[00:44] <coldice> Unless I want to join the pool
[00:44] <JAA> Yeah, that's one way. grab-site is a wrapper around wpull to make it easier to use.
[00:45] <coldice> Btw, is there a list of archived websites? I can't seem to find it on the wiki
[00:45] <JAA> That would be a long list.
[00:46] <kiska> A very long list
[00:46] <kiska> From #archivebot Major: Job status: 95273 completed
[00:48] <coldice> So the data is archived, but not available? Am I missing something?
[00:48] <JAA> All our data is uploaded to the Internet Archive and included in the Wayback Machine.
[00:49] <JAA> https://archive.org/details/archiveteam
[01:02] *** sknebel has quit IRC (Quit: No Ping reply in 180 seconds.)
[01:05] *** sknebel has joined #archiveteam-bs
[01:44] *** BlueMax has quit IRC (Quit: Leaving)
[01:46] *** BlueMax has joined #archiveteam-bs
[02:08] <coldice> Thanks for your help JAA, my grabber is working fine. https://i.imgur.com/EmB3bQY.png - I got a few TB of storage, which should get me pretty far.... 
[02:09] <JAA> Happy to help. :-)
[02:10] *** Odd0002 has quit IRC (Read error: Operation timed out)
[02:17] *** Odd0002 has joined #archiveteam-bs
[02:55] <ivan> coldice: you can set up grab-site and an uploader to upload and remove WARCs before the crawls finish
[02:56] <ivan> the grab-site component is --finished-warc-dir= and the uploader can be something like https://gist.github.com/ivan/079530350ac94851d581b55b1d372440 for IA
[03:02] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…)
[03:06] *** odemg has quit IRC (Ping timeout: 260 seconds)
[03:27] *** bitBaron has joined #archiveteam-bs
[03:45] <coldice> Anyone.. think my grab-site is running of... seeing a lot of requests to  https://static.xx.fbcdn.net/rsrc.php/* - should that be in the ignore pattern?
[03:45] <FlashBack> All good
[03:45] <Flashfire> Coldice its facebook java script crap
[03:45] <coldice> Whelp, a lot of it too
[03:45] <Flashfire> grabbing it has no harm at all but geel free to ignore it as well
[03:46] <coldice> Is it possibly for me to interact with the script too like the IRC bot? Just command-line wise
[03:47] <Flashfire> No clue with grab site
[03:58] <coldice> JAA, may I know what you use in customs scripts to scrape websites for archive? Scrapy?
[04:23] *** ndiddy has quit IRC (Read error: Operation timed out)
[04:27] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…)
[04:30] <Raccoon> I just saw the bot's link to a wiki of ISP Hosts.  Maybe somebody would similarly find this list interesting.  https://gist.github.com/a-raccoon/15c55e8d4048bb120b56
[04:38] *** faoling__ has joined #archiveteam-bs
[04:42] *** Pixi` has joined #archiveteam-bs
[04:44] *** faolingf_ has quit IRC (Ping timeout: 360 seconds)
[04:47] *** dxrt has quit IRC (Read error: Operation timed out)
[04:47] *** dxrt has joined #archiveteam-bs
[04:47] *** Atom-- has joined #archiveteam-bs
[04:48] *** Frogging has quit IRC (Read error: Operation timed out)
[04:48] *** Frogging has joined #archiveteam-bs
[04:48] *** twigfoot has quit IRC (Ping timeout: 360 seconds)
[04:48] *** Pixi has quit IRC (Read error: Operation timed out)
[04:49] *** underscor has quit IRC (Ping timeout: 360 seconds)
[04:49] *** underscor has joined #archiveteam-bs
[04:49] *** svchfoo1 sets mode: +o underscor
[04:50] *** arkiver has quit IRC (Read error: Operation timed out)
[04:50] *** superkuh has quit IRC (Excess Flood)
[04:51] *** twigfoot has joined #archiveteam-bs
[04:51] *** betamax_ has joined #archiveteam-bs
[04:52] *** swebb has quit IRC (Ping timeout: 360 seconds)
[04:52] *** Somebody2 has quit IRC (Ping timeout: 360 seconds)
[04:52] *** unlobito has quit IRC (Ping timeout: 360 seconds)
[04:52] *** unlobito has joined #archiveteam-bs
[04:52] *** swebb has joined #archiveteam-bs
[04:52] *** svchfoo1 sets mode: +o swebb
[04:53] *** Cameron_D has quit IRC (Read error: Operation timed out)
[04:53] *** sknebel_ has joined #archiveteam-bs
[04:54] *** arkiver has joined #archiveteam-bs
[04:54] *** Darkstar has quit IRC (Read error: Connection reset by peer)
[04:54] *** Cameron_D has joined #archiveteam-bs
[04:55] *** Somebody2 has joined #archiveteam-bs
[04:55] *** godane has quit IRC (Read error: Operation timed out)
[04:56] *** twigfoot has quit IRC (Read error: Operation timed out)
[04:56] *** betamax has quit IRC (Read error: Operation timed out)
[04:57] *** Atom has quit IRC (Read error: Operation timed out)
[04:57] *** godane has joined #archiveteam-bs
[04:58] *** svchfoo1 sets mode: +o godane
[04:58] *** Yurume has joined #archiveteam-bs
[04:59] *** astrid has quit IRC (Read error: Operation timed out)
[05:00] *** twigfoot has joined #archiveteam-bs
[05:00] *** Cameron_D has quit IRC (Ping timeout: 360 seconds)
[05:01] *** Cameron_D has joined #archiveteam-bs
[05:02] *** Somebody2 has quit IRC (Ping timeout: 360 seconds)
[05:02] *** phirephl- has quit IRC (Ping timeout: 360 seconds)
[05:02] <godane> SketchCow: any news?
[05:04] *** astrid has joined #archiveteam-bs
[05:04] *** swebb sets mode: +o astrid
[05:04] *** MrRadar has quit IRC (Read error: Operation timed out)
[05:05] *** Darkstar has joined #archiveteam-bs
[05:06] *** sknebel has quit IRC (Read error: Operation timed out)
[05:07] *** twigfoot has quit IRC (Read error: Operation timed out)
[05:07] *** twigfoot has joined #archiveteam-bs
[05:07] *** Yurume_ has quit IRC (Read error: Operation timed out)
[05:08] *** zino_ has quit IRC (Excess Flood)
[05:11] *** MrRadar has joined #archiveteam-bs
[05:11] *** superkuh has joined #archiveteam-bs
[05:12] *** phirephly has joined #archiveteam-bs
[05:12] *** Darkstar has quit IRC (Read error: Connection reset by peer)
[05:13] *** Somebody2 has joined #archiveteam-bs
[05:15] *** zino has joined #archiveteam-bs
[05:15] *** Darkstar has joined #archiveteam-bs
[05:25] <hook54321> JAA: Have we started grabbing XUL addons from addons.mozilla.org? The deadline is "early October, 2018"
[05:36] *** m007a83 has quit IRC (Fuck you Comcast)
[05:55] *** HCross has quit IRC (Ping timeout: 268 seconds)
[05:56] *** HCross has joined #archiveteam-bs
[05:56] *** HCross has quit IRC (Excess Flood)
[05:57] *** Yurume has quit IRC (Ping timeout: 268 seconds)
[05:57] *** TC04 has quit IRC (Ping timeout: 268 seconds)
[05:57] *** svchfoo1 has quit IRC (Ping timeout: 268 seconds)
[05:57] *** TC01 has joined #archiveteam-bs
[05:57] *** Yurume has joined #archiveteam-bs
[05:57] *** kiskabak2 has quit IRC (Ping timeout: 268 seconds)
[05:58] *** Kaz has quit IRC (Ping timeout: 268 seconds)
[06:02] *** betamax_ has quit IRC (Ping timeout: 268 seconds)
[06:02] *** betamax has joined #archiveteam-bs
[06:14] *** BlueMax has quit IRC (Quit: Leaving)
[06:26] *** dxrt_ has joined #archiveteam-bs
[06:28] *** sec^nd has quit IRC (Quit: ZNC 1.6.5 - http://znc.in)
[06:36] *** second has joined #archiveteam-bs
[06:43] *** BlueMax has joined #archiveteam-bs
[06:55] *** HCross has joined #archiveteam-bs
[07:01] *** erin has joined #archiveteam-bs
[07:55] *** svchfoo1 has joined #archiveteam-bs
[07:55] *** svchfoo3 sets mode: +o svchfoo1
[08:14] *** kiskabak2 has joined #archiveteam-bs
[08:30] *** coldice_ has joined #archiveteam-bs
[08:35] *** coldice has quit IRC (Read error: Operation timed out)
[09:31] *** BartoCH has quit IRC (Quit: WeeChat 2.2)
[09:31] *** BartoCH has joined #archiveteam-bs
[09:39] *** faoling__ is now known as faolingfa
[09:53] *** faolingfa has quit IRC (Leaving)
[10:20] <coldice_> Flashfire, Yea, the part about upload the warc file I'm not quite sure about yet
[10:21] <Flashfire> Yeah I am not so great with that you are going to need to ask someone else for help I am sorry
[10:50] *** Kaz has joined #archiveteam-bs
[11:02] <JAA> coldice_: I'm using custom code on top of a modified version of aiohttp when it has to be fast and can easily be split up into individual work items. If I just want to do a recursive crawl, I use wpull.
[11:02] *** SimpBrain has quit IRC (Read error: Operation timed out)
[11:03] <JAA> hook54321: The warrior project isn't started yet, but arkiver said it should be ready soon. I grabbed all Firefox addons yesterday, and I'll grab the Thunderbird and Seamonkey ones today. But I'm only grabbing the actual .xpi (and occasionally .zip) files, not the web page; the latter is also very important since it contains description, screenshots, metadata, changelogs, license information, etc.
[11:03] <JAA> -> #outofammo
[12:14] *** Mateon1 has quit IRC (Read error: Operation timed out)
[12:16] *** Mateon1 has joined #archiveteam-bs
[12:19] *** TC01 has quit IRC (Read error: Operation timed out)
[12:23] *** TC01 has joined #archiveteam-bs
[12:38] *** chferfa has quit IRC ()
[13:02] *** coldice_ is now known as coldice
[13:04] <coldice> Ops, turns out the site I was crawling requires login to access the forum part.... anyone know how to parse a login site? :|
[13:25] *** BlueMax has quit IRC (Read error: Connection reset by peer)
[13:47] *** m007a83 has joined #archiveteam-bs
[13:50] <coldice> Anyone, grab-site has hit a nationalgeographic url and doesn't proceed... think it's stuck
[13:51] <coldice> can I stop and continue where it left or something?
[14:08] *** wp494 has quit IRC (Read error: Operation timed out)
[14:09] *** wp494 has joined #archiveteam-bs
[14:11] *** Atom__ has joined #archiveteam-bs
[14:14] *** Atom-- has quit IRC (Read error: Operation timed out)
[14:20] <mr_archiv> @coldice, manually login using a web browser, note the cookie(s) it sets and their values and send that as a part of each request with the web scrapper you are using.
[15:32] *** odemg has joined #archiveteam-bs
[15:45] *** zhongfu has quit IRC (Ping timeout: 260 seconds)
[16:14] *** zhongfu has joined #archiveteam-bs
[16:42] *** zhongfu has quit IRC (Ping timeout: 260 seconds)
[18:17] *** Mateon1 has quit IRC (Quit: Mateon1)
[18:18] *** Mateon1 has joined #archiveteam-bs
[19:03] <godane> so i got a beta player at Savers for $8
[19:03] <godane> i will have to see works works but i did test in store and it does power one
[19:03] <godane> *on
[19:42] *** RichardG_ has quit IRC (Read error: Operation timed out)
[19:43] *** ndiddy has joined #archiveteam-bs
[19:59] <godane> so tape will not load
[19:59] <godane> figures
[20:05] <godane> i'm digitizing a tape called 'The Valley of Miracles'
[20:19] <godane> this is a vhs tape i bought from savers
[20:19] <godane> the only thing that i would think that needs to be digitize maybe
[20:34] <Raccoon> what sort of tapes do you like to digitize.
[20:36] <Raccoon> I have a bunch of VHS from our wildlife refuge I was about to toss, because my bitch cat peed in the box (destroying them with smell even if I cleaned them well).  But the cassettes themselves were undamaged.
[20:36] <Raccoon> they're either visitor education films or wildlife management and heavy machinery crew instructional videos.
[20:37] <Raccoon> Bobcat and JohnDeer brand training
[20:45] *** atluxity has quit IRC (Be the person your dog think you are.)
[20:56] <ivan> coldice: add nationalgeographic to ignores and raise the concurrency
[20:56] <ivan> coldice: it'll resume soon enough
[20:58] *** RichardG has joined #archiveteam-bs
[21:04] <godane> Raccoon: i'm not taking cat peed on tapes if possible
[21:04] <godane> at least i would like to see pictures of the tapes first
[21:05] <Raccoon> thought not :)  I mean, the tapes are clean, but the pretty case cover art can't be salvaged except for maybe a photograph
[21:06] <Raccoon> boring stuff about birds and sandhill cranes anyway
[21:06] <Raccoon> riogrande
[21:11] *** RichardG has quit IRC (Ping timeout: 246 seconds)
[21:29] <godane> i'm doing 'from boxing to ballet' tape
[21:29] <godane> its pushing 10Mbits
[22:22] *** coldice has quit IRC (Read error: Operation timed out)
[23:13] *** BlueMax has joined #archiveteam-bs