[00:05] *** BlueMax has joined #archiveteam-bs [00:13] *** Flashfire has joined #archiveteam-bs [00:17] *** Mayeau is now known as Mayonaise [00:25] *** coldice has joined #archiveteam-bs [00:32] coldice: Have a look at our wiki. It contains a wealth of information on archival. [00:32] Generally speaking, you'll want to archive websites in the WARC format, which preserves request and response entirely (including HTTP headers) and also contains relevant metadata. [00:34] There are several tools and approaches to do this. The one we use most of the time (including through ArchiveBot and the warrior project) is a crawler like wpull or wget. This works pretty well for most sites. The major exception here are websites that make heavy use of JavaScript. [00:35] So old websites before 2010 is safe to wpull [00:36] Anything else through PanthomJS or something? [00:36] Even modern sites might work fine with wpull. It really just depends on how the site is built. [00:37] If the site's browsable with JS disabled in the browser, then it will usually work fine with those crawlers. [00:37] PhantomJS doesn't work very well. [00:38] We don't really have a proper solution for JS-heavy websites yet. It's a quite tricky problem, especially when links aren't even real links, clicks get hijacked, etc. [00:39] You can always archive that stuff through a browser using a proxy that writes everything to WARC, e.g. warcprox. But that doesn't necessarily mean that it can also be played back later. [00:40] And it's not well automatable in the general case. So you typically need to write custom code for each such site you want to grab. [00:43] Alright, to get started I need https://github.com/ludios/grab-site right? [00:44] Unless I want to join the pool [00:44] Yeah, that's one way. grab-site is a wrapper around wpull to make it easier to use. [00:45] Btw, is there a list of archived websites? I can't seem to find it on the wiki [00:45] That would be a long list. [00:46] A very long list [00:46] From #archivebot Major: Job status: 95273 completed [00:48] So the data is archived, but not available? Am I missing something? [00:48] All our data is uploaded to the Internet Archive and included in the Wayback Machine. [00:49] https://archive.org/details/archiveteam [01:02] *** sknebel has quit IRC (Quit: No Ping reply in 180 seconds.) [01:05] *** sknebel has joined #archiveteam-bs [01:44] *** BlueMax has quit IRC (Quit: Leaving) [01:46] *** BlueMax has joined #archiveteam-bs [02:08] Thanks for your help JAA, my grabber is working fine. https://i.imgur.com/EmB3bQY.png - I got a few TB of storage, which should get me pretty far.... [02:09] Happy to help. :-) [02:10] *** Odd0002 has quit IRC (Read error: Operation timed out) [02:17] *** Odd0002 has joined #archiveteam-bs [02:55] coldice: you can set up grab-site and an uploader to upload and remove WARCs before the crawls finish [02:56] the grab-site component is --finished-warc-dir= and the uploader can be something like https://gist.github.com/ivan/079530350ac94851d581b55b1d372440 for IA [03:02] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…) [03:06] *** odemg has quit IRC (Ping timeout: 260 seconds) [03:27] *** bitBaron has joined #archiveteam-bs [03:45] Anyone.. think my grab-site is running of... seeing a lot of requests to https://static.xx.fbcdn.net/rsrc.php/* - should that be in the ignore pattern? [03:45] All good [03:45] Coldice its facebook java script crap [03:45] Whelp, a lot of it too [03:45] grabbing it has no harm at all but geel free to ignore it as well [03:46] Is it possibly for me to interact with the script too like the IRC bot? Just command-line wise [03:47] No clue with grab site [03:58] JAA, may I know what you use in customs scripts to scrape websites for archive? Scrapy? [04:23] *** ndiddy has quit IRC (Read error: Operation timed out) [04:27] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…) [04:30] I just saw the bot's link to a wiki of ISP Hosts. Maybe somebody would similarly find this list interesting. https://gist.github.com/a-raccoon/15c55e8d4048bb120b56 [04:38] *** faoling__ has joined #archiveteam-bs [04:42] *** Pixi` has joined #archiveteam-bs [04:44] *** faolingf_ has quit IRC (Ping timeout: 360 seconds) [04:47] *** dxrt has quit IRC (Read error: Operation timed out) [04:47] *** dxrt has joined #archiveteam-bs [04:47] *** Atom-- has joined #archiveteam-bs [04:48] *** Frogging has quit IRC (Read error: Operation timed out) [04:48] *** Frogging has joined #archiveteam-bs [04:48] *** twigfoot has quit IRC (Ping timeout: 360 seconds) [04:48] *** Pixi has quit IRC (Read error: Operation timed out) [04:49] *** underscor has quit IRC (Ping timeout: 360 seconds) [04:49] *** underscor has joined #archiveteam-bs [04:49] *** svchfoo1 sets mode: +o underscor [04:50] *** arkiver has quit IRC (Read error: Operation timed out) [04:50] *** superkuh has quit IRC (Excess Flood) [04:51] *** twigfoot has joined #archiveteam-bs [04:51] *** betamax_ has joined #archiveteam-bs [04:52] *** swebb has quit IRC (Ping timeout: 360 seconds) [04:52] *** Somebody2 has quit IRC (Ping timeout: 360 seconds) [04:52] *** unlobito has quit IRC (Ping timeout: 360 seconds) [04:52] *** unlobito has joined #archiveteam-bs [04:52] *** swebb has joined #archiveteam-bs [04:52] *** svchfoo1 sets mode: +o swebb [04:53] *** Cameron_D has quit IRC (Read error: Operation timed out) [04:53] *** sknebel_ has joined #archiveteam-bs [04:54] *** arkiver has joined #archiveteam-bs [04:54] *** Darkstar has quit IRC (Read error: Connection reset by peer) [04:54] *** Cameron_D has joined #archiveteam-bs [04:55] *** Somebody2 has joined #archiveteam-bs [04:55] *** godane has quit IRC (Read error: Operation timed out) [04:56] *** twigfoot has quit IRC (Read error: Operation timed out) [04:56] *** betamax has quit IRC (Read error: Operation timed out) [04:57] *** Atom has quit IRC (Read error: Operation timed out) [04:57] *** godane has joined #archiveteam-bs [04:58] *** svchfoo1 sets mode: +o godane [04:58] *** Yurume has joined #archiveteam-bs [04:59] *** astrid has quit IRC (Read error: Operation timed out) [05:00] *** twigfoot has joined #archiveteam-bs [05:00] *** Cameron_D has quit IRC (Ping timeout: 360 seconds) [05:01] *** Cameron_D has joined #archiveteam-bs [05:02] *** Somebody2 has quit IRC (Ping timeout: 360 seconds) [05:02] *** phirephl- has quit IRC (Ping timeout: 360 seconds) [05:02] SketchCow: any news? [05:04] *** astrid has joined #archiveteam-bs [05:04] *** swebb sets mode: +o astrid [05:04] *** MrRadar has quit IRC (Read error: Operation timed out) [05:05] *** Darkstar has joined #archiveteam-bs [05:06] *** sknebel has quit IRC (Read error: Operation timed out) [05:07] *** twigfoot has quit IRC (Read error: Operation timed out) [05:07] *** twigfoot has joined #archiveteam-bs [05:07] *** Yurume_ has quit IRC (Read error: Operation timed out) [05:08] *** zino_ has quit IRC (Excess Flood) [05:11] *** MrRadar has joined #archiveteam-bs [05:11] *** superkuh has joined #archiveteam-bs [05:12] *** phirephly has joined #archiveteam-bs [05:12] *** Darkstar has quit IRC (Read error: Connection reset by peer) [05:13] *** Somebody2 has joined #archiveteam-bs [05:15] *** zino has joined #archiveteam-bs [05:15] *** Darkstar has joined #archiveteam-bs [05:25] JAA: Have we started grabbing XUL addons from addons.mozilla.org? The deadline is "early October, 2018" [05:36] *** m007a83 has quit IRC (Fuck you Comcast) [05:55] *** HCross has quit IRC (Ping timeout: 268 seconds) [05:56] *** HCross has joined #archiveteam-bs [05:56] *** HCross has quit IRC (Excess Flood) [05:57] *** Yurume has quit IRC (Ping timeout: 268 seconds) [05:57] *** TC04 has quit IRC (Ping timeout: 268 seconds) [05:57] *** svchfoo1 has quit IRC (Ping timeout: 268 seconds) [05:57] *** TC01 has joined #archiveteam-bs [05:57] *** Yurume has joined #archiveteam-bs [05:57] *** kiskabak2 has quit IRC (Ping timeout: 268 seconds) [05:58] *** Kaz has quit IRC (Ping timeout: 268 seconds) [06:02] *** betamax_ has quit IRC (Ping timeout: 268 seconds) [06:02] *** betamax has joined #archiveteam-bs [06:14] *** BlueMax has quit IRC (Quit: Leaving) [06:26] *** dxrt_ has joined #archiveteam-bs [06:28] *** sec^nd has quit IRC (Quit: ZNC 1.6.5 - http://znc.in) [06:36] *** second has joined #archiveteam-bs [06:43] *** BlueMax has joined #archiveteam-bs [06:55] *** HCross has joined #archiveteam-bs [07:01] *** erin has joined #archiveteam-bs [07:55] *** svchfoo1 has joined #archiveteam-bs [07:55] *** svchfoo3 sets mode: +o svchfoo1 [08:14] *** kiskabak2 has joined #archiveteam-bs [08:30] *** coldice_ has joined #archiveteam-bs [08:35] *** coldice has quit IRC (Read error: Operation timed out) [09:31] *** BartoCH has quit IRC (Quit: WeeChat 2.2) [09:31] *** BartoCH has joined #archiveteam-bs [09:39] *** faoling__ is now known as faolingfa [09:53] *** faolingfa has quit IRC (Leaving) [10:20] Flashfire, Yea, the part about upload the warc file I'm not quite sure about yet [10:21] Yeah I am not so great with that you are going to need to ask someone else for help I am sorry [10:50] *** Kaz has joined #archiveteam-bs [11:02] coldice_: I'm using custom code on top of a modified version of aiohttp when it has to be fast and can easily be split up into individual work items. If I just want to do a recursive crawl, I use wpull. [11:02] *** SimpBrain has quit IRC (Read error: Operation timed out) [11:03] hook54321: The warrior project isn't started yet, but arkiver said it should be ready soon. I grabbed all Firefox addons yesterday, and I'll grab the Thunderbird and Seamonkey ones today. But I'm only grabbing the actual .xpi (and occasionally .zip) files, not the web page; the latter is also very important since it contains description, screenshots, metadata, changelogs, license information, etc. [11:03] -> #outofammo [12:14] *** Mateon1 has quit IRC (Read error: Operation timed out) [12:16] *** Mateon1 has joined #archiveteam-bs [12:19] *** TC01 has quit IRC (Read error: Operation timed out) [12:23] *** TC01 has joined #archiveteam-bs [12:38] *** chferfa has quit IRC () [13:02] *** coldice_ is now known as coldice [13:04] Ops, turns out the site I was crawling requires login to access the forum part.... anyone know how to parse a login site? :| [13:25] *** BlueMax has quit IRC (Read error: Connection reset by peer) [13:47] *** m007a83 has joined #archiveteam-bs [13:50] Anyone, grab-site has hit a nationalgeographic url and doesn't proceed... think it's stuck [13:51] can I stop and continue where it left or something? [14:08] *** wp494 has quit IRC (Read error: Operation timed out) [14:09] *** wp494 has joined #archiveteam-bs [14:11] *** Atom__ has joined #archiveteam-bs [14:14] *** Atom-- has quit IRC (Read error: Operation timed out) [14:20] @coldice, manually login using a web browser, note the cookie(s) it sets and their values and send that as a part of each request with the web scrapper you are using. [15:32] *** odemg has joined #archiveteam-bs [15:45] *** zhongfu has quit IRC (Ping timeout: 260 seconds) [16:14] *** zhongfu has joined #archiveteam-bs [16:42] *** zhongfu has quit IRC (Ping timeout: 260 seconds) [18:17] *** Mateon1 has quit IRC (Quit: Mateon1) [18:18] *** Mateon1 has joined #archiveteam-bs [19:03] so i got a beta player at Savers for $8 [19:03] i will have to see works works but i did test in store and it does power one [19:03] *on [19:42] *** RichardG_ has quit IRC (Read error: Operation timed out) [19:43] *** ndiddy has joined #archiveteam-bs [19:59] so tape will not load [19:59] figures [20:05] i'm digitizing a tape called 'The Valley of Miracles' [20:19] this is a vhs tape i bought from savers [20:19] the only thing that i would think that needs to be digitize maybe [20:34] what sort of tapes do you like to digitize. [20:36] I have a bunch of VHS from our wildlife refuge I was about to toss, because my bitch cat peed in the box (destroying them with smell even if I cleaned them well). But the cassettes themselves were undamaged. [20:36] they're either visitor education films or wildlife management and heavy machinery crew instructional videos. [20:37] Bobcat and JohnDeer brand training [20:45] *** atluxity has quit IRC (Be the person your dog think you are.) [20:56] coldice: add nationalgeographic to ignores and raise the concurrency [20:56] coldice: it'll resume soon enough [20:58] *** RichardG has joined #archiveteam-bs [21:04] Raccoon: i'm not taking cat peed on tapes if possible [21:04] at least i would like to see pictures of the tapes first [21:05] thought not :) I mean, the tapes are clean, but the pretty case cover art can't be salvaged except for maybe a photograph [21:06] boring stuff about birds and sandhill cranes anyway [21:06] riogrande [21:11] *** RichardG has quit IRC (Ping timeout: 246 seconds) [21:29] i'm doing 'from boxing to ballet' tape [21:29] its pushing 10Mbits [22:22] *** coldice has quit IRC (Read error: Operation timed out) [23:13] *** BlueMax has joined #archiveteam-bs