#archiveteam-bs 2018-09-15,Sat

↑back Search

Time Nickname Message
00:05 🔗 BlueMax has joined #archiveteam-bs
00:13 🔗 Flashfire has joined #archiveteam-bs
00:17 🔗 Mayeau is now known as Mayonaise
00:25 🔗 coldice has joined #archiveteam-bs
00:32 🔗 JAA coldice: Have a look at our wiki. It contains a wealth of information on archival.
00:32 🔗 JAA Generally speaking, you'll want to archive websites in the WARC format, which preserves request and response entirely (including HTTP headers) and also contains relevant metadata.
00:34 🔗 JAA There are several tools and approaches to do this. The one we use most of the time (including through ArchiveBot and the warrior project) is a crawler like wpull or wget. This works pretty well for most sites. The major exception here are websites that make heavy use of JavaScript.
00:35 🔗 coldice So old websites before 2010 is safe to wpull
00:36 🔗 coldice Anything else through PanthomJS or something?
00:36 🔗 JAA Even modern sites might work fine with wpull. It really just depends on how the site is built.
00:37 🔗 JAA If the site's browsable with JS disabled in the browser, then it will usually work fine with those crawlers.
00:37 🔗 JAA PhantomJS doesn't work very well.
00:38 🔗 JAA We don't really have a proper solution for JS-heavy websites yet. It's a quite tricky problem, especially when links aren't even real links, clicks get hijacked, etc.
00:39 🔗 JAA You can always archive that stuff through a browser using a proxy that writes everything to WARC, e.g. warcprox. But that doesn't necessarily mean that it can also be played back later.
00:40 🔗 JAA And it's not well automatable in the general case. So you typically need to write custom code for each such site you want to grab.
00:43 🔗 coldice Alright, to get started I need https://github.com/ludios/grab-site right?
00:44 🔗 coldice Unless I want to join the pool
00:44 🔗 JAA Yeah, that's one way. grab-site is a wrapper around wpull to make it easier to use.
00:45 🔗 coldice Btw, is there a list of archived websites? I can't seem to find it on the wiki
00:45 🔗 JAA That would be a long list.
00:46 🔗 kiska A very long list
00:46 🔗 kiska From #archivebot Major: Job status: 95273 completed
00:48 🔗 coldice So the data is archived, but not available? Am I missing something?
00:48 🔗 JAA All our data is uploaded to the Internet Archive and included in the Wayback Machine.
00:49 🔗 JAA https://archive.org/details/archiveteam
01:02 🔗 sknebel has quit IRC (Quit: No Ping reply in 180 seconds.)
01:05 🔗 sknebel has joined #archiveteam-bs
01:44 🔗 BlueMax has quit IRC (Quit: Leaving)
01:46 🔗 BlueMax has joined #archiveteam-bs
02:08 🔗 coldice Thanks for your help JAA, my grabber is working fine. https://i.imgur.com/EmB3bQY.png - I got a few TB of storage, which should get me pretty far....
02:09 🔗 JAA Happy to help. :-)
02:10 🔗 Odd0002 has quit IRC (Read error: Operation timed out)
02:17 🔗 Odd0002 has joined #archiveteam-bs
02:55 🔗 ivan coldice: you can set up grab-site and an uploader to upload and remove WARCs before the crawls finish
02:56 🔗 ivan the grab-site component is --finished-warc-dir= and the uploader can be something like https://gist.github.com/ivan/079530350ac94851d581b55b1d372440 for IA
03:02 🔗 bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…)
03:06 🔗 odemg has quit IRC (Ping timeout: 260 seconds)
03:27 🔗 bitBaron has joined #archiveteam-bs
03:45 🔗 coldice Anyone.. think my grab-site is running of... seeing a lot of requests to https://static.xx.fbcdn.net/rsrc.php/* - should that be in the ignore pattern?
03:45 🔗 FlashBack All good
03:45 🔗 Flashfire Coldice its facebook java script crap
03:45 🔗 coldice Whelp, a lot of it too
03:45 🔗 Flashfire grabbing it has no harm at all but geel free to ignore it as well
03:46 🔗 coldice Is it possibly for me to interact with the script too like the IRC bot? Just command-line wise
03:47 🔗 Flashfire No clue with grab site
03:58 🔗 coldice JAA, may I know what you use in customs scripts to scrape websites for archive? Scrapy?
04:23 🔗 ndiddy has quit IRC (Read error: Operation timed out)
04:27 🔗 bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…)
04:30 🔗 Raccoon I just saw the bot's link to a wiki of ISP Hosts. Maybe somebody would similarly find this list interesting. https://gist.github.com/a-raccoon/15c55e8d4048bb120b56
04:38 🔗 faoling__ has joined #archiveteam-bs
04:42 🔗 Pixi` has joined #archiveteam-bs
04:44 🔗 faolingf_ has quit IRC (Ping timeout: 360 seconds)
04:47 🔗 dxrt has quit IRC (Read error: Operation timed out)
04:47 🔗 dxrt has joined #archiveteam-bs
04:47 🔗 Atom-- has joined #archiveteam-bs
04:48 🔗 Frogging has quit IRC (Read error: Operation timed out)
04:48 🔗 Frogging has joined #archiveteam-bs
04:48 🔗 twigfoot has quit IRC (Ping timeout: 360 seconds)
04:48 🔗 Pixi has quit IRC (Read error: Operation timed out)
04:49 🔗 underscor has quit IRC (Ping timeout: 360 seconds)
04:49 🔗 underscor has joined #archiveteam-bs
04:49 🔗 svchfoo1 sets mode: +o underscor
04:50 🔗 arkiver has quit IRC (Read error: Operation timed out)
04:50 🔗 superkuh has quit IRC (Excess Flood)
04:51 🔗 twigfoot has joined #archiveteam-bs
04:51 🔗 betamax_ has joined #archiveteam-bs
04:52 🔗 swebb has quit IRC (Ping timeout: 360 seconds)
04:52 🔗 Somebody2 has quit IRC (Ping timeout: 360 seconds)
04:52 🔗 unlobito has quit IRC (Ping timeout: 360 seconds)
04:52 🔗 unlobito has joined #archiveteam-bs
04:52 🔗 swebb has joined #archiveteam-bs
04:52 🔗 svchfoo1 sets mode: +o swebb
04:53 🔗 Cameron_D has quit IRC (Read error: Operation timed out)
04:53 🔗 sknebel_ has joined #archiveteam-bs
04:54 🔗 arkiver has joined #archiveteam-bs
04:54 🔗 Darkstar has quit IRC (Read error: Connection reset by peer)
04:54 🔗 Cameron_D has joined #archiveteam-bs
04:55 🔗 Somebody2 has joined #archiveteam-bs
04:55 🔗 godane has quit IRC (Read error: Operation timed out)
04:56 🔗 twigfoot has quit IRC (Read error: Operation timed out)
04:56 🔗 betamax has quit IRC (Read error: Operation timed out)
04:57 🔗 Atom has quit IRC (Read error: Operation timed out)
04:57 🔗 godane has joined #archiveteam-bs
04:58 🔗 svchfoo1 sets mode: +o godane
04:58 🔗 Yurume has joined #archiveteam-bs
04:59 🔗 astrid has quit IRC (Read error: Operation timed out)
05:00 🔗 twigfoot has joined #archiveteam-bs
05:00 🔗 Cameron_D has quit IRC (Ping timeout: 360 seconds)
05:01 🔗 Cameron_D has joined #archiveteam-bs
05:02 🔗 Somebody2 has quit IRC (Ping timeout: 360 seconds)
05:02 🔗 phirephl- has quit IRC (Ping timeout: 360 seconds)
05:02 🔗 godane SketchCow: any news?
05:04 🔗 astrid has joined #archiveteam-bs
05:04 🔗 swebb sets mode: +o astrid
05:04 🔗 MrRadar has quit IRC (Read error: Operation timed out)
05:05 🔗 Darkstar has joined #archiveteam-bs
05:06 🔗 sknebel has quit IRC (Read error: Operation timed out)
05:07 🔗 twigfoot has quit IRC (Read error: Operation timed out)
05:07 🔗 twigfoot has joined #archiveteam-bs
05:07 🔗 Yurume_ has quit IRC (Read error: Operation timed out)
05:08 🔗 zino_ has quit IRC (Excess Flood)
05:11 🔗 MrRadar has joined #archiveteam-bs
05:11 🔗 superkuh has joined #archiveteam-bs
05:12 🔗 phirephly has joined #archiveteam-bs
05:12 🔗 Darkstar has quit IRC (Read error: Connection reset by peer)
05:13 🔗 Somebody2 has joined #archiveteam-bs
05:15 🔗 zino has joined #archiveteam-bs
05:15 🔗 Darkstar has joined #archiveteam-bs
05:25 🔗 hook54321 JAA: Have we started grabbing XUL addons from addons.mozilla.org? The deadline is "early October, 2018"
05:36 🔗 m007a83 has quit IRC (Fuck you Comcast)
05:55 🔗 HCross has quit IRC (Ping timeout: 268 seconds)
05:56 🔗 HCross has joined #archiveteam-bs
05:56 🔗 HCross has quit IRC (Excess Flood)
05:57 🔗 Yurume has quit IRC (Ping timeout: 268 seconds)
05:57 🔗 TC04 has quit IRC (Ping timeout: 268 seconds)
05:57 🔗 svchfoo1 has quit IRC (Ping timeout: 268 seconds)
05:57 🔗 TC01 has joined #archiveteam-bs
05:57 🔗 Yurume has joined #archiveteam-bs
05:57 🔗 kiskabak2 has quit IRC (Ping timeout: 268 seconds)
05:58 🔗 Kaz has quit IRC (Ping timeout: 268 seconds)
06:02 🔗 betamax_ has quit IRC (Ping timeout: 268 seconds)
06:02 🔗 betamax has joined #archiveteam-bs
06:14 🔗 BlueMax has quit IRC (Quit: Leaving)
06:26 🔗 dxrt_ has joined #archiveteam-bs
06:28 🔗 sec^nd has quit IRC (Quit: ZNC 1.6.5 - http://znc.in)
06:36 🔗 second has joined #archiveteam-bs
06:43 🔗 BlueMax has joined #archiveteam-bs
06:55 🔗 HCross has joined #archiveteam-bs
07:01 🔗 erin has joined #archiveteam-bs
07:55 🔗 svchfoo1 has joined #archiveteam-bs
07:55 🔗 svchfoo3 sets mode: +o svchfoo1
08:14 🔗 kiskabak2 has joined #archiveteam-bs
08:30 🔗 coldice_ has joined #archiveteam-bs
08:35 🔗 coldice has quit IRC (Read error: Operation timed out)
09:31 🔗 BartoCH has quit IRC (Quit: WeeChat 2.2)
09:31 🔗 BartoCH has joined #archiveteam-bs
09:39 🔗 faoling__ is now known as faolingfa
09:53 🔗 faolingfa has quit IRC (Leaving)
10:20 🔗 coldice_ Flashfire, Yea, the part about upload the warc file I'm not quite sure about yet
10:21 🔗 Flashfire Yeah I am not so great with that you are going to need to ask someone else for help I am sorry
10:50 🔗 Kaz has joined #archiveteam-bs
11:02 🔗 JAA coldice_: I'm using custom code on top of a modified version of aiohttp when it has to be fast and can easily be split up into individual work items. If I just want to do a recursive crawl, I use wpull.
11:02 🔗 SimpBrain has quit IRC (Read error: Operation timed out)
11:03 🔗 JAA hook54321: The warrior project isn't started yet, but arkiver said it should be ready soon. I grabbed all Firefox addons yesterday, and I'll grab the Thunderbird and Seamonkey ones today. But I'm only grabbing the actual .xpi (and occasionally .zip) files, not the web page; the latter is also very important since it contains description, screenshots, metadata, changelogs, license information, etc.
11:03 🔗 JAA -> #outofammo
12:14 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
12:16 🔗 Mateon1 has joined #archiveteam-bs
12:19 🔗 TC01 has quit IRC (Read error: Operation timed out)
12:23 🔗 TC01 has joined #archiveteam-bs
12:38 🔗 chferfa has quit IRC ()
13:02 🔗 coldice_ is now known as coldice
13:04 🔗 coldice Ops, turns out the site I was crawling requires login to access the forum part.... anyone know how to parse a login site? :|
13:25 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
13:47 🔗 m007a83 has joined #archiveteam-bs
13:50 🔗 coldice Anyone, grab-site has hit a nationalgeographic url and doesn't proceed... think it's stuck
13:51 🔗 coldice can I stop and continue where it left or something?
14:08 🔗 wp494 has quit IRC (Read error: Operation timed out)
14:09 🔗 wp494 has joined #archiveteam-bs
14:11 🔗 Atom__ has joined #archiveteam-bs
14:14 🔗 Atom-- has quit IRC (Read error: Operation timed out)
14:20 🔗 mr_archiv @coldice, manually login using a web browser, note the cookie(s) it sets and their values and send that as a part of each request with the web scrapper you are using.
15:32 🔗 odemg has joined #archiveteam-bs
15:45 🔗 zhongfu has quit IRC (Ping timeout: 260 seconds)
16:14 🔗 zhongfu has joined #archiveteam-bs
16:42 🔗 zhongfu has quit IRC (Ping timeout: 260 seconds)
18:17 🔗 Mateon1 has quit IRC (Quit: Mateon1)
18:18 🔗 Mateon1 has joined #archiveteam-bs
19:03 🔗 godane so i got a beta player at Savers for $8
19:03 🔗 godane i will have to see works works but i did test in store and it does power one
19:03 🔗 godane *on
19:42 🔗 RichardG_ has quit IRC (Read error: Operation timed out)
19:43 🔗 ndiddy has joined #archiveteam-bs
19:59 🔗 godane so tape will not load
19:59 🔗 godane figures
20:05 🔗 godane i'm digitizing a tape called 'The Valley of Miracles'
20:19 🔗 godane this is a vhs tape i bought from savers
20:19 🔗 godane the only thing that i would think that needs to be digitize maybe
20:34 🔗 Raccoon what sort of tapes do you like to digitize.
20:36 🔗 Raccoon I have a bunch of VHS from our wildlife refuge I was about to toss, because my bitch cat peed in the box (destroying them with smell even if I cleaned them well). But the cassettes themselves were undamaged.
20:36 🔗 Raccoon they're either visitor education films or wildlife management and heavy machinery crew instructional videos.
20:37 🔗 Raccoon Bobcat and JohnDeer brand training
20:45 🔗 atluxity has quit IRC (Be the person your dog think you are.)
20:56 🔗 ivan coldice: add nationalgeographic to ignores and raise the concurrency
20:56 🔗 ivan coldice: it'll resume soon enough
20:58 🔗 RichardG has joined #archiveteam-bs
21:04 🔗 godane Raccoon: i'm not taking cat peed on tapes if possible
21:04 🔗 godane at least i would like to see pictures of the tapes first
21:05 🔗 Raccoon thought not :) I mean, the tapes are clean, but the pretty case cover art can't be salvaged except for maybe a photograph
21:06 🔗 Raccoon boring stuff about birds and sandhill cranes anyway
21:06 🔗 Raccoon riogrande
21:11 🔗 RichardG has quit IRC (Ping timeout: 246 seconds)
21:29 🔗 godane i'm doing 'from boxing to ballet' tape
21:29 🔗 godane its pushing 10Mbits
22:22 🔗 coldice has quit IRC (Read error: Operation timed out)
23:13 🔗 BlueMax has joined #archiveteam-bs

irclogger-viewer