Time |
Nickname |
Message |
00:05
🔗
|
|
BlueMax has joined #archiveteam-bs |
00:13
🔗
|
|
Flashfire has joined #archiveteam-bs |
00:17
🔗
|
|
Mayeau is now known as Mayonaise |
00:25
🔗
|
|
coldice has joined #archiveteam-bs |
00:32
🔗
|
JAA |
coldice: Have a look at our wiki. It contains a wealth of information on archival. |
00:32
🔗
|
JAA |
Generally speaking, you'll want to archive websites in the WARC format, which preserves request and response entirely (including HTTP headers) and also contains relevant metadata. |
00:34
🔗
|
JAA |
There are several tools and approaches to do this. The one we use most of the time (including through ArchiveBot and the warrior project) is a crawler like wpull or wget. This works pretty well for most sites. The major exception here are websites that make heavy use of JavaScript. |
00:35
🔗
|
coldice |
So old websites before 2010 is safe to wpull |
00:36
🔗
|
coldice |
Anything else through PanthomJS or something? |
00:36
🔗
|
JAA |
Even modern sites might work fine with wpull. It really just depends on how the site is built. |
00:37
🔗
|
JAA |
If the site's browsable with JS disabled in the browser, then it will usually work fine with those crawlers. |
00:37
🔗
|
JAA |
PhantomJS doesn't work very well. |
00:38
🔗
|
JAA |
We don't really have a proper solution for JS-heavy websites yet. It's a quite tricky problem, especially when links aren't even real links, clicks get hijacked, etc. |
00:39
🔗
|
JAA |
You can always archive that stuff through a browser using a proxy that writes everything to WARC, e.g. warcprox. But that doesn't necessarily mean that it can also be played back later. |
00:40
🔗
|
JAA |
And it's not well automatable in the general case. So you typically need to write custom code for each such site you want to grab. |
00:43
🔗
|
coldice |
Alright, to get started I need https://github.com/ludios/grab-site right? |
00:44
🔗
|
coldice |
Unless I want to join the pool |
00:44
🔗
|
JAA |
Yeah, that's one way. grab-site is a wrapper around wpull to make it easier to use. |
00:45
🔗
|
coldice |
Btw, is there a list of archived websites? I can't seem to find it on the wiki |
00:45
🔗
|
JAA |
That would be a long list. |
00:46
🔗
|
kiska |
A very long list |
00:46
🔗
|
kiska |
From #archivebot Major: Job status: 95273 completed |
00:48
🔗
|
coldice |
So the data is archived, but not available? Am I missing something? |
00:48
🔗
|
JAA |
All our data is uploaded to the Internet Archive and included in the Wayback Machine. |
00:49
🔗
|
JAA |
https://archive.org/details/archiveteam |
01:02
🔗
|
|
sknebel has quit IRC (Quit: No Ping reply in 180 seconds.) |
01:05
🔗
|
|
sknebel has joined #archiveteam-bs |
01:44
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
01:46
🔗
|
|
BlueMax has joined #archiveteam-bs |
02:08
🔗
|
coldice |
Thanks for your help JAA, my grabber is working fine. https://i.imgur.com/EmB3bQY.png - I got a few TB of storage, which should get me pretty far.... |
02:09
🔗
|
JAA |
Happy to help. :-) |
02:10
🔗
|
|
Odd0002 has quit IRC (Read error: Operation timed out) |
02:17
🔗
|
|
Odd0002 has joined #archiveteam-bs |
02:55
🔗
|
ivan |
coldice: you can set up grab-site and an uploader to upload and remove WARCs before the crawls finish |
02:56
🔗
|
ivan |
the grab-site component is --finished-warc-dir= and the uploader can be something like https://gist.github.com/ivan/079530350ac94851d581b55b1d372440 for IA |
03:02
🔗
|
|
bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…) |
03:06
🔗
|
|
odemg has quit IRC (Ping timeout: 260 seconds) |
03:27
🔗
|
|
bitBaron has joined #archiveteam-bs |
03:45
🔗
|
coldice |
Anyone.. think my grab-site is running of... seeing a lot of requests to https://static.xx.fbcdn.net/rsrc.php/* - should that be in the ignore pattern? |
03:45
🔗
|
FlashBack |
All good |
03:45
🔗
|
Flashfire |
Coldice its facebook java script crap |
03:45
🔗
|
coldice |
Whelp, a lot of it too |
03:45
🔗
|
Flashfire |
grabbing it has no harm at all but geel free to ignore it as well |
03:46
🔗
|
coldice |
Is it possibly for me to interact with the script too like the IRC bot? Just command-line wise |
03:47
🔗
|
Flashfire |
No clue with grab site |
03:58
🔗
|
coldice |
JAA, may I know what you use in customs scripts to scrape websites for archive? Scrapy? |
04:23
🔗
|
|
ndiddy has quit IRC (Read error: Operation timed out) |
04:27
🔗
|
|
bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…) |
04:30
🔗
|
Raccoon |
I just saw the bot's link to a wiki of ISP Hosts. Maybe somebody would similarly find this list interesting. https://gist.github.com/a-raccoon/15c55e8d4048bb120b56 |
04:38
🔗
|
|
faoling__ has joined #archiveteam-bs |
04:42
🔗
|
|
Pixi` has joined #archiveteam-bs |
04:44
🔗
|
|
faolingf_ has quit IRC (Ping timeout: 360 seconds) |
04:47
🔗
|
|
dxrt has quit IRC (Read error: Operation timed out) |
04:47
🔗
|
|
dxrt has joined #archiveteam-bs |
04:47
🔗
|
|
Atom-- has joined #archiveteam-bs |
04:48
🔗
|
|
Frogging has quit IRC (Read error: Operation timed out) |
04:48
🔗
|
|
Frogging has joined #archiveteam-bs |
04:48
🔗
|
|
twigfoot has quit IRC (Ping timeout: 360 seconds) |
04:48
🔗
|
|
Pixi has quit IRC (Read error: Operation timed out) |
04:49
🔗
|
|
underscor has quit IRC (Ping timeout: 360 seconds) |
04:49
🔗
|
|
underscor has joined #archiveteam-bs |
04:49
🔗
|
|
svchfoo1 sets mode: +o underscor |
04:50
🔗
|
|
arkiver has quit IRC (Read error: Operation timed out) |
04:50
🔗
|
|
superkuh has quit IRC (Excess Flood) |
04:51
🔗
|
|
twigfoot has joined #archiveteam-bs |
04:51
🔗
|
|
betamax_ has joined #archiveteam-bs |
04:52
🔗
|
|
swebb has quit IRC (Ping timeout: 360 seconds) |
04:52
🔗
|
|
Somebody2 has quit IRC (Ping timeout: 360 seconds) |
04:52
🔗
|
|
unlobito has quit IRC (Ping timeout: 360 seconds) |
04:52
🔗
|
|
unlobito has joined #archiveteam-bs |
04:52
🔗
|
|
swebb has joined #archiveteam-bs |
04:52
🔗
|
|
svchfoo1 sets mode: +o swebb |
04:53
🔗
|
|
Cameron_D has quit IRC (Read error: Operation timed out) |
04:53
🔗
|
|
sknebel_ has joined #archiveteam-bs |
04:54
🔗
|
|
arkiver has joined #archiveteam-bs |
04:54
🔗
|
|
Darkstar has quit IRC (Read error: Connection reset by peer) |
04:54
🔗
|
|
Cameron_D has joined #archiveteam-bs |
04:55
🔗
|
|
Somebody2 has joined #archiveteam-bs |
04:55
🔗
|
|
godane has quit IRC (Read error: Operation timed out) |
04:56
🔗
|
|
twigfoot has quit IRC (Read error: Operation timed out) |
04:56
🔗
|
|
betamax has quit IRC (Read error: Operation timed out) |
04:57
🔗
|
|
Atom has quit IRC (Read error: Operation timed out) |
04:57
🔗
|
|
godane has joined #archiveteam-bs |
04:58
🔗
|
|
svchfoo1 sets mode: +o godane |
04:58
🔗
|
|
Yurume has joined #archiveteam-bs |
04:59
🔗
|
|
astrid has quit IRC (Read error: Operation timed out) |
05:00
🔗
|
|
twigfoot has joined #archiveteam-bs |
05:00
🔗
|
|
Cameron_D has quit IRC (Ping timeout: 360 seconds) |
05:01
🔗
|
|
Cameron_D has joined #archiveteam-bs |
05:02
🔗
|
|
Somebody2 has quit IRC (Ping timeout: 360 seconds) |
05:02
🔗
|
|
phirephl- has quit IRC (Ping timeout: 360 seconds) |
05:02
🔗
|
godane |
SketchCow: any news? |
05:04
🔗
|
|
astrid has joined #archiveteam-bs |
05:04
🔗
|
|
swebb sets mode: +o astrid |
05:04
🔗
|
|
MrRadar has quit IRC (Read error: Operation timed out) |
05:05
🔗
|
|
Darkstar has joined #archiveteam-bs |
05:06
🔗
|
|
sknebel has quit IRC (Read error: Operation timed out) |
05:07
🔗
|
|
twigfoot has quit IRC (Read error: Operation timed out) |
05:07
🔗
|
|
twigfoot has joined #archiveteam-bs |
05:07
🔗
|
|
Yurume_ has quit IRC (Read error: Operation timed out) |
05:08
🔗
|
|
zino_ has quit IRC (Excess Flood) |
05:11
🔗
|
|
MrRadar has joined #archiveteam-bs |
05:11
🔗
|
|
superkuh has joined #archiveteam-bs |
05:12
🔗
|
|
phirephly has joined #archiveteam-bs |
05:12
🔗
|
|
Darkstar has quit IRC (Read error: Connection reset by peer) |
05:13
🔗
|
|
Somebody2 has joined #archiveteam-bs |
05:15
🔗
|
|
zino has joined #archiveteam-bs |
05:15
🔗
|
|
Darkstar has joined #archiveteam-bs |
05:25
🔗
|
hook54321 |
JAA: Have we started grabbing XUL addons from addons.mozilla.org? The deadline is "early October, 2018" |
05:36
🔗
|
|
m007a83 has quit IRC (Fuck you Comcast) |
05:55
🔗
|
|
HCross has quit IRC (Ping timeout: 268 seconds) |
05:56
🔗
|
|
HCross has joined #archiveteam-bs |
05:56
🔗
|
|
HCross has quit IRC (Excess Flood) |
05:57
🔗
|
|
Yurume has quit IRC (Ping timeout: 268 seconds) |
05:57
🔗
|
|
TC04 has quit IRC (Ping timeout: 268 seconds) |
05:57
🔗
|
|
svchfoo1 has quit IRC (Ping timeout: 268 seconds) |
05:57
🔗
|
|
TC01 has joined #archiveteam-bs |
05:57
🔗
|
|
Yurume has joined #archiveteam-bs |
05:57
🔗
|
|
kiskabak2 has quit IRC (Ping timeout: 268 seconds) |
05:58
🔗
|
|
Kaz has quit IRC (Ping timeout: 268 seconds) |
06:02
🔗
|
|
betamax_ has quit IRC (Ping timeout: 268 seconds) |
06:02
🔗
|
|
betamax has joined #archiveteam-bs |
06:14
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
06:26
🔗
|
|
dxrt_ has joined #archiveteam-bs |
06:28
🔗
|
|
sec^nd has quit IRC (Quit: ZNC 1.6.5 - http://znc.in) |
06:36
🔗
|
|
second has joined #archiveteam-bs |
06:43
🔗
|
|
BlueMax has joined #archiveteam-bs |
06:55
🔗
|
|
HCross has joined #archiveteam-bs |
07:01
🔗
|
|
erin has joined #archiveteam-bs |
07:55
🔗
|
|
svchfoo1 has joined #archiveteam-bs |
07:55
🔗
|
|
svchfoo3 sets mode: +o svchfoo1 |
08:14
🔗
|
|
kiskabak2 has joined #archiveteam-bs |
08:30
🔗
|
|
coldice_ has joined #archiveteam-bs |
08:35
🔗
|
|
coldice has quit IRC (Read error: Operation timed out) |
09:31
🔗
|
|
BartoCH has quit IRC (Quit: WeeChat 2.2) |
09:31
🔗
|
|
BartoCH has joined #archiveteam-bs |
09:39
🔗
|
|
faoling__ is now known as faolingfa |
09:53
🔗
|
|
faolingfa has quit IRC (Leaving) |
10:20
🔗
|
coldice_ |
Flashfire, Yea, the part about upload the warc file I'm not quite sure about yet |
10:21
🔗
|
Flashfire |
Yeah I am not so great with that you are going to need to ask someone else for help I am sorry |
10:50
🔗
|
|
Kaz has joined #archiveteam-bs |
11:02
🔗
|
JAA |
coldice_: I'm using custom code on top of a modified version of aiohttp when it has to be fast and can easily be split up into individual work items. If I just want to do a recursive crawl, I use wpull. |
11:02
🔗
|
|
SimpBrain has quit IRC (Read error: Operation timed out) |
11:03
🔗
|
JAA |
hook54321: The warrior project isn't started yet, but arkiver said it should be ready soon. I grabbed all Firefox addons yesterday, and I'll grab the Thunderbird and Seamonkey ones today. But I'm only grabbing the actual .xpi (and occasionally .zip) files, not the web page; the latter is also very important since it contains description, screenshots, metadata, changelogs, license information, etc. |
11:03
🔗
|
JAA |
-> #outofammo |
12:14
🔗
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
12:16
🔗
|
|
Mateon1 has joined #archiveteam-bs |
12:19
🔗
|
|
TC01 has quit IRC (Read error: Operation timed out) |
12:23
🔗
|
|
TC01 has joined #archiveteam-bs |
12:38
🔗
|
|
chferfa has quit IRC () |
13:02
🔗
|
|
coldice_ is now known as coldice |
13:04
🔗
|
coldice |
Ops, turns out the site I was crawling requires login to access the forum part.... anyone know how to parse a login site? :| |
13:25
🔗
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
13:47
🔗
|
|
m007a83 has joined #archiveteam-bs |
13:50
🔗
|
coldice |
Anyone, grab-site has hit a nationalgeographic url and doesn't proceed... think it's stuck |
13:51
🔗
|
coldice |
can I stop and continue where it left or something? |
14:08
🔗
|
|
wp494 has quit IRC (Read error: Operation timed out) |
14:09
🔗
|
|
wp494 has joined #archiveteam-bs |
14:11
🔗
|
|
Atom__ has joined #archiveteam-bs |
14:14
🔗
|
|
Atom-- has quit IRC (Read error: Operation timed out) |
14:20
🔗
|
mr_archiv |
@coldice, manually login using a web browser, note the cookie(s) it sets and their values and send that as a part of each request with the web scrapper you are using. |
15:32
🔗
|
|
odemg has joined #archiveteam-bs |
15:45
🔗
|
|
zhongfu has quit IRC (Ping timeout: 260 seconds) |
16:14
🔗
|
|
zhongfu has joined #archiveteam-bs |
16:42
🔗
|
|
zhongfu has quit IRC (Ping timeout: 260 seconds) |
18:17
🔗
|
|
Mateon1 has quit IRC (Quit: Mateon1) |
18:18
🔗
|
|
Mateon1 has joined #archiveteam-bs |
19:03
🔗
|
godane |
so i got a beta player at Savers for $8 |
19:03
🔗
|
godane |
i will have to see works works but i did test in store and it does power one |
19:03
🔗
|
godane |
*on |
19:42
🔗
|
|
RichardG_ has quit IRC (Read error: Operation timed out) |
19:43
🔗
|
|
ndiddy has joined #archiveteam-bs |
19:59
🔗
|
godane |
so tape will not load |
19:59
🔗
|
godane |
figures |
20:05
🔗
|
godane |
i'm digitizing a tape called 'The Valley of Miracles' |
20:19
🔗
|
godane |
this is a vhs tape i bought from savers |
20:19
🔗
|
godane |
the only thing that i would think that needs to be digitize maybe |
20:34
🔗
|
Raccoon |
what sort of tapes do you like to digitize. |
20:36
🔗
|
Raccoon |
I have a bunch of VHS from our wildlife refuge I was about to toss, because my bitch cat peed in the box (destroying them with smell even if I cleaned them well). But the cassettes themselves were undamaged. |
20:36
🔗
|
Raccoon |
they're either visitor education films or wildlife management and heavy machinery crew instructional videos. |
20:37
🔗
|
Raccoon |
Bobcat and JohnDeer brand training |
20:45
🔗
|
|
atluxity has quit IRC (Be the person your dog think you are.) |
20:56
🔗
|
ivan |
coldice: add nationalgeographic to ignores and raise the concurrency |
20:56
🔗
|
ivan |
coldice: it'll resume soon enough |
20:58
🔗
|
|
RichardG has joined #archiveteam-bs |
21:04
🔗
|
godane |
Raccoon: i'm not taking cat peed on tapes if possible |
21:04
🔗
|
godane |
at least i would like to see pictures of the tapes first |
21:05
🔗
|
Raccoon |
thought not :) I mean, the tapes are clean, but the pretty case cover art can't be salvaged except for maybe a photograph |
21:06
🔗
|
Raccoon |
boring stuff about birds and sandhill cranes anyway |
21:06
🔗
|
Raccoon |
riogrande |
21:11
🔗
|
|
RichardG has quit IRC (Ping timeout: 246 seconds) |
21:29
🔗
|
godane |
i'm doing 'from boxing to ballet' tape |
21:29
🔗
|
godane |
its pushing 10Mbits |
22:22
🔗
|
|
coldice has quit IRC (Read error: Operation timed out) |
23:13
🔗
|
|
BlueMax has joined #archiveteam-bs |