Time |
Nickname |
Message |
00:44
🔗
|
|
susanandd has joined #archiveteam |
00:45
🔗
|
|
susanandd has quit IRC (Client Quit) |
01:27
🔗
|
|
Stilett0 is now known as Stiletto |
01:50
🔗
|
|
espes__ has quit IRC (Ping timeout: 250 seconds) |
02:00
🔗
|
|
drumstick has joined #archiveteam |
02:07
🔗
|
|
espes__ has joined #archiveteam |
03:37
🔗
|
|
Stilett0 has joined #archiveteam |
03:42
🔗
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
03:47
🔗
|
|
marvinw is now known as ivan |
03:51
🔗
|
|
drumstick has quit IRC (Read error: Operation timed out) |
03:57
🔗
|
SketchCow |
People are on it |
04:12
🔗
|
|
emanuel has joined #archiveteam |
04:17
🔗
|
|
kyounko has joined #archiveteam |
04:32
🔗
|
|
Stilett0 is now known as Stiletto |
04:37
🔗
|
|
z00nx has joined #archiveteam |
05:00
🔗
|
|
Sk1d has quit IRC (Ping timeout: 250 seconds) |
05:07
🔗
|
|
Sk1d has joined #archiveteam |
05:10
🔗
|
sb057 |
good to know |
05:19
🔗
|
emanuel |
Anyone here working on the Cambodia Daily archive? |
05:22
🔗
|
|
Asparagir has quit IRC (Asparagir) |
05:26
🔗
|
SketchCow |
archivebot is |
05:33
🔗
|
emanuel |
Ok. I have a repo with all of the Cambodia Daily's article urls (https://github.com/emanuelfeld/cambodia-daily), but archivebot is working it's way through the site that's fine. |
05:35
🔗
|
|
BlueMaxim has joined #archiveteam |
05:46
🔗
|
dxrt |
emanuel: ^I think we should work on those as well. The ArchiveBot job is progressing very slowly and has barely managed 1GB /21k responses. |
05:47
🔗
|
emanuel |
There's about 50k articles between the english and khmer sites. |
05:56
🔗
|
emanuel |
dxrt: I'm not familiar with the way Archive Team works, so let me know I should be adding the repo link somewhere else. |
05:58
🔗
|
dxrt |
It's ok. Someone here should see it and be able to grab those articles from your list pretty quickly - If not I'l get it done in a few hours when I'm free. |
06:01
🔗
|
emanuel |
Sweet, thanks! |
06:14
🔗
|
|
sb057 has quit IRC (Quit: Page closed) |
06:15
🔗
|
dxrt |
All good! Starting it now. |
06:27
🔗
|
|
drumstick has joined #archiveteam |
06:30
🔗
|
|
nwf__ has quit IRC (Read error: Operation timed out) |
06:40
🔗
|
|
nwf__ has joined #archiveteam |
07:30
🔗
|
|
HCross2 has joined #archiveteam |
08:07
🔗
|
|
drumstick has quit IRC (Remote host closed the connection) |
08:08
🔗
|
|
drumstick has joined #archiveteam |
08:38
🔗
|
|
drumstick has quit IRC (Read error: Operation timed out) |
08:50
🔗
|
|
Honno has joined #archiveteam |
08:51
🔗
|
|
drumstick has joined #archiveteam |
08:52
🔗
|
|
refeed has joined #archiveteam |
09:20
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
09:34
🔗
|
refeed |
Hi guys, I wanna ask, (maybe this is a dumb question) I don't know much about warc file format. Is warc also can saves html page resources like images, css, etc, or it just saves the html webpage and its request header? |
09:35
🔗
|
HCross2 |
refeed: the warc stores everything needed to replicate the web page as a whole in the wayback |
09:44
🔗
|
|
kristian_ has joined #archiveteam |
09:46
🔗
|
refeed |
HCross2: thanks :) |
09:53
🔗
|
|
refeed has quit IRC (Leaving) |
09:53
🔗
|
|
atomotic has joined #archiveteam |
11:20
🔗
|
|
atomotic has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) |
11:36
🔗
|
|
drumstick has quit IRC (Ping timeout: 600 seconds) |
11:38
🔗
|
|
kitties has quit IRC (Quit: Connection closed for inactivity) |
12:11
🔗
|
|
atomotic has joined #archiveteam |
12:26
🔗
|
|
odemg has quit IRC (Read error: Operation timed out) |
12:28
🔗
|
|
Kalroth has quit IRC (Ping timeout: 250 seconds) |
12:32
🔗
|
|
Mateon1 has quit IRC (Ping timeout: 250 seconds) |
12:33
🔗
|
|
Mateon1 has joined #archiveteam |
12:35
🔗
|
|
kristian_ has quit IRC (Quit: Leaving) |
12:38
🔗
|
|
Kalroth has joined #archiveteam |
12:49
🔗
|
|
odemg has joined #archiveteam |
12:50
🔗
|
|
emanuel has quit IRC (Quit: Page closed) |
12:59
🔗
|
|
nertzy has joined #archiveteam |
13:27
🔗
|
|
Honno has quit IRC (Read error: Operation timed out) |
14:04
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
14:23
🔗
|
SketchCow |
-- SHOUT OUT TO TED RHEINGOLD, RIP -- |
14:39
🔗
|
|
refeed has joined #archiveteam |
14:56
🔗
|
|
Honno has joined #archiveteam |
16:17
🔗
|
|
ld1 has quit IRC (Ping timeout: 260 seconds) |
16:17
🔗
|
|
ld1 has joined #archiveteam |
16:25
🔗
|
|
schbirid has joined #archiveteam |
16:33
🔗
|
refeed |
Is there any specific case whether we should use .arc or .warc format? |
16:51
🔗
|
|
ld1 has quit IRC (Ping timeout: 260 seconds) |
16:52
🔗
|
|
Jimmy_ has joined #archiveteam |
16:53
🔗
|
|
ld1 has joined #archiveteam |
16:53
🔗
|
astrid |
refeed: .warc is better, if you're choosing between the two |
16:54
🔗
|
|
Jimmy_ has quit IRC (Client Quit) |
16:55
🔗
|
refeed |
astrid, euhm okay, reasonable |
16:59
🔗
|
refeed |
Is warc format can also be used to saves any other data type besides webpage? in my case I wanna to use it to store some of my scrapy json outputs |
17:00
🔗
|
astrid |
warc is designed to record any http transaction, including both request and response headers |
17:00
🔗
|
astrid |
you can also use it for any other request/response protocol, like FTP or DNS, but the tooling is almost all HTTP |
17:00
🔗
|
MrRadar |
Yes. You can also store additional data in "resource" records |
17:08
🔗
|
refeed |
OK, thanks :) |
17:09
🔗
|
astrid |
you might also get some use out of https://github.com/odie5533/WarcProxy |
17:13
🔗
|
refeed |
oh nice, thanks |
18:02
🔗
|
|
cf has quit IRC (Read error: Operation timed out) |
18:03
🔗
|
|
cf has joined #archiveteam |
18:04
🔗
|
|
refeed has quit IRC (Leaving) |
18:18
🔗
|
|
emanuel has joined #archiveteam |
19:07
🔗
|
|
Sanqui has quit IRC (Ping timeout: 260 seconds) |
19:11
🔗
|
|
scyther has quit IRC (Ping timeout: 260 seconds) |
19:12
🔗
|
|
elwisp has quit IRC (Quit: leaving) |
19:15
🔗
|
|
Sanqui has joined #archiveteam |
19:22
🔗
|
|
emanuel has quit IRC (Quit: Page closed) |
19:26
🔗
|
|
scyther has joined #archiveteam |
19:50
🔗
|
|
atomotic has joined #archiveteam |
20:26
🔗
|
|
Aranje has joined #archiveteam |
20:39
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
20:46
🔗
|
|
Aranje has quit IRC (Quit: Three sheets to the wind) |
20:58
🔗
|
|
sun_shine has joined #archiveteam |
21:14
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
21:47
🔗
|
|
kitties has joined #archiveteam |
22:07
🔗
|
|
drumstick has joined #archiveteam |
22:07
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
22:10
🔗
|
|
dashcloud has joined #archiveteam |
22:21
🔗
|
|
namibj_ has quit IRC (Ping timeout: 260 seconds) |
22:33
🔗
|
|
namibj_ has joined #archiveteam |
22:34
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
22:34
🔗
|
|
dashcloud has joined #archiveteam |
22:35
🔗
|
|
Soni has quit IRC (Read error: Operation timed out) |
22:36
🔗
|
|
Soni has joined #archiveteam |
22:51
🔗
|
|
sun_shine has left Leaving |
23:19
🔗
|
|
Asparagir has joined #archiveteam |
23:30
🔗
|
Asparagir |
emanuel: not sure if you got an answer yet (I was offline for a while) but we started archiving Cambodia Daily about two days ago. It's still running, got 1.6 GB of it so far. Got their Twitter too. |
23:31
🔗
|
Asparagir |
And I just added a job for the Khmer language version of the paper too, although it may not start running for a little bit. |
23:41
🔗
|
|
Honno has quit IRC (Read error: Operation timed out) |