[00:44] *** susanandd has joined #archiveteam [00:45] *** susanandd has quit IRC (Client Quit) [01:27] *** Stilett0 is now known as Stiletto [01:50] *** espes__ has quit IRC (Ping timeout: 250 seconds) [02:00] *** drumstick has joined #archiveteam [02:07] *** espes__ has joined #archiveteam [03:37] *** Stilett0 has joined #archiveteam [03:42] *** Stiletto has quit IRC (Read error: Operation timed out) [03:47] *** marvinw is now known as ivan [03:51] *** drumstick has quit IRC (Read error: Operation timed out) [03:57] People are on it [04:12] *** emanuel has joined #archiveteam [04:17] *** kyounko has joined #archiveteam [04:32] *** Stilett0 is now known as Stiletto [04:37] *** z00nx has joined #archiveteam [05:00] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [05:07] *** Sk1d has joined #archiveteam [05:10] good to know [05:19] Anyone here working on the Cambodia Daily archive? [05:22] *** Asparagir has quit IRC (Asparagir) [05:26] archivebot is [05:33] Ok. I have a repo with all of the Cambodia Daily's article urls (https://github.com/emanuelfeld/cambodia-daily), but archivebot is working it's way through the site that's fine. [05:35] *** BlueMaxim has joined #archiveteam [05:46] emanuel: ^I think we should work on those as well. The ArchiveBot job is progressing very slowly and has barely managed 1GB /21k responses. [05:47] There's about 50k articles between the english and khmer sites. [05:56] dxrt: I'm not familiar with the way Archive Team works, so let me know I should be adding the repo link somewhere else. [05:58] It's ok. Someone here should see it and be able to grab those articles from your list pretty quickly - If not I'l get it done in a few hours when I'm free. [06:01] Sweet, thanks! [06:14] *** sb057 has quit IRC (Quit: Page closed) [06:15] All good! Starting it now. [06:27] *** drumstick has joined #archiveteam [06:30] *** nwf__ has quit IRC (Read error: Operation timed out) [06:40] *** nwf__ has joined #archiveteam [07:30] *** HCross2 has joined #archiveteam [08:07] *** drumstick has quit IRC (Remote host closed the connection) [08:08] *** drumstick has joined #archiveteam [08:38] *** drumstick has quit IRC (Read error: Operation timed out) [08:50] *** Honno has joined #archiveteam [08:51] *** drumstick has joined #archiveteam [08:52] *** refeed has joined #archiveteam [09:20] *** BlueMaxim has quit IRC (Quit: Leaving) [09:34] Hi guys, I wanna ask, (maybe this is a dumb question) I don't know much about warc file format. Is warc also can saves html page resources like images, css, etc, or it just saves the html webpage and its request header? [09:35] refeed: the warc stores everything needed to replicate the web page as a whole in the wayback [09:44] *** kristian_ has joined #archiveteam [09:46] HCross2: thanks :) [09:53] *** refeed has quit IRC (Leaving) [09:53] *** atomotic has joined #archiveteam [11:20] *** atomotic has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [11:36] *** drumstick has quit IRC (Ping timeout: 600 seconds) [11:38] *** kitties has quit IRC (Quit: Connection closed for inactivity) [12:11] *** atomotic has joined #archiveteam [12:26] *** odemg has quit IRC (Read error: Operation timed out) [12:28] *** Kalroth has quit IRC (Ping timeout: 250 seconds) [12:32] *** Mateon1 has quit IRC (Ping timeout: 250 seconds) [12:33] *** Mateon1 has joined #archiveteam [12:35] *** kristian_ has quit IRC (Quit: Leaving) [12:38] *** Kalroth has joined #archiveteam [12:49] *** odemg has joined #archiveteam [12:50] *** emanuel has quit IRC (Quit: Page closed) [12:59] *** nertzy has joined #archiveteam [13:27] *** Honno has quit IRC (Read error: Operation timed out) [14:04] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [14:23] -- SHOUT OUT TO TED RHEINGOLD, RIP -- [14:39] *** refeed has joined #archiveteam [14:56] *** Honno has joined #archiveteam [16:17] *** ld1 has quit IRC (Ping timeout: 260 seconds) [16:17] *** ld1 has joined #archiveteam [16:25] *** schbirid has joined #archiveteam [16:33] Is there any specific case whether we should use .arc or .warc format? [16:51] *** ld1 has quit IRC (Ping timeout: 260 seconds) [16:52] *** Jimmy_ has joined #archiveteam [16:53] *** ld1 has joined #archiveteam [16:53] refeed: .warc is better, if you're choosing between the two [16:54] *** Jimmy_ has quit IRC (Client Quit) [16:55] astrid, euhm okay, reasonable [16:59] Is warc format can also be used to saves any other data type besides webpage? in my case I wanna to use it to store some of my scrapy json outputs [17:00] warc is designed to record any http transaction, including both request and response headers [17:00] you can also use it for any other request/response protocol, like FTP or DNS, but the tooling is almost all HTTP [17:00] Yes. You can also store additional data in "resource" records [17:08] OK, thanks :) [17:09] you might also get some use out of https://github.com/odie5533/WarcProxy [17:13] oh nice, thanks [18:02] *** cf has quit IRC (Read error: Operation timed out) [18:03] *** cf has joined #archiveteam [18:04] *** refeed has quit IRC (Leaving) [18:18] *** emanuel has joined #archiveteam [19:07] *** Sanqui has quit IRC (Ping timeout: 260 seconds) [19:11] *** scyther has quit IRC (Ping timeout: 260 seconds) [19:12] *** elwisp has quit IRC (Quit: leaving) [19:15] *** Sanqui has joined #archiveteam [19:22] *** emanuel has quit IRC (Quit: Page closed) [19:26] *** scyther has joined #archiveteam [19:50] *** atomotic has joined #archiveteam [20:26] *** Aranje has joined #archiveteam [20:39] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [20:46] *** Aranje has quit IRC (Quit: Three sheets to the wind) [20:58] *** sun_shine has joined #archiveteam [21:14] *** schbirid has quit IRC (Quit: Leaving) [21:47] *** kitties has joined #archiveteam [22:07] *** drumstick has joined #archiveteam [22:07] *** dashcloud has quit IRC (Read error: Operation timed out) [22:10] *** dashcloud has joined #archiveteam [22:21] *** namibj_ has quit IRC (Ping timeout: 260 seconds) [22:33] *** namibj_ has joined #archiveteam [22:34] *** dashcloud has quit IRC (Read error: Operation timed out) [22:34] *** dashcloud has joined #archiveteam [22:35] *** Soni has quit IRC (Read error: Operation timed out) [22:36] *** Soni has joined #archiveteam [22:51] *** sun_shine has left Leaving [23:19] *** Asparagir has joined #archiveteam [23:30] emanuel: not sure if you got an answer yet (I was offline for a while) but we started archiving Cambodia Daily about two days ago. It's still running, got 1.6 GB of it so far. Got their Twitter too. [23:31] And I just added a job for the Khmer language version of the paper too, although it may not start running for a little bit. [23:41] *** Honno has quit IRC (Read error: Operation timed out)