#archiveteam 2017-09-05,Tue

↑back Search

Time Nickname Message
00:44 🔗 susanandd has joined #archiveteam
00:45 🔗 susanandd has quit IRC (Client Quit)
01:27 🔗 Stilett0 is now known as Stiletto
01:50 🔗 espes__ has quit IRC (Ping timeout: 250 seconds)
02:00 🔗 drumstick has joined #archiveteam
02:07 🔗 espes__ has joined #archiveteam
03:37 🔗 Stilett0 has joined #archiveteam
03:42 🔗 Stiletto has quit IRC (Read error: Operation timed out)
03:47 🔗 marvinw is now known as ivan
03:51 🔗 drumstick has quit IRC (Read error: Operation timed out)
03:57 🔗 SketchCow People are on it
04:12 🔗 emanuel has joined #archiveteam
04:17 🔗 kyounko has joined #archiveteam
04:32 🔗 Stilett0 is now known as Stiletto
04:37 🔗 z00nx has joined #archiveteam
05:00 🔗 Sk1d has quit IRC (Ping timeout: 250 seconds)
05:07 🔗 Sk1d has joined #archiveteam
05:10 🔗 sb057 good to know
05:19 🔗 emanuel Anyone here working on the Cambodia Daily archive?
05:22 🔗 Asparagir has quit IRC (Asparagir)
05:26 🔗 SketchCow archivebot is
05:33 🔗 emanuel Ok. I have a repo with all of the Cambodia Daily's article urls (https://github.com/emanuelfeld/cambodia-daily), but archivebot is working it's way through the site that's fine.
05:35 🔗 BlueMaxim has joined #archiveteam
05:46 🔗 dxrt emanuel: ^I think we should work on those as well. The ArchiveBot job is progressing very slowly and has barely managed 1GB /21k responses.
05:47 🔗 emanuel There's about 50k articles between the english and khmer sites.
05:56 🔗 emanuel dxrt: I'm not familiar with the way Archive Team works, so let me know I should be adding the repo link somewhere else.
05:58 🔗 dxrt It's ok. Someone here should see it and be able to grab those articles from your list pretty quickly - If not I'l get it done in a few hours when I'm free.
06:01 🔗 emanuel Sweet, thanks!
06:14 🔗 sb057 has quit IRC (Quit: Page closed)
06:15 🔗 dxrt All good! Starting it now.
06:27 🔗 drumstick has joined #archiveteam
06:30 🔗 nwf__ has quit IRC (Read error: Operation timed out)
06:40 🔗 nwf__ has joined #archiveteam
07:30 🔗 HCross2 has joined #archiveteam
08:07 🔗 drumstick has quit IRC (Remote host closed the connection)
08:08 🔗 drumstick has joined #archiveteam
08:38 🔗 drumstick has quit IRC (Read error: Operation timed out)
08:50 🔗 Honno has joined #archiveteam
08:51 🔗 drumstick has joined #archiveteam
08:52 🔗 refeed has joined #archiveteam
09:20 🔗 BlueMaxim has quit IRC (Quit: Leaving)
09:34 🔗 refeed Hi guys, I wanna ask, (maybe this is a dumb question) I don't know much about warc file format. Is warc also can saves html page resources like images, css, etc, or it just saves the html webpage and its request header?
09:35 🔗 HCross2 refeed: the warc stores everything needed to replicate the web page as a whole in the wayback
09:44 🔗 kristian_ has joined #archiveteam
09:46 🔗 refeed HCross2: thanks :)
09:53 🔗 refeed has quit IRC (Leaving)
09:53 🔗 atomotic has joined #archiveteam
11:20 🔗 atomotic has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
11:36 🔗 drumstick has quit IRC (Ping timeout: 600 seconds)
11:38 🔗 kitties has quit IRC (Quit: Connection closed for inactivity)
12:11 🔗 atomotic has joined #archiveteam
12:26 🔗 odemg has quit IRC (Read error: Operation timed out)
12:28 🔗 Kalroth has quit IRC (Ping timeout: 250 seconds)
12:32 🔗 Mateon1 has quit IRC (Ping timeout: 250 seconds)
12:33 🔗 Mateon1 has joined #archiveteam
12:35 🔗 kristian_ has quit IRC (Quit: Leaving)
12:38 🔗 Kalroth has joined #archiveteam
12:49 🔗 odemg has joined #archiveteam
12:50 🔗 emanuel has quit IRC (Quit: Page closed)
12:59 🔗 nertzy has joined #archiveteam
13:27 🔗 Honno has quit IRC (Read error: Operation timed out)
14:04 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
14:23 🔗 SketchCow -- SHOUT OUT TO TED RHEINGOLD, RIP --
14:39 🔗 refeed has joined #archiveteam
14:56 🔗 Honno has joined #archiveteam
16:17 🔗 ld1 has quit IRC (Ping timeout: 260 seconds)
16:17 🔗 ld1 has joined #archiveteam
16:25 🔗 schbirid has joined #archiveteam
16:33 🔗 refeed Is there any specific case whether we should use .arc or .warc format?
16:51 🔗 ld1 has quit IRC (Ping timeout: 260 seconds)
16:52 🔗 Jimmy_ has joined #archiveteam
16:53 🔗 ld1 has joined #archiveteam
16:53 🔗 astrid refeed: .warc is better, if you're choosing between the two
16:54 🔗 Jimmy_ has quit IRC (Client Quit)
16:55 🔗 refeed astrid, euhm okay, reasonable
16:59 🔗 refeed Is warc format can also be used to saves any other data type besides webpage? in my case I wanna to use it to store some of my scrapy json outputs
17:00 🔗 astrid warc is designed to record any http transaction, including both request and response headers
17:00 🔗 astrid you can also use it for any other request/response protocol, like FTP or DNS, but the tooling is almost all HTTP
17:00 🔗 MrRadar Yes. You can also store additional data in "resource" records
17:08 🔗 refeed OK, thanks :)
17:09 🔗 astrid you might also get some use out of https://github.com/odie5533/WarcProxy
17:13 🔗 refeed oh nice, thanks
18:02 🔗 cf has quit IRC (Read error: Operation timed out)
18:03 🔗 cf has joined #archiveteam
18:04 🔗 refeed has quit IRC (Leaving)
18:18 🔗 emanuel has joined #archiveteam
19:07 🔗 Sanqui has quit IRC (Ping timeout: 260 seconds)
19:11 🔗 scyther has quit IRC (Ping timeout: 260 seconds)
19:12 🔗 elwisp has quit IRC (Quit: leaving)
19:15 🔗 Sanqui has joined #archiveteam
19:22 🔗 emanuel has quit IRC (Quit: Page closed)
19:26 🔗 scyther has joined #archiveteam
19:50 🔗 atomotic has joined #archiveteam
20:26 🔗 Aranje has joined #archiveteam
20:39 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
20:46 🔗 Aranje has quit IRC (Quit: Three sheets to the wind)
20:58 🔗 sun_shine has joined #archiveteam
21:14 🔗 schbirid has quit IRC (Quit: Leaving)
21:47 🔗 kitties has joined #archiveteam
22:07 🔗 drumstick has joined #archiveteam
22:07 🔗 dashcloud has quit IRC (Read error: Operation timed out)
22:10 🔗 dashcloud has joined #archiveteam
22:21 🔗 namibj_ has quit IRC (Ping timeout: 260 seconds)
22:33 🔗 namibj_ has joined #archiveteam
22:34 🔗 dashcloud has quit IRC (Read error: Operation timed out)
22:34 🔗 dashcloud has joined #archiveteam
22:35 🔗 Soni has quit IRC (Read error: Operation timed out)
22:36 🔗 Soni has joined #archiveteam
22:51 🔗 sun_shine has left Leaving
23:19 🔗 Asparagir has joined #archiveteam
23:30 🔗 Asparagir emanuel: not sure if you got an answer yet (I was offline for a while) but we started archiving Cambodia Daily about two days ago. It's still running, got 1.6 GB of it so far. Got their Twitter too.
23:31 🔗 Asparagir And I just added a job for the Khmer language version of the paper too, although it may not start running for a little bit.
23:41 🔗 Honno has quit IRC (Read error: Operation timed out)

irclogger-viewer