[00:02] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…) [00:13] *** omarroth has joined #archiveteam-bs [00:19] JAA How long did it take to archive? How much GB/TB/PB/whatever in data was saved? [00:20] *** RichardG_ is now known as RichardG [00:20] ayanami_: Just a couple hours, 3 GiB or so. This is only the thread HTML and a few associated things though, no images, attachments, etc. [00:21] Don't have the exact size for just the thread URLs since it's all running in the same grab and writing to the same WARCs. [00:21] About 3 hours and 20 minutes for the thread URLs. [00:22] Post URLs will take a while since they go up to over 2.3 million (so it takes 2.3 million requests). Should still finish in time easily though. [00:22] I'm doing roughly 10k requests per minute at the moment. [00:23] Oh, that's ~3 GiB of compressed WARCs. I don't know the uncompressed size. [00:24] It's text, so it compresses very well, but it won't be huge since the forums aren't *that* large. [00:27] *** bitBaron has joined #archiveteam-bs [00:47] *** wyatt8740 has quit IRC (Ping timeout: 360 seconds) [00:55] Current ETA is around 10:00 UTC. [01:06] *** tuluu has quit IRC (Read error: Connection refused) [01:07] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…) [01:08] *** tuluu has joined #archiveteam-bs [01:10] *** ATrescue has quit IRC (Ping timeout: 260 seconds) [02:03] *** ATrescue has joined #archiveteam-bs [02:17] *** Anthony_ has joined #archiveteam-bs [02:27] *** Anthony_ has quit IRC (Ping timeout: 262 seconds) [02:33] *** BlueMax has quit IRC (Read error: Connection reset by peer) [02:42] *** omarroth has quit IRC (Remote host closed the connection) [02:55] Huh, my request rate plummeted the second I started the ArchiveBot job. I guess I shouldn't go much faster then. [02:57] For what? 99? JAA [03:01] Yeah [03:03] *** ayanami_ has quit IRC (Quit: Leaving) [03:08] *** qw3rty114 has joined #archiveteam-bs [03:14] *** BlueMax has joined #archiveteam-bs [03:15] *** qw3rty113 has quit IRC (Read error: Operation timed out) [03:16] *** odemgi has joined #archiveteam-bs [03:16] *** RomeSilva has quit IRC (Ping timeout: 246 seconds) [03:18] *** odemgi_ has quit IRC (Ping timeout: 252 seconds) [03:18] How about #Etch-A-Sketch for Sony Sketch ? [03:18] Nah I think that is TradeMarked [03:19] not that we really cared about that in the past [03:25] *** odemg has quit IRC (Ping timeout: 615 seconds) [03:28] #sketchy ? [03:29] Oh, occupied. [03:29] occupado [03:30] Who dares to steal a potential AT IRC channel?! [03:30] THE CHEEK OF THEM [03:31] *** odemg has joined #archiveteam-bs [03:35] #SketchyGrab [03:40] #EraseASketch [04:41] *** balrog has quit IRC (Read error: Operation timed out) [05:13] *** Mayonaise has quit IRC (Read error: Operation timed out) [05:14] *** balrog has joined #archiveteam-bs [05:15] *** Mayonaise has joined #archiveteam-bs [05:18] *** d5f4a3622 has quit IRC (Quit: WeeChat 2.4) [05:23] *** d5f4a3622 has joined #archiveteam-bs [05:41] *** balrog has quit IRC (Read error: Operation timed out) [05:53] *** Zerote_ has joined #archiveteam-bs [06:10] *** Despatche has quit IRC (Quit: Read error: Connection reset by deer) [06:11] *** RomeSilva has joined #archiveteam-bs [06:26] *** tuluu has quit IRC (Read error: Connection refused) [06:27] *** tuluu has joined #archiveteam-bs [06:31] *** balrog has joined #archiveteam-bs [06:36] *** ivan has quit IRC (Leaving) [06:38] *** ivan has joined #archiveteam-bs [06:56] *** RichardG_ has joined #archiveteam-bs [06:56] *** RichardG has quit IRC (Read error: Connection reset by peer) [07:16] *** Zerote_ has quit IRC (Ping timeout: 600 seconds) [07:21] *** Zerote_ has joined #archiveteam-bs [07:33] *** RichardG has joined #archiveteam-bs [07:33] *** RichardG_ has quit IRC (Read error: Connection reset by peer) [08:18] *** RomeSilva has quit IRC (Read error: Connection reset by peer) [08:19] *** RomeSilva has joined #archiveteam-bs [08:28] *** RichardG has quit IRC (Read error: Connection reset by peer) [08:29] *** RichardG has joined #archiveteam-bs [09:00] *** tuluu has quit IRC (Read error: Connection refused) [09:01] *** tuluu has joined #archiveteam-bs [09:34] *** VerifiedJ has joined #archiveteam-bs [09:38] *** Verified_ has quit IRC (Ping timeout: 252 seconds) [09:39] *** VerifiedJ has quit IRC (Ping timeout: 252 seconds) [09:56] *** ColdIce has quit IRC (Remote host closed the connection) [09:56] *** ColdIce has joined #archiveteam-bs [10:10] *** Dallas has quit IRC (Quit: The Lounge - https://thelounge.chat) [10:12] *** Dallas has joined #archiveteam-bs [10:15] *** Dallas has quit IRC (Client Quit) [10:16] *** Dallas has joined #archiveteam-bs [10:22] *** BlueMax has quit IRC (Quit: Leaving) [10:29] Oops, bug in my 99.se script causing an infinite loop. Welp. [10:34] *** Dj-Wawa has quit IRC (Quit: Connection closed for inactivity) [10:47] Fixed and resumed. [10:47] All threads and posts pages have been retrieved, now doing the user profiles. [10:48] *** arbin has quit IRC (Quit: .) [10:52] *** arbin has joined #archiveteam-bs [10:56] any ideas why http://web.archive.org/web/*/https://geoportal-hamburg.de/gdi3d/datasource-data/Schraegluftbilder2018/50_07032_lvl02-oblique-left/5/10/10.jpg might not want to archive? [11:06] *** VerifiedJ has joined #archiveteam-bs [11:09] schbirid: Not really, but I've had that issue before on kkl-luzern.ch. Perhaps a ban of IA's IP range? [11:10] By the way, it looks like 99.se is running backups at 03:00 UTC. That's the only time I got a few timeout errors during my grab. [11:11] hm weird [11:11] i had no issues getting images added earlier http://web.archive.org/web/*/https://geoportal-hamburg.de/gdi3d/datasource-data/Schraegluftbilder2018//* [11:11] Huh [11:12] they had an expired certificate last weekend [11:12] maybe that got cached somewhere? [11:17] Yeah, also sounds plausible. [11:17] Probably a case for info@archive.org, but I haven't had much luck getting through there lately. [11:48] *** icedice has joined #archiveteam-bs [11:50] *** Zerote_ has quit IRC (Ping timeout: 600 seconds) [11:53] *** enowaldo has joined #archiveteam-bs [12:02] *** enowaldo has quit IRC (Ping timeout: 252 seconds) [12:21] 99.se is done. I covered a couple user profiles many times due to missing item deduplication in qwarc (coming soon). 35 GiB of WARCs, mostly from the posts pages I think. [12:22] *** bitBaron has joined #archiveteam-bs [12:25] *** bitBaron has quit IRC (Read error: Operation timed out) [12:28] Awesome JAA [12:28] dedup could be done post crawl if later needed [12:33] *** enowaldo has joined #archiveteam-bs [12:36] *** Madbrad has joined #archiveteam-bs [12:40] *** enowaldo has quit IRC (Read error: Operation timed out) [12:42] *** odemgi_ has joined #archiveteam-bs [12:45] *** odemgi has quit IRC (Ping timeout: 252 seconds) [12:51] *** Damme has quit IRC (Read error: Connection reset by peer) [12:56] *** bitBaron has joined #archiveteam-bs [12:57] *** gilbahat has joined #archiveteam-bs [13:00] Who/what is OTW? Do you have an idea of the size of bookcity.co.il ? [13:00] OTW is the organization for transformative works, mainly known for their flagship project 'Archive of our own' which is a fanfiction specific archive [13:01] one of their subprojects is called 'open doors' which specializes in rescuing and re-cataloguing fanfiction archives [13:02] ~200k "books" (stories?) according to the homepage. [13:03] *** a_spook_ has joined #archiveteam-bs [13:03] yes [13:03] by fanfiction sites, this is considered a decent amount [13:03] Each story has one page it seems, no pagination. [13:03] there are multiple episodes though [13:04] I do wonder if the 200k count really is 'books' or 'episodes' [13:04] What's the semantic feature you mentioned? [13:04] they have a tag-based system [13:05] Do you have an example of such a multi-episode book? [13:05] yes, sec. it also has fan-out (multiple options for an episode) [13:05] schbirid: removing https seemed to work: http://web.archive.org/web/20190430130123/http://geoportal-hamburg.de/gdi3d/datasource-data/Schraegluftbilder2018/50_07032_lvl02-oblique-left/5/10/10.jpg [13:05] http://bookcity.co.il/book.asp?id=206044 has fan-out: episode 3 / episode 3.1 [13:05] *** m007a83 has quit IRC (Quit: Fuck you Comcast) [13:08] *** m007a83 has joined #archiveteam-bs [13:08] Oh, I see now that there's a menu on the right with the pagination. Requires JS, so ArchiveBot will *not* cover it unless those pages are linked elsewhere. [13:08] So the right side nav, it's flipping to different cgi get URLs [13:08] (Menu only appears with JS enabled as well.) [13:09] *** Despatche has joined #archiveteam-bs [13:10] After the power outage, FOS came back without having the script that uploads archivebot uploads. [13:10] A lot of data comes in via archivebot. [13:10] Just wanted to pass along. Seems like terabytes a day [13:11] I think the mobile pages have a non-js nav [13:11] it has a different url scheme for mobile [13:12] SketchCow: Ok, Noted. Do you have the script or is it just not running? [13:13] gilbahat: Where can I find the mobile site? Difficult to navigate not knowing Hebrew. :-) [13:13] JAA: there's an additional nav in center of page that uses forms drop down [13:15] *** omarroth has joined #archiveteam-bs [13:15] marked: Come again? [13:15] (i'll be surprised is this works) where it says: רשימת הפרקים [13:16] Oh right, that's the one I was looking at before. [13:16] *** RichardG has quit IRC (Read error: Connection reset by peer) [13:16] The one on the right side doesn't use JS. :-) [13:17] It only shows up with JS enabled it seems, but the links are plain HTML. [13:17] *** RichardG has joined #archiveteam-bs [13:22] I can help with any hebrew issues [13:24] woops, on Android, front page redirects to http://bookcity.co.il/mobile/ [13:27] *** enowaldo has joined #archiveteam-bs [13:29] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…) [13:31] the mobile version is pageable without JS: http://bookcity.co.il/mobile/page.asp?id=267514 [13:33] Hmm, it does that redirect for ArchiveBot as well. [13:33] Guess it's time to restart with a browser UA and a separate job for the mobile page. [13:37] Igloo: It's running now. [13:37] I'm just noting how much the machine uploads [13:42] *** Oddly has joined #archiveteam-bs [13:44] yeah, at least the mobile side will be sure to get all the content. on the desktop side, the only other thing I can think of is enumerating all the story ID's and figure it will come together on playback when that browser has javascript turned on [13:44] ^grab by sequential ID [13:44] Yup [13:45] gilbahat: Any idea how urgent this is, i.e. when the site might disappear? [13:46] Or is this rather a "site has been unhealthy for a while, better safe than sorry"-type thing? [13:47] *** enowaldo has quit IRC (Read error: Operation timed out) [13:49] *** bitBaron has joined #archiveteam-bs [13:50] *** gilbahat has quit IRC (Ping timeout: 260 seconds) [13:58] *** gilbahat has joined #archiveteam-bs [13:58] back, sorry [14:08] *** Zerote_ has joined #archiveteam-bs [14:13] *** Smiley has joined #archiveteam-bs [14:20] *** Oddly has quit IRC (Read error: Operation timed out) [14:28] gilbahat, the question to you was: what do you know about the urgency/when the site might disappear? [14:29] *** omarroth has quit IRC (Read error: Connection reset by peer) [14:29] I know for sure that the site is doomed (confirmed in private chat with owner, still not publicly known) [14:29] but no shuttering date [14:32] *** omarroth has joined #archiveteam-bs [14:38] *** gilbahat has quit IRC (gilbahat) [14:54] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…) [15:04] *** gilbahat has joined #archiveteam-bs [15:05] did we make a decision on the channel for the sketch website? [15:06] a_spook_: thx [15:06] retch? [15:12] *** a_spook_ has quit IRC (Quit: Connection closed for inactivity) [15:20] *** gilbahat has quit IRC (gilbahat) [15:23] *** gilbahat has joined #archiveteam-bs [15:28] *** enowaldo has joined #archiveteam-bs [15:35] arkiver: Don't think so. [15:40] *** gilbahat has quit IRC (Quit: gilbahat) [15:41] *** gilbahat has joined #archiveteam-bs [15:45] what's the website called [15:46] nyany: Sketch: https://sketch.sonymobile.com/ [15:46] oh this [15:46] *** enowaldo has quit IRC (Read error: Operation timed out) [15:47] if you were looking for a name i was going to suggest something with the word sketchy in it, seems right for sony [15:47] Yeah, I suggested #sketchy, but that channel's occupied. [15:49] ... #sketchedout? [15:49] *** omarroth has quit IRC (Read error: Connection reset by peer) [15:50] i'm presently the only occupant of said channel [15:50] *** gilbahat has quit IRC (Quit: gilbahat) [15:53] oh wow. the front page of their site is very depressing [15:53] *** omarroth has joined #archiveteam-bs [15:53] https://sketch.sonymobile.com/explore/featured/sketch/1dddc116-ed8b-45a3-a042-40c96fd2de46 [16:00] *** killsushi has joined #archiveteam-bs [16:03] *** Madbrad has quit IRC (Quit: Madbrad) [16:07] i'll hold the channel until you get back to me. [16:19] *** gilbahat has joined #archiveteam-bs [16:21] *** enowaldo has joined #archiveteam-bs [16:31] *** enowaldo has quit IRC (Ping timeout: 252 seconds) [16:36] *** omarroth has quit IRC (Read error: Connection reset by peer) [16:36] *** gilbahat has quit IRC (Quit: gilbahat) [16:40] *** omarroth has joined #archiveteam-bs [16:41] *** tuluu has quit IRC (Read error: Connection refused) [16:42] *** tuluu has joined #archiveteam-bs [16:44] *** enowaldo has joined #archiveteam-bs [16:48] *** enowaldo has quit IRC (Ping timeout: 265 seconds) [16:58] *** omarroth has quit IRC (Read error: Connection reset by peer) [17:07] *** tuluu has quit IRC (Read error: Connection refused) [17:08] *** tuluu has joined #archiveteam-bs [17:21] *** enowaldo has joined #archiveteam-bs [17:35] *** Verified_ has joined #archiveteam-bs [17:37] *** martinlig has joined #archiveteam-bs [17:43] *** enowaldo has quit IRC (Read error: Operation timed out) [17:52] sketched out is pretty good. Could we just recreate it with capitals? #SketchedOut [18:03] done, but in looking at the other channels on at they all seem to be in lowercase? [18:19] *** enowaldo has joined #archiveteam-bs [18:19] IRC channel names are treated as case-insensitive by almost all implementations I believe (although RFC 1459 doesn't say that). Anyway. [18:19] #SketchedOut it is. [18:19] they're case insensitive but case preserving [18:20] Probably depends on the server implementation. The specs don't say anything about it. [18:22] And depending on the client, it may also not show up with the "correct" capitalisation in your client. I know I've joined a channel with a different capitalisation than it was created as before in irssi, though I don't remember which it was. [18:22] in irssi, [14:22] [@nyany_(+i)] [2:choopa/#sketchedout(+nt)] [18:22] shows with lowercase [18:23] Did you /join #sketchedout or #SketchedOut? [18:23] yes that may depend on how you typed it when you joined [18:23] *** Mateon1 has quit IRC (Quit: Mateon1) [18:23] JAA: I joined sketchedout [18:23] but ah [18:23] yeah, that makes sense. forgive me. [18:23] Right, that's what I meant above. [18:24] *** enowaldo has quit IRC (Ping timeout: 265 seconds) [18:25] *** Tsuser has quit IRC (Ping timeout: 260 seconds) [18:26] *** Mateon1 has joined #archiveteam-bs [18:32] *** Tsuser_ has joined #archiveteam-bs [18:40] it kaput NodePing: [AT] HTTPS tracker.archiveteam.org: HTTP is down [18:41] > 100% iowait [18:41] oof [18:44] ouch [18:47] 100% iowait almost always means that something is seriously fucked, as is probably the case here [18:48] so uh, chfoo Kaz wanna take a look? [18:48] I have nothing but a phone on me [18:49] *** Fusl sets mode: +o Kaz [18:51] I tried, I can't even log in [18:53] fun [18:53] yeah with 100% iowait you very certainly wont be able to do anything on it anyway [18:53] oh look its slowly coming back [18:55] we use puppet in a production environment, one of the servers managed was a rather small kvm. there was an ensure set to make sure x application was running [18:55] problem is, the ensure was misconfigured, so every time it checked, it assumed the app wasnt running, and started it. results were similar to what just happened here. [18:57] yeah, this looks like a hardware failure to me though [18:57] disk ops dropped to 0, iowait went up to 100% [18:59] *** astrid has quit IRC (Ping timeout: 1212 seconds) [19:02] *** wyatt8740 has joined #archiveteam-bs [19:03] Maybe we'll finally fix/replace it [19:03] i dont think so [19:03] it will just be a "i restarted it, its working again" [19:14] *** m007a83 has quit IRC (Read error: Connection reset by peer) [19:24] *** astrid has joined #archiveteam-bs [19:41] *** enowaldo has joined #archiveteam-bs [19:49] VoynichCr: I added a whole bunch of URLs to https://www.archiveteam.org/index.php/ArchiveBot/Educational_institutions/list, but they haven't been processed into the corresponding table yet. Any idea why? [19:49] Is it possible the list is too long? [19:50] *** wyatt8740 has quit IRC (Ping timeout: 246 seconds) [19:52] *** ayanami_ has joined #archiveteam-bs [19:52] Update on what I wrote in here a couple days ago about WARC payload digests being incorrect in WARCs produced by wpull and warcio, there was quite a bit of discussion about this on the warcio issue I opened yesterday: https://github.com/webrecorder/warcio/issues/74 TL;DR: "Someone" needs to do a comparison between existing toolery and identify which tools produce payload digests according to the [19:52] standard and which keep the transfer encoding. Then a decision can be made whether software or standard need to be fixed. [19:54] chfoo, ivan, PurpleSym: ^ You might be interested in this. [19:54] *** enowaldo has quit IRC (Read error: Operation timed out) [19:55] *** wyatt8740 has joined #archiveteam-bs [19:58] *** Ravenloft has quit IRC (Remote host closed the connection) [20:39] hmm, are any projects active on the tracker atm? [20:39] Only URLTeam I think. [20:41] Well, when the tracker responds, that is. [20:41] i was going to ask something about the "active" projects. tumblr is there still. seems that when i was looking at the stats there's been next to no activity since early april [20:41] is that project still ongoing? [20:42] That's why I moved it to the Hiatus section on the wiki earlier. [20:43] no kidding. must've been just after i was looking [20:43] actually, that doesn't appear to be the case, JAA. [20:44] I did edit it, but the main page is cached. [20:44] Now it should be good. [20:45] jolly good [20:48] *** enowaldo has joined #archiveteam-bs [20:57] *** enowaldo has quit IRC (Ping timeout: 252 seconds) [21:29] I am looking for someone who wants to help with YouTube archiving by monitoring everything submitted in #youtubearchive and feeding in current events, so that I can focus on the software stuff for a bit [21:34] *** tuluu has quit IRC (Read error: Connection refused) [21:35] *** tuluu has joined #archiveteam-bs [21:36] the first task involves clicking every link and making sure people aren't gumming up the works with boring gaming or 10 Hour videos [21:37] the second task involves thinking about what is on YouTube when Something Is Happening and should be saved [21:39] How many WARC tools are going to be enough to resolve an answer to the digest contraversy? [21:44] marked: All of them. Well, all common ones at least. Things in the WARC standard have been driven by implementations, so it should be as complete a picture as possible. As mentioned, my plan is to write a little HTTP server that every author/maintainer can run their tool against, and then the WARCs can be compared against each other. I won't do that immediately though since I also want to hear what [21:44] wumpus has found so far in his investigation to possibly also cover some other pitfalls than just this transfer encoding/payload digest thing. [21:53] *** VerifiedJ has quit IRC (Quit: Leaving) [22:12] *** enowaldo has joined #archiveteam-bs [22:14] ivan: For the first issue, have you considered just restricting access like is done with archivebot? [22:14] I guess that probably requires software work. [22:19] jodizzle: I kind of like the unrestricted access [22:20] and giving someone permission doesn't guarantee they'll behave anyway [22:25] *** enowaldo has quit IRC (Read error: Operation timed out) [22:26] *** DashEqual has joined #archiveteam-bs [22:47] *** Dj-Wawa has joined #archiveteam-bs [22:55] *** Zerote_ has quit IRC (Read error: Connection reset by peer) [23:01] *** tuluu has quit IRC (Read error: Connection refused) [23:03] *** tuluu has joined #archiveteam-bs [23:04] *** astrid has quit IRC (Read error: Operation timed out) [23:18] *** BlueMax has joined #archiveteam-bs [23:18] *** enowaldo has joined #archiveteam-bs [23:22] *** astrid has joined #archiveteam-bs [23:22] *** Fusl sets mode: +o astrid [23:28] *** enowaldo has quit IRC (Ping timeout: 268 seconds) [23:57] *** enowaldo has joined #archiveteam-bs [23:57] *** PhrackD has quit IRC (Read error: Operation timed out) [23:59] *** PhrackD has joined #archiveteam-bs