[00:27] *** kristian_ has quit IRC (Quit: Leaving) [00:44] *** dashcloud has quit IRC (Read error: Operation timed out) [00:55] *** dashcloud has joined #archiveteam-bs [00:56] *** Soni has quit IRC (Ping timeout: 272 seconds) [00:58] *** Soni has joined #archiveteam-bs [02:11] *** balrog has quit IRC (Read error: Operation timed out) [02:13] *** balrog has joined #archiveteam-bs [02:13] *** swebb sets mode: +o balrog [02:23] *** ftpdingus has joined #archiveteam-bs [02:23] Anyone alive? [02:43] *** ZexaronS- has quit IRC (Read error: Operation timed out) [02:59] *** ZexaronS has joined #archiveteam-bs [03:18] *** ZexaronS has quit IRC (Read error: Operation timed out) [03:45] *** ZexaronS has joined #archiveteam-bs [03:55] *** godane has quit IRC (Read error: Operation timed out) [04:27] yeah [04:27] *** ftpdingus has quit IRC (Read error: Connection reset by peer) [04:27] oh [04:28] *** ftpdingus has joined #archiveteam-bs [04:37] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:40] oh? [04:44] *** Sk1d has joined #archiveteam-bs [04:52] oh. [05:29] *** divingkat has joined #archiveteam-bs [05:30] http://miscdatadigs.shoutwiki.com/ [05:30] The TCRF of regular everyday computer programs [05:30] Or something. [05:30] Basically digging through programs, apps, etc. to find interesting things in them. [05:31] Or at least, uncommon knowledge. [06:06] *** BlueMaxim has joined #archiveteam-bs [06:12] *** ftpdingus has quit IRC (Ping timeout: 255 seconds) [07:03] *** mls has quit IRC (Ping timeout: 250 seconds) [07:11] *** mls has joined #archiveteam-bs [08:04] *** divingkat has quit IRC (Leaving) [10:33] *** Soni has quit IRC (Ping timeout: 272 seconds) [10:33] *** Soni has joined #archiveteam-bs [11:12] Whoa. I managed to trigger an infinite loop in wpull. It tried to retrieve the same broken URL (NXDOMAIN) over 6000 times although I specified --tries 3. [12:00] https://board.byuu.org/viewtopic.php?f=6&t=1804 [12:26] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [13:16] *** midas_ has joined #archiveteam-bs [13:19] *** zhongfu has quit IRC (Ping timeout: 260 seconds) [13:19] *** zhongfu has joined #archiveteam-bs [14:04] *** kristian_ has joined #archiveteam-bs [14:34] *** brayden_ has joined #archiveteam-bs [14:34] *** swebb sets mode: +o brayden_ [14:37] *** brayden has quit IRC (Ping timeout: 255 seconds) [14:37] *** brayden_ is now known as brayden [14:38] *** brayden has left Closing Window [14:38] *** brayden has joined #archiveteam-bs [14:38] *** swebb sets mode: +o brayden [14:44] *** Mateon1 has quit IRC (Read error: Operation timed out) [15:25] *** kristian_ has quit IRC (Quit: Leaving) [15:27] *** Stiletto has quit IRC (Read error: Operation timed out) [15:41] *** robink has joined #archiveteam-bs [15:41] *** robinak has quit IRC (Read error: Connection reset by peer) [16:20] *** godane has joined #archiveteam-bs [16:20] so i had to find the irc.efnet.org ip address to connect to there [16:21] *here [16:22] i kept on getting a not resolve error in pidgin [17:11] *** schbirid has joined #archiveteam-bs [17:13] Can confirm, irc.efnet.org does not resolve for me currently. Looks like EFNet's name servers are down. [17:16] ok then [17:16] at least its not comcast maybe [17:17] Definitely not, unless the EFNet admins are suicidal and host their name servers at Comcast. [17:37] So, regarding Dead Format: I'm planning on creating a seed list and letting ArchiveBot take it from there. [17:37] Artists, labels, and users are all not enumerable as far as I can see. Releases are, however, although there is also a slug in the URL. So I've grabbed all releases (IDs 1 to 23k) and will extract the artist, label, and user links from those for seeding. This will, however only cover a tiny part of the website. [17:37] My second idea was search engines, but DDG and Searx didn't return many results. [17:38] Another plan would be bruteforcing the collection search, which appears to be basically the only way to really discover content on that site. [17:38] *** schbirid2 has joined #archiveteam-bs [17:39] <3 [17:42] I found a list of the 10k most common English words, that sounds like a good start. I'll also try to find some lists in other important languages. [17:43] If anyone has other ideas, let me know. [17:43] (Or such word lists.) [17:44] It looks like their search fails when there are too many results. You can't search for "the", for example. [17:45] *** schbirid has quit IRC (Read error: Operation timed out) [17:46] JAA: try wildcards [17:46] also [17:47] sec [17:47] Ooh, nice one. "Search for t* in Haves returned 544,902 items" [17:48] Unfortunately, this is incredibly slow, i.e. causes very high load on their server. [17:48] But it's a very good idea. [17:52] JAA: will have more for you in a few mins [17:52] hold [17:52] Huh, the search is funny. Searching "entrée" or "entrèe" both returns the same result, but "entree" doesn't. [17:52] Ok [18:16] So my releases grab yielded 8355 artists, 3743 labels, and 1624 users. [18:24] Note for myself: the ArchiveBot job will have to follow external links to grab the full-resolution images. [18:26] JAA: https://github.com/joepie91/webshots [18:26] JAA: in particular https://github.com/joepie91/webshots/blob/master/search.py [18:26] note the last commit description [18:26] JAA: does wildcard searches that are automatically refined when it detects an incomplete result set, and un-refines afterwards [18:27] you'd need to change the exact parameter for determining 'incomplete result set' (you need 'too many results error' instead of 'X amount of results') but otherwise the idea is the same [18:28] JAA: tl;dr obtains a complete set with the minimum amount of search requests [18:28] Oh, neat. [18:28] Thanks! I have to leave now, but I'll take a closer look later. [18:28] This sounds perfect. [18:29] JAA: it was, incidentally, for an older archiveteam project :) [18:35] SketchCow: here are some items are having trouble : https://pastebin.com/W6mkfij5 [18:35] at least 2 of the items history i have check don't say its dark [18:35] *history log [18:36] *** bwn has quit IRC (Read error: Operation timed out) [19:19] *** dashcloud has quit IRC (Read error: Operation timed out) [19:20] *** dashcloud has joined #archiveteam-bs [19:24] *** kristian_ has joined #archiveteam-bs [19:28] joepie91: How surprising. :-P [19:32] *** kristian_ has quit IRC (Ping timeout: 370 seconds) [19:32] *** kristian_ has joined #archiveteam-bs [19:38] *** Stilett0- has joined #archiveteam-bs [19:38] *** ruunyan has quit IRC (Ping timeout: 250 seconds) [19:44] *** dashcloud has quit IRC (Read error: Operation timed out) [19:47] So, regarding the entrée/entrèe situation: the search completely ignores the special characters, splits it into two words, and probably ignores "e" as a search word. "entr" obviously matches the entry which contains "entrèe". [19:48] *** dashcloud has joined #archiveteam-bs [19:49] Interestingly, you can't search for "esp", but "espé" works (but is not the same as "esp*"). [20:26] *** kristian_ has quit IRC (Read error: Operation timed out) [20:33] *** kristian_ has joined #archiveteam-bs [20:36] *** kristian_ has quit IRC (Client Quit) [20:42] Huh, I found a query which returns over one third of all "items" (= collection entries, as far as I can tell). I probably don't have to mention how slow it is. [20:43] I don't have the slightest idea why it works though. I entered "*a-b*". [21:16] *** schbirid2 has quit IRC (Quit: Leaving) [21:20] *** noirscape has quit IRC (Read error: Connection reset by peer) [22:11] lol [22:18] *** bwn has joined #archiveteam-bs [22:28] My scrape is running now. I'm simply bruteforcing all three-character prefixes, i.e. aaa*, aab*, ... zzz*. [22:29] I ended up writing my own code since the specifics were quite a bit different than in your case and I figured I'd be faster this way. [22:36] *** Stilett0- is now known as Stiletto [23:09] *** Stiletto has quit IRC (Read error: Operation timed out) [23:56] *** dashcloud has quit IRC (Read error: Operation timed out) [23:57] *** dashcloud has joined #archiveteam-bs