[00:27] *** kristian_ has quit IRC (Quit: Leaving)
[00:44] *** dashcloud has quit IRC (Read error: Operation timed out)
[00:55] *** dashcloud has joined #archiveteam-bs
[00:56] *** Soni has quit IRC (Ping timeout: 272 seconds)
[00:58] *** Soni has joined #archiveteam-bs
[02:11] *** balrog has quit IRC (Read error: Operation timed out)
[02:13] *** balrog has joined #archiveteam-bs
[02:13] *** swebb sets mode: +o balrog
[02:23] *** ftpdingus has joined #archiveteam-bs
[02:23] <ftpdingus> Anyone alive?
[02:43] *** ZexaronS- has quit IRC (Read error: Operation timed out)
[02:59] *** ZexaronS has joined #archiveteam-bs
[03:18] *** ZexaronS has quit IRC (Read error: Operation timed out)
[03:45] *** ZexaronS has joined #archiveteam-bs
[03:55] *** godane has quit IRC (Read error: Operation timed out)
[04:27] <ndiddy> yeah
[04:27] *** ftpdingus has quit IRC (Read error: Connection reset by peer)
[04:27] <ndiddy> oh
[04:28] *** ftpdingus has joined #archiveteam-bs
[04:37] *** Sk1d has quit IRC (Ping timeout: 250 seconds)
[04:40] <Somebody2> oh?
[04:44] *** Sk1d has joined #archiveteam-bs
[04:52] <astrid> oh.
[05:29] *** divingkat has joined #archiveteam-bs
[05:30] <divingkat> http://miscdatadigs.shoutwiki.com/
[05:30] <divingkat> The TCRF of regular everyday computer programs
[05:30] <divingkat> Or something.
[05:30] <divingkat> Basically digging through programs, apps, etc. to find interesting things in them.
[05:31] <divingkat> Or at least, uncommon knowledge.
[06:06] *** BlueMaxim has joined #archiveteam-bs
[06:12] *** ftpdingus has quit IRC (Ping timeout: 255 seconds)
[07:03] *** mls has quit IRC (Ping timeout: 250 seconds)
[07:11] *** mls has joined #archiveteam-bs
[08:04] *** divingkat has quit IRC (Leaving)
[10:33] *** Soni has quit IRC (Ping timeout: 272 seconds)
[10:33] *** Soni has joined #archiveteam-bs
[11:12] <JAA> Whoa. I managed to trigger an infinite loop in wpull. It tried to retrieve the same broken URL (NXDOMAIN) over 6000 times although I specified --tries 3.
[12:00] <ranma> https://board.byuu.org/viewtopic.php?f=6&t=1804
[12:26] *** BlueMaxim has quit IRC (Read error: Connection reset by peer)
[13:16] *** midas_ has joined #archiveteam-bs
[13:19] *** zhongfu has quit IRC (Ping timeout: 260 seconds)
[13:19] *** zhongfu has joined #archiveteam-bs
[14:04] *** kristian_ has joined #archiveteam-bs
[14:34] *** brayden_ has joined #archiveteam-bs
[14:34] *** swebb sets mode: +o brayden_
[14:37] *** brayden has quit IRC (Ping timeout: 255 seconds)
[14:37] *** brayden_ is now known as brayden
[14:38] *** brayden has left Closing Window
[14:38] *** brayden has joined #archiveteam-bs
[14:38] *** swebb sets mode: +o brayden
[14:44] *** Mateon1 has quit IRC (Read error: Operation timed out)
[15:25] *** kristian_ has quit IRC (Quit: Leaving)
[15:27] *** Stiletto has quit IRC (Read error: Operation timed out)
[15:41] *** robink has joined #archiveteam-bs
[15:41] *** robinak has quit IRC (Read error: Connection reset by peer)
[16:20] *** godane has joined #archiveteam-bs
[16:20] <godane> so i had to find the irc.efnet.org ip address to connect to there
[16:21] <godane> *here
[16:22] <godane> i kept on getting a not resolve error in pidgin
[17:11] *** schbirid has joined #archiveteam-bs
[17:13] <JAA> Can confirm, irc.efnet.org does not resolve for me currently. Looks like EFNet's name servers are down.
[17:16] <godane> ok then
[17:16] <godane> at least its not comcast maybe
[17:17] <JAA> Definitely not, unless the EFNet admins are suicidal and host their name servers at Comcast.
[17:37] <JAA> So, regarding Dead Format: I'm planning on creating a seed list and letting ArchiveBot take it from there.
[17:37] <JAA> Artists, labels, and users are all not enumerable as far as I can see. Releases are, however, although there is also a slug in the URL. So I've grabbed all releases (IDs 1 to 23k) and will extract the artist, label, and user links from those for seeding. This will, however only cover a tiny part of the website.
[17:37] <JAA> My second idea was search engines, but DDG and Searx didn't return many results.
[17:38] <JAA> Another plan would be bruteforcing the collection search, which appears to be basically the only way to really discover content on that site.
[17:38] *** schbirid2 has joined #archiveteam-bs
[17:39] <astrid> <3
[17:42] <JAA> I found a list of the 10k most common English words, that sounds like a good start. I'll also try to find some lists in other important languages.
[17:43] <JAA> If anyone has other ideas, let me know.
[17:43] <JAA> (Or such word lists.)
[17:44] <JAA> It looks like their search fails when there are too many results. You can't search for "the", for example.
[17:45] *** schbirid has quit IRC (Read error: Operation timed out)
[17:46] <joepie91> JAA: try wildcards
[17:46] <joepie91> also
[17:47] <joepie91> sec
[17:47] <JAA> Ooh, nice one. "Search for t* in Haves returned 544,902 items"
[17:48] <JAA> Unfortunately, this is incredibly slow, i.e. causes very high load on their server.
[17:48] <JAA> But it's a very good idea.
[17:52] <joepie91> JAA: will have more for you in a few mins
[17:52] <joepie91> hold
[17:52] <JAA> Huh, the search is funny. Searching "entrée" or "entrèe" both returns the same result, but "entree" doesn't.
[17:52] <JAA> Ok
[18:16] <JAA> So my releases grab yielded 8355 artists, 3743 labels, and 1624 users.
[18:24] <JAA> Note for myself: the ArchiveBot job will have to follow external links to grab the full-resolution images.
[18:26] <joepie91> JAA: https://github.com/joepie91/webshots
[18:26] <joepie91> JAA: in particular https://github.com/joepie91/webshots/blob/master/search.py
[18:26] <joepie91> note the last commit description
[18:26] <joepie91> JAA: does wildcard searches that are automatically refined when it detects an incomplete result set, and un-refines afterwards
[18:27] <joepie91> you'd need to change the exact parameter for determining 'incomplete result set' (you need 'too many results error' instead of 'X amount of results') but otherwise the idea is the same
[18:28] <joepie91> JAA: tl;dr obtains a complete set with the minimum amount of search requests
[18:28] <JAA> Oh, neat.
[18:28] <JAA> Thanks! I have to leave now, but I'll take a closer look later.
[18:28] <JAA> This sounds perfect.
[18:29] <joepie91> JAA: it was, incidentally, for an older archiveteam project :)
[18:35] <godane> SketchCow: here are some items are having trouble : https://pastebin.com/W6mkfij5
[18:35] <godane> at least 2 of the items history i have check don't say its dark
[18:35] <godane> *history log
[18:36] *** bwn has quit IRC (Read error: Operation timed out)
[19:19] *** dashcloud has quit IRC (Read error: Operation timed out)
[19:20] *** dashcloud has joined #archiveteam-bs
[19:24] *** kristian_ has joined #archiveteam-bs
[19:28] <JAA> joepie91: How surprising. :-P
[19:32] *** kristian_ has quit IRC (Ping timeout: 370 seconds)
[19:32] *** kristian_ has joined #archiveteam-bs
[19:38] *** Stilett0- has joined #archiveteam-bs
[19:38] *** ruunyan has quit IRC (Ping timeout: 250 seconds)
[19:44] *** dashcloud has quit IRC (Read error: Operation timed out)
[19:47] <JAA> So, regarding the entrée/entrèe situation: the search completely ignores the special characters, splits it into two words, and probably ignores "e" as a search word. "entr" obviously matches the entry which contains "entrèe".
[19:48] *** dashcloud has joined #archiveteam-bs
[19:49] <JAA> Interestingly, you can't search for "esp", but "espé" works (but is not the same as "esp*").
[20:26] *** kristian_ has quit IRC (Read error: Operation timed out)
[20:33] *** kristian_ has joined #archiveteam-bs
[20:36] *** kristian_ has quit IRC (Client Quit)
[20:42] <JAA> Huh, I found a query which returns over one third of all "items" (= collection entries, as far as I can tell). I probably don't have to mention how slow it is.
[20:43] <JAA> I don't have the slightest idea why it works though. I entered "*a-b*".
[21:16] *** schbirid2 has quit IRC (Quit: Leaving)
[21:20] *** noirscape has quit IRC (Read error: Connection reset by peer)
[22:11] <joepie91> lol
[22:18] *** bwn has joined #archiveteam-bs
[22:28] <JAA> My scrape is running now. I'm simply bruteforcing all three-character prefixes, i.e. aaa*, aab*, ... zzz*.
[22:29] <JAA> I ended up writing my own code since the specifics were quite a bit different than in your case and I figured I'd be faster this way.
[22:36] *** Stilett0- is now known as Stiletto
[23:09] *** Stiletto has quit IRC (Read error: Operation timed out)
[23:56] *** dashcloud has quit IRC (Read error: Operation timed out)
[23:57] *** dashcloud has joined #archiveteam-bs