#archiveteam-bs 2017-09-26,Tue

↑back Search

Time Nickname Message
00:27 πŸ”— kristian_ has quit IRC (Quit: Leaving)
00:44 πŸ”— dashcloud has quit IRC (Read error: Operation timed out)
00:55 πŸ”— dashcloud has joined #archiveteam-bs
00:56 πŸ”— Soni has quit IRC (Ping timeout: 272 seconds)
00:58 πŸ”— Soni has joined #archiveteam-bs
02:11 πŸ”— balrog has quit IRC (Read error: Operation timed out)
02:13 πŸ”— balrog has joined #archiveteam-bs
02:13 πŸ”— swebb sets mode: +o balrog
02:23 πŸ”— ftpdingus has joined #archiveteam-bs
02:23 πŸ”— ftpdingus Anyone alive?
02:43 πŸ”— ZexaronS- has quit IRC (Read error: Operation timed out)
02:59 πŸ”— ZexaronS has joined #archiveteam-bs
03:18 πŸ”— ZexaronS has quit IRC (Read error: Operation timed out)
03:45 πŸ”— ZexaronS has joined #archiveteam-bs
03:55 πŸ”— godane has quit IRC (Read error: Operation timed out)
04:27 πŸ”— ndiddy yeah
04:27 πŸ”— ftpdingus has quit IRC (Read error: Connection reset by peer)
04:27 πŸ”— ndiddy oh
04:28 πŸ”— ftpdingus has joined #archiveteam-bs
04:37 πŸ”— Sk1d has quit IRC (Ping timeout: 250 seconds)
04:40 πŸ”— Somebody2 oh?
04:44 πŸ”— Sk1d has joined #archiveteam-bs
04:52 πŸ”— astrid oh.
05:29 πŸ”— divingkat has joined #archiveteam-bs
05:30 πŸ”— divingkat http://miscdatadigs.shoutwiki.com/
05:30 πŸ”— divingkat The TCRF of regular everyday computer programs
05:30 πŸ”— divingkat Or something.
05:30 πŸ”— divingkat Basically digging through programs, apps, etc. to find interesting things in them.
05:31 πŸ”— divingkat Or at least, uncommon knowledge.
06:06 πŸ”— BlueMaxim has joined #archiveteam-bs
06:12 πŸ”— ftpdingus has quit IRC (Ping timeout: 255 seconds)
07:03 πŸ”— mls has quit IRC (Ping timeout: 250 seconds)
07:11 πŸ”— mls has joined #archiveteam-bs
08:04 πŸ”— divingkat has quit IRC (Leaving)
10:33 πŸ”— Soni has quit IRC (Ping timeout: 272 seconds)
10:33 πŸ”— Soni has joined #archiveteam-bs
11:12 πŸ”— JAA Whoa. I managed to trigger an infinite loop in wpull. It tried to retrieve the same broken URL (NXDOMAIN) over 6000 times although I specified --tries 3.
12:00 πŸ”— ranma https://board.byuu.org/viewtopic.php?f=6&t=1804
12:26 πŸ”— BlueMaxim has quit IRC (Read error: Connection reset by peer)
13:16 πŸ”— midas_ has joined #archiveteam-bs
13:19 πŸ”— zhongfu has quit IRC (Ping timeout: 260 seconds)
13:19 πŸ”— zhongfu has joined #archiveteam-bs
14:04 πŸ”— kristian_ has joined #archiveteam-bs
14:34 πŸ”— brayden_ has joined #archiveteam-bs
14:34 πŸ”— swebb sets mode: +o brayden_
14:37 πŸ”— brayden has quit IRC (Ping timeout: 255 seconds)
14:37 πŸ”— brayden_ is now known as brayden
14:38 πŸ”— brayden has left Closing Window
14:38 πŸ”— brayden has joined #archiveteam-bs
14:38 πŸ”— swebb sets mode: +o brayden
14:44 πŸ”— Mateon1 has quit IRC (Read error: Operation timed out)
15:25 πŸ”— kristian_ has quit IRC (Quit: Leaving)
15:27 πŸ”— Stiletto has quit IRC (Read error: Operation timed out)
15:41 πŸ”— robink has joined #archiveteam-bs
15:41 πŸ”— robinak has quit IRC (Read error: Connection reset by peer)
16:20 πŸ”— godane has joined #archiveteam-bs
16:20 πŸ”— godane so i had to find the irc.efnet.org ip address to connect to there
16:21 πŸ”— godane *here
16:22 πŸ”— godane i kept on getting a not resolve error in pidgin
17:11 πŸ”— schbirid has joined #archiveteam-bs
17:13 πŸ”— JAA Can confirm, irc.efnet.org does not resolve for me currently. Looks like EFNet's name servers are down.
17:16 πŸ”— godane ok then
17:16 πŸ”— godane at least its not comcast maybe
17:17 πŸ”— JAA Definitely not, unless the EFNet admins are suicidal and host their name servers at Comcast.
17:37 πŸ”— JAA So, regarding Dead Format: I'm planning on creating a seed list and letting ArchiveBot take it from there.
17:37 πŸ”— JAA Artists, labels, and users are all not enumerable as far as I can see. Releases are, however, although there is also a slug in the URL. So I've grabbed all releases (IDs 1 to 23k) and will extract the artist, label, and user links from those for seeding. This will, however only cover a tiny part of the website.
17:37 πŸ”— JAA My second idea was search engines, but DDG and Searx didn't return many results.
17:38 πŸ”— JAA Another plan would be bruteforcing the collection search, which appears to be basically the only way to really discover content on that site.
17:38 πŸ”— schbirid2 has joined #archiveteam-bs
17:39 πŸ”— astrid <3
17:42 πŸ”— JAA I found a list of the 10k most common English words, that sounds like a good start. I'll also try to find some lists in other important languages.
17:43 πŸ”— JAA If anyone has other ideas, let me know.
17:43 πŸ”— JAA (Or such word lists.)
17:44 πŸ”— JAA It looks like their search fails when there are too many results. You can't search for "the", for example.
17:45 πŸ”— schbirid has quit IRC (Read error: Operation timed out)
17:46 πŸ”— joepie91 JAA: try wildcards
17:46 πŸ”— joepie91 also
17:47 πŸ”— joepie91 sec
17:47 πŸ”— JAA Ooh, nice one. "Search for t* in Haves returned 544,902 items"
17:48 πŸ”— JAA Unfortunately, this is incredibly slow, i.e. causes very high load on their server.
17:48 πŸ”— JAA But it's a very good idea.
17:52 πŸ”— joepie91 JAA: will have more for you in a few mins
17:52 πŸ”— joepie91 hold
17:52 πŸ”— JAA Huh, the search is funny. Searching "entrΓ©e" or "entrΓ¨e" both returns the same result, but "entree" doesn't.
17:52 πŸ”— JAA Ok
18:16 πŸ”— JAA So my releases grab yielded 8355 artists, 3743 labels, and 1624 users.
18:24 πŸ”— JAA Note for myself: the ArchiveBot job will have to follow external links to grab the full-resolution images.
18:26 πŸ”— joepie91 JAA: https://github.com/joepie91/webshots
18:26 πŸ”— joepie91 JAA: in particular https://github.com/joepie91/webshots/blob/master/search.py
18:26 πŸ”— joepie91 note the last commit description
18:26 πŸ”— joepie91 JAA: does wildcard searches that are automatically refined when it detects an incomplete result set, and un-refines afterwards
18:27 πŸ”— joepie91 you'd need to change the exact parameter for determining 'incomplete result set' (you need 'too many results error' instead of 'X amount of results') but otherwise the idea is the same
18:28 πŸ”— joepie91 JAA: tl;dr obtains a complete set with the minimum amount of search requests
18:28 πŸ”— JAA Oh, neat.
18:28 πŸ”— JAA Thanks! I have to leave now, but I'll take a closer look later.
18:28 πŸ”— JAA This sounds perfect.
18:29 πŸ”— joepie91 JAA: it was, incidentally, for an older archiveteam project :)
18:35 πŸ”— godane SketchCow: here are some items are having trouble : https://pastebin.com/W6mkfij5
18:35 πŸ”— godane at least 2 of the items history i have check don't say its dark
18:35 πŸ”— godane *history log
18:36 πŸ”— bwn has quit IRC (Read error: Operation timed out)
19:19 πŸ”— dashcloud has quit IRC (Read error: Operation timed out)
19:20 πŸ”— dashcloud has joined #archiveteam-bs
19:24 πŸ”— kristian_ has joined #archiveteam-bs
19:28 πŸ”— JAA joepie91: How surprising. :-P
19:32 πŸ”— kristian_ has quit IRC (Ping timeout: 370 seconds)
19:32 πŸ”— kristian_ has joined #archiveteam-bs
19:38 πŸ”— Stilett0- has joined #archiveteam-bs
19:38 πŸ”— ruunyan has quit IRC (Ping timeout: 250 seconds)
19:44 πŸ”— dashcloud has quit IRC (Read error: Operation timed out)
19:47 πŸ”— JAA So, regarding the entrΓ©e/entrΓ¨e situation: the search completely ignores the special characters, splits it into two words, and probably ignores "e" as a search word. "entr" obviously matches the entry which contains "entrΓ¨e".
19:48 πŸ”— dashcloud has joined #archiveteam-bs
19:49 πŸ”— JAA Interestingly, you can't search for "esp", but "espΓ©" works (but is not the same as "esp*").
20:26 πŸ”— kristian_ has quit IRC (Read error: Operation timed out)
20:33 πŸ”— kristian_ has joined #archiveteam-bs
20:36 πŸ”— kristian_ has quit IRC (Client Quit)
20:42 πŸ”— JAA Huh, I found a query which returns over one third of all "items" (= collection entries, as far as I can tell). I probably don't have to mention how slow it is.
20:43 πŸ”— JAA I don't have the slightest idea why it works though. I entered "*a-b*".
21:16 πŸ”— schbirid2 has quit IRC (Quit: Leaving)
21:20 πŸ”— noirscape has quit IRC (Read error: Connection reset by peer)
22:11 πŸ”— joepie91 lol
22:18 πŸ”— bwn has joined #archiveteam-bs
22:28 πŸ”— JAA My scrape is running now. I'm simply bruteforcing all three-character prefixes, i.e. aaa*, aab*, ... zzz*.
22:29 πŸ”— JAA I ended up writing my own code since the specifics were quite a bit different than in your case and I figured I'd be faster this way.
22:36 πŸ”— Stilett0- is now known as Stiletto
23:09 πŸ”— Stiletto has quit IRC (Read error: Operation timed out)
23:56 πŸ”— dashcloud has quit IRC (Read error: Operation timed out)
23:57 πŸ”— dashcloud has joined #archiveteam-bs

irclogger-viewer