I would prefer something like `{"name": contains("k")}`, where contains("k") returns an object with a custom __eq__ that compares equal to any string (or iterable) that contains "k". Then you can just filter by equality.
I recently started using this pattern for pytest equality assertions, as pytest helpfully produces a detailed diff on mismatch. It's not perfect, as pytest doesn't always produce a correct diff with this pattern, but it's better than some alternatives.
{'name__contains':"k", "age__lt":20}
Kind of tangential to this package, but I've always loved this filter query syntax. Does it have a name?I first encountered it in Django ORM, and then in DRF, which has them as URL query params. I have recently built a parser for this in Javascript to use it on the frontend. Does anyone know any JS libraries that make working with this easy? I'm thinking parsing and offering some kind of database-agnostic marshaling API. (If not, I might have to open-source my own code!)
Interesting... I've been playing with the idea of embedding more python in my C, no cython or anything just using <Python.h> and <numpy/arrayobject.h>. From one perspective it's just "free" C-bindings to a lot of optimized packages. Trying some different C-libraries, the python code is often faster. Python almost becomes C's package manager
E.g. sorting 2^23 random 64-bit integers: qsort: 850ms, custom radix sort: 250ms, ksort.h: 582ms, np.sort: 107ms (including PyArray_SimpleNewFromData, PyArray_Sort). Where numpy uses intel's x86-simd-sort I believe.
E.g. inserting 8M entries into a hash table (random 64-bit keys and values): MSI-style hash table: ~100ns avg insert/lookup, cc_map: ~95ns avg insert/lookup, Python.h: 91ns insert, 60ns lookup
I'm curious if OPs tool might fit in similarly. I've found lmdb to be quite slow even in tmpfs with no sync, etc.
Having seen a lot of work come to grief because of the decision to use pandas, anything that’s not pandas has my vote. Pandas: if you’re not using it interactively, don’t use it at all. This advice goes double if your use case is “read a csv.” Standard library in Python has you covered there.
Apologies for being off topic, but after reading the implementation code, I was amazed at how short it is!
I have never been a huge fan of Python (Lisp person) but I really appreciate how concise Python can be, and the dynamic nature of Python allows the nice query syntax.
Django ORM for plain lists is interesting I guess... but being faster than pandas at that is quite a surprise, bravo!
Title should be prefixed with Show HN and the name of the project in order to not mislead readers about the content of the link.
I don't understand why numeric filters are included. The library is written in python, so shouldn't a lambda function based filter be roughly as fast but much easier/clearer to write.
It’s nice it’s fast at 10k dictionary entries, but how does it scale?
You can create a dataframe from a list of dictionaries in pandas
`df = pd.DataFrame([{},{},{}])`
Maybe this is just me, but embedding the language in strings like this seems like it's just asking for trouble.
Interesting work. I'd be curious to know the timing relative to list comprehensions for similar queries, since that's the common standard library alternative for many of these examples.
Interesting project and approach, thanks for sharing!
If you're interested in a simple solution to query a list with SQL including vector similarity, check this out: https://gist.github.com/davidmezzetti/f0a0b92f5281924597c9d1...
I feel like the scale where a library like this is meaningful for performance, and therefore worth the dependency+DSL complexity, is also the scale where you should use a proper database (even just SQLite).
Embedding functionality into strings prevents any kind of static analysis. The same issue as embedding plain SQL, plain regexes, etc..
I am always in favor of declarative approaches where applicable. But whenever they are embedded in this way, you get this static analysis barrier and a possible mismatch between the imperative and declarative code, where you change a return type or field declaratively and it doesn't come up as an error in the surrounding code.
A positive example is VerbalExpressions in Java, which only allow expressing valid regular expressions and every invalid regular expression is inexpressible in valid java code. Jooq is another example, which makes incorrect (even incorrectly typed) SQL code inexpressible in Java.
I know python is a bit different, as there is no extensive static analysis in the compiler, but we do indeed have a lot of static analysis tools for python that could be valuable. A statically type-safe query is a wonderful thing for safety and maintainability and we do have good type-checkers for python.